1

1. Blasted each transcript from Mm.seq.uniq Unigene (Aug 13 2001) against rodrep.

2. Masked all regions of transcript that aligned with a transcript in roprep with bit score > 30.

3. Blasted masked transcripts against Mm.seq.uniq

4. Masked additional regions of each transcript that aligned with other transcripts in unigene.uniq with bit score > 50 (this is just a computation saving step - if a transcript aligns with another transcript (subject) in Mm.seq.uniq it will align with many of the transcripts in the cluster of that subject.)

5. Blasted newly masked transcripts against Mm.all.

6. Begin at 3' end take first 70-mer oligonucleotide from unaligned regions with Tm = 70 to 78.

7. Skip 40 bp. Search as above stopping at 1000 bases from the 3' end or 6 regions selected.

8. Compute secondary structure of selected region using RNAfold (Vienna) Eliminate regions with -(delta G) > 20. (note we use - delta G because the delta Gs are negative)

9. Compute LZW compression score for each selected region using gzip and wordcount (deRisi suggestion.) Eliminate regions with LZW score > 33.

10. Choose region with the lowest -(delta G secondary structure formation.)

Fields in the output are

gb_id|ug_cluster|gi_id|distance_from_5'end\tdistance_from_3'_end\tlength_of transcript\tTm\tdelta_g\tlzw_score\tsequence

Note: * after the identifier indicates noexons detected in genscan of transcript ie just utr and +? after the identifier indicates there is ambiguity in the orientation of the transcript as detected by genscan

(\t indicates a tab)

For more information about the LZW compression score see methods page http://arep.med.harvard.edu/choose_description.html

NOTE All sequence orientations in unigene were assumes to be correct - all probes were chosen from the strand given in unigene