1

1. Unigene (Aug 3 2001) unique set each RNA blasted against all others in the set
2. With a cutoff score = 50 choose all non-aligned regions.
3. Begin at the 3' end take the first 70-mer oligonucleotide Tm=72.5 to 83.5 (as in the Operon set)
4. Skip 39 bp. Search as above stop at 7 regions selected or at 5' end whichever comes first.
5. Blast against the entire Unigene set (not just Unigene Unique) and count the maximum number of identities per region.
6. Secondary structure prediction for each region using RNAfold (Vienna)
8. LZW compression score using gzip and wordcount (deRisi suggestion) for each region. score < 35 selected.
9. Lowest with identity <= 25
10. In case of a tie above the lowest secondary structure is chosen
11. Blast against Repbase (Genetic Information Research Inst.) and human rRNA & tRNAs excluding if >= 25 identical bases

Fields in the output are

gb_id|ug_cluster|gi_id|distance_from_5'_end annotation\t distance_from_3'_end\tlength_of transcript\tTm\tdelta_g\tlzw_score\tsequence

Note: * after the identifier indicates noexons detected in genscan of transcript ie just utr and +? after the identifier indicates there is ambiguity in the orientation of the transcript as detected by genscan

(\t indicates a tab)

For more information about the LZW compression score see methods page http://arep.med.harvard.edu/choose_description.html

NOTE All sequence orientations in unigene were assumes to be correct - all probes were chosen from the strand given in unigene