1. Unigene (Aug 3 2001) unique
set each RNA blasted against all others in the set
2. With a cutoff score = 50 choose all non-aligned regions.
3. Begin at the 3' end take the first 70-mer oligonucleotide Tm=72.5 to 83.5 (as in the Operon set)
4. Skip 39 bp. Search as above stop at 7 regions selected or at 5' end
whichever comes first.
5. Blast against the entire Unigene set (not just Unigene Unique) and count the
maximum number of identities per region.
6. Secondary structure prediction for each region using RNAfold (Vienna)
8. LZW compression score using gzip and wordcount (deRisi suggestion) for each
region. score < 35 selected.
9. Lowest with identity <= 25
10. In case of a tie above the lowest secondary structure is chosen
11. Blast against Repbase (Genetic
Information Research Inst.) and human rRNA & tRNAs excluding if >= 25
identical bases
Fields in the output are
gb_id|ug_cluster|gi_id|distance_from_5'_end annotation\t distance_from_3'_end\tlength_of transcript\tTm\tdelta_g\tlzw_score\tsequence
Note: * after the identifier indicates noexons detected in genscan of transcript ie just utr and +? after the identifier indicates there is ambiguity in the orientation of the transcript as detected by genscan
(\t indicates a tab)
For more information about the LZW compression score see methods page http://arep.med.harvard.edu/choose_description.html
NOTE All sequence orientations in unigene were assumes to be correct - all probes were chosen from the strand given in unigene