The two basic considerations in designing oligonucleotide probes are specificity and sensitivity. To be specific, a probe must hybridize primarily with its target. This entails avoiding cross-hybridization. The thermodynamics of secondary structure formation serve as a good theoretical model for cross-hybridization; the hybridization of non-complementary strands of DNA involves formation of structures similar to those formed in strand self-hybridization. An ideal way to determine the specificity of a potential probe would be to generate alignments with other transcripts and score them using a system based on the Gibbs energy of hybridization. Unfortunately, for a large transcriptome, this calculation is prohibitively expensive. If microarray hybridization conditions are carefully controlled, however, only very good alignments should cause measurable cross-hybridization. These types of alignments are easily captured with existing dynamic programming algorithms such BLAST. High scoring BLAST alignments should correspond with highly stable hybridized structures. In fact, Rosetta Inpharmatics observed strong correlation between predicted cross-hybridization using BLAST and experimentally measured cross-hybridization (Nat Biotechnol. 2001 Apr;19(4):342-7.)
There are a few technical considerations in using BLAST to find alignments. Since BLAST works by finding word matches of a minimal size and then extending them, it is better to search for alignments using entire transcripts as queries rather than subsequences. With the entire sequence, a seeding word match, outside of a subsequence can allow the extension of an alignment into the subsequence. Such alignments are potentially important and could be missed by using only the subsequence as query.
A simple, computationally efficient means of designing highly specific probes, then, is to align each transcript with all other transcripts in the transcriptome using BLAST, annotate it with the resulting alignment information, and attempt to choose probes from the unaligned regions. A threshold bit score can be used to decide which alignments to include. A bit score of n corresponds roughly with a perfect alignment of length ~ n/2. Thus for a 70-mer probe, excluding all alignments of bit score > 50 should be safe for cross hybridization.
Another consideration is that the transcriptome should be fairly complete and should include all transcribed RNAs, such as repeat elements and non-messenger RNAs. Achieving good probe sensitivity requires favorable thermodynamics and kinetics of probe-target hybridization and unfavorable self-hybridization. Thermodynamics of probe target hybridization can be well approximated by calculating the melting temperature, Tm. Additionally, since microarrays involve hybridizing many probes in parallel, there should be uniformity in the thermodynamics of probe hybridization across the chip. Requiring probes to have Tms within a certain range helps maintain this uniformity.
Sequence complexity is a factor in hybridzation. A Low complexity sequence (for instance sequences with a small repeat AATAATAAT) can hybridize to its target in many imperfect ways and also potentially cross-hybridize to other targets in many ways (ways which could easily be missed by BLAST alignments because of the seeding word match)
Joe Derisi (unpublished data) developed a simple method using LZ77 compression to give a complexity score for the sequence. A repetitive string will be more compressible than a non- repetitive string. Thus a comparison of the size of the gzip compressed string with the uncompressed string can be used to generate a score for sequence complexity.
Finally, it is important to eliminate probes which have a significant propensity for forming secondary structure (ie probe self complementarity.) Secondary structure in the probe (and therefore also potentially in its target in the region of the probe) will act as a barrier to hybridization between the probe and its target.
(1) Begin with a list of transcripts and a transcriptome
(2) Divide the list of transcripts into files containing ~ 1000 transcripts each (such files are of a size amenable to use of the megablast algorithm.)
(3) Calculate alignments of each transcript with a database of repeat elements using megablast (one file at a time)
W=12, U=T, e=.001
(4) Mask alignments with bit score above 40 (convert sequence characters to lower case)
mask.pl
(5) Calculate alignments of each transcript with the rest of the transcriptome using megablast
Parameters W=12, U=T, e=.001
(6) Change output to a more usable form using the program extract.pl
(7) Create concatenated sequence files for each of the files from (2). (line format below)
>ID Def sequence
(8) Choose oligos with program choose.1.00.pl
Parameters
n: Initial number of oligos to choose
m: Final number of oligos to choose
BLAST bit score threshold
Tm Range
Increment for moving through unaligned regions
Jump after probe with Tm in Tm range is found
Lz77 threshold
RNAfold threshold
(a) Read through alignment information and record the positions of regions of each query which contain alignments with other transcripts (or transcripts in other unigene clusters) with a bit score above the threshold.
(b) If regions overlap merge them into a single region ie (alignment a (1 100) alignment b (75 150) -> merged alignment (1 150))
(c) Get transcript length information from the concatenated sequence file and make a list of non-aligned regions of the query from which oligos may be selected
(d) Beginning at the 3’ end of the 3’ most non-aligned region of the transcript, examine potential probe to see if its Tm is in the Tm range. If it is, add this probe to the list of probes and jump the set number of bases toward the 5’ end before examining another potential probe. If not, examine the next potential probe. Stop adding probes to the list when the set number has been reached or the maximum distance from the 3’ end has been reached.
(e) Calculate lz77 compression score for each probe using gzip and the gibbs energy of secondary structure formation using RNAfold
(f) Choose the best m probes with lz77 scores and RNAfold scores below the thresholds and with the best m RNAfold scores of the set