The scripts divide.pl, mask.pl, extract.pl, concatenate_fasta.pl, and choose.1.01.pl can be used to select oligonucleotides from a set of transcripts with minimal cross-hybridization against a transcriptome. For a description of the reasoning behind the method please see http://arep.med.harvard.edu/human.html/criteria.html

They are written to use alignment output from megablast (available via anonymous ftp from ftp://ncbi.nlm.nih.gov) and secondary structure predictions from RNAfold (available from http://www.tbi.univie.ac.at/~ivo/RNA/) Both of these programs must be installed on the machine from which the scripts are run. A sequence complexity score is generated using wc and gzip calls to linux - so the program requires linux to work properly. We recommend checking for repetitive elements by first megaBLASTing the transcripts against a database of repeats (described below.) We use repbase (which can be obtained from http://www.girinst.org/Repbase_Update.html) 

Instructions

(1) Divide the set of transcripts into files with at most 1000 transcripts each. If the files are in fasta format, the program divide.pl can be used. 

perl divide.pl input  I 

(2) Generate concatenated sequence files from each of the new transcript files

	perl concatenate_fasta.pl input > output 

(3) Create BLAST databases for the transcriptome and repeat elements

	formatdb -i input -p F

	(for files in FASTA format)

(4) megaBLAST these files against the database of repeat elements with the following parameters.

-W 8 -D 2 -U T -e .01

(a separate database of repeat elements allows you to megaBLAST with a smaller wordsize and e value than if they were incorporated into the larger transcriptome database. This allow the search for alignments with repeat elements to be more comprehensive.)
	
(5) Change the format of the blast output using extract.pl

	perl extract.pl input > output

(6) Mask the regions of the transcript that align to repeat elements with mask.pl
The variables $infile and $transcript must be set before running the program.

perl mask.pl

(7) Make fasta files from the masked concatenated sequence files (all of this concatenating and unconcatenating the sequence files will be eliminated soon but for now ...) 

	perl make_fasta.pl input > output

(8) megablast the masked fasta files against the transcriptome with parameters

	-W 12 -U T -D 2 -e .005 -f T

(9) Change the format of the blast output using extract.pl

	perl extract.pl input > output

(10) create a file list for input to choose.1.01.pl. The format of each line is

	masked_concatenated_sequence_file\textracted_blast_file\tlabel_for_output

(where the \t represents a tab)

(11) edit the choose.1.01.pl code for your system and particular transcript set and transcriptome

			
	line 387: change the system command so that it points to the location of Rnafold.pl on your system
	line 464: change the subroutine parse_cluster so that it finds the proper cluster identifier for your ids
	line 476: change the subroutine parse_qid so that it gets the proper information from your identifier
	line 494: change the subroutine parse_sid so that it gets the proper information from your identifier
	line 515: change the subroutine parse_cncid so that it gets the gi number (or whatever you are using in its place - it just needs to be a unique identifier)
	(note that the identifiers really only require a unique identifier and possibly a cluster identifier)

(12) input your own parameters at the beginning of the code. The parameters are described in the code itself. 
	Basically they are just
		blast threshold = $threshold
		initial number = $number (oligos picked from non-aligned regions that lie within the Tm range)
		final number = $t_number
		initial length = $length (try to get oligos of this length first)
		minimum length = $molength (minimum length that an oligo can get)		 
		max distance from 3' = $distance_m
		jump after finding oligo = $jump (this is measured in bases)
		tm high = $h_tm
		tm low = $l_tm
		max lz score = $mlz
		max rnafold delta G = $mrna
(13) run choose.1.01.pl

	perl choose.1.01.pl file_list 

for more information about the algorithm see the description section of the web page