Prediction of similarly-acting cis-regulatory modules

Instructions for using sampler and searcher programs

The sampler program requires several additional files to be within the same directory or to be entered as input parameters. These include an annotation file, a file with list of genes, as well as several files that provide phylogenetic footprint information. These files are included here, and were generated as described in the text of the accompanying paper. They must be included in the same directory as the sampler program in order for the program to function properly. The files are:

Files representing the subsequence conservation profiles of phylogenetic footprints. Here, phylogenetic footprints are generated using blocks based on 60% sequence conservation over 100nt, of which clusters of conserved non-coding regions separated by less than 100nt are then grouped together. Conserved non-coding subsequences were then identified from the CNSes as maximal contiguous stretches of aligned, conserved sequence that contained at most 2 mismatches within each 8 bp window.

CNSSes are analyzed in setting up our Markov chain algorithm: each CNSS string s₁s₂...s_n, is considered as a sequence of n - 5 6-nt windows s_is_i+1s_i+2s_i+3s_i+4s_i+5, and each such window is considered a "state-transition" from the prefix 5-word s_is_i+1s_i+2s_i+3s_i+4 to the suffix 5-word s_i+1s_i+2s_i+3s_i+4s_i+5 if none of the six base codes is '-' (gap).
```
	chr2L.regions_60pct.blocks.training.5_O.a4
	chr2R.regions_60pct.blocks.training.5_O.a4
	chr3L.regions_60pct.blocks.training.5_O.a4
	chr3R.regions_60pct.blocks.training.5_O.a4
	chr4.regions_60pct.blocks.training.5_O.a4
	chrX.regions_60pct.blocks.training.5_O.a4
```
A file with the Release 3.1 annotations.
A file with all the PFs greater with greater than 300 state transitions (a parameter chosen to reflect the archetypal length of CRMs), along with a file of header information for each of these PFs [both files are used by the random_selector.pl script]

Auxilliary modules to work in combination with sampler and searcher programs

C code for training program.
PERL code for random_selector.pl script.

The training program can be run on any sequence comparisons in the AVID alignment format. It is used to generate the subsequence conservation profiles for phylogenetic footprints.

The random_selector script is a perl script that works together with the sampler program. This script randomly selects 1000 PFs for use in the ranking procedure important in the updating step.

For more information, please contact Yonatan Grad.

Last updated by YG on 20 January 2004.