3 Sep 99 Getting Started The ScanACE and AlignACE executables are command-line programs. If you are using Windows95/98/NT, you have to open a DOS/Command prompt window, cd to the appropriate directory, then type alignace.exe (options). If you run AlignACE without options, it will return a long list of all possible options. Most of these can be ignored in normal use. The most important options are: -i/-j: One or the other is required. Use -i to specify a FASTA-formatted input file. Use -j in conjunction with -y/-a/-b/-z options to specify that sequence upstream of a list of ORFs be used. The list of ORFs should be either Y-names or common names found in the ORF table, one per line. The amount of upstream sequence to be taken can be controlled with the -k and -l options. -y/-a/-b/-z: Use these to specify a genome sequence and associated ORF and other features tables. The -y option sets defaults for S. cerevisiae using the files included in the download. These files must be located in the directory from which AlignACE is run. Alternate genome sequences, ORF, and other_features tables are specified with -z, -a, and -b options respectively. -e: Use Smith-Waterman algorithm to find and purge input set of extra copies of nearly identical input sequences. The presence of such repeated sequences causes the sampler to find motifs that are skewed greatly toward those repeated sequences. Such duplicated upstream regions occur frequently in S. cerevisiae. The algorithm used here to purge these sequences is very inefficient for larger input sets, say more than 50-100 500bp sequences. It will soon be replaced by a BLAST-based algorithm. The standard parameters I most often use for motif searching in S. cerevisiae are -y, -j, and -e. For example, AlignACE -jfile.inp -y -e. Other parameters that are occasionally useful: -o: Specify an output file. Otherwise, an output filename is automatically generated from the input arguments. -g: Specify the GC content of the genome. Default is 0.38 for S. cerevisiae. -x: Modify the number of expected sites from the default of 10. This is also the number of sites with which the sampler is seeded. -w: Change the number of columns from the default of 10. I haven't noticed any motifs that are very sensitive to this. -d: Turn off column sampling. With this option, only motifs with contiguous active columns are found. -r1 -t1 -otest.out: These options will result in a quick test run. This is useful sometimes just to make sure you have things set up correctly. The Output The output files are raw text. PCs generally like to add a second character to each newline, whereas Unix/Linux machines only need one. You should still be able to read the files ok in Windows with WordPad. At the top of the output AlignACE file are listed any purged sequences and any ORF names that were not found in the ORF table. This is followed by a listing of the input sequence names. The numbers associated with these names are used in the subsequent motif descriptions to refer to the input sequences. Motifs are then listed in the order found and masked. The fields in AlignACE output following Motif number x are: 1: site in sequence context (*'s below indicate 'active' motif columns) 2: number of the sequence from which the site was found (listed at the top of the output file) 3: position of the site in that sequence (specifically, the position of the site column nearest the beginning of the input sequence) 4: strand (1=forward, 0=reverse) 5: sequence in the active columns from field 1 6: an internal score between 0 and 1 measuring site strength relative to the motif 7: distance relative to start (if sequence is given 5'->3' distal ->proximal to translation start, as is the default if an ORF list is used for input) As mentioned above, the *'s indicate active columns. Masking refers to the way multiple motifs are found and is described in Roth et al. 'Masking position x' means that sequence positions corresponding to the x-th position in the motif as shown (considering both active and inactive columns) are to be masked before further sampling. Map score is the internal score used by AlignACE to judge motifs. Mathematical details are found in Liu, et al, Journal of the American Statistical Association 90: 1156-1170 (1995). The last five lines of statistics following that are some measures that we once thought might be interesting. At this point, I'd ignore them. There will be a significant update to AlignACE at some point in the near future. It should become faster, a little easier to use, and include much improved motif statistics. ScanACE The output from ScanACE begins with some information about the motif used in the search, followed by a ranked list of sites. The number of sites may be determined with score cutoffs in terms of standard deviations from the mean of the scores for the aligned sites, or a specific number of sites may be requested with the -s option. The first line of information about each site includes five fields: the site sequence, chromosome number, position on chromosome, strand (1=forward, 0=reverse), site score. This is followed by information on neighboring genomic features. *** If at all possible, use Linux.