3 Sep 99

Getting Started

The ScanACE and AlignACE executables are command-line programs.  If
you are using Windows95/98/NT, you have to open a DOS/Command prompt
window, cd to the appropriate directory, then type alignace.exe
(options).

If you run AlignACE without options, it will return a long list of all
possible options.  Most of these can be ignored in normal use.  The
most important options are:

-i/-j: One or the other is required.  Use -i to specify a
FASTA-formatted input file.  Use -j in conjunction with -y/-a/-b/-z
options to specify that sequence upstream of a list of ORFs be used.
The list of ORFs should be either Y-names or common names found in the
ORF table, one per line.  The amount of upstream sequence to be taken
can be controlled with the -k and -l options.

-y/-a/-b/-z: Use these to specify a genome sequence and associated ORF
and other features tables.  The -y option sets defaults for
S. cerevisiae using the files included in the download.  These files
must be located in the directory from which AlignACE is run.
Alternate genome sequences, ORF, and other_features tables are
specified with -z, -a, and -b options respectively.

-e: Use Smith-Waterman algorithm to find and purge input set of extra
copies of nearly identical input sequences.  The presence of such
repeated sequences causes the sampler to find motifs that are skewed
greatly toward those repeated sequences.  Such duplicated upstream
regions occur frequently in S. cerevisiae.  The algorithm used here to
purge these sequences is very inefficient for larger input sets, say
more than 50-100 500bp sequences.  It will soon be replaced by a
BLAST-based algorithm.

The standard parameters I most often use for motif searching in
S. cerevisiae are -y, -j, and -e.  For example, 
AlignACE -jfile.inp -y -e.

Other parameters that are occasionally useful:

-o: Specify an output file.  Otherwise, an output filename is
automatically generated from the input arguments.

-g: Specify the GC content of the genome.  Default is 0.38 for
S. cerevisiae.

-x: Modify the number of expected sites from the default of 10.  This
is also the number of sites with which the sampler is seeded.

-w: Change the number of columns from the default of 10.  I haven't
noticed any motifs that are very sensitive to this.

-d: Turn off column sampling.  With this option, only motifs with
contiguous active columns are found.

-r1 -t1 -otest.out: These options will result in a quick test run.
This is useful sometimes just to make sure you have things set up
correctly.


The Output

The output files are raw text.  PCs generally like to add a second
character to each newline, whereas Unix/Linux machines only need one.
You should still be able to read the files ok in Windows with WordPad.

At the top of the output AlignACE file are listed any purged sequences
and any ORF names that were not found in the ORF table.  This is
followed by a listing of the input sequence names.  The numbers
associated with these names are used in the subsequent motif
descriptions to refer to the input sequences.  Motifs are then listed
in the order found and masked.

The fields in AlignACE output following Motif number x are:

1:  site in sequence context (*'s below indicate 'active' motif columns)

2:  number of the sequence from which the site was found (listed at the
top of the output file)

3:  position of the site in that sequence (specifically, the position of
the site column nearest the beginning of the input sequence)

4: strand (1=forward, 0=reverse)

5: sequence in the active columns from field 1

6: an internal score between 0 and 1 measuring site strength relative to
the motif

7: distance relative to start (if sequence is given 5'->3' distal
->proximal to translation start, as is the default if an ORF list is used
for input)

As mentioned above, the *'s indicate active columns.  Masking refers
to the way multiple motifs are found and is described in Roth et al.
'Masking position x' means that sequence positions corresponding to
the x-th position in the motif as shown (considering both active and
inactive columns) are to be masked before further sampling.  Map score
is the internal score used by AlignACE to judge motifs.  Mathematical
details are found in Liu, et al, Journal of the American Statistical
Association 90: 1156-1170 (1995).  The last five lines of statistics
following that are some measures that we once thought might be
interesting.  At this point, I'd ignore them.

There will be a significant update to AlignACE at some point in the
near future.  It should become faster, a little easier to use, and include
much improved motif statistics.


ScanACE

The output from ScanACE begins with some information about the motif
used in the search, followed by a ranked list of sites.  The number of
sites may be determined with score cutoffs in terms of standard
deviations from the mean of the scores for the aligned sites, or a
specific number of sites may be requested with the -s option.  The
first line of information about each site includes five fields: the
site sequence, chromosome number, position on chromosome, strand
(1=forward, 0=reverse), site score.  This is followed by information
on neighboring genomic features.


*** If at all possible, use Linux.