File Formats for Matrix Searches

Alignments

The alignment files (*.dat) are sequence alignments in FASTA format. These files contain an aligned listing of known binding sites for each E. coli DNA-binding protein. For a number of these the location of the binding site upstream of the translation start codon is listed beside the name. Binding site data was obtained largely from the DPInteract database.

Search Output

These files (*.sco) contain the sequences and scores for each matrix search hit in the E. coli genome. At the beginning of the file, the scores for the known input sites are listed (using the scoring method of Berg & Von Hippel, 1987). The mean and standard deviations of the scores for the known sites are calculated, and these values are used to determine cutoffs for the searches. We used a cutoff equal to the mean of the scores of the input sites, and a looser cutoff at two standard deviations below the mean of the scores of the input sites.

The second section of this file contains a listing of all matrix hits in the E. coli genome scoring better than two standard devations below the mean of the scores of the input sites. Column 1 contains the sequence of the predicted binding site, column 2 gives the chromosome number (always equal to 1 for E. coli), column 3 gives the coordinate in the genome for the start point of the predicted binding site, column 4 gives the direction (1 is forward and 0 is reverse complement) and column 5 gives the score of the site against the weight matrix.

Locations in the genome

These files (*.dba) describes the location in the E. coli genome of each matrix search hit that is within 600 bp upstream or 100 bp downstream of the start codon. Column 1 gives the coordinate in the genome of the start point for the predicted binding site. The next columns give the name(s) of the closest gene in each direction that the sites is upstream of, and the distance that the site is located upstream of this gene (a positive number refers to the distance upstream of the start codon, a negative number refers to a distance downstream of the start codon). The last column (in parentheses) gives the percentage of the site that is located within noncoding regions. At the end of the file, there is a summary of the overall percentage of sites located in noncoding regions for each of the 2 cutoffs.


Abigail Manson McGuire
Genetics Department
Harvard Medical School/BCMP
200 Longwood Ave.
Boston, MA. 02115.
E-mail: amcguire@arep.med.harvard.edu
Telephone: (617) 432-4136