Overview of the GenomeSequence Database

John Aach
November 25, 1997

last modified: December 1, 1997

The GenomeSequence database is an adjunct to a larger and more complex database named "TBEID" (Total Biomolecule Expression and Interaction Database) which is under development. The purpose of TBEID is to manage expression data systematically collected on cells containing precisely characterizable mutations under various growth and competition conditions. To do its work, TBEID contains tables for gene mutations whose registration depends closely on sequence, i.e., when a new version of a sequence comes out, database information on mutations must be reviewed and possibly adjusted.

GenomeSequence was set up to assist in this process. It was designed to maintain multiple versions of sequence for an organism and to provide an easy way of selecting out parts relevant to genes of interest. Basic to this usage was the ability to retrieve sequence anywhere in or adjacent to a gene relative to its start or end, since it may be necessary to examine upstream or downstream as well as coding DNA in determining the impact of a sequence revision. However, this capability is of independent interest even for a single sequence version, and soon other potential uses for a sequence database were identified. Primary among these was a table of unique oligos that could be used to identify primers within or in the neighborhood of genes of interest.

The GenomeSequence database is designed to accommodate any fully sequenced organism in which all genes have (at least tentatively) been identified and assigned locations. It also demands that all bases in the sequence have been called (i.e., no ambiguous base codes). However, at this stage, only the M52 version of the E. coli sequence has been loaded into the database, and only a few queries have been made available over the web interface.

Data on the GenomeSequence database

The entire sequence from the GenBank M52 version of the E. coli genome is maintained on the database.

Gene names, PIDs, coding region locations, and "Blattner accession numbers" (the 'b' labels developed in Fred Blattner's lab as a unique identifier for each gene) have been extracted from the GenBank annotations and loaded into the database.

Unique oligos between 7 and 11 bases long were determined through a modification of a suffix array program written by Tim Chen, PhD. These were then processed to assure that unique oligos that could be derived by prepending or postpending additional bases to shorter unique oligos were eliminated. (Suffix array processing guarantees this result for the 3' end, but extra processing was required for the 5' end.) The result is any unique oligo on the file is the shortest unique oligo that begins or ends at its location.

For the M52 version of GenomeSequence, there are

0 unique 7mers
84 unique 8mers
2798 unique 9mers
75624 unique 10mers
826270 unique 11mers

for a total of 904776 unique oligos in the size range 7-11 with these properties.

Steps were taken to validate all data transformations and processing involved in preparing data for the database. A copy of the tests undertaken and their results is available upon request.

Using the GenomeSequence Database

It will be worth spending a moment on location schemes used in the GenomeSequence database.

Locating sequence in the database

The web-interface to GenomeSequence allows queries by gene location or by chromosome location.

Gene location entails the specification of a start and an end location relative to a gene. The gene may be given by name, Blattner accession number, or GenomeSequence number. Each of the start and end location are specified in two parts, the first of which indicates the base point of the location as either the start or the end of the gene coding region, and the second of which indicates a position relative to the base point. Positions may be positive, negative, or zero. When the base point is the start of the gene, positions follow normal conventions -- i.e., +1 means the first base of the coding region and a -1 the first base upstream from it (0 is accepted as equivalent to +1). When the base point is the end of the gene, positions are treated as offsets; thus position +1 from the end point is the first base after the end point, and position -1 is the first base before it. Finally, if no position is specified, it is treated as having a value of 0.

Examples of gene location specifications thus include the following

start base	start position	end base	end position	yields
start		end		whole gene coding region
start	-100	start	-1	101 bases upstream of gene
end	-10	end	10	last 11 bases of gene + 10 more

Chromosome location involves specifying locations by giving the location of the sequence according to the source sequence file. This means that all locations are given as strand 1 locations. Two of three parameters must be specified: a start location, an end location, and/or a length.

Limited wraparound for circular chromosomes is supported for sequence retrieval by both gene and chromosome location. In the case of gene location, this means that queries will attempt to accommodate situations in which one of the gene relative addresses for start and end of sequence happens to be either before the position 1 point of the source sequence file or after the end of the chromosome. For chromosome location, start locations may be specified as < 1 or end locations as > length of the chromosome. Locations < 1 start at 0, so that position 0 is actually the last base in the source sequence file.

Sequence reporting

All sequences reported as answers to GenomeSequence queries will be given in 5'-to-3' order relative to the strand selected. In the case of gene located sequences, the strand reported will be the gene coding strand. In the case of chromosome location reporting, the strand of interest must be specified. Note that this means that strand 2 sequences will be reported as 5'-to-3' for strand 2, even though their locations (for chromosome location) must be given as strand 1 addresses.

Available queries on the web interfact to the GenomeSequence database

Get Sequence By Gene retrieves sequence based on gene-relative locations as described above.
Get Sequence By Location retrieves sequence based on chromosome locations as described above.
Get Unique Oligos By Gene retrieves unique oligos which are entirely contained within a gene-relative location range.
Get Unique Oligos By Location retrieves unique oligos which are entirely contained within a chromosome location range.

Implementation

GenomeSequence is implemented as a Sybase System 11 database on a Dec Alpha 3000. The web application has been written in perl 5.003 using the DBlib interface provided by sybperl. A command line interface is also available which consists of a set of 11 perl packages plus perl programs for each query type. The same perl packages, plus two additional ones are used in the web application.

For more information

Please contact John Aach in case of problems or suggestions:

John Aach, PhD
Department of Genetics
Harvard Medical School
Warren Alpert Building
200 Longwood Avenue
Boston MA 02115

PHONE: 617-432-0503
FAX: 617-432-7266