Accessing Databases

What is Redundancy?

A key concept in comparing databases is the issue of redundancy. Many databases try to be "non-redundant". Unfortunately, biological data is too complex to fit a simple definition of redundancy. Are two alleles of the same locus redundant? Two isozymes in the same organism? The same locus in two closely related organisms? Hence, each "non-redundant" database has its own definition of redundancy. Some use automated measures, while others use manual culling; the former are amenable to large projects, the latter give higher quality. Other databases don't attempt to be non-redundant, but rather sacrifice this goal in favor of ensuring completeness.

Databases

Nucleotide (DNA & RNA)

nr (NCBI)

The nr nucleotide database maintained by NCBI as a target for their BLAST search services is a composite of GenBank, GenBank updates, and EMBL updates.

Non-redundant: Entries with absolutely identical sequences have been merged.

GenBank / EMBL / DDBJ

In theory, GenBank, the EMBL Datalibrary, and the DNA Databank of Japan (DDBJ) are just names for the same database. In reality, small timelags in propagating data between the database centers causes minor differences in these databases. However, if one of these libraries is merged with the updates to all of these databases, a complete set of sequences is formed.

Redundant: Little to no attempts to reduce redundancy

dbEST (Boguski, Lowe, & Tolstoshev. Nature Genetics 4:332 1993) is a library of Expressed Sequence Tags (Science 252:1651), single-pass cDNA sequences generated from automated sequencers.

CAUTION: ESTs are blindly sequenced from cDNA libraries with little or no human intervention; they are therefore likely to contain sequencing errors and are frequently contaminated with heterologous sequences and transcribed repetitive elements.

Redundant: no attempts made to reduce redundancy

Protein

nr (NCBI)

The nr protein database maintained by NCBI as a target for their BLAST search services is a composite of SwissProt, SwissProt updates, PIR, PDB. Entries with absolutely identical sequences have been merged.

SwissProt

SwissProt is maintained by Amos Bairoch at the University of Geneva. SwissProt is a highly-curated, highly-crossreferenced, non-redundant database. Unfortunately, the cost of this labor-intensive quality enhancement process is that not every sequence is in SwissProt. If you wish to look up information about a sequence, SwissProt is the first place to look.

Non-redundant: manual curation used to provide only one entry per protein product; variants are annotated in entry.
Highly-cross-referenced to other databases.

PIR

The Protein Identification Resource was originated by the late Margaret Dayhoff. It attempts to enjoy the advantages of a complete and a non-redundant database.

Non-redundant: PIR1 section contains only one entry per protein product.
Redundant: Complete database (PIR1+PIR2+PIR3) has many redundancies

PDB

The Protein Data Bank, maintained by Brookhaven National Laboratory (Long Island, New York, USA), contains all publically available solved protein structures. Searches against the pdb can be used to ask whether any known 3D structures are similar to your query protein.

Non-redundant: Only the "best" determination of a given structure is left in the database; however, multiple structures for one molecule may exist due to other components (i.e. one entry uncomplexed, one complexed).

OWL

Prot. Eng. 3:153

Non-redundant: Automatically generated from component databases (see reference for further info).

Protein Motifs

Prosite: Prosite is a database of protein motifs maintained by Amos Bairoch at the University of Geneva (NAR 19:2241, 1991). Each motif (defined by either a regular expression or a profile) is accompanied by a description of the motif and what is known about it's biology, as well as a listing of the true positive, false negative, and false positive SwissProt entries for the pattern.
BLOCKS: BLOCKS is a database developed by Steve Henikoff and colleagues. A block is a gap-free multiple alignment of sequences based on Prosite (Henikoff & Henikoff, NAR 19:6565 1991).

This document is intended to serve as a guide to using certain bioinformatics programs. It cannot be guaranteed to be free of errors or completely up-to-date. If you know of errors or other shortcomings of this document, please mail them to Keith Robison (Church Lab, HMS Genetics)

KRobison@nucleus.harvard.edu