Database Artifacts

Like any other scientific data, sequence data is subject to experimental artifacts.

Vector sequences

A number of authors have identified and cataloged the contamination of sequence databases with vectors. Among the studies are:
Claverie Genomics 12:838 1992.
Lamperti et al Nucleic. Acids. Res 20:2741) 1992.
Of particular note in this paper is the finding of short apparent vector sequences in the middle of non-vector sequence. The authors speculate that these may be due to errors in the editing of sequences or to rearranged plasmids.
Lopez, Kristensen, & Prydz. Nature 355:211. 1992.
Kristensen, Lopez, & Prydz. An estimate of the sequencing error frequency in the DNA sequence databases. DNA Seq 2:343 1989.

Heterologous sequences

White, O. et al. Nucl. Acids. Res. 21:2829
Describes a statistical method to compare sequence sets (but not individual sequences). Shows that several sets of cDNAs show bulk properties different than human cDNAs. Sequence comparisons are used to show that this is due to contamination of the anomalous libraries with yeast and bacterial sequences.