Sequence Pre-Filters

A number of sequence pre-filters have been developed to aid in sequence analysis.

Reducing matches due to biased amino acid composition

Many amino acid sequences are highly repetitive in nature, especially naive translations of genomic DNA. Matches between such segments are more likely to be due to these local amino acid composition biases than to common descent. Filters have been developed to mask out regions showing highly-biased local composition.


(Wooton & Federhen, Computers & Chemistry 17:149. 1993)


(Claverie & States, Computers & Chemistry, 17:191. 1993)
XNU & SEG have been integrated into the network BLAST server, but there is little about their operation which would preclude using them with other programs.

Reducing matches to "uninteresting" sequences


XBLAST (not to be confused with BLASTX) is program which masks sequences from a query using a previous BLAST output as a guide (Claverie & States, Computers & Chemistry, 17:191. 1993). In other words, given a sequence and a BLAST of that sequences, XBLAST outputs the sequence with all matches from the BLAST report masked by ambiguity characters (X for proteins; N for nucleotides). This can greatly improve the readability of BLAST reports by removing uninformative or confusing matches. For example, suppose you have just sequenced 20Kb of human DNA. That DNA is likely to contain various repetitive sequences, such as Alu elements. A BLASTN search will contain many hits involving Alu elements, which might obscure more interesting hits involving other similarities. Hence, a wise sequence of searches would be
  1. BLASTN search versus Human Repetitive Sequences (Rep)
  2. XBLAST of query using Rep
  3. BLASTN search of GenBank (or dbEST) using XBLAST-processed query

This document is intended to serve as a guide to using certain bioinformatics programs. It cannot be guaranteed to be free of errors or completely up-to-date. If you know of errors or other shortcomings of this document, please mail them to Keith Robison