Substitution Matrices

In aligning two protein sequences, some method must be used to score the alignment of one residue against another. Substitution matrices contain such values.

Widely used matrices

PAM / MDM / Dayhoff

The late Margaret Dayhoff was a pioneer in protein databasing and comparison. She and her coworkers developed a model of protein evolution which resulted in the development of a set of widely used substitution matrices. These are frequently called Dayhoff, MDM (Mutation Data Matrix), or PAM (Percent Accepted Mutation) matrices.
  • Derived from global alignments of closely related sequences.
  • Matrices for greater evolutionary distances are extrapolated from those for lesser ones.
  • The number with the matrix (PAM40, PAM100) refers to the evolutionary distance; greater numbers are greater distances.
Several later groups have attempted to extend Dayhoff's methodology or re-apply her analysis using later databases with more examples.
Extensions
  • Jones, Thornton and coworkers used the same methodology as Dayhoff but with modern databases (CABIOS 8:275)
  • Gonnett and coworkers (Science 256:1443) used a slightly different (but theoretically equivalent) methodology
Henikoff & Henikoff (Proteins 17:49) compared these two newer versions of the PAM matrices with Dayhoff's originals.
Seed and coworkers extended the extrapolations to even greater distances

BLOSUM

The BLOSUM series of matrices were created by Steve Henikoff and colleagues (PNAS 89:10915).
  • Derived from local, ungapped alignments of distantly related sequences
  • All matrices are directly calculated; no extrapolations are used
  • The number after the matrix (BLOSUM62) refers to the minimum percent identity of the blocks used to construct the matrix; greater numbers are lesser distances.
  • The BLOSUM series of matrices generally perform better than PAM matrices for local similarity searches (Proteins 17:49).

Structure-based matrices

Specialized Matrices

Claverie (J.Mol.Biol 234:1140) has developed a set of substitution matrices designed explicitly for finding possible frameshifts in protein sequences. These matrices are designed solely for use in protein-protein comparisons; they should not be used with programs which blindly translate DNA (e.g. BLASTX, TBLASTN).

Using multiple matrices

There is no such thing as a perfect substitution matrix; each matrix has its own limitations. If each matrix has its own limitations, then it should be possible to use multiple matrices so that each one complements the limits of the others. This intuitive notion has been formally developed by Steven Altschul (J Mol Biol 219:555; J Mol Evol 36:290), and tested by Steve Henikoff (Proteins 17:49)
This document is intended to serve as a guide to using certain bioinformatics programs. It cannot be guaranteed to be free of errors or completely up-to-date. If you know of errors or other shortcomings of this document, please mail them to Keith Robison (Church Lab, HMS Genetics)
KRobison@nucleus.harvard.edu