Protein Structure and Function from Sequence
J.M.Johnson and G.M.Church
This page was last updated in 1999. Use at your own risk.
The purpose of this page is to help organize the process of obtaining
maximal structure and function information for a given protein using
computational methods. The rapid increase of publicly available
sequences and protein structures means that an increasing amount of
information can be obtained for any protein sequence through its
relatedness to others.
If a set of homologous proteins can be found and aligned, the
information content at each position in the alignment profile is far
greater than in any single member of the family, and any structural or
functional prediction algorithm should utilize this collective
information. Profile information of this type is extremely sensitive to
the quality of the multiple alignment, and distant homologues should
only be included in the alignment if they can be aligned with
confidence.
No attempt has been made to be comprehensive in the collections of
programs and links, and many useful programs are not listed here, or
are being created as I type this.
Much of the process described below is used in the following paper,
which can be used as a worked example: Johnson and Church (1999) J.
Mol. Biol. 287: 695-715.
Please send comments, additions, corrections to johnson@arep.med.harvard.edu
A. Translate it into all possible reading frames. Find the coding
region(s).
B. Compare to protein databases, check for frameshifts and sequencing
errors
-
BLAST
tools
-
blastn -- nucleotide vs. nucleotide databases
-
blastx -- six-frame translations of n.t. sequence vs. protein databases
-
tblastx -- six-frame translations of n.t. sequence vs. translated DNA
databases
-
GeneWise
(part of Wise2,
Sanger Centre) **recommended**--searches all possible reading frames
vs. various databases, including Pfam
A. Get a family. Find homologues and get
pairwise/multiple alignments
-
Check annotations in GenBank and SwissProt. The protein may have links
to putative homologues and alignments. Don't trust these annotations,
because they contain many errors, but use them to inform your
family-building process.
-
By local alignment algorithms
-
BLASTP
(Altschul et al 1990)
-
PSI-BLAST
-
DeCypher II (Smith-Waterman)
-
Fasta3 (EBI)
-
SSEARCH (Smith Waterman)
-
MPSRCH (Cook ?) Smith-Waterman
-
SCANPS
-
By hidden Markov models (HMM)
-
HMMER (Eddy
et al. 1995)
-
SAM
-
By sequence property searches
-
PROPSEARCH
(Hobohm and Sander 1995)
-
Caution: high-scoring false positives may occur when using any of these
methods, particularly if your query contains transmembrane helices,
coiled-coil patterns, or other low-complexity sequence.
B. Find known motifs in sequence
B. Secondary structure prediction (from single seq)
-
PHDsec
(Burkard Rost et al.) [single or multiple]
-
SOPM (Geourjon
& Deleage, IBCP, France) [single seq.]
-
SSP
(Nearest-neighbor) Solovyev and Salamov, Baylor College, USA. [single
seq.]
-
PSA - for
single domain globular proteins (BMERC) [single seq.]
-
SSP at IBCP, France
- consensus of several methods [single seq.]
-
NNPREDICT (Cohen
et al., UCSF) [single seq.]
-
Quadratic
Logistic (input single sequence or alignment to structural
homolog)
-
GOR
(Garnier et al. 1978)
-
GOR4 (ABS-NIH)
-
Jpred
(consensus method)
-
PSIPred (Warwick)
single or multiple PSI-BLAST -> neural network
C. Recognize known fold from sequence
D. Identify other characteristics
E. Try an automated function prediction method
-
GeneQuiz server or information
site (EBI) GeneQuiz does not use alignments as search queries,
as we are recommending here, and it doesn't give you much structural
information, but it is automated and convenient as a first pass, to see
how easy the problem will be. GeneQuiz uses the following algorithms in
combination.
-
Blast: Altschul SF, Gish W, Miller W, Myers EW, Lipman
DJ (1990) J. Mol.Biol. 215: 403-10
-
Fasta: Pearson WR, Lipman DJ (1988) Proc. Natl. Acad.
Sci. USA 85: 2444-2448
-
MaxHom: Sander C, Schneider R (1991) Proteins 9:
56-69
-
Biasdb: Casari G, Ouzounis C, Sander C (1997)
-
Prosite: Bairoch A, Bucher P (1994) Nucleic Acids Res.
22: 3583-3589
-
Blocks: Henikoff S, Henikoff JG (1994) Genomics 19:
97-107
-
PredictProtein: Rost B, Sander C, Schneider R (1994) CABIOS 10:
53-60
-
Transmembrane helix prediction: Rost B, Casadio R,
Fariselli P, Sander C (1994) Protein Sci. 4:
521-533
-
Solvent accessibility: Rost B, Sander C (1994) Proteins
20:216-226
-
Coils: Lupas A, Van Dyke M, Stock J (1991) Science
252: 1162-1164
-
Protein functional class (from database annotation):
Tamames J, Ouzounis C, Sander C, Valencia A (1996) FEBS
Lett. 389:96-101
A. Find motifs and create local multiple alignment
-
PROBE (Neuwald et al. 1997, uses Gibbs algorithms)
-
Gibbs Motif Sampler (Neuwald et al. 1995)
-
Gibbs Site Sampler (Lawrence et al. 1993)
-
ASSET (Neuwald and Green 1994)
-
MATCH-BOX (Depiereux and Feytmans 1992) motif must be present in all
seqs.
-
BLOCK-MAKER
(FHCRC)
-
SAM
-
MEME
(Bailey and Elkan 1994) looks for motifs with Bayesian alogrithms.
**recommended**
B. Create global multiple alignment
-
Global dynamic programming
-
CLUSTALW
(Thompson et al.1994) (server@BCM) hierarchical, pairwise
-
MAP (Huang 1994) hierarchical, pairwise, progressive in linear space
-
MSA (WashU) Near-optimal sum of pairs
-
PIMA (Smith and Smith 1992) hierarchical, pairwise, pattern-induced
-
Local motif-finding plus heuristics
-
MACAW (Shuler et al. 1991)
-
Other, or not sure which of the above to put it in
-
MEME (finds homologues for you)
-
SSPA (Brocchieri and Karlin 1998) "symmetric-iterated" method
-
PSI-BLAST (Altschul et al 1997) position-specific, iterated BLAST
-
HMM (HMMER)
-
MaxHom (Sander and Schneider 1991)
-
AMPS (Barton and Sternberg 1990)
-
Pileup, GCG
-
Alignment Tips:
-
Use an interated method like PSI-BLAST and/or HMMs
-
Set the parameters on the stringent side to prevent false positives
from creeping in early.
-
Use the conserved motifs you have identified to guide the alignment
process. Use these to identify and remove false positives in your
alignment set.
-
Check alignment after every iteration for poorly aligned regions and
false positives.
-
Refinements made by eye (can be done after every iteration of an
iterated method)
-
Make sure close homologues stay aligned together through subsequent
iterations
-
Check conserved residues and motifs to identify poor alignment regions
-
Remove questionable proteins and clear false positives from search
matrices or hidden Markov models
-
Recognize protein domains when possible and create separate alignments
A. Secondary structure prediction (from multiple alignment)
-
PHD/PredictProtein
(uses Maxhom/neural net algorithm) **recommended**
-
SOPM - self
optimized method (IBCP-CNRS) [can use multalign.] (C. Geourjon & G.
Deleage (1995) Comp. App. Biosci.11: 681-684)
-
Quadratic
Logistics (NIH)(with homologues)
-
PREDATOR
-Frischman & Argos (EMBL) [multalign]
-
SSPRED
- (Mehta et al., EMBL) ith residue exchange statistics [multalign]
-
ZPRED -
(LICR, UK) [multalign] AMPS format], GOR method
-
AMAS (Barton,
EBI) [multiply aligned sequences, AMPS fmt]
B. Membrane topology prediction
-
Transmembrane alpha helices (~21 consecutive hydrophobic residues)
-
TMpred
- transmembrane region and orientation prediction (ISREC)
-
TMAP
- accepts input mult. align (Persson and Argos 1996, EMBL)
-
PHDhtm/PredictProtein
(EMBL)
-
SOSUI
(Tokyo)
-
Transmembrane beta strands (uses amphipathicity, border aromatics,
etc.)
-
MAPF (Johnson
and Church 1999)
-
Hsprime (Schirmer and Cowan 1993) [single seq.]
-
"Imax" - by residue conservation alone (Ferenci 1994)
-
Moment
- (GCG) hydrophobic moment
-
many other methods
C. Sort out the domain structure of your family.
-
This is espeically important with modular proteins with multiple
domains.
-
XDOM --Gouzy et al. (1997) Comput Appl Biosci 13(6):601-608--
is an automated domain finder which uses the ProDom
database.
D. Phylogenetic analysis
E. Identify conserved/functional residues
-
SEQUENCESPACE (Casari et al. 1995. NatStrBiol 2:171)
-
AMAS (Barton,
EBI) Analyze multiply aligned sequences
F. Predict contacts by correlated mutations
-
Maximum Likelihood (Pollock et al. 1999 J.Mol.Biol. 287:187)
-
Correlated Mutations (Gobel et al. 1994)
-
PREDBB (Hubbard and Park 1994)
-
(Thomas et al. 1996)
-
Note: these methods have not been proven to work well yet.
G. 3D fold recognition (find structural homologue)
-
Multiple sequence threading algorithms
-
MST (Taylor 1997)
-
TOPITS/PredictProtein
(Rost, 1995) [single or multalign]
-
Other threading algorithms (many)
-
Fold recognition from a HMM of your multiple alignment. Scan vs. pdb
seqs.
-
Scan HMM vs. PDB sequences
-
(e.g. Hubbard and Park 1995)
H. Find more members of your family in databases
-
SCAN (from the Gibbs algorithm)
-
MAST
(from the MEME server)
-
Meta-MEME (SDSC) -- uses
HMMs of MEME motifs **recommended**
-
Scan (Stanford) with a
regular expression
-
Regular
expression search of OWL
-
Patscan
(Argonne)
-
pmotif (UMN) -
searches DNA sequence for protein motifs
-
HMMsearch (part of HMMer)
-
Note: should remove the bias in the search profile or HMM due to
similar sequences prior to the database search.
I. Identify representative motifs in the aligned sequences
-
E-Motif (Stanford)
this one doesn't align them for you
J. Use the multiple alignment to improve structure prediction
-
MAPF (Johnson
and Church, 1999). Integrates results of structure-prediction
programs for all proteins in a multiple alignment to improve the
accuracy of the predictions and to distribute structural information
from one homologue to another. Can be used with coiled-coil prediction,
secondary-structure prediction, or any other sequence characteristic
for which you have an algorithm.
K. Transfer functional information from annotated proteins to others
-
Use database annotation and literature searches on each protein
-
Caveats:
-
Database annotations are often wrong, incomplete, or misleading.
-
Wrong annotations are often propagated, so finding many proteins with
the same annotation is not necessarily convincing.
-
Make sure that the domain/motif which gave rise to the annotation is
present in the aligned region.
-
Proteins may have multiple domains and multiple functions. If the
domain structure of your query protein(s) is known, you are better off
studying one domain at a time, building separate alignments.
-
If you want to transfer annotation of a particular enzymatic activity,
make sure all of the active site residues are present.
-
Beware of alternate splicings. Activity may be associated with only one
splice variant.
-
Conservation of the structural fold does not imply conservation of the
function. Homologous proteins may have evolved to have different
functions.
-
See Doerks et al. (1998), Smith and Zhang (1997), and Karp
(1998) for reviews of some of these functional genomics issues.
A. Make a multiple alignment including the structural homologue
-
This allows you to distribute the structural information to the other
members of your protein family, depending on how accurate your
alignment is.
-
See alignment methods above
B. Transfer structural information through the multiple alignment
-
Several databases have been set up to do this with known structures.
-
CATH
-- classification of protein structures
-
FSSP -- families
of structurally similar proteins
-
LPFC
-- library of protein family core structures (Schmidt et al. 1997)
-
HOMALDB
-- database of structural alignments
C. Combine with other predictions and biochemical information
-
e.g. GLASS (Leplae, Hubbard, Tramontano, unpublished)
-
MAPF (Johnson
and Church, 1999). Can distribute structural information across
a multiple alignment in a visual display.
D. Build homology models for the proteins in the alignment
-
InsightII-Homology Package (MSI-Biosym)
-
ProModII
-
LOOK (Molecular Applications Group)
E. Check/refine your structure
-
checking for bumps, disallowed conformations, packing
-
energy minimization, simulated annealing
-
Biotech
Validation Suite for Protein Structures (EMBL)
-
ERRAT
- Protein Structure Verification at UCLA-DOE (US)
-
SCARF2 -
Protein Structure (PDB) Comparison (& Info) at LEMB (US)
-
SwissModel
- Automated Protein Modeling at ExPasy (Switzerland)
-
Verify3D
- 3D Structure Evaluation Service at UCLA-DOE (US)
-
several others
Other web collections of related tools
Some related publications (mostly reviews)
-
S. R. Eddy (1998) "Profile hidden Markov models," Bioinformatics
14: 755.
-
T. Doerks, A. Bairoch, P. Bork (1998) Trends Genet. 14(6):
248-250.
-
P. Karp (1998) "What we do not know about sequence analysis and
sequence databases," Bioinformatics 14(9):
753-754.
-
M. Gerstein and H. Hegyi (1998) FEMS Microbiol. Rev. 22:
277-304.
-
A. F. Neuwald, J. S. Liu, D. J. Lipman, and C. E. Lawrence (1997)
"Extracting protein alignment models from the sequence
database," Nucleic Acids Res. 25(9):1665-77.
-
R. Sanchez and A. Sali (1997) "Advances in comparative
protein-structure modelling," Curr. Op. Struct. Biol. 7:206-214.
-
T. E. Smith and X. Zhang (1997) Nat. Biotechnol. 15:
1222-1223.
-
T. Hubbard and J. Park (1996) "Protein structure prediction:
playing the fold" Trends Biochem. Sci. 21(8):279.
This gives a some of the basic flow shown above.
-
T. Springer (1996) PNAS 94:65-72. Example of homology
modeling.
-
Barton, G. J. (1995), "Protein Secondary Structure
Prediction," Curr. Op. Struct. Biol. 5: 372-376.
-
R.B. Russell & M. J. E. Sternberg (1995) "Protein Structure
Prediction: How Good Are We?," Current Biology 5:
488-490.
-
Benner, S. A., Gerloff, D. L. & Jenny, T. F. (1994) Science
265: 1642-1644.
-
Bairoch A, Bucher P (1994) Nucleic Acids Res. 22:
3583-3589 (Prosite)
-
Henikoff S, Henikoff JG (1994) Genomics 19: 97-107
(BLOCKS)
-
Rost B, Sander C, Schneider R (1994) CABIOS 10: 53-60
(PredictProtein)
-
Rost, B., Schneider, R. & Sander, C. (1993) Trends Biochem.
Sci. 18: 120-123.
-
Sander C, Schneider R (1991) Proteins 9: 56-69
(MaxHom)
-
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) J.
Mol.Biol. 215: 403-10 (BLAST)
-
Pearson WR, Lipman DJ (1988) Proc. Natl. Acad. Sci. USA 85:
2444-2448 (Fasta)
This page in progress and contains errors.
Please address comments, additions, corrections to jjohnson@fas.harvard.edu