Protein Structure and Function from Sequence

J.M.Johnson and G.M.Church

This page was last updated in 1999. Use at your own risk.

Overview and notes

The purpose of this page is to help organize the process of obtaining maximal structure and function information for a given protein using computational methods. The rapid increase of publicly available sequences and protein structures means that an increasing amount of information can be obtained for any protein sequence through its relatedness to others.

If a set of homologous proteins can be found and aligned, the information content at each position in the alignment profile is far greater than in any single member of the family, and any structural or functional prediction algorithm should utilize this collective information. Profile information of this type is extremely sensitive to the quality of the multiple alignment, and distant homologues should only be included in the alignment if they can be aligned with confidence.

No attempt has been made to be comprehensive in the collections of programs and links, and many useful programs are not listed here, or are being created as I type this.

Much of the process described below is used in the following paper, which can be used as a worked example: Johnson and Church (1999) J. Mol. Biol. 287: 695-715.

Please send comments, additions, corrections to johnson@arep.med.harvard.edu

If you start with a DNA sequence

A. Translate it into all possible reading frames. Find the coding region(s).

Translate - (ExPASy)
Protein machine - nucleotide to protein translation at EBI
Gene Identification Software (list)

B. Compare to protein databases, check for frameshifts and sequencing errors

BLAST tools
- blastn -- nucleotide vs. nucleotide databases
- blastx -- six-frame translations of n.t. sequence vs. protein databases
- tblastx -- six-frame translations of n.t. sequence vs. translated DNA databases
GeneWise (part of Wise2, Sanger Centre) **recommended**--searches all possible reading frames vs. various databases, including Pfam

What you can do with a single protein sequence

A. Get a family. Find homologues and get pairwise/multiple alignments

Check annotations in GenBank and SwissProt. The protein may have links to putative homologues and alignments. Don't trust these annotations, because they contain many errors, but use them to inform your family-building process.
By local alignment algorithms
1. BLASTP (Altschul et al 1990)
2. PSI-BLAST
3. DeCypher II (Smith-Waterman)
4. Fasta3 (EBI)
5. SSEARCH (Smith Waterman)
6. MPSRCH (Cook ?) Smith-Waterman
7. SCANPS
By hidden Markov models (HMM)
1. HMMER (Eddy et al. 1995)
2. SAM
By sequence property searches
1. PROPSEARCH (Hobohm and Sander 1995)
Caution: high-scoring false positives may occur when using any of these methods, particularly if your query contains transmembrane helices, coiled-coil patterns, or other low-complexity sequence.

B. Find known motifs in sequence

Search sequence vs. databases of motifs
1. Identify (Stanford) searches e-motif database **recommended**
2. PROSITE (Bairoch 1991)
3. Pfam (Sanger Center) HMM search vs. Pfam families
4. Block Searcher of BLOCKS database
5. PRINTScan (Basel) -search PRINTS
6. ProDom (Sonhammer and Kahn 1994, Toulouse) - slow server
7. COG - Classification of orthologous groups (Koonin, NCBI-NIH)
8. MotifFinder (Japan)
9. PFSCAN Profile Scan ISREC (Lausanne) - searches several motif databases
10. PRINTS/PROSITE - Combined PRINTS/PROSITE search
11. BLOCKS/PRINTS - Search BLOCKS/PRINTS (in blocks format)
12. PIMA, BCM Search Launcher - various pattern searches (Houston)
Check databases of alignments e.g. PIR-ALN

B. Secondary structure prediction (from single seq)

PHDsec (Burkard Rost et al.) [single or multiple]
SOPM (Geourjon & Deleage, IBCP, France) [single seq.]
SSP (Nearest-neighbor) Solovyev and Salamov, Baylor College, USA. [single seq.]
PSA - for single domain globular proteins (BMERC) [single seq.]
SSP at IBCP, France - consensus of several methods [single seq.]
NNPREDICT (Cohen et al., UCSF) [single seq.]
Quadratic Logistic (input single sequence or alignment to structural homolog)
GOR (Garnier et al. 1978)
GOR4 (ABS-NIH)
Jpred (consensus method)
PSIPred (Warwick) single or multiple PSI-BLAST -> neural network

C. Recognize known fold from sequence

Threading algorithms
1. THREADER (Jones et al. 1992, Nature 358, 86-89)
2. TOPITS/PredictProtein (Rost, 1995) [single or multalign]
3. UCLA-DOE Threading program
4. Threading123D (Alexandrov)
5. GenTHREADER (D. T. Jones (1999) J. Mol. Biol. 287: 797-815) *recommended*
6. ProCeryon (Sippl et al--ProCeryon Bioscienses)
Swiss-Model - from alignment to crystallographic data (ExPASy)
ProFIT (Sippl, Salzburg)
MAP (Barton, unpublished)

D. Identify other characteristics

Coiled-coil prediction algorithms
1. ISREC COILS server
2. Multicoil
3. Paircoil
SAPS (Brendel et al. 1992, PNAS 89: 2002-2006) Statistical analysis of amino acid usage, periodicities, etc. SAPS server at EBI.

E. Try an automated function prediction method

GeneQuiz server or information site (EBI) GeneQuiz does not use alignments as search queries, as we are recommending here, and it doesn't give you much structural information, but it is automated and convenient as a first pass, to see how easy the problem will be. GeneQuiz uses the following algorithms in combination.
- Blast: Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) J. Mol.Biol. 215: 403-10
- Fasta: Pearson WR, Lipman DJ (1988) Proc. Natl. Acad. Sci. USA 85: 2444-2448
- MaxHom: Sander C, Schneider R (1991) Proteins 9: 56-69
- Biasdb: Casari G, Ouzounis C, Sander C (1997)
- Prosite: Bairoch A, Bucher P (1994) Nucleic Acids Res. 22: 3583-3589
- Blocks: Henikoff S, Henikoff JG (1994) Genomics 19: 97-107
- PredictProtein: Rost B, Sander C, Schneider R (1994) CABIOS 10: 53-60
- Transmembrane helix prediction: Rost B, Casadio R, Fariselli P, Sander C (1994) Protein Sci. 4: 521-533
- Solvent accessibility: Rost B, Sander C (1994) Proteins 20:216-226
- Coils: Lupas A, Van Dyke M, Stock J (1991) Science 252: 1162-1164
- Protein functional class (from database annotation): Tamames J, Ouzounis C, Sander C, Valencia A (1996) FEBS Lett. 389:96-101

If you have a set of homologues

A. Find motifs and create local multiple alignment

PROBE (Neuwald et al. 1997, uses Gibbs algorithms)
Gibbs Motif Sampler (Neuwald et al. 1995)
Gibbs Site Sampler (Lawrence et al. 1993)
ASSET (Neuwald and Green 1994)
MATCH-BOX (Depiereux and Feytmans 1992) motif must be present in all seqs.
BLOCK-MAKER (FHCRC)
SAM
MEME (Bailey and Elkan 1994) looks for motifs with Bayesian alogrithms. **recommended**

B. Create global multiple alignment

Global dynamic programming
1. CLUSTALW (Thompson et al.1994) (server@BCM) hierarchical, pairwise
2. MAP (Huang 1994) hierarchical, pairwise, progressive in linear space
3. MSA (WashU) Near-optimal sum of pairs
4. PIMA (Smith and Smith 1992) hierarchical, pairwise, pattern-induced
Local motif-finding plus heuristics
1. MACAW (Shuler et al. 1991)
Other, or not sure which of the above to put it in
1. MEME (finds homologues for you)
2. SSPA (Brocchieri and Karlin 1998) "symmetric-iterated" method
3. PSI-BLAST (Altschul et al 1997) position-specific, iterated BLAST
4. HMM (HMMER)
5. MaxHom (Sander and Schneider 1991)
6. AMPS (Barton and Sternberg 1990)
7. Pileup, GCG
Alignment Tips:
- Use an interated method like PSI-BLAST and/or HMMs
- Set the parameters on the stringent side to prevent false positives from creeping in early.
- Use the conserved motifs you have identified to guide the alignment process. Use these to identify and remove false positives in your alignment set.
- Check alignment after every iteration for poorly aligned regions and false positives.
- Refinements made by eye (can be done after every iteration of an iterated method)
  - Make sure close homologues stay aligned together through subsequent iterations
  - Check conserved residues and motifs to identify poor alignment regions
  - Remove questionable proteins and clear false positives from search matrices or hidden Markov models
- Recognize protein domains when possible and create separate alignments

If you have a multiple alignment or known motif

A. Secondary structure prediction (from multiple alignment)

PHD/PredictProtein (uses Maxhom/neural net algorithm) **recommended**
SOPM - self optimized method (IBCP-CNRS) [can use multalign.] (C. Geourjon & G. Deleage (1995) Comp. App. Biosci.11: 681-684)
Quadratic Logistics (NIH)(with homologues)
PREDATOR -Frischman & Argos (EMBL) [multalign]
SSPRED - (Mehta et al., EMBL) ith residue exchange statistics [multalign]
ZPRED - (LICR, UK) [multalign] AMPS format], GOR method
AMAS (Barton, EBI) [multiply aligned sequences, AMPS fmt]

B. Membrane topology prediction

Transmembrane alpha helices (~21 consecutive hydrophobic residues)
1. TMpred - transmembrane region and orientation prediction (ISREC)
2. TMAP - accepts input mult. align (Persson and Argos 1996, EMBL)
3. PHDhtm/PredictProtein (EMBL)
4. SOSUI (Tokyo)
Transmembrane beta strands (uses amphipathicity, border aromatics, etc.)
1. MAPF (Johnson and Church 1999)
2. Hsprime (Schirmer and Cowan 1993) [single seq.]
3. "Imax" - by residue conservation alone (Ferenci 1994)
4. Moment - (GCG) hydrophobic moment
5. many other methods

C. Sort out the domain structure of your family.

This is espeically important with modular proteins with multiple domains.
XDOM --Gouzy et al. (1997) Comput Appl Biosci 13(6):601-608-- is an automated domain finder which uses the ProDom database.

D. Phylogenetic analysis

GCG (Wisconsin)
Phylip
Phylo_win

E. Identify conserved/functional residues

SEQUENCESPACE (Casari et al. 1995. NatStrBiol 2:171)
AMAS (Barton, EBI) Analyze multiply aligned sequences

F. Predict contacts by correlated mutations

Maximum Likelihood (Pollock et al. 1999 J.Mol.Biol. 287:187)
Correlated Mutations (Gobel et al. 1994)
PREDBB (Hubbard and Park 1994)
(Thomas et al. 1996)
Note: these methods have not been proven to work well yet.

G. 3D fold recognition (find structural homologue)

Multiple sequence threading algorithms
1. MST (Taylor 1997)
2. TOPITS/PredictProtein (Rost, 1995) [single or multalign]
3. Other threading algorithms (many)
Fold recognition from a HMM of your multiple alignment. Scan vs. pdb seqs.
1. Scan HMM vs. PDB sequences
2. (e.g. Hubbard and Park 1995)

H. Find more members of your family in databases

SCAN (from the Gibbs algorithm)
MAST (from the MEME server)
Meta-MEME (SDSC) -- uses HMMs of MEME motifs **recommended**
Scan (Stanford) with a regular expression
Regular expression search of OWL
Patscan (Argonne)
pmotif (UMN) - searches DNA sequence for protein motifs
HMMsearch (part of HMMer)
Note: should remove the bias in the search profile or HMM due to similar sequences prior to the database search.

I. Identify representative motifs in the aligned sequences

E-Motif (Stanford) this one doesn't align them for you

J. Use the multiple alignment to improve structure prediction

MAPF (Johnson and Church, 1999). Integrates results of structure-prediction programs for all proteins in a multiple alignment to improve the accuracy of the predictions and to distribute structural information from one homologue to another. Can be used with coiled-coil prediction, secondary-structure prediction, or any other sequence characteristic for which you have an algorithm.

K. Transfer functional information from annotated proteins to others

Use database annotation and literature searches on each protein
Caveats:
- Database annotations are often wrong, incomplete, or misleading.
- Wrong annotations are often propagated, so finding many proteins with the same annotation is not necessarily convincing.
- Make sure that the domain/motif which gave rise to the annotation is present in the aligned region.
- Proteins may have multiple domains and multiple functions. If the domain structure of your query protein(s) is known, you are better off studying one domain at a time, building separate alignments.
- If you want to transfer annotation of a particular enzymatic activity, make sure all of the active site residues are present.
- Beware of alternate splicings. Activity may be associated with only one splice variant.
- Conservation of the structural fold does not imply conservation of the function. Homologous proteins may have evolved to have different functions.
See Doerks et al. (1998), Smith and Zhang (1997), and Karp (1998) for reviews of some of these functional genomics issues.

If you have a homologue of known structure

A. Make a multiple alignment including the structural homologue

This allows you to distribute the structural information to the other members of your protein family, depending on how accurate your alignment is.
See alignment methods above

B. Transfer structural information through the multiple alignment

Several databases have been set up to do this with known structures.
- CATH -- classification of protein structures
- FSSP -- families of structurally similar proteins
- LPFC -- library of protein family core structures (Schmidt et al. 1997)
- HOMALDB -- database of structural alignments

C. Combine with other predictions and biochemical information

e.g. GLASS (Leplae, Hubbard, Tramontano, unpublished)
MAPF (Johnson and Church, 1999). Can distribute structural information across a multiple alignment in a visual display.

D. Build homology models for the proteins in the alignment

InsightII-Homology Package (MSI-Biosym)
ProModII
LOOK (Molecular Applications Group)

E. Check/refine your structure

checking for bumps, disallowed conformations, packing
energy minimization, simulated annealing
Biotech Validation Suite for Protein Structures (EMBL)
ERRAT - Protein Structure Verification at UCLA-DOE (US)
SCARF2 - Protein Structure (PDB) Comparison (& Info) at LEMB (US)
SwissModel - Automated Protein Modeling at ExPasy (Switzerland)
Verify3D - 3D Structure Evaluation Service at UCLA-DOE (US)
several others

References

Other web collections of related tools

Motif/Pattern/Profile searches, by Peer Bork
BCM Search Launcher (Baylor)
The PredictProtein server (EMBL )
Pedro's BioMolecular Research Tools
ExPASY Mol. Bio. Server
Stanford Motif Bioinformatics Server
Protein Structure Links Server (SDSC)
Barton Group Home Page (EBI)
Rob Russell's guide to structure prediction (UK)

Some related publications (mostly reviews)

S. R. Eddy (1998) "Profile hidden Markov models," Bioinformatics 14: 755.
T. Doerks, A. Bairoch, P. Bork (1998) Trends Genet. 14(6): 248-250.
P. Karp (1998) "What we do not know about sequence analysis and sequence databases," Bioinformatics 14(9): 753-754.
M. Gerstein and H. Hegyi (1998) FEMS Microbiol. Rev. 22: 277-304.
A. F. Neuwald, J. S. Liu, D. J. Lipman, and C. E. Lawrence (1997) "Extracting protein alignment models from the sequence database," Nucleic Acids Res. 25(9):1665-77.
R. Sanchez and A. Sali (1997) "Advances in comparative protein-structure modelling," Curr. Op. Struct. Biol. 7:206-214.
T. E. Smith and X. Zhang (1997) Nat. Biotechnol. 15: 1222-1223.
T. Hubbard and J. Park (1996) "Protein structure prediction: playing the fold" Trends Biochem. Sci. 21(8):279. This gives a some of the basic flow shown above.
T. Springer (1996) PNAS 94:65-72. Example of homology modeling.
Barton, G. J. (1995), "Protein Secondary Structure Prediction," Curr. Op. Struct. Biol. 5: 372-376.
R.B. Russell & M. J. E. Sternberg (1995) "Protein Structure Prediction: How Good Are We?," Current Biology 5: 488-490.
Benner, S. A., Gerloff, D. L. & Jenny, T. F. (1994) Science 265: 1642-1644.
Bairoch A, Bucher P (1994) Nucleic Acids Res. 22: 3583-3589 (Prosite)
Henikoff S, Henikoff JG (1994) Genomics 19: 97-107 (BLOCKS)
Rost B, Sander C, Schneider R (1994) CABIOS 10: 53-60 (PredictProtein)
Rost, B., Schneider, R. & Sander, C. (1993) Trends Biochem. Sci. 18: 120-123.
Sander C, Schneider R (1991) Proteins 9: 56-69 (MaxHom)
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) J. Mol.Biol. 215: 403-10 (BLAST)
Pearson WR, Lipman DJ (1988) Proc. Natl. Acad. Sci. USA 85: 2444-2448 (Fasta)

This page in progress and contains errors.

Please address comments, additions, corrections to jjohnson@fas.harvard.edu