Protein Structure and Function from Sequence

J.M.Johnson and G.M.Church

This page was last updated in 1999. Use at your own risk.


Overview and notes

The purpose of this page is to help organize the process of obtaining maximal structure and function information for a given protein using computational methods. The rapid increase of publicly available sequences and protein structures means that an increasing amount of information can be obtained for any protein sequence through its relatedness to others.

If a set of homologous proteins can be found and aligned, the information content at each position in the alignment profile is far greater than in any single member of the family, and any structural or functional prediction algorithm should utilize this collective information. Profile information of this type is extremely sensitive to the quality of the multiple alignment, and distant homologues should only be included in the alignment if they can be aligned with confidence.

No attempt has been made to be comprehensive in the collections of programs and links, and many useful programs are not listed here, or are being created as I type this.

Much of the process described below is used in the following paper, which can be used as a worked example: Johnson and Church (1999) J. Mol. Biol. 287: 695-715.

Please send comments, additions, corrections to johnson@arep.med.harvard.edu


If you start with a DNA sequence

A. Translate it into all possible reading frames. Find the coding region(s).

B. Compare to protein databases, check for frameshifts and sequencing errors


What you can do with a single protein sequence

A. Get a family. Find homologues and get pairwise/multiple alignments

B. Find known motifs in sequence

B. Secondary structure prediction (from single seq)

C. Recognize known fold from sequence

D. Identify other characteristics

E. Try an automated function prediction method


If you have a set of homologues

A. Find motifs and create local multiple alignment

B. Create global multiple alignment

If you have a multiple alignment or known motif

A. Secondary structure prediction (from multiple alignment)

B. Membrane topology prediction

C. Sort out the domain structure of your family.

D. Phylogenetic analysis

E. Identify conserved/functional residues

F. Predict contacts by correlated mutations

G. 3D fold recognition (find structural homologue)

H. Find more members of your family in databases

I. Identify representative motifs in the aligned sequences

J. Use the multiple alignment to improve structure prediction

K. Transfer functional information from annotated proteins to others


If you have a homologue of known structure

A. Make a multiple alignment including the structural homologue

B. Transfer structural information through the multiple alignment

C. Combine with other predictions and biochemical information

D. Build homology models for the proteins in the alignment

E. Check/refine your structure


References

Other web collections of related tools

Some related publications (mostly reviews)


This page in progress and contains errors.
Please address comments, additions, corrections to jjohnson@fas.harvard.edu