Microbial Ecology, Proteogenomics & Computational Optima

Harvard & MIT 13-May -2002 for CD

George Church, Sallie Chisholm, Martin Polz,

Roberto Kolter, Fred Ausubel, Raju Kucherlapati,

Steve Lory, Steve Gygi, Mike Laub




Table of Contents:

1 __ Face Page

2 __ Project Description

3 Computational, Technological and Biological Rationale

10 Goal 1 Identify and Characterize the Molecular Machines of Life.

17 Goal 2 Characterize Gene Regulatory Networks

33 Goal 3 Microbial Communities in Natural Environments at the Molecular Level

75 Goal 4 Computational Understanding & Design of Biological Systems.

85 __ Management Plan

87 __ Bibliography

90 __ Biographical Sketches

100 __ Facilities and Resources

115 __ Current and Pending Support

121 __ Proprietary Information

121 __ Special Information & Supplementary Documentation

121 __ Appendices & relevant preprints indexed by first author




We propose to leverage our experience in technology development (in proteomics, selection-genomics, and computational modeling) and experience in the biology/ecology of three microbial systems sequenced by (and of key interest to) DOE to integrate the four GTL goals (below) in the context of the main DOE missions (energy production/carbon sequestration and environmental clean up). The microbial genera proposed include uncultivated isolate plus Prochlorococcus, a group of species responsible for a major fraction of the earth's microbial carbon fixation, Caulobacter, a species relevant to dilute scavenging and bioremediation as well as cell division, Pseudomonas, displaying a broad range of metabolic pathways including chemical/biological toxins and well-studied biofilms. We emphasize the computation theme of optimality and the concept that overconstraining integrative models with comprehensive datasets facilitates examination of inconsistencies for insights into data collection issues and biological discoveries.

Goal 1 -- Identify and Characterize the Molecular Machines of Life. We will use (1a) "Proteogenomic Mapping," (1b) "Quantitative Interaction Proteomics", and (1c) "Proteomic Cellular Deconvolution" to move progressively from crude cell models consisting of lists of expressed gene products to well-determined systems in which concentrations, subcellular localizations, genomic structures (1-, 3- and 4-dimensional), and protein-protein interactions are known.

Goal 2 -- Characterize Gene Regulatory Networks. (2a) Use the above methods together with RNA arrays to monitor the network interactions as a function of external environmental changes, in particular, environmental stresses including light, temperature, metals, radiation, chemical and biological toxins, and phage infection. (2b) Use informatics to identify potential regulatory motifs upstream of co-regulated genes, as determined from microarray analyses including significant combinations of motifs. (2c) We will correlate and test the above hypotheses with selection data on mutations in each gene and genetic domain (goal 3d) and mass spectrometry of protein complexes selected by solid-phase versions of the motifs.

Goal 3 -- Characterize Complex Microbial Communities in their Natural Environments at the Molecular Level." (3a) Study the in situ microdiversity of planktonic microbial populations, the metabolic connections between taxa, and the role of cyanophage in shaping the structure of these communities. (3b) Functional Diversity Arrays and Single Cell Activity Multiplexing

(3c) Extend newly developed "in situ" amplification methods to explore the location of cells with specific RNA profiles in microbial communities such as biofilms, aquatic gradients.

(3d) Use genome-wide transposon tag array quantitation to survey the relative fitness of thousands of different genotypes in a mixed microbial population subjected to the same stresses as in goal 2. Further develop and apply a method to detect differences in large genomic regions

Goal 4 -- Develop Computational Methods to Understand Complex Biological Systems and Design new Systems. (Flux Balance=FBA, Minimum Perturbation=MPA, and 4D-cell)

(4a) Integrating Metabolic and Regulatory Networks: Application of MPA to Goal 2.

(4b) Addition of metabolic genes with MPA (Goal 3a)

(4c) Ecosystem Flux Balance Analysis, cross-feeding (Goal 3b)

(4d) Compartmentalized FBA and MPA (Goal 3c)

(4e) Our spatiotemporal (4D mechanical/compartmental) model of a biosynthesis and cell division will be extended to the interactions and organisms in goal 1. Our software will be available as open source.


The importance of "optimality" in biology is hard to overstate. From the folding of proteins to the functioning of large ecosystems, Darwinian processes have found globally optimal solutions (or remarkably powerful local optima). We can now perturb these patterns on a global scale as we respond to our own Darwinian urge to survive as the only known species capable of planning new optima. However, our biological sciences have only recently garnered the requisite combination of holistic tools that might plausibly allow us to tackle such designs on a large or small scale. In particular, the revolutionary "omics" tools are just now becoming ready for integration with system modeling of simple cells in realistic environments.

To engineer an organism or ecosystem, one needs unprecedented understanding, which in turns requires modeling of quantitative spatial and temporal data on nearly all molecules in the cell and nearly all of the organismal types within an ecosystem. We propose to develop the most comprehensive and accurate methods to measure and model these interacting components. The three microbial systems chosen represent important targets in themselves as well as ideal for developing systems measures and models that would be difficult to establish in other organisms, but will eventually be applicable very broadly.



A major effort in goal 3 will be to examine the genomic diversity within a "meta-population" of co-occurring Prochlorococcus populations to better understand variability at this level, and also explore the metabolic diversity of the heterotrophic bacteria that co-occur with these autotrophs ¾ 99% of which are uncultured to date. Our goal is to better understand the metabolic potential of these different rRNA-types, study the flow of carbon from Prochlorococcus to them, and look for the production of signaling compounds by Prochlorococcus in response to them (goal 2). The cellular machinery of these cells is optimized at slow growth rates (roughly one doubling per day), relative to model microbes such as E. coli. The cells are adapted to an extremely dilute environment (P and N levels at 1-20 nM levels) in which they grow at near maximal growth rates for existing light and temperature levels ¾ i.e. they are not in a state of ‘shutting down’. Part of goal 4 is to model the metabolic interactions within the community with fluxes similar to and interfaced with the intracellular models. In goals 1 & 2 we show the power of proteomics, RNA profiling, and population selection methods applied to species, which replicate with a doubling time of 24 hours or more. These methods do not require the unnaturally rapid replication rates (30 minutes) found in laboratory strains of E. coli (Wada, et al. 2000). The rationale behind the diverse genera chosen is presented below. Throughout the proposal we try not to depend on the common microbial practices of frequent colony purification or homologous recombination in the DOE-relevant species. We do have cutting-edge research in homologous recombination as exemplified in goal 1, and so will be prepared to apply it if the need and opportunity arises.









Strategic Choice of Three Diverse Genera

We have chosen three genera of bacteria to represent major categories of microbial life styles, cell-to-cell interactions and ecological constraints. The rationale is outlined below.

    1. Prochlorococcus is an autotroph ¾ indeed the smallest known autotroph with the smallest number of genes. Thus it represents the smallest number of genes (about 1700) that can make life from "non-life" i.e. inorganic compounds and the sun’s energy. Natural habitat is a well-mixed and highly dilute environment.
    2. Caulobacter has the highest ability to scavenge low concentration compounds of any known prokaryote.
    3. Pseudomonas may have the greatest known degradative and biosynthetic diversity per cell among prokaryotes due to its relatively large genome size of ~ 6MB.
    1. Prochlorococcus – planktonic, individual cells
    2. Caulobacter – planktonic and substrate bound (dependent on life stage)
    3. Pseudomonas – planktonic, substrate bound and biofilm forming
    1. Prochlorococcus – defined cell cycle and (potential) circadian synchrony
    2. Caulobacter – life stage synchrony
    3. Pseudomonas – community synchrony mediated by quorum sensing
  5. What keeps the species in balance so that the number of species does not drop with time (as it does in many artificial ecosystems)?

    1. Pseudomonas – cell-to-cell interactions within a biofilm
    2. Prochlorococcus – Not likely to be part of a physically constrained consortium. But possibly quorum sensing. The microbial community in its natural environment is very diverse giving rise to opportunities for multiple interactions and dependencies.
    3. Caulobacter – as per Prochlorococcus when in planktonic stage, as per Pseudomonas when substrate bound
    1. Pseudomonas – steep gradients of compounds within biofilms
    2. Prochlorococcus – gradients of resources with ocean depth. Different ecotypes are distributed differently with depth.
    3. Caulobacter – chemotaxis in gradients.


For Prochlorococcus we have two genomes that differ in size by 40% and are 2% different in rRNA sequence (see below). For Pseudomonas the 4 genome sequences differ maximally by 15% increase in size. For Caulobacter we have only one genome sequence.

Much of the proposal focuses on Prochlorococcus as illustrative of the goals for all three, rather than enumerate analogous details for the other two genera. Prochlorococcus also has the smallest genome & proteome, which will offer technical advantages. Therefore the bulk of the resources will be devoted to it throughout the grant. Nevertheless the other species will help determine the generality of the approaches, have the advantages described above and are needed to tie together this amazing community of researchers where certain technical advances are happening in the other two organisms -- for example, Pseudomonas for the GIRAFF mismatch method (Sokurenko, et al. 2001) and transposon methods; Caulobacter for DNA-protein interactions and cell-cycle (Laub et al. 2000*; Laub et al. 2002*).





Figure 0a. Electron micrograph of Prochlorococcus cell, and a flow cytometric signature of a Prochlorococcus population in the "wild". One of the advantages of studying this group is that one can easily monitor its distribution and abundance in its natural habitat by its distinct light scatter and chlorophyll fluorescence signals. Thus we can begin to understand the biology of this organism at scales ranging from the genome to global ecology [ picoplan.html, cyano_forum/pico/pico.htm]


Because of its simplicity, its relevance to DOE’s mission in energy and carbon management, and its global significance, Prochlorococcus represents an exciting candidate for the Genomes to Life Program. Less than one micron in diameter, with a minimum of 1700 genes, it is a dominant component of the photosynthetic machinery of the oceans. Discovered only 15 years ago , Prochlorococcus accounts for roughly 30% of the total chlorophyll in the mid-latitude oceans , and sometimes as much as 80% of the total primary production . Often present at 108 cells per liter in the open ocean, it "may well be the dominant organism on this planet" . We have estimated Prochlorococcus’ global abundance to be roughly 1025 cells (100 moles!), which is an order of magnitude more than the total number of human cells on Earth.

Unlike heterotrophic microbes, Prochlorococcus is not dependent on the organic products of other cells. This "minimal phototroph" is as free living as a cell can be ¾ requiring only sunlight, CO2, and inorganic nutrients to proliferate. Moreover, the microdiversity within this group is well documented both in laboratory cultures and in field populations, which allows us to gain insights into the origins and nature of variability in cellular networks, and the forces that shape and sustain microbial diversity.

The documented microdiversity within the group we call Prochlorococcus is embedded in a complex microbial community that constitutes the base of the marine pelagic food web. While the genomic revolution continues to reshape our view of bacterial "species", genomic and post-genomic approaches are only beginning to be extended effectively to the exploration of these natural microbial communities. The challenge lies in understanding the extent and nature of genomic adaptation and variation in communities under natural selection, and the response repertoire afforded to cells by the underlying genomes.

Over the last decade, we have learned through the elucidation of rRNA gene diversity that microbial diversity in the environment is much larger than previously assumed and that cultured organisms poorly represent the organisms found in the environment. The functional significance of rRNA variation is yet to be determined; however, comparative sequencing of genomes of closely related strains and of genome fragments recovered from the environment suggest that rRNA variation does not adequately represent functionally significant genomic variation. Genomics has also highlighted the importance of lateral gene transfer in the acquisition of evolutionary novelty, and laboratory experiments have revealed that loss of genes can be rapid when populations experience relaxed selection or environmental change.

Thus it is becoming increasingly clear that the idea of a microbial "species" is flawed, and must be replaced by an evolutionarily and ecologically meaningful term based on genomic information. There are multiple scales of resolution by which an organism can be identified. On one end of the spectrum is the phylotype as defined by the rRNA sequence. At the other end is the complete genome sequence. The former does not properly describe the diversity in protein coding genes, and the latter is far too stringent; cells that function identically in a given environment need not have identical genomes. What is the range of genome similarity that defines a functional ecological unit? How many can co-exist in a well-mixed environment? Is there a core genome that is shared by all, and peripheral genes that are frequently lost and regained? To what degree do microbes in the wild share genetic information by recombination and what is the role of phages or transposons in this process? At what degree of sequence divergence do recombination events become rare and new ecotypes (i.e. closely related but genetically and ecologically distinct strains) emerge? Finally, what selective regimes favor diversification at the genome level? Answers to these questions are essential for understanding the evolution and maintenance of complex microbial communities.

* * *

Here we propose a multidisciplinary effort aimed at the DOE Genomes To Life Program using Prochlorococcus and its accompanying microbial community as our model system. Prochlorococcus is uniquely suited for attacking both of these goals for several reasons: The complete genome sequences of two of the ecotypes, MED4 and MIT9313 ¾ which differ by 2% at the 16S rDNA locus ¾ have recently been completed by the JGI. Like other closely related genomes, they have common genomic backbones as well as major differences. Unlike most other systems, however, the differences between these ecotypes have been interpretable in terms of known ecological and phylogenetic differences. This is possible in part because the ecology and physiological diversity of Prochlorococcus is well studied, in part because it is a simple phototroph, and in part because its natural habitat is a well-mixed simple environment where relevant environmental parameters that dictate its distributions can be easily measured. Although we should not be surprised to find such agreement between properties and behaviors at the genetic, organism, and population level (since the latter emerge from, and feed back upon, the former) we have been inspired by how clearly some of the differences at the genome level can be mapped onto the ecological distributions of the ecotypes in field populations.



DOE Relevance of Prochlorococcus

The oceans contribute 40% of the total photosynthesis on Earth. This drives the "biological pump" in the surface oceans, which exports carbon to the deep sea where it is naturally sequestered. If the pump were turned off, the concentration of CO2 in the atmosphere would more than double . Given the significance of this pump in regulating atmospheric CO2 concentrations, it is important that we understand the cellular processes of the organisms that drive it ¾ the phytoplankton. Despite the astounding diversity of extant phytoplankton species, a significant fraction of oceanic primary productivity is carried out by two closely related groups of cyanobacteria ¾ Prochlorococcus and Synechococcus. In the oligotrophic seas Prochlorococcus alone can account for up to half of the total chlorophyll . These tiny cells ¾ the smallest oxygenic phototrophs in the sea ¾ have been extraordinarily successful in dominating the oceanic carbon cycle.


General Characteristics of Prochlorococcus

Prochlorococcus is very closely related to the Marine A cyanobacterial cluster , forming a single lineage within the cyanobacteria, with 96% similarity in their 16S rDNA sequences (see below, goal 3). The major light harvesting complexes of nearly all other cyanobacteria consist of phycobilisomes, a defining characteristic of this group . In contrast, Prochlorococcus lacks phycobilisomes, and contains divinyl chlorophyll a (chl a2) and divinyl chlorophyll b (chl b2) as its major photosynthetic pigments . The latter enable it to efficiently absorb the blue light in the deep ocean . However, some strains have recently been shown to contain the gene for phycoerythrin, and traces of this pigment within the cells (Hess, et al. 2001; Ting, et al. 1999, 2001)

Prochlorococcus is very abundant (often 105 cells ml-1) in the oligotrophic waters between 40° N and 40° S , a range that is consistent with its temperature optima when grown in culture . Distinct "ecotypes" (using the term to loosely describe physiologically and genetically distinct, but closely related isolates; some would call these different species) exist within the genus ¾ exemplified by MED4 and MIT9313 ¾ which we have called high and low light-adapted based on the optimum light intensity for growth, and the range of their chl b/a ratios over the course of photoacclimation (Fig 1 A, B). Phylogenies constructed using rDNA sequences reveal clades that cluster the ecotypes according to these differences (see section below), and depth distributions of ecotypes in the field are consistent with these groupings .



"Caulobacter crescentus is the most common nonpathogenic bacterium in nutrient-poor freshwater streams. In the swarmer phase of its three-phase life cycle, C. crescentus is motile and chemically sensitive, characteristics that help it locate nutrient sources. In its nonswarmer phase, it adheres to solid substrates such as rocks. Microbial Genome Program (MGP) scientists are determining the DNA sequence of the genome of C. crescentus, one of the organisms responsible for sewage treatment." (

In addition it is an organism for which cells can be naturally synchronized in their division cycle stages, displays asymmetric cycle division. Relevant RNA microarray data and location data have collected (Laub et al. 2000*; Laub et al. 2002*).






Figure 0b above. Caulobacter cell division time series from Yves Brun (




Pseudomonas is capable of colonizing niches that range form water and soil to tissues of plants and animals. This ability to thrive in a variety of environments is in part reflected in its complex genome (over 6 Mb for strain PAO1) and the coding capacity of ca. 6,000 proteins, which is similar to the simple eukaryote Saccharomyces cerevisiae. Moreover, the annotation of the genome of PAO1 revealed that it contains a large number of proteins which are homologous to transcriptional regulatory proteins. Indeed, over 400 such genes have been identified and they represent 6.7% of all annotated genes, the highest percentage regulatory genes found in sequenced microbial genomes. This indicates that the regulation of the large number of genes in the P. aeruginosa genome requires the activities of a correspondingly large number of regulatory networks. The thousands of adjustments in the levels of various proteins in the cell, which allow P. aeruginosa to thrive in a particular environment is accomplished by transduction of signals to regulatory proteins leading to selective repression and activation of gene expression.







"Pseudomonads are noted for their metabolic diversity and are often isolated from enrichments designed to identify bacteria that degrade pollutants. Bioremediation applications seek to exploit the inherent metabolic diversity of P. fluorescens to partially or completely degrade pollutants such as styrene, TNT and, polycyclic aromatic hydrocarbons (Baggi, et al. 1983; Gilcrease & Murphy 1995; Caldini, et a. 1995) . In addition, strains can be modified genetically to improve their performance in particular applications. A number of strains of P. fluorescens ... [produce] secondary metabolites including antibiotics, siderophores and hydrogen cyanide (O' Sullivan, D.B., and O'Gara, F. 1992)." (






Figure 0c from Denkar & Ausubel 2002* see attached. Confocal Scanning Laser Microscopy (CSLM) analysis of biofilm formed by wild-type PA14 and antibiotic resistant (RSCV) expressing GFP. Scale bar 50 µm.