"Goal 3 -- Characterize the Functional Repertoire of Complex Microbial Communities in their Natural Environments at the Molecular Level

"Goal 3 -- Characterize the Functional Repertoire of Complex Microbial Communities in their Natural Environments at the Molecular Level."

Goal 3 a and b: Background and Progress to date

Understanding the nature of diversity and of functional units in microbial communities is one of the major challenges in microbiology, ecology and evolutionary theory. Although ribosomal RNA approaches have provided first steps towards diversity estimation, and are widely used as a proxy for unique bacterial ‘types’ in natural populations, it remains unknown at what level of genetic resolution an ecologically functional unit must be defined. Furthermore, although genomic studies on cultivated bacteria have resulted in important and unexpected insights into the processes and patterns of genome evolution, it remains unclear how these insights may be extended to populations that co-occur in natural environments. Many crucial questions, such as at what level of structural similarity genome evolution is driven by homogenizing versus differentiating mechanisms, can only be answered by analysis of co-occurring genomes at different level of phylogenetic relationships.

Goal 3a

We will use the cyanobacterium Prochlorococcus as our central model to explore in detail the genomic variation that occupies a single dominant and well-defined niche in the ocean. This will be accomplished by flow sorting the Prochlorococcus cells away from the rest of the microbial community, constructing a BAC library, and, depending of the diversity encountered, either assembling the complete genomes or large contigs to determine the structure of co-existing genomes. Should assembly of large genome portions not be possible, we will provide anchors for the bioinformatic/evolutionary analysis by identification homologous genes/genome regions in the BAC libraries (see below). We will also measure the diversity of co-existing Prochlorococcus in the four samples by rarefaction of a number of different gene markers and by application of in situ amplification techniques.

Furthermore, we will estimate the overall diversity and nature of phylogenetic and functional variation in genomes of uncultured bacterioplankton co-existing with Prochlorococcus. This is to delineate the diversity of the total bacterial community – a task that has remained elusive yet is crucial for effective implementation of environmental genomics. We have recently discovered through elimination of a major artifact that bacterial diversity in the coastal ocean has likely been overestimated by at least an order of magnitude. We seek to extend this approach to the open ocean systems, and complement diversity estimation by capturing and assembling large genome fragments of important members of the bacterial community. This will provide estimates of the extent and nature of genome variation on a community level. We will also assess to what extent function is conserved in bacterial communities under different environmental regimes by development of ‘functional genotype multiplexing’ through an extension of ‘in situ amplification’ protocols developed by the Church lab. These will allow the simultaneous identification of phylogenetic identity and presence of functionally relevant genes in the genomes of uncultured prokaryotes.

For both tasks under goal 3a, we will use bioinformatics and evolutionary analysis to assess the nature of the diversification process. That is: (1) Survey genes representing different functional categories (informational, central metabolic, photosynthetic, catabolic, etc.) for their prevalence and sequence diversity; (2) Distinguish purifying selection (maintenance of function) from function change (or loss) by comparisons of DNA versus protein divergence (synonymous vs. nonsynonymous sequence changes); and (3) Look for evidence of recombination and gene transfer through congruency of phylogenetic trees of genes, unusual codon usage, and local gene order; and (4) identify potential prophage inserted within the Prochlorococcus genomes to characterize the relationships between Prochlorococcus and prophage diversity.

Goal 3b

We will explore the functional connection between the dominant autotroph Prochlorococcus and co-existing heterotrophic bacteria. Our goal here is to determine the extent of specific cell-to-cell interaction in the well-mixed oceanic environments. We will determine whether specific carbon compounds known to be excreted by Prochlorococcus are taken up by specific heterotrophic bacterial populations indicating selection for species networks or whether carbon transfer is guided by chance encounters of individual cells. We will combine DNA microarray and radiotracer techniques in a novel application, the ‘functional diversity array’, which will allow us to identify and link carbon sources and sinks within the microbial community. The FDA will be complemented by a new technique, which we term here ‘single cell activity multiplexing. It combines in situ amplification from single cells in acrylamide matrix with quantification of uptake of radiotracers.

Field Dynamics

The seasonal dynamics of Prochlorococcus populations have been well documented in the N. Atlantic and Pacific (from the USJGOFS HOT and BATS Time series stations), and in the Gulf of Aqaba in the Red Sea. ¾ the three sites we have chosen for constructing the BAC libraries. These data-sets have information on the total Prochlorococcus "meta-population" ¾ i.e. all of the cells that are identified as Prochlorococcus based on their light scatter and fluorescence signature using flow cytometry. This includes all of the ecotypic diversity at a particular site, and thus describes the outer bounds of the collective niche of this group.

The dynamics of the meta-population are distinctly different at the three sites, thus providing us with different selection regimes for our field studies. At the HOT site in the Pacific there is very little seasonal change; the surface mixed layer never extends below the euphotic zone, thus nitrogen remains undetectable in the surface mixed layer throughout the year . Prochlorococcus are fairly uniformly distributed above and below the mixed layer year-round at this site. At the BATS site in the Atlantic, the water column is stratified in summer, with a 20m mixed layer, but mixes down to about 200 meters during the winter. Prochlorococcus abundance is low in the mixed layer in summer, and very high in the static sub-surface chlorophyll maximum layer at the base of the euphotic zone. In winter it is uniformly distributed throughout the mixed layer, in moderate abundances . In the Gulf of Aqaba of the Red Sea, the scenario is the most extreme. Here the deep waters are never cold enough to sustain strong stratification, thus in winter the water column mixes down to at least 600m and the Prochlorococcus population is undetectable. As deep mixing subsides in April, the population re-emerges and by July there is a huge sub-surface maximum at about 100 meters, with cell densities as high as 10⁶ cells ml^-1. This is accompanied by smaller population in the shallow surface mixed layer .

Thus these three sites provide us with very different selective regimes for the Prochlorococcus meta-population. One extreme is the situation in the Red Sea where a very large population is built up from an extremely small founder population after winter mixing. This population is established below the mixed layer, where low light conditions are relatively stable, and persists until the onset of deep winter mixing. The other extreme is the surface mixed layer at the HOT site, which does not undergo much seasonal perturbation, but experiences short term light fluctuations in the mixed layer throughout the year. In the middle is the BATS site, where moderate seasonal forcing exists. With this in mind, we will strategically select depths and seasons for sampling among these three sites for the construction of our BAC libraries.

Ecotypic Diversity in Prochlorococcus

The Chisholm Lab has isolated 55 strains of Prochlorococcus into culture from diverse oceans. Phylogenies constructed using rDNA sequences from a subset of the collection reveal clades that cluster into ecotypes (Fig. 3C, below) according to their optimum and minimum light intensity for growth, and the range of their chl b/a ratios (Fig 3A, 3b). . The ecotypes differ at the 16S rDNA locus by about 2% (Fig. 3C) and the rDNA sequence variability among our cultured isolates can be directly related to that observed in the field . More refined analysis of phylogenetic relationships among isolates based on the 23S and ITS regions of the rDNA locus support the distinction between the two types. High-light (HL) adapted isolates are closely related and cluster together in a shallow clade, while the low light (LL) adapted isolates are more divergent . Ecotypes have also been shown to be distinct in terms of their optical properties, and the structure/composition of their photosynthetic apparatus as well as in their Cu tolerance and Co requirement .

We have also shown that the HL ecotypes can only use ammonia as a nitrogen source, while the LL ecotypes can utilize both ammonia and nitrite . In contrast to their close relative Synechococcus, none of the Prochlorococcus isolates can use nitrate. These physiological observations were confirmed by whole genome analyses: The HL strain MED4 lacks the genetic machinery to reduce nitrate or nitrite, whereas the LL strains do contain the genes for nitrite reduction (see below). Thus we can begin to connect genome diversity with niche diversity: The ecotypes that thrive at high light (i.e. surface waters) have lost the machinery to use oxidized forms of nitrogen, which is consistent with the predominance of regenerated ammonium in surface waters. In contrast, those that thrive in low light have retained the ability to use nitrite, which is usually relatively abundant at the base of the euphotic zone. Depth distributions of ecotypes in the field are consistent with the HL and LL designation .

There is no relationship between the phylogenetic affinities of the different cultured ecotypes (Fig. 1C) and their ocean of origin . This is consistent with the observation that a 2% 16S rDNA sequence difference between the ecotypes translates into separate evolutionary history of, very roughly, 100 million years , whereas the mean global circulation time of the oceans is on the order of thousands of years . That is, microbial distribution in the oceans is determined by ecology, not by geography, per se.

Thus we hypothesize that multiple ecotypes of Prochlorococcus co-exist in all oceanic environments, alternating in dominance along spatial and temporal gradients. These ecotypes are descendants of a common ancestor yet have been shaped by evolutionary mechanisms that lead to diversification. Much of this diversification is gradual along clonal lineages, as evidenced by the rRNAs; however, major change can be introduced by gene loss and rearrangement, or lateral gene transfers as evidenced by comparison of two Prochlorococcus genomes (see below). Ultimately, ecotypes arise that are genomic hybrids, consisting of families of genes whose co-occurrence has been selected for based on the probability of co-occurrence of particular environmental conditions in the oceans. One of the goals of this proposal is to begin to understand the full extent of this diversity – from gradual changes to major genome differences - and, ultimately, its relationship to the dynamics of the environment.

Comparative Genomics of two Prochlorococcus Ecotypes

The DOE’s Joint Genome Institute has sequenced the genomes of Prochlorococcus MED4 and MIT9313. MED4 belongs to the more recently evolved HL clade of Prochlorococcus, while MIT9313 belongs to the LL clade (Fig. 3a). Over the course of this differentiation there has been a dramatic reduction in genome size (Table 1). MED4 has the smallest genome for any known oxygenic phototroph, with 1.7 Mbp and approximately 1700 potential genes (Table 1). A comparison of the genomes of these two ecotypes reveals a common core of ca. 1300 genes, and a large group of genes, conserved in both genomes, are of unknown function. In addition, each genome contains a significant number (200-600) of genes that are (currently) unique (Table 3a) ¾ the majority of which (about 60%) are of unknown function. Alignment of the two genomes demonstrates that they are mosaics of blocks of genes with significant rearrangement (Fig. 3b), and closer inspection reveals that even between conserved blocks, insertion/deletion events have led to further differentiation (see below).

Concurrent with the reduction in genome size in MED4 is a dramatic reduction in %GC content, leading to different codon and amino acid usage patterns compared to MIT9313, and a reduced number of genes encoding regulatory proteins (Table 1). For example the MED4 genome contains 4 histidine kinase motifs (Tolonen unpubl. data) in comparison to the 43 found in Synechocystis PCC6803, a related fresh water species. In fact, of all the genome sequences available on the Integrated Genetics Website, MED4 has the fewest histidine kinase motifs, implying that it has very few regulatory circuits and networks. Superficially, this might suggest that energy is not limiting in the high light environment, so that a small core of constitutively expressed biosynthetic pathways is emphasized over a broader set of regulated assimilatory pathways.

Ecotypic Differences at Selected Loci.

The comparative study of laboratory isolates of Prochlorococcus with regard to detailed features at selected loci (selected either for their universal function of their ecological relevance for this particular organism) has begun to yield some insights into the genetic basis of ecotypic diversity. We do not have room to review all that has been unveiled thus far, but comparisons of the photosynthetic apparati of the two ecotypes can be found in two of our recently published review articles (see ), one of which can be viewed at http://web.mit.edu/chisholm/www/prog.pdf.

One particular comparison is compelling with regard to the importance of deletion events in the evolution of ecotypes. As mentioned above, Prochlorococcus is unusual in that it cannot utilize nitrate as a nitrogen source , and only the LL ecotypes can utilize nitrite. The HL ecotypes are limited to ammonium and urea as their nitrogen sources, which is consistent with their predominance in surface waters where these regenerated forms of N dominate. Prochlorococcus’ close relative Synechococcus, however, can utilize all three forms of N.

Comparative genomics has revealed that this makes sense when you consider the evolutionary origins of these three ecotypes as well as the ecological niches they now occupy (Fig. 3c).

Serial deletions of segments of the N-metabolism regulon, have resulted in the sequential loss of the nitrate and then the nitrite reductase genes as the LL Prochlorococcus ecotypes evolved from Synechococcus, and the HL ecotype evolved from its LL relative (Post et al, unpubl.). The net result is that the HL ecotype dominates high-light surface waters where ammonium is the dominant N source and the LL ecotype dominates deeper waters where light is scarce but nitrite is often abundant. Their close relative and Synechococcus has a very broad niche with respect to N utilization, and thus is capable of bloom formation when NO_3- upwells from the deep water. Synechococcus cannot, however, grow at the very low light intensities at which LL Prochlorococcus thrives. Thus these deletion events have played an important role in niche diversification among these ecotypes. It is likely that as we begin to compare other genes that differ among the ecotypes we will gain clues as to other environmental "drivers" for this diversification. Indeed, similar deletion events can be seen in the photosynthetic apparatus of .

Prochlorococcus cyanophage

Almost every Prochlorococcus isolate in our collection has shown susceptibility to lysis by naturally-occurring cyanophage (Sullivan, unpubl). Several phage have been cloned, and their host ranges have been found to vary considerably. Some phage are capable of infecting only a single host while others infect multiple hosts even spanning both ecotypes of Prochlorococcus and in some cases a second genus of marine cyanobacteria, Synechococcus.

In addition to lytic phage, prophage have recently been shown to exist in natural marine Synechococcus communities . Using bioinformatic approaches to detect possible prophage in our Prochlorococcus genomes , we have detected possible prophage present within the MED4 (~35 kb+ in size) and MIT9313 (~ 20+ kb in size) genomes (Sullivan, unpub). A key objective for future work will be to determine if these represent functional phage capable of being induced to a lytic stage, or are remnants of inactive phage. As we search for novel means of creating a working genetic system, the benefit of a functional prophage might prove invaluable for future genetic manipulation in Prochlorococcus.

IMPORTANT NOTE:

Polz and Chisholm recently submitted a NSF Biocomplexity proposal (along with Hiroaki Shizuya and Gary Olsen) to do the BAC Library and fingerprinting work described herein for Prochlorococcus at the three study sites. That proposal did not include analysis of the rest of the microbial community ¾ or the connectivity between it and Prochlorococcus ¾ that we are proposing here. If the NSF proposal is funded, it would support the construction of a minimum of four BAC libraries from flow-sorted Prochlorococcus cells obtained from the Bermuda Atlantic Time Series Station (BATS), and the Hawaii Ocean Time Series Station (HOT) and the Gulf of Aqaba in the Red Sea. One of these libraries would be fingerprinted to determine contigs of the co-existing environmental populations, while the others would serve as reference libraries. The NSF grant only includes funds for sequencing of selected genes of environmental relevance but not of whole genomes. Thus, we propose full sequencing of the fingerprinted library by the JGI under the auspices of this grant (see estimate for coverage and cost estimate). Should the NSF Biocomplexity not be funded, we would ask the JGI to carry out both BAC library construction and sequencing of the environmental Prochlorococcus BAC library.

Measurement of Genomic Diversity of Natural Communities

Diversity is a central ecosystem parameter as a measure of co-existing, interacting and co-evolving genomes. Although we have, in principle, learnt how to measure bacterial community diversity via measurement of ribosomal RNA diversity, reliable estimates are still limited to simple environments. In fact, a recent review showed that no complex marine environment has been sampled sufficiently and so bacterial diversity remains an open question . Molecular diversity studies typically circumvent culture of organisms by directly collecting cells from the environment, extracting mixed DNA, and PCR amplification and cloning of variants of specific homologous genes . Ribosomal RNA (rRNA) genes are particularly useful because they allow universal phylogenetic differentiation of organisms, and the rRNAs themselves provide excellent targets for identification/quantification of populations via in situ or slot blot hybridization . Furthermore, because in many bacteria, rRNA content is positively correlated to growth rate quantification of specific rRNA in natural samples can give information about the relative activity of populations . However, we have recently discovered that although this lack of diversity estimates is in part rooted in technical difficulties, more importantly, methodological problems may lead to an explosive accumulation of sequence artifacts (Thompson et al. 2002). The discovery of this methodological problem has recently enabled us to estimate the total ribotype diversity in a coastal bacterioplankton community (see preliminary results).

This lack of data on ribotype diversity is compounded by an absence of information on genomic variation that may lead to functional variation within ribotypes (genomes with identical rRNA sequences) that co-occur in the environment. Thus, the functional unit represented by diversity measurements can currently not be ascertained. Only a single study analyzed an environmental BAC library constructed from a sample of coastal bacterioplankton. They detected two archaeal clones belonging to a single ribotype and several clones with closely related ribotypes. Analysis of genes that flank the rRNA operon revealed that homologous genes were present but that there was sequence variation in all clones. However, in the clones with identical ribotype the variation was limited to synonymous substitutions indicating functional equivalence . While this suggests that there is indeed genomic variation within ribotypes, more extensive studies are obviously needed to improve our understanding of overall genome variation, especially as it relates to ribotype variation and the relationship of ribotype diversity to ecotype. In addition, sequence variation contains valuable information about the mechanisms and history of the forces that structure environmental populations and genomes.

Mechanisms of diversification and selection

In an ecological context, microbial diversity will ultimately be determined by the rates and mechanisms that generate genetic change and the degree to which such changes are removed through selection and drift. In the extreme, diversity could be manifest as a virtually immeasurable continuum of sequence and genome variants. Alternatively, and more likely, biotic and physical factors in the environment may regularly purge variation from natural communities, leading to discontinuous and limited genomic variation. Several mechanisms that may introduce change into bacterial genomes have been inferred from experimental studies and comparative sequencing. These include clonal diversification (accumulation of point mutations that are passed vertically along lineages), gene loss, intragenomic rearrangements, and horizontal mechanisms like recombination and lateral gene transfer. Additionally, insertion sequence (IS) elements, transposons and phages may play an important role in diversification of genomes. Of these, clonal diversification and recombination will introduce change into existing genes without altering overall genome structure while all other mechanisms will change gene order or content.

Though point mutation is the ultimate source of sequence change, other mechanisms acting in concert may considerably modify its effects. Aside from the generation of evolutionary novelty, clonal divergence may eventually lead to isolation of populations from recombination, a consequence that may be of equal or greater importance . Recombination rates have been shown to decrease exponentially with sequence divergence in Bacillus, Streptococcus and Escherichia . For example, in a study comparing nucleotide divergence in the rpoB gene to transformation frequency in Streptococcus, transformation became increasingly rare as gene sequences diverged and was no longer detectable at 27% difference. Such genetic isolation, reinforced and modified by ecological factors, such as geographic isolation, population effects and selection, may ultimately lead to the accumulation of functional differences. Thus, the degree to which sequence diversity is continuous or discontinuous within and among clonal populations may have considerable ecological significance.

While once considered to lead primarily to homogenization on the population level, recombination can enhance genetic diversity when it occurs between clones in a structured population . In the classical sense, recombination allows the co-existence of polymorphism and so expands the potential niche of a species. However, due to its dependence on genetic similarity, it is difficult to predict recombination rates between populations in the absence of sequence information for co-occurring populations. Rate estimates have been obtained for E. coli isolates from the ECOR collection, suggesting that sequence divergence due to recombination is 50-fold higher than that due to mutation . In the extreme case of H. pylori, which occupies a niche in absence of competitors, appears to be panmictic . Explicit tests for estimating recombination are now available but to date these have only been done for pure-cultured isolates. One of our explicit goals is to apply such tests to genomes in naturally occurring communities.

Lateral gene transfer has also undoubtedly played an important role in bacterial evolution . Well known examples include the pathogenicity islands in several bacteria which can be traced to phylogenetically distinct groups . The rates of transfer in the environment are unknown, but may be enhanced if the genomes contain regions that are predisposed to accept foreign genes. For example, recently described transposons harbor integrons that target specific sites in the genome that can integrate and express open reading frames . They appear to be widespread, can be present in multiple copies in genomes, and have been found associated with resistance genes. However, although integron mediated lateral gene transfer may be one of the major factors that introduce variation into bacterial genomes, at the current state of environmental genomics, its effect may be difficult to estimate as it acts in narrowly circumscribed islands within the genome.

Genome rearrangements and gene loss may also have significant effect on structuring genomes, however the importance of these processes in the environment is unknown at present. It is likely that most such events are detrimental in genomes that have long co-evolved with their environment and so may rarely be detected among closely related bacteria in natural environments. Nonetheless such events may be more favored under conditions of rapid environmental change, such as transfer to a culture medium, and so may be more frequently represented in existing databases (which are dominated by cultivated organisms) than in naturally occurring genomes. For example, even genome disruption by IS elements may be relatively rare in environmental populations as suggested by a recent comparison of Yersinia pestis strains which showed identical IS element numbers and locations in all strains of the biovar responsible for the plague pandemic in modern times even though these IS elements integrate at random locations into the genome . Ultimately, however, the impact on naturally occurring genomes by gene rearrangements, gene loss, IS elements and more targeted insertions such as integrons will likely have to await environmental genomics approaches capable of examining large numbers of whole genomes or large contiguous genome fragments. We believe that the approach proposed here, will allow exploration of several significant features of genome diversity and inference of mechanisms of genome evolution under different environmental regimes.

Diversity and Culturability of Bacterioplankton

As outlined above, the majority of bacteria in the environment have remained uncultured. This also applies for bacterioplankton species. This has largely been determined by comparison of results from isolation attempts, direct counts of cells, and, during the last decade, molecular approaches . For marine bacterioplankton communities, culture-independent approaches have lead to several important generalizations . First and foremost, it is believed that culture approaches, which isolated of bacteria on media with high substrate concentration, have lead to isolates that poorly represent the dominant rDNA sequences recovered. Thus, it has become customary to classify marine bacteria into culturable and unculturable . Only alpha-Proteobacteria of the Roseobacter clade, Cytophaga/Flavobacerium representatives and cyanobacteria generally are both recovered at high frequency in culture collections and in clone libraries. Other common isolates, particularly some gamma-Proteobacteria genera (e.g., Vibrio) grow on marine agars but occur infrequently in clone libraries. Among the groups that have evaded cultivation to date are the SAR11, SAR116 and SAR86 clusters and the Actinobacteria. These are frequently dominant in clone libraries and appear to be cosmopolitan judging from their occurrence in clone libraries from a variety of habitats.

Despite great progress in understanding of bacterioplankton diversity, major questions remain. First, we still do not have good estimates of total diversity in bacterioplankton communities. Second, dynamics of plankton communities using clone libraries has only been addressed infrequently. Both are problems of insufficient sampling of clone libraries ; however, this can now be addressed by using equipment increasingly available through genome centers. Third, the ecological role of the uncultured bacterial phylotypes is unknown; however, as detailed below exciting new approaches will allow significant progress.

New Approaches to determine structure-function relationships

An exciting recent extension of molecular approaches is the simultaneous determination of structure (phylogeny) and function (metabolism) of microbial populations. Environmental samples are amended with isotopically heavy substrate, which is metabolized by the community. In one set of methods, active populations are identified by incorporation of ¹³C from the added substrate into biomass and subsequent detection of population specific tracer molecules such as DNA or polar lipid derived fatty acids . A second method combines in situ hybridization by phylogenetic oligonucleotide probes together with ¹⁴C based autoradiography, allowing simultaneous determination of activity and identity .

We are currently developing a conceptually similar but more broadly applicable approach, the Functional Diversity Microarray (FDA). This combines isotopic labeling of active populations with measurement of population diversity using DNA microarrays.

Overview Goals 3 and b

We propose to analyze the microbial community from three oceanic environments with disturbance regimes that vary over different time-scales (daily, months, seasonal). We will focus on Prochlorococcus, which is the dominant primary producer in these environments, and its functional connection to the bacterial community. We will explore the nature of genomic variation and modes of diversification within the single environmental niche occupied by Prochlorococcus. We will further determine the extent and nature of variation of bacterial ribotypes co-existing with Prochlorococcus under the different environmental regimes. Finally, we will explore connectivity between Prochlorococcus and heterotrophic bacterial populations by determining the patterns of carbon transfer between this dominant primary producer and co-existing heterotrophs.

We have chosen the specific environmental sites below to maximize differences in selective regimes both with regard to seasonal disturbance, and short-term mixing dynamics (see background section):

(1) HOT – Summer surface mixed layer: A population which has been isolated in the mixed layer for most of the year, experiencing fluctuating high light/low nutrient environment (minimum disturbance with short term fluctuations)

(2) HOT – Summer, below the mixed layer: A population that has been isolated from the mixed layer for most of the year and experiencing relatively constant low light/low nutrient environment (minimum disturbance – long-term stability)

(3) BATS – Summer deep chlorophyll maximum layer: A population that has been isolated from the mixed layer for several months and experiences a relatively constant low light environment, and relatively higher nutrients (Intermediate disturbance – short term stability).

(4) Red Sea – Summer deep chlorophyll maximum layer: A population experiences relatively constant low light and exists only June – Sept, before it is essentially eliminated by deep winter mixing (maximum disturbance – short term stability)

One of these libraries ¾ to be determined from the analysis of gene diversity described below ¾ will serve as our ‘reference library’ and will be assembled into contigs by fingerprinting (see note on matching funds from NSF). This library will also be targeted for potential full sequencing under the auspices of this proposal Many of the questions posed will be addressed using this reference library, and this will represent phase I of our work. In phase II, we will move into the comparative stage where we compare loci and genes in the other BAC libraries.

Specific Questions
What type and extent of genomic variation exists in co-occurring Prochlorococcus populations?

We will determine what common genomic backbone and superimposed variation exists in the genomes of co-occurring Prochlorococcus. We will initially approach this by fingerprinting the entire BAC library from the environmental location we have found to display highest number of sequence variants in the diversity screening. Depending on the genomic variation encountered in the sample, the fingerprinting will provide us either with completely assembled genomes or with large contiguous portions of the genomes (at a minimum the average

Size of a BAC clone). We will completely sequence large regions of the genomes (or contigs) anchored by informational genes and pathways identified largely from the two sequenced ecotypes. This will provide us with a rich comparative dataset and will form the foundation for comparative analysis of the BAC libraries from the different environments.

What are the major modes of diversification of these Prochlorococcus populations?

We will analyze the gene sequences and genome architecture we encounter in the completely fingerprinted and in the partially characterized BAC libraries for quantitative and qualitative information on mechanisms that drive the evolution of these genomes. We will identify contigs containing rRNA operons and target these for complete sequencing. We will group the contigs by rRNA similarity and analyze the sequences for quantitative evidence of importance of (point) mutation vs. recombination and qualitative evidence for differences in overall architecture. The first will be done by identification of at least 6 orthologous genes that are 10s of kbp apart on the contigs and comparison of their DNA and protein sequence divergence and congruence of phylogeny. The second will be approached by contrasting the contigs for differences in gene arrangement, duplication, gain and loss. Beyond presence and absence of genes (or, more generally, genome regions), relative divergence of genes, synonymous vs. nonsynonymous changes, strength of codon bias, and unusual ("alien") codon usage will also be examined. We will strive to include genes with demonstrated differences in expression level and those with markedly different numbers of interactions within the cell.

What is the genomic diversity in key genes and pathways that are under different selection regimes?

We will determine to what extent the differences in environmental disturbance regimes transcend to diversity on the genome level. An important question is whether the two extremes, high stability (HOT) and population crashes (Red Sea) lead to reduced diversity as opposed to the intermediate disturbance regime (BATS). We will address this by comparing evidence of overall diversity in the marker genes obtained by PCR and in specific genes and pathways from the BAC libraries. Initially, we will concentrate on genes that have already been shown to be important in determining ecological success of Prochlorococcus (see Background) but important additional genes are likely to be identified through the ongoing development of Prochlorococcus DNA microarrays in the Chisholm lab. We will identify BAC clones carrying target genes by hybridization with gene probes constructed by PCR and determine sequence diversity. The comparison of genes under strong (e.g., transporters, light-harvesting apparatus, N and P uptake) and weak (e.g., informational, central metabolism and housekeeping genes) environmental selection will help identify key differences.

How closely do cultured Prochlorococcus isolates resemble environmental genomes, and what types are most readily isolated from the environment?

Prochlorococcus is one of the few ecologically dominant microbes for which an extensive culture collection exists. Thus, we will determine in an exemplary fashion how well the diversity among the cultured strains represents the environmental diversity. This will be accomplished by comparing sequence diversity in some of the same key loci used for the above two questions. In cases where genes can be associated with specific isolates, or at least linked with a common organism through the BAC assemblies, phylogenetic trees will be constructed and compared for consistency among genes.

How many bacterial ribotypes co-exist under the different environmental selection regimes and what is the nature of their genomic variation?

We will compare sequence diversity in 16S and 23S rDNA clone libraries obtained from the three different environmental selection regimes. These genes are the standard in diversity work and estimates of their total diversity and overlap in distribution is needed as a first step for future environmental genomics applications. We have recently shown that rarefaction of such libraries is possible by a combination of high-throughput technology, new statistical methods, and by modification of existing PCR amplification schemes that avoid generation of artifactual sequence diversity (see background). Furthermore, we will adapt the ‘in situ amplification and sequencing’ technology developed by the Church lab for rapid determination of overall ribotype diversity in the environment.

What is the relationship between structure (phylogeny) and function in the bacterial communities from the different environmental regimes?

We will expand this question from the detailed exploration of diversity within the Prochlorococcus populations to the co-existing uncultured bacterial community by two new approaches. First, we will use our newly developed ‘capture and walk’ technique that allows us to use oligonucleotide probes specific for a ribotype to pull large genome fragments (up to 20 kb) from the environment. These can be cloned and sequenced, and probes complementary to their ends can be designed for capture of contiguous fragments. Thus, clone libraries that are samples of the co-existing diversity within identical and similar ribotypes can be assembled and the diversity of associated genes explored. Second, we will adapt the ‘in situ amplification’ technology to a functional multiplexing in which co-localization of specific structural and phylogenetic marker genes can be identified in a high throughput manner. We will concentrate on uncultured bacterial ribotypes found to be either numerically dominant or to be an important link in carbon transfer from Prochlorococcus to the bacterial community (see below).

What are the patterns of functional connections between the dominating autotroph Prochlorococcus and the heterotrophic bacterial community?

We will explore to what extent carbon compounds excreted by Prochlorococcus structure the heterotrophic bacterial community by application of our ‘functional diversity array’ (FDA). This allows simultaneous identification of microbial ribotypes and determination of growth on specific carbon substrates. The rRNA clone libraries from the different environments will be used as templates for construction of ribotype specific oligonucleotide probe arrays. These arrays will be hybridized against total rRNA from samples incubated with ¹⁴C-labeled carbon substrates, which were collected from Prochlorococcus or identified as important exudates. Populations, which actively metabolize these substrates can be identified by the radiolabel accumulated in their rRNA allowing qualitative assessment whether carbon transfer routes are dictated by chance encounters between heterotrophic and autotrophic populations or whether specific associations may have (co)-evolved over time.

What are the relationships between Prochlorococcus and prophage diversity?

We will use bioinformatics approaches to identify candidate prophage in the BAC library clones of Prochlorococcus. Through the work of other laboratories (Rowher, pers. comm.), signature genes are beginning to emerge that allow for the phylogenetic analysis of phage types based upon the sequence analysis of one or a few conserved genes just as has been done for microbial biota using 16S ribosomal DNA. Building upon this work, we have the opportunity to compare the phylogeny of the host and prophage detected within different Prochlorococcus clones to determine the relative importance of vertical or horizontal transmission of phage within the Prochlorococcus community.

What are the relative abundances of Prochlorococcus prophage in natural communities?

Estimating the abundance of prophage in a natural community has traditionally been difficult due to the dependence upon culture-based techniques selecting at two levels (the culturability of the host and the culturability of the phage) and due to the unknown selection of an appropriate inducing agent to target "all prophage." Through statistical analysis of our BAC clone libraries, we will have the unique opportunity to be able to approximate the abundance of prophage within the Prochlorococcus community using culture-independent techniques.

Do prophage confer host cell fitness advantages and drive niche diversification of Prochlorococcus?

We know from other phage-host systems that prophage often encode virulence factors and / or novel genes that allow significant fitness advantages of a lysogenic (prophage-containing) cell over non-prophage containing cells. Detailed characterization of prophage from our BAC libraries will allow for the identification of genes encoding such factors that might drive the physiological diversification of Prochlorococcus ecotypes in oceanic systems.

Progress to Date

Diversity of 23S rDNA in the Plum Island Sound.

We have constructed and screened a large-clone library from a coastal environment by the methods outlined in the experimental approach. Using our recently developed protocol to reduce PCR-generated sequence artifacts (see above), we have found surprisingly low ribotype diversity in this environment. Although the screening is still in progress, we currently estimate about 277 ribotypes to be present in the library (Fig. 1). This allows us to put a first lower boundary on total gene content for this community. Genome size of free-living bacteria ranges from 0.98 to 9.4 Mbp. Taking E. coli as our model with 4.6 Mb and roughly 4,400 genes we can estimate a minimum total environmental genome of 277 x 4.6 = 1,274 Mb and 277 x 4,400 = 1,218,800 genes (some of so close as to be allelic, others distant homologs, or non-homologs). In comparison, the human genome, which has been sequenced, is 3,000 Mbp but is thought to have 30,000 or more genes. This suggests that environmental communities may be accessible by genomics. However, a critical but unexplored variable in this calculation is the degree of within-ribotype diversity of co-existing bacteria. It will be essential to estimate within-ribotype diversity to arrive at reasonable estimates of total diversity.

Capture of large genome fragments from the environment.

We have developed a protocol that will be used to capture large (>20 Kb) genome fragments from environmental DNA. The protocol was first optimized using Vibrio cholerae DNA that was completely digested with SmaI resulting in a fragment of 6.1 kbp containing the rRNA operon. Fragment capture with a specific, 23S rDNA targeted 70-mer oligonucleotide showed good recovery with 62 ng of specifically enriched DNA. This fragment was then cloned by the methods described below. Subsequently, we were able to recover similar amounts of a ~20 kb genomic fragment when V. cholerae DNA was spiked into DNA extracted from a natural community at 10, 1 and 0.1%. We anticipate being able to recover much larger genome fragments using partial digests of environmental DNA that has been size fractionated on pulsed field gels. As detailed below, we will ultimately use this method to obtain large fragments of DNA from uncultured organisms with unknown genome composition.

Proposed Approaches

Field Sampling

Chisholm already has an ongoing NSF project at the HOT and BATS stations (see Prior Support section), thus obtaining the samples from there will not be a problem. We also have an ongoing collaboration with Dr. Anton Post, a cyanobacterial expert at the Interuniversity Institute of Eilat, who has regular cruises on the Gulf of Aqaba (see letter of collaboration), thus facilitating our sampling there.

Sample Preparation

Cell collection and concentration. We will need a minimum of 2 x 10⁹ Prochlorococcus cells for each BAC library (~ 20 liters of water); however, to ensure sufficient coverage, we will concentrate cells from 100 liters. Samples will be pre-filtered (1 m m pore size) to reduce concentration of larger, eukaryotic cells. The remaining cells will be concentrated by tangiental flow filtration and pelleted by centrifugation as described by Béjà et al. . The cell pellet will be frozen in liquid nitrogen.

Cell sorting. Prochlorococcus cells will be sorted from other phytoplankton and heterotrophic bacteria using the MIT flow cytometry facility, which is equipped with several MoFlo flow cytometers (Cytomation). As we have shown many times in our past work , Prochlorococcus has a unique flow cytometric signature that distinguishes it from other phytoplankton and heterotrophic bacteria, and we have sorted them from field populations for other molecular studies . The MoFlo instrument has high-speed sorting capability, and can sort up to 30,000 cells per second, which means we could get the requisite 10⁹ cells in a 24-hour period.

If we stain the DNA of the community with a fluorescent stain like Hoechst, we will be able to cleanly sort the Prochlorococcus away from all of the heterotrophic bacteria. This would be the ideal approach, and we will use it if we can show that the stain will not interfere with the remainder of the analysis, or that we can remove the stain before the analysis without disrupting the DNA. If this approach fails, we can still greatly enrich the Prochlorococcus cells relative to the heterotrophs through sorting, and the "contaminating" heterotrophs should be easily identified in our libraries. Since statistically they will be the dominant heterotrophs in the sample, some exploration of their genomic identity could be quite interesting and we will treat this as an ancillary part of the work.

DNA extraction. Nucleic acids for diversity estimation by PCR amplification and cloning will be extracted using bead beating , which yields DNA from difficult to lyse cells including Bacillus spores. Although cultured Prochlorococcus cells easily lyse quantitatively, the bead-beating will serve as a reference for the more gentle nucleic acid extraction method used for BAC library construction. High molecular weight DNA for BAC construction will be extracted as described by Stein et al. . Cells will be embedded in agarose in syringes and lysed by extrusion of the mixture into lysozyme and detergent containing buffer. DNA will be retrieved by enzymatic digestion of the agarose and will be subjected to shearing (see below).

Diversity estimation

Outline. We will estimate the number of co-existing bacterial ribotypes, and, as a preparation for Prochlorococcus BAC construction, the number of Prochlorococcus genomes in our samples, by determination of the sequence diversity in several genes and genetic elements. This will allow us to decide the necessary number of clones needed in the Prochlorococcus BAC library for the desired 15 to 20 x coverage of co-existing genomes and will provide us with suitable molecular markers for identification/quantification of specific genotypes in environmental samples or culture collections. We will target genes that accumulate sequence change at different rate but are limited to genes for which good PCR primers are available. For the total community, 16S and 23S rRNA genes will be used, and for Prochloroccus the internal transcribed spacer (ITS), and the RNA polymerase and the recA genes will also be assayed. Diversity of each gene will be estimated from rarefaction of sequence diversity in PCR-generated clone libraries. We have previously done this for the bacterial community using the 23S rRNA genes (see above), and for Prochlorococcus using the ITS, which is single copy in Prochlorococcus, and have found 20 co-exisiting sequence variants . Since the ITS is considered hypervariable, we expect this approach to be possible for all genes.

PCR amplification and cloning. All PCR amplification protocols will take into account recent insights into generation artifacts including formation of heteroduplex molecules, which we have recently found to be a potential major source of artificial sequence diversity . Thus, at least 10 replicates will be amplified for only 15 cycles to minimize skewing of the distribution of sequence types and accumulation of mutations and chimeric molecules. Reactions will be diluted 1:10 into fresh reagents and amplified for 3 cycles to remove heteroduplex molecules followed by pooling and cloning. We can measure the ratio of the different amplification products in the PCR and can extrapolate to the gene templates by estimating amplifications kinetics using our Constant Denaturant Capillary Electrophoresis (CDCE) apparatus. This provide important information for calculation of the necessary coverage of the different libraries

PCR primers. For 16S rDNA, the standard Bacteria specific primers 27F and 1492R including recently published modifications. For 23S rDNA, our recently re-designed Bacteria-specific primers will be used. These are perfectly matched to all Bacteria 23S rDNA sequences in the Ribosomal Database Project (RDP) and amplified a set of 40 phylogenetically representative bacterial strains (Klepac and Polz, unpublished). For ITS, primers anchored in 16S and 23S rDNA will be used . For recA amplification, primers described by Eisen will be used. The gene for DNA-dependent RNA polymerase will be amplified as described by Palenik .

Diversity estimation by in situ amplification (polony formation). As a longer-term technology development project, we will adapt the new polymerase colony (polony) method of PCR amplification in thin polyacrylamide gels with one covalently immobilized primer (Mitra & Church, 1999*, see attached) for rapid diversity estimation of bacterial ribotypes. DNA extracted from environmental samples will be deposited at appropriate dilutions on glass microscope slides and amplified in situ. The resultant PCR colonies (polonies) will be hybridized or sequenced in situ for sequence identification (Mitra et al. 2002*, see attached). This would allow the simultaneous sequencing without prior cloning of thousands of polonies on the slides.

Library and polony screening and diversity estimation. All libraries will be screened by automated sequencing of clones with a single primer (RevPrep Orbit (GeneMachines) and 3700 sequencer). A complete sequence for several representative clones in sequence type will be obtained. In all cases, the success of the sampling process will be monitored by rarefaction analysis and the total number of sequence types in a sample will be determined by the Chao-1 estimator. Confidence intervals for the Chao-1 estimator will be calculated as described by Hughes et al. .

Phylogenetically ordered large genome fragment libraries.

Outline. We will capture large genome fragments from bacterial ribotypes to estimate within ribotype diversity and to assay genome structures of important uncultured members of oceanic communities (e.g., members of the "SAR" (Sargasso) cluster, which are dominant bacteria in all oceanic environments) or important sinks of carbon originating from Prochlorococcus (see below, functional diversity array). For this purpose, a 70-mer probe complementary to a highly variable region within the 23S rDNA of each selected ribotype will be constructed. For each ribotype, the captured DNA will be cloned and thus a set of phylogenetically ordered libraries generated. The inserts in each library will vary in size since the environmental DNA is incompletely digested and enriched for size above a 10 kbp cutoff. Furthermore, the library may contain a background of ribotypes that were captured non-specifically by the 70-mer probes. Thus, the initial characterization of the libraries will involve a four-step analysis protocol, which allows exclusion of non-desired clones. First, inserts will be sized by pulsed field gel electrophoresis. Second, inserts above 10 kbp will be screened by RFLP using hexameric restriction enzymes and ordered by similarity. The following groups of cloned inserts are expected: (1) same pattern, same size, (2) similar pattern, different size, and (3) different pattern regardless of size. Third, ribotype identity will be confirmed by sequencing of the 23S and 16S rDNA in the same set of clones and only identical ribotypes will be further analyzed. Fourth, the sequence in the flanking region (gene) downstream of the 23S rDNA will be determined in all clones containing identical 16S ribotypes. Two groups of clones are expected that contain (1) homologous flanking genes and (2) non-homologous genes. Our subsequent analysis will concentrate primarily on the first group since these clones stem identifiably from orthologous rRNA operons (see below). Preference will be given to clones with complete ribosomal operons (and complete operons will be sequenced).

Probe construction. Specific 70-mer oligonucleotides will be constructed based on alignments of 23S rDNA sequences recovered in our PCR-generated clone libraries using the GCG (Genetic Computer Group, www.accelrys.com) sequence editor. We have previously determined that tethered 70-mer oligonucleotides have very uniform dissociation behavior almost independent of the sequence (rRNAs have a limited range of GC-content) (Marcelino et al., unpublished). Thus, optimization of hybridization temperatures and conditions is not needed. For genome walking by capture, we will use PCR-amplified sequence stretches from the ends of the initially captured fragments. These should hybridize and capture homologous genes.

Capture of genomic fragments. Oligonucleotides are tethered to a linker oligonucleotide, which is biotinylated. Hybridization is carried out in solution and the hybridization product subsequently captured using streptavidin coated magnetic beads. The efficiency of this process is demonstrated in the preliminary results section.

Cloning. The captured single stranded fragments are cloned by attachment of linker oligonucleotides and subsequent partial second strand synthesis using Klenow. This enables either blunt end cloning or forced cloning via restriction sites introduced in the linker oligonucleotides. The plasmid containing the insert are then be transformed into E. coli host cells where the second strand will be fully synthesized. The efficiency of the process has been demonstrated (see preliminary results). To date, we have chosen the PBluescript II SK (+/-) Phagemid (Stratagene), which can carry up to 15 kbp, inserts but the process can be adapted for other plasmids, including BACs.

Sequencing and analysis. We will aim for the initial sampling of captured genomic fragments of 10 unique ribotypes. Preference will be given to captured fragments that contain near complete rRNA operons. Within ribotype diversity and diversity among closely related ribotypes (<5% 16S rRNA sequence divergence) will be examined by sequence comparison of genome regions flanking strictly homologous (orthologous) rRNA operons. These will be identified by the presence of at least one homologous flanking gene. By primer walking, we will sequence the flanking regions of a set of representative fragments. Contigs will be assembled and sequences will be edited and aligned using Sequencher (Gene Codes, Ann Arbor, MI) and GCG v.10 (Acelrys), and open reading frames will be identified. BLAST similarity searches will be conducted to identify and characterize homologs. Sequence divergence in flanking genes will be estimated for synonymous (K_S) and nonsynonymous (K_A) sites using DIVERGE (GCG) as estimators of selection. Furthermore, we will look for evidence of recombination, gene loss or rearrangement (as described below).

Functional genotype multiplexing

We will develop functional genotype multiplexing, a new technique based on in situ amplification methods (Mitra and Church, 1999*, Mitra et al. 2002* see attached), that allows the rapid simultaneous phylogenetic identification of organisms and determination of presence of specific functional genes. This will be done by embedding in thin acrylamide gels bacterial cells that have been made permeable by enzymatic digestion and short fixation, and by subsequent in situ amplification in the gels. Embedding in the gels allows high multiplicities of cells and probes to be inspected without formation of crossover products or heteroduplexes during the PCR since each cell's amplification is compartmentalized away from other cells. One previous source of variability is PCR without prelysis. Some environmental bacteria are hard to lyse, enzymatic digestion is used to make them permeable in the context of the immobilization polymer the cell DNAs remain separate. The polony format seems to compensate for differences in amplification in another way, which is that the amplicons, which start early or go faster switch from exponential to slower growth (cube-law) while the slower ones stay exponential. This is expected due to saturation of the core of the polony. Several primer pairs can be included in the amplification mixture embedded in the gel, some of which can be ribotype or phylogenetic group specific. Alternatively, rRNAs can be amplified with universal primers and different bacterial phylogenetic groups identified by hybridization. Identification of colocalized genes is by labeling of primers with different fluors and subsequent detection of specific color mixtures. Our current experiments on FACS sorted mammalian T-cells are expected to be considerably more challenging than the microbial cells and will allow us to work out methods to get acceptably low "false-negative" rates due to limitations on single molecule PCR, which we have already pushed to over 80% efficiency per molecule.

Prochlorococcus BAC Libraries

Outline. BAC libraries will be constructed for the samples described in the section above and the size of the BAC library will be determined by the diversity estimates. We currently estimate based on ITS diversity that ~20 ecotypes co-exist. This would entail a library size of 10,000 clones assuming 20 x coverage. At least one of the BAC libraries will be fingerprinted to assemble large contiguous genome fragments of the co-existing Prochlorococcus (see sequence analysis). However, we will strive to include a second library if the encountered diversity and costs allow.

Vector and host: We will use HS996, which is based on the most commonly employed host DH10B for BAC libraries, carrying an additional phage T1 resistant mutation. The resistant mutation prevents the destruction of the libraries due to host lysis by possible phage T1 contamination in laboratories. We will employ pIndigoBAC536 as a cloning vector. It carries four unique restriction sites at the cloning region; HindIII, BamHI, EcoRI and Eco72I. The Eco72I site will be used for blunt end ligation of sheared DNA sources.

Preparation of sheared DNA: BAC (fosmid) cloning from sheared DNA requires careful preparation of sheared DNA with properly terminated ends: the blunt ends must have 3’-OH and 5’ P terminated ends. Over the last two years, we have established a highly efficient and reproducible method for obtaining sheared DNA fragments of 30 to 100 kbp. The agarose DNA plug (100 to 200 ul in volume) is melted in an Eppendorf centrifuge tube by heating at 65 C for 15 min, and digested by agarase to extract DNA. The DNA is sheared by vortexing at maximum speed for 60 sec and by repeated passing through a 26 gauge needle. The resultant sheared DNA accumulates sharply at apparent size of 50 kbp. Although the structure of the ends of these sheared DNA are unpredictable, we found that a simple fill-in reaction by T4 DNA polymerase and phosphorylation by T4 ligase create ends in proper configuration for blunt end ligation.

Library construction: We will make four libraries with a currently estimated size of 20,000 clones (average insert size 36 kb). This estimate is based on a genome size of Prochlorococcus of 2 Mbp and 20 ecotypes in the flow sorted fractions to yield 15 to 20 X coverage for a complete physical map of each genotype. However, the exact coverage will take into account the estimation of ratios of gene targets as a proxy for distribution of the different genomes obtained by CDCE-PCR analysis in the diversity. screening. We will make two copies from each library; one working copy and the other for replication. The libraries will be stored in 384-well microtiter plates at –80 C. We have developed a series of quality control procedures to maximize library construction and to ensure the size of the inserts is in the required range (30 kb to 55 kb). For example, we will check the percentage of the empty (no insert) clones and the size of the inserts in the libraries. (We found that virtually all of the colonies generated by lambda packaging have inserts.) Twenty colonies generated from each ligation mixture will be checked for the apparent size of the inserts by pulsed field electrophoresis.

Making filters for colony hybridization: We will make initially 10 sets of filters for colony hybridization (to be distributed among the collaborators). 40,000 colonies (20,000 unique colonies x 2) can be printed on one 20 cm by 20 cm nylon membrane by Genetix Q-bot. Each set consists of four such sheets. After processing and fixing DNA on the membrane, they will be shipped to the investigators. More will be made as needed.

Fluorescent Fingerprinting of BACs

NOTE: The fingerprinting section describes work proposed in the collaborative NSF Biocomplexity proposal by Chisholm, Polz, Shizuya and Olsen. It is repeated here for clarification but is not part of the budget for this proposal. The fingerprinting method that Shizuya and collaborators have developed is based on end filling with fluorescently labeled nucleotides as described in detail in . We use a class IIS restriction enzyme, HgaI, which cuts DNA five bases away from the recognition site and generate a 5’ overhang having five unknown bases. These unknown bases can be sequenced with the recessed strand serving as primer and using modified fluorescent dideoxy terminator sequencing reagent (now available from Applied Biosystems). In order to accurately determine the size of each fragment, the fifth dye is included in each lane. With this in-lane standard, each fragment is characterized by both the size and the sequence of its terminal five bases. (To generate fragments small enough to size on automated sequencers, a four-base blunt cutter, RsaI, is included in the protocol. This method greatly increases the power to detect minimum overlap between clones (10 - 15 % overlaps as opposed to 50 % in agarose-based fingerprinting), and to assemble highly accurate contigs.

We use the modified FPC program to cluster clones into overlapping groups. The program was originally developed by the Sanger Centre, and based on an algorithm calculating pairwise probability of coincidence, which specifies a random chance of non-overlapping for two clones that share a certain number of fingerprinted fragments. However, since it uses the information solely based on the size, it is not usable for five colored fingerprinting fragments. To be able to use the color information (base sequence information), we have modified the program. The modified program is now available from the site: www.discoverybio.com.

We have successfully tested the method on 555 BAC clones from human chromosome 16p13.1 to 16p11.2 that could previously not be assembled using marker-based hybridization and agarose-based fingerprinting due to regional duplications. Because we expect high homology among DNA from Prochlorococcus species in the libraries, five base terminal sequence information together with the size information will greatly enhance the accuracy of the contig assemblies and generate highly accurate physical maps.

We estimate that it will take three months to prepare BAC DNA from 20,000 clones, and require an additional six to nine month to fingerprint by a 3700 and/or 373 ABI DNA sequencer. Data assembly will take three months to complete.

Comparative genomics of Prochlorococcus BAC libraries

Outline. Four BAC libraries will be constructed, one of which one will serve as the ‘reference’ library. The reference library will be completely fingerprinted (depending on matching funding from NSF, see above) or sequenced with the goal of assembling complete genomes. The success of assembly will depend on how many genomes co-exist and how similar these are to one another. If the genome diversity is moderate we will be able to tile overlapping BACs and obtain a detailed picture of overall genome architecture and superimposed variation. If genome diversity is too high to assemble contiguous BACs we will establish anchor points in the BACs by probing with highly conserved genes, primarily from informational genes since these have lowest probability of lateral gene transfer . In this case, the portrait of genome evolution will be of variations in local genome structure (at a minimum the size of BAC inserts) but will nonetheless provide an unprecedented set of data for interpretation of mechanisms of diversification and selection. The other BAC libraries will be a rich source of comparative data with the background of relevant environmental variation and substructure of the metapopulation. The presence/absence and sequence diversity of specific genes and pathways will be compared among all BAC libraries (Table 3). This includes genes and pathways that have already been shown to display key differences in environmental adaptations, including photosynthesis, inorganic carbon fixation, nutrient uptake (N and P), temperature regulation, and organic carbon metabolism. However, an important asset is the current development of DNA microarrays that will be used for interrogation of gene expression patterns under varying environmental regimes. Specific genes will be identified in the BAC libraries by hybridization of the filters made from BAC colonies. Since multiple probes can be used per hybridization and the filters can be reprobed at least 10 times we could search for at least of 500 genes. More filters will be printed if needed.

Target genes for comparative analysis. As outlined above, several genes and the pathways that are associated will be targeted either for anchoring the BAC sequencing or for comparison of the BAC libraries from the different environments. Anchoring genes are those that have low probability of lateral transfer due to the high numbers of interactions of their gene products in the cell and those that encode important pathways (e.g., photosynthesis and nutrient uptake). Genes for comparison of the BAC libraries are those that allow testing of the questions of diversification of the populations and the influence of different selection regimes. Thus, besides information about ecological function, we will include genes with differences in expression level for interpretation of codon bias, and genes under strong or weak environmental selection. A list of target genes (Table 3) has been drawn up from currently available information based on genome comparison of the Prochlorococcus ecotypes MED4 and MIT9313; however, continuing genomic comparison and information based on environmental regulation of sets of genes as determined by DNA microarrays will serve to supplement this list.

Gene Diversity and Divergence. Analyses of informational and central metabolic genes will provide a baseline of sequence diversity and relative divergences at the DNA versus protein level. Establishing such a baseline within the biological system is very important, since the broad range of G+C contents of MED4 and MIT9313 within this relatively tight phylogenetic group indicates one or more shifts in "G+C pressure" (e.g., . Thus, it will be important to characterize (in essence, calibrate) these measures within the low-G+C group, within the high-G+C group, and between the groups. With this background it will then be more meaningful to interpret the diversity and DNA versus amino acid divergences of other proteins, including those

Table 3. Candidate genes for the comparative analyses. The number in parenthesis after the gene name indicates relative degree of expression in MED4 when grown under continuous light (Zinser, unpubl.). E=Environmental, H= Housekeeping. Yes and No indicate presence or absence in the genomes.

Gene	Function	Role	MED4	9313
pstS	phosphate metabolism, periplasmic phosphate binding protein	E	yes	yes
ntcA	nitrogen metabolism, regulatory protein	E	yes	yes
glnA	nitrogen metabolism, glutamine synthetase	E	yes	yes
amtA/cysQ	nitrogen metabolism, ammonium transport protein	E	yes	yes
ureD	nitrogen metabolism, urease component	E	yes	yes
nirA	nitrogen metabolism, nitrite transport	E	no	yes
opp-PBP	oligopeptide permease, periplasmic substrate-binding component	E	yes	yes
acs	acetyl-CoA synthetase	E	yes	yes
malE	maltose/maltodextrin permease, ABC transporter component	E	no	yes
malQ	amylomaltase, breaks down maltose/maltodextrins	E	yes	yes
kaiB	circadian clock	E	yes	yes
zwf	Glucose-6-phosphate dehydrogenase, pentose phosphate shunt	H	yes	yes
gnd	6-phosphogluconate dehydrogenase, pentose phosphate shunt	H	yes	yes
tal	Transladolase, pentose phosphate shunt	H	yes	yes
glpX	fructose 1,6-bisphospatase, C- fixation via Calvin-Benson Cycle	H	yes	yes
rbcL	Rubisco, carbon fixation	E	yes	yes
pcb (44.7)	photosystem II antenna protein	E	yes	yes (2)
psbA (143)	photosystem II core protein	E	yes	yes (2)
FtsZ (2.21)	cell division protein	H	yes	yes
dnaA (1)	DNA replication protein	H	yes	yes
rnpB (1912)	RNase P	H	yes	?
kaiC (1.48)	circadian clock	E	yes	yes
rpoD (1.48)	sigma factor…housekeeping?	H	yes	yes
ycf27 (2)	two component response regulator protein, function unknown	E	yes	yes
rbcL (2.93)	Rubsico subunit, carbon fixation	E	yes	yes
gltA	citrate synthase, Citric acic cycle	H	yes	yes
acnB	acontiase, citric acid cycle	H	yes	yes
sdhA, sdhB	succinate dehydrogenase, citric acid cycle	H	no	yes
numerous	high light inducible proteins	E	yes(21)	yes (9)
phr, or0495	photolyase	E	yes(2)	no

that are most apt to be directly involved in environmental adaptations. These include genes in the photosynthesis system and transport proteins for scarce resources. A priori, we would expect that some genes central to photosynthesis, such as chlorophyll biosynthesis genes, would behave much like central metabolic genes, whereas others will more strongly reflect environmental adaptations.

We will explore the relative merits of different methods of estimating DNA and protein divergences for this data set. We have a bias towards likelihood-based methods when they are available, and new models and measures are routinely becoming available (not to mention being hotly debated) (e.g., . In the end, it will be comparisons of genes in different functional categories that will be used as the arbiter in our studies. Fortunately, it is unexpectedly high divergence that tends to indicate a relaxation or other change in constraints, and this is opposite to most systematic errors in current methods, so false positives will be relatively rare (and they might be an indication of other interesting phenomena, such as lateral gene transfer).

Codon Usage. Among other parameters to be monitored is codon usage as it has been found (in some systems) to provide information on gene expression level . Although Codon Adaptation Index is commonly used to monitor this, it tests for matching to a specific codon usage pattern, not just a large departure from random. Thus, other measures will also be explored. We also note that highly biased codon usage in a gene systematically changes divergence estimates , so this too will need to be considered in evaluating observed divergences. Another application (and in some respects more useful) of codon usage analysis has been the detection of "alien" genes (e.g., . In addition to methods such as correspondence analysis, which is particularly good at revealing recurring patterns, we are interested in using distance-based methods that will reveal unusual patterns, even if a given pattern is found only in a single gene.

Phylogenetic Analyses. We have much experience at phylogenetic analysis of nucleotide and protein sequences (Moore, et al. 1998). These types of analysis are powerful tools for generating a variety of genome-scale hypotheses, e.g. about metabolic potential. Of great interest will be congruency of phylogenetic trees for different genes, when their associations with specific lineages can be determined by presence in common DNA fragments, e.g., , by BAC mapping, or by association with specific isolates. Incongruent trees can provide evidence for recombination or horizontal transfer of genes (though other causes must also be considered, including systematic errors and cryptic paralogy). A second aspect of phylogenetic trees that can be quite striking is a lengthening of branches to genes when their function is changed, or constraints are relaxed. Finally, phylogenetic analyses provide a context within which to interpret other data, including gene presence and absence, or rearrangements.

Whole Genome Sequencing of Assembled Genomes

If this proposal is funded, we would have as a first year milestone to provide the JGI with genomic DNA and/or BACs for sequencing a selected number of Prochloroccus and interacting uncultivated species' genomes. As stated in the DOE RFP for this GTL solicitation, we have not included this in the budget, since the sequencing would be done by the JGI. We anticipate that 300 Mbp of raw sequence would be a valuable resource for 10X coverage of 1000 BACs (half centered on rRNA sequences and half from key assembled contigs) plus 100 single cell profiles at 1X coverage (~65% sequence) of both ends of 4 kbp restriction fragments. The single cell libraries help establish that the metabolic potentials all come from one cell rather than chimeric ambiguities in a BAC fingerprint contig from a mixture of incompletely purified cells. It will be augmented by short sequence surveys done in the context of polony correlates with metabolic functions (see below).

Goal 3b: Functional Diversity Arrays and Single Cell Activity Multiplexing

Outline. We will determine the heterotrophic sinks of important carbon substrates by two techniques: (3b.i) The Functional Diversity Array (FDA), and (3b.ii) Single Cell Activity Multiplexing. Both are culture independent approaches that allow identification of heterotrophic bacterial populations or individual cells that actively metabolize labeled substrates. We will use these approaches to explore the interconnection between Prochlorococcus and populations that act as sinks for excreted carbon.

Goal 3b.i. Functional diversity array

In short, the FDA approach is as follows: (I) Use 23S rRNA (and 16S rRNA) gene clone libraries as templates for construction of oligonucleotide probes against variable regions of the molecule (see probe construction). (II) Spot oligonucleotide probes on glass slides using a robotic arrayer. (III) Incubate environmental samples with ¹⁴C-labeled substrate of interest (see ¹⁴C-labeling of rRNA). (IV) Extract total rRNA from the incubations and hybridize to glass slides containing probes (see hybridization).

14C-labeling of active heterotrophic populations: ¹⁴C-labeled substrate is efficiently incorporated into rRNA and can be used to identify populations that actively degrade a specific organic compound. Both, E. coli pure cultures and natural bacterial assemblages, incorporated acetate and glucose into cellular RNA at 2 and 7%, respectively, of total substrate assimilated within a one to two cell doublings. Based on these results, the sensitivity of our phosphorimager and literature data on composition of bacterial cells, we have conservatively estimated a limit of detection for identification of a bacterial population growing on an added substrate to be 100-1000 cells (within limits independent of volume) . We will spike environmental samples with carbon compounds previously identified as excretion products of Prochlorococcus (e.g., formate, glycollate, acetate, lactate; see Goal 2) and with Prochlorococcus cells pregrown on ¹⁴CO₂ and analyze accumulation of label in rRNA in a pulse and chase fashion.

Goal 3b.i. Single-Cell Activity Multiplexing.

A complementary approach to FDA will be taken for organisms, which do not have sufficiently active rRNA synthesis. In this, the same ¹⁴C-labeled substrates will be incubated with the whole cell population (direct ocean samples and FACS sorted samples) spread on polony slides for rapid but thorough filter rinses, autoradiography and rRNA-typing (see ‘Functional Multiplexing’ for a description). This will help determine the trans-membrane uptake rates and specificities. It will be calibrated with clonal bacteria (Prochlorococcus, Pseudomonas, Caulobacter) to calibrate with known pumps and channels and to determine stochasticity that is independent of the major genetic differences expected in wild populations. The cells will be FACS sorted onto the slide so that the position on the slide gives information that can be later used to enrich for that cell type. Even the cells which are metabolically dead by all the above criteria can be genotyped for rDNA and a dozen other metabolic genetic markers to determine if there is enrichment for certain types of cells to die. For example, we might expect that the fastest growing cell-types dominate this class. This approach could be extended to cover a variety of other transported substances (& isotopes) including atoms which would (³³PO₄^{- -}, ³H-uridine) or would not (e.g. ⁴⁵Ca⁺⁺, ⁵⁵Fe⁺⁺, ³⁵SO₄^--) be present in the nucleic acid probes used in the ¹⁴C hybridization assay above (Haurani , et al. 1993). This in turn provides powerful assays and hence purifications based on their ability to modulate the above uptake/incorporation rates with molecules found in the cell environment (ocean samples), e.g. HSL or siderophores (Guan and Kamino, 2001). The individual cells at the end of multiple such assays can be (destructively) genotyped by polony amplification, sequencing and/or hybridization. They can also be assessed for doubling rate microscopically. We have successfully done analogous polony genotyping for single mammalian T-cells FACS-sorted onto slides, which is undoubtedly more technically challenging since the genome size is 1000-fold more complex.

Goal 3c: Spatio-temporal (4 dimensional) patterns of genome expression within biofilm communities of Pseudomonas spp.

Introduction to biofilms

Populations of surface-attached microorganisms comprised either of single or multiple species are commonly referred to as biofilms. In most natural settings bacteria are found predominantly in biofilms, yet for many years studies of bacterial physiology focused on the free swimming or planktonic state of bacterial cells. The widespread recognition that biofilms impact a myriad of environments, from water pipes to indwelling devices in hospital patients, has led to concerted efforts at gaining a better understanding of the molecular mechanisms that underlie the development of these communities.

Simply stated, biofilms can be defined as groups of microbes on a surface. Surface-associated bacteria exist in a variety of physiological states, maintain different cell densities and colonize diverse surfaces. Every microbe thus far investigated has been shown to colonize surfaces under some conditions, and the surfaces that sustain growth vary greatly to include abiotic solids, liquids or living cells. Thus, it appears that surface-associated growth, or biofilm formation is important in the ecology of most or all bacteria. In fact, it is generally accepted that most microbes spend the bulk of their existence in biofilms . Recently, we have begun to appreciate the spatial organization and complex intercellular signaling networks that develop when microbes are present in multi-cellular communities. These higher order bacterial interactions suggest that within biofilms, a microbe’s physiology will be very distinct from its well-studied planktonic counterparts. Thus, detailed studies of the molecular processes at work in many different model biofilms will provide a new insights and contexts for understanding microbial physiology.

Microbial biofilms predominate in diverse environments . Pipelines, catheters, teeth, plant roots, and the lungs of Cystic Fibrosis patients are but a few of the most widely recognized surfaces where the effects of biofilms are readily apparent . During the last five years, many investigators interested in microbial physiology and genetics have turned their attention to the study of these surface-associated communities and much has been learned about the molecular mechanisms of biofilm formation. As compared to the view of just a few years ago where many investigators felt that robust biofilm formation was an attribute of a few bacterial species, it is now clear that virtually all microbes can form biofilms. Perhaps the most salient feature of biofilm-associated bacteria is that their physiology and metabolism are markedly different from that of their planktonic counterparts .

Genetic analyses of biofilm development

We initially isolated and characterized mutants defective in biofilm formation on abiotic surfaces. Our general approach was to work with pure cultures of microorganisms and the communities they formed on abiotic solid surfaces such as polyvinylchloride and borosilicate glass. To carry out this approach we developed a very simple genetic screen to identify and characterize the genetic determinants important for biofilm formation . The gold standard method for growing and analyzing biofilms has been the flow cell . However, these chambers are not particularly well suited to the development of genetic screens where high throughput is important. The basis of the genetic screen that we developed was to use a microtiter dish well as a chamber for biofilm growth. Biofilms could then be visualized using a stain such as crystal violet, saffranin, or ruthenium red. The simplicity of the assay meant that it was relatively easy to carry out high throughput screens of thousands of randomly generated mutants. The complete or partial loss of biofilm staining after rinsing served as an operational definition for mutants defective in biofilm formation. Subsequent microscopic examination revealed that the mutants isolated were blocked at different stages of biofilm development. Therefore, as described in detail below, a simple genetic screen allowed for the elucidation of a number of the steps involved in biofilm development.

We carried out genetic screens on libraries of mutants Pseudomonas fluorescens and P. aeruginosa . The rationale behind this choice of model organisms was based primarily on the diversity of the environmental niches colonized by each and their ease of genetic manipulation. P. fluorescens is a plant-beneficial root-colonizing organism used as a biological control agent on various vegetable crops . Colonization of the plant root is an important step in the biocontrol activity of this bacterium. P. aeruginosa is a ubiquitous soil microbe. While it is often considered for its role in opportunistic infections and its broad host range, which includes plants, mice, invertebrates, and insects , its broad role as an environmental microbe is clear.

The biofilm-defective mutants obtained in the genetic screens were subsequently analyzed microscopically to determine where in the pathway of biofilm development they were blocked. In addition, since the mutations were due to transposon insertions it was possible to immediately map the physical site of the mutation. This process was greatly facilitated when we modified an arbitrary PCR amplification technique to allow us to sequence insertion sites without having to clone the chromosomal fragments . The physical mapping of the insertion sites was also greatly simplified by the fact that the genome sequence of at least one strain of these organisms was available. Thus, it was relatively easy to quickly know which gene was affected and when during biofilm development that gene function was required.

The results of our analyses, in conjunction with the work from other groups, allowed us to build a working developmental model for biofilm formation that involved four genetically defined stages: i) Attachment, ii) Surface Colonization, iii) Biofilm Maturation and iv) Detachment. In the following sections we summarize the progress we have made in defining and refining this model.

Attachment. Attachment to abiotic surfaces such as plastic or glass occurs in response to specific environmental signals, i.e. biofilm formation is not a constitutive response. Rather, in response to different environmental cues, bacteria turn on genes that allow them to attach to surfaces. We have found that bacterial biofilm formation relies much more on the composition of the medium rather than the physical properties of the surface itself . Thus, it is not surprising that among the genes found to affect biofilm formation are those involved in sensing the environment. We have identified several putative sensor kinases and response regulators with no prior known function from Pseudomonas aeruginosa (G. O'Toole, L. Friedman, P. Watnick, N. Bomchil and R. Kolter, unpublished observations). We have also found that the carbon catabolite regulator Crc, another environmental sensor/regulator, is essential for surface colonization during biofilm formation in P. aeruginosa .

Once the cells sense the right environmental cues, they face the challenge of making contact with the surface. Mutants unable to swim showed great delays in biofilm formation, suggesting flagellar motility (swimming) facilitates initial contact with the surface. The main role of flagellar motility seems to be to provide force-generating movement that allows the cell to overcome repulsive forces when approaching the surface and become attached. In the absence of flagellar motility there appears to be enough kinetic energy from Brownian motion to allow some cells to eventually attach and form biofilms. The flagella themselves do not seem to serve as adhesins as paralyzed mutants that still have flagella are also defective in biofilm formation. Finally, while swimming plays an important role in attachment, chemotaxis does not seem to have a role, at least when attachment involves inert abiotic surfaces and homogeneous nutritive environments. Once contact with the surface is made, it is necessary for the bacteria to stabilize their interactions with the surface and in many instances pili play a critical role in this process .

Surface Colonization. Once stably associated with the surface, bacteria begin to colonize it. Colonization involves a combination of three distinct processes: growth and division of surface-associated cells, recruitment of additional planktonic cells via cell-to-cell interactions, and migration of attached cells along the surface. The concerted action of these three processes leads to the formation of stable microcolonies in the early phases of biofilm development. We showed that cell-to-cell interactions can be mediated by surface structures such type IV pili, which mediate twitching motility in P. aeruginosa . The biofilm defective phenotype of the P. aeruginosa Crc mutant described above is due to a decrease in twitching motility due to the fact that Crc controls the transcription of the pil genes, which encode the type IV pili apparatus. The involvement of twitching motility at this stage of biofilm formation is analogous to the social motility exhibited by Myxococcus xanthus during fruiting body development . These observations served as the starting point that led us to draw many parallels between biofilm formation and fruiting body development and put forth the developmental model of biofilm formation.

Biofilm Maturation. The assemblage of critical numbers of cells in microcolonies on a surface signals the start of biofilm maturation. As biofilms reach maturity, they develop their characteristic architecture of matrix-enclosed pillars of densely-packed cells surrounded by water filled channels. At this time, biofilms develop an increased resistance to antimicrobials; the mechanism of antimicrobial resistance remains unknown. Much progress has been made in determining the genes involved in the production of the extracellular matrix of biofilms, which is composed largely, though by no means entirely, of exopolysaccharide. Recent work from John Mattick indicates that DNA is also part of the extracellular matrix. We found a particularly interesting connection between flagellar motility and the synthesis of exopolysaccharide in V. cholerae O139; the loss of the flagellum led to overproduction of the exopolysaccharide. That observation led us to propose a model whereby the signal to initiate exopolysaccharide synthesis, the committed step in biofilm formation, was the sensing of a non-functioning flagellum due to the cell’s association with a surface. In fact, it now appears that V. cholerae and Pseudomonas putida at least transiently lose their flagella within biofilms (P. Watnick and R. Kolter, unpublished observation), indicating a possible mechanism for this type of regulation of exopolysaccharide production.

The P. aeruginosa genes responsible for the major exopolysaccharide involved in biofilm architecture have not been clearly defined. The major candidate exopolysaccharide, alginate, appears to be part of the extracellular matrix but in some environments mutants unable to synthesize this polymer make what appears to be normal biofilms (G.A. O’Toole, personal communication). In an elegant study published in 1998, the Greenberg and Costerton groups showed that biofilms formed by P. aeruginosa mutants unable to synthesize a quorum sensing compound, the C12 acylated homoserinelactone, were defective in maturation . The mutant biofilms consisted of densely packed cells, did not display the characteristic architecture of wild-type biofilms, and were easily destroyed by detergents such as SDS. Yet, those biofilms appeared to contain normal amounts of alginate. To date the quorum sensing-regulated genes responsible for detergent resistance and biofilm architecture have not been identified. Very recently, however, we have obtained biofilm-defective mutants in a cluster of genes that appear to encode an exopolysaccharide, which is not alginate (L. Friedman and R. Kolter, unpublished results). Further studies will be needed to determine whether this new gene cluster is in fact responsible for the synthesis of the major architecture-conferring exopolysaccharide in P. aeruginosa.

Detachment. When the environmental conditions are no longer propitious for the biofilm mode of growth, biofilms can fragment and detach. In P. aeruginosa, biofilms that were incubated for prolonged periods in batch culture eventually weakened and dissolved. This observation led us to hypothesize the existence of a "biofilm inhibitory factor" (BIF) produced in stationary phase cultures of this organism. In fact, we were able to partially purify a substance from stationary phase culture supernatants that both inhibited biofilm formation and dissolved preformed biofilms (G. O'Toole and R. Kolter, unpublished results). The substance was shown to exhibit surfactant activity but was also shown to be distinct from the known surfactants of P. aeruginosa.

Proposed work on 4D structure of Pseudomonas biofilms

With this as background we can now look precisely at how the genome manifests itself within a biofilm. The remarkable architecture of biofilms already suggests a high degree of cellular specialization. Several studies have indicated that different patterns of gene expression occur within different regions of biofilms. However, relatively little is known about how this differential gene expression is regulated and whether it reflects a more dramatic cellular differentiation. Our analyses of B. subtilis biofilms led to the surprising discovery that sporulation occurred preferentially at the tips of aerial projections on the surface of the biofilm [Branda, 2001 #2565]. However, virtually nothing is known about the specialization of labor within different zones of Pseudomonas biofilms, using Pseudomonas fluorescens, Pseudomonas putida and Pseudomonas aeruginosa isolates from environmental settings. Our aims in the research plan presented here are to begin to understand how physical and chemical signals are integrated and transduced among the cells composing the biofilm to achieve spatial and temporal differentiation.

The Biological System. Biofilm formation by the Pseudomonas follows a distinct developmental pathway depending on the experimental set-up utilized. For the proposed experiments we will use batch cultures and analyze the biofilm that forms on the air liquid interphase. We will use this as this process involved the natural congregation of cell in response to environmental changes and self-generated signals. The cells then aggregate and maturation of the biofilm follows. After inoculation of standing cultures, motile cells proliferate throughout the liquid as planktonic cells until they reach a density of ~5 x 10⁸ cfu/ml. At that point, the vast majority of the cells migrate to the air-liquid interface where they form a biofilm that floats on the surface of the medium. Cells within the biofilm undergo dramatic differentiation as they continue to proliferate. Cells become non-motile and are held together by an extracellular matrix. As the cell mass increases, some groups of cells form aerial projections. It is important to note that this high degree of differentiation is only apparent in wild isolates of Pseudomonas and in less domesticated strains such as PA14 and not observable in the sequenced strain and not in domesticated laboratory strains. Thus the importance of analyzing robust environmental strains.

The Working Model and Central Hypothesis. Our preliminary genetic analyses of floating biofilm formation has allowed us to construct a developmental model that will be further refined through the proposed experiments. Our central hypothesis is that the interplay between cell-cell signaling and the microenvironments within the biofilm produces spatial and temporal patterns of gene expression that lead to localized cellular differentiation. This hypothesis must now be critically tested through a detailed analysis of the gene expression patterns in different regions of the biofilm. To do so, we will analyze the spatial and temporal expression of the Pseudomonas genome. We will develop microscopic techniques coupled to mRNA amplification techniques to analyze the spatio-temporal expression patterns of all highly expressed genes. Lastly, we will disrupt genes known to be involved in environmental sensing and cell-to-cell signaling and assess their effects on the spatial and temporal patterns of gene expression. The compilation of the results obtained through the analysis of where and when genes are expressed will result in a "functional anatomy" of differentiated Pseudomonas biofilms.

Analysis of the spatio-temporal gene expression. The thrust of our experimental approach will be to study the development of Pseudomonas biofilms much as one would study a multicellular organism. We and others have shown that within structured microbial communities, such as biofilms, member microbes display differences in gene expression depending upon their location within the community. This indicates that regions of cell differentiation, and perhaps specialization, exist within these communities, and we refer to such regions, as well as those that are morphologically distinct, as biofilm "zones". This section describes the microscopic analyses that will enable mapping of the Pseudomonas biofilm zones that show cellular differentiation as defined by differential gene expression. These initial analyses will constitute a series of experiments in which the spatial and temporal patterns of expression will be mapped. At first we will characterize gene expression in mature wild-type biofilms. Subsequently, we will similarly analyze biofilms prior to their maturation. In our experimental set-up, mature biofilms are defined as those that form after five days of incubation at 25ºC in TB (tryptone broth). For these studies, strains will be grown in standing cultures incubated at 25ºC and sampled every six hours over five days, to enable study of the patterns of gene expression during all stages of biofilm development. The result of these initial observations will be the establishment of a temporal and spatial map of gene expression.

Analyses of biofilm thin sections. The biofilms to be analyzed for gene expression will be characterized both through molecular approaches that amplify mRNA in situ and by microscopic approaches.

The molecular approaches will necessitate that we develop mRNA amplification and that we have comprehensive arrays for which the collaboration and integration with the other aims of this program project are essential.

We will use both light microscopy and transmission electron microscopy (TEM). The biofilms will be fixed in paraformaldehyde, washed with glycine to quench free aldehyde groups, and infiltrated with sucrose or dextran as cryoprotectants. For light microscopy, the samples will be embedded in Tissue-Tek OCT compound on a cryostat chuck, frozen in an EtOH/dry ice bath, and 5 µm thick sections will be made using a microtome equipped with a glass knife. These sections will be captured on polylysine-coated slides, air-dried, and covered with aqueous mounting medium and a coverslip and subjected to in situ hybridization using selected mRNA probes determined from our in situ expression analyses. For TEM, the biofilm samples will be directly frozen in liquid nitrogen and sectioned (0.07 µm thick) using a microtome equipped with a diamond knife; the sections will be placed on formvar/carbon-coated copper grids and treated with uranyl acetate in methyl cellulose for contrast and embedding.

Confocal scanning laser microscopy of intact biofilms. Confocal scanning laser microscopy (CSLM) is perhaps the most powerful technique for analyzing patterns of gene expression within relatively thick specimens. For our purposes, the most attractive aspects of CSLM are that it enables simultaneous detection of multiple fluorescent probes as well as three-dimensional imaging through assembly of "stacks" of successive optical sections. Moreover, CSLM specimens do not require fixation or other potentially disruptive manipulations as long as they have been fluorescently labeled in vivo. For this reason, we will use a transcriptional gfp fusions to selected reporter genes, obtained from our molecular approaches to follow gene expression within the biofilm. The gfp allele that will be used encodes an unstable variant of GFP allowing for the monitoring of the dynamics, rather than the history, of sspE expression within biofilms. Biofilm fragments excised from mature biofilms will be fixed to a coverslip using non-fluorescent acrylamide, and sealed within the depression of a welled slide. Vertical or horizontal optical sections about 0.3 µm thick will be captured using reflected backscattered light imaging to detect structural features followed by fluorescence imaging to detect GFP expression. The images of corresponding sections will be merged to show GFP expression within the context of the structural features observed. We can also use the fluorescent dye FM4-64 (Molecular Probes) to label cell membranes nonspecifically, to serve as an additional reference. Three-dimensional viewing of GFP expression patterns will be achieved through assembly of successive optical sections into stacks. At the conclusion of these experiments we will have mapped the spatial and temporal patterns of gene expression and developed the techniques to study the spatial and temporal expression patterns key biofilm-specific genes.

Goal 3d: Genome-wide transposon tag array quantitation to survey the relative fitness of thousands of different genotypes in a mixed microbial population

The Ausubel laboratory has been studying the role of phenotypic variation in the formation of P. aeruginosa strain PA14 biofilms (Drenkard, 2002). This project derived from the observation that antibiotic resistant PA14 colonies appear at a frequency of 10^-6-10^-7 when cultures were plated on Luria-Bertani (LB) agar containing kanamycin (200 µg/ml). One class of resistant variants (~30%) exhibited a rough colony phenotype compared to the wild-type and was called RSCV (Rough Small Colony Variant). When RSCVs were grown on antibiotic free LB agar, wild-type revertants characterized by a large colony size, smooth appearance, and wild-type susceptibility to kanamycin, arose on the edges of the variant colonies after 5 days incubation at room temperature suggesting that the phenotypic changes observed in the resistant variants were transient.

Unlike wild-type, RSCV formed visible aggregates when liquid cultures were left without shaking at room temperature. Moreover, RSCV exhibited increased attachment to surfaces such as glass and polyvinylchloride plastic (PVC). Reverted RSCV showed wild-type levels of both agglutination and attachment to glass and PVC plastic. Consistent with the self-aggultination phenotype, RSCV clones agglutinate at a lower salt concentration (0.125 M) than wild-type PA14 (0.5 M), indicating that RSCVs had a higher degree of surface hydrophobicity. Moreover, confocal scanning laser microscopy (CSLM) of GFP-labeled PA14 showed that RSCV form biofilm faster which has a greater biomass than wild-type. Additionally, measurements of viable biomass of GFP-tagged PA14 and RSCV cells using CSLM analysis showed that biofilms formed by RSCV are more resistant to a continuous flow of the antibiotic tobramycin (200 µg/ml) than wild-type PA14 biofilms, paralleling the resistance observed on plates.

Phenotypic (phase) variation is a common phenomenon in Gram-negative bacteria that often involves environmentally regulated changes in surface components leading to changes in observable phenotypes (Henderson, 1999). Consistent with a role for phase variation in the formation of RSCV, there is a 40-fold increase in the frequency of appearance of resistant variants (not just RSCV) obtained on LB media containing NaCl (85 mM) compared to the same medium without NaCl and a dramatic 10⁶-fold increase on minimal M63 salts compared to LB medium.

Importantly, by transferring a cosmid library of PA14 chromosomal DNA en masse to RSCV, we identified a regulatory gene that causes 100% reversion of RSCV to the antibiotic susceptible non-selfagglutinating form. The gene that was identified, pvrR (phenotypic variant regulator) shows sequence similarities to response regulator elements of two-component regulatory systems, including 30% identity and 45% similarity to the Vibrio cholerae response regulator VieA. Moreover, sequence analysis of the regions located upstream and downstream of pvrR revealed the presence of two additional ORFs (designated ORF1 and ORF3 respectively) with sequence homology to two-component regulatory elements. To determine whether pvrR or a highly similar pvrR homolog is present in other P. aeruginosa strains, we performed PCR analysis of 14 P. aeruginosa strains using PvrR-specific primers. We subsequently confirmed the specificity of the PCR products obtained by Southern blotting and hybridization with a pvrR-specific probe. Among 14 strains tested, 12 strains contained the pvrR gene fragment or a highly similar fragment. Consistent with the putative role of PvrR in the regulation of phenotypic switching, overexpression of PvrR in wild-type PA14 resulted in a 6-fold reduction in the frequency of resistant variants obtained after plating overnight cultures on kanamycin plates (200 µg/ml) compared to wild-type. Finally, since PvrR is involved in the regulation of the phenotypic switch, we hypothesized that mutation of pvrR would alter the proportion of resistant variants present in the PA14 population. Indeed, a 914 bp in-frame deletion of pvrR (∆pvrR) in PA14 exhibited increased frequency of appearance of resistant variants on kanamycin plates with respect to the wild-type.

These data show that P. aeruginosa is capable of undergoing transient phenotypic changes that enhance their ability to form biofilms. Analogous to phase variation phenomena, enhanced biofilm forming RSCV representing a relatively small fraction of the bacterial population ensure its survival in adverse conditions, such as the presence of antibiotics. It remains to be determined whether the underlying mechanism of RSCV formation involves the types of DNA rearrangements that characterize other phase variable systems (Han, 1997, Henderson, 1999). Because the appearance of phenotypic variants in response to antibiotic treatment has been reported in both Gram-negative and Gram-positive bacteria (McNamara, 2000), resistance mechanisms similar to the one found in this study may be common among other bacterial pathogens

Background on Making Non-Redundant Libraries

Rather than screening random mutant transposon libraries, we now propose to sequence the insertion sites of approximately 24,000 random transposon insertions in the PA14 genome. This number gives a 95% probability of obtaining an insertion in each of the estimated 4,800 non-essential P. aeruginosa genes. From this library of 24,000 sequenced insertions, a single insertion in each targeted gene will be chosen for screening in model hosts. There are several advantages of this approach:

Instead of screening 24,000 mutant lines for a particular phenotype to saturate a screen, only approximately 4,800 mutants will have to be screened.

Screening with a 4,800-member library that contains known insertions identifies genes that are not related to a particular phenotype as well as genes that are involved.

When a putative mutant has been identified, it is necessary to verify that the targeted gene is the cause of the phenotype of interest. Currently, this involves one of two laborious procedures, complementation analysis with the wild-type gene or reconstruction of the mutant by targeted mutagenesis. For any given gene, however, the library of non-redundant insertions will likely contain additional, independent insertions that can readily be tested to determine if the same phenotype is observed.

PA14 contains several large blocks of DNA not found in PAO1 (E. Drenkard, S. Miyata, J. Villanueva, L. Rahme, S. Calderwood and F. Ausubel, unpublished data), High-throughput sequencing of all members of the insertion library will facilitate the identification of additional genes present in PA14 but absent in PAO1. Moreover, the sequencing project described in this proposal will provide somewhere between 1.5 and 2.0 fold sequence coverage of the non-essential portion of the PA14 genome.

The Ausubel is in the process of generating a saturated non-redundant library of sequenced insertions using a highly automated protocol. Briefly, PA14 is mutagenized with transposon TnphoA and insertion strains are picked by a colony-picking robot (Qbot). The mutants in microtiter plates are assigned barcodes and are then replica plated to generate backup copies of the library for storage using a microtiter plate and liquid handling robot (Biomek). The locus disrupted by the insertion in each mutant is identified by PCR amplification followed by sequencing of the genomic region adjacent to the transposon. A Biomek robot is used to set up the PCR reactions that are then cleaned up using a Qiagen Biorobot. The cleaned PCR products are sequenced by the MGH sequencing core facility. Ultimately, the resulting data will be entered into a MySQL relational database that is under development. We are able to obtain high quality sequences for 89% of the mutants and the projected throughput has increased to approximately 4,600 colonies processed and sequenced per three-week interval.

It is unlikely that a single transposon will be sufficient to saturate the PA14 genome. Therefore, a derivative of the mariner transposon will be used to make additional transposition events when the library is saturated with TnphoA. Unlike TnphoA, mariner is of eukaryotic origin and presumably exhibits target site preferences different from that of TnphoA. We have found that a mariner-based transposon (TnGFP3; obtained from J. Mekalanos) is five-fold more efficient at mutagenizing PA14 than TnphoA and we can obtain PCR products and DNA sequence using this alternate transposon with a similar (89%) efficiency. An outward facing T7 promoter will also be included. The T7 promoter will be used for in vitro transcription in quantitative growth experiments (see Badarinarayana, et al. 2001*, attached, and goal 3d for Prochlorococcus below, where the same strategy of Y-linker PCR followed by T7 RNA polymerase is needed to get the precise array quantitation of the selected genotypes).

Using this high throughput protocol, to date we have produced 4608 (48 96-well plates) individual insertional mutants using the TnPhoA transposon and processed them for long-term storage. Several of these mutants have been further processed for PCR and sequencing to identify the genomic region adjacent to the transposon insertion. Working with this first set of mutants, we have optimized the high throughput protocol for maximal efficiency.

We have also made advances in the design of the bioinformatics tools required to store and manipulate the data generated by the construction of the PA14 uni-gene library. Specifically, we have been developing a sophisticated MySQL relational database that will be essential for both sorting and tracking the individual transposon mutants as well as mapping the transposon insertions to specific open reading frames. Currently, the database performs two general functions: 1) tracking the physical location and the stage of processing of each mutant and 2) completing sequence analysis for individual mutants.

The PA14 uni-gene library project requires processing numerous plates in series and in parallel. The database records which plates have completed a given step of the process, including culture, PCR, and sequencing. The database also records the physical location of each mutant in the corresponding wells of each microtiter plate, allowing us to retrieve a given mutant from any plate at any stage of the process.

The database system also automates much of our sequence data analysis. DNA sequence files for each of the PCR products are imported and correctly assigned to the corresponding mutant in the database. The database then uses software (Phred) to identify low quality or questionable sequence and remove it, leaving behind only high quality sequence for two types of subsequent alignment analyses: (a) Since the PCR products contain a short stretch of Transposon DNA adjacent to the PA14 genomic DNA, the database identifies the transposon sequence within the remaining high quality sequence using the Smith Waterman alignment algorithm. This allows an accurate determination of the genomic site into which the transposon has inserted and therefore disrupted. (b) Because the complete genomic sequence of PA14 is unavailable, automated BLAST searches are performed using the annotated genomic sequence of the highly related strain, PAO1 (www.pseudomonas.org). Successful sequence alignments are then used to identify the location of the transposon insertion site and to determine whether it lies within an ORF. Sequences that fail to align with the PAO1 genome may correspond to genomic DNA that is present in PA14 but absent in PAO1 -- examples of such sequences have already been identified in the Ausubel and other laboratories, and many of these "PA14-specific" sequences have been shown to contribute to the virulence of this more pathogenic strain.

Significant Quality Assurance testing has been completed to ensure that the database accurately relates and processes each mutant DNA sequence file. As noted above, we feel that a single transposon will not be sufficient to saturate the PA14 genome since it will likely have both hotspots and coldspots for integration. Unlike the TnphoA transposon, a derivative of the mariner transposon (a transposon of eukaryotic rather than prokaryotic origin) presumably exhibits target site preferences different from that of TnphoA. Additional features have been added to the database that allow the software to process chromatograms derived from mutants created by insertion of the mariner transposon.

Using the current version of the database, a subset of the original 4608 mutants has been analyzed and a preliminary breakdown of the PA14 mutants has been obtained. Approximately 60% of 744 Processed Sequences align with ORFs represented in the PAO1 genome. 20% of these Processed Sequences align with intergenic PAO1 genomic sequence. An additional 20% do not align with any sequence in the PAO1 genome and may represent PA14-specific sequences. A web page has been designed so that processed mutants that align with PAO1 ORFs can be viewed and requested: http://pga.mgh.harvard.edu/cgi-bin/pa14/mutants/retrieve.cgi. At this site, the "virtual" insertion site in the PAO1 genome, the ORF that overlaps this site and any annotation associated with the ORF is listed for each mutant. Visitors to the site can search for specific mutants by specifying any one of these fields. Alternatively, visitors can download the entire list of mutants. As more mutants are added to the database, the number of mutants listed on this site will expand, providing more reagents to the Pseudomonas community.

Further develop and apply methods to detect differences in large genomic regions.

Scientific Questions to be Addressed

The intensive application of the tools of molecular genetics to the study of environmental microbes, coupled to the availability of complete genome sequences, has revolutionized the study of environmental microbiology in recent years. However, a key gap in our knowledge of environmental microbes relates to the variability that is found among natural isolates. While complete genome sequences are now essential tools for future research, it remains to be determined just how widely the genetic content varies among different strains of microorganisms and how this variability affects metabolic activity. Thus, it is imperative that the functional significance of genomic variability among environmental bacteria be studied in a methodical and rigorous way. The key questions to address to gain future insights into this matter include:

How much do the genomes of natural isolates vary with respect to those genomes whose sequence has been determined?

What genes are missing?

What different/novel genes are present in natural isolates?

What are the functions associated with the novel genes?

Could these genes encode potential new enzymatic activities with bioremediation potential?

The three environmental organisms that we propose to utilize as the focus of this research represent among the most ubiquitous microbes and are members of the broad group known as the Pseudomonads. For these microbes, most of their evolutionary history is likely to have occurred in the open environment and is thus are likely to have repeatedly encountered horizontal gene transfer opportunities. It is likely that their genomes will display a remarkable degree of heterogeneity amongst microbiologically similar isolates.

Our central hypothesis is that microbes that must survive in constantly changing environmental settings with continual encounters with other microbes will contain highly variable genomes, indicative of frequent horizontal gene transfer. At the core of the proposed research is the testing of key predictions of this hypothesis. Most importantly, this hypothesis predicts that worldwide isolates of Pseudomonads are predicted to contain large numbers of genes that differ from those present in the sequenced strains. While there are some indications that this will indeed be the case, the number of strains that have been characterized to date is too limited to validate or falsify the hypothesis. The proposed work will, of course, also provide the scientific community at large with an expanding database that should eventually contain "all" the genes that can be said to be present in strains of P. aeruginosa, P. fluorescens and P. putida.

Proposed Work

Environmental strains of from different locations worldwide have already been isolated and characterized microbiologically. At present, our collection of natural isolates is composed of several dozen isolates from each species. Initial screening was performed with the aim of obtaining a heterogeneous collection of isolates by surveying from diverse environments. In particular, we have selected soil Pseudomonads from pristine fields, from fields undergoing intense agriculture and from oil-fields.

The next step in the research will be to determine the differences in genome arrangement between the new isolates and the "model" strain whose genome has been sequenced. Genes that are missing and, most interestingly, new genes, will be identified using microarray and hybridization technologies.

Once new genes are identified, they will be mapped by sequencing genomic clones and thus their site of insertion into the known genome will be determined. From here the work will take two directions.

1. Bioinformatics approaches will be used to attempt to identify the function of some of the new genes.

2. Multiple tagging methods will be used to assess the relative fitness of the natural isolates when in competition.

The central question to be addressed in this project is that of genetic variability among natural bacterial isolates of three members of the broad microbiological group Pseudomonads. While the genomic sequences have now been completed or are nearly completed for Pseudomonas fluorescens Pf0-1: 5.5 Mbp, Pseudomonas putida KT2440 6.1 Mb, Pseudomonas putida PRS1: 6.1 Mbp, Pseudomonas aeruginosa PA01, 6.264 Mbp, it is not yet known how much variation there will be worldwide among isolates. In particular, several of these strains have been domesticated through many years of laboratory maintenance. Thus, a most important aspect of our proposed work is to establish criteria that will be applied in the selection of strains to be analyzed. Ideally one would want to analyze as many strains as possible. The very nature of this project is such that the more strains are analyzed, the quicker the analysis of subsequent strains becomes. However, at the outset, selection parameters should be aimed to analyze strains that give indications of containing the most differences from the sequenced strains.

The first criterion to be used is that of geographic/environmental location. Strains of these microorganisms that have been isolated from the most diverse geographical/environmental regions will be analyzed first. To quickly discriminate strains, isolates will be catalogued in terms of differences in RAPD and pulsed-field gel electrophoresis chromosomal patterns. Insertion sequence pattern differences that have already been documented will also be used. The aim is to rapidly identify about 10 to 20 isolates from each species to carry out the whole genome analysis. As the work progresses additional strains can be added to the in depth analysis, particularly if new isolates are identified that have unique features with respect to their bioremediation potential.

Identification of genes missing in new isolates vs. sequenced strains.

Once the initial collection of strains is identified, the next step will be to map their genomes relative to the reference genomes. For these analyses we will need the central tool of this project: gene microarrays representing the sequenced genomes. Microarrays for Psa PA01 are already available. For the other genomes we will synthesize oligonucleotides array on glass slides. As new genes are discovered as a result of this project, they will be added to the subsequent editions of the microarrays. The eventual hope is to generate nearly comprehensive microarrays that contain "all" genes known to be present in these microbes, thus making subsequent genomic mapping trivial. The first full-fledged effort in this regard will be the comprehensive sequencing of the genes in Psa PA14 not present in Psa PA01.

Having microarrays, the identification of genes missing in the new isolates becomes simple. We will use fluorescently labeled genomic DNA from the new isolates to probe the microarrays. The hybridization patterns can then be analysed directly to identify deletions. Novel genes present in the new strains, however, will have to be identified through a lengthier process.

Identification of new genes in isolates vs. sequenced strains.

The key step that will allow the identification of novel genes present in the new isolates will be the generation of special probes to hybridize cosmid libraries made from DNA of the new isolates. Concurrent with the preparation of the probes, we will generate the cosmid libraries.

Two alternative approaches will be used to prepare probes that are highly enriched for sequences unique to the new isolates. The first approach makes use of the fact that M13 libraries produce single stranded DNA that can be used directly as probe without denaturation. New isolate DNA fragments, with average size inserts of 1 kilobase, will be cloned into an M13 vector. Once good libraries are available, as judged by the percent of phage containing inserts, we will prepare single stranded phage DNA. This DNA will be hybridized in solution with an excess of denatured chromosomal DNA from the sequenced strains. In this way, the only insert DNA that will remain single stranded will correspond to those sequences that are not represented in the sequenced genome. Without further denaturation, this DNA will be used to probe cosmid libraries that will have been denatured and immobilized on nitrocellulose or nylon membranes. Phage DNA that has hybridized to cosmids can be detected by using a labeled oligonucleotide complementary to M13. Alternatively, the new isolate DNA can be denatured and hybridized at low concentration to a vast excess of sequenced strain single stranded DNA which will be fixed on a solid matrix. The new isolate DNA remaining in the liquid phase will be amplified using randomly primed PCR. The amplified product can be used directly to probe the cosmid library of new isolate DNA.

Those cosmids that give positive results on the initial hybridization will be purified and reprobed using Southern hybridization analysis. Cosmids that reprobe positive will be subjected to limited sequence analysis from both vector-insert boundaries. In this way we will be able to quickly locate the insert within the known genome. Three general results are possible: a) both boundaries correspond to known sequences, b) one boundary corresponds to known sequences or c) neither boundary is present in the sequenced genome. When the boundaries are part of the sequenced genomes, the sequence immediately maps the cosmid insert to a region of the known chromosome. The novel genes can then be pinpointed by restriction mapping followed by sequencing. If the boundaries are not represented in the genome, those novel sequences can be used to generate a probe to identify overlapping cosmids. These cosmids in turn can be used to "walk" to the next sequences that match the known genome sequence. Once novel regions are precisely mapped they will be sequenced in their entirety.

The end-result from these analysis will be a set of complete chromosome sequences where all gene-sized deletions and insertions that differentiate the new isolates from the sequenced strains will have been delineated. These new sequences will then be used to generate primer pairs to PCR amplify the new genes from their corresponding strains so that the amplified products can be added to the microarray slides for use in the subsequent analysis of new isolates.

Genetic analyses

From here the work will take two directions. First, bioinformatics approaches will be used to attempt to identify the function of some of the new genes. This approach will help to narrow down the number of possible genes to be further analyzed by indicating those with interesting functions. Genes predicted to code for membrane components, for products involved in signal transduction, for global regulators or for proteins that have possible roles in interactions with the host or might be important in survival will be considered. We might expect that novel genes identified might be related to the strains' capacity to survive in different environments and thus might reveal potentially useful genes for application in agricultural biocontrol and bioremediation.

Second, genetic approaches will be used to generate mutations in these genes and the phenotypes of these mutations on key aspects of bacterial physiology, such as fitness within mixed strain biofilms. Targeted gene disruption and transposon tagging should be straightforward, as these species have been genetically manipulated in the past. Alongside the mutants, we also plan to study the competitive fitness of the new isolates versus the sequenced strains by conducting competition experiments in in vitro formed biofilms.

Goal 3d: Technologies to analyze bacterial fitness. Mismatch methods

In order to monitor changes in phenotypes that result from single nucleotide changes in the genome, we have developed a method based on recognition of mismatched DNA by mismatch binding proteins, One system, termed GIRAFF (Genomic Isogenicity Review by Annealing of Fractionated Fragments) is based on the ability of the CELI nuclease recognize and cleave DNA fragments at mismatched bases in a heteroduplex DNA, generated by annealing of DNA of wild-type and mutants bacteria. (Sokurenko et al, 2001). The method requires large-scale fractionation of DNA and probing of fractions by Southern hybridization. Although we have demonstrated that GIRAFF can be used in a reliable way with several known mutations, it is rather labor intensive, and is not easily applicable to a large-scale genome mutational analysis. We are currently developing an alternative method, which takes advantage of the ability of bacterial mismatch-binding protein MutS to recognize heteroduplexes containing at least one mismatched base pair. E. coli MutS, bound to magnetic beads can be used to remove heteroduplexes form the mixture of annealed fragments, that have been digested with restriction enzymes, mixed. We demonstrated that the predicted specific DNA fragment, containing several mutations, present in a 12 kb plasmid (xcpTUVW-pDN18) can be enriched for by MutS-mediated adsorption, following digestion of the plasmid and denaturation-annealing of the combined fragments and PCR amplification of the bound DNA. This specific fragment is not detected when denatured fragments from each digest are allowed to reanneal prior to MutS treatment. We are currently expanding this method to a genome-wide detection of mutations in the P. aeruginosa chromosome, combining MutS isolation of mismatched fragments with identification of mutated genes using microarrays.

Goal 3d: Genome-wide transposon tag array quantitation to survey the relative fitness of thousands of different genotypes in a mixed microbial population

Developing a Genetic System for Prochlorococcus: Progress to date

A reliable and efficient way to introduce DNA into Prochlorococcus is a vital prerequisite for a genetic system. As such, our experiments thus far have been to demonstrate that conjugation with E. coli is a effective means to introduce foreign DNA into Prochlorococcus. In these experiments, ProchlorococcusMED4 cells were mated with an E. coli conjugal donor strain containing two plasmids, pRK24 and pRL153.

pRK24 encodes a broad host range conjugal apparatus that has been shown to mediate DNA transfer to a wide range of bacteria including myxobacteria, thiobacilli, and cyanobacteria (Elhai and Wolk, 1988). In fact, this plasmid has been used to transfer DNA from bacteria to fungal (Saccharomyces) and mammalian (CHO K1) cells (Waters, 2001). The other plasmid in the E. coli conjugal donor strain was pRL153. pRL153 is a kanamycin-resistant derivative of the broad host range plasmid RSF1010. . RSF1010-derived plasmids have been found to efficiently replicate in the unicellular cyanobacteria Synechocystis strains sp. PCC6803 and PCC6714 and Synechococcus strains sp. PCC7942 and PCC6301, even though these plasmids contain no cyanobacterial DNA (Mermetbouvier et al., 1993). Thus, we hypothesized that pRL153 would be transferred into Prochlorococcus by conjugation and would replicate autonomously in the Prochlorococcus cell, endowing them with kanamycin-resistance.

To demonstrate that pRL153 could be introduced into Prochlorococcus by conjugation, it was necessary to detect pRL153 in pure MED4 cultures after mating. Even if the plasmid were detected, one could not be sure that it was in Prochlorococcus if there were residual E. coli surviving from the mating in the media. Although Pro99 media (fortified Sargasso Sea water) does not contain an organic carbon source, experimental evidence supports that E. coli can live in Prochlorococcus cultures for long periods of time (Tolonen, personal observation). Thus, Prochlorococcus matings were done with a MED4 strain isolated by Erik Zinser that is resistant to streptomycin at 100 micrograms/ml. This streptomycin concentration is well above the resistance levels of our Prochlorococcus and E. coli lab strains.

The conjugation methods that we are using for Prochlorococcus are based upon methods used for other cyanobacteria (Elhai and Wolk, 1988; Brahamsha, 1996). To mate E. coli and Prochlorococcus, late log cultures of both species are grown in liquid. The E. coli and Prochlorococcus cultures are then concentrated and mixed together. This mating mixture is spotted onto plates consisting of seawater and agarose. After 48 hours, the cells are excised from the plate and transferred to liquid seawater media with 100 µg ml^-1 streptomycin (Sm) and with or without 25 m g ml^-1 kanamycin (Kan) to see if the growth characteristics of the cells have changed. The experimental design is outlined in Table 1.

Table 3a. Experimental design for MED4 conjugation experiments with E. coli. After mating, cells are transferred to media containing 100 µg ml^-1 Sm +/– 25 µg ml^-1Kan. The E. coli strain with conjugal plasmid and transfer plasmid is 1100-2 (pRK24, pRL153). The E. coli strain with only the transfer plasmid is 1100-2 (pRL153). Results signifying a successful conjugation are indicated.

MED4 Treatment:
Outcome for MED4:

Pro99 medium + Kan
Pro99 medium - Kan

+ E. coli (w/ transfer

plasmid and w/ conjugal plasmid)

growth

growth

+ E. coli (w/ transfer plasmid w/o conjugal plasmid)

no growth

growth

- E. coli

no growth

growth

We found that mating conferred kanamycin-resistance to MED4 relative to the unmated controls. Furthermore, the kanamycin-resistance phenotype of mated MED4 cells was dependent upon the presence of the pRK24 conjugal plasmid within the donor E. coli cells.

Once MED4 cells that had been mated with the E. coli conjugal donor began to grow under kanamycin selection, a sub-culture was transferred to fresh Sm Kan media. The growth of the mated MED4 cells was compared to unmated cells in media with or without kanamycin (fig. 1).

Figure 3a. Growth curves comparing mated and unmated MED4 cells when transferred into fresh media with +/ - 25 micrograms/ml Kan. Each treatment shows duplicate cultures. The MED4 conjugation cultures were mated with the E. coli donor 1100-2 (pRK24, pRL153).

After identifying mated MED4 cultures that were kanamycin-resistant, it was important to demonstrate that this resistance was due to the conjugal transfer of the kanamycin-resistance plasmid pRL153. The first step was to demonstrate that the post-mating liquid cultures contained no viable E. coli donor cells. Viable count assays on LB plates were unable to detect even a single live cell of E. coli from a 10 ml sample of the culture, indicating that the post-mating streptomycin selection killed all of the remaining E. coli cells.

The second step was to demonstrate by PCR that the pRL153 plasmid was present in the MED4 cultures. Two sets of PCR primers were designed to amplify different regions of pRL153. One primer pair was designed to amplify a 500 base pair region within the kanamycin-resistance gene and the other to amplify pRL153 sequence outside the kanamycin-resistance gene. PCR experiments were done directly on 5 microliters of MED4 culture.

It was found that MED4 cultures yielded PCR products with both primer pairs only after mating with E. coli (Figure 3b, below), indicating that pRL153 was successfully transferred to MED4 by conjugation. These experiments support that conjugation with E. coli is a viable means to introduce DNA into Prochlorococcus.

Figure 3b. Using primers to amplify sequence inside or outside the kanamycin-resistance gene, PCR products are detected in mated MED4 cultures but not in unmated cultures. Lanes are numbered from left to right. Lanes 2-4 are PCR products using primers outside the kanamycin-resistance gene of pRL153. Lanes 5-7 are PCR products using primers inside the kanamycin-resistance gene of pRL153.

Lanes 1 and 8: 100 bp ladder

Lane 2: purified pRL153

Lane 3: unmated Prochlorococcus MED4

Lane 4: mated Prochlorococcus MED4

Lane 5: purified pRL153

Lane 6: unmated Prochlorococcus MED4

Lane 7: mated Prochlorococcus MED4

Experiments to develop transposon mutagenesis in Prochlorococcus

Previous experiments in our lab have demonstrated that conjugation with E. coli is a viable means to introduce foreign DNA into the Prochlorococcus cell. Currently, we are applying these conjugation techniques to determine if the Himar1-mariner transposon can be used to make random, tagged insertions in the Prochlorococcus chromosome. Himar1 has been shown to efficiently transpose in E. coli and Mycobacterium in vivo (Rubin et al, 1999). In fact, they concluded that this transposon will be active in any bacterial strain for which expression signals are known and there is a system for introducing DNA. Based upon their results, we are developing a Himar1 based in vivo transposition system for Prochlorococcus.

Initially, we are constructing a plasmid vector to deliver the Himar1 mariner transposon into Prochlorococcus This vector has several salient features (fig. 3c below). First, the plasmid contains a mini-transposon consisting of a selectable marker flanked by 31 bp inverted repeats. The selectable marker is the chloramphenicol resistance gene driven by the Prochlorococcus psbA promoter. The psbA promoter is a strong, Prochlorococcus promoter that will enable us to select Prochlorococcus cells that contain the mini-transposon using 0.5 µg ml-1 chloramphenicol. The flanking inverted repeats are recognized by the transposase protein and mobilize the cassette for transposition into the chromosome. In addition, the plasmid contains an origin of transfer so that it can be mobilized for conjugal transfer from E. coli to Prochlorococcus. It also contains a colK origin of replication that is recognized by E. coli but not Prochlorococcus (not shown). Thus, once introduced into Prochlorococcus, this plasmid will not replicate autonomously and all chloramphenicol-resistant cells will be expected to contain integrated copies of the mini-transposon. Finally, this plasmid contains the transposase gene driven by the psbA promoter. This promoter will cause the transposase gene to be highly expressed in Prochlorococcus, facilitating transposition, and weakly expressed in the E. coli conjugal donor strain.

Figure 3c. Diagram of the plasmid vector to deliver the Himar1 mariner transposon into Prochlorococcus. This plasmid contains a mini-transposon flanked by inverted repeats. The inverted repeats are recognized by the transposase and mobilize the cassette for transposition. The mini-transposon has a chloramphenicol-resistance gene driven by the Prochlorococcus psbA promoter and an outward facing T7 promoter. The T7 promoter will be used for in vitro transcription in quantitative growth experiments (next section). In addition, the plasmid has an origin of transfer to so that it may be transferred into Prochlorococcus by conjugation. Finally, the plasmid has the Himar1 mariner transposase gene driven by the Prochlorococcus psbA promoter. This promoter results in high expression of the transposase gene in Prochlorococcus, and weak expression in E. coli.

We will deliver the plasmid shown in fig.1 into Prochlorococcus by conjugation. Subsequently, we will select for Prochlorococcus transconjugants with 0.5 µg ml-1 chloramphenicol. If we can identify chloramphenicol-resistant colonies, we will attempt to detect the mini-transposon in the Prochlorococcus chromosome by PCR. Further, Southern blotting of DNA from 20-30 colonies will be used to determine if the transposon did indeed insert randomly.

The ability to make random gene disruptions in Prochlorococcus will enable a host of new possibilities for genetic experiments. Initially, we will use it to make a genome-wide library of Prochlorococcus mutants. We will make this library by introducing the Himar1 transposon by conjugation, selecting transconjugants using chloramphenicol selection on plates, and transferring the resistant cells to liquid culture in 96 well plates. Cells in 96 well plates can then be frozen as stocks to form a mutant library. This library could be used to identify genes involved in diverse environmental adaptations, such as growth in adverse light conditions and nutrient stress. Foremost, we will apply the transposon library to evaluate genome-wide analysis quantitative growth phenotypes by competition in chemostats.

Prochlorococcus fitness experiments to select insertional mutants by competition in chemostats

An ideal way to study the role of a gene in mediating an adaptive response is to measure its contribution to the fitness of the organism. Traditional genetic screens identify genes that are required for growth under selective conditions. However, we would like to quantify the relative fitness contributions these genes when the cell is under selection. To this end, we propose use the methods developed by Badarinarayana et al., 2001* see attached, to study Prochlorococcus genes involved in nutrient stress. These methods facilitate the genomic analysis of quantitative growth phenotypes using insertional mutagenesis and DNA microarrays. To do these experiments, we propose to use transposon mutagenesis to make a library of random Prochlorococcus mutants. We will then pool these mutants and compete them nutrient-replete and nutrient-limited chemostats. Using full-genome Prochlorococcus DNA microarrays, we will compare the relative abundances of mutants in the libraries before and after selection in chemostats to identify insertions that reduced the fitness of the organism under selection.

Our laboratory has extensive experience growing phytoplankton under nutrient-limited conditions in chemostats. In these experiments, we will grow a pool of transposon mutants in chemostats under nutrient-replete and nutrient-starved conditions. For example, both a nitrogen-replete and nitrogen-limited chemostat will be inoculated with a pool of Prochlorococcus transposon mutants. The continuous cultures will then be maintained in log-phase for a defined time period of approximately 30 generations. Cells will then be harvested, and the relative abundances of each of the transposon mutants will be quantified using the methods described in Badarinarayana et al., 2001. By comparing the relative abundances in the nutrient-replete and nutrient-stressed cultures to the original transposon library, we will identify genomic insertions that had negative fitness consequences specifically under nutrient-stressed conditions. Further, we will be able to quantitatively compare relative fitness deficits of different mutants. These experiments will provide us a powerful means to discover genes involved in nutrient stresses of Prochlorococcus and assign fitness contributions to those genes under nutrient-stressed conditions.

MED4 Treatment:	Outcome for MED4:
	Outcome for MED4:
	Pro99 medium + Kan	Pro99 medium - Kan

+ E. coli (w/ transfer plasmid and w/ conjugal plasmid)	growth	growth
*+ E. coli* (w/ transfer plasmid w/o conjugal plasmid)**	no growth	growth
- E. coli	no growth	growth