The recent availability of various draft sequences of the assembled human & mouse genomes has opened the possibility of comprehensive searches for stretches of DNA exhibiting significant conservation over evolutionary time. We adopted a BLAST-based strategy using the Celera draft mouse genome assembly and the NCBI draft human genome assembly to define ~1.15 million “islands” of strong homology, covering the most conserved ~10% of each genome. We have used these elements to generate a “humanized” version of the mouse genome and a “mousenized” version of the human genome. We are working towards making all of these resources publicly available in the near future. In the meantime, we offer from this page: (1) statistics on the current build of HUMMUS; (2) graphical representations of mouse-human conservation in the context of exon-intron mappings of mRNA sequences to genomic sequence.
Example of mRNA ® genome mapping:
The x-axis represents genomic sequence. The green and blue boxes of the upper row represent individual exons of an mRNA mapped to genomic sequence with the SIM4 tool. Coding portions of the mRNA are colored blue and non-coding portions (e.g. the 5’ and 3’ UTRs) are colored green. In this case, the transcript proceeds from left-to-right along the contig. In the lower row (orange), the height of the vertical bar at each position is proportional to the level of mouse-human conservation over a 50-base-pair window of the overlaid human genome, centered at that position. That HUMMUS consists of a set of “islands” of strong conservation, rather than a continuous full alignment, can be clearly seen (see Methods). As expected, the positions of the coding exons correlate strongly with spikes in mouse-human conservation. Also notable are the stretches of conservation upstream and downstream of the transcript, as well as in its 3’ UTR.
HUMMUS is available in three forms:
islands of conservation a set of ~1.15 million gapped alignments of streches of syntenic conserved mouse & human sequence
“mousenized” human genome an overlay of a reference assembled human genome with corresponding mouse sequence
“humanized” mouse genome an overlay of a reference assembled mouse genome with corresponding human sequence
We present here some basic statistics on the “islands of conservation” and the “mousenized” human genome.
islands of conservation
Number of alignments 1,150,000
Average alignment size 254.6 bp
Estimated fraction of false-positive bases ~0.0004
Estimated coverage ~90%
Number of overlaid bases 288,384,392 bp (~10% of genome)
Overall % nucleotide identity for overlaid regions 75.3%
Overall % gapped bases in overlaid regions ~2%
To obtain an independent measure of coverage and to calculate statistics of conservation over known genes, we mapped 8,100 RefSeq transcripts
with defined ORFs back to the “mousenized” human genome with the MEGABLAST and SIM4 tools (see methods). 7,442 of these transcripts intersected with HUMMUS-defined overlaid bases (~92%; consistent with our independent estimate of ~90% coverage).
The following statistics were observed for the 7,442 covered mRNA sequences:
% of coding bases overlaid with mouse sequence 93.0%
% overall nucleotide identity of overlaid coding bases 84.2%
% of 5’ UTR bases overlaid with mouse sequence 48.2%
% overall nucleotide identity of overlaid 5’ UTR bases 78.6%
% of 5’ UTR bases overlaid with mouse sequence 49.3%
% overall nucleotide identity of overlaid 5’ UTR bases 77.3%
73% of the transcripts had 3’ UTRs with 100+ bp of aligned bases
The NCBI UniGene resource is the result of the partitioning of millions of ESTs and tens-of-thousands of mRNA into a “non-redundant, gene-oriented” set of clusters. As part of the resource, a single sequence is selected from each cluster as its longest, highest-quality member. We refer to these sequences as the Best-Of-UniGene (BOU) sequences. Many of the BOUs are full-length mRNAs (often the same mRNAs found in the RefSeq resource, though this is not always the case). We used the MEGABLAST tool to map approximate genomic locations and the SIM4 tool to obtain more precise mappings of each human and mouse UniGene BOU sequence onto genomic coordinates of the overlaid human and mouse genomes, respectively. Graphical representations integrating information on exon-intron structure and mouse-human conservation were generated for each UniGene cluster. The map above (see Introduction section) is an example. A full set of graphical representations for all UniGene clusters that we were able to map is available:
Computational Discovery of Overlapping Transcriptional Units in the Human and Mouse Genomes
Jay Shendure
jay_shendure@student.hms.harvard.edu
Last revised : November 11, 2003