Upstream sequences for human and mouse Refseq mRNAs

Paste human and/or mouse Refseq identifiers ("NM_" followed by six digits) to retrieve 5-kb upstream sequences (promoter regions) of mRNAs located in the respective genome.

OR: download all 5-kb human or mouse upstream sequences OR all 10-kb human or mouse upstream sequences

Details

This page provides information supplementary to "Computational comparison of two draft sequences of the human genome", Aach et al., Nature 409, 856-859 (15 February 2001). You can retrieve individual or multiple sequences by providing human and/or mouse Refseq ID's, or you can download the entire sets, separately for mouse and human. The sequences are in FASTA format. The header looks like:

>UNM_006807.1|2148|NT_035623|99.6|96.0|9-1764=318023-316269=96%TCGT,1765-1917=316240-316090=95%CTAC,1918-2148=316061-315831=97%ACAA|TTATGGGAGGCTGAGGGGAGGGCCG

and consists of the following fields, separated by vertical bars:

1. the sequence identifier, namely "U" followed by the Refseq ID of the gene upstream of which this sequence lies;
2. the length of the Refseq mRNA (we have noticed that these change, i.e., at the NCBI);
3. the contig identifier
4. the percentage of the mRNA covered by the alignment;
5. the average percent identity of the alignment;
6. details of the alignment (see below); and
7. the first 25 bases of the mRNA; since Refseq sequences may change, this can serve as a landmark.

The details of the alignment consist of one or more (comma-separated) entries, each corresponding to a fragment of the mRNA, presumably an exon, aligned to the genome. For example, "1-422=7253179-7253600=99%CCGT" means that bases 1 to 422 of the mRNA match bases 7253179 to 7253600 of the genomic contig with 99% identity. The gene is on the positive strand since 7253600 is greater than 7253179. By contrast, "1-408=151092-150684=94%CTAA" means that the gene is on the negative strand, since 150684 is less than 151092. The four bases that follow "%" are the two bases in the genome before the fragment and the two after the fragment, i.e., the putative splice signals. For example, if you see "877-992=12749440-12749555=99%AGGT,993-1246=12749827-12750080=98%AGGT", you see that an AG precedes and a GT follows each exon.

The sequence consists of at most the 5000 bases directly upstream of the first base of the mRNA that matches the genome (from details above). It may be less than 5000 bases if the gene lies close to the edge of a contig.

Methods

On 8/11/02, we downloaded from the NCBI the unmasked human (CHR_*/hs_chr*.fa.gz, build 30) and mouse (CHR_*/mm_chr*.fa.gz, MGSCv3_Release1) genomes as well as the human (hs.fna.gz) and mouse (mouse.fna.gz) Refseq sets. We removed non-NM mRNAs (NG*, NC*). We used NCBI's MegaBLAST (v2.2.3) to align the mRNAs against their respective genomes with an E-value of 1e-25, a minimum identity of 95%, and filtering disabled for initial matches but enabled for extensions (i.e., we looked for strong initial matches). Then, for every contig, we realigned all mRNAs with reported matches, this time with filtering disabled and an E-value of 1e-10. For every DNA-mRNA pair, we assembled the best apparent alignment; that is, for every base of the mRNA, we kept the genomic fragment with the highest percent identity, and in case of ties, that extending furthest on the mRNA. For each alignment, we calculated the percentage of the mRNA covered and the average percent identity. We multiplied these two numbers to obtain a score for the alignment (maximum 10000). For each mRNA, we kept the alignment with the highest score, or alignments in case of ties. To ensure the accuracy of the score, we then redid the alignments with sim4, calculated the score as above, and kept the alignment(s) with the best score.

Brief statistics

For human, there were 15365 NM sequences in Refseq, and we have 15634 upstream sequences for 14106 of those gene sequences; for mouse, there were 8583 NM sequences in Refseq, and we have 8947 upstream sequences for 7660 of those gene sequences.

Updates

Adnan Derti, Lipper Center for Computational Genetics. Last updated 9/27/02.