SUPPLEMENTAL MATERIAL:

Methods

RNA analysis

To construct a library of enriched miRNAs, endogenous 18nt to 26nt RNAs were gel-isolated from total RNA from the N2 wild-type C. elegans strain, ligated with 5' and 3' oligonucleotide linkers, and amplified by RT-PCR using antisense DNA oligonucleotides to the linker sequences as described previously17.  The resolution of the amplified library of small RNAs by 15% SDS/Urea-PAGE, blotting onto nylon membrane, and hybridization of the blots with the miRNA probes were performed as described previously5, 27.  Total RNA isolation and Northern blot procedure have been described previously5, 27.  A qualification concerning detection by both Northern blots and our amplification protocol is that they may not definitively identify the sequence of a detected miRNA if multiple hairpins in a genome contain ~20nt stem sequences very nearly complementary to the probe.  The candidate miRNA and let-7 sequences are presented in Table I in Supplemental Materials; antisense probes were designed to these sequences and used for the Northern analysis and for probing the blots of the enriched small RNAs.  The total RNAs from human tissues were purchased from Clontech.  The total RNAs from D. melanogaster developmental stages were kind gifts from M. Kuroda (Baylor College of Medicine, Houston, TX.) and N. Perrimon (Harvard Medical School, Boston, MA.). The dcr-1 (ok-247) strain was grown as previously described28. 

 

Computational Methods:

Sequences and annotations: We used genome assemblies as follows: C. elegans (produced by the C. elegans Sequencing Group at the Sanger Institute and Genome Sequencing Center at Washington University) downloaded from http://www.sanger.ac.uk on 7 March, 2001, D. melanogaster (release 2) downloaded from http://www.fruitfly.org on 9 May, 2001 (ref 29) repeat-masked human genome sequence downloaded from http://www.ncbi.nlm.nih.gov on 13 August, 2001 (ref 30).  Annotations downloaded with the C. elegans and D. melanogaster genome sequences were used to identify intragenic regions.  The C. elegans and D. melanogaster genome sequences were repeat masked using the RepeatMasker version dated 19 June, 2001 (Smit, AFA and Green, P., unpublished results).  We downloaded unassembled genomic reads of C. briggsae (~6X coverage) from the Washington University Sequencing Center at http://www.genome.wustl.edu/ on 21 September, 2001.

miRNA test set: A test set of 53 miRNAs made available to us thanks to Thomas Tuschl and later published in Lagos-Quintana, et al., 2001 (ref 18), was used to test the effect of combinations of parameter settings used in generating predicted miRNA hairpin sets.  The test set included several variants of let-7.

srnaloop: srnaloop is a BLAST-like algorithm that looks for short complementary words within a specified distance and uses dynamic programming to determine a complete alignment.  Compared to BLAST24, srnaloop supports shorter word lengths and aligns complementary base pairs (including GUs).  See our web site (http://arep.med.harvard.edu/miRNA/) for additional information and the software itself.  Because we wanted to accommodate the possibility that there might be miRNA precursor hairpins longer than the ~70 nt associated with currently known miRNAs, we generally directed srnaloop to look for hairpins with lengths ≤ 95 (-l 95).  Score thresholds (-t) were generally either 23.5 or 23.  srnaloop scores are aggregate match, mismatch, and gap scores accumulated over the duplex region of the hairpin and are not penalized for hairpin loop size or normalized for hairpin length (which is controlled by –l).  Therefore srnaloop scores are correlated with many other parameters (see Parameter correlations below).

Stutter filtering: In searching a sequence for hairpins of a certain length, srnaloop may find two or more hairpins on the same strand that overlap for a considerable percentage of their lengths, a phenomenon we called “stuttering.”  Stutter filtering consists of iteratively cycling through predicted hairpins on a strand by strand basis, detecting overlaps whose length exceeds a threshold fraction of the smaller of the two overlapping hairpin lengths, and eliminating the hairpins with the smaller srnaloop score.

Folding energy and structure filters: Sets of predicted hairpins were processed by RNAfold31 using the –d0 option to compute minimum free energies of folding and structure characteristics such as numbers of multiloops.

Parameter correlations: Many parameters used to identify and filter candidate miRNA hairpin sequences are strongly correlated.  For example, using the hairpin sequences associated with the miRNA test set (above), we found that srnaloop score and RNAfold-computed folding energy were correlated at -0.57, hairpin sequence length and srnaloop score were correlated at 0.77, and hairpin sequence length and RNAfold-computed folding energy were correlated at  -0.57, and GC content and RNAfold-computed folding energy at -0.49 (Pearson correlation coefficients).

Correspondence determination: To find correspondences between hairpins in a predicted miRNA hairpin set SA from species A in an assembled, repeat-masked genome sequence for species B, each sequence in SA is BLASTed against the genome sequence for B using –W 8 and –e 100 (ref 24).  BLAST hits less than 20 nt in length are ignored if their location in a query sequence in SA is not entirely on one side of the sequence midpoint or the other, a heuristic filter whose purpose is to ensure that BLAST hits represent possible mature miRNA sequences in hairpin stems and not hairpin loop sequences.  Repeat-masked sequence regions surrounding target BLAST hits in genome B are then extracted so that the target BLAST hit is in the same location in the extracted B sequence as the query hit is in the SA sequence, plus up to 10 nt padding on either end.  Overlapping B extracts were merged.  Extracted B sequence is then analyzed for hairpins with srnaloop, and re-verified for the presence of a BLAST target hit on the same side of the computed B hairpin sequence midpoint as the BLAST query hit in the A hairpin for the BLAST hit that generated the extract.  Srnaloop parameters for extracted B sequence may be different from those that generated the initial SA hairpins and always specify a single-stranded search.  Hairpins from B that pass this consistency check are then filtered for GC content and folding energy and structure and comprise the set of B hairpins that ‘correspond’ to SA.  Sets of corresponding hairpins may contain multiple instances of a given sequence if that sequence is duplicated in the A or B genome.

Transitivity filter: In cases where a set of hairpins SA from species A is used to find corresponding sequences SB in species B, and then SB used to find corresponding sequences SC in species C, there is both a BLAST hit that establishes the correspondence between a hairpin sequence HA in SA and hairpin sequence HB in SB, and another BLAST hit that establishes the correspondence between HB in SB to hairpin sequence HC in SC.  However, it may be the case that the target site in HB for the HA®HB correspondence does not overlap the query site in HB for the HB®HC correspondence.  The transitivity filter looks for the subset of corresponding hairpins for which the target site in HB for HA®HB overlaps the query site in HB for HB®HC.  For the CDH set (see below), all C. elegans, D. melanogaster, and human hairpins were BLASTed against each other, and those hairpins with BLAST matches exceeding 14 nt and overlapping in sequence position were further assembled into groups, where at most 12 nt of non-overlap was allowed between the D. melanogaster sequence that is the C. elegans hairpin target and that which is the human hairpin query. 

Short repeat filtering: Although all assembled genomic sequences analyzed were repeat masked, many derived sets of hairpins contained mononucleotide strings or approximate tandem repeats of short words.  In some cases we therefore filtered out hairpins containing 10 nt long mononucleotide strings, or tandem consecutive repeats of 2-4 bases up to lengths 12, 15, 16, respectively, allowing in each case one single base mismatch, deletion, insertion, and a possible insertion between each repeated block.  None of the available cloned miRNA sequences in our set is rejected by this filter.

Structure quality filtering: As a final structure filtering step in generating predicted sets of miRNA hairpins as indicated below, we regenerated hairpin structures using mfold software (http://www.bioinfo.rpi.edu/~zukerm/rna/ ; ref 32, 33) and retained only those sequences with predicted hairpins containing no multiloops.  These filters also ensured that the BLAST hit sequence in the hairpin establishing the correspondence or the hairpin sequence that matches the query mature miRNA sequence is entirely within a duplexed region characterized by the following topology limitations: no bulges of greater than 3 nt, no more than a 5 nt bulge on the opposite stem, and absence from the loop region. 

8713 C. elegans hairpin set: A set of 16,216 intergenic regions was extracted from the repeat masked C. elegans genome and both strands analyzed by srnaloop using parameters including –w 4 –dw 1 -~2w -t 23.5 -l 95 and –sm 0.  In cases where multiple transcripts for a gene were annotated, intergenic regions were based on the smallest transcript.  The 494,319 resulting hairpin sequences were successively filtered to meet the following criteria: GC content ≥ 32.8% and ≤ 62.5%, stutter-filtration at 66% length overlaps, RNAfold-determined minimum free energy ≤ -32.5 kcal/mol and no multiloops, to yield the 8,713 set of C. elegans hairpins.  To preserve genome locations of all hairpins, duplicate hairpin sequences were maintained in the 8,713 set.  As our test set of miRNA sequences (above) did not contain C. elegans miRNAs, we could not directly verify the presence of test set sequences in the 8,713 set of hairpins except for let-7 (lin-4 could not be used to test the 8,713 set because, being found within an intron4, it was excluded by our use of only C. elegans intergenic sequences).  However, similar parameters (but using a lower srnaloop score threshold) passed > 54% of the hairpin sequences associated with the miRNA test set.  We subsequently reassessed our filters as biochemically cloned C.elegans miRNAs were reported: in the current set of 61 hairpins for C. elegans miRNAs (see below), 39 (63.9%) are present in our repeat-masked C. elegans intergenic sequence and 29 (47.5%) are present among the 8713, so that 29/39 (74.4%) of all hairpins available in the sequence analyzed passed our filters.  The 22 hairpins not present in our analyzed sequence are either in genic sequence (including introns) or in clone sequence not found in the C. elegans assembly.  Of the 39 miRNA hairpins present in our sequence, 37 of these 39 passed the srnaloop score threshold, all 37 of these 37 passed GC content and stutter filtering, and 29 of these 37 passed the predicted minimum free energy and multiloop threshold.  We repeated this assessment for the complete set of 61 hairpins where sequence for the 22 hairpins not in our analyzed sequence was taken from C. elegans clones, and found that 46/61 (75.4%) of all hairpins identified for cloned C. elegans miRNAs would have passed our filters.  Of the complete set of 61, 58 of these 61 passed our srnaloop score threshold, 54 of these 58 passed our GC content threshold, and 46 of these 54 passed the predicted minimum free energy and multiloop threshold.

To estimate the probability of finding a C. elegans hairpin that met our criteria by chance, we randomly extracted 1,500 sequences of 100nt from our repeat-masked intergenic sequence and rejected any that contained a string of 8 or more uncalled bases (Ns), leaving a set of 1,154.  We subjected these 1,154 to the same GC content, stutter, structure, and energy filters as were used to derive the 8,713.  Only three of the 1,154 randomly extracted sequences generated hairpins passing all our constraints.  We therefore estimate the probability to be 3/1154 = 0.0026.  This is a liberal estimate because it does not consider sequences with many more Ns that would likely fail our criteria, and also because, unlike with the 8,713, we applied stutter filtering last to guarantee that no hairpin that might meet all constraints might be rejected in favor of an overlapping hairpin that failed a constraint.

 6086 C.elegans hairpin set: For the homology-based miRNA hairpin prediction set, we further filtered the 8,713 C.elegans hairpin set by removing 145 additional sequences that potentially overlapped coding regions (based on longer transcripts for genes with multiple annotated transcripts than first used to extract intergenic regions for the 8,713 set), removing duplicates, applying short repeat filtering, and applying structure quality filtering.  The result was a set of 6,086 C. elegans hairpins.

Drosophila correspondences to the 8713 set: Correspondences in the repeat-masked D. melanogaster genome sequence were determined as described above based on 900,127 BLAST hits.  D. melanogaster genomic sequence was extracted with 10 nt padding on each end.  Srnaloop applied to extracted D. melanogaster genomic sequence used the –t 23 parameter instead of –t 23.5, resulting in 85,999 fly hairpin sequences.  These were re-filtered for srnaloop score of ≥ 24, filtered for GC content ≥ 32.8% and ≤ 63.5%, “stutter”-filtered at 66% overlap, and filtered for RNAfold-determined minimum free energy of ≤ -30 kcal/mol and no multiloops, resulting in 4,778 D. melanogaster hairpins.  Correspondence mapping with the original 8,713 query C. elegans hairpins resulted in 3,514 D. melanogaster and 3,019 C. elegans hairpins, with both sets of hairpins containing duplicate sequences.  Removing duplicates led to a set of 3,505 distinct D. melanogaster and 2,523 distinct C. elegans hairpins.  Nine miRNAs from the test set were found in the predicted fly miRNAs, including let-7, mir-1, mir-2a, mir-2b, mir-8, mir-9, mir-10, mir-13a, and mir-13b.  

CDC set (C.elegansàD. melanogasterà C.briggsae):  The set of 2,523 distinct C. elegans hairpins was BLASTed into the C. briggsae genome sequence reads using an e-value cutoff of 10-14, and yielded a set of 95 hairpins.  Thirteen of these hairpins had between 14 and 253 hits in the C. briggsae genome and were removed because of the possibility they represented repeat sequences.  Two of the remaining 82 hairpins were identical except for the presence of a single “N” in one hairpin sequence; they were deemed equivalent.  Thus, the set of hairpins that showed correspondence between C. elegans and D. melanogaster, and had high homology between C. elegans and C. briggsae consisted of 81 distinct hairpins; this set of hairpins included 6 of the 8 total known miRNAs that have been identified as conserved between C. elegans and D. melanogaster16-18.  All 81 were tested by Northern blot.  Subsequently, we selected 28 higher quality predictions (CDC-f set) on the basis of structure and sequence.  The criteria used were later codified as short repeat and structure quality filtering.

CDH set (C. elegansàD. melanogasteràhuman): Correspondences between the 3,505 distinct D. melanogaster hairpins found to correspond to the 8,713 C. elegans hairpin set and human genomic sequence were determined as described above and based on a set of 1,328,689 BLAST hits.  Sequence extraction, srnaloop hairpin identification, and stutter-filtering, GC filtering, srnaloop score re-filtering, and RNAfold-computed folding energy were performed using the same parameters as for the D. melanogaster correspondences on the 8,713 C. elegans set except that the folding energy criterion originally used for the 8,713 C. elegans hairpins was used (energy ≤ -32.52 kcal/mol).  After BLAST hit correspondences with the 3,505 D. melanogaster hairpin sequences were established and duplicates were eliminated, this left 6,630 distinct human and 2,246 distinct D. melanogaster hairpin sequences, which corresponded to 1,729 distinct sequences from the original 8,713 C. elegans hairpin set.  Ten test miRNA sequences from human and fly, including five let-7 variants, were found in the combined corresponding fly and human hairpin set.  At this point, we applied transitivity filtering.  Groups of hairpins corresponding transitively across all three species with greater than five D. melanogaster hairpins and/or greater than 12 human hairpins were considered to contain possible repeat sequence, and were removed.  The groups were then filtered for C. elegans hairpins whose D. melanogaster BLAST match region was duplexed according to the structure quality filtering described above, leaving 162 C. elegans hairpins.  Corresponding D. melanogaster hairpins were then subjected to the same structure filter over the BLAST match region, yielding a set of 96 C. elegans hairpins where the matching region between the C. elegans and D. melanogaster hairpins adheres to the structural criteria in both species.  Short repeat filtering, removal of possible coding sequence, and a further selection resulted in a set of 40 hairpins, six of which are published miRNAs16,17.

C. elegans miRNA homolog set: We used matcher, a pure Smith-Waterman algorithm, from the EMBOSS v2.3.1 software package26 to align each of 164 miRNA sequences against the C. elegans set of 6,086 hairpins described above, using default settings except for gappenalty=10 and gaplength=1 (parameters which are permissive for gaps).  Using matcher in place of BLAST, allowed us to overcome the BLAST requirement for a matching word of at least 7 nt between sequences, a limitation when applied to very short miRNA sequences.  Hairpins with perfect matches or matches where gaps account for 15% or more of the sequence were removed.  The resulting matches with matcher algorithm scores greater than 60 were subjected to structure quality filtering (see above), yielding a set of 190 hairpins.  A round of filtering to remove hairpins with homolog matches at the extreme ends of hairpins left a total of 116 candidate worm hairpin orthologs and paralogs of known miRNAs, of which three were identified by the CDC and CDH algorithms.

Clustering and multiple alignments of miRNAs: We used the matcher program to align each of 233 mature metazoan miRNA sequences16-20 and our five confirmed novel miRNAs against each other, using the same parameters as in the homology algorithm (default, except gappenalty=10 and gaplength=1).  We generated a dissimilarity matrix from pair-wise matcher alignment scores and performed hierarchical complete clustering.  From the clustering, we drew a dendrogram in Matlab (Supplemental Figure 2).  We then cut the dendrogram to yield clusters that include the large let-7 variant cluster and additional clusters we viewed as promising (Figure 3 and Supplemental Table 2).  The resulting clusters containing multiple miRNAs were then individually aligned using CLUSTAL W34 using an alignment gap-penalty of 8 (permissive towards gaps).  Each cluster was evaluated for whether the aligned subsequences were from similar regions of the constituent miRNAs. We adjusted by hand the composition of clusters to emphasize long contiguous blocks of aligned bases in common locations.  This resulted in a set of ~40 clusters (Supplemental Table 2).

Enumeration of cloned C. elegans miRNAs:  We compiled cloned C. elegans miRNA information16,17, identified duplicates, and checked source C. elegans genome sequence data.  We determined that there were a total of 62 miRNAs and 61 distinct hairpins.  We could not confirm a stable hairpin secondary structure for one reported miRNA (mir-89) and hence excluded it from these counts.

Analysis of conservation of predicted C. elegans  miRNAs:  We analyzed conservation of C. elegans miRNAs in two contexts.  To determine which of the 61 hairpins for cloned C. elegans miRNAs (above) were conserved, we first found that nine were reported in the literature as having potential homologs in the D. melanogaster, M. musculus, or H. sapiens genomes.  We then applied the method described above for finding D. melanogaster correspondences to the 61, which is more sensitive than ordinary BLASTing, to find nine additional apparently conserved sequences.  Therefore we count 18 miRNA hairpins of the 61 as conserved.  We also identified conserved sequences from our own predictions:  We counted all sequences in the CDC-f and CDH sets as conserved by dint of correspondences with D. melanogaster and/or H. sapiens, but only the 63 out of 116 sequences from our homology set that were found to correspond to a D. melanogaster, M. musculus, or H. sapiens miRNA.  After removing duplicates we were left with 119 conserved predicted miRNA hairpins.

 

 

REFERENCES:

1.         Rougvie, A.E. Control of developmental timing in animals. Nature Reviews Genetics 2, 690-701 (2001).

2.         Pasquinelli, A.E. & Ruvkun, G. Control and developmental timing by microRNAs and their targets. Annu Rev Cell Dev Biol 28, 28 (2002).

3.         Feinbaum, R. & Ambros, V. The timing of lin-4 RNA accumulation controls the timing of postembryonic developmental events in Caenorhabditis elegans. Developmental Biology 210, 87-95 (1999).

4.         Lee, R.C., Feinbaum, R.L. & Ambros, V. The C. elegans heterochronic gene lin-4 encodes small RNAs with antisense complementarity to lin-14. Cell 75, 843-54 (1993).

5.         Reinhart, B.J. et al. The 21-nucleotide let-7 RNA regulates developmental timing in Caenorhabditis elegans. Nature 403, 901-6 (2000).

6.         Wightman, B., Ha, I. & Ruvkun, G. Posttranscriptional regulation of the heterochronic gene lin-14 by lin-4 mediates temporal pattern formation in C. elegans. Cell 75, 855-62 (1993).

7.         Slack, F.J. et al. The lin-41 RBCC gene acts in the C. elegans heterochronic pathway between the let-7 regulatory RNA and the LIN-29 transcription factor. Molecular Cell 5, 659-69 (2000).

8.         Pasquinelli, A.E. et al. Conservation of the sequence and temporal expression of let-7 heterochronic regulatory RNA. [see comments.]. Nature 408, 86-9 (2000).

9.         Grishok, A. et al. Genes and mechanisms related to RNA interference regulate expression of the small temporal RNAs that control C. elegans developmental timing. Cell 106, 23-34 (2001).

10.       Ketting, R.F. et al. Dicer functions in RNA interference and in synthesis of small RNA involved in developmental timing in C. elegans. Genes & Development 15, 2654-9 (2001).

11.       Hutvagner, G. et al. A cellular function for the RNA-interference enzyme Dicer in the maturation of the let-7 small temporal RNA. [see comments.]. Science 293, 834-8 (2001).

12.       Bernstein, E., Caudy, A.A., Hammond, S.M. & Hannon, G.J. Role for a bidentate ribonuclease in the initiation step of RNA interference. [see comments.]. Nature 409, 363-6 (2001).

13.       Hannon, G.J. RNA interference. Nature 418, 244-51 (2002).

14.       Ha, I., Wightman, B. & Ruvkun, G. A bulged lin-4/lin-14 RNA duplex is sufficient for Caenorhabditis elegans lin-14 temporal gradient formation. Genes & Development 10, 3041-50 (1996).

15.       Olsen, P.H. & Ambros, V. The lin-4 regulatory RNA controls developmental timing in Caenorhabditis elegans by blocking LIN-14 protein synthesis after the initiation of translation. Dev Biol 216, 671-80 (1999).

16.       Lee, R.C. & Ambros, V. An extensive class of small RNAs in Caenorhabditis elegans. [see comments.]. Science 294, 862-4 (2001).

17.       Lau, N.C., Lim, L.P., Weinstein, E.G. & Bartel, D.P. An abundant class of tiny RNAs with probable regulatory roles in Caenorhabditis elegans. [see comments.]. Science 294, 858-62 (2001).

18.       Lagos-Quintana, M., Rauhut, R., Lendeckel, W. & Tuschl, T. Identification of novel genes coding for small expressed RNAs. [see comments.]. Science 294, 853-8 (2001).

19.       Lagos-Quintana, M. et al. Identification of tissue-specific microRNAs from mouse. Curr Biol 12, 735-9 (2002).

20.       Mourelatos, Z. et al. miRNPs: a novel class of ribonucleoproteins containing numerous microRNAs. Genes Dev 16, 720-8 (2002).

21.       Grosshans, H. & Slack, F.J. Micro-RNAs: small is plentiful. J Cell Biol 156, 17-21 (2002).

22.       Reinhart, B.J., Weinstein, E.G., Rhoades, M.W., Bartel, B. & Bartel, D.P. MicroRNAs in plants. Genes Dev 16, 1616-26 (2002).

23.       Rhoades, M. et al. Prediction of plant microRNA targets. Cell 110, 513 (2002).

24.       Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. Basic local alignment search tool. Journal of Molecular Biology 215, 403-10 (1990).

25.       Lai, E.C. Micro RNAs are complementary to 3' UTR sequence motifs that mediate negative post-transcriptional regulation. Nat Genet 30, 363-4 (2002).

26.       Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European Molecular Biology Open Software Suite. Trends in Genetics 16, 276-7 (2000).

27.       Ausubel, F.M. et al. Current Protocols in Molecular Biology, (John Wiley and Sons, New York, 1995).

28.       Dent, J.A., Davis, M.W. & Avery, L. avr-15 encodes a chloride channel subunit that mediates inhibitory glutamatergic neurotransmission and ivermectin sensitivity in Caenorhabditis elegans. EMBO J 16, 5867-79 (1997).

29.       Adams, M.D. et al. The genome sequence of Drosophila melanogaster. Science 287, 2185-95 (2000).

30.       Lander, E.S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860-921 (2001).

31.       Hofacker, I.L. et al. Fast folding and comparison of RNA secondary structures. Monatshefte f. Chemie 125, 167-188 (1994).

32.       Mathews, D.H., Sabina, J., Zuker, M. & Turner, D.H. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. Journal of Molecular Biology 288, 911-40 (1999).

33.       Zuker, M., Mathews, D.H. & Turner, D.H. Algorithms and thermodynamics for RNA secondary structure prediction: a practical guide. In RNA Biochemistry and Biotechnology (eds. Barciszewski, J. & Clark, B.F.C.) 11-43 (Kluwer Academic Publishers, New York, 1999).

34.       Thompson, J.D., Higgins, D.G. & Gibson, T.J. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Research 22, 4673-80 (1994).