| 
 |   | 
 | |||||||||
|   | HomeNewsCourse | Project Ideas, 2002Genie 
      Hainsworth genie@hms.harvard.edu I work at 
       Protein-Protein 
      Interactions: Network Structures I 
      would like to look at data from protein-protein interaction experiments, 
      and construct networks showing their connections. By examining the 
      structure and connectivity of these networks, I hope we can deduce 
      something about the roles of particular proteins. The 
      protein chip is created by depositing a microarray of different plasmid 
      DNAs on a glass slide, then using in vitro transcription/translation (IVT) 
      to produce the corresponding proteins. The genes in the plasmids all 
      include an affinity tag, so that as the proteins are made, they become 
      immobilized on the slide. Already, part of my work is to analyze scanned 
      images of these slides, and quantitate. We may have a small amount of real 
      data by early December, which I would like to start analyzing with the 
      method we develop in the project.  Steven 
      Corsello corsello@fas.harvard.edu Aim: To correlate 
      microarray data with the promoter site consensus sequence for a specific 
      transcription factor. Most transcription 
      factors bind a known consensus sequence in the promoter region of a gene. 
      However, often the sequence contains degeneracies, such as TTNNNNNAA. This 
      project would develop a model in which to score tendencies for a 
      particular base to be incorporated into the site, and then compare this 
      result with the fold of gene induction reported on the 
      microarray. This project can be done 
      in either a human or yeast system. Yeast would likely be more 
      straightforward since gene transcription is better characterized in this 
      system, and the promoter region is easier to 
      identify. My background is in 
      biochemistry and cell biology, so a computer scientist, statistician, or 
      someone with experience interpreting microarray data would be particularly 
      helpful. Joshua Rene 
      Lacsina lacsina@fas.harvard.edu I would like to focus on 
      genomic analysis of parasitic human pathogens, particularly Plasmodium 
      falciparum, one of the causative agents of malaria, and Leishmania major, 
      the causative agent of leishmaniasis, an endemic disease primarily 
      affecting the third world. The complete genome of Plasmodium falciparum 
      has been completed, with significant annotation published for chromosomes 
      2 and 3. In contrast, only chromosome 1 of Leishmania has been completed. 
      I developed the following project idea based on a series of articles 
      focusing on malaria genomics published in Molecular and Biochemical 
      Parasitology, available electronically via Hollis: Finding novel motifs in 
      the P. falciparum genome based on an algorithm that searches for 
      repetitions in DNA sequences. I still haven't thought of a good way to 
      enrich my dataset for biologically-relevant/interesting things, though the 
      complete sequence of the mosquito genome could be useful...I'm open to 
      suggestions... Let me know via e-mail if 
      you are interested. Here are some references--the first reference is the 
      journal issue I referred to above (available on Hollis) containing several 
      articles pertinent to this topic, so take a look at as many of them as you 
      wish: Molecular and Biochemical 
      Parasitology. (2001) v. 118: pp. 127-302. GlimmerM. 
      http://www.tigr.org/software/glimmerm/. Huestis R and A. Saul , 
      An algorithm to predict 3' intron splice sites in Plasmodium falciparum 
      genomic sequences. Mol. 
      Biochem. Parasitol. 112 (2001), pp. 
      7177. Myler PJ, et. al., 
      Leishmania major Friedlin chromosome 1 has an unusual distribution of 
      protein-coding genes. PNAS 96 (1999), pp. 2902-2906. Leo 
      Aristotle S. Hizon Leo.Hizon@mpi.com Simulation of the 
      recombination of antibody genes by using perl to predict the amino acid 
      sequences of the variable region of the antibody. -modeling or semi-random 
      recombination events in B-cells                    
       Dynamic Programming 
      analysis of Th2 chemokine receptors and ligands nucleotide and protein 
      sequences. -the evolutionary 
      relationship between Th2 chemokine receptors  Bob 
      Brady rbrady@fas.harvard.edu The Determination of a 
      General Set of Fine-Grained Selection Criteria for the Discovery of siRNA 
      in Humans It has been reported that small 
      interfering RNA (siRNA) can induce gene silencing in Humans.  The small (~ 21 bp) 
      double-stranded siRNA triggers the degradation of mRNA that matches its 
      sequence.  This effect can be 
      used to determine the function of specific genes. The problem is that a wide variety 
      of unreported or proprietary results have been used for the determination 
      of fine-grained selection criteria in the discovery of siRNA.  Fine-grained selection criteria 
      include, but are not limited to: %gc content, position from start codon, 
      and homolog parameters from a BLAST search of the proposed siRNA.   The goal of this study is to 
      review the siRNA sequences and corresponding gene silencing results 
      reported in the literature, calculate the fine-grained selection 
      parameters, and determine a general set of selection critera.  Not all published results list the 
      actual siRNA sequences.  It 
      will be required to implement the selection method described in the 
      journal article with a Perl script and BLAST queries in cases where the 
      actual siRNA sequence is not given. Additionally, Perl and Mathematica 
      will be used to parse the data, calculate results, and visualize 
      data/results. David 
      Twomey Project Idea 
      #1 With genomics and mouse genomes 
      now available one may now ask if the newly available genomic sequence 
      information may at least partially explain gene expression patterns. There 
      was some work on looking into whether genes that have similar expression 
      patterns also share up-stream regulatory sequences.  Project Idea 
      #2 1) There exists much functional 
      knowledge on biochemical and signaling pathways, also on protein 
      interactions. 2) Novel methods, such as gene 
      network reverse engineering, driven directly by gene expression and 
      molecular activity data, can infer functional, regulatory interactions 
      between any of the genes measured, indpendent of whether previous 
      functional annotation exists. 3) In some cases, reverse 
      engineered networks will make predictions that are concordant with known 
      functional interactions. In other cases, predictions from reverse 
      engineering will go beyond (e.g. functional predictions on novel genes) or 
      contradict current functional knowledge. 4) The case has been made that 
      integrated network models should include components from reverse 
      engineered networks, and known signaling pathways. HOW DO WE MANAGE CASES 
      IN WHICH CONFLICTING, CONTRADICTING OR "SPECULATIVE" FUNCTIONAL 
      PREDICTIONS ARE CONTRIBUTED BY THE VARIOUS INFORMATION SOURCES USED TO 
      BUILD A NETWORK MODEL. Julian 
      Bonilla My project idea deals with 
      developing a HMM for determining nucleosome positioning, based on 
      ratiometric data generated by microarray experiments.  I have access to some raw data 
      generated by a biologist at CGR.  
      I've also worked on developing a browser that displays and 
      annotates the data, but I think this might be too difficult to complete in 
      the time given.  I'd be 
      interested in working with anyone that has a statistics or math background 
      to further develop the model. Faraz Waseem 
 I have a project idea. The idea is 
      that I want to develop an engine (or program) to predict protein function 
      on context basis (non-homologous approach). It will be a rule-based 
      program. I have read an article on this issue and it seems to be a good 
      computational biology problem. Richard Xu Richard_Xu@biogen.com Name TNF Receptor 
      Biomining Objective Tumour necrosis factor(TNF) 
      superfamily has been been identified to play pivotal roles in the 
      organization and function of the immune system [1].  Although there have been 29 TNF 
      receptors identified, more TNF receptors are yet to be found among human 
      genomes.  The project here is 
      aimed at taking innovative approach to find possible gene sequences that 
      could potentially encode TNF receptor by utilizing TNF receptor’s several 
      features.  Many existing tools 
      will be used, BLASTX, HMM, TMAP and PFAM, to name a few.  Computational result will be 
      provided to prove the efficacy of such approach. Process The whole process will be divided 
      into two big steps: discovery and validation.    Discovery 
      Phase: In discovery phase, we will use 
      the existing known TNF receptor sequences and  take advantage of the presence of 
      cysteine-rich-domains in TNF receptor to build HMM.  We will then apply the HMM to find 
      out all the candidate DNA sequences that can potentially code the proteins 
      in conformance to this HMM model.  
      We then BLASTX these genes to get their encoding proteins. 
       Validation 
      Process: We will have several criterions to 
      sift out unqualified proteins and identify the most potential 
      candidate. Comparing Variable Selection Methods for Microarray Classification Models Based on Logistic Regression An important application of microarray technology is categorizing 
      tissue or cell types based on their gene 
      expression profiles. There are numerous methods available--clustering, self-organizing 
      maps, and neural networks are just a 
      few. Each of these create a mathematical model using the gene expression levels from a set of "training" arrays. 
      In the case where the tissue or cell 
      samples are associated with a disease, the models can be used to (1) search for key genes involved in the 
      disease process, (2) identify 
      subclasses of the disease, or (3) diagnose the disease. 
       My lab has been studying the use of logistic regression for classifying 
      microarrays, and we have shown that it can 
      perform as well as (and often better 
      than) other models, with the added advantage that it requires far 
      fewer genes to make accurate predictions 
      about the identity of "test"arrays. The most difficult part, though, in 
      creating a logistic regression model 
      arises from the fact that the number of genes on an array (measured 
      in the thousands) is much greater than the 
      typical number of training arrays 
      (usually just a few dozen). As a result, many different 
      combinations of genes can be used to 
      create a logistic regression model that perfectly fits the training data; however, these sets of 
      genes vary widely in their ability to 
      fit the test data.  The question I hope to answer with my project is, given only the 
      training data, how do you choose the 
      combinations of genes that will most likely fit well to the test data. In other words, is it 
      possible to distinguish genes, which 
      by random fluctuations in expression level happen to correlate well 
      with the training data, from other genes 
      whose high correlation with the training data has a real biological basis? 
      Therefore, I plan to compare several 
      well-known variable selection methods and design some algorithms of 
      my own to determine which set of genes is 
      best to use in a logistic regression 
      microarray classification model.  Griffin 
      Weber   weber@fas.harvard.edu  
 Using the Index of Coincidence to identify Open Reading Frames 
 Every language has what is called an Index of Coincidence. The Index of Coincidence (IC) is defined as the probability that two random elements of a string are identical and can be calculated from the frequency histogram of the string. English has an IC of about 0.065 and random data has an IC of about 0.038. Different languages have different Indices of Coincidence, depending on their particular pattern of alphabet use. Our project will attempt to evaluate the role of the Index of Coincidence in helping to identify open reading frames and to distinguish them from non-coding sequences. We suspect that, within species (and perhaps, regardless of species) coding and non-coding sequences will exhibit characteristic Indices of Coincidence much in the way that different languages do. If we can demonstrate that there is a difference between ICs of ORFs and non-ORFs, then this difference can be used to help identify ORFs in unknown sequences. 
 Jeanhee Chung   jachung@attbi.com                  
      Thomas Lasko  file:///C:/WINNT/Profiles/genetics/Application%20Data/SSH/temp/tlasko@mit.edu 
       Transcriptional 
      control mediated by cleansing of short sequences from gene regulatory 
      regions Differentiation of cells 
      and their responses to stimuli are in large part made possible by tight 
      regulation of gene expression. This control is executed primarily at the 
      level of transcription initiation. The current theory describes trans-acting transcription factors 
      binding to cis-regulatory 
      sequences within, or adjacent to genes, as a primary mode of regulation of 
      expression. Combination of many different cis binding sites located close to 
      any gene would explain the complexity of transcriptional responses to 
      stimuli and co-expressed genes should in principle share similar patterns 
      of transcription factor binding sites. Computational methods for analysis of transcriptional regulation rely frequently on annotation of regulatory elements located in proximity of the studied genes and comparison of arrangements of these elements between co-expressed genes or between homologous genes among various species. We speculate that the 
      absence of or negative bias towards specific sequences in the 
      regulatory regions of co-expressed genes might add another degree of 
      regulation. Sequences cleansed from regulatory regions of co-expressed 
      genes might serve as “disruption sites”. Disruption of transcription might 
      be achieved in many ways, e.g. by binding proteins that would make spatial 
      arrangement of other trans-acting factors impossible, 
      by binding short silencing RNA sequences or by changing unfavorably the 
      local conformation of DNA strands. Rolf 
      Hanson  I'm 
      interested in issues dealing with retroelements, viruses and genome  evolution - how the study of 
      retroviruses can be used to learn more about the coevolution of 
      retroviruses and their human hosts. This is kind of broad and fuzzy, and I 
      am looking for a specific problem that would be appropriate for a course 
      project. An idea would be to look for retroelements in genome databases 
      using algorithms based on the structure of the retroelements, rather than 
      homology. That 
      said, I am more of a hacker and unfortunately would probably be more 
      excited about creating a cool piece of software or computer graphics, 
      rather than demystifying the mechanisms of human evolution.  I am good with PERL, python, C, 
      Java, 3-D graphics (OpenGL), macintosh programming, unix, etc. I work at 
      children's hospital (www.chip.org) with some guys who wrote a book about 
      microarrays, so potentially have access to people who know a lot about 
      bioinformatics. Anna 
      Mallikarjunan  annapurni@nevo.com My 
      fairly non-existent biological backgound is making it hard for me to 
      choose a problem that will be both interesting and 
      relevant. However, 
      one area I am interested in is to develop a partial software solution 
      (dependent on the time constraints) that provides a visual interface to 
      nucleotide mutations. I am looking to biologists to suggest what they 
      would want to visualize when analyzing nucleotide 
      mutations. The 
      skills I can bring to a team are experience in a variety of software 
      platforms and programming paradigms. Matt 
      Paschke Background: I have 
      an AB in Computer Science and have done a fair bit of programming, 
      including a good bit of programming in perl and related languages.  I am also in the middle of taking 
      a molecular biology class, a genetics class and an organic chemistry 
      class. My 
      interests probably fall into two broad categories.  The first would be the problem of 
      gene location -- finding where genes are in the vast amount of collected 
      genomic data.  The secound 
      would be data representaion – how to most efficiently represent genetic 
      data to make searching and sequence alignment more efficient.  Spending hours waiting for a BLAST 
      search is still too slow.  
      These are just broad interests.  I would love to work with anyone 
      who wants to work at the interface of the CS and the 
      biology. Daniel 
      Rosenband  danlief@au-bon-pain.lcs.mit.edu I'm a 
      computer science graduate student at MIT.  My research interests are 
      in supercomputing, computer architecture, and hardware design.  I've 
      only taken introductory undergrad. biology, so I'm looking to be part of a 
      team that consists of at least one or two other people with a strong 
      biology background. The type of project I would like to work on is one 
      that involves some aspect of high-performance computing -- novel 
      algorithms to take advantage of large machines, simulating a complicated 
      biological process, or finding a biological problem that dedicated 
      high-performance hardware could cost-effectively solve.  
       My 
      office and apt. are on MIT campus, so I can easily meet with people to 
      workon the project either at Harvard or MIT. Atif 
      Khan  gumaan@yahoo.com I have 
      interests in two directions 1) 
      Machine learning approaches (in particular neural networks) to predict RNA 
      and protein secondary structure. 2) 
      Information theory and evolution - in particular exploring the relation 
      between the theory of error correcting codes, signal/noise propagation and 
      evolutionary constructs like mutation, selection, drift etc. The idea here 
      would be to understand how evolution preserves "informational 
      complexity". Dan 
      O'Brien  
 mjg-dob@attbi.com I have 
      3 ideas for projects. 1. 
      Looking at mutant p53 in clam leukemia cells for homologs 
       http://www.idealibrary.com/retrieve/doi/10.1006/excr.1997.3513#ex973513fn1 2a. To 
      attempt to find a correlation with physical cell stress or size and gene 
      expression. There is work going on growing cells on a scaffold or grating 
      that can be expanded to place the cells under tension.  Heart cells grown on micro pegs 
      exhibit electrical properties different from cells grown in a culture 
      medium. I have no idea how I'll get the data. 2b. 
      Cells grow and shrink in size during the cell cycle is it possible to 
      correlate this expansion and contraction with up regulation of 
      genes? The 
      data from this could come from the paper that we read this week 
       3. Is 
      DNA used as a framework or building block to increase cell size in single 
      cell organisms?  Is it 
      possible to correlate cell size and amount of DNA?     Heta 
      Ray   Heta@mit.edu Skills: 
      Java, Databases, Object Modeling, Machine Learning and Data 
      Analysis Genomic 
      and proteomic approaches can provide hypotheses concerning function for 
      the large number of genes predicted from genome sequences. Due to the 
      artificial nature of the assays, however, the information from these 
      high-throughput approaches should be considered with caution. Although it 
      is possible that more meaningful hypotheses could be formulated by 
      integrating the data from various functional genomic and proteomic 
      projects, it has yet to be seen to what extent the data can be correlated 
      and how such integration can be achieved.   I would like to speculate 
      and co-relate the the mRNA abundance to the presence/absence of 
      proteins.  This correlation 
      between mRNA abundance to the presence/absence of proteins can be used to 
      improve the quality of hypotheses based on the information from both 
      approaches.  A test and traing 
      set to be created and we could use machine learning (Neural Networks, 
      Bayesian methods etc.) to predict the outcome Another 
      biological problem could be identifying genes responsible for human 
      diseases by combining information about gene position with clues about 
      biological function.  The 
      recent availability of of whole genome sets of RNA and protein expression 
      provides powerful new functional insights.  These data sets could be used to 
      expedite disease genes discovery - we could assign a 'score' for each 
      gene, based on similarity in the RNA expression profile to known 
      mitochondrial genes .  Using a 
      large survey of organellar proteomics, genes can be classified according 
      to the likelihood of their protein product being associated with the 
      mitochondria.  The 
      intersection of this information could narrow down the search for the 
      possible gene candidates. Joe 
      Weber 
       jrweber@attbi.com    Identification of Potential Transcriptional Regulatory Elements by Comparison of Human and Pufferfish Genomic Sequences.   BIOL E-101 Project Proposal by Joe Weber, 11/5/02     The goal of this project is to identify potential transcriptional regulatory elements that are likely to play a conserved role in vertebrate body patterning. To accomplish this goal, the amino acid sequences of human genes thought to be important for body patterning will be used to search the pufferfish (Fugu rubripes) genome for closely related genes. The 5’ flanking and intronic sequences of likely homologs will be compared to identify clusters of conserved transcription factor binding sites that may serve as promoters, enhancers, or silencers. There are two main reasons for choosing the pufferfish and human genomes for this project. First, nearly all of each genome has been sequenced and is publicly available. Second, these two species are separated by approximately 450 million years of evolution. Over such a great period of time, the non-coding regions of a gene (promoter and introns) should have undergone extensive mutations. Therefore, the only sequences that are likely to be conserved are sequences that play a critical functional role, such as important transcriptional regulatory elements.   Here is a summary of how I will proceed with this project:   Step 1: Create a list of human genes for which there is published evidence indicating that the gene product plays a role in body patterning. In many cases, the experimental evidence will be from homologous proteins in animals commonly used for embryological studies, such as mouse, Xenopus, and Zebrafish. This list should include at least a few dozen genes, since part of the project goal is to see how common (or uncommon) it is to have highly similar promoter/enhancer sequences between human and pufferfish homologs. This list will include well studied transcription factors and signaling molecules such as members of the Zic, Gli, Sox, BMP, Nodal, and Wnt families.   Step 2: Use the human amino acids sequences to conduct BLAST searches against the pufferfish genome (http://genome.jgi-psf.org/fugu6/fugu6.home.html) and download the genomic sequences of likely homologs.   Step 3: Search for open reading frames and intron-exon splice site consensus sequences in order to verify that a genomic sequence found by the blast search is likely to code for a real protein. The pufferfish genome web site has a listing of more than 30,000 predicted protein sequences that will be very helpful. The predicted amino acid sequence for a pufferfish protein will then be aligned to the human sequence used for the BLAST search in order to determine if it is similar enough to be a likely homolog.   Step 4: Use a Smith-Waterman local alignment to search for regions of high similarity in the promoter and intron sequences of likely human-fish homologs, and use the MatInspector program to search for potential transcription factor binding sites based on matrices from the TRANSFAC database.   Step 5: The final product of the steps above would be a set of gene maps and tables listing potential transcription factor binding sites that appear to be phylogenetically conserved. It might be possible to confirm at least some of these predictions by doing a thorough search of the literature to see if any of these elements have already been identified by empirical methods such as protein-DNA binding assays and promoter-reporter gene assays.   Reasons to think that this approach will be productive: I recently worked on a project where I cloned the Xenopus (frog) gene for Zic3, a transcription factor involved in vertebrate body patterning. When I compared the frog and human Zic3 sequences, I found a 120 bp sequence in the middle of the first intron that had 82% identity between species. This similarity was quite striking, since the rest of the intron had only about 22% identity. Because frogs and humans are separated by more than 300 million years of evolution, I thought that this conserved sequence was very likely to be a functional regulatory element. I tested this hypothesis using a variety of promoter-reporter gene assays in Xenopus embryos, and found that the conserved sequence was a transcriptional enhancer that responded to the activin/nodal-related signaling pathway, which is known to induce the endogenous Zic3 gene. I would like to see how common these kind of conserved regulatory regions are between homologous genes in distantly related vertebrate genomes. Unfortunately, there is very little genomic sequence available for Xenopus. However, the availability of the nearly complete pufferfish genome sequence, and the 450 million years that separates humans and pufferfish, makes fish-human comparisons an attractive approach for studying conservation of transcriptional regulatory elements.   Note: I currently plan to carry out this project using existing software tools. However, what might be an interesting related project for someone with a stronger computer science background than I have is as follows:   The process of taking a human protein sequence and BLASTing it against the pufferfish genome is quite fast. However, extracting the relevant pufferfish genomic sequence and annotating its complete exon-intron structure can be quite tedious, especially if this process is going to be repeated with a large number of genes. I suspect that a great deal of this process could be automated, but I don’t know of an available program that does all of what I would like it to do. It would be great to have a program that a researcher could just give as input a known protein sequence from one species and a genomic sequence from another species (selected based on a BLAST search), and then have the program map out the best fit homolog it can find in the genomic sequence. The output of such a program would include the predicted amino acid sequence of the potential homolog and its percent identity with the input protein sequence. It would also include a table or map of the predicted genomic exon-intron structure with position numbers. Predicting exon-intron splice sites a priori is usually quite difficult, because there is a great deal of flexibility in the splice site consensus sequences. However, in this case a comparison of the input amino acid sequence with all of the open reading frames of the genomic sequence should narrow down the search space by quite a bit. Finally, it would be very useful if the program would copy the non-coding sequences (5’ flanking region and introns) to separate files, so that they can then be used in searches for transcriptional regulatory sequences.   Gregory Minevich 
         I would like to investigate one of the major pieces of evidence for the neutral theory of molecular evolution.  Advocated in part by Motoo Kimura, this idea states that most evolution at the molecular level is not the result of natural selection, but rather the result of random genetic drift.  In other words, mostmolecular variation in DNA and protein composition has no influence on the selective fitness of the organism-- it is selectively neutral.    Molecular evolution and morphological evolution are quite independent of one another in that variation onthe morphological level has been definitively demonstrated to have an influence on the fitness of the phenotype.  Ridley (p173)gives the example of the living fossil shark Heterodontus portusjacksoni – a species that closely resembles its fossil ancestors  from 300 million years ago.  Though the rates of molecular evolution in humans and this shark have beenroughly equal over the past 300 million years, their rates of morphological evolution have been astonishingly different.  Whereas the shark closelyresembles its ancestor from that time, humans have evolved from fish-like ancestors and have passed though amphibian, reptilian and mammalian stages. According to Ridley (p151), 4 main types of observation have been used to determine whether natural selection or neutral drift drives molecularevolution.  These are: 1) The rate of evolution and the magnitude of polymorphism2) The constancy of molecular evolution (known as the molecular clock)3) The relation between functional constraint on molecules and their rates of evolution4) The relation between polymorphism and evolutionary rate in different molecules (or parts of molecules)  Of these four observations or tests, I would like to single out the constancy of molecular evolution for investigation.  Whereas the rate of protein evolution runsrelative to absolute time, the molecular clock runs relative to generation time forsilent base changes in DNA (changes that do not result in an amino acid change in the protein).  According to Ridley, selection likely gives a better explanation for the case of protein evolution whereas neutral drift better explains silent base changes in DNA.    Overall Goal:Does Neutral Drift or Natural Selection Better Explain the Molecular Clock Effect? Steps Along the Way:* Investigate whether the latest research still has each respective clock running relative to absolute time and generation time respectively  * Ridley (p178) suggests an argument for how natural selection can explain why the protein molecular clock keeps absolute, rather than generational time.  The argument supposes 2 organisms with different generation times, "both separatelyevolving in relation to changes in parasites with short generation times".  I would like to model this argument using a combination of Perl and Mathematica where I could experiment with changing such variables as: mutation rates (of organisms and parasites), selection variables, and generation times (of organisms and parasites).  I would then compare the results of this modeling experiment with the latest research findings to achieve the overall goal above.  Resources:Gould, S.J. 2002 The Structure of Evolutionary TheoryKimura, M. 1968 Evolutionary Rate at the MolecularLevel. Nature 217:624-626Dawkins, R. 1986 The Blind WatchmakerRidley, Mark. 1996 Evolution   Some Further Reading:Bulmer, M 1988 Evolutionary Aspects of ProteinSynthesis. Oxford Surv. Evol. Biol. 5:1-40Bulmer, M 1988 Estimating the Variability ofSubstitution Rates. Genetics 123:615-619Gillespie, J.H. 1991 The Causes of MolecularEvolution.  Gillespie, J.H. 1993 Episodic Evolution of RNAViruses. Proc. Nat. Acad. Sci. USA 90:10411-10422Nichol, et al 1993 Punctuated Equalibrium and PositiveDarwinian Evolution in Vesicular Stomatitis Virus.Proc. Nat. Acad. Sci. USA 90:10424-10428Ohta, T. 1992 The Nearly Neutral Theory of MolecularEvolution. Ann. Rev. Ecol. System. 23:263-286 Ohta, T. 1993 An Examination of the Generations-TimeEffect on Molecular Evolution. Proc. Nat. Acad. Sci.USA 90:10676-10680    BioPhysics 
101 Term Project Proposal Overlaying Clustering Results from PCA 
      with Clustering Results from Self-Organizing 
Maps     Group 
      memeber: Amy ChangShixin Zhang 
       Yan 
      Wang sleepingpanda99@hotmail.com       (Note: each of group member sumitted the same project 
      proposal to his/her TF)   We are given: 1) DNA microarray data containing the expression profiles for 97% of the known or predicted genes of Saccharomyces Cerevisiae. The microarray data measures changes in the concentrations of the RNA transcripts from each gene for seven successive intervals after transfer of wild-type (strain SK1) diploid yeast cells to a nitrogen-deficient medium that induces sporulation. This dataset comes from a published paper: The Transcriptional Program of Sporulation in Budding Yeast, S. Chu et al[i]. The dataset can be downloaded from http://cmgm.stanford.edu/pbrown/sporulation. 2) Results of analysis done by S. Chu concluded that there are “At least seven distinct temporal patterns of induction were observed”. Their conclusion comes from having done a clustering analysis of the data using self-organizing maps.   Our 
      challenge: 1) Perform principal component analysis on the dataset from S. Chu et al 2) See if | |||||||||