| |||||||||
Project Ideas, 2002Genie
Hainsworth genie@hms.harvard.edu I work at
Interactions: Network Structures I
would like to look at data from protein-protein interaction experiments,
and construct networks showing their connections. By examining the
structure and connectivity of these networks, I hope we can deduce
something about the roles of particular proteins. The
protein chip is created by depositing a microarray of different plasmid
DNAs on a glass slide, then using in vitro transcription/translation (IVT)
to produce the corresponding proteins. The genes in the plasmids all
include an affinity tag, so that as the proteins are made, they become
immobilized on the slide. Already, part of my work is to analyze scanned
images of these slides, and quantitate. We may have a small amount of real
data by early December, which I would like to start analyzing with the
method we develop in the project. Steven
Corsello corsello@fas.harvard.edu Aim: To correlate
microarray data with the promoter site consensus sequence for a specific
transcription factor. Most transcription
factors bind a known consensus sequence in the promoter region of a gene.
However, often the sequence contains degeneracies, such as TTNNNNNAA. This
project would develop a model in which to score tendencies for a
particular base to be incorporated into the site, and then compare this
result with the fold of gene induction reported on the
microarray. This project can be done
in either a human or yeast system. Yeast would likely be more
straightforward since gene transcription is better characterized in this
system, and the promoter region is easier to
identify. My background is in
biochemistry and cell biology, so a computer scientist, statistician, or
someone with experience interpreting microarray data would be particularly
helpful. Joshua Rene
Lacsina lacsina@fas.harvard.edu I would like to focus on
genomic analysis of parasitic human pathogens, particularly Plasmodium
falciparum, one of the causative agents of malaria, and Leishmania major,
the causative agent of leishmaniasis, an endemic disease primarily
affecting the third world. The complete genome of Plasmodium falciparum
has been completed, with significant annotation published for chromosomes
2 and 3. In contrast, only chromosome 1 of Leishmania has been completed.
I developed the following project idea based on a series of articles
focusing on malaria genomics published in Molecular and Biochemical
Parasitology, available electronically via Hollis: Finding novel motifs in
the P. falciparum genome based on an algorithm that searches for
repetitions in DNA sequences. I still haven't thought of a good way to
enrich my dataset for biologically-relevant/interesting things, though the
complete sequence of the mosquito genome could be useful...I'm open to
suggestions... Let me know via e-mail if
you are interested. Here are some references--the first reference is the
journal issue I referred to above (available on Hollis) containing several
articles pertinent to this topic, so take a look at as many of them as you
wish: Molecular and Biochemical
Parasitology. (2001) v. 118: pp. 127-302. GlimmerM.
http://www.tigr.org/software/glimmerm/. Huestis R and A. Saul ,
An algorithm to predict 3' intron splice sites in Plasmodium falciparum
genomic sequences. Mol.
Biochem. Parasitol. 112 (2001), pp.
7177. Myler PJ, et. al.,
Leishmania major Friedlin chromosome 1 has an unusual distribution of
protein-coding genes. PNAS 96 (1999), pp. 2902-2906. Leo
Aristotle S. Hizon Leo.Hizon@mpi.com Simulation of the
recombination of antibody genes by using perl to predict the amino acid
sequences of the variable region of the antibody. -modeling or semi-random
recombination events in B-cells
Dynamic Programming
analysis of Th2 chemokine receptors and ligands nucleotide and protein
sequences. -the evolutionary
relationship between Th2 chemokine receptors Bob
Brady rbrady@fas.harvard.edu The Determination of a
General Set of Fine-Grained Selection Criteria for the Discovery of siRNA
in Humans It has been reported that small
interfering RNA (siRNA) can induce gene silencing in Humans. The small (~ 21 bp)
double-stranded siRNA triggers the degradation of mRNA that matches its
sequence. This effect can be
used to determine the function of specific genes. The problem is that a wide variety
of unreported or proprietary results have been used for the determination
of fine-grained selection criteria in the discovery of siRNA. Fine-grained selection criteria
include, but are not limited to: %gc content, position from start codon,
and homolog parameters from a BLAST search of the proposed siRNA. The goal of this study is to
review the siRNA sequences and corresponding gene silencing results
reported in the literature, calculate the fine-grained selection
parameters, and determine a general set of selection critera. Not all published results list the
actual siRNA sequences. It
will be required to implement the selection method described in the
journal article with a Perl script and BLAST queries in cases where the
actual siRNA sequence is not given. Additionally, Perl and Mathematica
will be used to parse the data, calculate results, and visualize
data/results. David
Twomey Project Idea
#1 With genomics and mouse genomes
now available one may now ask if the newly available genomic sequence
information may at least partially explain gene expression patterns. There
was some work on looking into whether genes that have similar expression
patterns also share up-stream regulatory sequences. Project Idea
#2 1) There exists much functional
knowledge on biochemical and signaling pathways, also on protein
interactions. 2) Novel methods, such as gene
network reverse engineering, driven directly by gene expression and
molecular activity data, can infer functional, regulatory interactions
between any of the genes measured, indpendent of whether previous
functional annotation exists. 3) In some cases, reverse
engineered networks will make predictions that are concordant with known
functional interactions. In other cases, predictions from reverse
engineering will go beyond (e.g. functional predictions on novel genes) or
contradict current functional knowledge. 4) The case has been made that
integrated network models should include components from reverse
engineered networks, and known signaling pathways. HOW DO WE MANAGE CASES
Bonilla My project idea deals with
developing a HMM for determining nucleosome positioning, based on
ratiometric data generated by microarray experiments. I have access to some raw data
generated by a biologist at CGR.
I've also worked on developing a browser that displays and
annotates the data, but I think this might be too difficult to complete in
the time given. I'd be
interested in working with anyone that has a statistics or math background
to further develop the model. Faraz Waseem
I have a project idea. The idea is
that I want to develop an engine (or program) to predict protein function
on context basis (non-homologous approach). It will be a rule-based
program. I have read an article on this issue and it seems to be a good
computational biology problem. Richard Xu Richard_Xu@biogen.com Name TNF Receptor
Biomining Objective Tumour necrosis factor(TNF)
superfamily has been been identified to play pivotal roles in the
organization and function of the immune system [1]. Although there have been 29 TNF
receptors identified, more TNF receptors are yet to be found among human
genomes. The project here is
aimed at taking innovative approach to find possible gene sequences that
could potentially encode TNF receptor by utilizing TNF receptor’s several
features. Many existing tools
will be used, BLASTX, HMM, TMAP and PFAM, to name a few. Computational result will be
provided to prove the efficacy of such approach. Process The whole process will be divided
into two big steps: discovery and validation. Discovery
Phase: In discovery phase, we will use
the existing known TNF receptor sequences and take advantage of the presence of
cysteine-rich-domains in TNF receptor to build HMM. We will then apply the HMM to find
out all the candidate DNA sequences that can potentially code the proteins
in conformance to this HMM model.
We then BLASTX these genes to get their encoding proteins.
Process: We will have several criterions to
sift out unqualified proteins and identify the most potential
candidate. Comparing Variable Selection Methods for Microarray Classification Models Based on Logistic Regression An important application of microarray technology is categorizing
tissue or cell types based on their gene
expression profiles. There are numerous methods available--clustering, self-organizing
maps, and neural networks are just a
few. Each of these create a mathematical model using the gene expression levels from a set of "training" arrays.
In the case where the tissue or cell
samples are associated with a disease, the models can be used to (1) search for key genes involved in the
disease process, (2) identify
subclasses of the disease, or (3) diagnose the disease.
My lab has been studying the use of logistic regression for classifying
microarrays, and we have shown that it can
perform as well as (and often better
than) other models, with the added advantage that it requires far
fewer genes to make accurate predictions
about the identity of "test"arrays. The most difficult part, though, in
creating a logistic regression model
arises from the fact that the number of genes on an array (measured
in the thousands) is much greater than the
typical number of training arrays
(usually just a few dozen). As a result, many different
combinations of genes can be used to
create a logistic regression model that perfectly fits the training data; however, these sets of
genes vary widely in their ability to
fit the test data. The question I hope to answer with my project is, given only the
training data, how do you choose the
combinations of genes that will most likely fit well to the test data. In other words, is it
possible to distinguish genes, which
by random fluctuations in expression level happen to correlate well
with the training data, from other genes
whose high correlation with the training data has a real biological basis?
Therefore, I plan to compare several
well-known variable selection methods and design some algorithms of
my own to determine which set of genes is
best to use in a logistic regression
microarray classification model. Griffin
Weber weber@fas.harvard.edu
Using the Index of Coincidence to identify Open Reading Frames
Every language has what is called an Index of Coincidence. The Index of Coincidence (IC) is defined as the probability that two random elements of a string are identical and can be calculated from the frequency histogram of the string. English has an IC of about 0.065 and random data has an IC of about 0.038. Different languages have different Indices of Coincidence, depending on their particular pattern of alphabet use. Our project will attempt to evaluate the role of the Index of Coincidence in helping to identify open reading frames and to distinguish them from non-coding sequences. We suspect that, within species (and perhaps, regardless of species) coding and non-coding sequences will exhibit characteristic Indices of Coincidence much in the way that different languages do. If we can demonstrate that there is a difference between ICs of ORFs and non-ORFs, then this difference can be used to help identify ORFs in unknown sequences.
Jeanhee Chung jachung@attbi.com
Thomas Lasko file:///C:/WINNT/Profiles/genetics/Application%20Data/SSH/temp/tlasko@mit.edu
control mediated by cleansing of short sequences from gene regulatory
regions Differentiation of cells
and their responses to stimuli are in large part made possible by tight
regulation of gene expression. This control is executed primarily at the
level of transcription initiation. The current theory describes trans-acting transcription factors
binding to cis-regulatory
sequences within, or adjacent to genes, as a primary mode of regulation of
expression. Combination of many different cis binding sites located close to
any gene would explain the complexity of transcriptional responses to
stimuli and co-expressed genes should in principle share similar patterns
of transcription factor binding sites. Computational methods for analysis of transcriptional regulation rely frequently on annotation of regulatory elements located in proximity of the studied genes and comparison of arrangements of these elements between co-expressed genes or between homologous genes among various species. We speculate that the
absence of or negative bias towards specific sequences in the
regulatory regions of co-expressed genes might add another degree of
regulation. Sequences cleansed from regulatory regions of co-expressed
genes might serve as “disruption sites”. Disruption of transcription might
be achieved in many ways, e.g. by binding proteins that would make spatial
arrangement of other trans-acting factors impossible,
by binding short silencing RNA sequences or by changing unfavorably the
local conformation of DNA strands. Rolf
Hanson I'm
interested in issues dealing with retroelements, viruses and genome evolution - how the study of
retroviruses can be used to learn more about the coevolution of
retroviruses and their human hosts. This is kind of broad and fuzzy, and I
am looking for a specific problem that would be appropriate for a course
project. An idea would be to look for retroelements in genome databases
using algorithms based on the structure of the retroelements, rather than
homology. That
said, I am more of a hacker and unfortunately would probably be more
excited about creating a cool piece of software or computer graphics,
rather than demystifying the mechanisms of human evolution. I am good with PERL, python, C,
Java, 3-D graphics (OpenGL), macintosh programming, unix, etc. I work at
children's hospital (www.chip.org) with some guys who wrote a book about
microarrays, so potentially have access to people who know a lot about
bioinformatics. Anna
Mallikarjunan annapurni@nevo.com My
fairly non-existent biological backgound is making it hard for me to
choose a problem that will be both interesting and
relevant. However,
one area I am interested in is to develop a partial software solution
(dependent on the time constraints) that provides a visual interface to
nucleotide mutations. I am looking to biologists to suggest what they
would want to visualize when analyzing nucleotide
mutations. The
skills I can bring to a team are experience in a variety of software
platforms and programming paradigms. Matt
Paschke Background: I have
an AB in Computer Science and have done a fair bit of programming,
including a good bit of programming in perl and related languages. I am also in the middle of taking
a molecular biology class, a genetics class and an organic chemistry
class. My
interests probably fall into two broad categories. The first would be the problem of
gene location -- finding where genes are in the vast amount of collected
genomic data. The secound
would be data representaion – how to most efficiently represent genetic
data to make searching and sequence alignment more efficient. Spending hours waiting for a BLAST
search is still too slow.
These are just broad interests. I would love to work with anyone
who wants to work at the interface of the CS and the
biology. Daniel
Rosenband danlief@au-bon-pain.lcs.mit.edu I'm a
computer science graduate student at MIT. My research interests are
in supercomputing, computer architecture, and hardware design. I've
only taken introductory undergrad. biology, so I'm looking to be part of a
team that consists of at least one or two other people with a strong
biology background. The type of project I would like to work on is one
that involves some aspect of high-performance computing -- novel
algorithms to take advantage of large machines, simulating a complicated
biological process, or finding a biological problem that dedicated
high-performance hardware could cost-effectively solve.
office and apt. are on MIT campus, so I can easily meet with people to
workon the project either at Harvard or MIT. Atif
Khan gumaan@yahoo.com I have
interests in two directions 1)
Machine learning approaches (in particular neural networks) to predict RNA
and protein secondary structure. 2)
Information theory and evolution - in particular exploring the relation
between the theory of error correcting codes, signal/noise propagation and
evolutionary constructs like mutation, selection, drift etc. The idea here
would be to understand how evolution preserves "informational
complexity". Dan
mjg-dob@attbi.com I have
3 ideas for projects. 1.
Looking at mutant p53 in clam leukemia cells for homologs
http://www.idealibrary.com/retrieve/doi/10.1006/excr.1997.3513#ex973513fn1 2a. To
attempt to find a correlation with physical cell stress or size and gene
expression. There is work going on growing cells on a scaffold or grating
that can be expanded to place the cells under tension. Heart cells grown on micro pegs
exhibit electrical properties different from cells grown in a culture
medium. I have no idea how I'll get the data. 2b.
Cells grow and shrink in size during the cell cycle is it possible to
correlate this expansion and contraction with up regulation of
genes? The
data from this could come from the paper that we read this week
3. Is
DNA used as a framework or building block to increase cell size in single
cell organisms? Is it
possible to correlate cell size and amount of DNA? Heta
Ray Heta@mit.edu Skills:
Java, Databases, Object Modeling, Machine Learning and Data
Analysis Genomic
and proteomic approaches can provide hypotheses concerning function for
the large number of genes predicted from genome sequences. Due to the
artificial nature of the assays, however, the information from these
high-throughput approaches should be considered with caution. Although it
is possible that more meaningful hypotheses could be formulated by
integrating the data from various functional genomic and proteomic
projects, it has yet to be seen to what extent the data can be correlated
and how such integration can be achieved. I would like to speculate
and co-relate the the mRNA abundance to the presence/absence of
proteins. This correlation
between mRNA abundance to the presence/absence of proteins can be used to
improve the quality of hypotheses based on the information from both
approaches. A test and traing
set to be created and we could use machine learning (Neural Networks,
Bayesian methods etc.) to predict the outcome Another
biological problem could be identifying genes responsible for human
diseases by combining information about gene position with clues about
biological function. The
recent availability of of whole genome sets of RNA and protein expression
provides powerful new functional insights. These data sets could be used to
expedite disease genes discovery - we could assign a 'score' for each
gene, based on similarity in the RNA expression profile to known
mitochondrial genes . Using a
large survey of organellar proteomics, genes can be classified according
to the likelihood of their protein product being associated with the
mitochondria. The
intersection of this information could narrow down the search for the
possible gene candidates. Joe
jrweber@attbi.com Identification of Potential Transcriptional Regulatory Elements by Comparison of Human and Pufferfish Genomic Sequences. BIOL E-101 Project Proposal by Joe Weber, 11/5/02 The goal of this project is to identify potential transcriptional regulatory elements that are likely to play a conserved role in vertebrate body patterning. To accomplish this goal, the amino acid sequences of human genes thought to be important for body patterning will be used to search the pufferfish (Fugu rubripes) genome for closely related genes. The 5’ flanking and intronic sequences of likely homologs will be compared to identify clusters of conserved transcription factor binding sites that may serve as promoters, enhancers, or silencers. There are two main reasons for choosing the pufferfish and human genomes for this project. First, nearly all of each genome has been sequenced and is publicly available. Second, these two species are separated by approximately 450 million years of evolution. Over such a great period of time, the non-coding regions of a gene (promoter and introns) should have undergone extensive mutations. Therefore, the only sequences that are likely to be conserved are sequences that play a critical functional role, such as important transcriptional regulatory elements. Here is a summary of how I will proceed with this project: Step 1: Create a list of human genes for which there is published evidence indicating that the gene product plays a role in body patterning. In many cases, the experimental evidence will be from homologous proteins in animals commonly used for embryological studies, such as mouse, Xenopus, and Zebrafish. This list should include at least a few dozen genes, since part of the project goal is to see how common (or uncommon) it is to have highly similar promoter/enhancer sequences between human and pufferfish homologs. This list will include well studied transcription factors and signaling molecules such as members of the Zic, Gli, Sox, BMP, Nodal, and Wnt families. Step 2: Use the human amino acids sequences to conduct BLAST searches against the pufferfish genome (http://genome.jgi-psf.org/fugu6/fugu6.home.html) and download the genomic sequences of likely homologs. Step 3: Search for open reading frames and intron-exon splice site consensus sequences in order to verify that a genomic sequence found by the blast search is likely to code for a real protein. The pufferfish genome web site has a listing of more than 30,000 predicted protein sequences that will be very helpful. The predicted amino acid sequence for a pufferfish protein will then be aligned to the human sequence used for the BLAST search in order to determine if it is similar enough to be a likely homolog. Step 4: Use a Smith-Waterman local alignment to search for regions of high similarity in the promoter and intron sequences of likely human-fish homologs, and use the MatInspector program to search for potential transcription factor binding sites based on matrices from the TRANSFAC database. Step 5: The final product of the steps above would be a set of gene maps and tables listing potential transcription factor binding sites that appear to be phylogenetically conserved. It might be possible to confirm at least some of these predictions by doing a thorough search of the literature to see if any of these elements have already been identified by empirical methods such as protein-DNA binding assays and promoter-reporter gene assays. Reasons to think that this approach will be productive: I recently worked on a project where I cloned the Xenopus (frog) gene for Zic3, a transcription factor involved in vertebrate body patterning. When I compared the frog and human Zic3 sequences, I found a 120 bp sequence in the middle of the first intron that had 82% identity between species. This similarity was quite striking, since the rest of the intron had only about 22% identity. Because frogs and humans are separated by more than 300 million years of evolution, I thought that this conserved sequence was very likely to be a functional regulatory element. I tested this hypothesis using a variety of promoter-reporter gene assays in Xenopus embryos, and found that the conserved sequence was a transcriptional enhancer that responded to the activin/nodal-related signaling pathway, which is known to induce the endogenous Zic3 gene. I would like to see how common these kind of conserved regulatory regions are between homologous genes in distantly related vertebrate genomes. Unfortunately, there is very little genomic sequence available for Xenopus. However, the availability of the nearly complete pufferfish genome sequence, and the 450 million years that separates humans and pufferfish, makes fish-human comparisons an attractive approach for studying conservation of transcriptional regulatory elements. Note: I currently plan to carry out this project using existing software tools. However, what might be an interesting related project for someone with a stronger computer science background than I have is as follows: The process of taking a human protein sequence and BLASTing it against the pufferfish genome is quite fast. However, extracting the relevant pufferfish genomic sequence and annotating its complete exon-intron structure can be quite tedious, especially if this process is going to be repeated with a large number of genes. I suspect that a great deal of this process could be automated, but I don’t know of an available program that does all of what I would like it to do. It would be great to have a program that a researcher could just give as input a known protein sequence from one species and a genomic sequence from another species (selected based on a BLAST search), and then have the program map out the best fit homolog it can find in the genomic sequence. The output of such a program would include the predicted amino acid sequence of the potential homolog and its percent identity with the input protein sequence. It would also include a table or map of the predicted genomic exon-intron structure with position numbers. Predicting exon-intron splice sites a priori is usually quite difficult, because there is a great deal of flexibility in the splice site consensus sequences. However, in this case a comparison of the input amino acid sequence with all of the open reading frames of the genomic sequence should narrow down the search space by quite a bit. Finally, it would be very useful if the program would copy the non-coding sequences (5’ flanking region and introns) to separate files, so that they can then be used in searches for transcriptional regulatory sequences. Gregory Minevich
I would like to investigate one of the major pieces of evidence for the neutral theory of molecular evolution. Advocated in part by Motoo Kimura, this idea states that most evolution at the molecular level is not the result of natural selection, but rather the result of random genetic drift. In other words, most molecular variation in DNA and protein composition has no influence on the selective fitness of the organism-- it is selectively neutral. Molecular evolution and morphological evolution are quite independent of one another in that variation onthe morphological level has been definitively demonstrated to have an influence on the fitness of the phenotype. Ridley (p173) gives the example of the living fossil shark Heterodontus portusjacksoni – a species that closely resembles its fossil ancestors from 300 million years ago. Though the rates of molecular evolution in humans and this shark have been roughly equal over the past 300 million years, their rates of morphological evolution have been astonishingly different. Whereas the shark closely resembles its ancestor from that time, humans have evolved from fish-like ancestors and have passed though amphibian, reptilian and mammalian stages. According to Ridley (p151), 4 main types of observation have been used to determine whether natural selection or neutral drift drives molecular evolution. These are: 1) The rate of evolution and the magnitude of polymorphism 2) The constancy of molecular evolution (known as the molecular clock) 3) The relation between functional constraint on molecules and their rates of evolution 4) The relation between polymorphism and evolutionary rate in different molecules (or parts of molecules) Of these four observations or tests, I would like to single out the constancy of molecular evolution for investigation. Whereas the rate of protein evolution runs relative to absolute time, the molecular clock runs relative to generation time for silent base changes in DNA (changes that do not result in an amino acid change in the protein). According to Ridley, selection likely gives a better explanation for the case of protein evolution whereas neutral drift better explains silent base changes in DNA. Overall Goal: Does Neutral Drift or Natural Selection Better Explain the Molecular Clock Effect? Steps Along the Way: * Investigate whether the latest research still has each respective clock running relative to absolute time and generation time respectively * Ridley (p178) suggests an argument for how natural selection can explain why the protein molecular clock keeps absolute, rather than generational time. The argument supposes 2 organisms with different generation times, "both separately evolving in relation to changes in parasites with short generation times". I would like to model this argument using a combination of Perl and Mathematica where I could experiment with changing such variables as: mutation rates (of organisms and parasites), selection variables, and generation times (of organisms and parasites). I would then compare the results of this modeling experiment with the latest research findings to achieve the overall goal above. Resources: Gould, S.J. 2002 The Structure of Evolutionary Theory Kimura, M. 1968 Evolutionary Rate at the Molecular Level. Nature 217:624-626 Dawkins, R. 1986 The Blind Watchmaker Ridley, Mark. 1996 Evolution Some Further Reading: Bulmer, M 1988 Evolutionary Aspects of Protein Synthesis. Oxford Surv. Evol. Biol. 5:1-40 Bulmer, M 1988 Estimating the Variability of Substitution Rates. Genetics 123:615-619 Gillespie, J.H. 1991 The Causes of Molecular Evolution. Gillespie, J.H. 1993 Episodic Evolution of RNA Viruses. Proc. Nat. Acad. Sci. USA 90:10411-10422 Nichol, et al 1993 Punctuated Equalibrium and Positive Darwinian Evolution in Vesicular Stomatitis Virus. Proc. Nat. Acad. Sci. USA 90:10424-10428 Ohta, T. 1992 The Nearly Neutral Theory of Molecular Evolution. Ann. Rev. Ecol. System. 23:263-286 Ohta, T. 1993 An Examination of the Generations-Time Effect on Molecular Evolution. Proc. Nat. Acad. Sci. USA 90:10676-10680
101 Term Project Proposal Overlaying Clustering Results from PCA
with Clustering Results from Self-Organizing
Maps Group
memeber: Amy ChangShixin Zhang
Wang sleepingpanda99@hotmail.com (Note: each of group member sumitted the same project
proposal to his/her TF) We are given: 1) DNA microarray data containing the expression profiles for 97% of the known or predicted genes of Saccharomyces Cerevisiae. The microarray data measures changes in the concentrations of the RNA transcripts from each gene for seven successive intervals after transfer of wild-type (strain SK1) diploid yeast cells to a nitrogen-deficient medium that induces sporulation. This dataset comes from a published paper: The Transcriptional Program of Sporulation in Budding Yeast, S. Chu et al[i]. The dataset can be downloaded from http://cmgm.stanford.edu/pbrown/sporulation. 2) Results of analysis done by S. Chu concluded that there are “At least seven distinct temporal patterns of induction were observed”. Their conclusion comes from having done a clustering analysis of the data using self-organizing maps. Our
challenge: 1) Perform principal component analysis on the dataset from S. Chu et al 2) See if |