-------------------------------------------------------------------------------------
CATSCORE PERL SCRIPTS
(c) 2002 Kevin Cheung
-------------------------------------------------------------------------------------
DESCRIPTION
-------------------------------------------------------------------------------------

This set of perl scripts are described in the paper "A microarray-based antibiotic
screen identifies a regulatory role for supercoiling in the osmotic stress response
of Escherichia coli," by authors: Kevin J. Cheung, Vasudeo Badarinarayana, Douglas
Selinger, Daniel Janse, and George M. Church.

Analysis of clustered microarray data can benefit from a systematic approach to the
characterization of clusters, and specifically, the elucidation of the commonalities
that are shared among coordinately regulated genes.  In this way, hypotheses may be 
developed correlating gene regulation with gene function.

CATSCORE implements a test (as described by Tavazoie 1999) for the estimation of
the statistical enrichment of functional categories (derived from gene association 
data) for specific clusters.  We have used as reference the genProtEC: Escherichia
coli genome and protein database  which is maintained on the website for the Marine
Biology Laboratory at Woods Hole, MA.  This data is a composite of functional 
classifications compiled by Riley and Ladeban in Niedhart's Escherichia coli and 
Salmonella: Cellular and Molecular Biology, 2nd Edition, and work by Serres and Riley
on paralogous proteins in E.coli.   We processed the 8,434 gene classifications and
338 functional categorizations into a format amenable to computational analysis.
CATSCORE then performs a hypergeometric test (equivalent to a 1-sided Fisher 2x2 test)
for each functional categorization against all clusters.


-------------------------------------------------------------------------------------
REQUIREMENTS
-------------------------------------------------------------------------------------

These scripts were designed for use with Windows systems running ActivePerl 5.6, but
should run on UNIX or Linux as well.  GeneCluster can be obtained at:

http://www-genome.wi.mit.edu/cancer/software/genecluster2/gc2.html


-------------------------------------------------------------------------------------
README
-------------------------------------------------------------------------------------

Enclosed are all relevant files needed to analyze clustered output from the Whitehead
Institute GeneCluster program (Tamayo 1999).  The pipeline is as follows:

Table of microarray data

   ||
   ||
   \/

run GeneCluster (genecluster generates two files based on a prefix, e.g. osr175 will
generate osr175_data.txt osr175_centroids.txt)

   ||
   ||
   \/

at the command line, run "alltests file_data.txt" (e.g. osr175_data.txt)

   ||
   ||
   \/

output is found in file_data.txt.top5 and file_data.txt.scores