We provide here a documentation of the steps involved in computer-assisted processing of polony images in Zhu et al. (2003). Further details can be found by examining the images and open-source code that can be linked to from this document.
All images of gels were acquired with a ScanArray 5000 (Perkin Elmer) instrument at 10 micron resolution using either a 635 nm (for Cy5 detection) or 532 nm (for Cy3 detection) laser. Intensities are returned as 16-bit values (1..65535) per pixel.
All image processing described in this document was executed in MATLAB (R12.1) with extensive use of Image Processing Toolbox functions.
The full sets of unprocessed and processed images as well as the specific MATLAB-based algorithms applied in the analyses described are freely available. We have embedded links to MATLAB scripts within this document. The full set of eighty unprocessed images (8 slides x 20 images per slide) can be linked to here. This README document provides information on the nomenclature of filenames as well as the identities of the eight slides with respect to the experiments that were performed.
A [3 x 3] median filter was applied to each raw image. The appropriate background image was subtracted from each signal image. When available, the background image used was the image captured after stripping (via denaturation) the extended probe from a given signal image. Where not available (Images 19 & 20 for Slides 1 through 4), the background image acquired immedietely prior to the signal-acquisition scan was used. Median-filtered, background-subtracted signal images were saved as new image files (these files end with ".sub" extensions in the appropriate directories). All subsequent analyses are based on these median-filtered, background-subtracted signal images. The MATLAB script used to perform the steps described in this section can be found here.
Each polony (e.g. each polymerase colony bearing products derived from amplification of the same single molecule) is large enough that it is represented by multiple pixels within each image. Our algorithm proceeds by performing several processing steps at the level of individual pixels, before proceeding to identify and computationally partition the sets of pixels that comprise individual polonies. The threshold signal value is defined as the intensity above which a given pixel is considered to validate the presence of a given exon at that specific location on the gel. As images are scanned at varying laser powers and PMTs, and as different single-base-extensions may yield varying amounts of signal for a given quantity of extended DNA, the slides essentially need to be normalized to one another. In moving forward we will be 'thresholding' the images. In other words, we will be setting a image-specific cutoff, intensities above which are assigned a '1', and below which are assigned a '0'. Here we seek to determine the approximately correct cutoff to use for each image. The algorithm proceeds by first defining the set of pixels that are above a minimum value on every image. These pixels generally will correspond to polonies that include all 10 exons. How this set of 'focus pixels' varies across the ten images will be used to define appropriate thresholds. Choosing appropriate threshold values for each image can therefore allow normalization across the full set of images. An automated process (that assumes this linearity) was applied to determine an appropriate set of threshold signal values for each background-subtracted image. Manual inspection suggested that the algorithm performed poorly for 5 out of the full set of 80 images. In these cases, thresholds were determined manually. Thresholding of each 16-bit image yields a binary image that reflects our predictions with respect to the presence or absence of a given exon at each pixel in the image. The MATLAB script used to perform the steps described in this section can be found here.
Step 3: Signature Determination
The exon “signature” of a given pixel on a polony image is defined as the full set of exons predicted to be present in the amplified DNA present at that specific location of the gel. We abstract this signature pattern as a binary number consisting of digits corresponding to the 10 individual exons (which are defined by the 10 scans). For example, a signature of "1000010011" would mean that the 1st, 6th, 9th and 10th variable exons were present (as defined by the set of binary thresholded images). This signature can of course be represented as its base 10 equivalent as well. The previous step defined "threshold" cutoffs for each image, above which a given pixel on a given image would be considered to reflect inclusion of the exon corresponding to that image. Here we create a signature image, in which the value at each pixel is the base 10 conversion of the base 2 10 digit composite signature at that pixel (determined by the binary value at each pixel in each of the ten background-subtracted, thresholded binary images). The MATLAB script used to perform the steps described in this section can be found here. The signature images (consisting of pixels values ranging from 0..1023) are stored as a MATLAB file named "signature_image.mat" in the image directory corresponding to each slide.
The values of the “signature” image of a given gel thus reflect the exon presence/absence pattern predicted for amplified DNA at specific pixel positions. This step is aimed at finding and counting polonies (where each polony is captured by multiple pixels). Generally speaking, each polony manifests itself on the signature image as a cluster of connected pixels that bear identical exon signatures, termed here a signature object. MATLAB image processing algorithms arere used to identify such objects. Morphological characteristics of individual objects (e.g. size & shape characteristics) are then used to distinguish true polonies from "noise" objects and pixels. Finally, an additional processing step is used to identify instances in which multiple distinct polonies bearing identical signatures are adjacent to one another (such that they form a single connected object), and classify them separately. The MATLAB script used to perform the steps described in this section can be found here.
Four slides (2, 4, 6, 8) used cDNA derived from the Eph4 cell line and four slides (1, 3, 5, 7) used cDNA derived from the transformed Eph4Bdd cell line. In this step, counts are tabulated from the list of computationally identified polonies, and statistics regarding the difference between Eph4 and Eph4Bdd in proportional representation of any given isoform are calculated. We also calculate statistics for differences in proportional representation of exons (with counts obtained by by summing the counts for all isoforms on any given slide that contain a given exon). Finally, the base 10 values (1..1023) are decoded to the exon signatures that they represent. The information generated in this step is exported to a set of text files (see README file). The MATLAB script used to perform the steps described in this section can be found here.
The focus of Step 6 is facilitate the visual validation / rejection of isoforms that were 'automatically' identified by Steps 1 through 5. We want to ensure that the polony-calling algorithm is not inaccurately claiming polonies with a specific rare signature that are not actually polonies but instead border-elements or junk (e.g. false positives). This semi-automated process permits visual evaluation and confirmation of each isoform claimed by Step 4 of the software. A list of isoforms validated in this step is exported to a text file (see README file). The MATLAB script used to perform the steps described in this section can be found here.