POLONY SOFTWARE MANUAL

Church Lab

Department of Genetics

Harvard Medical School

Mitra Lab

Department of Genetics

Washington University in St. Louis

INTRODUCTION

AUTOMATED IMAGE ALIGNMENT

PIXEL-BY-PIXEL SEQUENCE SIGNATURE EXTRACTION

POLONY SIGNATURE OBJECT FINDING

SOFTWARE

REFERENCES

INTRODUCTION

We provide here open-source code for polony image analysis, on the condition that all modifications of the code be made open-source and readily available as well. The current set of scripts are M files (MATLAB) and make extensive use of the Image Processing Toolbox functions. We are currently in the process of converting several of the scripts to C++ to improve speed.

We have yet to seriously generalize the routines. Consequently, many of the below programs were developed for specific sets of polony images. Protocols and parameters are constantly changing, so we often make minor modifications or improvements when analyzing new sets of images. Users may need to do the same in analyzing their own images. This document is intended to illustrate the general principles behind our analysis, and will provide users with a sense of how various commands can be utilized for efficient image analysis. Links to the scripts used in the analysis performed here are provided in the Software section.

Here we focus on the analysis of ten serial scans of a single slide on which a short read-length was sequenced via the polony method (Mitra et al., Figure 4B, manuscript in preparation). Ten cycles of single-base-extension were performed on a single acrylamide gel in which multiple polonies derived from five unique templates were present. A scan was performed after each single-base-extension step. In each scan, data is acquired on two channels, Cy3 (532 nM) and Cy5 (635 nM). The universal sequencing primer in this experiment was labeled with Cy3, whereas the nucleotides being added at each single-base-extension step were labeled with Cy5.

AUTOMATED IMAGE ALIGNMENT

Even with a defined protocol on a scanning instrument, the precise positioning of a given slide can vary appreciably from scan to scan. For our analysis, it is critical that scans be of the same dimension and precisely aligned to one another, such that a given set of pixel coordinates corresponds to precisely the same coordinates on different scans of the same gel. On gels, there exist defects that persist from scan to scan and often take the form of saturated or near-saturated pixels. We took advantage of these consistent defects to perform image alignment. A single scan was selected as a ‘reference’ to which all other images of the same slide could be aligned. Data from only a single channel was considered; in this case, the 532 nm channel was selected as it appeared to contain a greater amount of noise. A series of potential X-Y offsets (relative to the fixed reference scan) was considered for each scan being aligned. The number of individual pixel coordinates that were saturated or near-saturated in both the reference scan and the scan under consideration was calculated for each potential offset. This number is subsequently referred to as the “coincidence score”.

To save computational time, the alignment algorithm considered only a 200x200 region of each image (at the upper-left corner) and restricted potential offsets to ± 50 pixels along each axis. In testing the algorithm, we worked with more than just the ten scans that are the primary focus of this analysis. We attempted to two sets of images, one consisting of 20 scans of a single slide and the other of 27 scans of a different slide. Run-time with these parameters was on the order of ~2 minutes per scan-alignment, and was unambiguously successful for 38/47 of the slides, resulting in a clear and strong peak in the value of the coincidence score at a specific offset. Selecting a somewhat more internal 200x200 block (e.g. not at the edge) resulted in unambiguous alignment of 7 more of the slides. It is very likely that 45 of the 47 slides could have been captured immediately, had an internal 200x200 grid been selected from the beginning of the analysis. The final two slides warranted manual double-checking (due to lower peak scores), but appeared to have been correctly aligned by the latter approach as well.

Example of Pre-Alignment Offset Matrix:

The X-axis and Y-axis represent potential offsets of the test slide relative to the reference slide. The Z-axis represents the coincidence score. Note the presence of a strong peak that is not centered.

After the peak is defined, it is straight-forward to realign the scan under consideration such that it is in register with the reference scan.

Offset Matrix for the same slide, but Post-Alignment:

MATLAB Handle Graphics

The X-axis and Y-axis represent potential offsets of the test slide relative to the reference slide. The Z-axis again represents the score at each potential offset with respect to coincident-near-saturated pixels. Note that the strong peak has been centered.

A total of 47 alignments were performed. A plot of the suggested offsets for individual slides is as follows:

The original set of scans were adjusted by the specified amounts, adjusted to maximum contrast, and resaved. 635 nm scans were adjusted by the same offsets as their 532 nm counterparts. With these images, each pixel corresponds to a 10 micron square.

Note the restricted portion of the landscape in which offsets tend to occur. In the future, computational time for image-alignment can be further reduced by limiting analysis of potential offsets to a narrow corridor along a single axis.

The range of offsets observed appears to vary with the type of scanner used (as one might expect, as it depends on the precision of engineering of the instrument). The data discussed here was collected with an Axon instrument. Anecdotally, we have observed that the offsets with a GSI instrument are much lower (generally less than 5 pixels along each axis).

[ASSESMENT OF ACCURACY OF ALIGNMENT METHOD]

We are rewriting the algorithms in C++ for speed optimization.

PIXEL-BY-PIXEL SEQUENCE SIGNATURE EXTRACTION

The Cy3 channel data essentially provides consistent information on the location of polonies, and serves as an internal control against which base addition or failure of addition at a given step of single-base-extension can be measured. We will term the [532 nm / 635 nm] ratio, below which we consider a base to have been successfully added, to be the “discriminating ratio”. A ratio-image is essentially a matrix consisting of the 532 nm scan intensities divided by the 632 nm scan intensities. Note that in this section of the analysis, all calculations will be performed on a single pixel-by-pixel basis. In other words, no matrix-based smoothing or other operations that rely on multiple pixels will be performed. To save computational time, we will focus on a 1000 x 1000 polony-rich subsection of the slide.

Pixel intensity values are integers ranging from 0 to 65535.

Obviously, not all pixels on the image contribute to polonies. The set of background pixels can be thought of as including both those pixels that lie in the low-intensity areas between polonies, as well as those pixels that are at the fringes of a polony. As mentioned above, the 532 nm channel tells us where the polonies are by virtue of the Cy3 labeled sequencing primer. In this analysis, all pixels whose intensity value on the 532 nm channel was less than 40,000 in all 10 scans were excluded from consideration.

A pixel signature is a binary string, where each digit position is reserved for a given scan. The signature extraction algorithm functions as follows:

(1) Generate ratio-image for a given scan.

(2) Convert to binary-image based on a fixed discriminating ratio (e.g. 1 where base added, 0 where base did not)

(3) Append value to binary-string-image

(4) Go To Step (1) for all relevant images (in order of scans).

(5) Convert binary-strings to sequences based on knowledge of base-addition order

The signature matrix was generated in MATLAB and exported to PERL for decoding of base 10 values into base 2 bit-strings. Time to generate signatures was on the order of minutes, and is directly proportional to the number of images under analysis.

In this experiment, the following five templates were amplified:

ST20

Primer - CACACACACACACACTC

ST21

Primer - GTGTGTGTGTGTGTGTC

ST23

Primer - CAGCCGAACGACCGATC

ST29

Primer - ATGTGAGAGCTGTCGTC

ST30

Primer – AGTGCTCACACACGTGATC

After taking into account systematic errors (discussed below), the five most common “pixel signatures” corresponded to these five templates. The following image depicts the 1000 x 1000 area under consideration. Pixels yielding signatures corresponding to one of the five above templates have been given unique colors, whereas all “noise” pixels (those corresponding to unrecognized signatures) are colored purple. The key below the image summarizes the templates to which each color corresponds, the true sequences, and the observed signatures.

If you know that the base addition order was in this experiment was CAGTCAGTCA (ten bases), then the translation from bit signatures to a sequence read is quite straightforward. It appears that the vast majority of incorrect signatures are to be found at pixels in one of two places. First, note the bands of blue found at the junction of adjacent polonies with differing signatures. Second, note the small but substantial numbers of blue pixels mixed into the green polonies, seemingly reflecting an error that appears to be specific to pixels deriving from that template.

Future work can be directed at identifying and eliminating these forms of random error on a pixel-by-pixel basis (or something close to it).

There are two systematic errors that are consistent with those observed by Rob Mitra when sequencing these template. The first is the the apparent insertion of a G at the second position of the ST29-like sequence. The second is the recognition of the “CC” stretch in ST23 as a single “C”. We have yet to fully define the nature of the former error, but it is consistently observed with this template. The latter error is fully expected, given that non-terminating nucleotides were used in this protocol. Given that these are systematic errors, we will consider the ST29-like and ST23-like sequences to be “correct” in the context of our analysis of random error.

What exactly is the noise? In this figure, the five most common noise species (e.g. signatures) have been given unique colors, whereas correct template calls are colored purple. Minor noise signatures species (not in the top five) are colored white. We observe that characteristic noise types occur systematically at the intersection of specific pair-wise combinations of polonies.

An analysis of the common noise-types (GMC) found that they were generally very consistent with a logical OR function of the signatures of the intersecting polonies, as one might expect. A summary of that analysis is as follows:

POLONY SIGNATURE OBJECT FINDING

In this section we’ll move away from the worrying about individual pixels and instead worry about how to distinguish multi-pixel polonies from one another, but more importantly how to distinguish real signatures from noise in the form of randomly scattered noise pixels as well as erroneous signatures that occur at the junction of closely neighboring polonies. Finding individual polonies on a crowded slide is a difficult task, as the intensity peaks of “touching” polonies do not always appear to be clearly distinguishable. An alternative approach to finding and scoring individual polonies is to start from the single-pixel-signature analysis, and to ask whether objects can be defined as clusters of adjacent pixels that share a common signature. Looking at the above picture, it is quite clear that this should be possible, especially after some image smoothing (or “signature smoothing”) has been performed. We’ll then be faced with the task of distinguishing “real signature objects” from “noise signature objects”, as we have observed that clusters of pixels with identical noise signatures occur quite commonly. One distinguishing characteristic of ‘noise signature objects’ may be that their perimeter may tend to be surrounded by pixels of a different signature, as they most often occur as a consequence of the adjacency of two real polonies. Another distinguishing characteristic is that mixed-polony signatures tend to have higher parities than real polony signatures. We’ll attempt to filter out ‘noise signature objects’ on the basis of these characteristics.

The algorithm consists of just a few steps:

(1) Signature Mode Filtering.

A smoothing operation, similar to median filtering. For each location in the image, the most frequently occurring signature (e.g. the mode) in the 10x10 matrix surrounding it is selected as its new value.

(2) Object Finding

An object is a fully connected set of at least N pixels with a single signature value. In this example, an object was required to contain at least 400 pixels.

(3) Perimeter & Parity Scoring

For each object, find the set of non-object pixels that are adjacent to object pixels. Calculate the fraction of perimeter pixels that have assigned signatures. Report objects that have high fractions of “free” perimeter pixels. In this example, the minimum fraction of free perimeter pixels to qualify for reporting was set to 0.30. The maximum allowable parity was 10 (excluding ‘objects’ that added on every single-base-extension; usually these are clumps of high-intensity noise)

The following image is the same region of the gel as shown above, after signature mode filtering, object finding, and perimeter & parity-based image processing.

The analysis is successful in that 18 out of 18 polony objects are correctly called as one of the five known template sequences (with no mistakes excepting the known systematic errors on one template that are discussed above)..

SOFTWARE

These scripts have not been generalized. We are providing them here to demonstrate how the image analysis proceeds.

SatAlignAll.m Handles multiple image alignment

SatAlign.m Handles single image alignment (called by SatAlignAll.m)

SigProcess.m Determines pixel-by-pixel sequence signatures

ObjectFinder.m Finds signature objects that meet specified criteria

ModeFilter.m Performs a matrix-based signature mode operation on an image (called by ObjectFinder.m)

ModeFind.m Performs signature mode-finding on a sub-matrix (called by ModeFilter.m)

RunAll.m RunAll.m (runs all of the above scripts in concert)

REFERENCES

1. Mitra, R and Church, GM (1999) In situ localized amplification and contact replication of many individual DNA molecules. Nucleic Acids Res. 27(24):e34; pp.1-6.

2. Mitra, RD, Butty, V, Shendure, J, Williams, BR, Housman, DE, and Church, GM (2002) Digital Genotyping and Haplotyping with Polymerase Colonies. (submitted to PNAS, 10-21-02)

3. Mitra RD, Shendure J, Olejnik, J, Church GM (2002) Fluorescent in situ Sequencing on Polymerase Colonies (manuscript in preparation)

Last modified 11-21-02 by Jay Shendure

Please address questions or comments to Jay Shendure at jay_shendure@student.hms.harvard.edu