Department of
Genetics
Harvard Medical
School
Department of
Genetics
Washington
University in St. Louis
PIXEL-BY-PIXEL SEQUENCE SIGNATURE EXTRACTION
POLONY
SIGNATURE OBJECT FINDING
We provide here open-source code for polony image analysis, on the condition that all modifications of the code be made open-source and readily available as well. The current set of scripts are M files (MATLAB) and make extensive use of the Image Processing Toolbox functions. We are currently in the process of converting several of the scripts to C++ to improve speed.
We have yet to seriously generalize the routines. Consequently, many of the below programs were developed for specific sets of polony images. Protocols and parameters are constantly changing, so we often make minor modifications or improvements when analyzing new sets of images. Users may need to do the same in analyzing their own images. This document is intended to illustrate the general principles behind our analysis, and will provide users with a sense of how various commands can be utilized for efficient image analysis. Links to the scripts used in the analysis performed here are provided in the Software section.
Here we focus on the analysis of ten serial scans of a single slide on which a short read-length was sequenced via the polony method (Mitra et al., Figure 4B, manuscript in preparation). Ten cycles of single-base-extension were performed on a single acrylamide gel in which multiple polonies derived from five unique templates were present. A scan was performed after each single-base-extension step. In each scan, data is acquired on two channels, Cy3 (532 nM) and Cy5 (635 nM). The universal sequencing primer in this experiment was labeled with Cy3, whereas the nucleotides being added at each single-base-extension step were labeled with Cy5.
Even with a defined
protocol on a scanning instrument, the precise positioning of a given slide can
vary appreciably from scan to scan. For
our analysis, it is critical that scans be of the same dimension and precisely
aligned to one another, such that a given set of pixel coordinates corresponds
to precisely the same coordinates on different scans of the same gel. On gels, there exist defects that persist
from scan to scan and often take the form of saturated or near-saturated
pixels. We took advantage of these
consistent defects to perform image alignment.
A single scan was selected as a ‘reference’ to which all other images of
the same slide could be aligned. Data
from only a single channel was considered; in this case, the 532 nm channel was
selected as it appeared to contain a greater amount of noise. A series of potential X-Y offsets (relative
to the fixed reference scan) was considered for each scan being aligned. The number of individual pixel coordinates
that were saturated or near-saturated in both the reference scan and the
scan under consideration was calculated for each potential offset. This number is subsequently referred to as
the “coincidence score”.
To save
computational time, the alignment algorithm considered only a 200x200 region of
each image (at the upper-left corner) and restricted potential offsets to ± 50
pixels along each axis. In testing the
algorithm, we worked with more than just the ten scans that are the primary
focus of this analysis. We attempted to
two sets of images, one consisting of 20 scans of a single slide and the other
of 27 scans of a different slide.
Run-time with these parameters was on the order of ~2 minutes per
scan-alignment, and was unambiguously successful for 38/47 of the slides,
resulting in a clear and strong peak in the value of the coincidence score at a
specific offset. Selecting a somewhat
more internal 200x200 block (e.g. not at the edge) resulted in unambiguous
alignment of 7 more of the slides. It
is very likely that 45 of the 47 slides could have been captured immediately,
had an internal 200x200 grid been selected from the beginning of the
analysis. The final two slides
warranted manual double-checking (due to lower peak scores), but appeared to
have been correctly aligned by the latter approach as well.
Example of
Pre-Alignment Offset Matrix:
The X-axis and
Y-axis represent potential offsets of the test slide relative to the reference
slide. The Z-axis represents the
coincidence score. Note the presence of
a strong peak that is not centered.
After the peak is
defined, it is straight-forward to realign the scan under consideration such
that it is in register with the reference scan.
Offset Matrix
for the same slide, but Post-Alignment:
The X-axis and
Y-axis represent potential offsets of the test slide relative to the reference
slide. The Z-axis again represents the
score at each potential offset with respect to coincident-near-saturated
pixels. Note that the strong peak has
been centered.
A total of 47
alignments were performed. A plot of
the suggested offsets for individual slides is as follows:
The original set of
scans were adjusted by the specified amounts, adjusted to maximum contrast, and
resaved. 635 nm scans were adjusted by
the same offsets as their 532 nm counterparts.
With these images, each pixel corresponds to a 10 micron square.
Note the restricted
portion of the landscape in which offsets tend to occur. In the future, computational time for
image-alignment can be further reduced by limiting analysis of potential offsets
to a narrow corridor along a single axis.
The range of
offsets observed appears to vary with the type of scanner used (as one might
expect, as it depends on the precision of engineering of the instrument). The data discussed here was collected with
an Axon instrument. Anecdotally, we
have observed that the offsets with a GSI instrument are much lower (generally
less than 5 pixels along each axis).
[ASSESMENT
OF ACCURACY OF ALIGNMENT METHOD]
We are rewriting
the algorithms in C++ for speed optimization.
PIXEL-BY-PIXEL SEQUENCE SIGNATURE EXTRACTION
The Cy3 channel
data essentially provides consistent information on the location of polonies,
and serves as an internal control against which base addition or failure of
addition at a given step of single-base-extension can be measured. We will term the [532 nm / 635 nm] ratio,
below which we consider a base to have been successfully added, to be the
“discriminating ratio”. A ratio-image
is essentially a matrix consisting of the 532 nm scan intensities divided by
the 632 nm scan intensities. Note that
in this section of the analysis, all calculations will be performed on a single
pixel-by-pixel basis. In other words,
no matrix-based smoothing or other operations that rely on multiple pixels will
be performed. To save computational
time, we will focus on a 1000 x 1000 polony-rich subsection of the slide.
Pixel intensity
values are integers ranging from 0 to 65535.
Obviously, not all
pixels on the image contribute to polonies.
The set of background pixels can be thought of as including both those
pixels that lie in the low-intensity areas between polonies, as well as those
pixels that are at the fringes of a polony.
As mentioned above, the 532 nm channel tells us where the polonies are
by virtue of the Cy3 labeled sequencing primer. In this analysis, all pixels whose intensity value on the 532 nm
channel was less than 40,000 in all 10 scans were excluded from
consideration.
A pixel signature
is a binary string, where each digit position is reserved for a given
scan. The signature extraction
algorithm functions as follows:
(1) Generate ratio-image for a given scan.
(2) Convert to binary-image based on a fixed
discriminating ratio (e.g. 1 where base added, 0 where base did not)
(3) Append value to binary-string-image
(4) Go To Step (1) for all relevant images (in
order of scans).
(5) Convert binary-strings to sequences based on
knowledge of base-addition order
The signature
matrix was generated in MATLAB and exported to PERL for decoding of base 10
values into base 2 bit-strings. Time to
generate signatures was on the order of minutes, and is directly proportional
to the number of images under analysis.
In this experiment,
the following five templates were amplified:
ST20
Primer - CACACACACACACACTC
ST21
Primer - GTGTGTGTGTGTGTGTC
ST23
Primer - CAGCCGAACGACCGATC
ST29
Primer - ATGTGAGAGCTGTCGTC
ST30
Primer – AGTGCTCACACACGTGATC
After taking into account systematic errors (discussed below), the five most common “pixel signatures” corresponded to these five templates. The following image depicts the 1000 x 1000 area under consideration. Pixels yielding signatures corresponding to one of the five above templates have been given unique colors, whereas all “noise” pixels (those corresponding to unrecognized signatures) are colored purple. The key below the image summarizes the templates to which each color corresponds, the true sequences, and the observed signatures.
If you know that
the base addition order was in this experiment was CAGTCAGTCA (ten bases), then
the translation from bit signatures to a sequence read is quite
straightforward. It appears that the
vast majority of incorrect signatures are to be found at pixels in one of two
places. First, note the bands of blue
found at the junction of adjacent polonies with differing signatures. Second, note the small but substantial
numbers of blue pixels mixed into the green polonies, seemingly reflecting an
error that appears to be specific to pixels deriving from that template.
Future work can be
directed at identifying and eliminating these forms of random error on a
pixel-by-pixel basis (or something close to it).
There are two
systematic errors that are consistent with those observed by Rob Mitra when
sequencing these template. The first is
the the apparent insertion of a G at the second position of the ST29-like
sequence. The second is the recognition
of the “CC” stretch in ST23 as a single “C”.
We have yet to fully define the nature of the former error, but it is
consistently observed with this template.
The latter error is fully expected, given that non-terminating
nucleotides were used in this protocol.
Given that these are systematic errors, we will consider the ST29-like
and ST23-like sequences to be “correct” in the context of our analysis of
random error.
What exactly is the
noise? In this figure, the five most
common noise species (e.g. signatures) have been given unique colors, whereas
correct template calls are colored purple.
Minor noise signatures species (not in the top five) are colored
white. We observe that characteristic
noise types occur systematically at the intersection of specific pair-wise
combinations of polonies.
An analysis of the
common noise-types (GMC) found that they were generally very consistent with a
logical OR function of the signatures of the intersecting polonies, as one
might expect. A summary of that
analysis is as follows:
POLONY
SIGNATURE OBJECT FINDING
In this section
we’ll move away from the worrying about individual pixels and instead worry
about how to distinguish multi-pixel polonies from one another, but more
importantly how to distinguish real signatures from noise in the form of
randomly scattered noise pixels as well as erroneous signatures that occur at
the junction of closely neighboring polonies.
Finding individual polonies on a crowded slide is a difficult task, as
the intensity peaks of “touching” polonies do not always appear to be clearly
distinguishable. An alternative
approach to finding and scoring individual polonies is to start from the single-pixel-signature
analysis, and to ask whether objects can be defined as clusters of adjacent
pixels that share a common signature.
Looking at the above picture, it is quite clear that this should be
possible, especially after some image smoothing (or “signature smoothing”) has
been performed. We’ll then be faced
with the task of distinguishing “real signature objects” from “noise signature
objects”, as we have observed that clusters of pixels with identical noise
signatures occur quite commonly. One
distinguishing characteristic of ‘noise signature objects’ may be that their
perimeter may tend to be surrounded by pixels of a different signature, as they
most often occur as a consequence of the adjacency of two real polonies. Another distinguishing characteristic is
that mixed-polony signatures tend to have higher parities than real polony
signatures. We’ll attempt to filter out
‘noise signature objects’ on the basis of these characteristics.
The algorithm
consists of just a few steps:
(1) Signature Mode Filtering.
A smoothing
operation, similar to median filtering.
For each location in the image, the most frequently occurring signature
(e.g. the mode) in the 10x10 matrix surrounding it is selected as its new
value.
(2) Object Finding
An object is a
fully connected set of at least N pixels with a single signature value. In this example, an object was required to
contain at least 400 pixels.
(3) Perimeter &
Parity Scoring
For each object,
find the set of non-object pixels that are adjacent to object pixels. Calculate the fraction of perimeter pixels
that have assigned signatures. Report
objects that have high fractions of “free” perimeter pixels. In this example, the minimum fraction of
free perimeter pixels to qualify for reporting was set to 0.30. The maximum allowable parity was 10 (excluding
‘objects’ that added on every single-base-extension; usually these are clumps
of high-intensity noise)
The following image
is the same region of the gel as shown above, after signature mode filtering,
object finding, and perimeter & parity-based image processing.
The analysis is
successful in that 18 out of 18 polony objects are correctly called as one of
the five known template sequences (with no mistakes excepting the known
systematic errors on one template that are discussed above)..
SatAlignAll.m Handles multiple image alignment
SatAlign.m Handles single image alignment (called by SatAlignAll.m)
SigProcess.m Determines pixel-by-pixel sequence signatures
ObjectFinder.m Finds signature objects that meet specified criteria
ModeFilter.m Performs a matrix-based signature mode operation on an image (called by ObjectFinder.m)
ModeFind.m Performs signature mode-finding on a sub-matrix (called by ModeFilter.m)
RunAll.m RunAll.m (runs all of the above scripts in concert)
1.
Mitra, R and
Church, GM (1999) In situ
localized amplification and contact replication of many individual DNA
molecules. Nucleic
Acids Res. 27(24):e34; pp.1-6.
2. Mitra, RD, Butty, V, Shendure, J, Williams, BR, Housman, DE, and Church, GM (2002) Digital Genotyping and Haplotyping with Polymerase Colonies. (submitted to PNAS, 10-21-02)
3. Mitra RD, Shendure J, Olejnik, J, Church GM (2002) Fluorescent in situ Sequencing on Polymerase Colonies (manuscript in preparation)
Last modified 11-21-02 by Jay Shendure
Please address questions or comments to Jay Shendure at jay_shendure@student.hms.harvard.edu