This writeup describes proof-of-principle experiments for “wobble” sequencing

Wobble Sequencing [JAS, 01-Jun-04]

Proof of Principle on Loaded Beads (T1..T5)

Proof of Principle on Emulsion Beads (T1..T5)

Although this is still a method based on sequencing by polymerase-extension, it differs from FISSEQ and pyrosequencing in that base-additions are not “progressive”. Instead, after a given single-base-extension (SBE), the sequencing primer is stripped from the bead-immobilized templates and a new primer is hybridized. The trick to getting beyond the first base is that each sequencing primer in the set “reaches” out to a defined position in the unknown unique sequence of the template. A sequencing primer, from 5’ to 3’, thus consists of an “anchor sequence” that is complementary to the constant sequence on the template [identical to the sequencing primers that we’ve been using but a bit shorter to aid in primer-strpiping], and a defined number of additional “universal” bases, that will hybridize to the unknown sequence regardless of what it is. If, for example, there are three fixed universal bases, then the sequencing primer is positioned to sequence the fourth base via SBE with labeled nucleotides. After a single-base-extension and data acquisition, extended and unextended primers are stripped (with heat) and a new primer is annealed that has a different number of universal bases, thus querying a different base-position within the unknown sequence. In this simplest iteration of the scheme, one only needs a set of N primers to achieve a read-length of N.

The primary challenge is how to define the universal bases. Universal bases are synthetic nucleotide analogs that ideally pair with equal affinities to each of the natural nucleotides, and are readily accepted as substrates by natural enzymes. Examples include 5-nitroindole, 3-nitropyrole, deoxyinosine, etc. Another possibility would be to use natural bases, but order the sequencing primer oligonucleotides with a full “wobble” at universal positions, such that all possible sequencing primers (or some random subset of all possibilities) is present during hybridization. I tried all of the above with marginal success.

The failure is pretty easy to explain in retrospect by the facts that (1) regardless of how they behave in hybridization, polymerases tend not to like to extend off of unnatural bases; (2) for natural nucleotides, matching at the 3’-terminal of the sequencing primer is the most critical for efficient primer extension, but the terminal bases of any sequencing primer will make the least contribution to hybridization kinetics (relative to other “wobble” positions). The result is that even if some “sorting” goes on with “wobble” sequencing primers during primer hybridization, the primers that ultimately hybridize are quite likely to be mismatched at the 3’ terminus.

The approach that finally started to show some success involved fixing the terminal two bases of a given sequencing primer, but allowing the remainder of bases at “universal” positions to be synthesized with a full “wobble” of natural bases. The disadvantage of this compromise is that 16 separate hybridizations are required for each “reach” length (4^2 combinations of the two terminal bases). This is mitigated by the fact that polymerases don’t extend off mispaired termini very well at all, so a given extension set reveals the identity of both the two terminal bases and the extended base. So the average efficiency of the process is 3/16 = 0.188 bases per cycle.

Non-terminator FISSEQ, by comparison, yields approximately ~0.50 bases-per-cycle (assuming no homopolymer resolution and thus counting multi-base runs as single extensions; someone double-check this if you have a second). By this consideration, achieving an identical read-length will require ~2.67 times as many cycles in the 2bp-matched-wobble-sequencing system.

Regardless, if we or someone else eventually developed a four-color reversible-terminator FISSEQ system, there would be no real advantage to the wobble-sequencing method as it would provide the same information (i.e. homopolymer resolution) but require 5 times as much work.

Summary of Advantages

1. First and foremost, provides a quick solution to the homopolymer problem.

2. Mistakes (manual) and inefficiencies (biochemical) are non-cumulative.

3. Technology development for longer reads is greatly expedited (i.e. don’t have to cycle out to test a hypothesis for improving read-lengths)

4. Better signal than current FISSEQ system (in which a desire for signal has to be balanced against a desire to minimize the fraction of extended templates with cleaved linker as it inhibits the polymerase).

5. Lack of requirement to take extensions to completion opens up polymerase-choice (and amounts).

Summary of Disadvantages

1. The “terminally-matched” approach (i.e. xN.2 primers) requires ~3 times as many cycles than FISSEQ to achieve the same read-lengths.

2. Probably a lower prospect of ever achieving read-lengths > 20.

3. Has only been tried on a small number of templates

Arguments against the listed disadvantages are that (1) if it works, this is no biggie, (2) read-lengths of 10-20 will be sufficient for most purposes, especially in the context of a paired-strategy, and (3) the same could be said of FISSEQ.

We also might imagine that the overall efficiency could by improved by some enzyme engineering for greater permissiveness with respect to mismatches (e.g. the M1/M4 variants of Taq) or alterations to the primer design strategy.

Cycle Protocol

In the current protocol, a typical cycle is as follows:

1. Hybridize sequencing primer (15 minutes, 10 uM primer in 6x SSPE, 40-50’C)

2. Extend (4 minutes, SSB + polymerase + nucleotide)

3. Wash (2 minute)

4. Image acquisition

5. Strip primer (5 minutes, Wash 1E, 70’C)

A couple notes about the above:

(a) If the wobble-bases are fixed (poly-A, poly-G, poly-C, or poly-T instead of poly-N), extensions are no longer efficient. This implies that that some degree of “sorting” is going on during the hybridization that is critical to the overall process working. Hoping for this to occur, the “anchor sequence” is purposefully short (Tm = 37’C if it were alone), weighting the hybridization process to depend to a greater degree on the wobble-sequences.

(b) Initial data suggests that Sequenase is significantly better than Klenow for this approach.

we started top-layering the beads (not too surprising, in retrospect).

Primer Nomeclature

So a typical primer-name below is “37C.2N.CA”. For all of the primers below, the anchor sequence is a trimmed version of the original FISSEQ primer for the T1..T5 template. The “37C” (or “23C” or whatever it is) indicates the extent to which it has been trimmed (i.e. 37C is the Tm of the anchor sequence if it were a stand-alone primer). The “2N” indicates that the anchor-sequence is followed by two full wobble bases, and the CA indicates the fixed two terminal bases. This primer would extend to the 5^th base, thus sequencing 3 bases (base 3, 4 and 5) on 1/16^th of the templates of a random library.

In the experiments described below, we have focused on primers with even numbers of wobble-bases and terminal bases that match at least one of the five T1..T5 templates (so that at least something will extend). For a given reach-length, this is approximately 1/4^th of the primers that would be required in a real sequencing experiment involving sequencing of genomic fragments. However, we’re also being slightly conservative in that one could do multiples of three for the number of wobble-bases, rather than multiples of two. We have some redundancy built in here as this is just a proof-of-principle experiment. For example, 37C.2N.XX sequences bases 3, 4 and 5. 37C.4N.XX sequences bases 5, 6 and 7. Base 5 is sequenced twice (as is base 7, base 9, etc.)

Proof of Principle on Loaded Beads (T1..T5) [Slide EZ]

Slide “EZ” has top-layered, 1 uM beads with loaded T1..T5 templates. As mentioned above, I only tried a subset of the primers that would be required in a full sequencing experiment on unknown sequence. I ordered primers to sequence through to the 11^th base on all five templates (37C.0N.XX through 37C.8N.XX). I also ordered just one primer for 37C.10N.XX through 37C.18N.XX (as initial data suggested that we’d be pushing it to go beyond ~10 bases with this method). Details are as follows. See commentary after for more precise description of the fields here. TR = Texas Red.

No Primer Name Primer Seq Hyb Extend With Templates AS NAS A:N Ratio

1 37C.0N.CA CCTCATTCTCT CA 42c 35’ Cy3-C T1(+T4) 0.87 0.06 15.3

2 Cy5-G T4 0.27 0.02 12.3

3 TR-G T4(+T1) 1.15 0.06 19.4

4 37C.0N.GT CCTCATTCTCT GT 42c 45’ TR-G T2 0.83 0.04 19.5

5 37C.0N.AT CCTCATTCTCT AT 42c 20’ TR-G T5 0.21 0.07 3.0

6 37C.0N.AG CCTCATTCTCT AG 42c 25’ Cy3-T T3 0.64 0.05 13.7

7 37C.2N.CA CCTCATTCTCT NN CA 42c 20’ Cy3-C T1 2.16 0.06 38.9

8 37C.2N.GT CCTCATTCTCT NN GT 42c 60’ TR-G T2,T5 0.35 0.04 8.7

9 37C.2N.TG CCTCATTCTCT NN TG 42c 20’ Cy3-C T3 1.06 0.06 16.9

10 37C.2N.GC CCTCATTCTCT NN GC 42c 20’ Cy3-C T4 1.59 0.07 22.7

11 37C.4N.CA CCTCATTCTCT NNNN CA 45c 40’ Cy3-C T1 1.72 0.06 27.0

12 37C.4N.GT CCTCATTCTCT NNNN GT 45c 27’ TR-G T2 0.52 0.16 3.2

13 37C.4N.CT CCTCATTCTCT NNNN CT 45c 11’ Cy3-C T3 0.70 0.08 8.7

14 37C.4N.GA CCTCATTCTCT NNNN GA 45c 15’ TR-G T5 0.27 0.04 7.0

15 37C.4N.CG CCTCATTCTCT NNNN CG 45c 15’ Cy3-A T4 2.53 0.12 20.5

16 37C.6N.CA CCTCATTCTCT NNNNNN CA 50c 15’ Cy3-C T1,T3 0.53 0.10 5.3

17 37C.6N.GT CCTCATTCTCT NNNNNN GT 50c 15’ TR-G T2 0.12 0.18 0.7

18 37C.6N.AA CCTCATTCTCT NNNNNN AA 50c 15’ Cy3-C T4 0.86 0.09 9.2

19 37C.6N.GA CCTCATTCTCT NNNNNN GA 50c 40’ Cy5-G T5 0.07 0.02 3.1

20 37C.8N.CA CCTCATTCTCT NNNNNNNN CA 50c 30’ Cy3-C T1,T3 0.35 0.10 3.5

21 37C.8N.GT CCTCATTCTCT NNNNNNNN GT 42c 15’ Cy5-G T2 0.31 0.04 7.0

22 37C.8N.CG CCTCATTCTCT NNNNNNNN CG 42c 30’ Cy3-A T4 0.75 0.09 8.1

23 37C.8N.GC CCTCATTCTCT NNNNNNNN GC 42c 15’ Cy3-T T5 0.34 0.06 5.9

24 37C.6N.GT CCTCATTCTCT NNNNNN GT 40c 15’ Cy5-G T2 0.35 0.04 8.3

25 37C.6N.GA CCTCATTCTCT NNNNNN GA 40c 15’ Cy5-G T5 0.07 0.02 2.9

26 23C.10N.AC TCATTCTCT NNNNNNNNNN AC 30c 19’ Cy3-C T4 0.14 0.07 2.1

27 37C.12N.CG CCTCATTCTCT NNNNNNNNNNNN CG 37c 18’ Cy3-A T4 0.74 0.10 7.4

28 Cy5-T T3 0.03 0.06 0.6

29 37C.14N.AT CCTCATTCTCT NNNNNNNNNNNNNN AT 37c 15’ Cy3-C T4 0.08 0.08 1.0

30 37C.16N.CC CCTCATTCTCT NNNNNNNNNNNNNNNN CC 50c 15’ Cy5-A T1,T2,T4,T5 0.11 0.16 0.7

31 37C.18N.CC CCTCATTCTCT NNNNNNNNNNNNNNNNNN CC 50c 15’ Cy5-A T3 0.17 0.10 1.7

32 (Sequencing Primer, Cy3 labeled)

33 (Post-strip, Cy3 signal)

34 (Post-strip, Cy5 signal)

35 (WL)

The first column of numbers indicates the cycle number assigned to a given query. The second and third columns indicate the sequencing primer used, and the fourth column indicates the conditions of hybridization. The fifth column indicates the base(s) used to extend, and the 6^th column indicates the templates expected to add. The remaining columns indicate the best-fit slope coefficient for adders and non-adders, and finally the ratio of these values. Note that this is useful, but not the perfect statistic. Calls about which cycles worked and which didn’

Failures are listed in yellow. The first failure (cycle 17), was probably due to manual error in preparing the extension mix, as its repeat (cycle 24) worked quite well, and this primer worked well in the emulsion-bead experiment below as well. The remaining failures obviously seem to correlate with attempts at longer reads. The 37C.12N.CG primer, interestingly, works quite well for one template but not another. In a subsequent experiment, using Sequenase instead of Klenow resulted in both templates working with this primer. This was the main reason that I used Sequenase in the below experiment (on emulsion-beads). Sequenase also seems to yield greater signal in general than Klenow in this protocol.

There appear to be some other trends: (a) poor performance of “G” extensions, not unexpectedly; this gets better with Sequenase also; and (b) poor performance of the T5 template in terms of signal yield at any given cycle when it is expected to extend. This is a little harder to explain but I have some ideas. It doesn’t normally behave like this, so I’m guessing it has to do with the shortening of the anchor of the sequencing primer (some other experiments hint at this mattering). Other data suggests that the size of the anchor may not be so important, so there might be a solution there.

The good news is that we obtain ~11 base-pair reads on all five templates, and all observations appear consistent with expectation. A 15-bp read is obtained on one of the templates (T4), but results are not consistent (i.e. cycle 28) and we experience failure beyond base 15 (cycles 29-31).

Blue, in the below, indicates bases that were ultimately sequenced in the above experiment; yellow indicates bases attempted and failed, and the remainder were not attempted:

T1 CACACACACACACACACTCCACCACT

T2 GTGTGTGTGTGTGTGTCCACCACTCT

T3 AGTGCTCACACACGTGATCCACCACT

T4 CAGCCGAACGACCGATCCACCACTCT

T5 ATGTGAGAGCTGTCGTCCACCACTCT

Sample graph (extension with 37C.8N.CG, sequencing bases 10,11,12 on T4)

Proof of Principle on Emulsion Beads (T1..T5) [Slide G2]

Since the above worked so well, we decided to try it immediately on emulsion-generated beads (again, just the T1..T5 templates). This is Slide G2, using beads prepared by Greg on 5-31-04, poured top-layered.

On a side-note, Greg has been optimizing the emulsion-protocol for the human-genomic library, but it looks like the optimizations were such that the signal on the emulsion-generated T1..T5 beads are literally through the roof. The FISSEQ primers have historically worked better than the human-genomic library primers so this is not terribly surprising in retrospect, but still really nice. This also suggests that we should try to get these primer sequences flanking whatever libraries we make in the future, or do a bake-off of primer sequences and find a set that is even more suited for emulsion PCR.

Greg diluted these templates independently, only mixing them as they went into the emulsion mix. The reason for this is that they are single-stranded, and this procedure minimizes their binding to one another, which confounds results. However, the ratios of the five templates clearly deviate from 1:1. The initial set of primers used on these templates were the 37C.0N.XX series, which essentially establishes the identity of each bead. As the fraction of beads with 1+ template was high, it was not surprising that we observed a high fraction of non-clonal beads. We were only imaging approximately 1% of the gel (25 frames) at each cycle. The overall numbers were as follows:

No template 29,658

Weakly amplified 10,164

Strong clonal 13,350

T1 = 57

T2 = 8,945

T3 = 2,165

T4 = 1,834

T5 = 349

Strong non-clonal 7,668

Total 60,840

The numbers are generally consistent with what one would expect from Poisson statistics, but with a modest excess of non-clonal beads. It seems likely that some of modest fraction of the “no template” beads actually don’t participate in the distribution (e.g. are excluded b/c they are in the oil compartment, or in a compartment that is too small to initiate PCR, etc.)

The initial analysis of clonality and identity, which was based on the 37C.0N.XX primers, led me to focus only on primers that extended either T2, T3, or T4, as these dominate the slide. Relative to the above there are also changes to the hybridization conditions and modified nucleotides, but the most important difference (other than the fact that these are emulsion-generated beads) is that we are using Sequenase instead of Klenow.

Primer Name Primer Seq Hyb Extend With Templates Add Slope Non-add Slope A:N Ratio

1 37C.0N.CA CCTCATTCTCT CA 42c 15’ Cy3-C T1 0.69 0.009 80.8

2 Cy5-A T4 0.33 0.004 87.2

3 37C.0N.GT CCTCATTCTCT GT 42c 15’ Cy5-G T2 0.19 0.006 29.1

4 37C.0N.AG CCTCATTCTCT AG 42c 15’ Cy5-T T3 0.78 0.02 44.5

5 37C.0N.AT CCTCATTCTCT AT 42c 15’ Cy5-G T5 0.04 0.006 6.4

6 37C.2N.GT CCTCATTCTCT NN GT 42c 50’ Cy5-G T2,T5 0.20 0.05 4.1

7 37C.2N.TG CCTCATTCTCT NN TG 42c 15’ Cy3-C T3 0.28 0.01 25.6

8 37C.2N.GC CCTCATTCTCT NN GC 42c 15’ Cy3-C T4 0.67 0.01 46.2

9 37C.4N.GT CCTCATTCTCT NNNN GT 45c 15’ Cy5-G T2 0.14 0.01 14.4

10 37C.4N.CT CCTCATTCTCT NNNN CT 45c 22’ Cy3-C T3 0.61 0.02 40.8

11 37C.4N.CG CCTCATTCTCT NNNN CG 45c 15’ Cy3-A T4 0.86 0.02 40.6

12 37C.6N.GT CCTCATTCTCT NNNNNN GT 47c 15’ Cy3-G T2 0.17 0.01 12.8

13 37C.6N.CA CCTCATTCTCT NNNNNN CA 47c 15’ Cy3-C T1,T3 0.45 0.03 17.3

14 37C.6N.AA CCTCATTCTCT NNNNNN AA 47c 15’ Cy3-C T4 0.61 0.02 29.2

15 37C.8N.GT CCTCATTCTCT NNNNNNNN GT 50c 30’ Cy5-G T2 0.11 0.01 9.5

16 37C.8N.CA CCTCATTCTCT NNNNNNNN CA 50c 15’ Cy3-C T1,T3 0.27 0.02 12.1

17 37C.8N.CG CCTCATTCTCT NNNNNNNN CG 50c 15’ Cy3-A T4 0.27 0.03 10.2

18 23C.10N.AC TCATTCTCT NNNNNNNNNN AC 37c 30’ Cy3-C T4 0.27 0.02 11.0

19 37C.12N.CG CCTCATTCTCT NNNNNNNNNNNN CG 54c 18’ Cy5-T T3 0.07 0.04 1.7

20 Cy3-A T4 0.03 0.03 1.4

21 37C.14N.AT CCTCATTCTCT NNNNNNNNNNNNNN AT 54c 15’ Cy3-C T4 0.09 0.03 3.0

22 37C.12N.CG CCTCATTCTCT NNNNNNNNNNNN CG 54c 18’ Cy5-T T3 0.05 0.03 1.9

23 Cy3-A T4 0.22 0.02 11.6

Note: On cycle 19/20, as listed above, I accidentally stripped before reading the Cy3 signal out. Interestingly (and assuming that it worked, which data from Slide EX suggests it should have), less than 30 seconds in Wash 1E at 70’C seemed sufficient for stripping (or at least for redistribution of signal amongst the beads). That is why I repeated with 37C.12N.CG (cycles 22, 23).

Note that the calls about what worked and what didn’t work are based on visual inspection of the graphs. Thus, even though 37.12N.CG->T has lower “ratios” than 37C.14N.AT->C, it still appears to have worked, whereas 37C.14N.AT->C appears to have not worked.

I stripped this slide and re-annealed sequencing primer at the conclusion to see what extent the templates had fallen off due to heat-exposure, etc. The difference between the two sets of images (pre-sequencing and post-sequencing) was essentially negligible. It’s actually startling how consistent they were with one another, and indicates that we are not losing template over the course of the experiment. This inspection also demonstrates quite clearly that the extent of gel warping over the ~20 cycles was negligible (a concern, with the heat treatments, etc.)

This is the first real success that we have for good reads on emulsion-generated beads. The signal is really good for nearly all of the cycles.

Color-coding here is the same as described above:

T1 CACACACACACACACACTCCACCACT

T2 GTGTGTGTGTGTGTGTCCACCACTCT

T3 AGTGCTCACACACGTGATCCACCACT

T4 CAGCCGAACGACCGATCCACCACTCT

T5 ATGTGAGAGCTGTCGTCCACCACTCT

Sample graph from the same primer as the sample graph above, 37C.8N.CG, sequencing bases 10,11,12 on T4 (except with emulsion-beads instead of loaded-beads, and showing only well-amplified, clonal beads). Note the signal on these beads is actually higher than the loaded beads, which is a pleasant surprise. Some reasons why this could be include (a) more template on amplified beads than has been the case in the past with Greg’s optimizations, (b) switch to Sequenase from Klenow.

Agenda

Lots of potential things to try:

(1) switch to four-color sequencing

(2) repeat emulsion T1.T5 experiment with additional primers at 10N,12N,14N and adjusted template ratios

(3) full set of primers on a more complex library

(4) hybrid primers with constant region, universal or fixed bases, wobble bases, and fixed termini.

(5) single-base-matched termini with Sequenase.

(6) longer anchors to bring up T5 signal?

(7) stabilizing primer hybridized to distal constant region.

(8) other natural and mutant polymerases