S - PowerPoint PPT Presentation

About This Presentation
Title:

S

Description:

S M A S H – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 124
Provided by: budmi
Learn more at: https://cs.nyu.edu
Category:
Tags: seel

less

Transcript and Presenter's Notes

Title: S


1
S M A S H
  • Single
  • Molecule
  • Approach
  • to
  • Sequencing-by-Hybridization

2
Bud Mishra
  • Professor of Computer Science, Mathematics and
    Cell Biology
  • Courant Institute, NYU School of Medicine, Tata
    Institute of Fundamental Research, and Mt. Sinai
    School of Medicine

3
(No Transcript)
4
Reading DNA Sequences
5
How Does Nature Do It?
6
Tools of the trade
  • Where we collect three important tools from
    biotechnology scissors, glues and copiers

7
Scissors
  • Type II Restriction Enzyme
  • Biochemicals capable of cutting the
    double-stranded DNA by breaking two -O-P-O
    bridges on each backbone
  • Restriction Site
  • Corresponds to specific short sequences EcoRI
    GAATTC
  • Naturally occurring protein in bacteriaDefends
    the bacterium from invading viral DNABacterium
    produces another enzyme that methylates the
    restriction sites of its own DNA

Tools of the Trade
8
Glue
  • DNA Ligase
  • Cellular Enzyme Joins two strands of DNA
    molecules by repairing phosphodiester bonds
  • T4 DNA Ligase (E. coli infected with
    bacteriophage T4)
  • Hybridization
  • Hydrogen bonding between two complementary single
    stranded DNA fragments, or an RNA fragment and a
    complementary single stranded DNA fragment
    results in a double stranded DNA or a DNA-RNA
    fragment

Tools of the Trade
9
Copier
  • DNA Amplification
  • Main Ingredients Insert (the DNA segment to be
    amplified), Vector (a cloning vector that
    combines with an insert to create a replicon),
    Host Organism (usually bacteria).

Tools of the Trade
10
Copier
  • PCR (Polymerase Chain Reaction)
  • Main Ingredients Primers, Catalysts, Templates,
    and the dNTPs.

Tools of the Trade
11
Science by Stamp Collecting
  • How biotechnology relates to the classical
    coupon collectors problem

12
Sir Ernest Rutherford
For Mikes sake, Soddy, dont call it
transmutation. Theyll have our heads off as
alchemists. Rutherford, winner of 1908 Nobel
prize for chemistry for cataloging alpha and beta
particles
  • All science is either physics or stamp
    collecting.

13
Sanger Chemistry
14
A Challenge
  • Prize for DNA Decoding Aims to Fuel Innovation
  • first team that completely decodes the DNA of 100
    or more people in 10 days

15
(No Transcript)
16
Sequencing-by-SynthesisPyro-Sequencing
  • BASE EXTENSION A single-stranded DNA fragment,
    known as the template, is anchored to a surface
    with the starting point of a complementary
    strand, called the primer, attached to one of its
    ends (a). When fluorescently tagged nucleotides
    (dNTPs) and polymerase are exposed to the
    template, a base complementary to the template
    will be added to the primer strand (b). Remaining
    polymerase and dNTPs are washed away, then laser
    light excites the fluorescent tag, revealing the
    identity of the newly incorporated nucleotide
    (c). Its fluorescent tag is then stripped away,
    and the process starts anew.

17
The Middle Way
  • Combining Two Extremes
  • Indexing For each character b in the genome,
    make a list of each position where it occurs.
  • Shotgunning For each long sentence in the
    genome, select it with low probability
    (o(lgn/n)), and then read it reasonably
    accurately.
  • Where is the middle???

18
The Middle Way
  • Character Index
  • A 1, 11,
  • T 2, 3, 8, 9, 12
  • C 4, 5, 9, 10, 13
  • G 6, 7, 8, .
  • Sentences w/o Index
  • ATTCCGGG
  • GGGCCATCGT
  • CGTCATTCC

ATTCCGGGCCATC
ATTCCGGGCCATC
  • Words w/ approx. Index
  • ATTC 2..4
  • TCGG 6..8
  • GGGC 7..9
  • GCCA 10..12

ATTCCGGGCCA
19
Comparison
  • Shotgun Assembly of Short Reads from
    Sequencing-by-Synthesis/Ligation
  • Bottom-up
  • Short high-quality reads without location
    information
  • Shot-gun Assembler
  • Must use maps or paired-reads
  • Error Prone
  • Genotypic Sequence
  • Less Flexible
  • SMASH Single Molecule Approach to
    Sequencing-by-Hybridization
  • Top-down
  • Shorter low-quality reads with location
    information
  • Bayesian Assembler
  • Haplotypic Sequence
  • Flexible
  • Gap-less Sequencing
  • Large high-quality contigs
  • Haplotype phasing
  • Structural changes (e.g., chromosomal aberrations)

20
Outline
  • Physical Mapping Sequencing
  • Map
  • assign physical locations to important markers
    (e.g., restriction sites or hybridization
    probes).
  • Sequence
  • align short sequence reads to the markers
    (map-based sequence assembly) or
  • align long sequence reads to each other (shotgun
    assembly)
  • Optical Mapping
  • Sequencing

21
Optical Mapping
  • Where we map by watching the genome

22
Probing a Single Molecule
  • Standard ensemble measurements yield only average
    values of a parameter for a large number of
    identical copies of macromolecules.
  • Single molecule measurements provide rich set of
    information
  • dependence of a parameter on its nano-environment
  • statistical distribution
  • dynamics (temporal effects)

23
Optical Approaches
  • Single DNA molecule on a surface can be explored
    by nanometer scale with tunneling electrons,
    forces from sharp tips or magnetic resonance
  • STM (scanning tunneling microscope)
  • AFM (atomic force microscopy)
  • MRFM (magnetic resonance force microscopy)
  • Optical Approaches non-invasive, avoids
    synchronization, need not be real-time.

24
Optical Approaches are Inherently Noisy!
  • Since many biological macromolecules are smaller
    than the Raleigh limit, the optical approaches
    involve attaching single fluorescent probes to
    specific macromolecules.
  • Controlling Noise
  • Magnitude of Stoke-shift
  • Steric hinderance
  • Absorption cross-section
  • Point spread function (PSF)
  • Image Processing

25
Optical Mapping
26
Optical Mapping
  1. Capture and immobilize whole genomes as massive
    collections of single DNA molecules

Cells gently lysed to extract genomic DNA
DNA captured in parallel arrays of long single
DNA molecules using microfluidic device
Genomic DNA, captured as single DNA molecules
produced by random breakage of intact chromosomes
27
Optical Mapping
2. Interrogate with restriction
endonucleases 3. Maintain order of restriction
fragments in each molecule
Digestion reveals 6-nucleotide cleavage sites as
gaps
28
Optical Mapping
4. Determine size of fragments
29
Optical Mapping
5. GENTIG Robust Bayesian Map Assembler to
make whole-genome restriction map
30
Computational Analysis
Single DNA molecule on Optical Chip after
digestion, staining
  • Image analysis software measures size and order
    of restriction fragments
  • Overlapping single molecule maps are aligned to
    produce a map assembly covering an entire
    chromosome

31
Map Assembly
Overlapping single molecule maps are aligned to
produce a map assembly covering an entire
chromosome
32
Error Sources
  • Sizing Error
  • (Bernoulli labeling, absorption cross-section,
    PSF)
  • Partial Digestion
  • False Optical Sites
  • Orientation
  • Spurious molecules, Optical chimerism, Calibration

Image of restriction enzyme digested YAC clone
YAC clone 6H3, derived from human chromosome 11,
digested with the restriction endonuclease Eag I
and Mlu I, stained with a fluorochrome and imaged
by fluorescence microscopy.
33
E. coli Shotgun Map
34
Shotgun Mapping
  • Large fragments of genomic DNA of length from 2Mb
    to 12Mb are optically mapped
  • The resulting ordered restriction maps are
    automatically contiged by Gentig
  • The consensus map computed by Gentig is free of
    errors due to partial digestion, sizing error and
    false cuts

35
Taming the noise
  • Where we examine many noise sources in optical
    mapping

36
Optical MappingInterplay between Biology and
Computation
37
How Errors Complicate the Mapping
Correct Restriction Map
  • Accuracy in Sizing
  • False Cuts
  • (governed by surface conditions, illuminations,
    optic, imaging,)
  • Partial Digestion

Error Sources
  • Orientation
  • Spurious Data
  • Crossing
  • Breakage
  • Missing Fragments

38
Complexity
  • Where we admit defeat with combinatorial
    algorithmic approaches

39
Complexity Issues
Various combinations of error sources lead to
NP-hard Problems
40
SMRM(Single Molecule Restriction Map)
DRj
Dj
41
SMRM(Single Molecule Restriction Map)
42
Problem 2 (Sizing Error)
43
Problem 2 is NP Complete
44
Example
45
Probabilistic Analysis
  • Where we design the experiments to generate easy
    instances of a difficult problem

46
Sir Ernest Rutherford
  • If your experiment needs statistics, you ought
    to have done a better experiment.

47
Combinatorial Structure
48
Flips Flops
49
Intuition
50
Other Error Sources
51
Discretization
52
Sizing Error
53
Prediction
The probability of successfully computing the
correct restriction map as a function of the
number of cuts in the map and number of molecules
used in creating the map
54
Experimental Results
55
Bayesian Methods
  • Where we rely on statistical models of the error
    sources to map correctly

56
Bayesian Approach
57
Robustness of Optical Mapping Algorithm
  • BAC Clones with 6-cutters
  • Average Clone size 160 Kb Average Fragment
    Size 4 Kb, Average Number of Cutsites 40.
  • Parameters
  • Digestion rate can be as low as 10
  • Orientation of DNA need not be known.
  • 40 foreign DNA
  • 85 DNA partially broken
  • Relative sizing error up to 30
  • 30 spurious randomly located cuts

Algorithm Design and Analysis jointly with
T.S.Anantharaman
58
Experimental Results
59
Bayesian Inference
Pr(H D) Pr(D H) Pr(H)/ Pr(D)
60
Bayesian Model
61
Multiple Alignement
62
Bayesian Optimization
Gradient search for good parameters
Local gradient optimization
63
Y
  • From a genes point of view, reshuffling is a
    great restorative
  • The Y, in its solitary state disapproves of such
    laxity. Apart from small parts near each tip
    which line up with a shared section of the X, it
    stands aloof from the great DNA swap. Its genes,
    such as they are, remain in purdah as the
    generations succeed. As a result, each Y is a
    genetic republic, insulated from the outside
    world. Like most closed societies it becomes both
    selfish and wasteful. Every lineage evolves an
    identity of its own which, quite often, collapses
    under the weight of its own inborn weaknesses.
  • Celibacy has ruined mans chromosome.
  • Steve Jones, Y The descent of Men, 2002.

64
Mapping the DAZ locus on Y Chromosome
65
GENomic conTIG
  • Where we map large genomes

66
Plasmodium falciparum
  • Malaria Parasite
  • Deadliest of all the human Malaria parasites
  • P. falciparum
  • P. vivax
  • P.malariae
  • P. ovale
  • Responsible for 1.5-2.7 million deaths in 1997.

67
Gentig MapsPlasmodium falciparum
  • A. Gap-free consensus BamHI NheI maps for all
    14 chromosomes.
  • B.BamHI map
  • C. NheI map
  • D.NheI map of Chromosome 3 displayed by ConVEx

68
Gentig MapDeinococcus radiodurans
Nhe I map of D.radiodurans generated by Gentig
69
Gentig MapE. coli
  • Whole genome XhoI restriction map of E. coli O157
    generated by Gentig software of Anantharaman
    Mishra.
  • The outer circle represents an in silico XhoI
    digest of the sequence.
  • The second outermost circle shows the consensus
    optical map.
  • The inner circles represent the individual
    molecule maps from which the consensus map was
    generated.

70
Complexity, Revisited
  • Where we admit defeat again(?)
  • Can Gentig assemble large Eukaryotic Genomes??

71
GCP is NP-Complete
  • Transformation from Hamiltonian Path Problem
    restricted to cubic graphs.

Choose p 3/4 k M
72
Comparing MapsEffect of Partial Digestion
  • Parameters
  • Partial digestion probability, p
  • Relative sizing error, b
  • Restriction fragments, n
  • Overlap threshold ratio, q
  • m n p Expected detected restriction
    fragments.
  • Controlling False Negative
  • K 5 np4 q/2 and r k1/p4, k1 ¼ 2
  • If in fact the clones A and B overlap then we
    will detect it with a probability, at least
  • (1-exp(-k1)) (1 exp(-n p4 q/8))

73
Experiment Design
  • Relation among the error parameters
  • 3b n p /4 5 k 5 n p4 q/2
  • ) p (3 b/2 q)1/3
  • Parameter choice for shotgun-mapping. Make the
    partial digestion probability rather high (close
    to 1) or the relative sizing error as low for
    instance by using a rare cutter.

74
Contour Plot as a Function of Sizing Error
(x-axis) and Digestion Rate (y-axis)
  • The calculation is for the human genome, G 3,300
    mb.
  • The average molecule length 5 mb, with an
    overlap of 1 mb
  • The average restriction fragment length 25 kb
  • For a sizing error of 3 kb, the required
    digestion rate is 94
  • If the sizing error is reduced to 2 kb, the
    required digestion rate drops to 88
  • If the sizing error is reduced to 1 kb, the
    required digestion rate drops to 80

75
Sir Ernest Rutherford
  • You should never bet against anything in science
    at odds of more than about 1012 to 1.

76
How does Gentig Work
  • Gentig uses a purely Bayesian Approach.
  • It models all the error processes in the prior.
  • It initially starts with a conservative but fast
    pairwise overlap configuration, computed
    efficiently using Geometric Hashing.
  • It iteratively combines pairs of maps or map
    contigs, while optimizing the likelihood score
    subject to a constraint imposed by a
    false-positive constraint.
  • It has special heuristics to handle non-local
    errors.

77
The problem Data error
  • Miss restriction site rate
  • Gaussian sizing var
  • False restriction sites rate
  • Small fragments missing rate
  • Multiple chromosomes mixed together.

78
Solution Bayesian probability maximization
H
f(HD1, L, Dm c f(H) f(D1, L, Dm H)
79
Conditional Probability sum of alignments right
of I,J
H
80
Example increase interval in H
81
Sir Ernest Rutherford
  • You should never bet against anything in science
    at odds of more than about 1012 to 1.

82
HAPTIG HAPlotypic conTIG
  • HOOPLAS, HYPES HAPLOTYPES
  • Replacing Gentig
  • Faster More Accurate

83
Single Molecule HapoltypingCandida Albicans
  • The left end of chromsome-1 of the common fungus
    Candida Albicans (being sequenced by Stanford).
  • You can clearly see 3 polymorphisms
  • (A) Fragment 2 is of size 41.19kb (top) vs
    38.73kb (bottom).
  • (B) The 3rd fragment of size 7.76kb is missing
    from the top haplotype.
  • (C)The large fragment in the middle is of size
    61.78kb vs 59.66kb.

84
Goals
  • Identify and phase polymorphism in data.
  • Problem
  • Similar to Gentig Problem
  • if distributions of restriction site polymorphism
    (due to SNPs) and restriction fragment length
    polymorphism (due to indels) are known
  • How can one separate partial digestions from
    restriction site polymorphism and sizing error
    from restriction length polymorphism?
  • Solution
  • Bayesian Analysis with Parameterized Priors
    Fast Dynamic Programming Implementation.

85
Comparison on 3.4 GHz CPU
Chr1 length Coverage USC time NYU time
12 Mb 20x 3431 secs 448 secs
12 Mb 100x 17052 secs 2818 secs
30 Mb 20x 28993 secs 1444 secs
30 Mb 100x 124194 secs 9308 secs
100 Mb 20x NA 25465 secs
100 Mb 50x NA 74025 secs
200 Mb 20x NA 88835 secs
200 Mb 50x NA 259024 secs
Note Haptig times based on 2.4 GHz CPU
86
Comparison on 3.4 GHz CPU
20 X coverage
100 X coverage
Unlike the USC algorithm (by Waterman et al.),
Gentig/Haptig algorithms easily scale to human
genome sized problems. Gentig is routinely used
by OpGen and other labs to map.
87
Other Interesting Applications
  • Phylogeny
  • Sequence Validation
  • Haplotyping
  • Sequencing
  • Comparative Genomics
  • Rearrangement events
  • Hemizygous Deletions
  • Epigenomics
  • Characterizing cDNAs
  • Expression Profiling
  • Alternate Splicing

88
Sequencing
89
Joint Work with
  • T.S. Anantharaman, NYU, NY.
  • V. Demidov, UCLA, CA.
  • A. Lim, NYU, NY.
  • S. Paxia, NYU, NY.
  • J. Reed, UCLA, CA.
  • C. Cantor, Sequenom, Inc., CA
  • J. Gimzewski, UCLA, CA.
  • M. Teitell, UCLA, CA.

90
OverviewTechnology
  • How does it work?
  • Optimal integration of several technologies based
    on manipulation of single molecules on a surface.

91
Goal
  • Sequence a human size genome of about 6
    Gbinclude both haplotypes.
  • Integrate three component technologies
  • Optical Mapping to create Ordered Restriction
    Maps with respect to one or more restriction
    enzymes,
  • Hybridization of a pool of short nucleobase
    probes (PNA or LNA oligomers) with Single Genome
    dsDNA molecules on a surface, and
  • Efficient polynomial time algorithms to solve
    localized versions of the PSBH (Positional
    Sequencing by Hybridization) problems over the
    whole genome.

92
SMASH
  • Genomic DNA is carefully extracted from small
    number of cells of an organism (e.g., human) in
    normal or diseased states. (Fig 1 shows a cancer
    cell to be studied for its oncogeneomic
    characterization.)

93
SMASH
  • LNA probes of length 6 8 nucleotides are
    hybridized to dsDNA (double-stranded genomic DNA)
    in a test tube (Fig 2) and the modified DNA is
    stretched on a 1 x 1 chip that has microfluidic
    channels manufactured on its surface. These
    surfaces have been chemically treated to create a
    positive charge.

DNA samples are prepared for analysis with LNA
probes and restriction enzymes.
94
SMASH
  • Since DNA is slightly negatively charged, it
    adheres to the surface as it flows along these
    channels and stretches out. Individual molecules
    range in size from 0.3 3 million base pairs in
    length.
  • Next, bright emitters are attached to the probes
    on the surface and the molecules are imaged (Fig
    3).

95
SMASH
  • A restriction enzyme1 is added to break the DNA
    at specific sites. Since DNA molecules are under
    slight tension, the cut fragments of DNA relax
    like entropic springs, leaving small visible gaps
    corresponding to the positions of the restriction
    site (Fig 4).
  • 1. A restriction enzyme is a highly specific
    molecular scissor that recognizes short
    nucleotide sequences and cuts the DNA at only
    those recognition sites.

96
SMASH
  • The DNA is then stained with a fluorogen (Fig 5)
    and reimaged. The two images are combined to
    create a composite image suggesting the locations
    of a specific short word (e.g., probes) within
    the context of a pattern of restriction sites.

97
SMASH
  • The intensity of the light emitted by the dye at
    one frequency provides a measure of the length of
    the DNA fragments.
  • The intensity of the light emitted by the
    bright-emitters on probes provides an intensity
    profile for locations of the probes.
  • Images of each DNA molecule are then converted
    into ideograms, where the restriction sites are
    represented by a tall rectangle and probe sites
    by small circles (Fig 6).

98
SMASH
  • The steps above are repeated for all possible
    probe compositions (modulo reverse
    complementarity).
  • Sutta software then uses the data from all such
    individual ideograms to create an assembly of the
    haplotypic ordered restriction maps with
    approximate probe locations superimposed on the
    map.

99
SMASH
  • Local clusters of overlapping words are combined
    by Suttas PSBH (positional sequencing by
    hybridization) algorithm to overlay the inferred
    haplotypic sequence on top of the restriction map
    (Fig 7).

100
Sir Ernest Rutherford
  • We haven't the money, so we've got to think."

101
Hybridization
102
Thermal Stability of LNA
  • LNA/DNA or LNA/RNA duplexes have increased
    thermal stability compared with similar duplexes
    formed by DNA or RNA.
  • The LNA modification has been shown to increase
    the biological stability of nucleic acids.
  • Fully modified LNA oligonucleotides are resistant
    towards most nucleases tested.

103
Experiments with Probes
  • To test the feasibility of hybridization of a
    modified PNA, we first tested hybridization of
    PNA probe to lambda DNA molecules.
  • We were able to measure the degree of
    hybridization by digesting lambda DNA, after it
    was hybridized with the PNA probe, with PmlI
    restriction enzyme and run the sample on a 4
    PAGE gel and found the degree of hybridization to
    be greater then 90.

Bound
Unbound
104
Probe Map
  • Overlayed fluorescent images of labmda DNA
    molecules using a FITC filter (white) and CY5
    filter (red), showing the position of the probes
    on the lambda DNA molecules.

105
Final Probe Map
  • We next combined all probe maps from around 20
    image pairs using a Bayesian algorithm to compute
    the most likely consensus map.
  • For one set of 20 image pairs, a total of 512 DNA
    molecules with a total of 678 probes were
    identified and combined into a consensus map with
    2 probe locations at 14.8 and 52.4 of the DNA
    length.
  • The orientation of the DNA cannot be determined
    from optical maps.
  • Thus our result is in close agreement with the
    correct map with probes at 50.2 and 85.7 known
    from the sequence and the implied probe
    hybridization rate of 42 is also quite good.

106
Positional SBH Algorithm
107
Algorithmic Problem
  • The correct DNA sequence is s
  • Computable
  • Correct Haplotypic Restriction Maps (with respect
    to multiple enzymes)
  • Positional Spectra For a position x, wx wcx,
    the two k-mers at position x on the watson- and
    crick-strands.
  • Data
  • Map Errors Errors in sizing, false positives and
    negatives in cleavage
  • Spectral Errors Errors in position and false
    positives and negatives in hybridization
  • Compute a sequence t.
  • How closely does t approximate s?

108
Anomalies
  • Irresolvable Ambiguities
  • From assemblies based on 6bp probes
  • Error Pattern s w sRC
  • Correct Pattern s wRC sRC
  • s tcgcc (any 5 bases)
  • sRCggcga (Reverse compliment of X)
  • w CCCCTAAC (any short sequence under 50bp)
  • wRC GTTAGGGG (Reverse compliment of Y)

AssemblytcgccCCCCTAAC ggcga
Correct
tcgccGTTAGGGGggcga
109
A Library of Anomalies
  • Irresolvable Ambiguities Unavoidable Error
    Patterns
  • Assuming assemblies with K-bp probes
  • Most common s w sRC vs s wRC sRC
  • Also common s w s t s vs. s t s w s
  • Many more complicated patterns (but extremely
    rare)
  • Here
  • s any K-1 bp sequence
  • w, t any short sequence under 50bp
  • The probabilities of such patterns can be reduced
    exponentially with gapped probes without
    increasing the costs.

110
Complexity
  • Overcoming the complexity issues
  • Spectrum of k-mer probes constraints on the
    location of the k-mers in the sequence
  • However if the location constraints are in the
    form of k contiguous locations then the
    reconstruction problem is exponential in k,
    rather than the sequence length m.
  • NP-Complete, if k 3

111
Solution
  • Complexity Issue
  • a Experiment Design Powerful Heuristics
  • Accuracy Issue
  • a Combinatorial Design of Probes (Gapped Probes
    using universal Bases)

112
Gapped Probes
  • Mixing solid bases with wild-card bases
  • E.g., xxxxxx (9-mers) or xxxxxxxx (14
    mers)
  • An inert base
  • Universal In terms of its ability to form base
    pairs with the other natural DNA/RNA bases.
  • Applications in primers and in probes for
    hybridization
  • Examples
  • The naturally occurring base hypoxanthine, as its
    ribo- or 2'-deoxyribonucleoside
  • 2'-deoxyisoinosine
  • 7-deaza-2'-deoxyinosine
  • 2-aza-2'-deoxyinosine

113
2'-Deoxyinosine derivatives
  • 2'-Deoxyinosine derivatives can be used as
    universal DNA analogues.

Loakes, D. Nucl. Acids Res. 2001 292437-2447
doi10.1093/nar/29.12.2437
114
Other Chemistries
  • The bases 3-nitropyrrole and 5-nitroindole have
    also been incorporated into peptide nucleic acids
    (PNA)
  • In PNADNA duplexes, it was found that the T_m
    range was narrower than that found for
    corresponding DNADNA duplexes, although
    significantly higher than the corresponding
    DNADNA duplex containing an AT base pair.

Loakes, D. Nucl. Acids Res. 2001 292437-2447
doi10.1093/nar/29.12.2437
115
Approximation
  • A special case of the PSBH problem,
  • Separate constraints for each of the multiple
    instances of each k-mer.
  • By focusing on a small window of about 2k-1bp, in
    which most k-mers will occur only once, we hope
    to approximately solve the standard PSBH problem
  • Here separate constraints for multiple instances
    of each k-mer are not important.
  • Amortize over many instances

116
Multiple PSBH problem
  • Data set
  • Probe maps for all K-mers (with gaps)
  • We have multiple instances of each k-mer, (for
    k6, about one every 4Kb on each strand of the
    DNA in the sequence),
  • For each instance we can constrain the location
    to within about 100 base pairs depending on the
    optical resolution.
  • Beam Search Based Heuristics

117
Simulation Results
  • Probe Map Assumptions
  • For single DNA molecules
  • Probe location Standard Deviation 240 bases
  • Data coverage per probe map 50x
  • Probe hybridization rate 30, and
  • false positive rate of 10 probes per megabase,
    uniformly distributed.
  • Analytically estimation of the average error rate
    in the probe consensus map
  • Probe location SD 60 bases
  • False Positive rate lt 2.4
  • False Negative rate lt 2.0.

118
Simulation Results
  • Using estimated error rates, randomly introduced
    errors at the calculated rates into each of the
    2080 simulated probe consensus maps (for the
    above example).
  • Ran our sequence assembly algorithm.
  • Aligned the sequence produced with the originally
    assumed correct sequence using Smith-Waterman
    alignment.
  • Counted the total number of single base errors
    (mismatches deletions insertions).
  • Repeated this experiment until a total of 200,000
    bases of sequence had been simulated and computed
    the average error rate per 10,000 bases.

119
Simulation Results
  • We first tried un-gapped probes with 5,6,7 and 8
    bases respectively and got error rates per 10,000
    bases of 1674, 255, 39.6 and 3.7 bases
    respectively.
  • Our machine ran out of memory trying to simulate
    a 9 base ungapped probe, but from the trend it is
    clear that the goal of 1 base error per 10,000
    bases would most likely be achieved by a 9 base
    ungapped probe.

120
Results
UNGAPPED
GAPPED
121
1000 Rupees Genome
22.67 US for 6 billion bases 135 billion US
for the entire human population
122
Sir Ernest Rutherford
  • I have become more and more impressed by the
    power of the scientific method of extending our
    knowledge of nature.
  • Experiment, directed by the imagination of either
    an individual, or still better of a group of
    individuals of varied mental outlook is able to
    achieve results which far transcend the
    imagination alone of the greatest natural
    philosopher.

123
Sir Ernest Rutherford
  • Experiment without imagination, or imagination
    without recourse to experiment, can accomplish
    little. But for effective progress, a happy blend
    of these powers is necessary

124
the end
Write a Comment
User Comments (0)
About PowerShow.com