Design 1 - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Design 1

Description:

1 million / year for electrical service $10 million / month for reagents ... ( 590bp for human in year 2000) Contaminant and vector sequence is removed ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 32
Provided by: Adma3
Category:
Tags: design

less

Transcript and Presenter's Notes

Title: Design 1


1
Celera Assembler
Arthur L. Delcher Senior Research
Scientist CBCB University of Maryland
2
Whole Genome Shotgun Sequencing
WGS Sequencing WGS Assembly Performance
3
Mate-Pair Shotgun DNA Sequencing
DNA target sample
SHEAR SIZE (16 of these)
End Reads / Mate Pairs
e.g., 10Kbp 8 std.dev.
CLONE (16 of these) END SEQUENCE (automated)
550bp
10,000bp
4
Shotgun DNA Sequencing (Technology)
5
Whole Genome Shotgun Sequencing
  • Collect 10x sequence in a 1-to-1 ratio of two
    types of read pairs

35million reads for Human.
Short
Long
2Kbp
10Kbp
  • Collect another 20X in clone coverage of 50Kbp
    end sequence pairs

1.2million pairs for Human.
  • Early simulations showed that if repeats were
    considered black boxes, one could still cover
    99.7 of the genome unambiguously.

BAC 3
BAC 5
6
Clone-by-Clone Genome Sequencing
7
Sequencing Factory
8
Celeras Sequencing Factory(circa 2001)
  • 300 ABI 3700 DNA Sequencers
  • 50 Production Staff
  • 20,000 sq. ft. of wet lab
  • 20,000 sq. ft. of sequencing space
  • 800 tons of A/C (160,000 cfm)
  • 1 million / year for electrical service
  • 10 million / month for reagents

9
Human Data (April 2000)
  • Collected 27.27 Million reads 5.11X coverage
  • 21.04 Million are paired (77) 10.52 Million
    pairs
  • 2Kbp 5.045M 98.6 true lt6 std.dev.
  • 10Kbp 4.401M 98.6 true lt8 std.dev.
  • 50Kbp 1.071M 90.0 true lt15 std.dev.
  • validated against finished Chrom. 21 sequence
  • The clones cover the genome 38.7X times
  • Data is from 5 individuals (roughly 3X, 4 others
    at .5X)

10
Pairs Give Order Orientation
Contig
Assembly without pairs results in contigs whose
order and orientation are not known.
Consensus (15- 30Kbp)
Reads
?
2-pair
Pairs, especially groups of corroborating ones,
link the contigs into scaffolds where the size of
gaps is well characterized.
Mean Std.Dev. is known
Scaffold
11
Anatomy of a WGS Assembly
Consensus
Reads (of several haplotypes)
SNPs External Reads
12
Whole Genome Shotgun Assembly
WGS Sequencing WGS Assembly Performance
13
Assembler Design Philosophy
  • Detect repeats and so avoid being misled by them,
    leave for the last.
  • Make 1st order use of mate-pairs first to
    circumnavigate and later to fill in repeats.
  • Make all the sure moves first
  • tiered phases that get progressively more
    aggressive
  • output a complete audit trail of the evidence for
    assembly.

14
Assembly Pipeline (circa 2006)
Trim Screen
  • Reads (typically 800bp) are quality-trimmed so
    that average error rate is .5 with 1-in-1000
    having more than 2 error. Average trim length
    is 500-900bp, depending on the genome. (590bp for
    human in year 2000)
  • Contaminant and vector sequence is removed
  • Repeat screening makes run time and overlap graph
    size reasonable, e.g. 106 overlaps per Alu read
    must be avoided.
  • Now we dynamically limit repetitive overlaps in
    the overlap phase.
  • gatekeeper program to vet inputs/assign
    IDsReads stored in compressed, random-access
    binary store.

Overlapper
Unitiger
Scaffolder
Repeat Rez I, II
15
Assembly Pipeline
Overlap Detection
Trim Screen
Find all overlaps ? 40bp allowing 6 mismatch.
Use k-mer seed matches (k22) with O(nd)
extension where extension quits when probability
of seeing given of errors for amount of
sequence aligned is less than 1/1,000,000. Avoid
k-mers whose whose occurrence count c is such
that there is less than ? (10-6) chance of seeing
c occurrences given it is part of an R-fold (50)
or less repeat in a genome of length G (3x109).
Dynamic tuple selection is a form of automatic
repeat screening implying that overlaps involving
ubiquitous repetitive sequence may be missing.
Overlapper
Unitiger
Scaffolder
Repeat Rez I, II
K,?, and R were chosen to give us an appropriate
tradeoff between time, space, and sensitivity
16
Assembly Pipeline
Error correction
If a k(10)-mer matches a k-mer from an
overlapping read then the bases in the k-mer of
the read are confirmed.
If a base is not confirmed, and the
1-neighborhood of an overlapping k-mer matches
it, then there is a vote for correction. The
majority correction vote is applied to the
sequence.
Sequences are not actually changed, but overlaps
are re-evaluated as SNPs are corrected.
17
Assembly Pipeline
Trim Screen
Find all overlaps ? 40bp allowing 6 mismatch.
Overlapper
Unitiger
Scaffolder
Repeat Rez I, II
18
Assembly Pipeline
Trim Screen
Compute all overlap consistent sub-assemblies Un
itigs (Uniquely Assembled Contig)
Overlapper
Unitiger
Scaffolder
Repeat Rez I, II
19
OVERLAP GRAPH
Edge Types
Regular Dovetail
Prefix Dovetail
Suffix Dovetail
E.G.
Edges are annotated with deltas of overlaps
20
The Unitig Reduction
1. Remove Transitively Inferrable Overlaps
21
The Unitig Reduction
2. Collapse Unique Connector Overlaps
22
Unitigs Definition
Chordal Subgraph with no conflicting edges.
23
Unitig Theorem (Myers, JCB 95)
  • (1) Remove contained fragments
  • (2) Remove transitively inferred edges
  • (3) Collapse into unitigs
  • () Restore t.i. edges between unitig ends.
  • THM Shortest Common Superstring of unitigs
    Shortest Common Superstring of reads
  • Caveat SCS is not the right objective for
    assembly.

24
Revised Unitigger Algorithm
  • Preceding is computationally expensive
  • Current unitigger finds the best overlap on
    each end of each readits best buddy.
  • Unitigs are chains of mutually unique best
    buddiesadjacent reads are best buddies of each
    other and of no other read.
  • This takes time and space linear in the number of
    reads.
  • In rare cases results are different from graph
    reduction.

25
Branch Point Extension
  • A repeat boundary reflected on an underlying
    sequence read.

C
A
Genome
  • Compare peers to detect branch pts.

A
  • Make sure you get a read-length into each repeat
    induced gap (most Alu sized elements are resolved)

D
26
Bubble Smoothing
412
352
486
245
27
Assembly Pipeline
Trim Screen
Unique
Repetitive
28
Identifying Unique DNA Stretches
Repetitive DNA unitig
Unique DNA unitig
Arrival Intervals
Discriminator Statistic is log-odds ratio of
probability unitig is unique DNA versus 2-copy
DNA.
-10
10
0
Dist. For Unique
Dist. For Repetitive
Definitely Repetitive
Dont Know
Definitely Unique
29
Assembly Pipeline
Scaffold U-unitigs with confirmed pairs
Trim Screen
Overlapper
Unitiger
Scaffolder
Repeat Rez I, II
30
Assembly Pipeline
Trim Screen
Fill repeat gaps with doubly anchored positive
unitigs
Overlapper
Unitiggt0
Unitiger
Scaffolder
Repeat Rez I, II
31
Assembly Pipeline
Trim Screen
Fill repeat gaps with assembled, singly anchored
reads
Overlapper
Unitiger
Scaffolder
Repeat Rez I, II
32
Surrogates
  • Stones containing more than 1 read are added to
    contigs as consensus sequence only, without
    underlying reads.
  • Called surrogates
  • Allows repeat unitigs to be put in multiple
    positions in the assembly, but leaves regions
    without underlying read coverage.
  • We later attempt to resolve surrogates, by
    assigning reads from the original repeat unitig
    to the separate surrogate copies, based on mate
    pairs.

33
CelAsm Weaknesses
  • No dynamic read trimming.
  • however, latest version has fixed this - it
    offers trimming based on overlaps.
  • Unitigging treats overlaps as boolean valuesno
    quality variations are considered.
  • Unitigging ignores mate pairs.
  • Unitig A-stat is sometimes unreliable for
    non-random sequencing, haplotype variation,
    repeat-screening.
  • Scaffolding has no provision for (moderate or
    worse) polymorphism.
  • Read overlaps are ignored after unitigging.
Write a Comment
User Comments (0)
About PowerShow.com