De%20novo%20bacterial%20genome%20sequencing:%20millions%20of%20very%20short%20reads%20assembled%20on%20a%20desktop%20computer - PowerPoint PPT Presentation

About This Presentation
Title:

De%20novo%20bacterial%20genome%20sequencing:%20millions%20of%20very%20short%20reads%20assembled%20on%20a%20desktop%20computer

Description:

David Hernandez, Patrice Fran ois, Laurent Farinelli, Magne ster s, ... Reads and their reverse complements are considered the same read merged into the ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: De%20novo%20bacterial%20genome%20sequencing:%20millions%20of%20very%20short%20reads%20assembled%20on%20a%20desktop%20computer


1
De novo bacterial genome sequencing millions of
very short reads assembled on a desktop computer
  • David Hernandez, Patrice François, Laurent
    Farinelli, Magne Østerås, Jacques Schrenzel

Presented by Lucas Lochovsky
2
Outline
  • Introduction
  • Edenas Methodology
  • Reducing Read Redundancy
  • Overlap Graph Construction
  • Transitive Edge Reduction
  • Graph Cleanup
  • Contig Production
  • Results
  • Assemblers
  • Assembly tasks
  • Additional Edena Analyses
  • Graph Cleaning Effectiveness
  • Effective Coverage Depth
  • Conclusions

3
Outline
  • Introduction
  • Edenas Methodology
  • Reducing Read Redundancy
  • Overlap Graph Construction
  • Transitive Edge Reduction
  • Graph Cleanup
  • Contig Production
  • Results
  • Assemblers
  • Assembly tasks
  • Additional Edena Analyses
  • Graph Cleaning Effectiveness
  • Effective Coverage Depth
  • Conclusions

4
1) Introduction
  • NGS will allow us to explore strange new genomes,
    blah blah blah.
  • WGS assemblers weve covered so far
  • Medvedev-Brudno assembler
  • Arachne
  • AMOS-Cmp
  • Velvet
  • ALLPATHS
  • Think youve seen it all?

5
1) Introduction (contd)
  • Edena De novo short read assembler
  • Uses a classic overlap graph approach to assembly
  • Anyone else get a feeling of déjà vu?
  • Compare to other recently published NGS read
    assemblers
  • De novo assembly of two bacterial genomes
    sequenced with the Illumina/Solexa platform

6
Outline
  • Introduction
  • Edenas Methodology
  • Reducing Read Redundancy
  • Overlap Graph Construction
  • Transitive Edge Reduction
  • Graph Cleanup
  • Contig Production
  • Results
  • Assemblers
  • Assembly tasks
  • Additional Edena Analyses
  • Graph Cleaning Effectiveness
  • Effective Coverage Depth
  • Conclusions

7
2) Edenas Methodology
  • Built around a standard overlap-layout-consensus
    workflow
  • Opted to use exact matching for overlap detection
  • Reduce of spurious overlaps
  • Faster than using approximate matching
  • Also assume that all reads have the same length
  • Is this assumption valid?

8
2) Edenas Methodology (contd)
  • Four major steps
  • Remove redundant reads so that dataset size is
    more manageable
  • Overlap detection and overlap graph construction
  • Graph cleaning simplification and ambiguity
    resolution
  • Produce contigs

9
2) Edenas Methodology (contd)
  • 1) Practice your 3 Rs Reducing Read Redundancy
  • Illumina Genome Analyzer has high amount of
    over-sampling ? many redundant reads
  • Reduce dataset so it contains only a single copy
    of each read ? non-redundant
  • Index all reads into a prefix tree
  • Identical reads will be mapped to the same key ?
    no duplicate reads in this structure

10
2) Edenas Methodology (contd)
  • Prefix trees are associative arrays for strings
    where all descendants of a node have a common
    prefix
  • Reads and their reverse complements are
    considered the same read ? merged into the same
    tree key

11
2) Edenas Methodology (contd)
  • Ambiguous reads discarded, since they wont work
    with exact matching
  • Opens up possibility of coverage gaps in read
    data (not explored by the authors)
  • Original read data still useful for getting read
    frequencies
  • Contig coverage depth
  • Repeat identification

12
2) Edenas Methodology (contd)
  • 2) Overlap Graph Construction
  • Non-redundant read dataset is indexed by a suffix
    array
  • Déjà vu moment Almost exactly like suffix trees
    from MUMmer/MUMmerGPU!
  • Information used to produce a bidirected overlap
    graph
  • Déjà vu moment Just like the Medvedev-Brudno
    assembler! (which I presented!)

13
2) Edenas Methodology (contd)
  • This slide should be review for all of you!
  • Bidirected graphs are kind of like directed
    graphs, except each edge has an orientation on
    each of its ends
  • Gives rise to three types of edges
  • Edges where one arrow points out of a vertex, and
    one arrow points into a vertex
  • Edges with both arrows pointing out, and
  • Edges with both arrows pointing in (easiest one
    to do in PowerPoint!)
  • For a walk in a bidirected graph, for each vertex
    on that walk, the orientation of the edge
    entering the vertex must be opposite that of the
    edge leaving the vertex

14
2) Edenas Methodology (contd)
  • More review!
  • In a bidirected overlap graph, each vertex is a
    double-stranded read
  • Edges represent read overlaps
  • Three possible ways that two double-stranded
    reads can overlap (corresponds to the three types
    of edges)
  • Suppose we have two ds reads r1 and r2
  • Each read can be oriented to the left or to the
  • right
  • The three possible overlaps are
  • i) Both strands point in the same direction (both
  • reads can point left, or both can point
    right,
  • its the same overlap either way)
  • ii) r1 points left and r2 points right
  • iii) r1 points right and r2 points left

15
2) Edenas Methodology (contd)
  • Parameter Minimum overlap size
  • Sensitivity vs. specificity tradeoff
  • Small value Higher frequency of chance overlaps
    ? causes path branching in graph (sensitivity
    favoured)
  • Large value Creates more dead-end (DE) paths,
    i.e. reads not extended by overlapping reads on
    one side (specificity favoured)

16
2) Edenas Methodology (contd)
  • 3a) Transitive Edge Reduction
  • Simplifies paths by removing nonessential
    nodes/edges
  • Generally speaking, a path of the form v1 ? v2 ?
    v3 can be reduced to v1 ? v3, representing the
    same sequence with fewer nodes
  • Reduces graph complexity by the over-sampling
    rate c NL/G
  • N Number of reads
  • L Read length
  • G Genome size

17
2) Edenas Methodology (contd)
  • For sequences, its about removing reads for
    which another read with the same sequence
    overlaps the first read to a greater extent

18
2) Edenas Methodology (contd)
  • 3b) Graph Cleanup
  • Can have multiple paths branching off a single
    node (branching paths)
  • Due to genomic repetitions, sequencing errors,
    and clonal polymorphisms
  • Genomic repetitions cannot be fixed without
    additional information
  • But the other two can be resolved

19
2) Edenas Methodology (contd)
  • Sequencing errors produce short dead-end (DE)
    paths
  • Attempt to elongate branching nodes up to a
    certain depth md (minimum depth)
  • Reads that cannot be extended to a depth of md
    are removed
  • Experimentally determined that md10 is the best
    value

20
2) Edenas Methodology (contd)
21
2) Edenas Methodology (contd)
  • Also disambiguate bubbles in the graph caused by
    single base substitutions (aka p-bubbles)
  • Length of p-bubble is at most ms 4L - 2T - 1
  • L Read length
  • T Min. overlap size
  • Explore each branching path up to length ms
    (guaranteed upper bound)
  • Remove path with less coverage
  • Polymorphisms can be retained for later analysis

22
2) Edenas Methodology (contd)
23
2) Edenas Methodology (contd)
  • 4) Contig Production
  • If run in strict mode, Edena starts generating
    contig sequences
  • In non-strict mode, one more cleaning step is
    performed
  • Longer overlaps more reliable than shorter ones
  • Save only edges at branching nodes that have the
    highest overlap of all edges
  • Produce contig sequence by following
    non-intersecting simple paths in overlap graph
  • Nodes must have in-degree and out-degree of
    exactly one

24
Outline
  • Introduction
  • Edenas Methodology
  • Reducing Read Redundancy
  • Overlap Graph Construction
  • Transitive Edge Reduction
  • Graph Cleanup
  • Contig Production
  • Results
  • Assemblers
  • Assembly tasks
  • Additional Edena Analyses
  • Graph Cleaning Effectiveness
  • Effective Coverage Depth
  • Conclusions

25
3) Results
  • Survivor WGS Assembly
  • Four assemblers
  • Two challenges
  • One winner

26
3) Results (contd)
  • Contestant 1 SSAKE
  • Indexes reads in a prefix tree based upon first
    eleven 5 bases
  • Identify highest possible overlap between pairs
    of reads
  • Use most highly-covered reads as starting points
    for read extension (i.e. assembly nucleation
    points)
  • So far only used for partial genome sequencing
    for comparative metagenomic analysis (e.g.
    bacterial species distinction)

27
3) Results (contd)
  • Contestant 2 Velvet
  • k-mer/q-gram/k-gram/q-mer de Bruijn graph
    representation of reads
  • Contestant 3 SHARCGS
  • Can accept base quality scores along with read
    data for read filtering (low quality reads
    discarded)
  • Also filter out reads with low coverage
  • Assembly performed with a prefix tree
  • Contestant 4 Edena

28
3) Results (contd)
  • Reward Challenge
  • Assemble the 2.82 Mbp genome sequence and the
    20.7 Kbp plasmid sequence of the Staphylococcus
    aureus MW2 strain from Illumina reads
  • Immunity Challenge
  • Assemble 1.55 Mbp genome sequence and the 3.66
    Kbp plasmid sequence of the Helicobacter
    acinonychis Sheeba strain from Illumina reads

29
3) Results (contd)
  • Staphylococcus aureus results
  • Evaluated each assembler on the parameter
    configurations that produced the best results
  • Edena Min. overlap size 21 bases
  • Velvet k-mer value 23
  • SHARCGS Max. gap span 14
  • SSAKE Default parameters

30
3) Results (contd)
  • Compared contig assembly to published reference
    sequence
  • Non-strict mode tends to produce longer contigs
    at the expense of additional misassemblies
  • Velvet comparable to Edena strict

31
3) Results (contd)
  • SHARCGS unable to assemble significant contigs ?
    insufficient coverage depth
  • SSAKE produced a large number of mismatches
    mostly at contig boundaries

32
3) Results (contd)
  • Authors also tried combining contig results from
    Edena and Velvet due to significant overlaps
    between their contigs
  • N50 and mean contig size increased relative to
    original results
  • Edena non-strict has similar influence on results
    as previously

33
3) Results (contd)
  • Helicobacter acinonychis results
  • Best parameter settings
  • Edena Min. overlap size 27 (strict), 26
    (non-strict)
  • Velvet k-mer value 27
  • SHARCGS Max. gap span 10 (also must remove last
    four bases from each read)
  • SSAKE Default parameters

34
3) Results (contd)
  • Results similar to those from the previous
    assembly challenge

35
3) Results (contd)
  • Survivor WGS Assembly Conclusion
  • Granted Immunity Edena, Velvet
  • Sent to the Tribal Council SSAKE, SHARCGS

36
Outline
  • Introduction
  • Edenas Methodology
  • Reducing Read Redundancy
  • Overlap Graph Construction
  • Transitive Edge Reduction
  • Graph Cleanup
  • Contig Production
  • Results
  • Assemblers
  • Assembly tasks
  • Additional Edena Analyses
  • Graph Cleaning Effectiveness
  • Effective Coverage Depth
  • Conclusions

37
4) Additional Edena Analyses
  • Graph Cleaning Effectiveness
  • Demonstrate the effectiveness of DE path removal
    and p-bubble fixing
  • Created an ideal read pool from the S. aureus MW2
    strain
  • Consists of one read at every possible position
  • No errors
  • No polymorphisms
  • Distinguish between positive and negative reads
  • Positive reads have at least one exact occurrence
    in the reference sequence
  • Negative reads have none

38
4) Additional Edena Analyses (contd)
  • Ideal dataset indicates branching nodes and
    p-bubbles caused by genomic repetition
  • Anomalies in real datasets only due to negative
    reads
  • Due to small quantity of branching nodes in the
    ideal dataset, branch removal procedure is
    extremely effective

39
4) Additional Edena Analyses (contd)
  • Though many p-bubbles consist of sequences made
    of negative reads, most cannot be explained by
    base calling errors
  • Thought to correspond to underrepresented clonal
    polymorphisms

40
4) Additional Edena Analyses (contd)
  • Since there are no DE paths in the ideal dataset,
    expect that DE removal should remove all DE paths
    in real dataset (i.e. dead-ends correspond to
    negative reads)
  • From tests with different md values (below),
    authors decided 10 was best
  • Not so clear-cut to me

41
4) Additional Edena Analyses (contd)
  • Most DE paths have length 1
  • Correspond to paths created by base calling
    errors
  • Longer DE paths exist that do not appear to be
    caused by such errors
  • Thought to be clonal polymorphisms in low
    abundance ? cant form a complete p-bubble

42
4) Additional Edena Analyses (contd)
  • Effective Coverage Depth
  • Computed effective coverage depth according to
    formula from Lander and Waterman
  • E N(L-T)/G
  • N of usable reads
  • L Read length
  • T Req. overlap length
  • G Genome size
  • Can also estimate gaps in read coverage with Ne-E

43
4) Additional Edena Analyses (contd)
  • S. aureus sequencing
  • Raw coverage depth 48x
  • Effective coverage depth 14x
  • H. acinonychis sequencing
  • Raw coverage depth 284x
  • Effective coverage depth 36x
  • Statistics imply that there should be no gaps in
    H. acinonychis assembly, and only a few in S.
    aureus
  • But each actual assembly contained several
    hundred gaps

44
4) Additional Edena Analyses (contd)
  • Statistics assume uniform read sampling
  • Investigated underrepresented parts of genomes
  • After alignment of reads to reference genome,
    extracted low coverage sequences
  • These sequences have complex motifs and single
    base repeats ? cause difficulty in replication

45
Outline
  • Introduction
  • Edenas Methodology
  • Reducing Read Redundancy
  • Overlap Graph Construction
  • Transitive Edge Reduction
  • Graph Cleanup
  • Contig Production
  • Results
  • Assemblers
  • Assembly tasks
  • Additional Edena Analyses
  • Graph Cleaning Effectiveness
  • Effective Coverage Depth
  • Conclusions

46
5) Conclusions
  • Edena holds up well against other recent
    assemblers, in both assembly quality and
    computational resources
  • Some assemblers are partially complementary to
    each other (Edena and Velvet) ? can use together
    to produce results better than each individual
    assemblers results
  • Rise of NGS paired read data will help produce
    longer contigs and clean up ambiguities

47
Is Edena The One?
  • The One that will herald the beginning of cost-
  • effective whole genome assembly with NGS?
  • Maybe you should ask the Oracle

48
Thats all folks!
  • Discussion Questions
  • What were the strengths/weaknesses of the Edena?
    How would you improve it?
  • How do you think Edena compares to the other
    assemblers tested? Would you test it against
    other assemblers not tested here?
  • Given Edenas limitations, would you trust it for
    de novo genome assembly over traditional sequence
    assembly?
  • Why did we have to discuss yet another NGS genome
    assembler today?
Write a Comment
User Comments (0)
About PowerShow.com