Title: De%20novo%20bacterial%20genome%20sequencing:%20millions%20of%20very%20short%20reads%20assembled%20on%20a%20desktop%20computer
1De novo bacterial genome sequencing millions of
very short reads assembled on a desktop computer
- David Hernandez, Patrice François, Laurent
Farinelli, Magne Østerås, Jacques Schrenzel
Presented by Lucas Lochovsky
2Outline
- Introduction
- Edenas Methodology
- Reducing Read Redundancy
- Overlap Graph Construction
- Transitive Edge Reduction
- Graph Cleanup
- Contig Production
- Results
- Assemblers
- Assembly tasks
- Additional Edena Analyses
- Graph Cleaning Effectiveness
- Effective Coverage Depth
- Conclusions
3Outline
- Introduction
- Edenas Methodology
- Reducing Read Redundancy
- Overlap Graph Construction
- Transitive Edge Reduction
- Graph Cleanup
- Contig Production
- Results
- Assemblers
- Assembly tasks
- Additional Edena Analyses
- Graph Cleaning Effectiveness
- Effective Coverage Depth
- Conclusions
41) Introduction
- NGS will allow us to explore strange new genomes,
blah blah blah. - WGS assemblers weve covered so far
- Medvedev-Brudno assembler
- Arachne
- AMOS-Cmp
- Velvet
- ALLPATHS
- Think youve seen it all?
51) Introduction (contd)
- Edena De novo short read assembler
- Uses a classic overlap graph approach to assembly
- Anyone else get a feeling of déjà vu?
- Compare to other recently published NGS read
assemblers - De novo assembly of two bacterial genomes
sequenced with the Illumina/Solexa platform
6Outline
- Introduction
- Edenas Methodology
- Reducing Read Redundancy
- Overlap Graph Construction
- Transitive Edge Reduction
- Graph Cleanup
- Contig Production
- Results
- Assemblers
- Assembly tasks
- Additional Edena Analyses
- Graph Cleaning Effectiveness
- Effective Coverage Depth
- Conclusions
72) Edenas Methodology
- Built around a standard overlap-layout-consensus
workflow - Opted to use exact matching for overlap detection
- Reduce of spurious overlaps
- Faster than using approximate matching
- Also assume that all reads have the same length
- Is this assumption valid?
82) Edenas Methodology (contd)
- Four major steps
- Remove redundant reads so that dataset size is
more manageable - Overlap detection and overlap graph construction
- Graph cleaning simplification and ambiguity
resolution - Produce contigs
92) Edenas Methodology (contd)
- 1) Practice your 3 Rs Reducing Read Redundancy
- Illumina Genome Analyzer has high amount of
over-sampling ? many redundant reads - Reduce dataset so it contains only a single copy
of each read ? non-redundant - Index all reads into a prefix tree
- Identical reads will be mapped to the same key ?
no duplicate reads in this structure
102) Edenas Methodology (contd)
- Prefix trees are associative arrays for strings
where all descendants of a node have a common
prefix - Reads and their reverse complements are
considered the same read ? merged into the same
tree key
112) Edenas Methodology (contd)
- Ambiguous reads discarded, since they wont work
with exact matching - Opens up possibility of coverage gaps in read
data (not explored by the authors) - Original read data still useful for getting read
frequencies - Contig coverage depth
- Repeat identification
122) Edenas Methodology (contd)
- 2) Overlap Graph Construction
- Non-redundant read dataset is indexed by a suffix
array - Déjà vu moment Almost exactly like suffix trees
from MUMmer/MUMmerGPU! - Information used to produce a bidirected overlap
graph - Déjà vu moment Just like the Medvedev-Brudno
assembler! (which I presented!)
132) Edenas Methodology (contd)
- This slide should be review for all of you!
- Bidirected graphs are kind of like directed
graphs, except each edge has an orientation on
each of its ends - Gives rise to three types of edges
- Edges where one arrow points out of a vertex, and
one arrow points into a vertex - Edges with both arrows pointing out, and
- Edges with both arrows pointing in (easiest one
to do in PowerPoint!) - For a walk in a bidirected graph, for each vertex
on that walk, the orientation of the edge
entering the vertex must be opposite that of the
edge leaving the vertex
142) Edenas Methodology (contd)
- More review!
- In a bidirected overlap graph, each vertex is a
double-stranded read - Edges represent read overlaps
- Three possible ways that two double-stranded
reads can overlap (corresponds to the three types
of edges) - Suppose we have two ds reads r1 and r2
- Each read can be oriented to the left or to the
- right
- The three possible overlaps are
- i) Both strands point in the same direction (both
- reads can point left, or both can point
right, - its the same overlap either way)
- ii) r1 points left and r2 points right
- iii) r1 points right and r2 points left
152) Edenas Methodology (contd)
- Parameter Minimum overlap size
- Sensitivity vs. specificity tradeoff
- Small value Higher frequency of chance overlaps
? causes path branching in graph (sensitivity
favoured) - Large value Creates more dead-end (DE) paths,
i.e. reads not extended by overlapping reads on
one side (specificity favoured)
162) Edenas Methodology (contd)
- 3a) Transitive Edge Reduction
- Simplifies paths by removing nonessential
nodes/edges - Generally speaking, a path of the form v1 ? v2 ?
v3 can be reduced to v1 ? v3, representing the
same sequence with fewer nodes - Reduces graph complexity by the over-sampling
rate c NL/G - N Number of reads
- L Read length
- G Genome size
172) Edenas Methodology (contd)
- For sequences, its about removing reads for
which another read with the same sequence
overlaps the first read to a greater extent
182) Edenas Methodology (contd)
- 3b) Graph Cleanup
- Can have multiple paths branching off a single
node (branching paths) - Due to genomic repetitions, sequencing errors,
and clonal polymorphisms - Genomic repetitions cannot be fixed without
additional information - But the other two can be resolved
192) Edenas Methodology (contd)
- Sequencing errors produce short dead-end (DE)
paths - Attempt to elongate branching nodes up to a
certain depth md (minimum depth) - Reads that cannot be extended to a depth of md
are removed - Experimentally determined that md10 is the best
value
202) Edenas Methodology (contd)
212) Edenas Methodology (contd)
- Also disambiguate bubbles in the graph caused by
single base substitutions (aka p-bubbles) - Length of p-bubble is at most ms 4L - 2T - 1
- L Read length
- T Min. overlap size
- Explore each branching path up to length ms
(guaranteed upper bound) - Remove path with less coverage
- Polymorphisms can be retained for later analysis
222) Edenas Methodology (contd)
232) Edenas Methodology (contd)
- 4) Contig Production
- If run in strict mode, Edena starts generating
contig sequences - In non-strict mode, one more cleaning step is
performed - Longer overlaps more reliable than shorter ones
- Save only edges at branching nodes that have the
highest overlap of all edges - Produce contig sequence by following
non-intersecting simple paths in overlap graph - Nodes must have in-degree and out-degree of
exactly one
24Outline
- Introduction
- Edenas Methodology
- Reducing Read Redundancy
- Overlap Graph Construction
- Transitive Edge Reduction
- Graph Cleanup
- Contig Production
- Results
- Assemblers
- Assembly tasks
- Additional Edena Analyses
- Graph Cleaning Effectiveness
- Effective Coverage Depth
- Conclusions
253) Results
- Survivor WGS Assembly
- Four assemblers
- Two challenges
- One winner
263) Results (contd)
- Contestant 1 SSAKE
- Indexes reads in a prefix tree based upon first
eleven 5 bases - Identify highest possible overlap between pairs
of reads - Use most highly-covered reads as starting points
for read extension (i.e. assembly nucleation
points) - So far only used for partial genome sequencing
for comparative metagenomic analysis (e.g.
bacterial species distinction)
273) Results (contd)
- Contestant 2 Velvet
- k-mer/q-gram/k-gram/q-mer de Bruijn graph
representation of reads - Contestant 3 SHARCGS
- Can accept base quality scores along with read
data for read filtering (low quality reads
discarded) - Also filter out reads with low coverage
- Assembly performed with a prefix tree
- Contestant 4 Edena
283) Results (contd)
- Reward Challenge
- Assemble the 2.82 Mbp genome sequence and the
20.7 Kbp plasmid sequence of the Staphylococcus
aureus MW2 strain from Illumina reads - Immunity Challenge
- Assemble 1.55 Mbp genome sequence and the 3.66
Kbp plasmid sequence of the Helicobacter
acinonychis Sheeba strain from Illumina reads
293) Results (contd)
- Staphylococcus aureus results
- Evaluated each assembler on the parameter
configurations that produced the best results - Edena Min. overlap size 21 bases
- Velvet k-mer value 23
- SHARCGS Max. gap span 14
- SSAKE Default parameters
303) Results (contd)
- Compared contig assembly to published reference
sequence - Non-strict mode tends to produce longer contigs
at the expense of additional misassemblies - Velvet comparable to Edena strict
313) Results (contd)
- SHARCGS unable to assemble significant contigs ?
insufficient coverage depth - SSAKE produced a large number of mismatches
mostly at contig boundaries
323) Results (contd)
- Authors also tried combining contig results from
Edena and Velvet due to significant overlaps
between their contigs - N50 and mean contig size increased relative to
original results - Edena non-strict has similar influence on results
as previously
333) Results (contd)
- Helicobacter acinonychis results
- Best parameter settings
- Edena Min. overlap size 27 (strict), 26
(non-strict) - Velvet k-mer value 27
- SHARCGS Max. gap span 10 (also must remove last
four bases from each read) - SSAKE Default parameters
343) Results (contd)
- Results similar to those from the previous
assembly challenge
353) Results (contd)
- Survivor WGS Assembly Conclusion
- Granted Immunity Edena, Velvet
- Sent to the Tribal Council SSAKE, SHARCGS
36Outline
- Introduction
- Edenas Methodology
- Reducing Read Redundancy
- Overlap Graph Construction
- Transitive Edge Reduction
- Graph Cleanup
- Contig Production
- Results
- Assemblers
- Assembly tasks
- Additional Edena Analyses
- Graph Cleaning Effectiveness
- Effective Coverage Depth
- Conclusions
374) Additional Edena Analyses
- Graph Cleaning Effectiveness
- Demonstrate the effectiveness of DE path removal
and p-bubble fixing - Created an ideal read pool from the S. aureus MW2
strain - Consists of one read at every possible position
- No errors
- No polymorphisms
- Distinguish between positive and negative reads
- Positive reads have at least one exact occurrence
in the reference sequence - Negative reads have none
384) Additional Edena Analyses (contd)
- Ideal dataset indicates branching nodes and
p-bubbles caused by genomic repetition - Anomalies in real datasets only due to negative
reads - Due to small quantity of branching nodes in the
ideal dataset, branch removal procedure is
extremely effective
394) Additional Edena Analyses (contd)
- Though many p-bubbles consist of sequences made
of negative reads, most cannot be explained by
base calling errors - Thought to correspond to underrepresented clonal
polymorphisms
404) Additional Edena Analyses (contd)
- Since there are no DE paths in the ideal dataset,
expect that DE removal should remove all DE paths
in real dataset (i.e. dead-ends correspond to
negative reads) - From tests with different md values (below),
authors decided 10 was best - Not so clear-cut to me
414) Additional Edena Analyses (contd)
- Most DE paths have length 1
- Correspond to paths created by base calling
errors - Longer DE paths exist that do not appear to be
caused by such errors - Thought to be clonal polymorphisms in low
abundance ? cant form a complete p-bubble
424) Additional Edena Analyses (contd)
- Effective Coverage Depth
- Computed effective coverage depth according to
formula from Lander and Waterman - E N(L-T)/G
- N of usable reads
- L Read length
- T Req. overlap length
- G Genome size
- Can also estimate gaps in read coverage with Ne-E
434) Additional Edena Analyses (contd)
- S. aureus sequencing
- Raw coverage depth 48x
- Effective coverage depth 14x
- H. acinonychis sequencing
- Raw coverage depth 284x
- Effective coverage depth 36x
- Statistics imply that there should be no gaps in
H. acinonychis assembly, and only a few in S.
aureus - But each actual assembly contained several
hundred gaps
444) Additional Edena Analyses (contd)
- Statistics assume uniform read sampling
- Investigated underrepresented parts of genomes
- After alignment of reads to reference genome,
extracted low coverage sequences - These sequences have complex motifs and single
base repeats ? cause difficulty in replication
45Outline
- Introduction
- Edenas Methodology
- Reducing Read Redundancy
- Overlap Graph Construction
- Transitive Edge Reduction
- Graph Cleanup
- Contig Production
- Results
- Assemblers
- Assembly tasks
- Additional Edena Analyses
- Graph Cleaning Effectiveness
- Effective Coverage Depth
- Conclusions
465) Conclusions
- Edena holds up well against other recent
assemblers, in both assembly quality and
computational resources - Some assemblers are partially complementary to
each other (Edena and Velvet) ? can use together
to produce results better than each individual
assemblers results - Rise of NGS paired read data will help produce
longer contigs and clean up ambiguities
47Is Edena The One?
- The One that will herald the beginning of cost-
- effective whole genome assembly with NGS?
- Maybe you should ask the Oracle
48Thats all folks!
- Discussion Questions
- What were the strengths/weaknesses of the Edena?
How would you improve it? - How do you think Edena compares to the other
assemblers tested? Would you test it against
other assemblers not tested here? - Given Edenas limitations, would you trust it for
de novo genome assembly over traditional sequence
assembly? - Why did we have to discuss yet another NGS genome
assembler today?