De%20novo%20bacterial%20genome%20sequencing:%20millions%20of%20very%20short%20reads%20assembled%20on%20a%20desktop%20computer - PowerPoint PPT Presentation

About This Presentation

Title:

De%20novo%20bacterial%20genome%20sequencing:%20millions%20of%20very%20short%20reads%20assembled%20on%20a%20desktop%20computer

Description:

David Hernandez, Patrice Fran ois, Laurent Farinelli, Magne ster s, ... Reads and their reverse complements are considered the same read merged into the ... – PowerPoint PPT presentation

Number of Views:98

Avg rating:3.0/5.0

Slides: 49

Provided by: lucaslo

Learn more at: http://www.cs.toronto.edu

Category:

more less

Transcript and Presenter's Notes

Title: De%20novo%20bacterial%20genome%20sequencing:%20millions%20of%20very%20short%20reads%20assembled%20on%20a%20desktop%20computer

1
De novo bacterial genome sequencing millions of
very short reads assembled on a desktop computer

David Hernandez, Patrice François, Laurent
Farinelli, Magne Østerås, Jacques Schrenzel

Presented by Lucas Lochovsky
2
Outline

Introduction
Edenas Methodology
Reducing Read Redundancy
Overlap Graph Construction
Transitive Edge Reduction
Graph Cleanup
Contig Production
Results
Assemblers
Assembly tasks
Additional Edena Analyses
Graph Cleaning Effectiveness
Effective Coverage Depth
Conclusions

3
Outline

Introduction
Edenas Methodology
Reducing Read Redundancy
Overlap Graph Construction
Transitive Edge Reduction
Graph Cleanup
Contig Production
Results
Assemblers
Assembly tasks
Additional Edena Analyses
Graph Cleaning Effectiveness
Effective Coverage Depth
Conclusions

4
1) Introduction

NGS will allow us to explore strange new genomes,
blah blah blah.
WGS assemblers weve covered so far
Medvedev-Brudno assembler
Arachne
AMOS-Cmp
Velvet
ALLPATHS
Think youve seen it all?

5
1) Introduction (contd)

Edena De novo short read assembler
Uses a classic overlap graph approach to assembly
Anyone else get a feeling of déjà vu?
Compare to other recently published NGS read
assemblers
De novo assembly of two bacterial genomes
sequenced with the Illumina/Solexa platform

6
Outline

Introduction
Edenas Methodology
Reducing Read Redundancy
Overlap Graph Construction
Transitive Edge Reduction
Graph Cleanup
Contig Production
Results
Assemblers
Assembly tasks
Additional Edena Analyses
Graph Cleaning Effectiveness
Effective Coverage Depth
Conclusions

7
2) Edenas Methodology

Built around a standard overlap-layout-consensus
workflow
Opted to use exact matching for overlap detection
Reduce of spurious overlaps
Faster than using approximate matching
Also assume that all reads have the same length
Is this assumption valid?

8
2) Edenas Methodology (contd)

Four major steps
Remove redundant reads so that dataset size is
more manageable
Overlap detection and overlap graph construction
Graph cleaning simplification and ambiguity
resolution
Produce contigs

9
2) Edenas Methodology (contd)

1) Practice your 3 Rs Reducing Read Redundancy
Illumina Genome Analyzer has high amount of
over-sampling ? many redundant reads
Reduce dataset so it contains only a single copy
of each read ? non-redundant
Index all reads into a prefix tree
Identical reads will be mapped to the same key ?
no duplicate reads in this structure

10
2) Edenas Methodology (contd)

Prefix trees are associative arrays for strings
where all descendants of a node have a common
prefix
Reads and their reverse complements are
considered the same read ? merged into the same
tree key

11
2) Edenas Methodology (contd)

Ambiguous reads discarded, since they wont work
with exact matching
Opens up possibility of coverage gaps in read
data (not explored by the authors)
Original read data still useful for getting read
frequencies
Contig coverage depth
Repeat identification

12
2) Edenas Methodology (contd)

2) Overlap Graph Construction
Non-redundant read dataset is indexed by a suffix
array
Déjà vu moment Almost exactly like suffix trees
from MUMmer/MUMmerGPU!
Information used to produce a bidirected overlap
graph
Déjà vu moment Just like the Medvedev-Brudno
assembler! (which I presented!)

13
2) Edenas Methodology (contd)

This slide should be review for all of you!
Bidirected graphs are kind of like directed
graphs, except each edge has an orientation on
each of its ends
Gives rise to three types of edges
Edges where one arrow points out of a vertex, and
one arrow points into a vertex
Edges with both arrows pointing out, and
Edges with both arrows pointing in (easiest one
to do in PowerPoint!)
For a walk in a bidirected graph, for each vertex
on that walk, the orientation of the edge
entering the vertex must be opposite that of the
edge leaving the vertex

14
2) Edenas Methodology (contd)

More review!
In a bidirected overlap graph, each vertex is a
double-stranded read
Edges represent read overlaps
Three possible ways that two double-stranded
reads can overlap (corresponds to the three types
of edges)
Suppose we have two ds reads r1 and r2
Each read can be oriented to the left or to the
right
The three possible overlaps are
i) Both strands point in the same direction (both
reads can point left, or both can point
right,
its the same overlap either way)
ii) r1 points left and r2 points right
iii) r1 points right and r2 points left

15
2) Edenas Methodology (contd)

Parameter Minimum overlap size
Sensitivity vs. specificity tradeoff
Small value Higher frequency of chance overlaps
? causes path branching in graph (sensitivity
favoured)
Large value Creates more dead-end (DE) paths,
i.e. reads not extended by overlapping reads on
one side (specificity favoured)

16
2) Edenas Methodology (contd)

3a) Transitive Edge Reduction
Simplifies paths by removing nonessential
nodes/edges
Generally speaking, a path of the form v1 ? v2 ?
v3 can be reduced to v1 ? v3, representing the
same sequence with fewer nodes
Reduces graph complexity by the over-sampling
rate c NL/G
N Number of reads
L Read length
G Genome size

17
2) Edenas Methodology (contd)

For sequences, its about removing reads for
which another read with the same sequence
overlaps the first read to a greater extent

18
2) Edenas Methodology (contd)

3b) Graph Cleanup
Can have multiple paths branching off a single
node (branching paths)
Due to genomic repetitions, sequencing errors,
and clonal polymorphisms
Genomic repetitions cannot be fixed without
additional information
But the other two can be resolved

19
2) Edenas Methodology (contd)

Sequencing errors produce short dead-end (DE)
paths
Attempt to elongate branching nodes up to a
certain depth md (minimum depth)
Reads that cannot be extended to a depth of md
are removed
Experimentally determined that md10 is the best
value

20
2) Edenas Methodology (contd)
21
2) Edenas Methodology (contd)

Also disambiguate bubbles in the graph caused by
single base substitutions (aka p-bubbles)
Length of p-bubble is at most ms 4L - 2T - 1
L Read length
T Min. overlap size
Explore each branching path up to length ms
(guaranteed upper bound)
Remove path with less coverage
Polymorphisms can be retained for later analysis

22
2) Edenas Methodology (contd)
23
2) Edenas Methodology (contd)

4) Contig Production
If run in strict mode, Edena starts generating
contig sequences
In non-strict mode, one more cleaning step is
performed
Longer overlaps more reliable than shorter ones
Save only edges at branching nodes that have the
highest overlap of all edges
Produce contig sequence by following
non-intersecting simple paths in overlap graph
Nodes must have in-degree and out-degree of
exactly one

24
Outline

Introduction
Edenas Methodology
Reducing Read Redundancy
Overlap Graph Construction
Transitive Edge Reduction
Graph Cleanup
Contig Production
Results
Assemblers
Assembly tasks
Additional Edena Analyses
Graph Cleaning Effectiveness
Effective Coverage Depth
Conclusions

25
3) Results

Survivor WGS Assembly
Four assemblers
Two challenges
One winner

26
3) Results (contd)

Contestant 1 SSAKE
Indexes reads in a prefix tree based upon first
eleven 5 bases
Identify highest possible overlap between pairs
of reads
Use most highly-covered reads as starting points
for read extension (i.e. assembly nucleation
points)
So far only used for partial genome sequencing
for comparative metagenomic analysis (e.g.
bacterial species distinction)

27
3) Results (contd)

Contestant 2 Velvet
k-mer/q-gram/k-gram/q-mer de Bruijn graph
representation of reads
Contestant 3 SHARCGS
Can accept base quality scores along with read
data for read filtering (low quality reads
discarded)
Also filter out reads with low coverage
Assembly performed with a prefix tree
Contestant 4 Edena

28
3) Results (contd)

Reward Challenge
Assemble the 2.82 Mbp genome sequence and the
20.7 Kbp plasmid sequence of the Staphylococcus
aureus MW2 strain from Illumina reads
Immunity Challenge
Assemble 1.55 Mbp genome sequence and the 3.66
Kbp plasmid sequence of the Helicobacter
acinonychis Sheeba strain from Illumina reads

29
3) Results (contd)

Staphylococcus aureus results
Evaluated each assembler on the parameter
configurations that produced the best results
Edena Min. overlap size 21 bases
Velvet k-mer value 23
SHARCGS Max. gap span 14
SSAKE Default parameters

30
3) Results (contd)

Compared contig assembly to published reference
sequence
Non-strict mode tends to produce longer contigs
at the expense of additional misassemblies
Velvet comparable to Edena strict

31
3) Results (contd)

SHARCGS unable to assemble significant contigs ?
insufficient coverage depth
SSAKE produced a large number of mismatches
mostly at contig boundaries

32
3) Results (contd)

Authors also tried combining contig results from
Edena and Velvet due to significant overlaps
between their contigs
N50 and mean contig size increased relative to
original results
Edena non-strict has similar influence on results
as previously

33
3) Results (contd)

Helicobacter acinonychis results
Best parameter settings
Edena Min. overlap size 27 (strict), 26
(non-strict)
Velvet k-mer value 27
SHARCGS Max. gap span 10 (also must remove last
four bases from each read)
SSAKE Default parameters

34
3) Results (contd)

Results similar to those from the previous
assembly challenge

35
3) Results (contd)

Survivor WGS Assembly Conclusion
Granted Immunity Edena, Velvet
Sent to the Tribal Council SSAKE, SHARCGS

36
Outline

Introduction
Edenas Methodology
Reducing Read Redundancy
Overlap Graph Construction
Transitive Edge Reduction
Graph Cleanup
Contig Production
Results
Assemblers
Assembly tasks
Additional Edena Analyses
Graph Cleaning Effectiveness
Effective Coverage Depth
Conclusions

37
4) Additional Edena Analyses

Graph Cleaning Effectiveness
Demonstrate the effectiveness of DE path removal
and p-bubble fixing
Created an ideal read pool from the S. aureus MW2
strain
Consists of one read at every possible position
No errors
No polymorphisms
Distinguish between positive and negative reads
Positive reads have at least one exact occurrence
in the reference sequence
Negative reads have none

38
4) Additional Edena Analyses (contd)

Ideal dataset indicates branching nodes and
p-bubbles caused by genomic repetition
Anomalies in real datasets only due to negative
reads
Due to small quantity of branching nodes in the
ideal dataset, branch removal procedure is
extremely effective

39
4) Additional Edena Analyses (contd)

Though many p-bubbles consist of sequences made
of negative reads, most cannot be explained by
base calling errors
Thought to correspond to underrepresented clonal
polymorphisms

40
4) Additional Edena Analyses (contd)

Since there are no DE paths in the ideal dataset,
expect that DE removal should remove all DE paths
in real dataset (i.e. dead-ends correspond to
negative reads)
From tests with different md values (below),
authors decided 10 was best
Not so clear-cut to me

41
4) Additional Edena Analyses (contd)

Most DE paths have length 1
Correspond to paths created by base calling
errors
Longer DE paths exist that do not appear to be
caused by such errors
Thought to be clonal polymorphisms in low
abundance ? cant form a complete p-bubble

42
4) Additional Edena Analyses (contd)

Effective Coverage Depth
Computed effective coverage depth according to
formula from Lander and Waterman
E N(L-T)/G
N of usable reads
L Read length
T Req. overlap length
G Genome size
Can also estimate gaps in read coverage with Ne-E

43
4) Additional Edena Analyses (contd)

S. aureus sequencing
Raw coverage depth 48x
Effective coverage depth 14x
H. acinonychis sequencing
Raw coverage depth 284x
Effective coverage depth 36x
Statistics imply that there should be no gaps in
H. acinonychis assembly, and only a few in S.
aureus
But each actual assembly contained several
hundred gaps

44
4) Additional Edena Analyses (contd)

Statistics assume uniform read sampling
Investigated underrepresented parts of genomes
After alignment of reads to reference genome,
extracted low coverage sequences
These sequences have complex motifs and single
base repeats ? cause difficulty in replication

45
Outline

Introduction
Edenas Methodology
Reducing Read Redundancy
Overlap Graph Construction
Transitive Edge Reduction
Graph Cleanup
Contig Production
Results
Assemblers
Assembly tasks
Additional Edena Analyses
Graph Cleaning Effectiveness
Effective Coverage Depth
Conclusions

46
5) Conclusions

Edena holds up well against other recent
assemblers, in both assembly quality and
computational resources
Some assemblers are partially complementary to
each other (Edena and Velvet) ? can use together
to produce results better than each individual
assemblers results
Rise of NGS paired read data will help produce
longer contigs and clean up ambiguities

47
Is Edena The One?

The One that will herald the beginning of cost-
effective whole genome assembly with NGS?
Maybe you should ask the Oracle

48
Thats all folks!

Discussion Questions
What were the strengths/weaknesses of the Edena?
How would you improve it?
How do you think Edena compares to the other
assemblers tested? Would you test it against
other assemblers not tested here?
Given Edenas limitations, would you trust it for
de novo genome assembly over traditional sequence
assembly?
Why did we have to discuss yet another NGS genome
assembler today?