Title: Velvet:%20Algorithms%20for%20De%20Novo%20Short%20Assembly%20Using%20De%20Bruijn%20Graphs
1Velvet Algorithms for De Novo Short Assembly
Using De Bruijn Graphs
- March 12, 2008
- Daniel R. Zerbino and Ewan Birney
- Presenter Seunghak Lee
2What is de Bruijn Graphs?
- De Bruijn graph is a directed graph
- An edge represents overlap between sequences of
symbols - V(s1, s2, , sm)
- E(v1,v2,, vn),(w1,w2,,wn))v2w1,v3w2, ,
vnwn-1
3Introduction
- New sequencing techniques are commercially
available (e.g. 454 Sequencing, Solexa) - 454 Sequencing 100 200bp
- Solexa 30bp
- Algorithms whole genome shotgun (WGS) assembly
are not suitable for short reads - Overlap graph with a node per read is extremely
large - More ambiguous connections in assembly
4Introduction (cont)
- Euler assembler (Pevzner 2001) used k-mer for a
node of de Bruijn graphs - Reads are mapped as a path through the de Brujin
graph - High redundancy does not affect the number of
nodes - Velvet effectively deals with experimental
errors and repeats by using Brujin graphs with
k-mers
5De Bruijn Graphs - structure
6De Bruijn Graphs structure (cont)
- Adjacent k-mers overlap by k-1 nucleotides
- Each node is attached to twin node
- Reverse series of reverse complement k-mers
- Overlap between reads from opposite strand
- Union of a node and its twin node is called a
block - Last k-mer overlaps with the first of
- its destination
7De Bruijn Graphs construction (cont)
- Construction
- Reads are hashed with predefined k-mer length
- Small k-mer ? increase connectivity
- ? more ambiguous repeats
- Large k-mer ? increase specificity
- ? decrease connectivity
- Determine k considering sensitivity and
specificity
8De Bruijn Graphs construction (cont)
- For each k-mer, hash table records ID of the
first read and its position - Each k-mer is recorded with reverse complement
- Node is created if there is distinct
- interruption points
- Reads are traced through the graph
- Create a directed arc if necessary
9De Bruijn Graphs simplification
- Simplify the chains of blocks
- No information loss
- If node A has only one outgoing arc to node B,
- and if node B has only one ingoing arc ?
merge
A
B
10De Bruijn Graphs error removal
- Velvet focuses on topological features of the
graph - First step remove tips
- Tip chain of nodes disconnected on one end
- Use two criteria (1) length and (2) minority
count - Length remove a tip if lt 2k bp
- since two nearby errors can create a
tip up to 2k bp
error
error
k
k
11De Bruijn Graphs error removal (cont)
- Minority count multiplicity m lt n
- Starting from node B, going through the tip is an
alternative to a more common path
m
B
A
tip
C
n
12De Bruijn Graphs error removal (cont)
- Second step remove bubbles using Tour Bus
- Redundant paths start and end at the same nodes
- Bubbles are created by errors or biological
variants such as SNP
Bubble
13De Bruijn Graphs error removal (cont)
Tour Bus
- Detect redundant paths
2. Compare them using dynamic
programming methods
3. If similar, merge them
14De Bruijn Graphs error removal (cont)
- Third step remove erroneous connections
- Remove erroneous connections after Tour Bus
algorithm - Remove erroneous connections with basic coverage
- cutoff
- Genuine short nodes which cannot be simplified in
the graph should have high coverage
15Breadcrumb resolution of repeats
- Using read pairs, pair up the long nodes
- Flag paired reads using unambiguous long nodes
unambiguous long nodes
16Breadcrumb resolution of repeats
- Using read pairs, pair up the long nodes
- Flag paired reads using unambiguous long nodes
unambiguous long nodes
17Breadcrumb resolution of repeats
- Extends the nodes as far as possible using
flagged paired reads - All nodes between A and B are paired up to
- either A or B
18Experimental Results
- Test error removal pipeline on simulated data
- Simulate reads are from E. coli, S. cerevisiae,
- C.elegans, and H. sapiens
- Coverage density vs N50 for H. sapiens
- Limited by natural repetition of the reference
genome
Ideal
Error (1)
SNP
N50
19Experimental Results (cont)
- Test error removal pipeline on experimental data
- 173,428 bp human BAC was sequenced using Solexa
machines - Reads were 35bp long, and k31
- Tour Bus increased sensitivity by correcting
errors and - preserved the integrity of the graph structure
20Experimental Results (cont)
21Experimental Results (cont)
22Conclusions
- Velvet is a de Bruijn graph based sequence
assembly method for short reads - Errors are handled by removing tips and Tour Bus
algorithm - A large number of repeats are resolved by
Breadcrumb algorithm - Velvet was assessed using simulated and real
datasets and it performed well