Title: CompostBin : A DNA composition based metagenomic binning algorithm
1CompostBin A DNA composition based metagenomic
binning algorithm
- Sourav Chatterji, Ichitaro Yamazaki, Zhaojun Bai
and Jonathan Eisen - UC Davis
- schatterji_at_ucdavis.edu
2Overview of Talk
- Metagenomics and the binning problem.
- CompostBin
3The Microbial World
4Exploring the Microbial World
- Culturing
- Majority of microbes currently unculturable.
- No ecological context.
- Molecular Surveys (e.g. 16S rRNA)
- who is out there?
- what are they doing?
5Metagenomics
6Interpreting Metagenomic Data
- Nature of Metagenomic Data
- Mosaic
- Intraspecies polymorphism
- Fragmentary
- New Sequencing Technologies
- Enormous amount of data
- Short Reads
7Metagenomic Binning
Classification of sequences by taxa
8Why Bin at all?
9Binning in Action
- Glassy Winged Sharpshooter (Homalodisca
coagulata). - Feeds on plant xylem (poor in organic nutrients).
- Microbial Endosymbionts
10(No Transcript)
11Current Binning Methods
- Assembly
- Align with Reference Genome
- Database Search MEGAN, BLAST
- Phylogenetic Analysis
- DNA Composition TETRA,Phylopythia
12Current Binning Methods
- Need closely related reference genomes.
- Poor performance on short fragments.
- Sanger sequence reads 500-1000 bp long.
- Current assembly methods unreliable
- Complex Communities Hard to Bin.
13Overview of Talk
- Metagenomics and the binning problem.
- CompostBin
14Genome Signatures
- Does genomic sequence from an organism have a
unique signature that distinguishes it from
genomic sequence of other organisms? - Yes Karlin et al. 1990s
- What is the minimum length sequence that is
required to distinguish genomic sequence of one
organism from the genomic sequence of another
organism?
15Imperfect World
- Horizontal Gene Transfer
- Recent Estimates Ge et al. 2005
- Varies between 0-6 of genes.
- Typically 2.
- But
- Amelioration
16DNA-composition metrics
The K-mer Frequency Metric
CompostBin uses hexamers
17DNA-composition metrics
- Working with K-mers for Binning.
- Curse of Dimensionality O(4K) independent
dimensions. - Statistical noise increases with decreasing
fragment lengths. - Project data into a lower dimensional space to
decrease noise. - Principal Component Analysis.
18PCA separates species
Gluconobacter oxydans65 GC and Rhodospirillum
rubrum61 GC
19Effect of Skewed Relative Abundance
Abundance 201
Abundance 11
B. anthracis and L. monogocytes
20A Weighting Scheme
For each read, find overlap with other sequences
21A Weighting Scheme
4
5
5
3
Calculate the redundancy of each position.
Weight is inverse of average redundancy.
22Weighted PCA
- Calculate weighted mean µw
- Calculates weighted co-variance matrix Mw
- PCs are eigenvectors of Mw.
- Use first three PCs for further analysis.
N
å
X
w
i
i
µ
1
i
w
N
23Weighted PCA
- Calculate weighted mean µw
- Calculates weighted co-variance matrix Mw
- PCs are eigenvectors of Mw.
- Use first three PCs for further analysis.
24Weighted PCA
- Calculate weighted mean µw
- Calculates weighted co-variance matrix Mw
- Principal Components are eigenvectors of Mw.
- Use first three PCs for further analysis.
N
?
w
X
i
i
i
1
m
w
N
?
T
M
w
(
X
)(
X
)
-
m
-
m
w
i
i
w
i
w
i
25Weighted PCA separates species
PCA
Weighted PCA
B. anthracis and L. monogocytes 201
26Un-supervised Classification ?
27Semi-Supervised Classification
- 31 Marker Genes courtesy Martin Wu
- Omni-present
- Relatively Immune to Lateral Gene Transfer
- Reads containing these marker genes can be
classified with high reliability.
28Semi-supervised Classification
Use a semi-supervised version of the normalized
cut algorithm
29The Semi-supervised Normalized Cut Algorithm
- Calculate the K-nearest neighbor graph from the
point set. - Update graph with marker information.
- If two nodes are from the same species, add an
edge between them. - If two nodes are from different species, remove
any edge between them. - Bisect the graph using the normalized-cut
algorithm.
30Generalization to multiple bins
Gluconobacter oxydans 0.61, Granulobacter
bethesdensis0.59 and Nitrobacter hamburgensis
0.62
31Generalization to multiple bins
Gluconobacter oxydans 0.61, Granulobacter
bethesdensis0.59 and Nitrobacter hamburgensis
0.62
32Testing
- Simulate Metagenomic Sequencing
- Sanger Reads
- Variables
- Number of species
- Relative abundance
- GC content
- Phylogenetic Diversity
- Test on a real dataset where answer is
well-established.
33Results
34Results
35Conclusions/Future Directions
- Satisfactory performance
- No Training on Existing Genomes ?
- Sanger Reads ?
- Low number of Species ?
- Future Work
- Holy Grail Complex Communities
- Semi-supervised projection?
- Hybrid Assembly/Binning
36Acknowledgements
- Jonathan Eisen
- Martin Wu
- Dongying Wu
- Ichitaro Yamazaki
- Amber Hartman
- Marcel Huntemann
- Lior Pachter
- Richard Karp
- Ambuj Tewari
- Narayanan Manikandan
- Princeton University
- Simon Levin
- Josh Weitz
- Jonathan Dushoff
37(No Transcript)