CompostBin : A DNA composition based metagenomic binning algorithm - PowerPoint PPT Presentation

1 / 33

About This Presentation

Title:

CompostBin : A DNA composition based metagenomic binning algorithm

Description:

CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji*, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis – PowerPoint PPT presentation

Number of Views:74

Avg rating:3.0/5.0

Slides: 34

Provided by: SouravCh

Category:

more less

Transcript and Presenter's Notes

Title: CompostBin : A DNA composition based metagenomic binning algorithm

1
CompostBin A DNA composition based metagenomic
binning algorithm

Sourav Chatterji, Ichitaro Yamazaki, Zhaojun Bai
and Jonathan Eisen
UC Davis
schatterji_at_ucdavis.edu

2
Overview of Talk

Metagenomics and the binning problem.
CompostBin

3
The Microbial World
4
Exploring the Microbial World

Culturing
Majority of microbes currently unculturable.
No ecological context.
Molecular Surveys (e.g. 16S rRNA)
who is out there?
what are they doing?

5
Metagenomics
6
Interpreting Metagenomic Data

Nature of Metagenomic Data
Mosaic
Intraspecies polymorphism
Fragmentary
New Sequencing Technologies
Enormous amount of data
Short Reads

7
Metagenomic Binning
Classification of sequences by taxa
8
Why Bin at all?
9
Binning in Action

Glassy Winged Sharpshooter (Homalodisca
coagulata).
Feeds on plant xylem (poor in organic nutrients).
Microbial Endosymbionts

10
(No Transcript)
11
Current Binning Methods

Assembly
Align with Reference Genome
Database Search MEGAN, BLAST
Phylogenetic Analysis
DNA Composition TETRA,Phylopythia

12
Current Binning Methods

Need closely related reference genomes.
Poor performance on short fragments.
Sanger sequence reads 500-1000 bp long.
Current assembly methods unreliable
Complex Communities Hard to Bin.

13
Overview of Talk

Metagenomics and the binning problem.
CompostBin

14
Genome Signatures

Does genomic sequence from an organism have a
unique signature that distinguishes it from
genomic sequence of other organisms?
Yes Karlin et al. 1990s
What is the minimum length sequence that is
required to distinguish genomic sequence of one
organism from the genomic sequence of another
organism?

15
Imperfect World

Horizontal Gene Transfer
Recent Estimates Ge et al. 2005
Varies between 0-6 of genes.
Typically 2.
But
Amelioration

16
DNA-composition metrics
The K-mer Frequency Metric
CompostBin uses hexamers
17
DNA-composition metrics

Working with K-mers for Binning.
Curse of Dimensionality O(4K) independent
dimensions.
Statistical noise increases with decreasing
fragment lengths.
Project data into a lower dimensional space to
decrease noise.
Principal Component Analysis.

18
PCA separates species
Gluconobacter oxydans65 GC and Rhodospirillum
rubrum61 GC
19
Effect of Skewed Relative Abundance
Abundance 201
Abundance 11
B. anthracis and L. monogocytes
20
A Weighting Scheme
For each read, find overlap with other sequences
21
A Weighting Scheme
4
5
5
3
Calculate the redundancy of each position.
Weight is inverse of average redundancy.
22
Weighted PCA

Calculate weighted mean µw
Calculates weighted co-variance matrix Mw
PCs are eigenvectors of Mw.
Use first three PCs for further analysis.

N
å
X
w
i
i

µ
1
i

w
N
23
Weighted PCA

Calculate weighted mean µw
Calculates weighted co-variance matrix Mw
PCs are eigenvectors of Mw.
Use first three PCs for further analysis.

24
Weighted PCA

Calculate weighted mean µw
Calculates weighted co-variance matrix Mw
Principal Components are eigenvectors of Mw.
Use first three PCs for further analysis.

N
?
w
X
i
i
i
1

m

w
N
?
T
M
w
(
X
)(
X
)

-
m
-
m
w
i
i
w
i
w
i
25
Weighted PCA separates species
PCA
Weighted PCA
B. anthracis and L. monogocytes 201
26
Un-supervised Classification ?
27
Semi-Supervised Classification

31 Marker Genes courtesy Martin Wu
Omni-present
Relatively Immune to Lateral Gene Transfer
Reads containing these marker genes can be
classified with high reliability.

28
Semi-supervised Classification
Use a semi-supervised version of the normalized
cut algorithm
29
The Semi-supervised Normalized Cut Algorithm

Calculate the K-nearest neighbor graph from the
point set.
Update graph with marker information.
If two nodes are from the same species, add an
edge between them.
If two nodes are from different species, remove
any edge between them.
Bisect the graph using the normalized-cut
algorithm.

30
Generalization to multiple bins
Gluconobacter oxydans 0.61, Granulobacter
bethesdensis0.59 and Nitrobacter hamburgensis
0.62
31
Generalization to multiple bins
Gluconobacter oxydans 0.61, Granulobacter
bethesdensis0.59 and Nitrobacter hamburgensis
0.62
32
Testing