CompostBin : A DNA composition based metagenomic binning algorithm - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

CompostBin : A DNA composition based metagenomic binning algorithm

Description:

CompostBin : A DNA composition based metagenomic binning algorithm Sourav Chatterji*, Ichitaro Yamazaki, Zhaojun Bai and Jonathan Eisen UC Davis – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 34
Provided by: SouravCh
Category:

less

Transcript and Presenter's Notes

Title: CompostBin : A DNA composition based metagenomic binning algorithm


1
CompostBin A DNA composition based metagenomic
binning algorithm
  • Sourav Chatterji, Ichitaro Yamazaki, Zhaojun Bai
    and Jonathan Eisen
  • UC Davis
  • schatterji_at_ucdavis.edu

2
Overview of Talk
  • Metagenomics and the binning problem.
  • CompostBin

3
The Microbial World
4
Exploring the Microbial World
  • Culturing
  • Majority of microbes currently unculturable.
  • No ecological context.
  • Molecular Surveys (e.g. 16S rRNA)
  • who is out there?
  • what are they doing?

5
Metagenomics
6
Interpreting Metagenomic Data
  • Nature of Metagenomic Data
  • Mosaic
  • Intraspecies polymorphism
  • Fragmentary
  • New Sequencing Technologies
  • Enormous amount of data
  • Short Reads

7
Metagenomic Binning
Classification of sequences by taxa
8
Why Bin at all?
9
Binning in Action
  • Glassy Winged Sharpshooter (Homalodisca
    coagulata).
  • Feeds on plant xylem (poor in organic nutrients).
  • Microbial Endosymbionts

10
(No Transcript)
11
Current Binning Methods
  • Assembly
  • Align with Reference Genome
  • Database Search MEGAN, BLAST
  • Phylogenetic Analysis
  • DNA Composition TETRA,Phylopythia

12
Current Binning Methods
  • Need closely related reference genomes.
  • Poor performance on short fragments.
  • Sanger sequence reads 500-1000 bp long.
  • Current assembly methods unreliable
  • Complex Communities Hard to Bin.

13
Overview of Talk
  • Metagenomics and the binning problem.
  • CompostBin

14
Genome Signatures
  • Does genomic sequence from an organism have a
    unique signature that distinguishes it from
    genomic sequence of other organisms?
  • Yes Karlin et al. 1990s
  • What is the minimum length sequence that is
    required to distinguish genomic sequence of one
    organism from the genomic sequence of another
    organism?

15
Imperfect World
  • Horizontal Gene Transfer
  • Recent Estimates Ge et al. 2005
  • Varies between 0-6 of genes.
  • Typically 2.
  • But
  • Amelioration

16
DNA-composition metrics
The K-mer Frequency Metric
CompostBin uses hexamers
17
DNA-composition metrics
  • Working with K-mers for Binning.
  • Curse of Dimensionality O(4K) independent
    dimensions.
  • Statistical noise increases with decreasing
    fragment lengths.
  • Project data into a lower dimensional space to
    decrease noise.
  • Principal Component Analysis.

18
PCA separates species
Gluconobacter oxydans65 GC and Rhodospirillum
rubrum61 GC
19
Effect of Skewed Relative Abundance
Abundance 201
Abundance 11
B. anthracis and L. monogocytes
20
A Weighting Scheme
For each read, find overlap with other sequences
21
A Weighting Scheme
4
5
5
3
Calculate the redundancy of each position.
Weight is inverse of average redundancy.
22
Weighted PCA
  • Calculate weighted mean µw
  • Calculates weighted co-variance matrix Mw
  • PCs are eigenvectors of Mw.
  • Use first three PCs for further analysis.

N
å
X
w
i
i

µ
1
i

w
N
23
Weighted PCA
  • Calculate weighted mean µw
  • Calculates weighted co-variance matrix Mw
  • PCs are eigenvectors of Mw.
  • Use first three PCs for further analysis.

24
Weighted PCA
  • Calculate weighted mean µw
  • Calculates weighted co-variance matrix Mw
  • Principal Components are eigenvectors of Mw.
  • Use first three PCs for further analysis.

N
?
w
X
i
i
i
1

m

w
N
?
T
M
w
(
X
)(
X
)

-
m
-
m
w
i
i
w
i
w
i
25
Weighted PCA separates species
PCA
Weighted PCA
B. anthracis and L. monogocytes 201
26
Un-supervised Classification ?
27
Semi-Supervised Classification
  • 31 Marker Genes courtesy Martin Wu
  • Omni-present
  • Relatively Immune to Lateral Gene Transfer
  • Reads containing these marker genes can be
    classified with high reliability.

28
Semi-supervised Classification
Use a semi-supervised version of the normalized
cut algorithm
29
The Semi-supervised Normalized Cut Algorithm
  • Calculate the K-nearest neighbor graph from the
    point set.
  • Update graph with marker information.
  • If two nodes are from the same species, add an
    edge between them.
  • If two nodes are from different species, remove
    any edge between them.
  • Bisect the graph using the normalized-cut
    algorithm.

30
Generalization to multiple bins
Gluconobacter oxydans 0.61, Granulobacter
bethesdensis0.59 and Nitrobacter hamburgensis
0.62
31
Generalization to multiple bins
Gluconobacter oxydans 0.61, Granulobacter
bethesdensis0.59 and Nitrobacter hamburgensis
0.62
32
Testing
  • Simulate Metagenomic Sequencing
  • Sanger Reads
  • Variables
  • Number of species
  • Relative abundance
  • GC content
  • Phylogenetic Diversity
  • Test on a real dataset where answer is
    well-established.

33
Results
34
Results
35
Conclusions/Future Directions
  • Satisfactory performance
  • No Training on Existing Genomes ?
  • Sanger Reads ?
  • Low number of Species ?
  • Future Work
  • Holy Grail Complex Communities
  • Semi-supervised projection?
  • Hybrid Assembly/Binning

36
Acknowledgements
  • UC Davis
  • UC Berkeley
  • Jonathan Eisen
  • Martin Wu
  • Dongying Wu
  • Ichitaro Yamazaki
  • Amber Hartman
  • Marcel Huntemann
  • Lior Pachter
  • Richard Karp
  • Ambuj Tewari
  • Narayanan Manikandan
  • Princeton University
  • Simon Levin
  • Josh Weitz
  • Jonathan Dushoff

37
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com