C A M E R A A Metagenomics Resource for Microbial Ecology - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

C A M E R A A Metagenomics Resource for Microbial Ecology

Description:

Many species present ( 10, often 1000) Many closely related. New techniques ... New Techniques Needed. Fragment Recruitment. Extreme Assembly to find pan genomes ... – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 53
Provided by: saulkr
Learn more at: http://www.jcvi.org
Category:

less

Transcript and Presenter's Notes

Title: C A M E R A A Metagenomics Resource for Microbial Ecology


1
C A M E R AA Metagenomics Resource for
Microbial Ecology
  • Saul A. Kravitz
  • J. Craig Venter Institute
  • Rockville, Maryland USA
  • KNAW Colloquium
  • May 29, 2008

2
Goals
  • Introduce you to CAMERA
  • Encourage you to use CAMERA
  • What can CAMERA do for you?

3
Presentation Outline
  • Introduction to Metagenomics
  • Global Ocean Sampling (GOS) Expedition
  • CAMERA Capabilities and Features
  • Compute Resources
  • Data Resources
  • Tools Resources
  • Looking Forward

4
Metagenomic Questions
  • Within an environment
  • What biological functions are present (absent)?
  • What organisms are present (absent)
  • Compare data from (dis)similar environments
  • What are the fundamental rules of microbial
    ecology
  • Adapting to environmental conditions?
  • How?
  • Evidence and mechanisms for lateral transfer
  • Search for novel proteins and protein families
  • And diversity within known families

5
Genomics vs Metagenomics
  • Genomics Old School
  • Study of a single organism's genome
  • Genome sequence determined using shotgun
    sequencing and assembly
  • gt1300 microbes sequenced, first in 1995
  • DNA usually obtained from pure cultures (lt1)
  • Metagenomics
  • Application of genome sequencing methods to
    environmental samples (no culturing)
  • Environmental shotgun sequencing is the most
    widely used approach
  • Environmental Metadata provides key context

6
Complexity of Microbial Communities
  • Simple (e.g., AMD, gutless worm)
  • Few species present (lt10)
  • Diverse
  • ? Variations on standard genomics techniques
  • Complex (e.g., Soil or Marine)
  • Many species present (gt10, often gt1000)
  • Many closely related
  • ? New techniques

7
Global Ocean Sampling Expedition
8
Global Ocean Sampling (GOS)
  • 178 Total Sampling Locations
  • Phase 1 7.7M reads, gt6M proteins 3/07
  • Phase 2-IO 2.2M reads
    3/08
  • Phase 2 10M reads
    future
  • Diverse Environments
  • Open ocean, estuary, embayment, upwelling,
    fringing reef, atoll

4/04
3/07
3/08
9
GOS Sequence Diversity in the OceanRusch et al
(PLoS 2007)
  • Most sequence reads are unique
  • Very limited assembly
  • Most sequences not taxonomically anchored
  • Relating shotgun data to reference genomes
  • Annotation challenging
  • New Techniques Needed
  • Fragment Recruitment
  • Extreme Assembly to find pan genomes
  • Sample to Sample Comparisons

10
Comparing of Dominant Ribotypes
11
Comparison of Total Genomic Content
12
GOS Protein Analysis Yooseph et al (PLoS 2007)
  • Novel clustering process
  • Sequence similarity based
  • Predict proteins and group into related clusters
  • Include GOS and all known proteins
  • Findings
  • GOS proteins
  • cover all existing prokaryotic families
  • expands diversity of known protein families
  • 10 of large clusters are novel
  • Many are of viral origin
  • No saturation in the rate of novel protein family
    discovery

13
Added Protein Family Diversity Yooseph et al
(PLoS 2007)
Rubisco homologs
Known eukaryotes
Known prokaryotes
GOS prokaryotes
New Groups
14
GOS Viral Analysis(Williamson et al PLoSOne 2008)
  • Study of dsDNA viruses from shotgun data
  • 155k viral proteins identified from 37 GOS I
    sites (2.5)
  • 59 of viral sequences were bacteriophage
  • Viral acquisition and retention of host metabolic
    genes is common and widespread
  • Viruses have made these genes their own
  • Clade tightly with viral genes
  • Codistribution of P-SSM4-like cyanophage and the
    dominant ecotype of Prochlorococcus in GOS
    samples.

15
Viral acquisition of host genestalC Gene
  • GOS Viral
  • Public Viral
  • GOS Bacterial
  • Public Bacterial
  • Public Euk

16
Reference Genomes
  • Overview
  • 150 reference marine microbes (101 released)
  • Scaffold for GOS
  • Sequenced, assembled, autoannotated
  • Isolation Metadata
  • Incomplete
  • Bottlenecks
  • Availability of DNA
  • Purity of DNA
  • Status and Data
  • https//research.venterinstitute.org/moore/

17
Motivations for CAMERA
  • Significant investment in sequencing
  • Only accessible to bioinformatics elite
  • Diversity of user sophistication and needs
  • Bioinformatics and Computation Challenges
  • Assembly, annotation, comparative analysis,
    visualization
  • Dedicated compute resources
  • Importance of Metadata
  • Metadata required for environmental analysis
  • Need to drive standards
  • Compliance with Convention on Biodiversity

18
Convention on Biological Diversity
  • Sample in territorial waters?
  • Country granted certain rights by CBD
  • Sampling agreements may contain restrictions
  • CAMERA users must acknowledge potential
    restrictions on commercial data use
  • CAMERA maintains mapping of country-of-origin for
    all data objects

19
CAMERA http//camera.calit2.net
  • Convenient acronym for cumbersome name
  • Henry Nichols, PLoS Biology
  • Mission
  • Enable Research in Marine Microbiology
  • Debuted March 2007
  • camera-info_at_calit2.net

20
CAMERA Capabilities
  • Compute Resources
  • 512 node compute grid 200 Tb storage
  • Data and Metadata Resources
  • Annotated Metagenomic and genomic data
  • Tools Resources
  • Scalable BLAST
  • Fragment Recruitment
  • Metagenomic Annotation
  • Text Search

21
CAMRA Compute and Storage Complexat UCSD/Calit2
512 Processors 5 Teraflops 200 Terabytes
Storage
  • Source Larry Smarr, Calit2

22
CAMERA Metagenomic Data Volume by Project
23
CAMERA Metagenomic Samples
24
CAMERA Usersgt2000 Registered Since March 2007
25
CAMERA Data Collections
  • Metagenomic Sequence Collection
  • Reads and assemblies w/associated metadata
  • CAMERA-computed annotation
  • Protein Clusters
  • Maintaining clusters from Yooseph et al (Yooseph
    and Li, 08)
  • Genomic Data
  • Viral, Fungal, pico-Eukaryotes, Microbial
  • Moore Marine Genomes with Metadata
  • Non-redundant sequence Collection
  • Genbank, Refseq, Uniprot/Swissprot, PDB etc

26
Standardizing Contextual Metadata
  • Genome Standards Consortium
  • Led by Dawn Field, NIEeS
  • Members from EU, UK, US
  • Goals are to promote
  • Standardization of genomic descriptions
  • Exchange Integration of genomic data
  • Metadata standardization key enabler
  • MIMS Min Info for Metagenomic Sample
  • GCDML Standard format

27
Contextual Metadata Challenges
  • Researchers Need to Collect and Submit
  • Relevant metadata depends on study MIMS
  • Specification of minimum metadata
  • Standardize Exchange Format - GCDML
  • Comprehensive and extensible
  • Leverages Existing Ontologies, Validatable
  • And
  • Easy for a scientist to use...
  • Need ongoing software support for tools

28
CAMERA Core Metadata by Project
  • Defacto Core
  • Lattitude and Longitude
  • Collection date
  • Habitat and Geographic Location
  • Missing metadata

29
CAMERA Contextual Metadata
30
CAMERA 1.3
  • http//camera.calit2.net

31
Scalable BLAST with Metadata
  • Large searches permitted and encouraged
  • 454 FLX run vs All Metagenomic
  • Some larger tblastx jobs have run gt20 hrs
  • 10kbp BLASTN vs All Metagenomic 1 min
  • BLAST XML or Tabular Export
  • Searches against NRAA
  • BLAST XML output feeds MEGAN
  • Searches against All Metagenomic
  • GUI with metdata
  • Tabular with metadata

32
Scalable BLAST with Metadata
33
Integration of Metadata and Data
34
Browsing Large Data Collections Fragment
Recruitment Viewer
  • Microbial Communities vs Reference Genomes
  • Millions of sequence reads vs Thousands of
    genomes
  • Definition A read is recruited to a sequence
    if
  • End-to-end blastN alignment exists
  • Rapid Hypothesis Generation and Exploration
  • How do cultured and wildtype genomes differ?
  • Insertions, deletion, translocations
  • Correlation with environmental factors
  • Export sequence and annotation
  • Credits Doug Rusch and Michael Press

35
Fragment Recruitment Viewer
Sequence Similarity
Genomic Position
Doug Rusch, JCVI
36
Geographic Legend
Sequence Similarity
Genomic Position
Annotation
37
(No Transcript)
38
(No Transcript)
39
(No Transcript)
40
Prochlorococcus marinus str. MIT 9312
  • Coloring by geography
  • 80-95 identity cloud
  • GOS Indian Ocean
  • Regions with no coverage
  • Where?
  • Real?

41
Mate Status Highlights Differences
  • Paired end (mate) sequencing
  • Coloring by mate status
  • Highlights cultured vs metagenomic differences
  • Selective display of
  • Mates by status
  • Reads by sample

42
Mate Pairs Highlight Variation
43
What Genes are Involved
44
View by Sample
45
View by Sample Filter by mate status
46
Annotation ofEnvironmental Shotgun Data
  • Gene Finding
  • Using Yoosephs Protein Clusters, and/or
  • Metagene
  • Functional Assignment
  • Variation of JCVI prok annotation pipeline
  • Leverages protein cluster annotation -- soon
  • Quality Nearly Comparable to Prokaryotic Genomic
    Annotation

47
Protein Clusters as Gene Finder
  • Identification and soft mask of ncRNAs
  • Naïve identification of ORFs (60aa min)
  • Add peptides to clusters incrementally
  • Yooseph and Li, 2008
  • Predicted Genes based on ORFS in
  • Clusters of sufficient size
  • Clusters that satisfy additional filters

48
Protein ClustersAdvantages and Disadvantages
  • Weaknesses
  • Homology-based
  • Stateful (also a strength)
  • Less sensitive (for now)
  • Strengths
  • More specific
  • Transitive Annotation
  • Learns over time
  • Easy to maintain

49
Search for Dehalogenase
50
Browse Clusters
51
Near Future
  • More extensive data collection
  • Summary views of data sets by
  • Annotation
  • Samples
  • Mate Status
  • Taxonomy
  • Habitat and other contextual metadata
  • 16S datasets?

52
Credits
  • JCVI CAMERA Team
  • Leonid Kagan, Michael Press, Todd Safford,
    Cristian Goina, Qi Yang, Sean Murphy, Jeff
    Hoover, Tanja Davidsen, Ramana Madupu, Sree
    Nampally, Nikhat Zhafar, Prateek Kumar
  • Doug Rusch, Shibu Yooseph, Aaron Halpern,
    Granger Sutton, Shannon Williamson
  • Marv Frazier and Bob Friedman
  • Calit2 CAMERA Team
  • Adam Brust, Michael Chiu, Brian Fox, Adam Dunne,
    Kayo Arima
  • Larry Smarr and Paul Gilna
  • http//camera.calit2.net
Write a Comment
User Comments (0)
About PowerShow.com