Tools for comparative genomics and expert annotations - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Tools for comparative genomics and expert annotations

Description:

Introduce microbiologists to the power of NMPDR and SEED. Enable users to interact with data ... What are Pinned Regions? Focus gene is number 1, colored red ... – PowerPoint PPT presentation

Number of Views:100
Avg rating:3.0/5.0
Slides: 29
Provided by: lkmc
Category:

less

Transcript and Presenter's Notes

Title: Tools for comparative genomics and expert annotations


1
Tools for comparative genomics and expert
annotations

2
Goals of this Presentation
  • Introduce microbiologists to the power of NMPDR
    and SEED
  • Enable users to interact with data
  • Invite experts to participate in construction of
    subsystems
  • Capture expert annotations via the annotation
    clearinghouse

3
What is NMPDR?
  • Beautified, read-only version of the SEED
  • What is the SEED?
  • Editable environment for assignment of function
    in the context of systems biology
  • Intended to clean up legacy of errors created by
    similarity-based, automated assignment of
    function
  • Manual assignment of function based on integrated
    evidence sequence similarity, functional
    clusters, phylogenetic and metabolic profiles
  • Developed for the project to annotate 1000 genomes

4
When Will We Have 1000 Complete Genomes?
  • Depends on what is meant by complete
  • Many sequencing projects will stop without
    finishing or closing the genome in one
    contiguous sequence for each replicon
  • A genome is essentially complete when
  • 95 - 99 of genome accurately sequenced
  • 10X coverage by 454 method 5X coverage by Sanger
    method
  • Assembly places 70 data in contigs at least 20
    kbp

5
Bacterial Genome Facts
  • First two complete genomes in 1995 were bacterial
    pathogens
  • 2913 genomes started as of Sept., 2007
  • 63 of total are bacteria 50 of bacteria are
    pathogens
  • 4434 genomes started as of January, 2009
  • 51 bacteria
  • Value depends on accuracy of annotation

6
Complete Genome Projects
7
What is an Annotation?
  • Identification of nucleotide string that could
    potentially encode a protein
  • Open reading frames (ORFs) computed from stop and
    start codons, codon bias, promoters and RBS
  • Assignment of a name to that gene
  • Usually that of known protein with most similar
    sequence, computed from translated BLAST
  • Prediction of functional role for that gene
  • Function of most similar protein not always
    established with experimental evidence
  • Most similar protein may not have known function
  • Most similar ORF may or may not be expressed

8
Problems with Standard Annotations
  • 42 of H. influenzae ORFs assigned no function in
    1995
  • about half of those had no sequence match in
    GenBank
  • the rest matched hypothetical proteins in E.
    coli
  • 58 of H. influenzae ORFs assigned function of a
    significantly similar sequence
  • What was in GenBank to compare with in 1995?
  • 7 of all GenBank entries were bacterial, 16 of
    those, E. coli
  • many conserved hypotheticals added to database
  • Paralogous members of protein families may not be
    properly discriminated
  • Significantly similar enzymes may act on
    different substrates
  • Assignments are transitive, many times removed
    from experimental data

9
Subsystems Annotationsvs.Pipelines or Protein
Families
  • What is subsystems annotation?
  • humans integrating evidence within a comparative
    framework
  • Whats wrong with genome-at-a-time pipelines?
  • automated assignment of archived annotations to
    new genomes
  • propagates uninformative and incorrect
    annotations
  • Whats wrong with annotation based on protein
    families?
  • emphasizes structural and phylogenetic evidence
  • ignores metabolic and chromosomal contexts
  • leads to ambiguity for members of large families,
    e.g. transporters

10
What is a Subsystem?
  • Subsystem is a generalization of pathway
  • Collection of functional roles jointly involved
    in a biological process or complex
  • metabolic, signaling, regulatory, structural
  • Functional role is the abstract biological
    function of a gene product
  • Atomic or fundamental examples
  • 6-phosphofructokinase (EC 2.7.1.11)
  • LSU ribosomal protein L31p
  • cell division protein FtsZ
  • Inclusion of gene in subsystem is only by
    functional role
  • Controlled vocabulary

11
Expert-Defined Subsystems
  • Curator is researcher with first-hand knowledge
    of biological system
  • Functional roles defined and grouped into
    subsystem and subsets by curator
  • universal groups of roles include all organisms
  • functional variants are subsets of roles found in
    a limited number of organisms
  • often represent alternative paths or
    nonorthologous replacement
  • Semi-automated assignment of function based on
    manual groundwork, sequence homology, and
    functional clustering

12
Subsystem Primer
  • Describe your subsystem in 150 words or lesswhy
    should these functions be considered together?
  • define the emergent properties of the system
  • Provide or link to a diagram that illustrates
    this subsystem
  • define the graph or network
  • List the reactions or relationships between these
    functional roles
  • define the edges
  • List the exact names and abbreviations of these
    functional roles
  • define the nodes
  • List the id numbers (GenBank, SwissProtany
    identifying alias) of genes that play these roles
    in one or more exemplar genomes
  • examples of nodes
  • Provide one or more references that support the
    assignment of function for the exemplar genes
  • provide evidence

13
Populated Subsystems
  • Two-dimensional integration of functional roles
    with genomes
  • Spreadsheet
  • Columns of functional roles
  • Rows of organisms
  • Cells of annotated genes
  • Table of functional roles with GO terms
  • Diagram
  • Curator notes and citations

14
Simple ExampleHistidine Degradation Subsystem
  • Conversion of histidine to glutamate is
    organizing principle
  • Functional roles defined in table

15
Subsystem Diagram
  • Three functional variants
  • Universal subset has three roles, followed by
    three alternative paths from IV to VI

16
Subsystem Spreadsheet
  • Column headers taken from table of functional
    roles
  • Rows are selected genomes, or organisms
  • Cells are populated with specific, annotated
    genes
  • Shared background color indicates proximity of
    genes
  • Functional variants defined by the annotated
    roles
  • Variant code -1 indicates subsystem is not
    functional

17
Missing Genes Noticed by Subsystems Annotation
  • No genes were annotated ForI (EC 3.5.3.13)
    Formiminoglutamic iminohydrolase when the
    Histidine Degradation subsystem was populated
  • Organisms missing ForI convert His to Glu
  • Candidate genes that could perform the role
    ForI must be identified
  • Strategy for finding genes is based on
    chromosomal clustering and occurrence profiling

18
Finding Genes that Cluster with NfoD
  • Red gene in graphic and table is NfoD of
    Xanthomonas
  • Genes pictured in gray boxes located nearby NfoD
    in four or more species
  • Advanced controls expands display of homologous
    regions in other genomes
  • Functional Coupling score links to table of
    homologous pairs in other genomes
  • Cluster button finds biggest clusters in other
    species when not clustered in subject genome

19
What are Pinned Regions?
  • Focus gene is number 1, colored red
  • Most frequently co-localized homolog numbered 2,
    colored green
  • Sets of homologous genes presented in the same
    color with the same numerical label BLASTP
    cut-off e-val 1e-20
  • Numerical labels correspond to rank-ordered
    frequency of co-localization with the focus gene
  • Number of regions, size of region, and cut-off
    can be re-set by user

20
Candidate ForI in Context with NfoD
  • Compare Regions around NfoD, red, center
  • HutC, the regulator, is green, 2
  • HutH, the first functional role in the subsystem,
    is blue, 4
  • Candidate ForI is teal, 6, originally annotated
    as conserved hypothetical

21
Annotation of ForI EC 3.5.3.13
  • Metabolic context proves need for role
  • Organisms missing annotated ForI degrade His to
    Glu
  • Chromosomal context points to candidate
  • Clusters with NfoD and other genes in subsystem
  • Occurrence context supports candidate
  • Organisms containing NfoD lack GluF and HutG,
    required for functional variants 1 and 2,
    respectively
  • Organisms containing candidate ForI also contain
    NfoD, indicating functional variant 3
  • Phylogenetic trees of candidate ForI genes are
    coherent

22
Subsystems Allow Bioinformatics to Inform Bench
Research
  • Subsystems point to missing or alternative genes
  • Bioinformatic predictions need to be tested at
    the bench
  • ForI candidate now verified experimentally
  • Connections forged between bench and
    bioinformatics

23
How is NMPDR distinct from NCBI?
  • Corrected, functional annotations, manually
    curated in context of systems biology
  • Multiple starting points for accessing data
  • gene or protein name, subsystem, organism
  • Search results downloadable as names or sequences
  • Interactive tools for comparative analysis
  • Compare regionsadjust size of region, number of
    genomes
  • Subsystemsbrowse phylogenetic distribution of
    biological system color spreadsheet and diagram
  • Functional clustersfind genes with conserved
    proximity
  • BLASTP Hitsselect and align interesting
    sequences
  • Signature genesfind genes in common or that
    distinguish user-selected groups of genomes
    groups may contain one or many

24
Exploration of physical, genomic context
  • Compare Regions graphic
  • Focus protein highlighted red
  • Color-matched orthlogs allow comparative analysis
    of functional clustering and chromosomal
    rearrangements
  • Redraw the display with different number of
    genomes or different size region
  • Compare Regions table
  • Table is sortable and filterable with active
    column headings
  • Genes with conserved proximity shown with
    functional coupling scores, fc-sc
  • fc-sc (functional coupling score)
  • Measures conservation of gene proximity and
    phylogenetic distance
  • Link returns table listing pairs of proximal
    orthologs
  • CL (find best clusters)
  • Finds clusters containing the focus protein in
    other genomes
  • Useful for genes without functional coupling
    scores, fc-sc

25
Exploration of functional, biological context
  • Populated Subsystem Spreadsheet
  • Columns represent functional roles, mouse over
    header for definition
  • Genomes (rows) shown may be filtered and sorted
    by name or taxonomic group
  • Cells populated with specific, annotated genes
    linked to context pages
  • Functional variants defined by the annotated
    roles
  • Variant codes defined in notes tab
  • Diagram of subsystem often provided
  • Protein families
  • FIGfams taken from single column of functional
    roles
  • Links to structures, orthologs, literature

26
NMPDR Services
  • Essential Genes on Genomic Scale
  • Experimentally verified in genome-wide scans of
    10 important model organisms
  • Drug targets pipline to in silico screening
  • essential in at least one of the NMPDR pathogens
  • included in subsystems by our curators
  • orthologs in the Protein Data Bank
  • orthologs in a substantial number of bacterial
    priority pathogens
  • Targets search flexible search forms for
    discovering novel targets based on computed
    attributes
  • physical characteristics such as MW, pI
  • subcellular location
  • transmembrane regions and signal peptides
  • subsystem, pathway, reaction
  • structural motifs, protein families

27
Related NMPDR Services
  • RAST Genome annotation server
  • Automated annotation of essentially complete
    genome sequences in a small set of long sequence
    contigs
  • View results in comparative context with other
    genomes
  • MG-RAST Metagenome annotation server
  • Automated annotation of a very large set of very
    short DNA sequences
  • View results in comparative context with other
    data sets
  • Annotation Clearinghouse
  • Tool to credit experts with annotation of
    specific genes and to share annotations with
    other databases
  • Input is a two-column table of gene IDs and
    annotations vouched for by expert

28
Who is NMPDR?
  • Fellowship for Interpretation of Genomes (FIG)
  • Ross Overbeek, Veronika Vonstein, Gordon Pusch,
    Bruce Parrello, Rob Edwards, Andrei Osterman,
    Michael Fonstein, Svetlana Gerdes, Olga Zagnitko,
    Olga Vassieva, Yakov Kogan, Irina Goltsman
  • Argonne National Laboratory
  • Rick Stevens, Terry Disz, Robert Olson, Folker
    Meyer, Elizabeth Glass, Chris Henry, Jared
    Wilkening
  • Computation Institute at University of Chicago
  • Daniela Bartels, Michael Kubal, William Mihalo,
    Tobias Paczian, Andreas Wilke, Alex Rodriguez,
    Mark D'Souza, Rami Aziz
  • University of Illinois at Urbana Hope College
  • Gary J. Olsen, Claudia Reich, Leslie McNeil
    Aaron Best, Matt DeJongh
  • National Institute of Allergy and Infectious
    Diseases
  • National Institutes of Health, Department of
    Health and Human Services, Contract
    HHSN266200400042C.
Write a Comment
User Comments (0)
About PowerShow.com