Interoperation of Molecular Biology Databases - PowerPoint PPT Presentation

About This Presentation
Title:

Interoperation of Molecular Biology Databases

Description:

Interoperation of Molecular Biology Databases Peter D. Karp, Ph.D. Bioinformatics Research Group SRI International Menlo Park, CA pkarp_at_ai.sri.com – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 12
Provided by: pan12
Category:

less

Transcript and Presenter's Notes

Title: Interoperation of Molecular Biology Databases


1
Interoperation of Molecular Biology Databases
  • Peter D. Karp, Ph.D.
  • Bioinformatics Research Group
  • SRI International
  • Menlo Park, CA
  • pkarp_at_ai.sri.com

2
Main Message
  • Interoperation of molecular-biology databases is
    a challenging problem of critical importance
  • DOE should initiate a program in interoperation
    of molecular biology databases
  • Pursue both warehouse approach and multidatabase
    approach
  • Major progress possible within 5 years

3
Motivations
  • Important biological problems require access to
    multiple bioinformatics databases
  • Different problems require different sets of
    databases
  • Hundreds of bioinformatics databases exist
  • Nucleic Acids Research 322004 Database issue
  • Nucleic Acids Research DB list
    http//www3.oup.co.uk/nar/database/a/
  • 350 databases listed in 2002
  • 560 databases listed in 2004
  • Applications of integration include
  • Complex queries
  • Comparison of overlapping sources
  • Data mining

4
Bioinformatics Databases
  • Tremendous progress in point-and-click access for
    biologist users
  • Less progress toward providing a computable,
    interoperable infrastructure for large-scale data
    mining
  • Every large-scale mining/learning problem
    requires time consuming crafting of
    input/training datasets

5
Warehouse Approach vsMultidatabase Approach
  • Multidatabase query approaches assume databases
    are in a queryable DBMS
  • Most sites that do operate DBMSs do not allow
    remote query access because of security and
    loading concerns
  • Users want to control data stability
  • Users want to control hardware applied to problem
  • Internet bandwidth limits query throughput
  • Users need to capture, integrate and publish
    locally produced data of different types
  • Replicating and refreshing very large sources is
    expensive
  • Multidatabase and Warehouse approaches
    complementary

6
SRI BioWarehouseProject Goal
  • Create a toolkit for constructing bioinformatics
    database warehouses that integrate sets of
    bioinformatics databases into one physical DBMS

7
BioWarehouse Approach
  • Warehouse schema defines many bioinformatics
    datatypes
  • Create loaders for public bioinformatics DBs
  • Parse file format for the DB
  • Apply semantic transformations
  • Insert database into warehouse tables
  • Oracle and MySQL implementations
  • Warehouse query access mechanisms
  • SQL queries via JDBC,Lisp,Perl, ODBC, OAA

8
Warehouse Schema
  • Manages many bioinformatics datatypes
    simultaneously
  • Pathways, Reactions, Chemicals
  • Proteins, Genes, Replicons
  • Sequences, Sequence Features
  • Organisms, Taxonomic relationships
  • Computations (sequence matches)
  • Citations, Controlled vocabularies
  • Links to external databases
  • Each type of warehouse object implemented through
    one or more relational tables (currently 43)

9
Warehouse Schema
  • Manages multiple datasets simultaneously
  • Dataset Single version of a database
  • Allows version comparison
  • Multiple software tools or experiments require
    access to different versions
  • Each dataset is a warehouse entity
  • Every warehouse object is registered in a dataset
  • Different databases storing the same biological
    datatypes are coerced into same warehouse tables
  • Design of most datatypes inspired by multiple
    databases
  • Representational tricks to decrease schema bloat
  • Single space of primary keys
  • Single set of satellite tables such as for
    synonyms, citations, comments, etc.

10
Current Databases Supported by BioWarehouse
  • BioCyc
  • 15 genomes and metabolic networks
  • Swiss-Prot, TrEMBL
  • 1.3M proteins
  • ENZYME
  • KEGG
  • NCBI Taxonomy
  • CMR
  • 105 genomes, 250K genes, 250K proteins
  • Applications
  • DARPA BioSpice program on biological simulation
  • Study of sequence coverage of known enzymes

11
Summary
  • Interoperation of molecular-biology databases is
    a challenging problem of critical importance
  • DOE should initiate a program in interoperation
    of molecular biology databases
  • Pursue both warehouse approach and multidatabase
    approach
  • Major progress possible within 5 years
Write a Comment
User Comments (0)
About PowerShow.com