Overview of Genome Databases - PowerPoint PPT Presentation

About This Presentation
Title:

Overview of Genome Databases

Description:

Definition of Bioinformatics ... For many years, the majority of bioinformatics DBs did not employ a DBMS ... Warehouse schema defines many bioinformatics datatypes ... – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 44
Provided by: aiS9
Category:

less

Transcript and Presenter's Notes

Title: Overview of Genome Databases


1
Overview of Genome Databases
  • Peter D. Karp, Ph.D.
  • SRI International
  • pkarp_at_ai.sri.com
  • www-db.stanford.edu/dbseminar/seminar.html

2
Talk Overview
  • Definition of bioinformatics
  • Motivations for genome databases
  • Issues in building genome databases

3
Definition of Bioinformatics
  • Computational techniques for management and
    analysis of biological data and knowledge
  • Methods for disseminating, archiving,
    interpreting, and mining scientific information
  • Computational theories of biology
  • Genome Databases is a subfield of bioinformatics

4
Motivations for Bioinformatics
  • Growth in molecular-biology knowledge
    (literature)
  • Genomics
  • Study of genomes through DNA sequencing
  • Industrial Biology

5
Example Genomics Datatypes
  • Genome sequences
  • DOE Joint Genome Institute
  • 511M bases in Dec 2001
  • 11.97G bases since Mar 1999
  • Gene and protein expression data
  • Protein-protein interaction data
  • Protein 3-D structures

6
Genome Databases
  • Experimental data
  • Archive experimental datasets
  • Retrieving past experimental results should be
    faster than repeating the experiment
  • Capture alternative analyses
  • Lots of data, simpler semantics
  • Computational symbolic theories
  • Complex theories become too large to be grasped
    by a single mind
  • The database is the theory
  • Biology is very much concerned with qualitative
    relationships
  • Less data, more complex semantics

7
Bioinformatics
  • Distinct intellectual field at the intersection
    of CS and molecular biology
  • Distinct field because researchers in the field
    must know CS, biology, and bioinformatics
  • Spectrum from CS research to biology service
  • Rich source of challenging CS problems
  • Large, noisy, complex data-sets and
    knowledge-sets
  • Biologists and funding agencies demand working
    solutions

8
Bioinformatics Research
  • algorithms data structures programs
  • algorithms databases discoveries
  • Combine sophisticated algorithms with the right
    content
  • Properly structured
  • Carefully curated
  • Relevant data fields
  • Proper amount of data

9
Reference on Major Genome Databases
  • Nucleic Acids Research Database Issue
  • http//nar.oupjournals.org/content/vol30/issue1/
  • 112 databases

10
Questions to Ask of a New Genome Database
11
What are Database Goals andRequirements?
  • What problems will database be used to solve?
  • Who are the users and what is their expertise?

12
What is its Organizing Principle?
  • Different DBs partition the space of genome
    information in different dimensions
  • Experimental methods (Genbank, PDB)
  • Organism (EcoCyc, Flybase)

13
What is its Level of Interpretation?
  • Laboratory data
  • Primary literature (Genbank)
  • Review (SwissProt, MetaCyc)
  • Does DB model disagreement?

14
What are its Semantics and Content?
  • What entities and relationships does it model?
  • How does its content overlap with similar DBs?
  • How many entities of each type are present?
  • Sparseness of attributes and statistics on
    attribute values

15
What are Sources of its Data?
  • Potential information sources
  • Laboratory instruments
  • Scientific literature
  • Manual entry
  • Natural-language text mining
  • Direct submission from the scientific community
  • Genbank
  • Modification policy
  • DB staff only
  • Submission of new entries by scientific community
  • Update access by scientific community

16
What DBMS is Employed?
  • None
  • Relational
  • Object oriented
  • Frame knowledge representation system

17
Distribution / User Access
  • Multiple distribution forms enhance access
  • Browsing access with visualization tools
  • API
  • Portability

18
What Validation Approaches areEmployed?
  • None
  • Declarative consistency constraints
  • Programmatic consistency checking
  • Internal vs external consistency checking
  • What types of systematic errors might DB contain?

19
Database Documentation
  • Schema and its semantics
  • Format
  • API
  • Data acquisition techniques
  • Validation techniques
  • Size of different classes
  • Coverage of subject matter
  • Sparseness of attributes
  • Error rates
  • Update frequency

20
Relationship of Database Field toBioinformatics
  • Scientists generally unaware of basic DB
    principles
  • Complex queries vs click-at-a-time access
  • Data model
  • Defined semantics for DB fields
  • Controlled vocabularies
  • Regular syntax for flatfiles
  • Automated consistency checking
  • Most biologists take one programming class
  • Evolution of typical genome database
  • Finer points of DB research off their radar
    screen
  • Handfull of DB researchers work in bioinformatics

21
Database Field
  • For many years, the majority of bioinformatics
    DBs did not employ a DBMS
  • Flatfiles were the rule
  • Scientists want to see the data directly
  • Commercial DBMSs too expensive, too complex
  • DBAs too expensive
  • Most scientists do not understand
  • Differences between BA, MS, PhD in CS
  • CS research vs applications
  • Implications for project planning, funding,
    bioinformatics research

22
Recommendation
  • Teaching scientists programming is not enough
  • Teaching scientists how to build a DBMS is
    irrelevant
  • Teach scientists basic aspects of databases and
    symbolic computing
  • Database requirements analysis
  • Data models, schema design
  • Knowledge representation, ontologies
  • Formal grammars
  • Complex queries
  • Database interoperability

23
BioSPICE BioinformaticsDatabase Warehouse
  • Peter Karp, Dave Stringer-Calvert, Tom Lee, Kemal
    Sonmez
  • SRI International
  • http//www.BioSPICE.org/

24
Project Goal
  • Create a toolkit for constructing bioinformatics
    database warehouses that collect together a set
    of bioinformatics databases into one physical DBMS

25
Motivations
  • Important bioinformatics problems require access
    to multiple bioinformatics databases
  • Hundreds of bioinformatics databases exist
  • Nucleic Acids Research 30(1) 2002 DB issue
  • Nucleic Acids Research DB list 350 DBs at
    http//www3.oup.co.uk/nar/database/a/
  • Different problems require different sets of
    databases

26
Motivations
  • Combining multiple databases allows for data
    verification and complementation
  • Simulation problems require access to data on
    pathways, enzymes, reactions, genetic regulation

27
Why is the Multidatabase Approach Not Sufficient?
  • Multidatabase query approaches assume databases
    are in a DBMS
  • Internet bandwidth limits query throughput
  • Most sites that do operate DBMSs do not allow
    remote SQL access because of security and loading
    concerns
  • Control data stability
  • Need to capture, integrate and publish locally
    produced data of different types
  • Multidatabase and Warehouse approaches
    complementary

28
Scenario 1
  • BioSPICE scientist wants to model multiple
    metabolic pathways in a given organism
  • Enumerate pathways and reactions
  • What enzymes catalyze each reaction?
  • What genes code for each enzyme?
  • What control regions regulate each gene?

29
Approach
  • Oracle and MySQL implementations
  • Warehouse schema defines many bioinformatics
    datatypes
  • Create loaders for public bioinformatics DBs
  • Parse file format for the DB
  • Semantic transformations
  • Insert database into warehouse tables
  • Warehouse query access mechanisms
  • SQL queries via Perl, ODBC, OAA

30
Example Swiss-Prot DB
  • Version 40.0 describes 101K proteins in a 320MB
    file
  • Each protein described as one block of records
    (an entry) in a large text file
  • Loader tool parses file one entry at a time
  • Creates new entries in a set of warehouse tables

31
Warehouse Schema
  • Manages many bioinformatics datatypes
    simultaneously
  • Pathways, Reactions, Chemicals
  • Proteins, Genes, Replicons
  • Citations, Organisms
  • Links to external databases
  • Each type of warehouse object implemented through
    one or more relational tables (currently 43)

32
Warehouse Schema
  • Databases on our wish list
  • Genbank (nucleotide sequences)
  • Protein expression database
  • Protein-protein interactions database
  • Gene expression database
  • NCBI Taxonomy database
  • Gene Ontology
  • CMR

33
Warehouse Schema
  • Manages multiple datasets simultaneously
  • Dataset Single version of a database
  • Support alternative measurements and viewpoints
  • Version comparison
  • Multiple software tools or experiments that
    require access to different versions
  • Each dataset is a warehouse entity
  • Every warehouse object is registered in a dataset

34
Warehouse Schema
  • Different databases storing the same biological
    types are coerced into same warehouse tables
  • Design of most datatypes inspired by multiple
    databases
  • Representational tricks to decrease schema bloat
  • Single space of primary keys
  • Single set of satellite tables such as for
    synonyms, citations, comments, etc.

35
Warehouse Schema
  • Examples
  • Protein data from Swiss-Prot, TrEMBL, KEGG, and
    EcoCyc all loaded into same relational tables
  • Pathway data from MetaCyc and KEGG are loaded
    into the same relational tables

36
Example Swiss-Prot DB
ID 1A11_CUCMA STANDARD PRT 493
AA. AC P23599 DT 01-NOV-1991 (Rel. 20,
Created) DT 01-NOV-1991 (Rel. 20, Last sequence
update) DT 15-DEC-1998 (Rel. 37, Last
annotation update) DE 1-AMINOCYCLOPROPANE-1-CARB
OXYLATE SYNTHASE CMW33 (EC 4.4.1.14) (ACC DE
SYNTHASE) (S-ADENOSYL-L-METHIONINE
METHYLTHIOADENOSINE-LYASE). GN ACS1 OR ACCW.
37
How Swiss-Prot is Loaded intoThe Warehouse
  • Register Swiss-Prot in Datasets table
  • Create entry in Entry and Protein tables for each
    Swiss-Prot protein
  • Satellite tables store
  • Protein synonyms, citations, comments, accession
    numbers, organism, sequence features,
    subunits/complexes, DB links

38
Protein Table
CREATE TABLE Protein ( WID
NUMBER --The warehouse ID of this protein
Name VARCHAR2(500) --Common
name of the protein AASequence
VARCHAR2(4000),--Amino-acid sequence for this
protein Charge NUMBER,
--Charge of the chemical Fragment
CHAR(1), --Is this protein a fragment or
not, T or F MolecularWeightCalc NUMBER,
--Molecular weight calculated from sequence.
Units Daltons. MolecularWeightExp NUMBER,
--Molecular Weight determined through
experimentation. Units Daltons. PICalc
VARCHAR2(50), --pI calculated from its
sqeuence. PIExp VARCHAR2(50),
--pI value determined through experimentation.
DataSetWID NUMBER --Reference
to the data set from which the entity came from )
39
Database Loaders
  • Loader tool defined for each DB to be loaded into
    Warehouse
  • Example loaders available in several languages
  • Loaders
  • KEGG (C)
  • BioCyc collection of 15 pathway DBs (C)
  • Swiss-Prot (Java)
  • ENZYME (Java)

40
Terminology
  • Model Organism Database (MOD) DB describing
    genome and other information about an organism
  • Pathway/Genome Database (PGDB) MOD that
    combines information about
  • Pathways, reactions, substrates
  • Enzymes, transporters
  • Genes, replicons
  • Transcription factors, promoters, operons, DNA
    binding sites
  • BioCyc Collection of 15 PGDBs at BioCyc.org
  • EcoCyc, AgroCyc, YeastCyc

41
Loader Architecture
Swiss-Prot Datafile
ANTLR Parser Generator
Parser for SwissProt
Grammar for Swiss-Prot
Oracle Loadable File
SQL Insert Commands
42
Current Warehouse Contents
KEGG ENZYME SwissProt BsubCyc Warehouse Total
Chemicals 7,284 2,952 0 576 10,812
Genes 5,714 0 88,605 4,221 98,540
Organisms 60 0 103,807 1 103,868
Proteins 3,829 3,870 101,602 4,150 113,451
Enzymatic Reactions 3,509 0 0 717 4,226
Pathways 4,517 0 0 138 4,655
Pathway Reactions 36,271 0 0 530 36,801
43
Example Warehouse Uses
  • Check completeness of data sources

Count reactions in ENZYME database with (and
without) associated protein sequences in
SWISS-PROT database 3870 reactions in
ENZYME 1662 reactions (43) with a sequence in
SWISS-PROT 2208 reactions (57) without a
sequence in SWISS-PROT Count of distinct
non-partial EC numbers in SWISS-PROT 1554
distinct EC numbers in SWISS-PROT (non-partial)
Write a Comment
User Comments (0)
About PowerShow.com