BioWarehouse: The BioSPICE Bioinformatics Database Warehouse - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

BioWarehouse: The BioSPICE Bioinformatics Database Warehouse

Description:

Create a toolkit for integrating a set of bioinformatics databases into one ... Representational tricks to decrease schema bloat. Single space of primary keys ... – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 39
Provided by: tomga6
Category:

less

Transcript and Presenter's Notes

Title: BioWarehouse: The BioSPICE Bioinformatics Database Warehouse


1
BioWarehouse The BioSPICE BioinformaticsDatabas
e Warehouse
  • Peter D. Karp, Tom Lee, Valerie Wagner, Yannick
    Pouliot
  • SRI International
  • http//www.BioSPICE.org/

2
Project Goal
  • Create a toolkit for integrating a set of
    bioinformatics databases into one physical
    database warehouse

3
Motivations
  • Important bioinformatics problems require access
    to multiple bioinformatics databases
  • Hundreds of bioinformatics databases exist
  • Nucleic Acids Research 30(1) 2002 DB issue
  • Nucleic Acids Research DB list 350 DBs at
    http//www3.oup.co.uk/nar/database/a/
  • Different problems require different sets of
    databases

4
Motivations
  • Combining multiple databases allows for data
    verification and complementation
  • Simulation problems require access to data on
    pathways, enzymes, reactions, genetic regulation

5
Why is the Multidatabase Approach Not Sufficient?
  • Multidatabase query approaches assume databases
    are in a queryable DBMS
  • Most sites that do operate DBMSs do not allow
    remote query access because of security and
    loading concerns
  • Users want to control data stability
  • Users want to control speed of their hardware
  • Internet bandwidth limits query throughput
  • Users need to capture, integrate and publish
    locally produced data of different types
  • Multidatabase and Warehouse approaches
    complementary

6
Recent Progress and Current Work
  • Efforts completed to
  • Clean up schema
  • Refine loaders
  • Re-implement BioCyc loader in C
  • Define installation procedure
  • Genbank loader almost complete
  • Bacterial division of Genbank
  • Extensive BioWarehouse schema changes
  • All loaders being upgraded to support current
    schema and current versions of their DBs

7
Scenario 1
  • BioSPICE scientist wants to model multiple
    metabolic pathways in a given organism
  • Enumerate pathways and reactions
  • What enzymes catalyze each reaction?
  • What genes code for each enzyme?
  • What control regions regulate each gene?

8
Scenario 2
  • BioSPICE scientist evaluating a simulation model
    requires
  • Transcriptome, proteome, and metabolome
    measurements from as many sources as possible

9
Databases Supported by BioWarehouse
  • BioCyc collection
  • (EcoCyc, HumanCyc, MetaCyc, 13 more)
  • Swiss-Prot, TrEMBL
  • ENZYME
  • KEGG
  • NCBI Taxonomy
  • CMR -- Comprehensive Microbial Resource
  • Genbank bacterial division

10
SRI Constructing Public Warehouse Server
  • Publicly queryable collection of databases
  • NCBI Taxonomy
  • BioCyc open subset
  • Swiss-Prot (2005)
  • CMR
  • UCLA microarray data
  • Data will be openly available and queryable
  • Linux/PC running MySQL
  • Expected availability Sept 2004

11
Next Steps
  • Add schema for protein-protein interactions,
    implement loader for BIND

12
Approach
  • Oracle and MySQL implementations
  • Warehouse schema defines many bioinformatics
    datatypes
  • Create loaders for public bioinformatics DBs
  • Parse file format for the DB
  • Semantic transformations
  • Insert database into warehouse tables
  • Warehouse query access mechanisms
  • SQL queries via Perl, ODBC, JDBC, OAA

13
Database Loaders
  • Loader tool defined for each DB to be loaded into
    Warehouse
  • Example loaders available in several languages
  • C-based Loaders
  • NCBI Taxonomy
  • CMR (Comprehensive Microbial Resource)
  • KEGG
  • BioCyc collection of 15 pathway DBs
  • Java-based Loaders
  • Swiss-Prot
  • ENZYME
  • Genbank

14
Warehouse Schema
  • Manages many bioinformatics datatypes
    simultaneously
  • Pathways, Reactions, Chemicals
  • Proteins, Genes, Replicons, Nucleotide sequences
  • Citations
  • Organisms, Taxonomic relationships
  • Gene expression data
  • Links to external databases
  • Each type of warehouse object implemented through
    one or more relational tables (currently 62)

15
Warehouse Schema
  • Manages multiple datasets simultaneously
  • Dataset Single version of a database
  • Support alternative measurements and viewpoints
  • Version comparison
  • Multiple software tools or experiments that
    require access to different versions
  • Each dataset is a warehouse entity
  • Every warehouse object is registered in a dataset

16
Warehouse Schema
  • Different databases storing the same biological
    datatypes are coerced into same warehouse tables
  • Design of most datatypes inspired by multiple
    databases
  • Representational tricks to decrease schema bloat
  • Single space of primary keys
  • Single set of satellite tables such as for
    synonyms, citations, comments, etc.

17
Warehouse Schema
  • Examples
  • Protein data from Swiss-Prot, TrEMBL, KEGG, and
    EcoCyc all loaded into same relational tables
  • Pathway data from MetaCyc and KEGG are loaded
    into the same relational tables

18
Loader / Schema ExampleCMR
  • TIGR Comprehensive Microbial Resource
  • Contains all sequenced microbial genomes
  • For each genome describes
  • Organism
  • Replicons
  • Genes and gene products
  • Features on the replicons
  • Also includes All-vs-All BLAST search among all
    genomes

19
Warehouse Implementation of CMR
  • CMR loader written in C
  • Documented in HTML document that describes CMR
    mapping to BioWarehouse schema
  • Loading 105 genomes takes 4.5 hrs on
    dual-processor 2.5GHz PC

20
Overview of CMR in BioWarehouse Schema
Dataset
BioSource
Taxon
NucleicAcid
Gene
Protein
Feature
SubSequence
CrossReference
SequenceMatch
21
Dataset Table
CREATE TABLE DataSet ( WID number --
Warehouse identifier for this entry. Name
char -- Name of the database. Version
char -- Version of the database described by
this warehouse dataset LoadDate datetime --
Date that this version of the data set loaded
into the warehouse. ReleaseDate char --
Date that the original database was first
released. HomeURL char -- Web address
of the home page of the original
database. QueryURL char -- A URL at
which we can retrieve objects in this database
via the WWW -- by
substituting the unique ID of the object we wish
to retrieve for the --
string "s" in the query URL. An example URL
is --
http//database.university.edu/get-object/5050s )
22
Overview of CMR in BioWarehouse Schema
Dataset
BioSource
Taxon
NucleicAcid
Gene
Protein
Feature
SubSequence
CrossReference
SequenceMatch
23
BioSource Table
CREATE TABLE BioSource ( WID number --
Warehouse identifier for biological
source. TaxonWID number -- Warehouse GUID
identifying WID of Taxon entry for the
species Name char -- Informal name
assigned to this source. Strain char --
Strain of organism from which object is
derived Organ char -- Organ of organism
from which object is derived Organelle char
-- Organelle of organism from which object is
derived Tissue char -- Tissue of organism
from which object is derived CellType char
-- Cell type of organism from which object is
derived CellLine char -- Cell line from
which object is derived, if applicable Development
Stage char -- Stage of development associated
with the object Sex char -- Sex of the
organism from which object is derived )
24
Overview of CMR in BioWarehouse Schema
Dataset
BioSource
Taxon
NucleicAcid
Gene
Protein
Feature
SubSequence
CrossReference
SequenceMatch
25
Taxon Table
CREATE TABLE Taxon ( WID number --
Warehouse ID for this Taxon ParentWID
number -- Warehouse ID of the parent of this
taxon Name char -- Taxonomic Name
of this taxon Rank char -- Rank
of this taxon (kingdom, superkingdom
..) DivisionWID number -- Warehouse ID of
the division this taxon belongs
to. InheritedDivision char -- 'T' if division
is inherited from parent, else 'F' GencodeWID
number -- Warehouse ID of the genetic code
for this taxon. InheritedGencode char -- 'T'
if gencode is inherited from parent, else
'F' MCGencodeWID number -- Warehouse ID of
the mitochondrial genetic code for this
taxon. InheritedMCGencode char -- 'T' if the
mitochondrial gencode is inherited from parent,
else 'F' DataSetWID number -- Reference
to the data set from which the entity came from )
26
Overview of CMR in BioWarehouse Schema
Dataset
BioSource
Taxon
NucleicAcid
Gene
Protein
Feature
SubSequence
CrossReference
SequenceMatch
27
NucleicAcid Table
CREATE TABLE NucleicAcid ( WID
number-- Warehouse identifier for this
NucleicAcid. Name char -- Name or
description of this molecule.
-- Ex (CMR) 'Chromosome II Brucella
melitensis 16M'. Type char --
Enumeration 'DNA' 'RNA' or 'NA' Class
char -- Enumeration (RNA subtype, replicon type
etc.) Topology char -- Enumeration
'circular', 'linear' or 'other'. Strandedness
char -- Enumeration indicating whether Nucleic
Acid is single stranded, double stranded or mixed
stranded. Fragment char -- 'T' if this
is a fragment of a molecule,
-- 'F' if this NucleicAcid describes an entire
molecule. FullySequenced char -- 'T' if the
molecule is completely --
sequenced within this dataset, else
'F'. MoleculeLength int32 -- The length of the
molecule, in nucleotides. This value can be an
-- approximation. Even if
this value is known to be exact,
-- it may not necessarily be identical to
that of TotalLength, as the molecule CumulativeLen
gth int32 -- The cumulative number of nucleotides
for Subsequences referenced by this NucleicAcid
entry, -- whether
contiguous or not. This value is a summation of
the number of nucleotides for these
Subsequences. -- If the
molecule is completely sequenced, this value
should be identical -- to
that of MoleculeLength in this case, both fields
are populated. GeneticCodeWID number --
References the genetic code of this
molecule. BioSourceWID number -- References
the biological source of this molecule. DataSetWID
number -- References the data set from
which the entity came from )
28
Overview of CMR in BioWarehouse Schema
Dataset
BioSource
Taxon
NucleicAcid
Gene
Protein
Feature
SubSequence
CrossReference
SequenceMatch
29
SubSequence Table
CREATE TABLE Subsequence ( WID
number -- Warehouse identifier for this
Subsequence. NucleicAcidWID number --
References the containing nucleic acid
molecule. FullSequence char -- 'T' if this
is the complete sequence of the nucleic acid
molecule -- referenced
by NucleicAcidWID, else 'F'. Sequence
string -- Nucleotide sequence of this
Subsequence. Length int32 -- Number
of nucleotides (and characters) in
Sequence LengthApproximate char -- Enumeration
implies that Length approximates actual Sequence
length -- 'gt' for
greater than, -- 'lt'
for less than, or --
'ne' for not equal. PercentGC real --
Percentage of Sequence nucleotides that are
either guanine or cytosine. Version
char -- Dataset-specific information to
indicate the version of this Subsequence. DataSetW
ID number -- Reference to the data set
from which the entity came from )
30
Overview of CMR in BioWarehouse Schema
Dataset
BioSource
Taxon
NucleicAcid
Gene
Protein
Feature
SubSequence
CrossReference
SequenceMatch
31
Gene Table
CREATE TABLE Gene ( WID number --
Warehouse identifier this gene. Name
char -- Name of the gene. NucleicAcidWID
number -- Reference to the NucleicAcid molecule
this gene resides upon. SubsequenceWID number
-- Reference to the Subsequence containing the
nucleotide sequence of this Gene. Type
char -- Describes the type of molecule which
is known to be ultimately produced by a gene
-- enumerated values
(pre-mRNA, rRNA, tRNA, etc) GenomeID
char -- Unique ID assigned to this gene, such
as by a genome project CodingRegionStart int32
-- Base position of start of coding region.
Start is always less than End,
-- except for genes that wrap around the
origin of a circular chromosome. CodingRegionEnd
int32 -- Base position of end of coding
region. Direction char -- Direction of
transcription 'F' for forward (5' to 3'),
-- 'R' for reverse/complement
(3' to 5') Interrupted char -- 'T' if the
gene is interrupted, else 'F'. DataSetWID
number -- Reference to the data set from which
the entity came from )
32
Overview of CMR in BioWarehouse Schema
Dataset
BioSource
Taxon
NucleicAcid
Gene
Protein
Feature
SubSequence
CrossReference
SequenceMatch
33
Protein Table
CREATE TABLE Protein ( WID
number -- The warehouse ID of this protein Name
char -- Common name of the
protein, if it exists AASequence string
-- Amino-acid sequence for this protein Length
int32 -- Length of the amino-acid
sequence for this protein Charge
int16 -- Charge of the protein Fragment
char -- 'T' if protein is a fragment, else
'F' MolecularWeightCalc real -- Molecular
weight calculated from sequence.
-- Units Daltons. MolecularWeightExp
real -- Molecular Weight determined through
experimentation. --
Units Daltons. PICalc char -- pI
calculated from its sqeuence. PIExp
char -- pI value determined through
experimentation. )
34
Overview of CMR in BioWarehouse Schema
Dataset
BioSource
Taxon
NucleicAcid
Gene
Protein
Feature
SubSequence
CrossReference
SequenceMatch
35
Feature Table
CREATE TABLE Feature ( WID number --
Warehouse identifier of this feature. Description
char -- Describes this feature. Type
char -- Type of feature. These are the original
type values. Example "Promoter CYZ". Class
char -- Class of feature. Assigns our typology
for types of features, or qualifiers associated
with features. -- These are
our own enumerated type values, to allow us
-- to classify features without
losing the original (author-provided) type values
stored in Type. Example ClassPromoter. SequenceT
ype char -- Enumeration (see also SequenceWID)
that indicates whether the sequence
-- is protein or nucleic and how the
sequence (if available) is represented
-- If 'P', feature resides on a
protein. -- If 'S' or 'N',
feature resides on a nucleic acid. SequenceWID
number -- References a Protein or a Subsequence
containing the sequence contents
-- SequenceType of 'S' implies SequenceWID is
nonNULL and references a Subsequence-
-- sequence Subsequence.Sequence
(i.e., it is stored explicitly),
-- SequenceType of 'N' implies SequenceWID (if
nonNULL) references a NucleicAcid-
-- sequence is the substring
Subsequence.SequenceStartPosition
EndPosition -- where
Subsequence is the full Subsequence of the
nucleic acid. -- SequenceType
of 'P' implies SequenceWID (if nonNULL)
references a Protein- --
sequence is the substring Protein.AASequenceStart
Position EndPosition. StartPosition int32 --
Start position of the feature within the
NucleicAcid or Protein sequence. EndPosition
int32 -- End position of the feature within the
NucleicAcid or Protein sequence. ExperimentalSuppo
rt char -- 'T' if the feature is supported by
experimental evidence, else 'F' ComputationalSuppo
rt char -- 'T' if the feature is supported by
computational evidence, else 'F' )
36
Queries to BioWarehouse CMR
Which organisms have a trpA gene? (bwhissue-sql
"SELECT DISTINCT biosource.name FROM
gene,biosourcewidgenewid,biosource WHERE
gene.name'trpA' AND gene.widbiosourcewidgenewid.
genewid AND biosource.widbiosourcewidgenewid.bios
ourcewid") -- 71 organisms are returned
37
Queries to BioWarehouse CMR
(loop for (WID num) in (bwhissue-sql
"SELECT biosourceWID,count() FROM
nucleicacid WHERE datasetWID3 AND
type'DNA' GROUP BY biosourceWID")
append (when WID list (list WID
(caar (bwhissue-sql (format nil
"SELECT name FROM biosource WHERE wid A"
WID)))
nnum)))) ("202425" "1") ("202426"
"4") ("202427" "4") ("202428" "1") ("202429"
"1") ("202430" "1") ("202431" "1") ("202432"
"1") ("202433" "22")
38
Queries to BioWarehouse CMR
("202433" "Borrelia burgdorferi B31"
22) ("202479" "Nostoc sp. PCC 7120" 7) ("202426"
"Agrobacterium tumefaciens C58 Cereon"
4) ("202427" "Agrobacterium tumefaciens C58
UWash" 4) ("202450" "Deinococcus radiodurans R1"
4) ("202451" "Enterococcus faecalis V583"
4) ("202528" "Yersinia pestis CO92" 4) ("202437"
"Buchnera sp. APS" 3) ("202457" "Halobacterium
sp. NRC-1" 3) ("202465" "Mesorhizobium loti
MAFF303099" 3) ("202467" "Methanococcus
jannaschii DSM2661" 3) ("202485" "Pseudomonas
syringae DC3000" 3) ("202497" "Sinorhizobium
meliloti 1021" 3) ("202510" "Streptomyces
coelicolor A3(2)" 3) ("202527" "Xylella
fastidiosa 9a5c" 3)
39
Queries to BioWarehouse CMR
(bwhissue-sql "SELECT FROM nucleicacid
WHERE biosourcewid202433 AND type'DNA'")
("chromosome Borrelia burgdorferi B31" "DNA"
"chromosome" "linear") ("cp26(plasmid B) Borrelia
burgdorferi B31" "DNA" "plasmid"
"circular") ("cp32-1(plasmid P) Borrelia
burgdorferi B31" "DNA" "plasmid"
"circular") ("cp32-3(plasmid S) Borrelia
burgdorferi B31" "DNA" "plasmid"
"circular") ("cp32-4(plasmid R) Borrelia
burgdorferi B31" "DNA" "plasmid"
"circular") ("cp32-6(plasmid M) Borrelia
burgdorferi B31" "DNA" "plasmid"
"circular") ("cp32-7(plasmid O) Borrelia
burgdorferi B31" "DNA" "plasmid"
"circular") ("cp32-8(plasmid L) Borrelia
burgdorferi B31" "DNA" "plasmid"
"circular") ("cp32-9(plasmid N) Borrelia
burgdorferi B31" "DNA" "plasmid"
"circular") ("cp9(plasmid C) Borrelia burgdorferi
B31" "DNA" "plasmid" "circular") ("lp17(plasmid
D) Borrelia burgdorferi B31" "DNA" "plasmid"
"linear") ("lp21(plasmid U) Borrelia burgdorferi
B31" "DNA" "plasmid" "linear") ("lp25(plasmid E)
Borrelia burgdorferi B31" "DNA" "plasmid"
"linear") ("lp28-1(plasmid F) Borrelia
burgdorferi B31" "DNA" "plasmid"
"linear") ("lp28-2(plasmid G) Borrelia
burgdorferi B31" "DNA" "plasmid"
"linear") ("lp28-3(plasmid H) Borrelia
burgdorferi B31" "DNA" "plasmid"
"linear") ("lp28-4(plasmid I) Borrelia
burgdorferi B31" "DNA" "plasmid"
"linear") ("lp36(plasmid K) Borrelia burgdorferi
B31" "DNA" "plasmid" "linear") ("lp38(plasmid J)
Borrelia burgdorferi B31" "DNA" "plasmid"
"linear") ("lp5(plasmid T) Borrelia burgdorferi
B31" "DNA" "plasmid" "linear") ("lp54(plasmid A)
Borrelia burgdorferi B31" "DNA" "plasmid"
"linear") ("lp56(plasmid Q) Borrelia burgdorferi
B31" "DNA" "plasmid" "linear")
40
Dashboard Interface to BioWarehouse
41
(No Transcript)
42
Accessing the Warehouse
  • Create your own locally configured warehouse
  • Loader tools and schema definitions available as
    open source
  • Query SRI public warehouse in Sept

43
Acknowledgments
  • Funded by DARPA BioSPICE program
Write a Comment
User Comments (0)
About PowerShow.com