BioWarehouse: The BioSPICE Bioinformatics Database Warehouse - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

BioWarehouse: The BioSPICE Bioinformatics Database Warehouse

Description:

Create a toolkit for integrating a set of bioinformatics databases into one ... Representational tricks to decrease schema bloat. Single space of primary keys ... – PowerPoint PPT presentation

Number of Views:125

Avg rating:3.0/5.0

Slides: 39

Provided by: tomga6

Category:

more less

Transcript and Presenter's Notes

Title: BioWarehouse: The BioSPICE Bioinformatics Database Warehouse

1
BioWarehouse The BioSPICE BioinformaticsDatabas
e Warehouse

Peter D. Karp, Tom Lee, Valerie Wagner, Yannick
Pouliot
SRI International
http//www.BioSPICE.org/

2
Project Goal

Create a toolkit for integrating a set of
bioinformatics databases into one physical
database warehouse

3
Motivations

Important bioinformatics problems require access
to multiple bioinformatics databases
Hundreds of bioinformatics databases exist
Nucleic Acids Research 30(1) 2002 DB issue
Nucleic Acids Research DB list 350 DBs at
http//www3.oup.co.uk/nar/database/a/
Different problems require different sets of
databases

4
Motivations

Combining multiple databases allows for data
verification and complementation
Simulation problems require access to data on
pathways, enzymes, reactions, genetic regulation

5
Why is the Multidatabase Approach Not Sufficient?

Multidatabase query approaches assume databases
are in a queryable DBMS
Most sites that do operate DBMSs do not allow
remote query access because of security and
loading concerns
Users want to control data stability
Users want to control speed of their hardware
Internet bandwidth limits query throughput
Users need to capture, integrate and publish
locally produced data of different types
Multidatabase and Warehouse approaches
complementary

6
Recent Progress and Current Work

Efforts completed to
Clean up schema
Refine loaders
Re-implement BioCyc loader in C
Define installation procedure
Genbank loader almost complete
Bacterial division of Genbank
Extensive BioWarehouse schema changes
All loaders being upgraded to support current
schema and current versions of their DBs

7
Scenario 1

BioSPICE scientist wants to model multiple
metabolic pathways in a given organism
Enumerate pathways and reactions
What enzymes catalyze each reaction?
What genes code for each enzyme?
What control regions regulate each gene?

8
Scenario 2

BioSPICE scientist evaluating a simulation model
requires
Transcriptome, proteome, and metabolome
measurements from as many sources as possible

9
Databases Supported by BioWarehouse

BioCyc collection
(EcoCyc, HumanCyc, MetaCyc, 13 more)
Swiss-Prot, TrEMBL
ENZYME
KEGG
NCBI Taxonomy
CMR -- Comprehensive Microbial Resource
Genbank bacterial division

10
SRI Constructing Public Warehouse Server

Publicly queryable collection of databases
NCBI Taxonomy
BioCyc open subset
Swiss-Prot (2005)
CMR
UCLA microarray data
Data will be openly available and queryable
Linux/PC running MySQL
Expected availability Sept 2004

11
Next Steps

Add schema for protein-protein interactions,
implement loader for BIND

12
Approach

Oracle and MySQL implementations
Warehouse schema defines many bioinformatics
datatypes
Create loaders for public bioinformatics DBs
Parse file format for the DB
Semantic transformations
Insert database into warehouse tables
Warehouse query access mechanisms
SQL queries via Perl, ODBC, JDBC, OAA

13
Database Loaders

Loader tool defined for each DB to be loaded into
Warehouse
Example loaders available in several languages
C-based Loaders
NCBI Taxonomy
CMR (Comprehensive Microbial Resource)
KEGG
BioCyc collection of 15 pathway DBs
Java-based Loaders
Swiss-Prot
ENZYME
Genbank

14
Warehouse Schema

Manages many bioinformatics datatypes
simultaneously
Pathways, Reactions, Chemicals
Proteins, Genes, Replicons, Nucleotide sequences
Citations
Organisms, Taxonomic relationships
Gene expression data
Links to external databases
Each type of warehouse object implemented through
one or more relational tables (currently 62)

15
Warehouse Schema

Manages multiple datasets simultaneously
Dataset Single version of a database
Support alternative measurements and viewpoints
Version comparison
Multiple software tools or experiments that
require access to different versions
Each dataset is a warehouse entity
Every warehouse object is registered in a dataset

16
Warehouse Schema

Different databases storing the same biological
datatypes are coerced into same warehouse tables
Design of most datatypes inspired by multiple
databases
Representational tricks to decrease schema bloat
Single space of primary keys
Single set of satellite tables such as for
synonyms, citations, comments, etc.

17
Warehouse Schema

Examples
Protein data from Swiss-Prot, TrEMBL, KEGG, and
EcoCyc all loaded into same relational tables
Pathway data from MetaCyc and KEGG are loaded
into the same relational tables

18
Loader / Schema ExampleCMR

TIGR Comprehensive Microbial Resource
Contains all sequenced microbial genomes
For each genome describes
Organism
Replicons
Genes and gene products
Features on the replicons
Also includes All-vs-All BLAST search among all
genomes

19
Warehouse Implementation of CMR

CMR loader written in C
Documented in HTML document that describes CMR
mapping to BioWarehouse schema
Loading 105 genomes takes 4.5 hrs on
dual-processor 2.5GHz PC

20
Overview of CMR in BioWarehouse Schema
Dataset
BioSource
Taxon
NucleicAcid
Gene
Protein
Feature
SubSequence
CrossReference
SequenceMatch
21
Dataset Table
CREATE TABLE DataSet ( WID number --
Warehouse identifier for this entry. Name
char -- Name of the database. Version
char -- Version of the database described by
this warehouse dataset LoadDate datetime --
Date that this version of the data set loaded
into the warehouse. ReleaseDate char --
Date that the original database was first
released. HomeURL char -- Web address
of the home page of the original
database. QueryURL char -- A URL at
which we can retrieve objects in this database
via the WWW -- by
substituting the unique ID of the object we wish
to retrieve for the --
string "s" in the query URL. An example URL
is --
http//database.university.edu/get-object/5050s )
22
Overview of CMR in BioWarehouse Schema
Dataset
BioSource
Taxon
NucleicAcid
Gene
Protein
Feature
SubSequence
CrossReference
SequenceMatch
23
BioSource Table
CREATE TABLE BioSource ( WID number --
Warehouse identifier for biological
source. TaxonWID number -- Warehouse GUID
identifying WID of Taxon entry for the
species Name char -- Informal name
assigned to this source. Strain char --
Strain of organism from which object is
derived Organ char -- Organ of organism
from which object is derived Organelle char
-- Organelle of organism from which object is
derived Tissue char -- Tissue of organism
from which object is derived CellType char
-- Cell type of organism from which object is
derived CellLine char -- Cell line from
which object is derived, if applicable Development
Stage char -- Stage of development associated
with the object Sex char -- Sex of the
organism from which object is derived )
24
Overview of CMR in BioWarehouse Schema
Dataset
BioSource
Taxon
NucleicAcid
Gene
Protein
Feature
SubSequence
CrossReference
SequenceMatch
25
Taxon Table
CREATE TABLE Taxon ( WID number --
Warehouse ID for this Taxon ParentWID
number -- Warehouse ID of the parent of this
taxon Name char -- Taxonomic Name
of this taxon Rank char -- Rank
of this taxon (kingdom, superkingdom
..) DivisionWID number -- Warehouse ID of
the division this taxon belongs
to. InheritedDivision char -- 'T' if division
is inherited from parent, else 'F' GencodeWID
number -- Warehouse ID of the genetic code
for this taxon. InheritedGencode char -- 'T'
if gencode is inherited from parent, else
'F' MCGencodeWID number -- Warehouse ID of
the mitochondrial genetic code for this
taxon. InheritedMCGencode char -- 'T' if the
mitochondrial gencode is inherited from parent,
else 'F' DataSetWID number -- Reference
to the data set from which the entity came from )
26
Overview of CMR in BioWarehouse Schema
Dataset
BioSource
Taxon
NucleicAcid
Gene
Protein
Feature
SubSequence
CrossReference
SequenceMatch
27
NucleicAcid Table
CREATE TABLE NucleicAcid ( WID
number-- Warehouse identifier for this
NucleicAcid. Name char -- Name or
description of this molecule.
-- Ex (CMR) 'Chromosome II Brucella
melitensis 16M'. Type char --
Enumeration 'DNA' 'RNA' or 'NA' Class
char -- Enumeration (RNA subtype, replicon type
etc.) Topology char -- Enumeration
'circular', 'linear' or 'other'. Strandedness
char -- Enumeration indicating whether Nucleic
Acid is single stranded, double stranded or mixed
stranded. Fragment char -- 'T' if this
is a fragment of a molecule,
-- 'F' if this NucleicAcid describes an entire
molecule. FullySequenced char -- 'T' if the
molecule is completely --
sequenced within this dataset, else
'F'. MoleculeLength int32 -- The length of the
molecule, in nucleotides. This value can be an
-- approximation. Even if
this value is known to be exact,
-- it may not necessarily be identical to
that of TotalLength, as the molecule CumulativeLen
gth int32 -- The cumulative number of nucleotides
for Subsequences referenced by this NucleicAcid
entry, -- whether
contiguous or not. This value is a summation of
the number of nucleotides for these
Subsequences. -- If the
molecule is completely sequenced, this value
should be identical -- to
that of MoleculeLength in this case, both fields
are populated. GeneticCodeWID number --
References the genetic code of this
molecule. BioSourceWID number -- References
the biological source of this molecule. DataSetWID
number -- References the data set from
which the entity came from )
28
Overview of CMR in BioWarehouse Schema
Dataset
BioSource
Taxon
NucleicAcid
Gene
Protein
Feature
SubSequence
CrossReference
SequenceMatch
29
SubSequence Table
CREATE TABLE Subsequence ( WID
number -- Warehouse identifier for this
Subsequence. NucleicAcidWID number --
References the containing nucleic acid
molecule. FullSequence char -- 'T' if this
is the complete sequence of the nucleic acid
molecule -- referenced
by NucleicAcidWID, else 'F'. Sequence
string -- Nucleotide sequence of this
Subsequence. Length int32 -- Number
of nucleotides (and characters) in
Sequence LengthApproximate char -- Enumeration
implies that Length approximates actual Sequence
length -- 'gt' for
greater than, -- 'lt'
for less than, or --
'ne' for not equal. PercentGC real --
Percentage of Sequence nucleotides that are
either guanine or cytosine. Version
char -- Dataset-specific information to
indicate the version of this Subsequence. DataSetW
ID number -- Reference to the data set
from which the entity came from )
30
Overview of CMR in BioWarehouse Schema
Dataset
BioSource
Taxon
NucleicAcid
Gene
Protein
Feature
SubSequence
CrossReference
SequenceMatch
31
Gene Table
CREATE TABLE Gene ( WID number --
Warehouse identifier this gene. Name
char -- Name of the gene. NucleicAcidWID
number -- Reference to the NucleicAcid molecule
this gene resides upon. SubsequenceWID number
-- Reference to the Subsequence containing the
nucleotide sequence of this Gene. Type
char -- Describes the type of molecule which
is known to be ultimately produced by a gene
-- enumerated values
(pre-mRNA, rRNA, tRNA, etc) GenomeID
char -- Unique ID assigned to this gene, such
as by a genome project CodingRegionStart int32
-- Base position of start of coding region.
Start is always less than End,
-- except for genes that wrap around the
origin of a circular chromosome. CodingRegionEnd
int32 -- Base position of end of coding
region. Direction char -- Direction of
transcription 'F' for forward (5' to 3'),
-- 'R' for reverse/complement
(3' to 5') Interrupted char -- 'T' if the
gene is interrupted, else 'F'. DataSetWID
number -- Reference to the data set from which
the entity came from )
32
Overview of CMR in BioWarehouse Schema
Dataset
BioSource
Taxon
NucleicAcid
Gene
Protein
Feature
SubSequence
CrossReference
SequenceMatch
33
Protein Table
CREATE TABLE Protein ( WID
number -- The warehouse ID of this protein Name
char -- Common name of the
protein, if it exists AASequence string
-- Amino-acid sequence for this protein Length
int32 -- Length of the amino-acid
sequence for this protein Charge
int16 -- Charge of the protein Fragment
char -- 'T' if protein is a fragment, else
'F' MolecularWeightCalc real -- Molecular
weight calculated from sequence.
-- Units Daltons. MolecularWeightExp
real -- Molecular Weight determined through
experimentation. --
Units Daltons. PICalc char -- pI
calculated from its sqeuence. PIExp
char -- pI value determined through
experimentation. )
34
Overview of CMR in BioWarehouse Schema
Dataset
BioSource
Taxon
NucleicAcid
Gene
Protein
Feature
SubSequence
CrossReference
SequenceMatch
35
Feature Table
CREATE TABLE Feature ( WID number --
Warehouse identifier of this feature. Description
char -- Describes this feature. Type
char -- Type of feature. These are the original
type values. Example "Promoter CYZ". Class
char -- Class of feature. Assigns our typology
for types of features, or qualifiers associated
with features. -- These are
our own enumerated type values, to allow us
-- to classify features without
losing the original (author-provided) type values
stored in Type. Example ClassPromoter. SequenceT
ype char -- Enumeration (see also SequenceWID)
that indicates whether the sequence
-- is protein or nucleic and how the
sequence (if available) is represented
-- If 'P', feature resides on a
protein. -- If 'S' or 'N',
feature resides on a nucleic acid. SequenceWID
number -- References a Protein or a Subsequence
containing the sequence contents
-- SequenceType of 'S' implies SequenceWID is
nonNULL and references a Subsequence-
-- sequence Subsequence.Sequence
(i.e., it is stored explicitly),
-- SequenceType of 'N' implies SequenceWID (if
nonNULL) references a NucleicAcid-
-- sequence is the substring
Subsequence.SequenceStartPosition
EndPosition -- where
Subsequence is the full Subsequence of the
nucleic acid. -- SequenceType
of 'P' implies SequenceWID (if nonNULL)
references a Protein- --
sequence is the substring Protein.AASequenceStart
Position EndPosition. StartPosition int32 --
Start position of the feature within the
NucleicAcid or Protein sequence. EndPosition
int32 -- End position of the feature within the
NucleicAcid or Protein sequence. ExperimentalSuppo
rt char -- 'T' if the feature is supported by
experimental evidence, else 'F' ComputationalSuppo
rt char -- 'T' if the feature is supported by
computational evidence, else 'F' )
36
Queries to BioWarehouse CMR
Which organisms have a trpA gene? (bwhissue-sql
"SELECT DISTINCT biosource.name FROM
gene,biosourcewidgenewid,biosource WHERE
gene.name'trpA' AND gene.widbiosourcewidgenewid.
genewid AND biosource.widbiosourcewidgenewid.bios
ourcewid") -- 71 organisms are returned
37
Queries to BioWarehouse CMR
(loop for (WID num) in (bwhissue-sql
"SELECT biosourceWID,count() FROM
nucleicacid WHERE datasetWID3 AND
type'DNA' GROUP BY biosourceWID")
append (when WID list (list WID
(caar (bwhissue-sql (format nil
"SELECT name FROM biosource WHERE wid A"
WID)))
nnum)))) ("202425" "1") ("202426"
"4") ("202427" "4") ("202428" "1") ("202429"
"1") ("202430" "1") ("202431" "1") ("202432"
"1") ("202433" "22")
38
Queries to BioWarehouse CMR
("202433" "Borrelia burgdorferi B31"
22) ("202479" "Nostoc sp. PCC 7120" 7) ("202426"
"Agrobacterium tumefaciens C58 Cereon"
4) ("202427" "Agrobacterium tumefaciens C58
UWash" 4) ("202450" "Deinococcus radiodurans R1"
4) ("202451" "Enterococcus faecalis V583"
4) ("202528" "Yersinia pestis CO92" 4) ("202437"
"Buchnera sp. APS" 3) ("202457" "Halobacterium
sp. NRC-1" 3) ("202465" "Mesorhizobium loti
MAFF303099" 3) ("202467" "Methanococcus
jannaschii DSM2661" 3) ("202485" "Pseudomonas
syringae DC3000" 3) ("202497" "Sinorhizobium
meliloti 1021" 3) ("202510" "Streptomyces
coelicolor A3(2)" 3) ("202527" "Xylella
fastidiosa 9a5c" 3)
39
Queries to BioWarehouse CMR
(bwhissue-sql "SELECT FROM nucleicacid
WHERE biosourcewid202433 AND type'DNA'")
("chromosome Borrelia burgdorferi B31" "DNA"
"chromosome" "linear") ("cp26(plasmid B) Borrelia
burgdorferi B31" "DNA" "plasmid"
"circular") ("cp32-1(plasmid P) Borrelia
burgdorferi B31" "DNA" "plasmid"
"circular") ("cp32-3(plasmid S) Borrelia
burgdorferi B31" "DNA" "plasmid"
"circular") ("cp32-4(plasmid R) Borrelia
burgdorferi B31" "DNA" "plasmid"
"circular") ("cp32-6(plasmid M) Borrelia
burgdorferi B31" "DNA" "plasmid"
"circular") ("cp32-7(plasmid O) Borrelia
burgdorferi B31" "DNA" "plasmid"
"circular") ("cp32-8(plasmid L) Borrelia
burgdorferi B31" "DNA" "plasmid"
"circular") ("cp32-9(plasmid N) Borrelia
burgdorferi B31" "DNA" "plasmid"
"circular") ("cp9(plasmid C) Borrelia burgdorferi
B31" "DNA" "plasmid" "circular") ("lp17(plasmid
D) Borrelia burgdorferi B31" "DNA" "plasmid"
"linear") ("lp21(plasmid U) Borrelia burgdorferi
B31" "DNA" "plasmid" "linear") ("lp25(plasmid E)
Borrelia burgdorferi B31" "DNA" "plasmid"
"linear") ("lp28-1(plasmid F) Borrelia
burgdorferi B31" "DNA" "plasmid"
"linear") ("lp28-2(plasmid G) Borrelia
burgdorferi B31" "DNA" "plasmid"
"linear") ("lp28-3(plasmid H) Borrelia
burgdorferi B31" "DNA" "plasmid"
"linear") ("lp28-4(plasmid I) Borrelia
burgdorferi B31" "DNA" "plasmid"
"linear") ("lp36(plasmid K) Borrelia burgdorferi
B31" "DNA" "plasmid" "linear") ("lp38(plasmid J)
Borrelia burgdorferi B31" "DNA" "plasmid"
"linear") ("lp5(plasmid T) Borrelia burgdorferi
B31" "DNA" "plasmid" "linear") ("lp54(plasmid A)
Borrelia burgdorferi B31" "DNA" "plasmid"
"linear") ("lp56(plasmid Q) Borrelia burgdorferi
B31" "DNA" "plasmid" "linear")
40
Dashboard Interface to BioWarehouse
41
(No Transcript)
42
Accessing the Warehouse