Databases - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Databases

Description:

Include boutique databases? Schema Integration. How do schemas map ... Boutique databases are small, specialty databases that may add a modicum of knowledge. ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 34
Provided by: CurtisJ
Category:

less

Transcript and Presenter's Notes

Title: Databases


1
Databases
2
Database Federation
3
Definitions
  • Data warehouse
  • aggregation and summation of data sources
  • used to quickly answer very specific questions
  • Database Federation
  • richer schema reflecting more of source schemas
  • optimized for general rather than specific queries

4
Federation Considerations
  • Advantages
  • One-stop shopping
  • single query (and query language)
  • multiple resources brought together
  • Automatic result aggregation
  • Technical complexities
  • Updating from source databases
  • Unifying disparate schemas

5
Designing a Data Federation
  • Concrete vs. Virtual
  • How is federation to take place?
  • Data Sources
  • Are sources curated?
  • Include boutique databases?
  • Schema Integration
  • How do schemas map to federated schema?

6
Concrete Database Federation
7
Virtual Database Federation
Query Gateway
8
Comparison
Concrete
Virtual
  • Pros
  • quicker
  • single schema
  • Cons
  • bulkier
  • maintenance issues
  • Pros
  • less disk intensive
  • less maintenance
  • Cons
  • slower
  • fault tolerance needed

9
Data Sources
  • Curated databases increase the level of trust you
    can have in the data.
  • Uncurated databases often have the latest and
    greatest data.
  • Boutique databases are small, specialty databases
    that may add a modicum of knowledge.

10
Schema Integration
  • Decide what data is important to federate
  • Decide what fields correspond to other fields
  • Decide how to merge duplicate records (and decide
    what is a duplicate record).

11
Concrete Federation IGD
  • Integrated Genome Database
  • federated genomic data from multiple genome
    databases
  • used ACEDB as db engine
  • driven by most genome projects using ACEDB
  • ran out of room
  • too many records used up all the inodes on system
  • slow and clunky

12
Virtual Federation ENQUire
  • Extensible Network Query Unifier
  • WWW-based front end
  • single generic query split into db-specific query
    and sent out over network
  • results merged according to type (genomic,
    sequence, etc.)
  • used as basis for NCSA Biology Workbench

13
Semi-Virtual Federation SRS
  • Sequence Retrieval System
  • federated databases stored locally, with a query
    integration engine acting as middleware
  • good performance (no network hit)
  • simplified maintenance (native format dbs)
  • basis for Lion Biology Analysis Environment
  • Similar to DBGET

14
MOBY
  • Model Organism Bring Your (Own Schema)
  • MOBY is a system through which a client will be
    able to interact with multiple sources of
    biological data regardless of the underlying
    format or schema. The system also allows for the
    dynamic identification of new relationships
    between data from different sources.

15
(No Transcript)
16
  • A classification is not a neutral hat rack it
    expresses a theory of relationships that controls
    our concepts.
  • Stephen Jay Gould
  • Ever Since Darwin

17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
Ontologies
  • "An ontology is a specification of a
    conceptualization. "
  • In other words, a hierarchical mapping of
    concepts within a given frame of reference.

22
Gene Ontology
  • The goal of the Gene Ontology (GO) Consortium is
    to produce a controlled vocabulary that can be
    applied to all organisms even as knowledge of
    gene and protein roles in cells is accumulating
    and changing.
  • http//www.geneontology.org/

23
What GO Is
  • A collaborative effort to address the need for
    consistent descriptions of gene products in
    different databases
  • Three structured, controlled vocabularies
    (ontologies) that describe gene products in a
    species-independent manner
  • Uniform query facilitator

24
What GO Is Not
  • GO is not a database of gene sequences, nor a
    catalog of gene products. Rather, GO describes
    how gene products behave in a cellular context.
  • GO is not a 'federated solution'. Sharing
    vocabulary is a step towards unification, but is
    not, in itself, sufficient. Reasons for this
    include the following.
  • Knowledge changes and updates lag behind.
  • Individual curators evaluate data differently.
  • GO does not attempt to describe every aspect of
    biology.

25
GO Categories
  • Molecular Function Ontology
  • the tasks performed by individual gene products
    examples are carbohydrate binding and ATPase
    activity
  • Biological Process Ontology
  • broad biological goals, such as mitosis or purine
    metabolism, that are accomplished by ordered
    assemblies of molecular functions
  • Cellular Component Ontology
  • subcellular structures, locations, and
    macromolecular complexes examples include
    nucleus, telomere, and origin recognition complex

26
GO Structure
27
ltgoterm rdfabouthttp//www.geneontology.org/go
GO0008708 n_associations"0"gt
ltgoaccessiongtGO0008708lt/goaccessiongt
ltgonamegtglucose dehydrogenase activitylt/gonamegt
ltgodefinitiongt Catalysis of the reaction
D-glucose acceptor D-glucono-1,5-lactone
reduced acceptor. lt/godefinitiongt
ltgois_a rdfresource"http//www.geneontology.org
/goGO0016902" /gt ltgodbxref
rdfparseType"Resource"gt
ltgodatabase_symbolgtEClt/godatabase_symbolgt
ltgoreferencegt1.1.99.-lt/goreferencegt
lt/godbxrefgt lt/gotermgt
28
ltgoterm rdfabout"http//www.geneontology.org/go
GO0008708" n_associations"4"gt
ltgoaccessiongtGO0008708lt/goaccessiongt
ltgonamegtglucose dehydrogenase activitylt/gonamegt
ltgodefinitiongt Catalysis of the
reaction D-glucose acceptor
D-glucono-1,5-lactone reduced
acceptor. lt/godefinitiongt ltgois_a
rdfresource"http//www.geneontology.org/goGO00
16902" /gt ltgodbxref rdfparseType"Resource"gt
ltgodatabase_symbolgtEClt/godatabase_symbo
lgt ltgoreferencegt1.1.99.-lt/goreferencegt
lt/godbxrefgt ltgoassociation
rdfparseType"Resource"gt ltgoevidence
evidence_code"ISS"gt ltgodbxref
rdfparseType"Resource"gt
ltgodatabase_symbolgtFBlt/godatabase_symbolgt
ltgoreferencegtFBrf0141274lt/gorefer
encegt lt/godbxrefgt
lt/goevidencegt ltgogene_product
rdfparseType"Resource"gt
ltgonamegtCG9517lt/gonamegt
ltgodbxref rdfparseType"Resource"gt
ltgodatabase_symbolgtfblt/godatabase_symbo
lgt ltgoreferencegtFBgn0030591lt
/goreferencegt lt/godbxrefgt
lt/gogene_productgt
lt/goassociationgt lt/gotermgt
29
GO Relational Schema
30
Data Miningin Gene Expression
  • Levels of data mining in gene expression studies
  • Classification studies through pattern discovery
  • Knowledge discovery through linkage with a priori
    biological knowledge
  • Hypotheses generation

31
(No Transcript)
32
GO Agent
  • Genes of similar function are often co-expressed.
  • Our GO agent mines the GO database and creates
    a knowledge store of functions for each of our
    gene expression clusters.
  • Cluster (1,3) has 14 genes, 5 of which are
    characterized by GO.
  • Another agent then groups and reports mined
    information.

33
Visualization of GO Terms
Write a Comment
User Comments (0)
About PowerShow.com