THE CHALLENGES OF GENOME INFORMATION MANAGEMENT - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

THE CHALLENGES OF GENOME INFORMATION MANAGEMENT

Description:

many databases, many tools, many nomenclatures Shamkant B. Navathe. 6 ... Using different nomenclature. Quality Control is a major Issue ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 33
Provided by: amolna
Category:

less

Transcript and Presenter's Notes

Title: THE CHALLENGES OF GENOME INFORMATION MANAGEMENT


1
THE CHALLENGES OF GENOMEINFORMATION MANAGEMENT
Shamkant B. Navathe1 sham_at_cc.gatech.edu in
collaboration with Douglas C. Wallace2 dwallace_at_gm
m.gen.emory.edu
1Bioengineering Program - Database Group College
of Computing Georgia Institute of
Technology 2Center for Molecular Medicine Emory
University School of Medicine
2
Acknowledgements
  • Students
  • Andreas Kogelnik, M.D., Ph.D.
  • Girish Katdare. M.S.
  • Mondira Deb, Ph. D. student (ECE)
  • Ken Kladitis, B.S.
  • Martin Brandon, Ph.D. student
  • Faculty
  • Mike Brown, Ph.D. Asst Prof., Emory
  • Marie Lott , Ph.D., Research Scientist, Emory

3
General Challenges for Data Management in
Biological Applications
  • Collection and Curation
  • Analysis
  • comparative analysis
  • Integration
  • cross linking of data
  • Understanding
  • multiple interpretations
  • Dissemination
  • traditional means / web-based

4
Desired Properties of Proposed Solutions
  • Scalability
  • applicable to large volumes of data
  • high rates of data acquisition
  • No loss of information between systems
  • Constructive approach to data management vs.
    Absolute representation
  • identification
  • accommodation (of viewpoints)
  • manipulation
  • conflict resolution
  • This represents a very tall order for existing
    DBMSs

5
Current State of Affairs in Genetic Data
Management
  • Lots of genetic info lies in non-electronic form
  • Biology is becoming very "data rich"
  • laboratory automation is increasing data output
  • Comprehensive data analysis and interpretation is
    no longer feasible by manual means
  • many databases, many tools, many nomenclatures

6
Human Genome Initiative (HGI)
  • Begun in 1988
  • 200 million/year for 15 years from Congress
  • Capturing, analyzing, and interpreting the human
    collective genetic information for 24 pairs of
    chromosomes
  • 3-4 billion nucleotides per genome
  • 100,000-300,000 genes

This initiative was superseded by industry,
particularly Celera, who announced in Feb. 2001
that they had sequenced the complete genome
within 2 years. They announced that they have
identified about 30K genes. Mismatches among
genes from HGI vs. Celera. Exact number is
unknown.
7
Current Genome-related prominent databases
  • GenBank
  • DNA sequence 1,053,000,000 bases - 1,611,000
    sequences
  • Molecular modeling DB (MMDB) - 3-D structures
  • Online Mendelian Inheritance in Man (OMIM)
  • Clinical phenotypes - 8,700 entries
  • Swiss-Prot
  • Protein sequence 70,000 proteins
  • Genome Database (GDB) - human genome mapping data
  • Locus specific databases such as PaHDb, CFTR
  • PIR International and PDB (Protein data bank)

8
Types of Database Content
  • Full genome databases of human and other
    organisms (C-elegans, E-coli, Drosophila,
    mouse..)
  • Specialized subject databases
  • TRANSFAC transcription factors and their binding
    sites
  • REBASE Restriction Enzyme Resource
  • Derived Databases providing annotations and
    novel structuring of the content
  • Protein motif databases (PROSITE)
  • Protein structure-sequence alignment database
    (HSSP)
  • Protein Domains (PFAM)
  • Structural Clasification of Proteins (SCOP)

9
TYPE OF DATA
  • Sequence Data (different sequences for DNA data
    and protein data). Sequences ae linked to
    structures, motifs and metabolic pathways
  • Structure and function data
  • Annotations
  • Evolutionary relationships
  • Visual Representations
  • Audio and Video data related to phenotypes
  • (patient symptoms and behaviors)

Databases of metabolic pathways have intrinsic
complexity because nodes represent data from
sequences while edges represent chemical
reactions which are independent and non-sequence
related (Karp 1998)
10
Nature of Biological Data
  • Representation of biological macromolecules
  • Combined with associated fuzzy information
  • Incomplete and sometimes subjective
  • Open to interpretation
  • Using different nomenclature
  • Quality Control is a major Issue
  • New data is based on experiments without
    confirmation
  • Previous annotations of data may be inherited,
    but may not match with new results
  • Submissions have to be checked for accuracy
  • Same data may occur in multiple submisions.

Question should genomic and proteolmic databases
be passive repositories or active in the form of
annotations and links to other databases?
11
Quality control of data
  • Easier for structural data
  • rules of stereo chemistry and protein
    architecture apply
  • Experimental techniques exist to verify structure
  • NMR (nuclear magnetic resonance)
  • X ray Crystallography
  • Classical Tradeoff - whether to make data
    available quickly or whether to wait to verify
    its accuracy before it is made available
  • Existing databases like ESTs (expressed sequence
    tags) have a lot of related and possibly
    redundant information.

12
Related Areas of Computer Science/Computational
Science/Computer Technology
  • Algorithm design algorithm complexity analysis
  • applicable to comparison of sequence data
  • applicable to prediction of protein folding
  • Database modeling
  • entities, relationships, attributes, constraints
  • objects and object references
  • Database design
  • schemas, content/data organization design,
    loading, curating process design

13
Areas of Computer Science/Computational
Science/Computer Technology (cont.)
  • Knowledge system design
  • incorporation of heuristics rules
  • automated detection of patterns
  • deduction or derivation of new information using
    distributed computing and neural networks
  • Parallel processing and supercomputing
  • Animation and visualization
  • Virtual environments

14
OUR WORK IN THE AREA
  • Started a project around 1993 - later on named
    MITOMAP to create a database of the mitochondrial
    genome
  • Resulted in the PhD dissertation of Andreas
    Kogelnik in 1997 on Biological Information
    Management
  • Work in dissertation included an approach to
    manage all aspects of mitochondrial genome
  • System called GENOME was proposed
  • Currently being maintained and further developed
    by Martin Brandon

15
Human Mitochondrial Map MITOMAP
http//www.gen.emory.edu
T
F
DEAF 1555
D-Loop
12s
V
LHON 14484
rRNA
Cyt b
P
0
LDYS 14459
ND6
E
16s
rRNA
America A
MELAS 3243
L
ND5
LHON 3460
America C
ND1
Africa L
I
Q
L
ADPD 4336
M
S
H
America D
ND2
ND4
A
Asia F
N
Europe H
C
LHON 11778
W
Y
ND4L
America B, Asia B
ND3
R
COI
S
COIII
G
ATPase6
COII
D
K
NARP 8993/Leighs 8993
MERRF 8344
ATPase8
16
GENOME Georgia Tech Emory Networked Object
Management Environment
  • Focus on mitochondrial genome
  • 16,659 base pairs
  • Develop capabilities for collecting/storing/distri
    buting and analyzing the data produced
  • Integration of multiple types of data to create a
    comprehensive research data repository

17
Data Organization Problem Relational Model of
Data
  • Best for structured information
  • Naturally appealing for tabular data
  • Well founded and mathematically sound theory
    of sets and relations
  • - No accounting of semantics of data
  • - Does not provide simple features like subtyping
    inheritance
  • - SQL as a language is not powerful enough

18
Data Organization Problem Object-Oriented Model
of Data
  • Captures objects of greater complexity
  • Easier to deal with unstructured information
  • Easier to deal with relationships/behavior/inter
    pretation of data
  • - Query languages are not well developed
  • - Schema evolution techniques are lacking
  • - industrial support / experience is much weaker
    compared to relational

19
CASE STUDY OF DDLJ
  • Human Genome Database
  • Maintained in Kyoto, Japan
  • Linked to GenBank and EMBL

20
Schema Levels
21
Conceptual Schema
22
External Schema
23
Data Organization Our Approach in initial design
  • Use of standardized notation
  • ASN.1
  • Tailoring the approach by defining our own
    schemas, classes, properties, and functions
  • Creating an open architecture for future
    expansion/evolution of data models and
    incorporation of databases

24
Data Organization General Trends
  • Combining relational and O-O features into one
    system
  • Providing system functionality with pre-defined
    classes then adding user defined facilities
  • Active data - use of triggers and rules to
    create new data
  • Allowing support for heterogeneous/ federated
    data collections

25
OUR FUTURE APPROACH
  • Data Integration
  • integrate sequence based genomic information with
    mutation/disease related information, functional
    and biochemical information
  • interaction between nuclear and mitochondrial
    genome data using microarray experiments
  • combine with existing mutation databases
  • Long Term Maintainable Repository
  • use standard commercial approaches
  • likely to implement system using Oracle 9i

26
Data Integration
Genomic DBs
Mutational DBs
Protein DBs
Central Repository
27
MITOMAP Data Interactions
Functional Data
Gene-gene Interactions Data
mtDNA Sequence Data
Population Data
population database
Disease Data
28
Complexities vs. Ease of Use
  • Complex objects with illdefined and uncertain
  • data
  • Multiplicity of data types
  • Numbers, text, images, audio, video
  • Easy interface for the scientists and drug
    designers etc.
  • Better interfaces
  • More visualization
  • More animation
  • More user interactivity
  • Varied search paradigms - querying, browsing,
    navigation, interactive exploration

29
Biological Application Challenges
  • Raw Data
  • sequence data, anthropological evolution,
    microarray studies of gene expression, electronic
    patient records
  • Multiple dimensions of information
  • Content-relationships and links among data
  • Incomplete, ill-defined, ill-structured,
    ill-formed information
  • Missing and erroneous information
  • Going beyond raw data to meaningful
    information
  • Extraction - selection
  • Derivation - deduction
  • Exploration - discovery (data mining)

30
Challenges for Database Professionals
  • Learning applications
  • Jargon
  • Process model of the environment
  • Complexities, typical scenarios, rules,
    constraints
  • Apply database techniques to help in application
  • Conceptual modeling
  • Views, indexing, text analysis
  • Specification, normalization, query optimization
  • Apply techniques from outside the database area
  • AI, information retrieval, software engineering,
    user interfaces

31
Challenges for Biomedical Scientists
  • Integration of multiple, disparate data sources
  • Appreciation of data modeling as a precursor to
    information utilization
  • Ability to deal with multiple models, interfaces
    and environments
  • Awareness of the limitations of information
    technology

32
A lot of biologists and scientists dont realize
that if you build a database and you tolerate 5
sloppiness in the definition of individual
concepts, when you execute a query that joins
across 15 concepts, youve got less than a 5050
chance of getting the answer you want. -
Robert Robbins, Director, Applied Research
Laboratory, Johns Hopkins
Write a Comment
User Comments (0)
About PowerShow.com