Data integration architectures and methodologies for the Life Sciences Alexandra Poulovassilis, Birk - PowerPoint PPT Presentation

About This Presentation
Title:

Data integration architectures and methodologies for the Life Sciences Alexandra Poulovassilis, Birk

Description:

Given a set of biological data sources, data integration (DI) ... importance of being able to trace the provenance of data. wide variety of nomenclatures adopted ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 53
Provided by: Poulova
Category:

less

Transcript and Presenter's Notes

Title: Data integration architectures and methodologies for the Life Sciences Alexandra Poulovassilis, Birk


1
Data integration architectures and methodologies
for the Life SciencesAlexandra Poulovassilis,
Birkbeck, U. of London
2
Outline of the talk
  • The problem and challenges faced
  • Historical background
  • Main Data Integration approaches in the Life
    Sciences
  • Our work
  • Materialised and Virtual DI
  • Future directions
  • ISPIDER Project
  • Bioinformatics service reconciliation

3
1. The Problem
  • Given a set of biological data sources, data
    integration (DI) is the process of creating
    an integrated resource which
  • combines data from each of the data sources
  • in order to support new queries and analyses
  • Biological data sources are characterised by
    their high degree of heterogeneity, in terms of
  • data model, query interfaces, query processing
    capabilities, database schema or data exchange
    format, data types used, nomenclature adopted
  • Coupled with the variety, complexity and large
    volumes of biological data, this poses several
    challenges, leading to several methodologies,
    architectures and systems being developed

4
Challenges faced
  • Increasingly large volumes of complex, highly
    varying, biological data are being made available
  • Data sources are developed by different people in
    differing research environments for differing
    purposes
  • Integrating them to meet the needs of new users
    and applications requires reconciliation of their
    heterogeneity w.r.t. content, data
    representation/exchange and querying
  • Data sources may freely change their format and
    content without considering the impact on any
    integrated derived resources
  • Integrated resources may themselves become data
    sources for high-level integrations, resulting in
    a network of dependencies

5
Biological data Genes ? Proteins ? Biological
Function
Genome DNA sequences of 4 bases (A,C,G,T)
RNA copy of DNA sequence
Protein sequence of 20 amino acids
Biological Processes
FUNCTION
A gene
Permanent copy
Temporary copy
Product (each triple of RNA bases encodes an
amino acid)
Job
This slide is adapted from Nigel Martins
Lecture Notes on Bioinformatics
6
Varieties of Biological Data
  • genomic data
  • gene expression data (DNA ? proteins) and gene
    function data
  • protein structure and function data
  • regulatory pathway data how gene expression is
    regulated by proteins
  • cluster data similarity-based clustering of
    genes or proteins
  • proteomics data from experiments on separating
    proteins produced by organisms into peptides, and
    protein identification
  • phylogenetic data evolution of genomic, protein,
    function data
  • data on genomic variations in species
  • semi-structured/unstructured data medical
    abstracts

7
Some Key Application Areas for DI
  • Integrating, analysing and annotating genomic
    data
  • Predicting the functional role of genes and
    integrating function- specific information
  • Integrating organism-specific information
  • Integrating protein structure and pathway data
    with gene expression data, to support functional
    genomics analysis
  • Integrating, analysing and annotating proteomics
    data sources
  • Integrating phylogenetic data sources for
    genealogy research
  • Integrating data on genomic variations to analyse
    health impact
  • Integrating genomic, proteomic and clinical data
    for personalised medicine

8
2. Historical Background
  • One possible approach would be to encode
    transformation/ integration functionality in the
    application programs
  • However, this may be a complex and lengthy
    process, and may affect robustness,
    maintainability, extensibility
  • This has motivated the development of generic
    architectures and methodologies for DI, which
    abstract out this functionality from application
    programs into generic DI software
  • Much work has been done since the 1990s
    specifically in biological DI
  • Many systems have been developed e.g.
    DiscoveryLink, Kleisli, Tambis, BioMart, SRS,
    Entrez, that aim to address some of the
    challenges faced

9
3. Main DI Approaches in the Life Sciences
  • Materialised
  • import data into a DW
  • transform aggregate imported data
  • query the DW via the DBMS
  • Virtual
  • specify the integrated schema
  • wrap the data sources, using wrapper software
  • construct mappings between data sources and IS
    using mediator software
  • query the integrated schema
  • mediator software coordinates query evaluation,
    using the mappings and wrappers

10
Main DI Approaches in the Life Sciences
  • Link-based
  • no integrated schema
  • users submit simple queries to the integration
    software e.g. via web-based user interface
  • queries are formulated w.r.t to the data sources,
    as selected by the user
  • the integration software provides additional
    capabilities for
  • facilitating query formulation e.g.
    cross-references are maintained between different
    data sources and used to augment query results
    with links to other related data
  • speeding up query evaluation e.g. indexes are
    maintained supporting efficient keyword based
    search

11
4. Comparing the Main Approaches
  • Link-based integration is fine if functionality
    meets users needs
  • Otherwise materialised or virtual DI is
    indicated
  • both allow the integrated resource to be queried
    as though it were a single data source.
    Users/applications do not need to be aware of
    source schemas/formats/content
  • Materialised DI is generally adopted for
  • better query performance
  • greater ease of data cleaning and annotation
  • Virtual DI is generally adopted for
  • lower cost of storing and maintaining the
    integrated resource
  • greater currency of the integrated resource

12
5. Our work AutoMed
  • The AutoMed Project at Birkbeck and Imperial
  • is developing tools for the semi-automatic
    integration of heterogeneous information sources
  • can handle both structured and semi-structured
    data
  • provides a unifying graph-based metamodel (HDM)
    for specifying higher-level modelling languages
  • provides a single framework for expressing data
    cleansing, transformation and integration logic
  • the AutoMed toolkit is currently being used for
    biological data integration and p2p data
    integration

13
AutoMed Architecture
Schema and Transformation Repository
Wrapper
Schema Transformation and Integration Tools
Global Query Processor/Optimiser
Model Definitions Repository
Schema Matching Tools
Model Definition Tool
Other Tools e.g.GUI, schema evolution,DLT
14
AutoMed Features
  • Schema transformations are automatically
    reversible
  • addT/deleteT(c,q) by deleteT/addT(c,q)
  • extendT(c,Range q1 q2) by contractT(c,Range q1
    q2)
  • renameT(c,n,n) by renameT(c,n,n)
  • Hence bi-directional transformation pathways
    (more generally transformation networks) are
    defined between schemas
  • The queries within transformations allow
    automatic data and query translation
  • Schemas may be expressed in a variety of
    modelling languages
  • Schemas may or may not have a data source
    associated with them

15
AutoMed vs Common Data Model approach
16
6. Materialised DI
17
Some characteristics of Biological DI
  • prevalence of automated and manual annotation of
    data
  • prior, during and after its integration
  • e.g. DAS distributed annotation service GUS data
    warehouse annotation of data origin and data
    derivation
  • importance of being able to trace the provenance
    of data
  • wide variety of nomenclatures adopted
  • greatly increases the difficulty of data
    aggregation
  • has led to many standardised ontologies and
    taxonomies
  • inconsistencies in identification of biological
    entities
  • has led to standardisation efforts e.g. LSID
  • but still a legacy of non-standard identifiers
    present

18
The BioMap Data Warehouse
  • A data warehouse integrating
  • gene expression data
  • protein structure data including
  • data from the Macromolecular Structure Database
    (MSD) from the European Bioinformatics
    Institute (EBI)
  • CATH structural classification data
  • functional data including
  • Gene Ontology KEGG
  • hierachical clustering data, derived from the
    above
  • Aiming to support mining, analysis and
    visualisation of gene expression data

19
BioMap integration approach
20
BioMap architecture
21
Using AutoMed in the BioMap Project
  • Wrapping of data sources and the DW
  • Automatic translation of source and global
    schemas into AutoMeds XML schema language
    (XMLDSS)
  • Domain experts provide matchings between
    constructs in source and global schemas rename
    transfs.
  • Automatic schema restructuring and generation of
    transformation pathways
  • Pathways could subsequently be used for
    maintaince and evolution of the DW also for data
    lineage tracing
  • See DILS05 paper for details of the architecture
    and clustering approach

Integrated
Database
Integrated
Database
Wrapper
AutoMed
Integrated
Schema
n
n
T
o
o
r
i
i
t
a
t
y
a
n
a
s
a
m
y
f
m
p
r
a
o
w
a
r
r
o
t
w
h
f
m
h
o
s
h
t
f
a
w
n
t
s
a
t
a
i
a
a
o
n
p
p
y
r
n
T
a
r
T
AutoMed
AutoMed
AutoMed
..
Relational
XMLDSS
Relational
Schema
Schema
Schema
XML
RDB
RDB
..
Wrapper
Wrapper
Wrapper
XML
RDB
..
File
RDB
22
7. Virtual DI
  • The integrated schema may be defined in a
    standard data modelling language
  • Or, more broadly, it may be a source-independent
    ontology
  • defined in an ontology language
  • serving as a global schema for multiple
    potential data sources, beyond the ones being
    integrated e.g. as TAMBIS
  • The integrated schema may/may not encompass all
    of the data in the data sources
  • it may be sufficient to capture just the data
    needed for answering key user queries/analyses
  • this avoids the possibly complex and lengthy
    process of creating a complete integrated schema
    and set of mappings

23
Virtual DI Architecture
Wrappers
  • Metadata Repository
  • Data source schemas
  • Integrated schemas
  • Mappings

Schema Integration Tools
Global Query Processor
Global Query Optimiser
24
Degree of Data Source Overlap
  • different systems make different assumptions
    about this
  • some systems assume that each DS contributes a
    different part of the integrated virtual resource
    e.g. K2/Kleisli
  • some systems relax this but do not attempt any
    aggregation of duplicate or overlapping data from
    the DSs e.g. TAMBIS
  • some systems support aggregation at both schema
    and data levels e.g. DiscoveryLink, AutoMed
  • the degree of data source overlap impacts on
    complexity of the mappings and the design effort
    involved in specifying them
  • the complexity of the mappings in turn impacts on
    the sophistication of the global query
    optimisation and evaluation mechanisms that will
    be needed

25
Virtual DI methodologies
  • Top-down
  • integrated schema IS is first constructed
  • or it may already exist from previous integration
    or standardisation efforts
  • the set of mappings M is defined between IS and
    DS schemas

26
Virtual DI methodologies
  • Bottom-up
  • initial version of IS and M constructed e.g. from
    one DS
  • these are incrementally extended/refined by
    considering in turn each of the other DSs
  • for each object O in each DS, M is modified to
    encompass the mapping between O and IS, if
    possible
  • if not, IS is extended as necessary to encompass
    information represented by O, and M is then
    modified accordingly

27
Virtual DI methodologies
  • Mixed Top-down and Bottom-up
  • initial IS may exist
  • initial set of mappings M is specified
  • IS and M may need to be extended/refined by
    considering additional data from the DSs that IS
    needs to capture
  • for each object O in each DS that IS needs to
    capture, M is modified to encompass the mapping
    between O and IS, if possible
  • if not, IS is extended as necessary to encompass
    information represented by O, and M is then
    modified accordingly

28
Defining Mappings
  • Global-as-view (GAV)
  • each schema object in IS defined by a view over
    DSs
  • simple global query reformulation by query
    unfolding
  • view evolution problems if DSs change
  • Local-as-view (LAV)
  • each schema object in a DS defined by a view over
    IS
  • harder global query reformulation using views
  • evolution problems if IS changes
  • Global-local-as-view (GLAV)
  • views relate multiple schema objects in a DS with
    IS

29
Both-As-View approach supported by AutoMed
  • not based on views between integrated and source
    schemas
  • instead, provides a set of primitive schema
    transformations each adding, deleting or
    renaming just one schema object
  • relationships between source and integrated
    schema objects are thus represented by a pathway
    of primitive transformations
  • add, extend, delete, contract transformations are
    accompanied by a query defining the new/deleted
    object in terms of the other schema objects
  • from the pathways and queries, it is possible to
    derive GAV, LAV, GLAV mappings
  • currently AutoMed supports GAV, LAV and combined
    GAVLAV query processing

30
Typical BAV Integration Network
GS
id
id
id
id
id
US1
US2
USi
USn




DS1
DS2
DSi
DSn
31
Typical BAV Integration Network (contd)
  • On the previous slide
  • GS is a global schema
  • DS1, , DSn are data source schemas
  • US1, , USn are union-compatible schemas
  • the transformation pathways between each pair LSi
    and USi may consist of add, delete, rename,
    expand and contract primitive transformation,
    operating on any modelling construct defined in
    the AutoMed Model Definitions Repository
  • the transformation pathway between USi and GS is
    similar
  • the transformation pathway between each pair of
    union-compatible schemas consists of id
    transformation steps

32
8. Schema Evolution
  • In biological DI, data sources may evolve their
    schemas to meet the needs of new experimental
    techniques or applications
  • Global schemas may similarly need to evolve to
    encompass new requirements
  • Supporting schema evolution in materialised DI is
    costly requires modifying the ETL and view
    materialisation processes, plus the processes
    maintaining any derived data marts
  • With view-based virtual DI approaches, the sets
    of views that may be affected need to be examined
    and redefined

33
Schema Evolution in BAV
  • BAV supports the evolution of both data source
    and global schemas
  • The evolution of any schema is specified by a
    transformation pathway from the old to the new
    schema
  • For example, the figure on the right shows
    transformation pathways, T, from an old to a new
    global or data source schema

New Global Schema S
T
Global Schema S
New Data Source Schema S
Data Source Schema S
T
34
Global Schema Evolution
  • Each transformation step t in TS?S is
    considered in turn
  • if t is an add, delete, rename then schema
    equivalence is preserved and there is nothing
    further to do (except perhaps optimise the
    extended transformation pathway, using an AutoMed
    tool that does this) the extended pathway can be
    used to regenerate the necessary GAV or LAV views
  • if t is a contract then there will be information
    present in S that is no longer available in S
    again there is nothing further to do
  • if t is an extend then domain knowledge is
    required to determine if, and how, the new
    construct in S could be derived from existing
    constructs if not, nothing further to do if
    yes, the extend step is replaced by an add step

35
Local Schema Evolution
  • This is a bit more complicated as it may require
    changes to be propagated also to the global
    schema(s)
  • Again each transformation step t in TS?S is
    considered in turn
  • In the case that t is an add, delete, rename or
    contract step, the evolution can be carried out
    automatically
  • If it is an extend, then domain knowledge is
    required
  • See our CAiSE02, ICDE03 and ER04 papers for
    more details
  • The last of these discusses a materialised DI
    scenario where the old/new global/source schemas
    have an extent
  • We are currently implementing this functionality
    within the AutoMed toolkit

36
9. Some Future Directions in Biological DI
  • Automatic or semi-automatic identification of
    correspondences between sources, or between
    sources and global schemas e.g.
  • name-based and structural comparisons of schema
    elements
  • instance-based matching at the data level
  • annotation of data sources with terms from
    ontologies to facilitate automated reasoning
  • Capturing incomplete and uncertain information
    about the data sources within the integrated
    resource e.g. using probabilistic or logic-based
    representations and reasoning
  • Automating information extraction from textual
    sources using grammar and rule-based approaches
    integrating this with other related structured or
    semi-structured data

37
9.1 Harnessing Grid Technologies ISPIDER
  • ISPIDER Project Partners Birkbeck, EBI,
    Manchester, UCL
  • Aims
  • Large volumes of heterogeneous proteomics data
  • Need for interoperability
  • Need for efficient processing
  • Development of Proteomics Grid Infrastructure,
    use existing proteomics resources and develop new
    ones, develop new proteomics clients for
    querying, visualisation, workflow etc.

38
Project Aims
39
Project Aims
40
Project Aims
41
Project Aims
42
Project Aims
43
myGrid / DQP / AutoMed
  • myGrid collection of services/components
    allowing high-level integration via workflows of
    data and applications
  • DQP
  • uses OGSA-DAI (Open Grid Services Architecture
    Data Access and Integration) to access data
    sources
  • provides distributed query processing over
    OGSA-DAI enabled resources
  • Current research AutoMed DQP and AutoMed
    myGrid workflows interoperation
  • See DILS06 and DILS07 papers, respectively

44
AutoMed DQP Interoperability
  • Data sources wrapped with OGSA-DAI
  • AutoMed-DAI wrappers extract data sources
    metadata
  • Semantic integration of data sources using
    AutoMed transformation pathways into an
    integrated AutoMed schema
  • IQL queries submitted to this integrated schema
    are
  • reformulated to IQL queries on the data sources,
    using the AutoMed transformation pathways
  • Submitted to DQP for evaluation via the
    AutoMed-DQP Wrapper

45
9.2 Bioinformatics Service Reconciliation
  • Plethora of bioinformatics services are being
    made available
  • Semantically compatible services are often not
    able to interoperate automatically in workflows
    due to
  • different service technologies
  • differences in data model, data modelling, data
    types
  • ? need for service reconciliation

46
Previous Approaches
  • Shims. myGrid uses shims, i.e. services that act
    as intermediaries between specific pairs of
    services and reconcile their inputs and outputs
  • Bowers Ludäscher (DILS04) use 1-1 path
    correspondences to one or more ontologies for
    reconciling services. Sample implementation uses
    mappings to a single ontology and generates an
    XQuery query as the transformation program
  • Thakkar et al. use a mediator system, like us,
    but for service integration i.e. for providing
    services that integrate other services not for
    reconciling semantically compatible services that
    need to form a pipeline within a workflow

47
Our approach
  • XML as the common representation format
  • Assume availability of format converters to
    convert to/from XML, if output/input of a service
    is not XML

48
Our approach
  • XMLDSS as the schema type
  • We use our XMLDSS schema type as the common
    schema type for XML
  • Can be automatically derived from DTD/XML Schema,
    if available
  • Or can be automatically extracted from an XML
    document

49
Our approach
  • Correspondences to an ontology
  • Set of GLAV corrrespondences between each XMLDSS
    schema and a typed ontology
  • An element maps to a concept/path in the ontology
  • An attribute maps to a literal-valued
    property/path
  • There may be multiple correspondences for
    elements/attributes in the ontology

50
Our approach
  • Schema and data transformation
  • a pathway is generated to transform X1 to X2
  • correspondences are used to create X1?X1 and
    X2?X2
  • XMLDSS restructuring algorithm creates X1?X2
  • hence overall pathway X1?X1?X2?X2

51
Architecture
  • A workflow tool could use our approach either
    dynamically or statically
  • Mediation service
  • Workflow tool invokes service S1 and receives its
    output
  • Workflow tool submits output of S1, the schema of
    S2 and the two sets of correspondences to an
    AutoMed service
  • The AutoMed service transforms the output of S1
    to a suitable input for consumption by S2
  • Shim generation
  • AutoMed is used to generate a shim for services
    S1 and S2
  • XMLDSS schema transformation algorithm currently
    tightly coupled with AutoMed ? functionality can
    be exported as single XQuery query able to
    materialise S2 from the data output by S1

52
10. Conclusions
  • Integrating biological data sources is hard!
  • The overarching motivation is the potential to
    make scientific discoveries that can improve
    quality of life
  • The technical challenges faced can lead to new,
    more generally applicable, DI techniques
  • Thus, biological data integration continues to be
    a rich field for multi- and interdiscplinary
    research between clinicians, biologists,
    bioinformaticians and computer scientists
Write a Comment
User Comments (0)
About PowerShow.com