Data integration architectures and methodologies for the Life Sciences Alexandra Poulovassilis, Birk - PowerPoint PPT Presentation

About This Presentation

Title:

Data integration architectures and methodologies for the Life Sciences Alexandra Poulovassilis, Birk

Description:

Given a set of biological data sources, data integration (DI) ... importance of being able to trace the provenance of data. wide variety of nomenclatures adopted ... – PowerPoint PPT presentation

Number of Views:98

Avg rating:3.0/5.0

Slides: 53

Provided by: Poulova

Category:

more less

Transcript and Presenter's Notes

Title: Data integration architectures and methodologies for the Life Sciences Alexandra Poulovassilis, Birk

1
Data integration architectures and methodologies
for the Life SciencesAlexandra Poulovassilis,
Birkbeck, U. of London
2
Outline of the talk

The problem and challenges faced
Historical background
Main Data Integration approaches in the Life
Sciences
Our work
Materialised and Virtual DI
Future directions
ISPIDER Project
Bioinformatics service reconciliation

3
1. The Problem

Given a set of biological data sources, data
integration (DI) is the process of creating
an integrated resource which
combines data from each of the data sources
in order to support new queries and analyses
Biological data sources are characterised by
their high degree of heterogeneity, in terms of
data model, query interfaces, query processing
capabilities, database schema or data exchange
format, data types used, nomenclature adopted
Coupled with the variety, complexity and large
volumes of biological data, this poses several
challenges, leading to several methodologies,
architectures and systems being developed

4
Challenges faced

Increasingly large volumes of complex, highly
varying, biological data are being made available
Data sources are developed by different people in
differing research environments for differing
purposes
Integrating them to meet the needs of new users
and applications requires reconciliation of their
heterogeneity w.r.t. content, data
representation/exchange and querying
Data sources may freely change their format and
content without considering the impact on any
integrated derived resources
Integrated resources may themselves become data
sources for high-level integrations, resulting in
a network of dependencies

5
Biological data Genes ? Proteins ? Biological
Function
Genome DNA sequences of 4 bases (A,C,G,T)
RNA copy of DNA sequence
Protein sequence of 20 amino acids
Biological Processes
FUNCTION
A gene
Permanent copy
Temporary copy
Product (each triple of RNA bases encodes an
amino acid)
Job
This slide is adapted from Nigel Martins
Lecture Notes on Bioinformatics
6
Varieties of Biological Data

genomic data
gene expression data (DNA ? proteins) and gene
function data
protein structure and function data
regulatory pathway data how gene expression is
regulated by proteins
cluster data similarity-based clustering of
genes or proteins
proteomics data from experiments on separating
proteins produced by organisms into peptides, and
protein identification
phylogenetic data evolution of genomic, protein,
function data
data on genomic variations in species
semi-structured/unstructured data medical
abstracts

7
Some Key Application Areas for DI

Integrating, analysing and annotating genomic
data
Predicting the functional role of genes and
integrating function- specific information
Integrating organism-specific information
Integrating protein structure and pathway data
with gene expression data, to support functional
genomics analysis
Integrating, analysing and annotating proteomics
data sources
Integrating phylogenetic data sources for
genealogy research
Integrating data on genomic variations to analyse
health impact
Integrating genomic, proteomic and clinical data
for personalised medicine

8
2. Historical Background

One possible approach would be to encode
transformation/ integration functionality in the
application programs
However, this may be a complex and lengthy
process, and may affect robustness,
maintainability, extensibility
This has motivated the development of generic
architectures and methodologies for DI, which
abstract out this functionality from application
programs into generic DI software
Much work has been done since the 1990s
specifically in biological DI
Many systems have been developed e.g.
DiscoveryLink, Kleisli, Tambis, BioMart, SRS,
Entrez, that aim to address some of the
challenges faced

9
3. Main DI Approaches in the Life Sciences

Materialised
import data into a DW
transform aggregate imported data
query the DW via the DBMS
Virtual
specify the integrated schema
wrap the data sources, using wrapper software
construct mappings between data sources and IS
using mediator software
query the integrated schema
mediator software coordinates query evaluation,
using the mappings and wrappers

10
Main DI Approaches in the Life Sciences

Link-based
no integrated schema
users submit simple queries to the integration
software e.g. via web-based user interface
queries are formulated w.r.t to the data sources,
as selected by the user
the integration software provides additional
capabilities for
facilitating query formulation e.g.
cross-references are maintained between different
data sources and used to augment query results
with links to other related data
speeding up query evaluation e.g. indexes are
maintained supporting efficient keyword based
search

11
4. Comparing the Main Approaches

Link-based integration is fine if functionality
meets users needs
Otherwise materialised or virtual DI is
indicated
both allow the integrated resource to be queried
as though it were a single data source.
Users/applications do not need to be aware of
source schemas/formats/content
Materialised DI is generally adopted for
better query performance
greater ease of data cleaning and annotation
Virtual DI is generally adopted for
lower cost of storing and maintaining the
integrated resource
greater currency of the integrated resource

12
5. Our work AutoMed

The AutoMed Project at Birkbeck and Imperial
is developing tools for the semi-automatic
integration of heterogeneous information sources
can handle both structured and semi-structured
data
provides a unifying graph-based metamodel (HDM)
for specifying higher-level modelling languages
provides a single framework for expressing data
cleansing, transformation and integration logic
the AutoMed toolkit is currently being used for
biological data integration and p2p data
integration

13
AutoMed Architecture
Schema and Transformation Repository
Wrapper
Schema Transformation and Integration Tools
Global Query Processor/Optimiser
Model Definitions Repository
Schema Matching Tools
Model Definition Tool
Other Tools e.g.GUI, schema evolution,DLT
14
AutoMed Features

Schema transformations are automatically
reversible
addT/deleteT(c,q) by deleteT/addT(c,q)
extendT(c,Range q1 q2) by contractT(c,Range q1
q2)
renameT(c,n,n) by renameT(c,n,n)
Hence bi-directional transformation pathways
(more generally transformation networks) are
defined between schemas
The queries within transformations allow
automatic data and query translation
Schemas may be expressed in a variety of
modelling languages
Schemas may or may not have a data source
associated with them

15
AutoMed vs Common Data Model approach
16
6. Materialised DI
17
Some characteristics of Biological DI

prevalence of automated and manual annotation of
data
prior, during and after its integration
e.g. DAS distributed annotation service GUS data
warehouse annotation of data origin and data
derivation
importance of being able to trace the provenance
of data
wide variety of nomenclatures adopted
greatly increases the difficulty of data
aggregation
has led to many standardised ontologies and
taxonomies
inconsistencies in identification of biological
entities
has led to standardisation efforts e.g. LSID
but still a legacy of non-standard identifiers
present

18
The BioMap Data Warehouse

A data warehouse integrating
gene expression data
protein structure data including
data from the Macromolecular Structure Database
(MSD) from the European Bioinformatics
Institute (EBI)
CATH structural classification data
functional data including
Gene Ontology KEGG
hierachical clustering data, derived from the
above
Aiming to support mining, analysis and
visualisation of gene expression data

19
BioMap integration approach
20
BioMap architecture
21
Using AutoMed in the BioMap Project

Wrapping of data sources and the DW
Automatic translation of source and global
schemas into AutoMeds XML schema language
(XMLDSS)
Domain experts provide matchings between
constructs in source and global schemas rename
transfs.
Automatic schema restructuring and generation of
transformation pathways
Pathways could subsequently be used for
maintaince and evolution of the DW also for data
lineage tracing
See DILS05 paper for details of the architecture
and clustering approach

Integrated
Database
Integrated
Database
Wrapper
AutoMed
Integrated
Schema
n
n
T
o
o
r
i
i
t
a
t
y
a
n
a
s
a
m
y
f
m
p
r
a
o
w
a
r
r
o
t
w
h
f
m
h
o
s
h
t
f
a
w
n
t
s
a
t
a
i
a
a
o
n
p
p
y
r
n
T
a
r
T
AutoMed
AutoMed
AutoMed
..
Relational
XMLDSS
Relational
Schema
Schema
Schema
XML
RDB
RDB
..
Wrapper
Wrapper
Wrapper
XML
RDB
..
File
RDB
22
7. Virtual DI

The integrated schema may be defined in a
standard data modelling language
Or, more broadly, it may be a source-independent
ontology
defined in an ontology language
serving as a global schema for multiple
potential data sources, beyond the ones being
integrated e.g. as TAMBIS
The integrated schema may/may not encompass all
of the data in the data sources
it may be sufficient to capture just the data
needed for answering key user queries/analyses
this avoids the possibly complex and lengthy
process of creating a complete integrated schema
and set of mappings

23
Virtual DI Architecture
Wrappers

Metadata Repository
Data source schemas
Integrated schemas
Mappings

Schema Integration Tools
Global Query Processor
Global Query Optimiser
24
Degree of Data Source Overlap

different systems make different assumptions
about this
some systems assume that each DS contributes a
different part of the integrated virtual resource
e.g. K2/Kleisli
some systems relax this but do not attempt any
aggregation of duplicate or overlapping data from
the DSs e.g. TAMBIS
some systems support aggregation at both schema
and data levels e.g. DiscoveryLink, AutoMed
the degree of data source overlap impacts on
complexity of the mappings and the design effort
involved in specifying them
the complexity of the mappings in turn impacts on
the sophistication of the global query
optimisation and evaluation mechanisms that will
be needed

25
Virtual DI methodologies

Top-down
integrated schema IS is first constructed
or it may already exist from previous integration
or standardisation efforts
the set of mappings M is defined between IS and
DS schemas

26
Virtual DI methodologies

Bottom-up
initial version of IS and M constructed e.g. from
one DS
these are incrementally extended/refined by
considering in turn each of the other DSs
for each object O in each DS, M is modified to
encompass the mapping between O and IS, if
possible
if not, IS is extended as necessary to encompass
information represented by O, and M is then
modified accordingly

27
Virtual DI methodologies

Mixed Top-down and Bottom-up
initial IS may exist
initial set of mappings M is specified
IS and M may need to be extended/refined by
considering additional data from the DSs that IS
needs to capture
for each object O in each DS that IS needs to
capture, M is modified to encompass the mapping
between O and IS, if possible
if not, IS is extended as necessary to encompass
information represented by O, and M is then
modified accordingly

28
Defining Mappings

Global-as-view (GAV)
each schema object in IS defined by a view over
DSs
simple global query reformulation by query
unfolding
view evolution problems if DSs change
Local-as-view (LAV)
each schema object in a DS defined by a view over
IS
harder global query reformulation using views
evolution problems if IS changes
Global-local-as-view (GLAV)
views relate multiple schema objects in a DS with
IS

29
Both-As-View approach supported by AutoMed

not based on views between integrated and source
schemas
instead, provides a set of primitive schema
transformations each adding, deleting or
renaming just one schema object
relationships between source and integrated
schema objects are thus represented by a pathway
of primitive transformations
add, extend, delete, contract transformations are
accompanied by a query defining the new/deleted
object in terms of the other schema objects
from the pathways and queries, it is possible to
derive GAV, LAV, GLAV mappings
currently AutoMed supports GAV, LAV and combined
GAVLAV query processing

30
Typical BAV Integration Network
GS
id
id
id
id
id
US1
US2
USi
USn

DS1
DS2
DSi
DSn
31
Typical BAV Integration Network (contd)

On the previous slide
GS is a global schema
DS1, , DSn are data source schemas
US1, , USn are union-compatible schemas
the transformation pathways between each pair LSi
and USi may consist of add, delete, rename,
expand and contract primitive transformation,
operating on any modelling construct defined in
the AutoMed Model Definitions Repository
the transformation pathway between USi and GS is
similar
the transformation pathway between each pair of
union-compatible schemas consists of id
transformation steps

32
8. Schema Evolution

In biological DI, data sources may evolve their
schemas to meet the needs of new experimental
techniques or applications
Global schemas may similarly need to evolve to
encompass new requirements
Supporting schema evolution in materialised DI is
costly requires modifying the ETL and view
materialisation processes, plus the processes
maintaining any derived data marts
With view-based virtual DI approaches, the sets
of views that may be affected need to be examined
and redefined

33
Schema Evolution in BAV

BAV supports the evolution of both data source
and global schemas
The evolution of any schema is specified by a
transformation pathway from the old to the new
schema
For example, the figure on the right shows
transformation pathways, T, from an old to a new
global or data source schema

New Global Schema S
T
Global Schema S
New Data Source Schema S
Data Source Schema S
T
34
Global Schema Evolution

Each transformation step t in TS?S is
considered in turn
if t is an add, delete, rename then schema
equivalence is preserved and there is nothing
further to do (except perhaps optimise the
extended transformation pathway, using an AutoMed
tool that does this) the extended pathway can be
used to regenerate the necessary GAV or LAV views
if t is a contract then there will be information
present in S that is no longer available in S
again there is nothing further to do
if t is an extend then domain knowledge is
required to determine if, and how, the new
construct in S could be derived from existing
constructs if not, nothing further to do if
yes, the extend step is replaced by an add step

35
Local Schema Evolution

This is a bit more complicated as it may require
changes to be propagated also to the global
schema(s)
Again each transformation step t in TS?S is
considered in turn
In the case that t is an add, delete, rename or
contract step, the evolution can be carried out
automatically
If it is an extend, then domain knowledge is
required
See our CAiSE02, ICDE03 and ER04 papers for
more details
The last of these discusses a materialised DI
scenario where the old/new global/source schemas
have an extent
We are currently implementing this functionality
within the AutoMed toolkit

36
9. Some Future Directions in Biological DI

Automatic or semi-automatic identification of
correspondences between sources, or between
sources and global schemas e.g.
name-based and structural comparisons of schema
elements
instance-based matching at the data level
annotation of data sources with terms from
ontologies to facilitate automated reasoning
Capturing incomplete and uncertain information
about the data sources within the integrated
resource e.g. using probabilistic or logic-based
representations and reasoning
Automating information extraction from textual
sources using grammar and rule-based approaches
integrating this with other related structured or
semi-structured data

37
9.1 Harnessing Grid Technologies ISPIDER

ISPIDER Project Partners Birkbeck, EBI,
Manchester, UCL
Aims
Large volumes of heterogeneous proteomics data
Need for interoperability
Need for efficient processing
Development of Proteomics Grid Infrastructure,
use existing proteomics resources and develop new
ones, develop new proteomics clients for
querying, visualisation, workflow etc.

38
Project Aims
39
Project Aims
40
Project Aims
41
Project Aims
42
Project Aims
43
myGrid / DQP / AutoMed

myGrid collection of services/components
allowing high-level integration via workflows of
data and applications
DQP
uses OGSA-DAI (Open Grid Services Architecture
Data Access and Integration) to access data
sources
provides distributed query processing over
OGSA-DAI enabled resources
Current research AutoMed DQP and AutoMed
myGrid workflows interoperation
See DILS06 and DILS07 papers, respectively

44
AutoMed DQP Interoperability

Data sources wrapped with OGSA-DAI
AutoMed-DAI wrappers extract data sources
metadata
Semantic integration of data sources using
AutoMed transformation pathways into an
integrated AutoMed schema
IQL queries submitted to this integrated schema
are
reformulated to IQL queries on the data sources,
using the AutoMed transformation pathways
Submitted to DQP for evaluation via the
AutoMed-DQP Wrapper

45
9.2 Bioinformatics Service Reconciliation

Plethora of bioinformatics services are being
made available
Semantically compatible services are often not
able to interoperate automatically in workflows
due to
different service technologies
differences in data model, data modelling, data
types
? need for service reconciliation

46
Previous Approaches

Shims. myGrid uses shims, i.e. services that act
as intermediaries between specific pairs of
services and reconcile their inputs and outputs
Bowers Ludäscher (DILS04) use 1-1 path
correspondences to one or more ontologies for
reconciling services. Sample implementation uses
mappings to a single ontology and generates an
XQuery query as the transformation program
Thakkar et al. use a mediator system, like us,
but for service integration i.e. for providing
services that integrate other services not for
reconciling semantically compatible services that
need to form a pipeline within a workflow

47
Our approach

XML as the common representation format
Assume availability of format converters to
convert to/from XML, if output/input of a service
is not XML

48
Our approach

XMLDSS as the schema type
We use our XMLDSS schema type as the common
schema type for XML
Can be automatically derived from DTD/XML Schema,
if available
Or can be automatically extracted from an XML
document

49
Our approach

Correspondences to an ontology
Set of GLAV corrrespondences between each XMLDSS
schema and a typed ontology
An element maps to a concept/path in the ontology
An attribute maps to a literal-valued
property/path
There may be multiple correspondences for
elements/attributes in the ontology

50
Our approach

Schema and data transformation
a pathway is generated to transform X1 to X2
correspondences are used to create X1?X1 and
X2?X2
XMLDSS restructuring algorithm creates X1?X2
hence overall pathway X1?X1?X2?X2

51
Architecture

A workflow tool could use our approach either
dynamically or statically
Mediation service
Workflow tool invokes service S1 and receives its
output
Workflow tool submits output of S1, the schema of
S2 and the two sets of correspondences to an
AutoMed service
The AutoMed service transforms the output of S1
to a suitable input for consumption by S2
Shim generation
AutoMed is used to generate a shim for services
S1 and S2
XMLDSS schema transformation algorithm currently
tightly coupled with AutoMed ? functionality can
be exported as single XQuery query able to
materialise S2 from the data output by S1

52
10. Conclusions

Integrating biological data sources is hard!
The overarching motivation is the potential to
make scientific discoveries that can improve
quality of life
The technical challenges faced can lead to new,
more generally applicable, DI techniques
Thus, biological data integration continues to be
a rich field for multi- and interdiscplinary
research between clinicians, biologists,
bioinformaticians and computer scientists