Title: Data integration architectures and methodologies for the Life Sciences Alexandra Poulovassilis, Birk
1Data integration architectures and methodologies
for the Life SciencesAlexandra Poulovassilis,
Birkbeck, U. of London
2Outline of the talk
- The problem and challenges faced
- Historical background
- Main Data Integration approaches in the Life
Sciences - Our work
- Materialised and Virtual DI
- Future directions
- ISPIDER Project
- Bioinformatics service reconciliation
31. The Problem
- Given a set of biological data sources, data
integration (DI) is the process of creating
an integrated resource which - combines data from each of the data sources
- in order to support new queries and analyses
- Biological data sources are characterised by
their high degree of heterogeneity, in terms of - data model, query interfaces, query processing
capabilities, database schema or data exchange
format, data types used, nomenclature adopted - Coupled with the variety, complexity and large
volumes of biological data, this poses several
challenges, leading to several methodologies,
architectures and systems being developed
4Challenges faced
- Increasingly large volumes of complex, highly
varying, biological data are being made available
- Data sources are developed by different people in
differing research environments for differing
purposes - Integrating them to meet the needs of new users
and applications requires reconciliation of their
heterogeneity w.r.t. content, data
representation/exchange and querying - Data sources may freely change their format and
content without considering the impact on any
integrated derived resources - Integrated resources may themselves become data
sources for high-level integrations, resulting in
a network of dependencies
5Biological data Genes ? Proteins ? Biological
Function
Genome DNA sequences of 4 bases (A,C,G,T)
RNA copy of DNA sequence
Protein sequence of 20 amino acids
Biological Processes
FUNCTION
A gene
Permanent copy
Temporary copy
Product (each triple of RNA bases encodes an
amino acid)
Job
This slide is adapted from Nigel Martins
Lecture Notes on Bioinformatics
6Varieties of Biological Data
- genomic data
- gene expression data (DNA ? proteins) and gene
function data - protein structure and function data
- regulatory pathway data how gene expression is
regulated by proteins - cluster data similarity-based clustering of
genes or proteins - proteomics data from experiments on separating
proteins produced by organisms into peptides, and
protein identification - phylogenetic data evolution of genomic, protein,
function data - data on genomic variations in species
- semi-structured/unstructured data medical
abstracts
7Some Key Application Areas for DI
- Integrating, analysing and annotating genomic
data - Predicting the functional role of genes and
integrating function- specific information - Integrating organism-specific information
- Integrating protein structure and pathway data
with gene expression data, to support functional
genomics analysis - Integrating, analysing and annotating proteomics
data sources - Integrating phylogenetic data sources for
genealogy research - Integrating data on genomic variations to analyse
health impact - Integrating genomic, proteomic and clinical data
for personalised medicine
82. Historical Background
- One possible approach would be to encode
transformation/ integration functionality in the
application programs - However, this may be a complex and lengthy
process, and may affect robustness,
maintainability, extensibility - This has motivated the development of generic
architectures and methodologies for DI, which
abstract out this functionality from application
programs into generic DI software - Much work has been done since the 1990s
specifically in biological DI - Many systems have been developed e.g.
DiscoveryLink, Kleisli, Tambis, BioMart, SRS,
Entrez, that aim to address some of the
challenges faced
93. Main DI Approaches in the Life Sciences
- Materialised
- import data into a DW
- transform aggregate imported data
- query the DW via the DBMS
- Virtual
- specify the integrated schema
- wrap the data sources, using wrapper software
- construct mappings between data sources and IS
using mediator software - query the integrated schema
- mediator software coordinates query evaluation,
using the mappings and wrappers
10Main DI Approaches in the Life Sciences
- Link-based
- no integrated schema
- users submit simple queries to the integration
software e.g. via web-based user interface - queries are formulated w.r.t to the data sources,
as selected by the user - the integration software provides additional
capabilities for - facilitating query formulation e.g.
cross-references are maintained between different
data sources and used to augment query results
with links to other related data - speeding up query evaluation e.g. indexes are
maintained supporting efficient keyword based
search
114. Comparing the Main Approaches
- Link-based integration is fine if functionality
meets users needs - Otherwise materialised or virtual DI is
indicated - both allow the integrated resource to be queried
as though it were a single data source.
Users/applications do not need to be aware of
source schemas/formats/content - Materialised DI is generally adopted for
- better query performance
- greater ease of data cleaning and annotation
- Virtual DI is generally adopted for
- lower cost of storing and maintaining the
integrated resource - greater currency of the integrated resource
125. Our work AutoMed
- The AutoMed Project at Birkbeck and Imperial
- is developing tools for the semi-automatic
integration of heterogeneous information sources - can handle both structured and semi-structured
data - provides a unifying graph-based metamodel (HDM)
for specifying higher-level modelling languages - provides a single framework for expressing data
cleansing, transformation and integration logic - the AutoMed toolkit is currently being used for
biological data integration and p2p data
integration
13AutoMed Architecture
Schema and Transformation Repository
Wrapper
Schema Transformation and Integration Tools
Global Query Processor/Optimiser
Model Definitions Repository
Schema Matching Tools
Model Definition Tool
Other Tools e.g.GUI, schema evolution,DLT
14AutoMed Features
- Schema transformations are automatically
reversible - addT/deleteT(c,q) by deleteT/addT(c,q)
- extendT(c,Range q1 q2) by contractT(c,Range q1
q2) - renameT(c,n,n) by renameT(c,n,n)
- Hence bi-directional transformation pathways
(more generally transformation networks) are
defined between schemas - The queries within transformations allow
automatic data and query translation - Schemas may be expressed in a variety of
modelling languages - Schemas may or may not have a data source
associated with them
15AutoMed vs Common Data Model approach
166. Materialised DI
17Some characteristics of Biological DI
- prevalence of automated and manual annotation of
data - prior, during and after its integration
- e.g. DAS distributed annotation service GUS data
warehouse annotation of data origin and data
derivation - importance of being able to trace the provenance
of data - wide variety of nomenclatures adopted
- greatly increases the difficulty of data
aggregation - has led to many standardised ontologies and
taxonomies - inconsistencies in identification of biological
entities - has led to standardisation efforts e.g. LSID
- but still a legacy of non-standard identifiers
present
18The BioMap Data Warehouse
- A data warehouse integrating
- gene expression data
- protein structure data including
- data from the Macromolecular Structure Database
(MSD) from the European Bioinformatics
Institute (EBI) - CATH structural classification data
- functional data including
- Gene Ontology KEGG
- hierachical clustering data, derived from the
above - Aiming to support mining, analysis and
visualisation of gene expression data
19BioMap integration approach
20BioMap architecture
21Using AutoMed in the BioMap Project
- Wrapping of data sources and the DW
- Automatic translation of source and global
schemas into AutoMeds XML schema language
(XMLDSS) - Domain experts provide matchings between
constructs in source and global schemas rename
transfs. - Automatic schema restructuring and generation of
transformation pathways - Pathways could subsequently be used for
maintaince and evolution of the DW also for data
lineage tracing - See DILS05 paper for details of the architecture
and clustering approach
Integrated
Database
Integrated
Database
Wrapper
AutoMed
Integrated
Schema
n
n
T
o
o
r
i
i
t
a
t
y
a
n
a
s
a
m
y
f
m
p
r
a
o
w
a
r
r
o
t
w
h
f
m
h
o
s
h
t
f
a
w
n
t
s
a
t
a
i
a
a
o
n
p
p
y
r
n
T
a
r
T
AutoMed
AutoMed
AutoMed
..
Relational
XMLDSS
Relational
Schema
Schema
Schema
XML
RDB
RDB
..
Wrapper
Wrapper
Wrapper
XML
RDB
..
File
RDB
227. Virtual DI
- The integrated schema may be defined in a
standard data modelling language - Or, more broadly, it may be a source-independent
ontology - defined in an ontology language
- serving as a global schema for multiple
potential data sources, beyond the ones being
integrated e.g. as TAMBIS - The integrated schema may/may not encompass all
of the data in the data sources - it may be sufficient to capture just the data
needed for answering key user queries/analyses - this avoids the possibly complex and lengthy
process of creating a complete integrated schema
and set of mappings
23Virtual DI Architecture
Wrappers
- Metadata Repository
- Data source schemas
- Integrated schemas
- Mappings
Schema Integration Tools
Global Query Processor
Global Query Optimiser
24Degree of Data Source Overlap
- different systems make different assumptions
about this - some systems assume that each DS contributes a
different part of the integrated virtual resource
e.g. K2/Kleisli - some systems relax this but do not attempt any
aggregation of duplicate or overlapping data from
the DSs e.g. TAMBIS - some systems support aggregation at both schema
and data levels e.g. DiscoveryLink, AutoMed - the degree of data source overlap impacts on
complexity of the mappings and the design effort
involved in specifying them - the complexity of the mappings in turn impacts on
the sophistication of the global query
optimisation and evaluation mechanisms that will
be needed
25Virtual DI methodologies
- Top-down
- integrated schema IS is first constructed
- or it may already exist from previous integration
or standardisation efforts - the set of mappings M is defined between IS and
DS schemas
26Virtual DI methodologies
- Bottom-up
- initial version of IS and M constructed e.g. from
one DS - these are incrementally extended/refined by
considering in turn each of the other DSs - for each object O in each DS, M is modified to
encompass the mapping between O and IS, if
possible - if not, IS is extended as necessary to encompass
information represented by O, and M is then
modified accordingly
27Virtual DI methodologies
- Mixed Top-down and Bottom-up
- initial IS may exist
- initial set of mappings M is specified
- IS and M may need to be extended/refined by
considering additional data from the DSs that IS
needs to capture - for each object O in each DS that IS needs to
capture, M is modified to encompass the mapping
between O and IS, if possible - if not, IS is extended as necessary to encompass
information represented by O, and M is then
modified accordingly
28Defining Mappings
- Global-as-view (GAV)
- each schema object in IS defined by a view over
DSs - simple global query reformulation by query
unfolding - view evolution problems if DSs change
- Local-as-view (LAV)
- each schema object in a DS defined by a view over
IS - harder global query reformulation using views
- evolution problems if IS changes
- Global-local-as-view (GLAV)
- views relate multiple schema objects in a DS with
IS
29Both-As-View approach supported by AutoMed
- not based on views between integrated and source
schemas - instead, provides a set of primitive schema
transformations each adding, deleting or
renaming just one schema object - relationships between source and integrated
schema objects are thus represented by a pathway
of primitive transformations - add, extend, delete, contract transformations are
accompanied by a query defining the new/deleted
object in terms of the other schema objects - from the pathways and queries, it is possible to
derive GAV, LAV, GLAV mappings - currently AutoMed supports GAV, LAV and combined
GAVLAV query processing
30Typical BAV Integration Network
GS
id
id
id
id
id
US1
US2
USi
USn
DS1
DS2
DSi
DSn
31Typical BAV Integration Network (contd)
- On the previous slide
- GS is a global schema
- DS1, , DSn are data source schemas
- US1, , USn are union-compatible schemas
- the transformation pathways between each pair LSi
and USi may consist of add, delete, rename,
expand and contract primitive transformation,
operating on any modelling construct defined in
the AutoMed Model Definitions Repository - the transformation pathway between USi and GS is
similar - the transformation pathway between each pair of
union-compatible schemas consists of id
transformation steps
328. Schema Evolution
- In biological DI, data sources may evolve their
schemas to meet the needs of new experimental
techniques or applications - Global schemas may similarly need to evolve to
encompass new requirements - Supporting schema evolution in materialised DI is
costly requires modifying the ETL and view
materialisation processes, plus the processes
maintaining any derived data marts - With view-based virtual DI approaches, the sets
of views that may be affected need to be examined
and redefined
33Schema Evolution in BAV
- BAV supports the evolution of both data source
and global schemas - The evolution of any schema is specified by a
transformation pathway from the old to the new
schema - For example, the figure on the right shows
transformation pathways, T, from an old to a new
global or data source schema
New Global Schema S
T
Global Schema S
New Data Source Schema S
Data Source Schema S
T
34Global Schema Evolution
- Each transformation step t in TS?S is
considered in turn - if t is an add, delete, rename then schema
equivalence is preserved and there is nothing
further to do (except perhaps optimise the
extended transformation pathway, using an AutoMed
tool that does this) the extended pathway can be
used to regenerate the necessary GAV or LAV views - if t is a contract then there will be information
present in S that is no longer available in S
again there is nothing further to do - if t is an extend then domain knowledge is
required to determine if, and how, the new
construct in S could be derived from existing
constructs if not, nothing further to do if
yes, the extend step is replaced by an add step
35Local Schema Evolution
- This is a bit more complicated as it may require
changes to be propagated also to the global
schema(s) - Again each transformation step t in TS?S is
considered in turn - In the case that t is an add, delete, rename or
contract step, the evolution can be carried out
automatically - If it is an extend, then domain knowledge is
required - See our CAiSE02, ICDE03 and ER04 papers for
more details - The last of these discusses a materialised DI
scenario where the old/new global/source schemas
have an extent - We are currently implementing this functionality
within the AutoMed toolkit
369. Some Future Directions in Biological DI
- Automatic or semi-automatic identification of
correspondences between sources, or between
sources and global schemas e.g. - name-based and structural comparisons of schema
elements - instance-based matching at the data level
- annotation of data sources with terms from
ontologies to facilitate automated reasoning - Capturing incomplete and uncertain information
about the data sources within the integrated
resource e.g. using probabilistic or logic-based
representations and reasoning - Automating information extraction from textual
sources using grammar and rule-based approaches
integrating this with other related structured or
semi-structured data
379.1 Harnessing Grid Technologies ISPIDER
- ISPIDER Project Partners Birkbeck, EBI,
Manchester, UCL - Aims
- Large volumes of heterogeneous proteomics data
- Need for interoperability
- Need for efficient processing
- Development of Proteomics Grid Infrastructure,
use existing proteomics resources and develop new
ones, develop new proteomics clients for
querying, visualisation, workflow etc.
38Project Aims
39Project Aims
40Project Aims
41Project Aims
42Project Aims
43myGrid / DQP / AutoMed
- myGrid collection of services/components
allowing high-level integration via workflows of
data and applications - DQP
- uses OGSA-DAI (Open Grid Services Architecture
Data Access and Integration) to access data
sources - provides distributed query processing over
OGSA-DAI enabled resources - Current research AutoMed DQP and AutoMed
myGrid workflows interoperation - See DILS06 and DILS07 papers, respectively
44AutoMed DQP Interoperability
- Data sources wrapped with OGSA-DAI
- AutoMed-DAI wrappers extract data sources
metadata - Semantic integration of data sources using
AutoMed transformation pathways into an
integrated AutoMed schema - IQL queries submitted to this integrated schema
are - reformulated to IQL queries on the data sources,
using the AutoMed transformation pathways - Submitted to DQP for evaluation via the
AutoMed-DQP Wrapper
459.2 Bioinformatics Service Reconciliation
- Plethora of bioinformatics services are being
made available - Semantically compatible services are often not
able to interoperate automatically in workflows
due to - different service technologies
- differences in data model, data modelling, data
types - ? need for service reconciliation
46Previous Approaches
- Shims. myGrid uses shims, i.e. services that act
as intermediaries between specific pairs of
services and reconcile their inputs and outputs - Bowers Ludäscher (DILS04) use 1-1 path
correspondences to one or more ontologies for
reconciling services. Sample implementation uses
mappings to a single ontology and generates an
XQuery query as the transformation program - Thakkar et al. use a mediator system, like us,
but for service integration i.e. for providing
services that integrate other services not for
reconciling semantically compatible services that
need to form a pipeline within a workflow
47Our approach
- XML as the common representation format
- Assume availability of format converters to
convert to/from XML, if output/input of a service
is not XML
48Our approach
- XMLDSS as the schema type
- We use our XMLDSS schema type as the common
schema type for XML - Can be automatically derived from DTD/XML Schema,
if available - Or can be automatically extracted from an XML
document
49Our approach
- Correspondences to an ontology
- Set of GLAV corrrespondences between each XMLDSS
schema and a typed ontology - An element maps to a concept/path in the ontology
- An attribute maps to a literal-valued
property/path - There may be multiple correspondences for
elements/attributes in the ontology
50Our approach
- Schema and data transformation
- a pathway is generated to transform X1 to X2
- correspondences are used to create X1?X1 and
X2?X2 - XMLDSS restructuring algorithm creates X1?X2
- hence overall pathway X1?X1?X2?X2
51Architecture
- A workflow tool could use our approach either
dynamically or statically - Mediation service
- Workflow tool invokes service S1 and receives its
output - Workflow tool submits output of S1, the schema of
S2 and the two sets of correspondences to an
AutoMed service - The AutoMed service transforms the output of S1
to a suitable input for consumption by S2 - Shim generation
- AutoMed is used to generate a shim for services
S1 and S2 - XMLDSS schema transformation algorithm currently
tightly coupled with AutoMed ? functionality can
be exported as single XQuery query able to
materialise S2 from the data output by S1
5210. Conclusions
- Integrating biological data sources is hard!
- The overarching motivation is the potential to
make scientific discoveries that can improve
quality of life - The technical challenges faced can lead to new,
more generally applicable, DI techniques - Thus, biological data integration continues to be
a rich field for multi- and interdiscplinary
research between clinicians, biologists,
bioinformaticians and computer scientists