Title: AutoMed:%20Automatic%20generation%20of%20Mediator%20tools%20for%20heterogeneous%20data%20integration
1AutoMed Automatic generation of Mediator tools
for heterogeneous data integration
- Alex Poulovassilis
- School of Computer Science and Information
Systems, Birkbeck - AutoMed is a joint project with Peter McBrien
(Imperial College), - funded under the 2nd DIM call by EPSRC grants
GR/N38107 and GR/N35915
2Integrated Schema
Schema
Schema
Schema
3Background
- In earlier work (ER97, IS98, DKE98) we
developed a new framework to support
transformation and integration of heterogeneous
database schemas. - Our framework consisted of
- a new notion of schema equivalence
- a set of primitive schema transformations which
can be composed to define unconditional or
conditional equivalences between schemas
4Background
- In our data integration approach, we represent
the modelling constructs of higher-level data
models (e.g. relational, object-oriented,
semi-structured, XML, RDF) in terms of a
low-level hypergraph data model HDM whose
constructs are nodes, edges and constraints - The HDM common data model provides a unifying
semantics for such higher-level modelling
constructs - It avoids the semantic mismatches that may occur
between constructs of higher-level modelling
languages
5Background
- Our approach allows constructs from different
modelling languages to be mixed within the same
intermediate schema during the schema
transformation/integration process (CAiSE99) - Our schema transformations are automatically
reversible, setting up a two-way transformation
pathway between pairs of schema
6(No Transcript)
7(No Transcript)
8-
-
-
- addClass Series p(p,S)?category
- addClass Doc p(p,D)?category
- addClass Film p(p,F)?category
- addClass Prog p(p,c)?category
-
-
9- addSubClass Film Prog
- addSubClass Doc Prog
- addSubClass Series Prog
-
- addClass Series p(p,S)?category
- addClass Doc p(p,D)?category
- addClass Film p(p,F)?category
- addClass Prog p(p,c)?category
-
-
-
10- addSubClass Film Prog
- addSubClass Doc Prog
- addSubClass Series Prog
-
- addClass Series p(p,S)?category
- addClass Doc p(p,D)?category
- addClass Film p(p,F)?category
- addClass Prog p(p,c)?category
- delRel category (p,F)p?Film U
-
(p,D)p?Doc U -
(p,S)p?Series
11- delSubClass Film Prog
- delSubClass Doc Prog
- delSubClass Series Prog
-
- delClass Series p(p,S)?category
- delClass Doc p(p,D)?category
- delClass Film p(p,F)?category
- delClass Prog p(p,c)?category
- addRel category (p,F)p?Film U
-
(p,D)p?Doc U -
(p,S)p?Series
12- addConstraint subset Film Prog
- addConstraint subset Doc Prog
- addConstraint subset Series Prog
-
- addNode Series p(p,S)?category
- addNode Doc p(p,D)?category
- addNode Film p(p,F)?category
- addNode Prog p(p,c)?category
- delEdge category (p,F)p?Film U
-
(p,D)p?Doc U -
(p,S)p?Series - delNode Programme Prog
- delNode Category F,D,S
13- delConstraint subset Film Prog
- delConstraint subset Doc Prog
- delConstraint subset Series Prog
-
- delNode Series p(p,S)?category
- delNode Doc p(p,D)?category
- delNode Film p(p,F)?category
- delNode Prog p(p,c)?category
- addEdge category (p,F)p?Film U
-
(p,D)p?Doc U -
(p,S)p?Series - addNode Programme Prog
- addNode Category F,D,S
14Background
- These pathways can be used to automatically
translate data and queries between pairs of
schemas (ER99) - From a pathway TS gt S we
- compose the queries in the add steps to derive a
definition of each construct in S as a view over
S, and - compose the queries in the del steps to derive a
definition of each construct in S as a view over
S
15Background
- Thus
- Prog p (p,c)?category
- Film p(p,F)?category
- Doc p(p,D)?category
- Series p(p,S)?category
- and
- category (p,F)p?Film U (p,D)p?Doc U
(p,S)p?Series - These view definitions can then be used to
automatically translate data and queries between
S and S
16Overview of the AutoMed Project
- The AutoMed project aims to investigate
- how our theoretical framework can be practically
applied real data integration problems - how much of a mediators global query processing
functionality can be automatically generated from
our transformation pathways - evolutionary and heuristic techniques for schema
improvement and global query optimisation
17The AutoMed Architecture
Schema and Transformation Repository
Schema Transformation and Integration Tool
Global Query Processor
Global Query Optimiser
Model Definitions Repository
Model Definition Tool
Schema Evolution Tool
18Schema Transformation/Integration Networks in
AutoMed
GS
id
id
id
id
id
US1
US2
USi
USn
LS1
LS2
LSi
LSn
19Schema Transformation/Integration Networks in
AutoMed
- On the previous slide
- GS is a global schema
- LS1, , LSn are local schemas
- US1, , USn are union-compatible schemas
- the transformation pathways between each pair LSi
and USi may consist of add, delete, rename,
expand and contract primitive transformation,
operating on any modelling construct defined in
the AutoMed Model Definitions Repository - the transformation pathway between USi and GS is
similar - the transformation pathway between each pair of
union-compatible schemas consists of id
transformation steps
20Both-As-View integration
- Our schema transformation pathways capture at
least the information available from
global-as-view (GAV) or local-as-view (LAV) - We discuss this in a forthcoming paper (ICDE03)
and term our integration approach both-as-view
(BAV) - In particular, we discuss how
- GAV and LAV view definitions can be derived from
a BAV specification - a BAV specification can be partially derived from
a set of GAV or LAV view definitions
21Schema Evolution
- Unlike GAV and LAV, our framework readily
supports the evolution of both local and global
schemas - The first step is to define the evolution of the
global or local schema as a schema transformation
pathway from the old to the new schema - There is then a systematic way of evolving, as
opposed to re-generating, the transformation
pathways - In the case of a local schema evolution, the
global schema may also be evolved
22Schema Evolution
- In particular (see our CAiSE02 and ICDE03
papers for details) - if the evolved schema is semantically equivalent
to the original schema, then the transformation
network can be repaired automatically - if the evolved schema is a contraction of the
original schema, the transformation network can
again be repaired automatically - if the evolved schema is an extension of the
original schema, then domain knowledge may be
required (but again the network can be evolved
rather than regenerated)
23Global Query Processing
- We are handling query language heterogeneity by
translation into/from a functional intermediate
query language IQL Edgar Jasper
(BNCOD02 poster, BNCOD02 summer school paper) - A query Q expressed in a high-level query
language on a global schema GS is first
translated into IQL - GAV view definitions are derived from the
transformation pathways between GS and the local
schemas - These view definitions are substituted into Q,
reformulating it into an IQL query over local
schema constructs
24Global Query Processing
- Query optimisation and query evaluation then
occur - Specific issues for query optimisation in AutoMed
are - optimising the view definitions derived from the
transformation pathways, and - handling heterogeneous modelling constructs
appearing within these view definitions - For query evaluation, wrappers translate IQL
sub-queries into the local query language, and
translate results back into the IQL type system.
- Further query post-processing is possible.
25Why a Functional Language as the AutoMed
Intermediate Query Language ?
- Compositionality operators can be composed to an
arbitrary level of nesting within a query
provided the types of the operators are respected
by the expressions passed to them - Referential transparency any query evaluates to
a single answer, irrespective of the order of
evaluation of its sub-expressions - These properties make view generation, query
reformulation and query rewriting simpler than it
would be with imperative or logic notations
26Why a Functional Language as the AutoMed
Intermediate Query Language ?
- Natural support for collection types and
aggregation operators - Makes this a natural formalism for translating
into/out of other query languages e.g. - OQL is a functional query language
- SQL can be considered to be a restriction of OQL
- XQuery has a functional core language
- other languages for semi-structured and RDF data
are also functional (UnQL, YATL, RQL)
27Why a Functional Language as the AutoMed
Intermediate Query Language ?
- Aggregation operators over collection types such
as sets, bags and lists are generalised by a
single fold function (Buneman, Tannen, Naqvi,
1990s) - Optimisation techniques have been developed for
fold which are applicable to all functional query
languages with this formalism at their core (e.g.
work by Wadler, Wong, Fegaras, Grust,
Poulovassilis Small) - We plan to leverage these techniques, and perhaps
even existing software, for global query
optimisation in AutoMed
28XML Data Sources
- As well as integration of structured data
sources, we have done some work on translating
and integrating XML data see our CAiSE01 paper - We have defined a representation of XML in terms
of the nodes, edges and constraints of the HDM - We capture the ordering of XML elements by an
order node and a hyperedge to it from the edge
representing the parent-child relationship
29Translating XML into HDM
- ltcustomer nameJonesgt
- ltaccount numberA14/gt
- ltaccount numberB37/gt
- lt/customergt
- ltcustomer nameSmithgt
- ltaccount numberC514/gt
- ltaccount numberD438/gt
- lt/customergt
root
order
customer
name
order
number
account
30XML Data Sources
- We have defined a set of primitive
transformations on XML, in terms of the
underlying transformations on the equivalent HDM
representation (which is the general AutoMed
methodology) - XML documents are then translated into a simple
ER representation, which allows them to be
integrated with each other and with other
structured data sources - One possible direction of further work is
automatic or semi-automatic transformation and
integration of the ER models arising from XML
documents
31Unstructured Text Sources
- We have also been working on extracting structure
from unstructured text sources Dean Williams - The aim here is to integrate information
extracted from unstructured text with structured
or semi-structured information available from
other sources - We are using existing technology (the GATE tool)
for the text annotation and IE part of this work
32Unstructured Text Sources
- Natural language and domain ontologies will be
used extend these annotations - These will be imported into RDF repositories, and
we have extended AutoMed to encompass RDF and
RDFS data sources - The information extracted from the text will be
matched with existing structured information to
derive new facts and perhaps new schema
information as well
33Materialised integration
- Finally, as well as virtual integration of data
sources, we are also investigating using the
AutoMed framework for materialised integration
i.e. a data warehousing approach - In particular, we are looking at incremental view
maintenance and data lineage tracing using the
AutoMed schema transformation pathways Hao Fan
34Ongoing AutoMed Work at Imperial
- Automatic generation of equivalences between
different data models - A graphical schema transformations editor
- Data mining techniques for extracting relational
schema equivalences - Using AutoMed for integrating semi-structured and
structured data, in particular genomic data - Optimising schema transformation pathways