Title: Turning information into knowledge: the challenges of integrating diverse information sources Alex P
1Turning information into knowledge the
challenges of integrating diverse information
sourcesAlex Poulovassilis, Birkbeck, U. of
LondonCo-Director of the London Knowledge Lab
2The London Knowledge Lab
Institute of Education University of London
Birkbeck College University of London
purpose designed building Science Research
Infrastructure Fund 6m Research staff and
students 50 Location Bloomsbury Open June 2004
Social scientists Experts in education,
sociology, culture and media, semiotics,
philosophy, knowledge management ...
Computer scientists Experts in information
systems, information management, web
technologies, personalisation, ubiquitous
technologies
3LKL mission
to understand how digital technologies and media
are transforming peoples relationships to
information, learning and culture at home, work
and play to design, build and evaluate systems,
processes and interfaces which enhance learning
throughout life to examine critically the
assumptions about knowledge and learning that
underlie the different uses of digital
technologies
The starting point for our mission is that
digital technologies and new media will change
how we learn, work, collaborate and communicate
4LKL research themes
- Our research is funded by projects from EU,
EPSRC, ESRC, BBSRC, - JISC, Wellcome Trust currently about 25
projects. - Four broad themes guide our work and inform our
research - new forms of knowledge
- turning information into knowledge
- the changing cultures of new media
- creating empowering technologies for formal and
informal learning
5New forms of knowledge
- What do children and adults of the twenty-first
century need to know? - How can we learn in new and more effective ways?
- What kinds of knowledge are emerging in the
knowledge economy? - How can this knowledge be made more accessible to
more people?
6Turning information into knowledge
- The need to cope with ubiquitous, complex,
incomplete and inconsistent information is
pervasive in our societies - How can people benefit from this information in
their learning, working and social lives ? - What new techniques are necessary for managing,
accessing, integrating and personalising such
information ? - How to design and build tools that help people to
understand such information and generate new
knowledge from it ?
7The changing cultures of new media
- What are differences and continuities between
old media (books, film, TV) and new media
(internet, computer games, mobile phones) ? - How do children and adults use these media in
different contexts, both as consumers and
produces ? - How are they learning in, and from, this
convergent media environment ? - What are the implications of these developments
for formal and informal learning ?
8Creating empowering technologies for learning
- How are equity, participation, learner autonomy,
and the structuring of learning impacted by
digital technologies and new media? - Which media-enhanced approaches can help people
to learn and collaborate? - How can the Internet, and ambient and mobile
technologies create new learning opportunities?
9Turning information into knowledge information
integration
- AutoMed (EPSRC)
- developing tools for semi-automatic integration
of heterogeneous - information sources
- can handle both structured and semi-structured
(RDF/S, XML) data - can handle virtual, materialised and hybrid
integration scenarios - application in biological data integration,
e-learning, p2p data integration - ISPIDER (BBSRC e-Science programme)
- developing an integrated platform of proteomic
data sources, enabled as - Grid and Web services
- collaboration with groups at EBI, Manchester,
UCL
10The AutoMed Project
- Partners Birkbeck and Imperial Colleges
- Data integration based on schema
equivalence/subsumption - Low-level metamodel, the Hypergraph Data Model
(HDM), in terms of which higher-level data
modelling languages are defined extensible
therefore with new modelling languages - Provides a set of primitive equivalence-preserving
schema transformations for higher-level
modelling languages - addT(c,q) deleteT(c,q) renameT(c,n,n)
- Also two more primitive transformations for
imprecise integration scenarios - extendT(c,Range q q) contractT(c,Range q q)
11Features of the AutoMed toolkit
- Schema transformations are automatically
reversible - addT/deleteT(c,q) by deleteT/addT(c,q)
- extendT(c,Range q1 q2) by contractT(c,Range q1
q2) - renameT(c,n,n) by renameT(c,n,n)
- Hence bi-directional transformation pathways
(more generally transformation networks) are
defined between schemas - The queries within transformations allow
automatic data and query translation - Schemas may be expressed in a variety of
modelling languages
12Schema transformation/integration networks
GS
id
id
id
id
id
US1
US2
USi
USn
LS1
LS2
LSi
LSn
13Schema transformation/integration networks
(contd)
- On the previous slide
- GS is a global schema
- LS1, , LSn are local schemas
- US1, , USn are union-compatible schemas
- the transformation pathways between each pair LSi
and USi may consist of add, delete, rename,
expand and contract primitive transformation,
operating on any modelling construct defined in
the AutoMed Model Definitions Repository - the transformation pathway between USi and GS is
similar - the transformation pathway between each pair of
union-compatible schemas consists of id
transformation steps
14AutoMed architecture
Schema and Transformations Repository (STR)
Wrapper
Schema Transformation and Integration Tools
Global Query Processor
Model Definitions Repository (MDR)
Global Query Optimiser
Model Definition Tool
Schema Evolution Tool
15Other data integration approaches GAV LAV
- Global-As-View (GAV) approach specify GS
constructs by view definitions over LS constructs - Local-As-View (LAV) approach specify LS
constructs by view definitions over GS constructs
16Evolution problems of GAV and LAV
- GAV does not readily support evolution of local
schemas e.g. adding a new attribute to a source
table may invalidate some of the global view
definitions - In LAV, changes to a local schema impact only the
derivation rules defined for that schema - But conversely LAV has problems if one wants to
evolve the global schema since all the view
definitions defining local schema constructs in
terms of the global schema would need to be
reviewed - These evolution problems are exacerbated in P2P
data integration scenarios where there is no
distinction between local and global schemas
17AutoMed vs GAV/LAV/GLAV
- AutoMed schema transformation pathways capture at
least the information available from GAV and LAV
rules - add/extend transformations correspond to GAV
rules - delete/contract transformations correspond to LAV
rules - Thus, GAV and LAV view definitions can be derived
from a BAV network - GLAV rules e - e are also captured, by BAV
transformations of the form add(T,e) del(T,e)
- Thus, any reasoning or processing that is
possible using GAV, LAV or GLAV is also possible
using BAV
18Schema Evolution in BAV
New Global Schema S
- Unlike GAV/LAV/GLAV, BAV readily supports the
evolution of both local and global schemas. - The evolution of a global or local schema is
specified by a schema transformation pathway T
from the old schema S to the new schema S - The transformation network and schemas can then
be systematically repaired (rather than having to
be redefined)
T
Global Schema S
New Local Schema S
Local Schema S
T
19Global Query Processing
- We handle query language heterogeneity by
translation into/from a functional intermediate
query language IQL - A query Q expressed in a high-level query
language on a global schema S is first
translated into IQL (this functionality is not
yet supported in the AutoMed toolkit) - View definitions are derived from the
transformation pathways between S and the
requested data source schemas - These view definitions are substituted into Q,
reformulating it into an IQL query over source
schema constructs
20Global Query Processing (contd)
- Query optimisation and query evaluation then
occur - During query evaluation, the evaluator submits to
wrappers sub-queries that they are able to
translate into the local query language.
Currently, AutoMed supports wrappers for SQL,
OQL, XPath, XQuery and flat-file data sources - The wrappers translate sub-query results back
into the IQL type system - Further query post-processing then occurs in the
IQL evaluator
21Other AutoMed research at BBK
- As well as virtual integration of data sources,
we have investigated using AutoMed for
materialised data integration i.e. a data
warehousing approach - In particular, Hao Fan has worked on incremental
view maintenance, data lineage tracing and schema
evolution over AutoMed schema transformation
pathways - Lucas Zamboulis has developed semi-automatic
techniques for transforming and integrating
heterogeneous XML data - In recent work he is investigating used
correspondences to ontologies to enhance these
techniques - Sandeep Mittal is working on update translation
and update propagation along AutoMed pathways
e.g. in P2P environments
22Other AutoMed research at BBK (contd)
- Dean Williams has been working on extracting
structure from unstructured text sources - The aim here is to integrate information
extracted from unstructured text with structured
information available from other sources - Dean is using existing technology (the GATE tool)
for the text annotation and IE part of this work - The information extracted from the text is
matched with existing structured information to
derive new instance data and perhaps also new
schema fragments - AutoMed is being used for the schema and data
integration aspects of this project
23ISPIDER Project
- Partners Birkbeck, EBI, Manchester, UCL
- Aims
- Vast, heterogeneous biological data
- Need for interoperability
- Need for efficient processing
- Development of Proteomics Grid Infrastructure,
use existing proteomics resources and develop new
ones, develop new proteomics clients for
querying, visualisation, workflow etc.
24Project Aims
25Project Aims
26Project Aims
27Project Aims
28Project Aims
29myGrid / DQP / AutoMed
- myGrid collection of services/components
allowing high-level integration of
data/applications for in-silico experiments in
biology - DQP
- OGSA-DAI (Open Grid Services Architecture Data
Access and Integration) - Distributed query processing over OGSA-DAI
enabled resources - Ongoing research
- AutoMed / DQP interoperability
- AutoMed / myGrid interoperability
30DQP / AutoMed interoperability
- Data sources wrapped with OGSA-DAI
- AutoMed OGSA-DAI wrappers extract data sources
metadata - Semantic integration of data sources using
AutoMed transformation pathways into an
integrated AutoMed schema - IQL queries submitted to this integrated schema
are - Reformulated to IQL queries on the data sources,
using the AutoMed transformation pathways - Submitted to DQP for evaluation
31Ongoing and future research
- Heterogeneous data integration in Grid and P2P
environments, with bioinformatics and e-learning
as example application domains - Flexible combinations of virtual, materialised or
hybrid integration - Flexible query processing in imprecise
integration scenarios - P2P query processing over BAV pathways
- P2P update processing over BAV pathways
- Use of ECA rules and a P2P ECA rule execution
engine for flexible update processing and data
sharing