Data Integration - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Data Integration

Description:

The LSD (Learning Source Descriptions) System. Suppose user wants to integrate 100 data sources ... LSD provides attribute correspondences, but not complete mappings ... – PowerPoint PPT presentation

Number of Views:120

Avg rating:3.0/5.0

Slides: 29

Provided by: zack4

Category:

more less

Transcript and Presenter's Notes

Title: Data Integration

1
Data Integration

Zachary G. Ives
University of Pennsylvania
CIS 550 Database Information Systems
November 14, 2007

2
Challenges of Mapping Schemas

In a perfect world, it would be easy to match up
items from one schema with another
Every table would have a similar table in the
other schema
Every attribute would have an identical attribute
in the other schema
Every value would clearly map to a value in the
other schema
Real world as with human languages, things
dont map clearly!
May have different numbers of tables different
decompositions
Metadata in one relation may be data in another
Values may not exactly correspond
It may be unclear whether a value is the same

3
Different Aspects to Mapping

Schema matching / ontology alignment
How do we find correspondences between
attributes?
Entity matching / deduplication / record linking
/ etc.
How do we know when two records refer to the
same thing?
Mapping definition
How do we specify the constraints or
transformations that let us reason about when to
create an entry in one schema, given an entry in
another schema?

Lets see one influential approach to schema
matching
4
The LSD (Learning Source Descriptions) System

Suppose user wants to integrate 100 data sources
User
manually creates mappings for a few sources, say
3
shows LSD these mappings
LSD learns from the mappings
Multi-strategy learning incorporates many types
of info in a general way
Knowledge of constraints further helps
LSD proposes mappings for remaining 97 sources

5
Example
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Learned hypotheses
Schema of realestate.com
If phone occurs in the name agent-phone
listed-price 250,000 110,000 ...
location Miami, FL Boston, MA ...
phone (305) 729 0831 (617) 253 1429 ...
comments Fantastic house Great location ...
realestate.com
If fantastic great occur frequently in
data values description
homes.com
price 550,000 320,000 ...
contact-phone (278) 345 7215 (617) 335 2315 ...
extra-info Beautiful yard Great beach ...
6
LSDs Multi-Strategy Learning

Use a set of base learners
Each exploits well certain types of information
Name learner looks at words in the attribute
names
Naïve Bayes learner looks at patterns in the data
values
Etc.
Match schema elements of a new source
Apply the base learners
Each returns a score
For different attributes one learner is more
useful than another
Combine their predictions using a meta-learner
Meta-learner
Uses training sources to measure base learner
accuracy
Weighs each learner based on its accuracy

7
Training the Learners
Mediated schema
address price agent-phone
description
location listed-price phone
comments
Schema of realestate.com
Name Learner
(location, address) (listed-price, price) (phone,
agent-phone) (comments, description) ...
Miami, FL
250,000 (305) 729 0831
Fantastic house
realestate.com
Naive Bayes Learner
Boston, MA
110,000 (617) 253 1429
Great location
(Miami, FL, address) ( 250,000,
price) ((305) 729 0831, agent-phone) (Fantastic
house, description) ...
8
Applying the Learners
Mediated schema
Schema of homes.com
address price agent-phone
description
area day-phone extra-info
Name Learner Naive Bayes
Seattle, WA Kent,
WA Austin, TX
(address,0.8), (description,0.2) (address,0.6),
(description,0.4) (address,0.7), (description,0.3)
Meta-Learner
Name Learner Naive Bayes
Meta-Learner
(address,0.7), (description,0.3)
(278) 345 7215 (617) 335
2315 (512) 427 1115
(agent-phone,0.9), (description,0.1)
(address,0.6), (description,0.4)
Beautiful yard Great
beach Close to Seattle
9
Putting It All Together LSD System
Matching Phase
Training Phase
Mediated schema
Source schemas
Domain Constraints
Data listings
Training data for base learners
User Feedback
Constraint Handler
L1
L2
Lk
Mapping Combination
10
Mappings between Schemas

LSD provides attribute correspondences, but not
complete mappings
Mappings generally are posed as views define
relations in one schema (typically either the
mediated schema or the source schema), given data
in the other schema
This allows us to restructure or recompose
decompose our data in a new way
We can also define mappings between values in a
view
We use an intermediate table defining
correspondences a concordance table
It can be filled in using some type of code, and
corrected by hand

11
A Few Mapping Examples

Movie(Title, Year, Director, Editor, Star1,
Star2)
Movie(Title, Year, Director, Editor, Star1,
Star2)

PieceOfArt(ID, Artist, Subject, Title, TypeOfArt)
MotionPicture(ID, Title, Year)Participant(ID,
Name, Role)

PieceOfArt(I, A, S, T, Movie) - Movie(T, Y, A,
_, S1, S2), ID T Y, S S1 S2
Movie(T, Y, D, E, S1, S2) - MotionPicture(I, T,
Y), Participant(I, D, Dir), Participant(I, E,
Editor), Participant(I, S1, Star1),
Participant(I, S2, Star2)
T1
T2
Need a concordance table from CustIDs to PennIDs
12
Two Important Approaches

TSIMMIS Garcia-Molina97 Stanford
Focus semistructured data (OEM), OQL-based
language (Lorel)
Creates a mediated schema as a view over the
sources
Spawned a UCSD project called MIX, which led to a
company now owned by BEA Systems
Other important systems of this vein Kleisli/K2
_at_ Penn
Information Manifold Levy96 ATT Research
Focus local-as-view mappings, relational model
Sources defined as views over mediated schema
Requires a special
Led to peer-to-peer integration approaches
(Piazza, etc.)
Focus Web-based queriable sources

13
TSIMMIS

One of the first systems to support
semi-structured data, which predated XML by
several years OEM
An instance of a global-as-view mediation
system
We define our global schema as views over the
sources

14
XML vs. Object Exchange Model
Bernstein
Newcomer Principles of
TP Chamberlinthor DB2 UDB
O1 book O2 author Bernstein O3
author Newcomer O4 title Principles of
TP O5 book O6 author Chamberlin
O7 title DB2 UDB
15
Queries in TSIMMIS

Specified in OQL-style language called Lorel
OQL was an object-oriented query language that
looks like SQL
Lorel is, in many ways, a predecessor to XQuery
Based on path expressions over OEM structures
select book where book.title DB2 UDB and
book.author Chamberlin
This is basically like XQuery, which well use in
place of Lorel and the MSL template language.
Previous query restated
for b in AllData()/bookwhere b/title/text()
DB2 UDB and b/author/text()
Chamberlinreturn b

16
Query Answering in TSIMMIS

Basically, its view unfolding, i.e., composing a
query with a view
The query is the one being asked
The views are the MSL templates for the wrappers
Some of the views may actually require
parameters, e.g., an author name, before theyll
return answers
Common for web forms (see Amazon, Google, )
XQuery functions (XQuerys version of views)
support parameters as well, so well see these in
action

17
A Wrapper Definition in MSL

Wrappers have templates and binding patterns (X)
in MSL
B - B // select
from book where author X //
This reformats a SQL query over Book(author,
year, title)
In XQuery, this might look like
declare function GetBook(x AS xsdstring) as
book for b in sql(Amazon.DB,
select from book where author x
)return b/titlex
book

book
author
title

The union of GetBooks results is unioned with
others to form the view Mediator()
18
How to Answer the Query

Given our query
for b in Mediator()/bookwhere b/title/text()
DB2 UDB and b/author/text()
Chamberlinreturn b
Find all wrapper definitions that
Contain output enough structure to match the
conditions of the query
Or have already tested the conditions for us!

19
Query Composition with Views

We find all views that define book with author
and title, and we compose the query with each
declare function GetBook(x AS xsdstring) as
book for b in sql(Amazon.DB, select
from book where author x )return
b/title x
for b in Mediator()/bookwhere b/title/text()
DB2 UDB and b/author/text()
Chamberlinreturn b

book
author
title

20
Matching View Output to Our Querys Conditions

Determine that b/book/author/text() ?? x by
matching the pattern on the functions output
declare function GetBook(x AS xsdstring) as
book for b in sql(Amazon.DB,
select from book where author x
)return b/title
x
let x Chamberlinfor b in
GetBook(x)/bookwhere b/title/text() DB2
UDB return b

book
author
title

21
The Final Step Unfolding

let x Chamberlinfor b in ( for b in
sql(Amazon.com, select from book where
author x ) return b/title
x
)/bookwhere b/title/text() DB2 UDB
return b
How do we simplify further to get to here?
for b in sql(Amazon.com, select from
book where authorChamberlin)where
b/title/text() DB2 UDB return b

22
Virtues of TSIMMIS

Early adopter of semistructured data, greatly
predating XML
Can support data from many different kinds of
sources
Obviously, doesnt fully solve heterogeneity
problem
Presents a mediated schema that is the union of
multiple views
Query answering based on view unfolding
Easily composed in a hierarchy of mediators

23
Limitations of TSIMMIS Approach

Some data sources may contain data with certain
ranges or properties
Books by Aho, Students at UPenn,
If we ask a query for students at Columbia, dont
want to bother querying students at Penn
How do we express these?
Mediated schema is basically the union of the
various MSL templates as they change, so may
the mediated schema

24
An Alternate ApproachThe Information Manifold
(Levy et al.)

When you integrate something, you have some
conceptual model of the integrated domain
Define that as a basic frame of reference,
everything else as a view over it
Local as View
May have overlapping/incomplete sources
Define each source as the subset of a query over
the mediated schema
We can use selection or join predicates to
specify that a source contains a range of values
ComputerBooks() ? Books(Title, , Subj), Subj
Computers

25
The Local-as-View Model

The basic model is the following
Local sources are views over the mediated
schema
Sources have the data mediated schema is
virtual
Sources may not have all the data from the domain
open-world assumption
The system must use the sources (views) to answer
queries over the mediated schema

26
Query Answering

Assumption conjunctive queries, set semantics
Suppose we have a mediated schema author(aID,
isbn, year), book(isbn, title, publisher)
Suppose we have the query
q(a, t) - author(a, i, _), book(i, t, p), t
DB2 UDB
and sources
s1(a,t) ? author(a, i, _), book(i, t, p), t
123
s5(a, t, p) ? author(a, i, _), book(i,t), p
SAMS
We want to compose the query with the source
mappings but theyre in the wrong direction!
Yet everything in s1, s5 is an answer to the
query!

27
Answering Queries Using Views

Numerous recently-developed algorithms for these
Inverse rules Duschka et al.
Bucket algorithm Levy et al.
MiniCon Pottinger Halevy
Also related chase and backchase Popa,
Tannen, Deutsch
Requires conjunctive queries

28
Summary of Data Integration

Local-as-view integration has replaced
global-as-view as the standard
More robust way of defining mediated schemas and
sources
Mediated schema is clearly defined, less likely
to change
Sources can be more accurately described
Methods exist for query reformulation, including
inverse rules
Integration requires standardization on a single
schema
Can be hard to get consensus
Today we have peer-to-peer data integration,
e.g., Piazza Halevy et al., Orchestra Ives et
al., Hyperion Miller et al.
Some other aspects of integration were addressed
in related papers
Overlap between sources coverage of data at
sources
Semi-automated creation of mappings and wrappers
Data integration capabilities in commercial
products BEAs Liquid Data, IBMs DB2
Information Integrator, numerous packages from
middleware companies