On Schema Matching with Opaque Column Names and Data Values

About This Presentation

Title:

On Schema Matching with Opaque Column Names and Data Values

Description:

... of this talk. Outline of the remainder of this talk. Formal definition ... Two Car Part Tables. June 10, 2003. SIGMOD 2003. 15. Jaewoo Kang. Background ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 37

Provided by: www43

Category:

more less

Transcript and Presenter's Notes

Title: On Schema Matching with Opaque Column Names and Data Values

1
On Schema Matching with Opaque Column Names and
Data Values

Jaewoo Kang
NC State (Aug 2003)
Jeffrey F. Naughton
Univ. of Wisconsin-Madison

2
What is Schema Matching?

Finding semantic correspondences of schema
elements across heterogeneous sources.
Old problem yet attracting new interests.

3
What is Schema Matching? (Contd)

Important for enterprise applications
Data warehouses, data migration.
Also important for Internet data
Virtual databases, web information systems.
Fundamental element of data integration.

4
No Silver Bullet!

State of the art A collection of techniques that
propose matches.
We have added a new technique to this collection
that works when previous techniques dont even
apply.

5
Some Previous Approaches

Schema-based approaches

Manager Employ Salary
J. K. J. D. 50K
T. J. N. D 80K
P. K. Z. I. 75K
MNG EMP WAGE
U. P. D. S. 85K
A. H. M. H. 75K
J. N. D. F. 60K
Site 1
Site 2
6
Some Previous Approaches II

Instance-based approaches

Dept Employ Phone
HR J. D. 267-7622
RD N. D 354-8736
Sales Z. I. 219-0457
DPT EMP CONT
RD D. S. 387-9802
Sales M. H. 546-3856
Adm D. F. 326-1284
Site 1
Site 2
7
So two previous approaches

Schema-based (interpret column names)
Instance-based (interpret data values)

8
But what about this problem?
t1 t2 t3 t4 t5 t6 t7
193 7.3 3.6 5.7 9 0.39 0.2
176 7 4.5 8 15 0.87 0.4
123 6.3 3.8 7 12 0.56 0.5
238 6.7 3.9 3.7 18 0.44 0.5
174 6.1 3.5 4.4 21 0.56 0.6
96 6.1 4.1 3.1 10 0.73 0.3
133 8.4 4.7 6.3 12 0.77 0.3
t1 t2 t3 t4 t5 t6 t7
164 7.4 4.2 3.8 13 0.57 0.4
129 6.3 3.4 4.8 16 0.44 0.2
136 7.6 4 3.1 9 0.52 0.6
395 6.9 3.6 4.8 8 0.38 0.4
93 6.6 3.7 3.9 17 0.61 0.6
114 6.8 3.9 4 17 0.32 0.5
144 7.8 4.3 3.8 16 0.51 0.9
Site 1
Site 2
9
This is the Un-interpreted Matching Problem.

Focus of this talk
Outline of the remainder of this talk
Formal definition
Terminology
Algorithm
Experimental Results

10
Un-interpreted Matching

M1 match(R(r1, r2, .., rn), S(s1, s2, .., sm))
M2 match(R(r1, r2, .., rn), S(f1(s1), f2(s2),
.., fm(sm))
where match a schema matching algorithm,
Mi (ri-sj) set of matching column pairs,
fi arbitrary one-to-one function.
match is an un-interpreted matching iff M1M2
for all fis.
Main idea specific token representing column
name and value is not important.

11
Motivating example
Two Car Part Tables
Model Color Tire
XLE white P2R6
XG2.5 silver XR5
LE red GM6

Model C1 C2
GL3.5 a1 b1
XGL a2 b2
XE a3 b3

12
Motivating example
Two Car Part Tables
Model Color Tire
XLE white P2R6
XG2.5 silver XR5
LE red GM6

Model C1 C2
GL3.5 a1 b1
XGL a2 b2
XE a3 b3

13
Motivating example
Two Car Part Tables
Model Color Tire
XLE white P2R6
XG2.5 silver XR5
LE red GM6

Model C1 C2
GL3.5 a1 b1
XGL a2 b2
XE a3 b3

14
Motivating example
Two Car Part Tables
Model Color Tire
XLE white P2R6
XG2.5 silver XR5
LE red GM6

Model C1 C2
GL3.5 a1 b1
XGL a2 b2
XE a3 b3

15
Background

Before introducing our algorithm, need
Information Entropy
Mutual Information
Modeling Dependency Relations
Graph Matching

16
Information Entropy

Measures the uncertainty of values in an
attribute
Standard information theoretic measure

17
Mutual Information

Another standard information theoretic measure
Measures the amount of information captured in
one attribute about the other.
Note Self-information MI(XX) H(X)

18
Modeling Dependency Relation
A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
Table R
GTable2DepGraph(R)
19
Graph Matching

Our algorithm will use graph matching.
(G1(a),G2(b))GraphMatch(G1,G2)
Finds a mapping that minimizes the distance
between the two graphs.

G1
G2
20
Distance Between the Graphs

Euclidean distance metric (Frobenius norm)
where aij and bij mutual information between
node i and j.
m(node in A) matching node in B.

21
Measuring the quality of match results
22
Finally, Our Matching Algorithm

1. G1 Table2DepGraph(S1)
G2 Table2DepGraph(S2)
2. (G1(a), G2(b)) GraphMatch(G1, G2)
where Si an input table,
Gi a dependency graph,
(G1(a), G2(b)) a matching node pair.

23
Validating the Framework

Graph matching algorithm
Used exhaustive search w/ simple filtering.
Can be replaced w/ approximate algorithms in
practice.
System
Java HotSpot VM 1.4

24
Goals of experiments

Main goal see if mutual information-based
un-interpreted matching works.
Secondary goal see if mutual information is
necessary, or if a simpler approach, Entropy-only
Matching, works just as well.
Only compares the entropies of attributes in
isolation, without considering mutual
information.

25
Data Set I

Census Data (U.S. Census Bureau)
State census data files NY and CA.
Can algorithm find mapping between attributes in
NY and CA tables?

1 2 3 4 5 6 7 8 9 10
18091 1063 10 9 9 41 15 368 368 288
17511 3281 25 21 40 89 59 1211 1211 796
609 3424 29 13 15 148 26 1055 1055 861
3861 2884 18 7 4 114 11 670 670 568
18614 1478 12 10 15 40 16 630 630 459
26
Data Set II

Medical Data
Thrombosis lab exam data (12 years of patient
records.)
Range partitioned into two tables based on exam
dates.
Can algorithm find mapping between attributes in
resulting two tables?

1 2 3 4 5 6 7 8 9 10
970709 23 530 104 6.4 4 14 0.5 232 100
971022 26 564 108 6.8 5.3 13 0.55 250 103
971224 25 483 90 6.5 5.1 15 0.62
980120 26 578 101 7 4.6 16 0.49 224 93
980217 34 521 98 5.3 10 0.62 234 111
27
Results
Thrombosis exam
Census data

Match precision deteriorates as the size of match
increases.
However, deterioration is small compared to the
exponential increase in search space.
MI-based approach dominates entropy-only
approach.

28
Why does mutual information-based approach
dominate entropy-only approach?
29
Cardinality Constraints in Schema Matching

One-to-one mapping (bijective)

A B C
A B C
G1
G2
30
Cardinality Constraints in Schema Matching

One-to-one mapping (bijective)
Onto mapping (surjective)

A B C D
A B C
G1
G2
31
Cardinality Constraints in Schema Matching

One-to-one mapping
Onto mapping
Partial mapping

A B C E
A B C D
G1
G2
32
What about schemas that dont match?

Examined how our matching algorithm reacts to the
matching of unrelated schemas. (NY-CA vs.
Lab1-CA)

33
Distinguishing Good and Bad Matches

Clearly detects case where there is no good
matching.

34
Summary

Identified new class of schema matching problems
that have not been addressed by existing
solutions.
First to introduce an un-interpreted matching
technique that addresses the new class of
problems.
Evaluation suggests it may be useful as an
addition to existing matching techniques.

35
Future Work

Find an efficient, accurate graph matching
approximation algorithm.
Extend the techniques to nested structures such
as XML, OO schemas.
See if the technique applicable to the problems
of schema classification / clustering.

On Schema Matching with Opaque Column Names and Data Values - PowerPoint PPT Presentation

On Schema Matching with Opaque Column Names and Data Values

... of this talk. Outline of the remainder of this talk. Formal definition ... Two Car Part Tables. June 10, 2003. SIGMOD 2003. 15. Jaewoo Kang. Background ... – PowerPoint PPT presentation