Title: On Schema Matching with Opaque Column Names and Data Values
1On Schema Matching with Opaque Column Names and
Data Values
- Jaewoo Kang
- NC State (Aug 2003)
- Jeffrey F. Naughton
-
- Univ. of Wisconsin-Madison
2What is Schema Matching?
- Finding semantic correspondences of schema
elements across heterogeneous sources. - Old problem yet attracting new interests.
3What is Schema Matching? (Contd)
- Important for enterprise applications
- Data warehouses, data migration.
- Also important for Internet data
- Virtual databases, web information systems.
- Fundamental element of data integration.
4No Silver Bullet!
- State of the art A collection of techniques that
propose matches. - We have added a new technique to this collection
that works when previous techniques dont even
apply.
5Some Previous Approaches
Manager Employ Salary
J. K. J. D. 50K
T. J. N. D 80K
P. K. Z. I. 75K
MNG EMP WAGE
U. P. D. S. 85K
A. H. M. H. 75K
J. N. D. F. 60K
Site 1
Site 2
6Some Previous Approaches II
- Instance-based approaches
Dept Employ Phone
HR J. D. 267-7622
RD N. D 354-8736
Sales Z. I. 219-0457
DPT EMP CONT
RD D. S. 387-9802
Sales M. H. 546-3856
Adm D. F. 326-1284
Site 1
Site 2
7So two previous approaches
- Schema-based (interpret column names)
- Instance-based (interpret data values)
8But what about this problem?
t1 t2 t3 t4 t5 t6 t7
193 7.3 3.6 5.7 9 0.39 0.2
176 7 4.5 8 15 0.87 0.4
123 6.3 3.8 7 12 0.56 0.5
238 6.7 3.9 3.7 18 0.44 0.5
174 6.1 3.5 4.4 21 0.56 0.6
96 6.1 4.1 3.1 10 0.73 0.3
133 8.4 4.7 6.3 12 0.77 0.3
t1 t2 t3 t4 t5 t6 t7
164 7.4 4.2 3.8 13 0.57 0.4
129 6.3 3.4 4.8 16 0.44 0.2
136 7.6 4 3.1 9 0.52 0.6
395 6.9 3.6 4.8 8 0.38 0.4
93 6.6 3.7 3.9 17 0.61 0.6
114 6.8 3.9 4 17 0.32 0.5
144 7.8 4.3 3.8 16 0.51 0.9
Site 1
Site 2
9This is the Un-interpreted Matching Problem.
- Focus of this talk
- Outline of the remainder of this talk
- Formal definition
- Terminology
- Algorithm
- Experimental Results
10Un-interpreted Matching
- M1 match(R(r1, r2, .., rn), S(s1, s2, .., sm))
- M2 match(R(r1, r2, .., rn), S(f1(s1), f2(s2),
.., fm(sm)) - where match a schema matching algorithm,
- Mi (ri-sj) set of matching column pairs,
- fi arbitrary one-to-one function.
- match is an un-interpreted matching iff M1M2
for all fis. - Main idea specific token representing column
name and value is not important.
11Motivating example
Two Car Part Tables
Model Color Tire
XLE white P2R6
XG2.5 silver XR5
LE red GM6
Model C1 C2
GL3.5 a1 b1
XGL a2 b2
XE a3 b3
12Motivating example
Two Car Part Tables
Model Color Tire
XLE white P2R6
XG2.5 silver XR5
LE red GM6
Model C1 C2
GL3.5 a1 b1
XGL a2 b2
XE a3 b3
13Motivating example
Two Car Part Tables
Model Color Tire
XLE white P2R6
XG2.5 silver XR5
LE red GM6
Model C1 C2
GL3.5 a1 b1
XGL a2 b2
XE a3 b3
14Motivating example
Two Car Part Tables
Model Color Tire
XLE white P2R6
XG2.5 silver XR5
LE red GM6
Model C1 C2
GL3.5 a1 b1
XGL a2 b2
XE a3 b3
15Background
- Before introducing our algorithm, need
- Information Entropy
- Mutual Information
- Modeling Dependency Relations
- Graph Matching
16Information Entropy
- Measures the uncertainty of values in an
attribute - Standard information theoretic measure
17Mutual Information
- Another standard information theoretic measure
- Measures the amount of information captured in
one attribute about the other. - Note Self-information MI(XX) H(X)
18Modeling Dependency Relation
A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
Table R
GTable2DepGraph(R)
19Graph Matching
- Our algorithm will use graph matching.
- (G1(a),G2(b))GraphMatch(G1,G2)
- Finds a mapping that minimizes the distance
between the two graphs.
G1
G2
20Distance Between the Graphs
- Euclidean distance metric (Frobenius norm)
- where aij and bij mutual information between
node i and j. - m(node in A) matching node in B.
21Measuring the quality of match results
22Finally, Our Matching Algorithm
- 1. G1 Table2DepGraph(S1)
- G2 Table2DepGraph(S2)
- 2. (G1(a), G2(b)) GraphMatch(G1, G2)
- where Si an input table,
- Gi a dependency graph,
- (G1(a), G2(b)) a matching node pair.
23Validating the Framework
- Graph matching algorithm
- Used exhaustive search w/ simple filtering.
- Can be replaced w/ approximate algorithms in
practice. - System
- Java HotSpot VM 1.4
24Goals of experiments
- Main goal see if mutual information-based
un-interpreted matching works. - Secondary goal see if mutual information is
necessary, or if a simpler approach, Entropy-only
Matching, works just as well. - Only compares the entropies of attributes in
isolation, without considering mutual
information.
25Data Set I
- Census Data (U.S. Census Bureau)
- State census data files NY and CA.
- Can algorithm find mapping between attributes in
NY and CA tables?
1 2 3 4 5 6 7 8 9 10
18091 1063 10 9 9 41 15 368 368 288
17511 3281 25 21 40 89 59 1211 1211 796
609 3424 29 13 15 148 26 1055 1055 861
3861 2884 18 7 4 114 11 670 670 568
18614 1478 12 10 15 40 16 630 630 459
26Data Set II
- Medical Data
- Thrombosis lab exam data (12 years of patient
records.) - Range partitioned into two tables based on exam
dates. - Can algorithm find mapping between attributes in
resulting two tables?
1 2 3 4 5 6 7 8 9 10
970709 23 530 104 6.4 4 14 0.5 232 100
971022 26 564 108 6.8 5.3 13 0.55 250 103
971224 25 483 90 6.5 5.1 15 0.62
980120 26 578 101 7 4.6 16 0.49 224 93
980217 34 521 98 5.3 10 0.62 234 111
27Results
Thrombosis exam
Census data
- Match precision deteriorates as the size of match
increases. - However, deterioration is small compared to the
exponential increase in search space. - MI-based approach dominates entropy-only
approach.
28Why does mutual information-based approach
dominate entropy-only approach?
29Cardinality Constraints in Schema Matching
- One-to-one mapping (bijective)
A B C
A B C
G1
G2
30Cardinality Constraints in Schema Matching
- One-to-one mapping (bijective)
- Onto mapping (surjective)
A B C D
A B C
G1
G2
31Cardinality Constraints in Schema Matching
- One-to-one mapping
- Onto mapping
- Partial mapping
A B C E
A B C D
G1
G2
32What about schemas that dont match?
- Examined how our matching algorithm reacts to the
matching of unrelated schemas. (NY-CA vs.
Lab1-CA)
33Distinguishing Good and Bad Matches
- Clearly detects case where there is no good
matching.
34Summary
- Identified new class of schema matching problems
that have not been addressed by existing
solutions. - First to introduce an un-interpreted matching
technique that addresses the new class of
problems. - Evaluation suggests it may be useful as an
addition to existing matching techniques.
35Future Work
- Find an efficient, accurate graph matching
approximation algorithm. - Extend the techniques to nested structures such
as XML, OO schemas. - See if the technique applicable to the problems
of schema classification / clustering.
36Questions?
- For more information
- jaewoo_at_cs.wisc.edu
- http//www.cs.wisc.edu/jaewoo