On Schema Matching with Opaque Column Names and Data Values - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

On Schema Matching with Opaque Column Names and Data Values

Description:

... of this talk. Outline of the remainder of this talk. Formal definition ... Two Car Part Tables. June 10, 2003. SIGMOD 2003. 15. Jaewoo Kang. Background ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 37
Provided by: www43
Category:

less

Transcript and Presenter's Notes

Title: On Schema Matching with Opaque Column Names and Data Values


1
On Schema Matching with Opaque Column Names and
Data Values
  • Jaewoo Kang
  • NC State (Aug 2003)
  • Jeffrey F. Naughton
  • Univ. of Wisconsin-Madison

2
What is Schema Matching?
  • Finding semantic correspondences of schema
    elements across heterogeneous sources.
  • Old problem yet attracting new interests.

3
What is Schema Matching? (Contd)
  • Important for enterprise applications
  • Data warehouses, data migration.
  • Also important for Internet data
  • Virtual databases, web information systems.
  • Fundamental element of data integration.

4
No Silver Bullet!
  • State of the art A collection of techniques that
    propose matches.
  • We have added a new technique to this collection
    that works when previous techniques dont even
    apply.

5
Some Previous Approaches
  • Schema-based approaches

Manager Employ Salary
J. K. J. D. 50K
T. J. N. D 80K
P. K. Z. I. 75K
MNG EMP WAGE
U. P. D. S. 85K
A. H. M. H. 75K
J. N. D. F. 60K
Site 1
Site 2
6
Some Previous Approaches II
  • Instance-based approaches

Dept Employ Phone
HR J. D. 267-7622
RD N. D 354-8736
Sales Z. I. 219-0457
DPT EMP CONT
RD D. S. 387-9802
Sales M. H. 546-3856
Adm D. F. 326-1284
Site 1
Site 2
7
So two previous approaches
  • Schema-based (interpret column names)
  • Instance-based (interpret data values)

8
But what about this problem?
t1 t2 t3 t4 t5 t6 t7
193 7.3 3.6 5.7 9 0.39 0.2
176 7 4.5 8 15 0.87 0.4
123 6.3 3.8 7 12 0.56 0.5
238 6.7 3.9 3.7 18 0.44 0.5
174 6.1 3.5 4.4 21 0.56 0.6
96 6.1 4.1 3.1 10 0.73 0.3
133 8.4 4.7 6.3 12 0.77 0.3
t1 t2 t3 t4 t5 t6 t7
164 7.4 4.2 3.8 13 0.57 0.4
129 6.3 3.4 4.8 16 0.44 0.2
136 7.6 4 3.1 9 0.52 0.6
395 6.9 3.6 4.8 8 0.38 0.4
93 6.6 3.7 3.9 17 0.61 0.6
114 6.8 3.9 4 17 0.32 0.5
144 7.8 4.3 3.8 16 0.51 0.9
Site 1
Site 2
9
This is the Un-interpreted Matching Problem.
  • Focus of this talk
  • Outline of the remainder of this talk
  • Formal definition
  • Terminology
  • Algorithm
  • Experimental Results

10
Un-interpreted Matching
  • M1 match(R(r1, r2, .., rn), S(s1, s2, .., sm))
  • M2 match(R(r1, r2, .., rn), S(f1(s1), f2(s2),
    .., fm(sm))
  • where match a schema matching algorithm,
  • Mi (ri-sj) set of matching column pairs,
  • fi arbitrary one-to-one function.
  • match is an un-interpreted matching iff M1M2
    for all fis.
  • Main idea specific token representing column
    name and value is not important.

11
Motivating example
Two Car Part Tables
Model Color Tire
XLE white P2R6
XG2.5 silver XR5
LE red GM6

Model C1 C2
GL3.5 a1 b1
XGL a2 b2
XE a3 b3

12
Motivating example
Two Car Part Tables
Model Color Tire
XLE white P2R6
XG2.5 silver XR5
LE red GM6

Model C1 C2
GL3.5 a1 b1
XGL a2 b2
XE a3 b3

13
Motivating example
Two Car Part Tables
Model Color Tire
XLE white P2R6
XG2.5 silver XR5
LE red GM6

Model C1 C2
GL3.5 a1 b1
XGL a2 b2
XE a3 b3

14
Motivating example
Two Car Part Tables
Model Color Tire
XLE white P2R6
XG2.5 silver XR5
LE red GM6

Model C1 C2
GL3.5 a1 b1
XGL a2 b2
XE a3 b3

15
Background
  • Before introducing our algorithm, need
  • Information Entropy
  • Mutual Information
  • Modeling Dependency Relations
  • Graph Matching

16
Information Entropy
  • Measures the uncertainty of values in an
    attribute
  • Standard information theoretic measure

17
Mutual Information
  • Another standard information theoretic measure
  • Measures the amount of information captured in
    one attribute about the other.
  • Note Self-information MI(XX) H(X)

18
Modeling Dependency Relation
A B C D
a1 b2 c1 d1
a3 b4 c2 d2
a1 b1 c1 d2
a4 b3 c2 d3
Table R
GTable2DepGraph(R)
19
Graph Matching
  • Our algorithm will use graph matching.
  • (G1(a),G2(b))GraphMatch(G1,G2)
  • Finds a mapping that minimizes the distance
    between the two graphs.

G1
G2
20
Distance Between the Graphs
  • Euclidean distance metric (Frobenius norm)
  • where aij and bij mutual information between
    node i and j.
  • m(node in A) matching node in B.

21
Measuring the quality of match results
22
Finally, Our Matching Algorithm
  • 1. G1 Table2DepGraph(S1)
  • G2 Table2DepGraph(S2)
  • 2. (G1(a), G2(b)) GraphMatch(G1, G2)
  • where Si an input table,
  • Gi a dependency graph,
  • (G1(a), G2(b)) a matching node pair.

23
Validating the Framework
  • Graph matching algorithm
  • Used exhaustive search w/ simple filtering.
  • Can be replaced w/ approximate algorithms in
    practice.
  • System
  • Java HotSpot VM 1.4

24
Goals of experiments
  • Main goal see if mutual information-based
    un-interpreted matching works.
  • Secondary goal see if mutual information is
    necessary, or if a simpler approach, Entropy-only
    Matching, works just as well.
  • Only compares the entropies of attributes in
    isolation, without considering mutual
    information.

25
Data Set I
  • Census Data (U.S. Census Bureau)
  • State census data files NY and CA.
  • Can algorithm find mapping between attributes in
    NY and CA tables?

1 2 3 4 5 6 7 8 9 10
18091 1063 10 9 9 41 15 368 368 288
17511 3281 25 21 40 89 59 1211 1211 796
609 3424 29 13 15 148 26 1055 1055 861
3861 2884 18 7 4 114 11 670 670 568
18614 1478 12 10 15 40 16 630 630 459
26
Data Set II
  • Medical Data
  • Thrombosis lab exam data (12 years of patient
    records.)
  • Range partitioned into two tables based on exam
    dates.
  • Can algorithm find mapping between attributes in
    resulting two tables?

1 2 3 4 5 6 7 8 9 10
970709 23 530 104 6.4 4 14 0.5 232 100
971022 26 564 108 6.8 5.3 13 0.55 250 103
971224 25 483 90 6.5 5.1 15 0.62
980120 26 578 101 7 4.6 16 0.49 224 93
980217 34 521 98 5.3 10 0.62 234 111
27
Results
Thrombosis exam
Census data
  • Match precision deteriorates as the size of match
    increases.
  • However, deterioration is small compared to the
    exponential increase in search space.
  • MI-based approach dominates entropy-only
    approach.

28
Why does mutual information-based approach
dominate entropy-only approach?
29
Cardinality Constraints in Schema Matching
  • One-to-one mapping (bijective)

A B C
A B C
G1
G2
30
Cardinality Constraints in Schema Matching
  • One-to-one mapping (bijective)
  • Onto mapping (surjective)

A B C D
A B C
G1
G2
31
Cardinality Constraints in Schema Matching
  • One-to-one mapping
  • Onto mapping
  • Partial mapping

A B C E
A B C D
G1
G2
32
What about schemas that dont match?
  • Examined how our matching algorithm reacts to the
    matching of unrelated schemas. (NY-CA vs.
    Lab1-CA)

33
Distinguishing Good and Bad Matches
  • Clearly detects case where there is no good
    matching.

34
Summary
  • Identified new class of schema matching problems
    that have not been addressed by existing
    solutions.
  • First to introduce an un-interpreted matching
    technique that addresses the new class of
    problems.
  • Evaluation suggests it may be useful as an
    addition to existing matching techniques.

35
Future Work
  • Find an efficient, accurate graph matching
    approximation algorithm.
  • Extend the techniques to nested structures such
    as XML, OO schemas.
  • See if the technique applicable to the problems
    of schema classification / clustering.

36
Questions?
  • For more information
  • jaewoo_at_cs.wisc.edu
  • http//www.cs.wisc.edu/jaewoo
Write a Comment
User Comments (0)
About PowerShow.com