Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets - PowerPoint PPT Presentation

1 / 95
About This Presentation
Title:

Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets

Description:

Title: Swoosh: A Generic Approach to Entity Resolution Last modified by: Hector Garcia-Molina Document presentation format: On-screen Show Other titles – PowerPoint PPT presentation

Number of Views:187
Avg rating:3.0/5.0
Slides: 96
Provided by: forumStan
Category:

less

Transcript and Presenter's Notes

Title: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets


1
Generic Entity ResolutionIdentifying Real-World
Entities in Large Data Sets
  • Hector Garcia-Molina
  • Stanford University

Work with Omar Benjelloun, Qi Su, Jennifer
Widom, Tyson Condie, Nicolas Pombourcq, David
Menestrina, Steven Whang
2
Entity Resolution
e2
e1
N a A b CC c Ph e
N a Exp d Ph e
3
Applications
  • comparison shopping
  • mailing lists
  • classified ads
  • customer files
  • counter-terrorism

e1
N a A b CC c Ph e
e2
N a Exp d Ph e
4
Outline
  • Why is ER challenging?
  • How is ER done?
  • Some ER work at Stanford
  • Confidences

5
Challenges (1)
  • No keys!
  • Value matching
  • Kaddafi, Qaddafi, Kadafi, Kaddaffi...
  • Record matching

Nm Tom Ad 123 Main St Ph (650) 555-1212 Ph
(650) 777-7777
Nm Thomas Ad 132 Main St Ph (650) 555-1212
6
Challenges (2)
  • Merging records

Nm Tom Ad 123 Main St Ph (650) 555-1212 Ph
(650) 777-7777
Nm Thomas Ad 132 Main St Ph (650) 555-1212 Zp
94305
Nm Tom Nm Thomas Ad 123 Main St Ph (650)
555-1212 Ph (650) 777-7777 Zp 94305
7
Challenges (3)
  • Chaining

Nm Tom Wk IBM Oc laywer Sal 500K
Nm Tom Ad 123 Main BD Jan 1, 85 Wk IBM
Nm Thomas Ad 123 Maim Oc lawyer
Nm Tom Ad 123 Main BD Jan 1, 85 Wk IBM Oc
lawyer Sal 500K
Nm Tom Ad 123 Main BD Jan 1, 85 Wk IBM Oc
lawyer
8
Challenges (4)
  • Un-merging

Nm Tom Ad 123 Main BD Jan 1, 85 Wk IBM Oc
lawyer Sal 500K
too young to make 500K at IBM!!
9
Challenges (5)
  • Confidences in data

Nm Tom (0.9) Ad 123 Main St (1.0) Ph (650)
555-1212 (0.6) Ph (650) 777-7777 (0.8)
(0.8)
  • In value matching, match rules, merge

conf ?
10
Taxonomy
  • Pairwise snaps vs. clustering
  • De-duplication vs. fidelity enhancement
  • Schema differences
  • Relationships
  • Exact vs. approximate
  • Generic vs application specific
  • Confidences

11
Schema Differences
Name Tom Address 123 Main St Ph (650)
555-1212 Ph (650) 777-7777
FirstName Tom StreetName Main St StreetNumber
123 Tel (650) 777-7777
12
Pair-Wise Snaps vs. Clustering
13
De-Duplication vs. Fidelity Enhancement
B
S
R
S
N
14
Relationships
father
brother
business
business
15
Using Relationships
papers
authors
same??
16
Exact vs Approximate ER
cameras
resolved cameras
ER
products
CDs
resolved CDs
ER
books
resolved books
ER
...
...
17
Exact vs Approximate ER
terrorists
terrorists
sort by age
match against ages 25-35
Widom 30
18
Generic vs Application Specific
  • Match function M(r, s)
  • Merge function ltr, sgt gt t

19
Taxonomy
  • Pairwise snaps vs. clustering
  • De-duplication vs. fidelity enhancement
  • Schema differences
  • Relationships
  • Exact vs. approximate
  • Generic vs application specific
  • Confidences

20
Outline
  • Why is ER challenging?
  • How is ER done?
  • Some ER work at Stanford
  • Confidences

21
Taxonomy
  • Pairwise snaps vs. clustering
  • De-duplication vs. fidelity enhancement
  • Schema differences No
  • Relationships No
  • Exact vs. approximate
  • Generic vs application specific
  • Confidences ... later on

22
Model
r3
r1
r2
Nm Tom Wk IBM Oc laywer Sal 500K
Nm Tom Ad 123 Main BD Jan 1, 85 Wk IBM
Nm Thomas Ad 123 Maim Oc lawyer
M(r1, r2)
M(r4, r3)
Nm Tom Ad 123 Main BD Jan 1, 85 Wk IBM Oc
lawyer Sal 500K
Nm Tom Ad 123 Main BD Jan 1, 85 Wk IBM Oc
lawyer
r4ltr1, r2gt
ltr4, r3gt
23
Correct Answer
ER(R) All derivable records.....
Minus dominated records
24
Question
  • What is best sequence of match, merge calls that
    give us right answer?

25
Brute Force Algorithm
  • Input R
  • r1 a1, b2
  • r2 a1, c 4, e5
  • r3 b2, c4, f6
  • r4 a7, e5, f6

26
Brute Force Algorithm
  • Input R
  • r1 a1, b2
  • r2 a1, c 4, e5
  • r3 b2, c4, f6
  • r4 a7, e5, f6
  • Match all pairs
  • r1 a1, b2
  • r2 a1, c 4, e5
  • r3 b2, c4, f6
  • r4 a7, e5, f6
  • r12 a1, b2, c4, e5

27
Brute Force Algorithm
  • Match all pairs
  • r1 a1, b2
  • r2 a1, c 4, e5
  • r3 b2, c4, f6
  • r4 a7, e5, f6
  • r12 a1, b2, c4, e5
  • Repeat
  • r1 a1, b2
  • r2 a1, c 4, e5
  • r3 b2, c4, f6
  • r4 a7, e5, f6
  • r12 a1, b2, c4, e5
  • r123 a1, b2, c4, e5, f6

28
Question 1
Can we delete r1, r2?
29
Question 2
Can we avoid comparisons?
30
ICAR Properties
  • Idempotence
  • M(r1, r1) true ltr1, r1gt r1
  • Commutativity
  • M(r1, r2) M(r2, r1)
  • ltr1, r2gt ltr2, r1gt
  • Associativity
  • ltr1, ltr2, r3gtgt ltltr1, r2gt, r3gt

31
More Properties
  • Representativity
  • If ltr1, r2gt r3, thenfor any r4 such that M(r1,
    r4) is true we also have M(r3, r4) true.

r4
r1
r3
r2
32
ICAR Properties ? Efficiency
  • Commutativity
  • Idempotence
  • Associativity
  • Representativity
  • Can discard records
  • ER result independentof processing order

33
Swoosh Algorithms
  • Record Swoosh
  • Merges records as soon as they match
  • Optimal in terms of record comparisons
  • Feature Swoosh
  • Remembers values seen for each feature
  • Avoids redundant value comparisons

34
Swoosh Performance
35
If ICAR Properties Do Not Hold?
r12 Joe Sr., 123 Main, Ph 123, DLX
r23 Joe Jr., 123 Main, Ph 123, DLY
r3 Joe Jr., 123 Main, DLY
r1 Joe Sr., 123 Main, DLX
r2 Joe, 123 Main, Ph123
36
If ICAR Properties Do Not Hold?
r12 Joe Sr., 123 Main, Ph 123, DLX
r23 Joe Jr., 123 Main, Ph 123, DLY
r3 Joe Jr., 123 Main, DLY
r1 Joe Sr., 123 Main, DLX
r2 Joe, 123 Main, Ph123
Full Answer ER(R) r12, r23, r1, r2,
r3 Minus Dominated ER(R) r12, r23
37
If ICAR Properties Do Not Hold?
r12 Joe Sr., 123 Main, Ph 123, DLX
r23 Joe Jr., 123 Main, Ph 123, DLY
r3 Joe Jr., 123 Main, DLY
r1 Joe Sr., 123 Main, DLX
r2 Joe, 123 Main, Ph123
Full Answer ER(R) r12, r23, r1, r2,
r3 Minus Dominated ER(R) r12, r23 R-Swoosh
Yields ER(R) r12, r3 or r1, r23
38
Swoosh Without ICAR Properties
39
Distributed Swoosh
P1
P2
P3
r1 r2 r3 r4 r5 r6 ...
40
Distributed Swoosh
P1
P2
P3
r1 r3 r4 r6 ...
r1 r2 r4 r5 ...
r2 r3 r5 r6 ...
41
DSwoosh Performance
42
Outline
  • Why is ER challenging?
  • How is ER done?
  • Some ER work at Stanford
  • Confidences

43
Conclusion
  • ER is old and important problem
  • Our approach generic
  • Confidences
  • challenging
  • two ways to tame
  • thresholds
  • packages

44
Thanks.
45
Generic Confidence Model
  • r1 0.7 av1, bv2, c v3

.7a, b, c
match
yes (or no)
.9a, c, d
.7a, b, c
merge
.65a,b,c,d,x
.9a, c, d
46
Problem Properties May Not Hold
  • r1 0.9 a, b, c
  • r2 0.8 a, d
  • say confidences multiplied on merge
  • ltr1, r2gt 0.72a, b, c, d
  • lt ltr1, r2gt, r1gt 0.648a, b, c, d
  • lt ltr1, r1gt, r2gt ltr1, r2gt 0.72a, b, c, d

47
ER with Confidences
  • Very Expensive
  • must compute all derivations
  • cannot delete records after they merge
  • What can we do??
  • thresholds
  • packages

48
Important Property
  • If conf(Rx) lt threshold
  • Then for any Ry derived from Rx conf(Ry) lt
    threshold

C 0.7
r1
C lt 0.7
r3
r4
r2
49
Thresholds - Example
T0.7
0.9 a v1, bv2 0.8 av1, c v3 0.6
bv2, cv3, dv4 0.75 av1, bv2, cv3 0.5
av1, bv2, cv3, d v4 ...
50
Thresholds - Example
T0.7
0.9 a v1, bv2 0.8 av1, c v3 0.6
bv2, cv3, dv4 0.75 av1, bv2, cv3 0.5
av1, bv2, cv3, d v4 ...
51
Goal C-Swoosh
base records
allpossiblemerges
eliminatedominated
eliminatebelowthreshold
52
Goal C-Swoosh
base records
allpossiblemerges
eliminatedominated
eliminatebelowthreshold
earlier
53
Does Threshold Property Hold?
  • NO records are evidence

.7a, b, c
merge
.9a, b, c
.8a, b, c
54
Does Threshold Property Hold?
  • YES records are beliefs

.7a, b, c
merge
.8a, b, c
.8a, b, c
55
Simple Confidence Model
  • 0.7 a, b

Alternate Worlds
a, b
a, b
a, b, c
a, b
a, b
a, b
a, b, d
???
???
???
56
Rules
  • 0.7a, b, c, 0.7a, b, c? 0.7 a, b, c
  • 0.7 a, b, 0.5 a, b? 0.7 a, b
  • 0.7 a, b, c, 0.5 a, b? 0.7 a, b, c
  • 0.7 a, b, c, 0.9a, b? 0.7 a, b, c, 0.9a,
    b
  • etc

57
Matches
Match with confidence 0.5
0.9a, b, c0.8a, b, d a, x c, d, y
worlds
1
2
3
4
5
6
7
8
9
10
a,b,c
a,b,d
a,b,c,d
58
Matches
0.4a,b,c,d 0.9a, b, c0.8a, b, d a, x c,
d, y
0.9a, b, c0.8a, b, d a, x c, d, y
worlds
1
2
3
4
5
6
7
8
9
10
a,b,c
a,b,d
a,b,c,d
59
Summary
  • Belief model well suited for ER
  • Evidence model is very complex and expensive!

60
Packages
  • Match does not use confidences
  • merge does compute confidences
  • 4 properties hold for deterministic attributes
  • e.g., ltltr1, r2gt, r3gt ltr1, ltr2, r3gtgt
    ignoring confidences

61
Partition Records
  • r1 .9 a1, b2
  • r2 .8 a1, c 4, e5
  • r3 .7 b2, c4, f6
  • r4 .8 a7, e5, f6
  • r5 .9 a7, b2

r1
r12
r2
r123
r3
r4
r45
r5
62
Expand Packages
  • r1 .9 a1, b2
  • r2 .8 a1, c 4, e5
  • r3 .7 b2, c4, f6
  • r4 .8 a7, e5, f6
  • r5 .9 a7, b2

r1
r12
r2
r123
r3
r1, r2, r3 ltr1, r2gt ltltr1, r2gt r3gt ltr1, r3gt ltr2,
r3gt ...
63
Conclusion
  • ER is old and important problem
  • Our approach generic
  • Confidences
  • challenging
  • two ways to tame
  • thresholds
  • packages

64
Thanks.
65
Extra Slides
66
Taxonomy
  • Pairwise snaps vs. clustering
  • De-duplication vs. fidelity enhancement
  • Schema differences No
  • Relationships No
  • Exact vs. approximate
  • Generic vs application specific
  • Confidences ... later on

67
One Confidence Model
id1, a, b, c, d
id3, a, b, f, g
id2, a, c, e
id1, a, b, c, did1, a, b, d id1, a,
x id1, b, y
id3, a, b, cid3, a, b, d id3, a, b, f,
g id3, a, b, f, g
id2, a, b, cid2, a, c, e id2, a, c,
e id2, a, c, e
shorthand
68
Records Are Evidence
id1, a, b, c, d
id1, a, b, c, did1, a, b, d id1, a,
x id1, b, y
not 0.25
id1, (3/4)a, (3/4)b, (1/4)c, (2/4)d, (1/4)x,
(1/4)y
69
New Evidence
id1, a, b, c, d
id1, a, b, c, did1, a, b, d id1, a,
x id1, b, y id1, a, b, c, d
id1, (3/4)a, (3/4)b, (1/4)c, (2/4)d, (1/4)x,
(1/4)y id1, a, b, c, d
id1, (4/5)a, (4/5)b, (2/5)c, (3/5)d, (1/5)x,
(1/5)y
70
No Ids
a, b, ca, b, d a, x c, d, y
a, b, ca, b, d a, x c, d, y
0.3
0.7
a, b, (1/2)c, (1/2)d a, x c, d, y
71
No Ids
a, b, ca, b, d a, x c, d, y
a, b, ca, b, d a, x c, d, y
0.3
0.7
a, b, (1/2)c, (1/2)d a, x c, d, y
a, b, (1/2)c, (1/2)d a, x c, d, y
0.1
0.9
(2/3)a, (2/3)b, (2/3)c, (2/3)d, (1/3)y a, x
72
Queries?
a, b, ca, b, d a, x c, d, y
Threshold 0.5 Support 2 Maximal
Record Example a, b, c, d
a, b, ca, b, d a, x c, d, y
0.3
0.7
a, b, (1/2)c, (1/2)d a, x c, d, y
a, b, (1/2)c, (1/2)d a, x c, d, y
0.1
0.9
(2/3)a, (2/3)b, (2/3)c, (2/3)d, (1/3)y a, x
73
Queries?
a, b, ca, b, d a, x c, d, y
Threshold 0.5 Support 2 Maximal
Record Example a, b, c, d
a, b, ca, b, d a, x c, d, y
0.3
0.7
a, b, (1/2)c, (1/2)d a, x c, d, y
a, b, (1/2)c, (1/2)d a, x c, d, y
0.1
0.9
(2/3)a, (2/3)b, (2/3)c, (2/3)d, (1/3)y a, x
74
Need Simpler Model?
75
Bonus Material
  • Entity Resolution, Confidences,and their
    relationship to Information Privacy

76
Privacy
Alice
1.0
1.0
Nm Alice Ad 32 Fox
Nm Alice Ad 32 Fox Ph 5551212 Ad 14 Cat
Bob
77
Leakage
Alice
Bob
L 0.6 (between 0 and 1)
78
Multi-Record Leakage
Alice
r1, L 0.9 r2, L 0.8 r3, L 0.7
Bob
LL 0.9 (between 0 and 1,
e.g., max L)
79
Q1 Added Vulnerability?
p
Alice
r1
r2
r3
r4
Bob
r4 may cause Bobs records to snap together!
?LL ??
80
Q2 Disinformation?
p
Alice
r1
r2
r3
r4 (lies)
Bob
What is most cost effective disinformation?
?LL ??
81
Q3 Verification?
p
Alice
hypothesis h (0.6)
r1, 0.9 r2, 0.8 r3, 0.7 ...
Bob
What is best fact to verify to increase confidence
in hypothesis?
82
Summary
  • Entity resolution is critical
  • Efficient resolution important
  • Confidences are important, but how?
  • ER is key aspect of info privacy
  • check www-db.stanford.edu forSwoosh paper
    forthcoming paper

83
Thanks.
84
Extra Slides
85
Challenges
  • Exponential growth in complexity

0.9 a v1, bv2 0.8 av1, c v3 0.6
bv2, cv3, dv4 0.75 av1, bv2, cv3 0.5
av1, bv2, cv3, d v4 ...
86
Three Ideas to Tame Complexity
  • Thresholds
  • Domination
  • Packages

87
Thresholds
T0.7
0.9 a v1, bv2 0.8 av1, c v3 0.6
bv2, cv3, dv4 0.75 av1, bv2, cv3 0.5
av1, bv2, cv3, d v4 ...
88
Domination
0.9 a v1, bv2, cv3 0.8 av1, b v2, c
v3 0.8 bv2, cv3 ...
89
Domination
0.9 a v1, bv2, cv3 0.8 av1, b v2, c
v3 0.8 bv2, cv3 ...
90
Summary
  • Our approach pairwise, generic, Swoosh
  • Confidences
  • Making Tractable
  • threshold
  • domination
  • packages

91
Thanks You
92
What Swoosh Does NOT Do
  • Hash table with every pair seen
  • records ri, rj
  • compared values vi, vj
  • Swoosh achieves the same effectwith our N2 space

93
Swoosh Performance (I)
94
Swoosh Performance (II)
95
Swoosh Performance (III)
Write a Comment
User Comments (0)
About PowerShow.com