Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets - PowerPoint PPT Presentation

1 / 95

About This Presentation

Title:

Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets

Description:

Title: Swoosh: A Generic Approach to Entity Resolution Last modified by: Hector Garcia-Molina Document presentation format: On-screen Show Other titles – PowerPoint PPT presentation

Number of Views:187

Avg rating:3.0/5.0

Slides: 96

Provided by: forumStan

Category:

more less

Transcript and Presenter's Notes

Title: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets

1
Generic Entity ResolutionIdentifying Real-World
Entities in Large Data Sets

Hector Garcia-Molina
Stanford University

Work with Omar Benjelloun, Qi Su, Jennifer
Widom, Tyson Condie, Nicolas Pombourcq, David
Menestrina, Steven Whang
2
Entity Resolution
e2
e1
N a A b CC c Ph e
N a Exp d Ph e
3
Applications

comparison shopping
mailing lists
classified ads
customer files
counter-terrorism

e1
N a A b CC c Ph e
e2
N a Exp d Ph e
4
Outline

Why is ER challenging?
How is ER done?
Some ER work at Stanford
Confidences

5
Challenges (1)

No keys!
Value matching
Kaddafi, Qaddafi, Kadafi, Kaddaffi...
Record matching

Nm Tom Ad 123 Main St Ph (650) 555-1212 Ph
(650) 777-7777
Nm Thomas Ad 132 Main St Ph (650) 555-1212
6
Challenges (2)

Merging records

Nm Tom Ad 123 Main St Ph (650) 555-1212 Ph
(650) 777-7777
Nm Thomas Ad 132 Main St Ph (650) 555-1212 Zp
94305
Nm Tom Nm Thomas Ad 123 Main St Ph (650)
555-1212 Ph (650) 777-7777 Zp 94305
7
Challenges (3)

Chaining

Nm Tom Wk IBM Oc laywer Sal 500K
Nm Tom Ad 123 Main BD Jan 1, 85 Wk IBM
Nm Thomas Ad 123 Maim Oc lawyer
Nm Tom Ad 123 Main BD Jan 1, 85 Wk IBM Oc
lawyer Sal 500K
Nm Tom Ad 123 Main BD Jan 1, 85 Wk IBM Oc
lawyer
8
Challenges (4)

Un-merging

Nm Tom Ad 123 Main BD Jan 1, 85 Wk IBM Oc
lawyer Sal 500K
too young to make 500K at IBM!!
9
Challenges (5)

Confidences in data

Nm Tom (0.9) Ad 123 Main St (1.0) Ph (650)
555-1212 (0.6) Ph (650) 777-7777 (0.8)
(0.8)

In value matching, match rules, merge

conf ?
10
Taxonomy

Pairwise snaps vs. clustering
De-duplication vs. fidelity enhancement
Schema differences
Relationships
Exact vs. approximate
Generic vs application specific
Confidences

11
Schema Differences
Name Tom Address 123 Main St Ph (650)
555-1212 Ph (650) 777-7777
FirstName Tom StreetName Main St StreetNumber
123 Tel (650) 777-7777
12
Pair-Wise Snaps vs. Clustering
13
De-Duplication vs. Fidelity Enhancement
B
S
R
S
N
14
Relationships
father
brother
business
business
15
Using Relationships
papers
authors
same??
16
Exact vs Approximate ER
cameras
resolved cameras
ER
products
CDs
resolved CDs
ER
books
resolved books
ER
...
...
17
Exact vs Approximate ER
terrorists
terrorists
sort by age
match against ages 25-35
Widom 30
18
Generic vs Application Specific

Match function M(r, s)
Merge function ltr, sgt gt t

19
Taxonomy

Pairwise snaps vs. clustering
De-duplication vs. fidelity enhancement
Schema differences
Relationships
Exact vs. approximate
Generic vs application specific
Confidences

20
Outline

Why is ER challenging?
How is ER done?
Some ER work at Stanford
Confidences

21
Taxonomy

Pairwise snaps vs. clustering
De-duplication vs. fidelity enhancement
Schema differences No
Relationships No
Exact vs. approximate
Generic vs application specific
Confidences ... later on

22
Model
r3
r1
r2
Nm Tom Wk IBM Oc laywer Sal 500K
Nm Tom Ad 123 Main BD Jan 1, 85 Wk IBM
Nm Thomas Ad 123 Maim Oc lawyer
M(r1, r2)
M(r4, r3)
Nm Tom Ad 123 Main BD Jan 1, 85 Wk IBM Oc
lawyer Sal 500K
Nm Tom Ad 123 Main BD Jan 1, 85 Wk IBM Oc
lawyer
r4ltr1, r2gt
ltr4, r3gt
23
Correct Answer
ER(R) All derivable records.....
Minus dominated records
24
Question

What is best sequence of match, merge calls that
give us right answer?

25
Brute Force Algorithm

Input R
r1 a1, b2
r2 a1, c 4, e5
r3 b2, c4, f6
r4 a7, e5, f6

26
Brute Force Algorithm

Input R
r1 a1, b2
r2 a1, c 4, e5
r3 b2, c4, f6
r4 a7, e5, f6

Match all pairs
r1 a1, b2
r2 a1, c 4, e5
r3 b2, c4, f6
r4 a7, e5, f6
r12 a1, b2, c4, e5

27
Brute Force Algorithm

Match all pairs
r1 a1, b2
r2 a1, c 4, e5
r3 b2, c4, f6
r4 a7, e5, f6
r12 a1, b2, c4, e5

Repeat
r1 a1, b2
r2 a1, c 4, e5
r3 b2, c4, f6
r4 a7, e5, f6
r12 a1, b2, c4, e5
r123 a1, b2, c4, e5, f6

28
Question 1
Can we delete r1, r2?
29
Question 2
Can we avoid comparisons?
30
ICAR Properties

Idempotence
M(r1, r1) true ltr1, r1gt r1
Commutativity
M(r1, r2) M(r2, r1)
ltr1, r2gt ltr2, r1gt
Associativity
ltr1, ltr2, r3gtgt ltltr1, r2gt, r3gt

31
More Properties

Representativity
If ltr1, r2gt r3, thenfor any r4 such that M(r1,
r4) is true we also have M(r3, r4) true.

r4
r1
r3
r2
32
ICAR Properties ? Efficiency

Commutativity
Idempotence
Associativity
Representativity

Can discard records
ER result independentof processing order

33
Swoosh Algorithms

Record Swoosh
Merges records as soon as they match
Optimal in terms of record comparisons
Feature Swoosh
Remembers values seen for each feature
Avoids redundant value comparisons

34
Swoosh Performance
35
If ICAR Properties Do Not Hold?
r12 Joe Sr., 123 Main, Ph 123, DLX
r23 Joe Jr., 123 Main, Ph 123, DLY
r3 Joe Jr., 123 Main, DLY
r1 Joe Sr., 123 Main, DLX
r2 Joe, 123 Main, Ph123
36
If ICAR Properties Do Not Hold?
r12 Joe Sr., 123 Main, Ph 123, DLX
r23 Joe Jr., 123 Main, Ph 123, DLY
r3 Joe Jr., 123 Main, DLY
r1 Joe Sr., 123 Main, DLX
r2 Joe, 123 Main, Ph123
Full Answer ER(R) r12, r23, r1, r2,
r3 Minus Dominated ER(R) r12, r23
37
If ICAR Properties Do Not Hold?
r12 Joe Sr., 123 Main, Ph 123, DLX
r23 Joe Jr., 123 Main, Ph 123, DLY
r3 Joe Jr., 123 Main, DLY
r1 Joe Sr., 123 Main, DLX
r2 Joe, 123 Main, Ph123
Full Answer ER(R) r12, r23, r1, r2,
r3 Minus Dominated ER(R) r12, r23 R-Swoosh
Yields ER(R) r12, r3 or r1, r23
38
Swoosh Without ICAR Properties
39
Distributed Swoosh
P1
P2
P3
r1 r2 r3 r4 r5 r6 ...
40
Distributed Swoosh
P1
P2
P3
r1 r3 r4 r6 ...
r1 r2 r4 r5 ...
r2 r3 r5 r6 ...
41
DSwoosh Performance
42
Outline

Why is ER challenging?
How is ER done?
Some ER work at Stanford
Confidences

43
Conclusion

ER is old and important problem
Our approach generic
Confidences
challenging
two ways to tame
thresholds
packages

44
Thanks.
45
Generic Confidence Model

r1 0.7 av1, bv2, c v3

.7a, b, c
match
yes (or no)
.9a, c, d
.7a, b, c
merge
.65a,b,c,d,x
.9a, c, d
46
Problem Properties May Not Hold

r1 0.9 a, b, c
r2 0.8 a, d
say confidences multiplied on merge
ltr1, r2gt 0.72a, b, c, d
lt ltr1, r2gt, r1gt 0.648a, b, c, d
lt ltr1, r1gt, r2gt ltr1, r2gt 0.72a, b, c, d

47
ER with Confidences

Very Expensive
must compute all derivations
cannot delete records after they merge
What can we do??
thresholds
packages

48
Important Property

If conf(Rx) lt threshold
Then for any Ry derived from Rx conf(Ry) lt
threshold

C 0.7
r1
C lt 0.7
r3
r4
r2
49
Thresholds - Example
T0.7
0.9 a v1, bv2 0.8 av1, c v3 0.6
bv2, cv3, dv4 0.75 av1, bv2, cv3 0.5
av1, bv2, cv3, d v4 ...
50
Thresholds - Example
T0.7
0.9 a v1, bv2 0.8 av1, c v3 0.6
bv2, cv3, dv4 0.75 av1, bv2, cv3 0.5
av1, bv2, cv3, d v4 ...
51
Goal C-Swoosh
base records
allpossiblemerges
eliminatedominated
eliminatebelowthreshold
52
Goal C-Swoosh
base records
allpossiblemerges
eliminatedominated
eliminatebelowthreshold
earlier
53
Does Threshold Property Hold?

NO records are evidence

.7a, b, c
merge
.9a, b, c
.8a, b, c
54
Does Threshold Property Hold?

YES records are beliefs

.7a, b, c
merge
.8a, b, c
.8a, b, c
55
Simple Confidence Model

0.7 a, b

Alternate Worlds
a, b
a, b
a, b, c
a, b
a, b
a, b
a, b, d
???
???
???
56
Rules

0.7a, b, c, 0.7a, b, c? 0.7 a, b, c
0.7 a, b, 0.5 a, b? 0.7 a, b
0.7 a, b, c, 0.5 a, b? 0.7 a, b, c
0.7 a, b, c, 0.9a, b? 0.7 a, b, c, 0.9a,
b
etc

57
Matches
Match with confidence 0.5
0.9a, b, c0.8a, b, d a, x c, d, y
worlds
1
2
3
4
5
6
7
8
9
10
a,b,c
a,b,d
a,b,c,d
58
Matches
0.4a,b,c,d 0.9a, b, c0.8a, b, d a, x c,
d, y
0.9a, b, c0.8a, b, d a, x c, d, y
worlds
1
2
3
4
5
6
7
8
9
10
a,b,c
a,b,d
a,b,c,d
59
Summary

Belief model well suited for ER
Evidence model is very complex and expensive!

60
Packages

Match does not use confidences
merge does compute confidences
4 properties hold for deterministic attributes
e.g., ltltr1, r2gt, r3gt ltr1, ltr2, r3gtgt
ignoring confidences

61
Partition Records

r1 .9 a1, b2
r2 .8 a1, c 4, e5
r3 .7 b2, c4, f6
r4 .8 a7, e5, f6
r5 .9 a7, b2

r1
r12
r2
r123
r3
r4
r45
r5
62
Expand Packages

r1 .9 a1, b2
r2 .8 a1, c 4, e5
r3 .7 b2, c4, f6
r4 .8 a7, e5, f6
r5 .9 a7, b2

r1
r12
r2
r123
r3
r1, r2, r3 ltr1, r2gt ltltr1, r2gt r3gt ltr1, r3gt ltr2,
r3gt ...
63
Conclusion

ER is old and important problem
Our approach generic
Confidences
challenging
two ways to tame
thresholds
packages

64
Thanks.
65
Extra Slides
66
Taxonomy

Pairwise snaps vs. clustering
De-duplication vs. fidelity enhancement
Schema differences No
Relationships No
Exact vs. approximate
Generic vs application specific
Confidences ... later on

67
One Confidence Model
id1, a, b, c, d
id3, a, b, f, g
id2, a, c, e
id1, a, b, c, did1, a, b, d id1, a,
x id1, b, y
id3, a, b, cid3, a, b, d id3, a, b, f,
g id3, a, b, f, g
id2, a, b, cid2, a, c, e id2, a, c,
e id2, a, c, e
shorthand
68
Records Are Evidence
id1, a, b, c, d
id1, a, b, c, did1, a, b, d id1, a,
x id1, b, y
not 0.25
id1, (3/4)a, (3/4)b, (1/4)c, (2/4)d, (1/4)x,
(1/4)y
69
New Evidence
id1, a, b, c, d
id1, a, b, c, did1, a, b, d id1, a,
x id1, b, y id1, a, b, c, d
id1, (3/4)a, (3/4)b, (1/4)c, (2/4)d, (1/4)x,
(1/4)y id1, a, b, c, d
id1, (4/5)a, (4/5)b, (2/5)c, (3/5)d, (1/5)x,
(1/5)y
70
No Ids
a, b, ca, b, d a, x c, d, y
a, b, ca, b, d a, x c, d, y
0.3
0.7
a, b, (1/2)c, (1/2)d a, x c, d, y
71
No Ids
a, b, ca, b, d a, x c, d, y
a, b, ca, b, d a, x c, d, y
0.3
0.7
a, b, (1/2)c, (1/2)d a, x c, d, y
a, b, (1/2)c, (1/2)d a, x c, d, y
0.1
0.9
(2/3)a, (2/3)b, (2/3)c, (2/3)d, (1/3)y a, x
72
Queries?
a, b, ca, b, d a, x c, d, y
Threshold 0.5 Support 2 Maximal
Record Example a, b, c, d
a, b, ca, b, d a, x c, d, y
0.3
0.7
a, b, (1/2)c, (1/2)d a, x c, d, y
a, b, (1/2)c, (1/2)d a, x c, d, y
0.1
0.9
(2/3)a, (2/3)b, (2/3)c, (2/3)d, (1/3)y a, x
73
Queries?
a, b, ca, b, d a, x c, d, y
Threshold 0.5 Support 2 Maximal
Record Example a, b, c, d
a, b, ca, b, d a, x c, d, y
0.3
0.7
a, b, (1/2)c, (1/2)d a, x c, d, y
a, b, (1/2)c, (1/2)d a, x c, d, y
0.1
0.9
(2/3)a, (2/3)b, (2/3)c, (2/3)d, (1/3)y a, x
74
Need Simpler Model?
75
Bonus Material

Entity Resolution, Confidences,and their
relationship to Information Privacy

76
Privacy
Alice
1.0
1.0
Nm Alice Ad 32 Fox
Nm Alice Ad 32 Fox Ph 5551212 Ad 14 Cat
Bob
77
Leakage
Alice
Bob
L 0.6 (between 0 and 1)
78
Multi-Record Leakage
Alice
r1, L 0.9 r2, L 0.8 r3, L 0.7
Bob
LL 0.9 (between 0 and 1,
e.g., max L)
79
Q1 Added Vulnerability?
p
Alice
r1
r2
r3
r4
Bob
r4 may cause Bobs records to snap together!
?LL ??
80
Q2 Disinformation?
p
Alice
r1
r2
r3
r4 (lies)
Bob
What is most cost effective disinformation?
?LL ??
81
Q3 Verification?
p
Alice
hypothesis h (0.6)
r1, 0.9 r2, 0.8 r3, 0.7 ...
Bob
What is best fact to verify to increase confidence
in hypothesis?
82
Summary

Entity resolution is critical
Efficient resolution important
Confidences are important, but how?
ER is key aspect of info privacy
check www-db.stanford.edu forSwoosh paper
forthcoming paper

83
Thanks.
84
Extra Slides
85
Challenges

Exponential growth in complexity

0.9 a v1, bv2 0.8 av1, c v3 0.6
bv2, cv3, dv4 0.75 av1, bv2, cv3 0.5
av1, bv2, cv3, d v4 ...
86
Three Ideas to Tame Complexity

Thresholds
Domination
Packages

87
Thresholds
T0.7
0.9 a v1, bv2 0.8 av1, c v3 0.6
bv2, cv3, dv4 0.75 av1, bv2, cv3 0.5
av1, bv2, cv3, d v4 ...
88
Domination
0.9 a v1, bv2, cv3 0.8 av1, b v2, c
v3 0.8 bv2, cv3 ...
89
Domination
0.9 a v1, bv2, cv3 0.8 av1, b v2, c
v3 0.8 bv2, cv3 ...
90
Summary