Title: Generic Entity Resolution: Identifying Real-World Entities in Large Data Sets
1Generic Entity ResolutionIdentifying Real-World
Entities in Large Data Sets
- Hector Garcia-Molina
- Stanford University
Work with Omar Benjelloun, Qi Su, Jennifer
Widom, Tyson Condie, Nicolas Pombourcq, David
Menestrina, Steven Whang
2Entity Resolution
e2
e1
N a A b CC c Ph e
N a Exp d Ph e
3Applications
- comparison shopping
- mailing lists
- classified ads
- customer files
- counter-terrorism
e1
N a A b CC c Ph e
e2
N a Exp d Ph e
4Outline
- Why is ER challenging?
- How is ER done?
- Some ER work at Stanford
- Confidences
5Challenges (1)
- No keys!
- Value matching
- Kaddafi, Qaddafi, Kadafi, Kaddaffi...
- Record matching
Nm Tom Ad 123 Main St Ph (650) 555-1212 Ph
(650) 777-7777
Nm Thomas Ad 132 Main St Ph (650) 555-1212
6Challenges (2)
Nm Tom Ad 123 Main St Ph (650) 555-1212 Ph
(650) 777-7777
Nm Thomas Ad 132 Main St Ph (650) 555-1212 Zp
94305
Nm Tom Nm Thomas Ad 123 Main St Ph (650)
555-1212 Ph (650) 777-7777 Zp 94305
7Challenges (3)
Nm Tom Wk IBM Oc laywer Sal 500K
Nm Tom Ad 123 Main BD Jan 1, 85 Wk IBM
Nm Thomas Ad 123 Maim Oc lawyer
Nm Tom Ad 123 Main BD Jan 1, 85 Wk IBM Oc
lawyer Sal 500K
Nm Tom Ad 123 Main BD Jan 1, 85 Wk IBM Oc
lawyer
8Challenges (4)
Nm Tom Ad 123 Main BD Jan 1, 85 Wk IBM Oc
lawyer Sal 500K
too young to make 500K at IBM!!
9Challenges (5)
Nm Tom (0.9) Ad 123 Main St (1.0) Ph (650)
555-1212 (0.6) Ph (650) 777-7777 (0.8)
(0.8)
- In value matching, match rules, merge
conf ?
10Taxonomy
- Pairwise snaps vs. clustering
- De-duplication vs. fidelity enhancement
- Schema differences
- Relationships
- Exact vs. approximate
- Generic vs application specific
- Confidences
11Schema Differences
Name Tom Address 123 Main St Ph (650)
555-1212 Ph (650) 777-7777
FirstName Tom StreetName Main St StreetNumber
123 Tel (650) 777-7777
12Pair-Wise Snaps vs. Clustering
13De-Duplication vs. Fidelity Enhancement
B
S
R
S
N
14Relationships
father
brother
business
business
15Using Relationships
papers
authors
same??
16Exact vs Approximate ER
cameras
resolved cameras
ER
products
CDs
resolved CDs
ER
books
resolved books
ER
...
...
17Exact vs Approximate ER
terrorists
terrorists
sort by age
match against ages 25-35
Widom 30
18Generic vs Application Specific
- Match function M(r, s)
- Merge function ltr, sgt gt t
19Taxonomy
- Pairwise snaps vs. clustering
- De-duplication vs. fidelity enhancement
- Schema differences
- Relationships
- Exact vs. approximate
- Generic vs application specific
- Confidences
20Outline
- Why is ER challenging?
- How is ER done?
- Some ER work at Stanford
- Confidences
21Taxonomy
- Pairwise snaps vs. clustering
- De-duplication vs. fidelity enhancement
- Schema differences No
- Relationships No
- Exact vs. approximate
- Generic vs application specific
- Confidences ... later on
-
22Model
r3
r1
r2
Nm Tom Wk IBM Oc laywer Sal 500K
Nm Tom Ad 123 Main BD Jan 1, 85 Wk IBM
Nm Thomas Ad 123 Maim Oc lawyer
M(r1, r2)
M(r4, r3)
Nm Tom Ad 123 Main BD Jan 1, 85 Wk IBM Oc
lawyer Sal 500K
Nm Tom Ad 123 Main BD Jan 1, 85 Wk IBM Oc
lawyer
r4ltr1, r2gt
ltr4, r3gt
23Correct Answer
ER(R) All derivable records.....
Minus dominated records
24Question
- What is best sequence of match, merge calls that
give us right answer?
25Brute Force Algorithm
- Input R
- r1 a1, b2
- r2 a1, c 4, e5
- r3 b2, c4, f6
- r4 a7, e5, f6
26Brute Force Algorithm
- Input R
- r1 a1, b2
- r2 a1, c 4, e5
- r3 b2, c4, f6
- r4 a7, e5, f6
- Match all pairs
- r1 a1, b2
- r2 a1, c 4, e5
- r3 b2, c4, f6
- r4 a7, e5, f6
- r12 a1, b2, c4, e5
27Brute Force Algorithm
- Match all pairs
- r1 a1, b2
- r2 a1, c 4, e5
- r3 b2, c4, f6
- r4 a7, e5, f6
- r12 a1, b2, c4, e5
- Repeat
- r1 a1, b2
- r2 a1, c 4, e5
- r3 b2, c4, f6
- r4 a7, e5, f6
- r12 a1, b2, c4, e5
- r123 a1, b2, c4, e5, f6
28Question 1
Can we delete r1, r2?
29Question 2
Can we avoid comparisons?
30ICAR Properties
- Idempotence
- M(r1, r1) true ltr1, r1gt r1
- Commutativity
- M(r1, r2) M(r2, r1)
- ltr1, r2gt ltr2, r1gt
- Associativity
- ltr1, ltr2, r3gtgt ltltr1, r2gt, r3gt
31More Properties
- Representativity
- If ltr1, r2gt r3, thenfor any r4 such that M(r1,
r4) is true we also have M(r3, r4) true.
r4
r1
r3
r2
32ICAR Properties ? Efficiency
- Commutativity
- Idempotence
- Associativity
- Representativity
- Can discard records
- ER result independentof processing order
33Swoosh Algorithms
- Record Swoosh
- Merges records as soon as they match
- Optimal in terms of record comparisons
- Feature Swoosh
- Remembers values seen for each feature
- Avoids redundant value comparisons
34Swoosh Performance
35If ICAR Properties Do Not Hold?
r12 Joe Sr., 123 Main, Ph 123, DLX
r23 Joe Jr., 123 Main, Ph 123, DLY
r3 Joe Jr., 123 Main, DLY
r1 Joe Sr., 123 Main, DLX
r2 Joe, 123 Main, Ph123
36If ICAR Properties Do Not Hold?
r12 Joe Sr., 123 Main, Ph 123, DLX
r23 Joe Jr., 123 Main, Ph 123, DLY
r3 Joe Jr., 123 Main, DLY
r1 Joe Sr., 123 Main, DLX
r2 Joe, 123 Main, Ph123
Full Answer ER(R) r12, r23, r1, r2,
r3 Minus Dominated ER(R) r12, r23
37If ICAR Properties Do Not Hold?
r12 Joe Sr., 123 Main, Ph 123, DLX
r23 Joe Jr., 123 Main, Ph 123, DLY
r3 Joe Jr., 123 Main, DLY
r1 Joe Sr., 123 Main, DLX
r2 Joe, 123 Main, Ph123
Full Answer ER(R) r12, r23, r1, r2,
r3 Minus Dominated ER(R) r12, r23 R-Swoosh
Yields ER(R) r12, r3 or r1, r23
38Swoosh Without ICAR Properties
39Distributed Swoosh
P1
P2
P3
r1 r2 r3 r4 r5 r6 ...
40Distributed Swoosh
P1
P2
P3
r1 r3 r4 r6 ...
r1 r2 r4 r5 ...
r2 r3 r5 r6 ...
41DSwoosh Performance
42Outline
- Why is ER challenging?
- How is ER done?
- Some ER work at Stanford
- Confidences
43Conclusion
- ER is old and important problem
- Our approach generic
- Confidences
- challenging
- two ways to tame
- thresholds
- packages
44Thanks.
45Generic Confidence Model
.7a, b, c
match
yes (or no)
.9a, c, d
.7a, b, c
merge
.65a,b,c,d,x
.9a, c, d
46Problem Properties May Not Hold
- r1 0.9 a, b, c
- r2 0.8 a, d
- say confidences multiplied on merge
- ltr1, r2gt 0.72a, b, c, d
- lt ltr1, r2gt, r1gt 0.648a, b, c, d
- lt ltr1, r1gt, r2gt ltr1, r2gt 0.72a, b, c, d
47ER with Confidences
- Very Expensive
- must compute all derivations
- cannot delete records after they merge
- What can we do??
- thresholds
- packages
48Important Property
- If conf(Rx) lt threshold
- Then for any Ry derived from Rx conf(Ry) lt
threshold
C 0.7
r1
C lt 0.7
r3
r4
r2
49Thresholds - Example
T0.7
0.9 a v1, bv2 0.8 av1, c v3 0.6
bv2, cv3, dv4 0.75 av1, bv2, cv3 0.5
av1, bv2, cv3, d v4 ...
50Thresholds - Example
T0.7
0.9 a v1, bv2 0.8 av1, c v3 0.6
bv2, cv3, dv4 0.75 av1, bv2, cv3 0.5
av1, bv2, cv3, d v4 ...
51Goal C-Swoosh
base records
allpossiblemerges
eliminatedominated
eliminatebelowthreshold
52Goal C-Swoosh
base records
allpossiblemerges
eliminatedominated
eliminatebelowthreshold
earlier
53Does Threshold Property Hold?
.7a, b, c
merge
.9a, b, c
.8a, b, c
54Does Threshold Property Hold?
.7a, b, c
merge
.8a, b, c
.8a, b, c
55Simple Confidence Model
Alternate Worlds
a, b
a, b
a, b, c
a, b
a, b
a, b
a, b, d
???
???
???
56Rules
- 0.7a, b, c, 0.7a, b, c? 0.7 a, b, c
- 0.7 a, b, 0.5 a, b? 0.7 a, b
- 0.7 a, b, c, 0.5 a, b? 0.7 a, b, c
- 0.7 a, b, c, 0.9a, b? 0.7 a, b, c, 0.9a,
b - etc
57Matches
Match with confidence 0.5
0.9a, b, c0.8a, b, d a, x c, d, y
worlds
1
2
3
4
5
6
7
8
9
10
a,b,c
a,b,d
a,b,c,d
58Matches
0.4a,b,c,d 0.9a, b, c0.8a, b, d a, x c,
d, y
0.9a, b, c0.8a, b, d a, x c, d, y
worlds
1
2
3
4
5
6
7
8
9
10
a,b,c
a,b,d
a,b,c,d
59Summary
- Belief model well suited for ER
- Evidence model is very complex and expensive!
60Packages
- Match does not use confidences
- merge does compute confidences
- 4 properties hold for deterministic attributes
- e.g., ltltr1, r2gt, r3gt ltr1, ltr2, r3gtgt
ignoring confidences
61Partition Records
- r1 .9 a1, b2
- r2 .8 a1, c 4, e5
- r3 .7 b2, c4, f6
- r4 .8 a7, e5, f6
- r5 .9 a7, b2
r1
r12
r2
r123
r3
r4
r45
r5
62Expand Packages
- r1 .9 a1, b2
- r2 .8 a1, c 4, e5
- r3 .7 b2, c4, f6
- r4 .8 a7, e5, f6
- r5 .9 a7, b2
r1
r12
r2
r123
r3
r1, r2, r3 ltr1, r2gt ltltr1, r2gt r3gt ltr1, r3gt ltr2,
r3gt ...
63Conclusion
- ER is old and important problem
- Our approach generic
- Confidences
- challenging
- two ways to tame
- thresholds
- packages
64Thanks.
65Extra Slides
66Taxonomy
- Pairwise snaps vs. clustering
- De-duplication vs. fidelity enhancement
- Schema differences No
- Relationships No
- Exact vs. approximate
- Generic vs application specific
- Confidences ... later on
-
67One Confidence Model
id1, a, b, c, d
id3, a, b, f, g
id2, a, c, e
id1, a, b, c, did1, a, b, d id1, a,
x id1, b, y
id3, a, b, cid3, a, b, d id3, a, b, f,
g id3, a, b, f, g
id2, a, b, cid2, a, c, e id2, a, c,
e id2, a, c, e
shorthand
68Records Are Evidence
id1, a, b, c, d
id1, a, b, c, did1, a, b, d id1, a,
x id1, b, y
not 0.25
id1, (3/4)a, (3/4)b, (1/4)c, (2/4)d, (1/4)x,
(1/4)y
69New Evidence
id1, a, b, c, d
id1, a, b, c, did1, a, b, d id1, a,
x id1, b, y id1, a, b, c, d
id1, (3/4)a, (3/4)b, (1/4)c, (2/4)d, (1/4)x,
(1/4)y id1, a, b, c, d
id1, (4/5)a, (4/5)b, (2/5)c, (3/5)d, (1/5)x,
(1/5)y
70No Ids
a, b, ca, b, d a, x c, d, y
a, b, ca, b, d a, x c, d, y
0.3
0.7
a, b, (1/2)c, (1/2)d a, x c, d, y
71No Ids
a, b, ca, b, d a, x c, d, y
a, b, ca, b, d a, x c, d, y
0.3
0.7
a, b, (1/2)c, (1/2)d a, x c, d, y
a, b, (1/2)c, (1/2)d a, x c, d, y
0.1
0.9
(2/3)a, (2/3)b, (2/3)c, (2/3)d, (1/3)y a, x
72Queries?
a, b, ca, b, d a, x c, d, y
Threshold 0.5 Support 2 Maximal
Record Example a, b, c, d
a, b, ca, b, d a, x c, d, y
0.3
0.7
a, b, (1/2)c, (1/2)d a, x c, d, y
a, b, (1/2)c, (1/2)d a, x c, d, y
0.1
0.9
(2/3)a, (2/3)b, (2/3)c, (2/3)d, (1/3)y a, x
73Queries?
a, b, ca, b, d a, x c, d, y
Threshold 0.5 Support 2 Maximal
Record Example a, b, c, d
a, b, ca, b, d a, x c, d, y
0.3
0.7
a, b, (1/2)c, (1/2)d a, x c, d, y
a, b, (1/2)c, (1/2)d a, x c, d, y
0.1
0.9
(2/3)a, (2/3)b, (2/3)c, (2/3)d, (1/3)y a, x
74Need Simpler Model?
75Bonus Material
- Entity Resolution, Confidences,and their
relationship to Information Privacy
76Privacy
Alice
1.0
1.0
Nm Alice Ad 32 Fox
Nm Alice Ad 32 Fox Ph 5551212 Ad 14 Cat
Bob
77Leakage
Alice
Bob
L 0.6 (between 0 and 1)
78Multi-Record Leakage
Alice
r1, L 0.9 r2, L 0.8 r3, L 0.7
Bob
LL 0.9 (between 0 and 1,
e.g., max L)
79Q1 Added Vulnerability?
p
Alice
r1
r2
r3
r4
Bob
r4 may cause Bobs records to snap together!
?LL ??
80Q2 Disinformation?
p
Alice
r1
r2
r3
r4 (lies)
Bob
What is most cost effective disinformation?
?LL ??
81Q3 Verification?
p
Alice
hypothesis h (0.6)
r1, 0.9 r2, 0.8 r3, 0.7 ...
Bob
What is best fact to verify to increase confidence
in hypothesis?
82Summary
- Entity resolution is critical
- Efficient resolution important
- Confidences are important, but how?
- ER is key aspect of info privacy
- check www-db.stanford.edu forSwoosh paper
forthcoming paper
83Thanks.
84Extra Slides
85Challenges
- Exponential growth in complexity
0.9 a v1, bv2 0.8 av1, c v3 0.6
bv2, cv3, dv4 0.75 av1, bv2, cv3 0.5
av1, bv2, cv3, d v4 ...
86Three Ideas to Tame Complexity
- Thresholds
- Domination
- Packages
87Thresholds
T0.7
0.9 a v1, bv2 0.8 av1, c v3 0.6
bv2, cv3, dv4 0.75 av1, bv2, cv3 0.5
av1, bv2, cv3, d v4 ...
88Domination
0.9 a v1, bv2, cv3 0.8 av1, b v2, c
v3 0.8 bv2, cv3 ...
89Domination
0.9 a v1, bv2, cv3 0.8 av1, b v2, c
v3 0.8 bv2, cv3 ...
90Summary
- Our approach pairwise, generic, Swoosh
- Confidences
- Making Tractable
- threshold
- domination
- packages
91Thanks You
92What Swoosh Does NOT Do
- Hash table with every pair seen
- records ri, rj
- compared values vi, vj
- Swoosh achieves the same effectwith our N2 space
93Swoosh Performance (I)
94Swoosh Performance (II)
95Swoosh Performance (III)