Title: Weighted Exact Set Similarity Join
1Weighted Exact Set Similarity Join
- The Pennsylvania State University
- Dongwon Lee
- dongwon_at_psu.edu
2Set Similarity Join
- Def. Set Similarity Join (SSJoin) Between
collections A and B, find X pairs of objects
whose similarity gt t - If X MOST ? Approximate SSJoin
- If X ALL ? Exact SSJoin
0.7
Lake, Monona, Wisc, Dane, County
0.5
0.4
University, Mendota, Wisc, Dane,
0.9
0.2
0.1
A
B
3Set Similarity Join
- Weighted vs. Unweighted
- Weighting quantifies relative importance of token
- Eg, Microsoft is more important than Copr.
- How to assign meaningful weights to tokens is an
important problem itself - Not further discussed here
4Set Similarity Join
- Approximate SSJoin
- Allows some false positives/negatives
- Eg, LSH as solution
- Exact SSJoin
- Does not allow any false positives/negatives
- Needs to be scalable
- Weighted Exact SSJoin
- Will simply call WESSJoin
WESSJoin
UESSJoin
exact
WASSJoin
UASSJoin
approx.
unweighted
weighted
5Applications of WESSJoin
- Entity resolution
- Web document genre classification
- Find all pairs of documents w. similar contents
- Query refinement for web search
- For a query, find another w. similar search
result - Movie recommendation
- Identify users who have similar movie tastes
w.r.t. the rented movies - ? Focus on string data represented as SET
- Eg, document, web page, record
6Research Issues
- Why not express WESSJoin in SQL?
- Join predicate as UDF
- Cartesian product followed by UDF processing ?
Inefficient evaluation - Special handling for WESSJoin needed
- Scalability
- Support diverse similarity (or distance)
functions - Eg, Overlap, Jaccard, Cosine vs. Edit,
- Support diverse computation models
- Eg, Threshold vs. Top-k
7Similarity/Distance Functions
- Jaccard Coefficient J(x,y)
- Overlap similarity O(x,y)
- Cosine similarity C(x,y)
- Hamming distance H(x,y)
- Levenshtein distance L(x,y) min of edit
operations to transform x to y
8Properties of sim()
- Similarity functions can be re-written to each
other equivalently - J(x,y) gt t ? O(x,y) gt t/(1t) (xy)
- O(x,y) gt t ? H(x,y) lt xy-2t
- C(x,y) gt t ? O(x,y) gt
- Eg,
- x Lake, Mendota, Monona
- y Wisc, Dane, Mendota, Lake
- J(x,y) gt 0.5 ? ? O(x,y) gt 2.3 ?
- Set representation k-gram, word, phrase,
9Naïve Solution
- All pair-wise comparison between A and B
- Nested-loop AB comparisons
- The sim() evaluation may be costly
- Eg, Generalized Jaccard Similarity function with
O(x3)
For x in A For y in B If sim(x,y) gt
t, return (x,y)
A, B table x, y record as set
10Naïve Solution Example
A
B
ID Content
1 Lake, Mendota
2 Lake, Monona, Area
3 Lake, Mendota, Monona, Dane
ID Content
4 Lake, Monona, University
5 Monona, Research, Area
6 Lake, Mendota, Monona, Area
O(x,y) gt 2 ?
O(x,y) ID4 ID5 ID6
ID1 1 0 2
ID2 2 2 3
ID3 2 1 3
11Naïve Solution Example
A
B
ID Content
1 Lake, Mendota
2 Lake, Monona, Area
3 Lake, Mendota, Monona, Dane
ID Content
4 Lake, Monona, University
5 Monona, Research, Area
6 Lake, Mendota, Monona, Area
J(x,y) gt 0.6 ?
J(x,y)) ID4 ID5 ID6
ID1 0.25 0 0.5
ID2 0.5 0.4 0.75
ID3 0.2 0.16 0.6
122-Step Framework
- Step 1 Blocking
- Using Index/heuristics/filtering/etc, reduce of
candidates to compare - Step 2 sim() only within candidate sets
- O(AC) s.t. C ltlt B
For x in A Using Foo, find a candidate
set C in B For y in C If sim(x,y) gt
t, return (x,y)
13Variants for Foo
- Foo How to identify candidate set C
- Fast
- Accurate no false positives/negatives
- Many Variants for Foo
- Inverted Index Sarawagi et al, SIGMOD 04
- Size filtering Arasu et al, VLDB 06
- Prefix Index Chaudhuri et al, ICDE 06
- Prefix Inverted Index Bayardo et al, WWW 07
- Bound filtering On et al, ICDE 07
- Position Index Xiao et al, WWW 08
14Inverted Index Sarawagi et al, SIGMOD 04
A
B
ID Content
1 Lake, Mendota
2 Lake, Monona, Area
3 Lake, Mendota, Monona, Dane
ID Content
4 Lake, Monona, University
5 Monona, Research, Area
6 Lake, Mendota, Monona, Area
Inverted Index (IDX) for A
Inverted Index (IDX) for B
Token in A ID List
Area 2
Dane 3
Lake 1, 2, 3
Mendota 1, 3
Monona 2, 3
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Monona 4, 5, 6
Research 5
University 4
15Inverted Index Sarawagi et al, SIGMOD 04
A
B
ID Content
1 Lake, Mendota
2 Lake, Monona, Area
3 Lake, Mendota, Monona, Dane
ID Content
4 Lake, Monona, University
5 Monona, Research, Area
6 Lake, Mendota, Monona, Area
Inverted Index (IDX) for B
For x in A Using IDX, find a candidate
set C in B For y in C If sim(x,y) gt
t, return (x,y)
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Monona 4, 5, 6
Research 5
University 4
ID1 Lake, Mendota ID2 ID3
Candidate set C 4,6 6 4, 6
16Inverted Index Sarawagi et al, SIGMOD 04
A
B
ID Content
1 Lake, Mendota
2 Lake, Monona, Area
3 Lake, Mendota, Monona, Dane
ID Content
4 Lake, Monona, University
5 Monona, Research, Area
6 Lake, Mendota, Monona, Area
Inverted Index (IDX) for B
For x in A Using IDX, find a candidate
set C in B For y in C If sim(x,y) gt
t, return (x,y)
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Monona 4, 5, 6
Research 5
University 4
ID1 Lake, Mendota ID2 ID3
ID Freq.
4 1
6 2
Candidate set C
O(x,y) gt 2
17Size Filtering Arasu et al, VLDB 06
- Idea Build index on the size of inputs
- Jaccard Coefficient J
- Upperbound for Jaccard
- Bounding y w.r.t. x
- Combining two ?
x
x
y
y
18Size Filtering Arasu et al, VLDB 06
- Intuition If t and x are given, y is bounded
- Eg,
- x Lake, Mendota
- y Lake, Mendota, Monona, Area
- J(x,y) gt 0.8 ?
- Then, according to
- x2, t0.8 ? 1.6 lt y lt 2.5
- However, y 4
- y cannot satisfy t0.8 ? no need to compute
J(x,y) at all
19Size Filtering Arasu et al, VLDB 06
For x in A Using IDX, find a candidate
set C in B For y in C If sim(x,y) gt
t, return (x,y)
- Algorithm
- For all input strings, build B-tree w.r.t. their
sizes - Given a set x, using B-tree index, find a
candidate y in B s.t.
20Prefix Index Chaudhuri et al, ICDE 06
- Intuition If two sets are very similar, their
prefixes, when ordered, must have some common
tokens - Eg.
- x Dane, University, Monona, Mendota
- y Area, Lake, Mendota, Monona, Wisc
- O(x,y) gt 3 ?
- x Dane, Mendota, Monona, University
- y Area, Lake, Mendota, Monona, Wisc
Prefixes
21Prefix Index Chaudhuri et al, ICDE 06
- Theorem 1 If there is no overlap btw. Prefix(x)
and Prefix(y), then sim(x,y) gt t, where - If sim()Overlap, Prefix(x)x - (t-1)
- If sim()Jaccard, Prefix(x)x-Ceiling(tx)1
- Algorithm using Theorem 1
- Given a set x
- For each token t_x in the prefix of x
- Using an index, locate a candidate y that
contains t_x in the prefix of y - If sim(x,y) gt t, return (x,y)
22Prefix Inverted Index Bayardo et al, WWW 07
A
B
ID Content
1 Lake, Mendota
2 Lake, Monona, Area
3 Lake, Mendota, Monona, Dane
ID Content
4 Lake, Monona, University
5 Monona, Research, Area
6 Lake, Mendota, Monona, Area
Token ID List DF Order
Area 2, 5 2 4
Dane 3 1 1
Lake 1, 2, 3, 4, 6 5 6
Mendota 1, 3, 6 3 5
Monona 2, 3, 4, 5, 6 5 7
Research 5 1 2
University 4 1 3
Inverted Index (IDX) for both A and B
Create a universal order Put rare tokens front
Order Dane gt Research gt University gt Area gt
Mendota gt Lake gt Monona
23Prefix Inverted Index Bayardo et al, WWW 07
Ordered A
Ordered B
ID Content
1 Mendota, Lake
2 Area, Lake, Monona
3 Dane, Mendota, Lake, Monona
ID Content
4 University, Lake, Monona
5 Research, Area, Monona
6 Area, Mendota, Lake, Monona
Order Dane gt Research gt University gt Area gt
Mendota gt Lake gt Monona
24Prefix Inverted Index Bayardo et al, WWW 07
Ordered A
Ordered B
ID Content
1 Mendota, Lake
2 Area, Lake, Monona
3 Dane, Mendota, Lake, Monona
ID Content
4 University, Lake, Monona
5 Research, Area, Monona
6 Area, Mendota, Lake, Monona
O(x,y) gt 2 Prefix(x)x-(t-1)x-1
Prefix Inverted Index for B
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Research 5
University 4
ID1 Mendota, Lake ID2 ID3
Candidate set C 6
25Prefix Inverted Index Bayardo et al, WWW 07
Ordered A
Ordered B
ID Content
1 Mendota, Lake
2 Area, Lake, Monona
3 Dane, Mendota, Lake, Monona
ID Content
4 University, Lake, Monona
5 Research, Area, Monona
6 Area, Mendota, Lake, Monona
O(x,y) gt 2 Prefix(x)x-(t-1)x-1
Prefix Inverted Index for B
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Research 5
University 4
ID1 ID2 Area, Lake, Monona ID3
Candidate set C 5 4,6 4,5,6
26Prefix Inverted Index Bayardo et al, WWW 07
Ordered A
Ordered B
ID Content
1 Mendota, Lake
2 Area, Lake, Monona
3 Dane, Mendota, Lake, Monona
ID Content
4 University, Lake, Monona
5 Research, Area, Monona
6 Area, Mendota, Lake, Monona
O(x,y) gt 2 Prefix(x)x-(t-1)x-1
Prefix Inverted Index for B
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Research 5
University 4
ID1 ID2 ID3 Dane, Mendota, Lake,
Monona
Candidate set C 6 4,6 4,6
27Position Index Xiao et al, WWW 08
Order Dane gt Research gt University gt Area gt
Mendota gt Lake gt Monona
- Eg,
- x Dane, Research, Area, Mendota, Lake
- y Research, Area, Mendota, Lake, Monona
- O(x,y) gt 4 ?
- ?
- Prefix(x) Prefix(y) 5 (4 -1) 2
- x Dane, Research, Area, Mendota, Lake
- y Research, Area, Mendota, Lake, Monona
- Research is common btw prefixes ? (x,y) is a
candidate pair ? need to compute sim(x,y)
28Position Index Xiao et al, WWW 08
Order Dane gt Research gt University gt Area gt
Mendota gt Lake gt Monona
- Eg,
- x Dane, Research, Area, Mendota, Lake
- y Research, Area, Mendota, Lake, Monona
- O(x,y) gt 4 ?
- ?
- Prefix(x) Prefix(y) 5 (4 -1) 2
- x Dane, Research, Area, Mendota, Lake
- y Research, Area, Mendota, Lake, Monona
- Estimation of max overlap overlap in prefixes
min of unseen tokens 1 min(3,4) 4 gt t ?
No need to compute sim(x,y) !
29Bound Filtering On et al, ICDE 07
- Generalized Jaccard (GJ) similarity
- Two sets x a1, , ax, y b1, , by
- Normalized weight of the maximum bipartite
matching M in the bipartite graph (N x U y, Ex
X y)
30Bound Filtering On et al, ICDE 07
0.7
0.7
0.5
0.5
0.4
0.4
0.9
0.2
0.9
0.2
0.1
0.1
x
y
M maximum weight bipartite matching
31Bound Filtering On et al, ICDE 07
- Issues
- GJ captures more semantics btw. two sets via the
weighted bipartite matching than Jaccard - But more costly to compute maximum weight
bipartite matching - Bellman-Ford O(V2E)
- Hungarian O(V3)
For x in A Using Foo, find a candidate
set C in B For y in C If GJ(x,y) gt t,
return (x,y)
32Bound Filtering On et al, ICDE 07
- Bipartite matching computation is expensive
because of the requirement - No node in the bipartite graph can have more than
one edge incident on it - Relax this constraint
- For each element ai in x, find an element bj in y
with the highest element-level similarity ? S1 - For each element bj in y, find an element ai in x
with the highest element-level similarity ? S2 - Complexity becomes linear O(xy)
33Bound Filtering On et al, ICDE 07
0.7
0.7
S1
S1
0.5
0.5
0.4
0.4
0.9
0.2
0.9
0.2
0.1
0.1
x
y
0.7
S2
0.5
S2
0.4
0.9
0.2
0.1
x
y
34Bound Filtering On et al, ICDE 07
- Properties
- Numerator of UB is at least as large as that of
GJ - Denominator of UB is no larger than that of GJ
- Similar arguments for LB
- Theorem 2
- LB lt GJ lt UB
35Bound Filtering On et al, ICDE 07
For x in A Using Foo, find a candidate
set C in B For y in C If GJ(x,y) gt t,
return (x,y)
- Algorithm
- Compute UB(x,y)
- If UB(x,y) lt t ? GJ(x,y) lt t ? (x,y) is not an
answer - Else Compute LB(x,y)
- If LB(x,y) gt t ? GJ(x,y) gt t ? (x,y) is an answer
- Else compute GJ(x,y)
LB lt GJ lt UB
36Takeaways
- WESSJoin finds ALL pairs of sets btw two
collections whose similarity gt t - Good abstraction for various problems
- 2 step framework is promising
- Step 1 reduce candidates
- Step 2 similarity computation among candidates
- Less researched issues
- Comparison among different WESSJoin methods
- WESSJoin top-k/skyline/MapReduce/etc
37Reference
- Sarawagi et al, SIGMOD 04 Sunita Sarawagi, Alok
Kirpal Efficient set joins on similarity
predicates, SIGMOD 2004. - Arasu et al, VLDB 06 Arvind Arasu, Venkatesh
Ganti, and Raghav Kaushik, Efficient exact
set-similarity joins, VLDB 2006. - Chaudhuri et al, ICDE 06 Surajit Chaudhuri,
Venkatesh Ganti, Raghav Kaushik A Primitive
Operator for Similarity Joins in Data Cleaning.
ICDE 2006. - Bayardo et al, WWW 07 R. J. Bayardo, Yiming Ma,
Ramakrishnan Srikant. Scaling Up All-Pairs
Similarity Search, WWW 2007. - On et al, ICDE 07 Byung-Won On, Nick Koudas,
Dongwon Lee, Divesh Srivastava, Group Linkage,
ICDE 2007. - Xiao et al, WWW 08 Chuan Xiao, Wei Wang, Xuemin
Lin, Jeffrey Xu Yu. Efficient Similarity Joins
for Near Duplicate Detection. WWW 2008. - Wei Wang. Efficient Exact Similarity Join
Algorithms - http//www.cse.unsw.edu.au/weiw/project/PPJoin-UT
S-Oct-2008.pdf - Jeffrey D. Ullman. High-Similarity Algorithms
- http//infolab.stanford.edu/ullman/mining/2009/si
milarity4.pdf