Weighted Exact Set Similarity Join - PowerPoint PPT Presentation

About This Presentation

Title:

Weighted Exact Set Similarity Join

Description:

Weighted Exact Set Similarity Join The Pennsylvania State University Dongwon Lee dongwon_at_psu.edu – PowerPoint PPT presentation

Number of Views:176

Avg rating:3.0/5.0

Slides: 38

Provided by: Dong179

Learn more at: https://pike.psu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Weighted Exact Set Similarity Join

1
Weighted Exact Set Similarity Join

The Pennsylvania State University
Dongwon Lee
dongwon_at_psu.edu

2
Set Similarity Join

Def. Set Similarity Join (SSJoin) Between
collections A and B, find X pairs of objects
whose similarity gt t
If X MOST ? Approximate SSJoin
If X ALL ? Exact SSJoin

0.7
Lake, Monona, Wisc, Dane, County
0.5
0.4
University, Mendota, Wisc, Dane,
0.9
0.2
0.1
A
B
3
Set Similarity Join

Weighted vs. Unweighted
Weighting quantifies relative importance of token
Eg, Microsoft is more important than Copr.
How to assign meaningful weights to tokens is an
important problem itself
Not further discussed here

4
Set Similarity Join

Approximate SSJoin
Allows some false positives/negatives
Eg, LSH as solution
Exact SSJoin
Does not allow any false positives/negatives
Needs to be scalable
Weighted Exact SSJoin
Will simply call WESSJoin

WESSJoin
UESSJoin
exact
WASSJoin
UASSJoin
approx.
unweighted
weighted
5
Applications of WESSJoin

Entity resolution
Web document genre classification
Find all pairs of documents w. similar contents
Query refinement for web search
For a query, find another w. similar search
result
Movie recommendation
Identify users who have similar movie tastes
w.r.t. the rented movies
? Focus on string data represented as SET
Eg, document, web page, record

6
Research Issues

Why not express WESSJoin in SQL?
Join predicate as UDF
Cartesian product followed by UDF processing ?
Inefficient evaluation
Special handling for WESSJoin needed
Scalability
Support diverse similarity (or distance)
functions
Eg, Overlap, Jaccard, Cosine vs. Edit,
Support diverse computation models
Eg, Threshold vs. Top-k

7
Similarity/Distance Functions

Jaccard Coefficient J(x,y)
Overlap similarity O(x,y)
Cosine similarity C(x,y)
Hamming distance H(x,y)
Levenshtein distance L(x,y) min of edit
operations to transform x to y

8
Properties of sim()

Similarity functions can be re-written to each
other equivalently
J(x,y) gt t ? O(x,y) gt t/(1t) (xy)
O(x,y) gt t ? H(x,y) lt xy-2t
C(x,y) gt t ? O(x,y) gt
Eg,
x Lake, Mendota, Monona
y Wisc, Dane, Mendota, Lake
J(x,y) gt 0.5 ? ? O(x,y) gt 2.3 ?
Set representation k-gram, word, phrase,

9
Naïve Solution

All pair-wise comparison between A and B
Nested-loop AB comparisons
The sim() evaluation may be costly
Eg, Generalized Jaccard Similarity function with
O(x3)

For x in A For y in B If sim(x,y) gt
t, return (x,y)
A, B table x, y record as set
10
Naïve Solution Example
A
B
ID Content
1 Lake, Mendota
2 Lake, Monona, Area
3 Lake, Mendota, Monona, Dane
ID Content
4 Lake, Monona, University
5 Monona, Research, Area
6 Lake, Mendota, Monona, Area
O(x,y) gt 2 ?
O(x,y) ID4 ID5 ID6
ID1 1 0 2
ID2 2 2 3
ID3 2 1 3
11
Naïve Solution Example
A
B
ID Content
1 Lake, Mendota
2 Lake, Monona, Area
3 Lake, Mendota, Monona, Dane
ID Content
4 Lake, Monona, University
5 Monona, Research, Area
6 Lake, Mendota, Monona, Area
J(x,y) gt 0.6 ?
J(x,y)) ID4 ID5 ID6
ID1 0.25 0 0.5
ID2 0.5 0.4 0.75
ID3 0.2 0.16 0.6
12
2-Step Framework

Step 1 Blocking
Using Index/heuristics/filtering/etc, reduce of
candidates to compare
Step 2 sim() only within candidate sets
O(AC) s.t. C ltlt B

For x in A Using Foo, find a candidate
set C in B For y in C If sim(x,y) gt
t, return (x,y)
13
Variants for Foo

Foo How to identify candidate set C
Fast
Accurate no false positives/negatives
Many Variants for Foo
Inverted Index Sarawagi et al, SIGMOD 04
Size filtering Arasu et al, VLDB 06
Prefix Index Chaudhuri et al, ICDE 06
Prefix Inverted Index Bayardo et al, WWW 07
Bound filtering On et al, ICDE 07
Position Index Xiao et al, WWW 08

14
Inverted Index Sarawagi et al, SIGMOD 04
A
B
ID Content
1 Lake, Mendota
2 Lake, Monona, Area
3 Lake, Mendota, Monona, Dane
ID Content
4 Lake, Monona, University
5 Monona, Research, Area
6 Lake, Mendota, Monona, Area
Inverted Index (IDX) for A
Inverted Index (IDX) for B
Token in A ID List
Area 2
Dane 3
Lake 1, 2, 3
Mendota 1, 3
Monona 2, 3
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Monona 4, 5, 6
Research 5
University 4
15
Inverted Index Sarawagi et al, SIGMOD 04
A
B
ID Content
1 Lake, Mendota
2 Lake, Monona, Area
3 Lake, Mendota, Monona, Dane
ID Content
4 Lake, Monona, University
5 Monona, Research, Area
6 Lake, Mendota, Monona, Area
Inverted Index (IDX) for B
For x in A Using IDX, find a candidate
set C in B For y in C If sim(x,y) gt
t, return (x,y)
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Monona 4, 5, 6
Research 5
University 4
ID1 Lake, Mendota ID2 ID3
Candidate set C 4,6 6 4, 6
16
Inverted Index Sarawagi et al, SIGMOD 04
A
B
ID Content
1 Lake, Mendota
2 Lake, Monona, Area
3 Lake, Mendota, Monona, Dane
ID Content
4 Lake, Monona, University
5 Monona, Research, Area
6 Lake, Mendota, Monona, Area
Inverted Index (IDX) for B
For x in A Using IDX, find a candidate
set C in B For y in C If sim(x,y) gt
t, return (x,y)
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Monona 4, 5, 6
Research 5
University 4
ID1 Lake, Mendota ID2 ID3
ID Freq.
4 1
6 2
Candidate set C
O(x,y) gt 2
17
Size Filtering Arasu et al, VLDB 06

Idea Build index on the size of inputs
Jaccard Coefficient J
Upperbound for Jaccard
Bounding y w.r.t. x
Combining two ?

x
x
y
y
18
Size Filtering Arasu et al, VLDB 06

Intuition If t and x are given, y is bounded
Eg,
x Lake, Mendota
y Lake, Mendota, Monona, Area
J(x,y) gt 0.8 ?
Then, according to
x2, t0.8 ? 1.6 lt y lt 2.5
However, y 4
y cannot satisfy t0.8 ? no need to compute
J(x,y) at all

19
Size Filtering Arasu et al, VLDB 06
For x in A Using IDX, find a candidate
set C in B For y in C If sim(x,y) gt
t, return (x,y)

Algorithm
For all input strings, build B-tree w.r.t. their
sizes
Given a set x, using B-tree index, find a
candidate y in B s.t.

20
Prefix Index Chaudhuri et al, ICDE 06

Intuition If two sets are very similar, their
prefixes, when ordered, must have some common
tokens
Eg.
x Dane, University, Monona, Mendota
y Area, Lake, Mendota, Monona, Wisc
O(x,y) gt 3 ?
x Dane, Mendota, Monona, University
y Area, Lake, Mendota, Monona, Wisc

Prefixes
21
Prefix Index Chaudhuri et al, ICDE 06

Theorem 1 If there is no overlap btw. Prefix(x)
and Prefix(y), then sim(x,y) gt t, where
If sim()Overlap, Prefix(x)x - (t-1)
If sim()Jaccard, Prefix(x)x-Ceiling(tx)1
Algorithm using Theorem 1
Given a set x
For each token t_x in the prefix of x
Using an index, locate a candidate y that
contains t_x in the prefix of y
If sim(x,y) gt t, return (x,y)

22
Prefix Inverted Index Bayardo et al, WWW 07
A
B
ID Content
1 Lake, Mendota
2 Lake, Monona, Area
3 Lake, Mendota, Monona, Dane
ID Content
4 Lake, Monona, University
5 Monona, Research, Area
6 Lake, Mendota, Monona, Area
Token ID List DF Order
Area 2, 5 2 4
Dane 3 1 1
Lake 1, 2, 3, 4, 6 5 6
Mendota 1, 3, 6 3 5
Monona 2, 3, 4, 5, 6 5 7
Research 5 1 2
University 4 1 3
Inverted Index (IDX) for both A and B
Create a universal order Put rare tokens front
Order Dane gt Research gt University gt Area gt
Mendota gt Lake gt Monona
23
Prefix Inverted Index Bayardo et al, WWW 07
Ordered A
Ordered B
ID Content
1 Mendota, Lake
2 Area, Lake, Monona
3 Dane, Mendota, Lake, Monona
ID Content
4 University, Lake, Monona
5 Research, Area, Monona
6 Area, Mendota, Lake, Monona
Order Dane gt Research gt University gt Area gt
Mendota gt Lake gt Monona
24
Prefix Inverted Index Bayardo et al, WWW 07
Ordered A
Ordered B
ID Content
1 Mendota, Lake
2 Area, Lake, Monona
3 Dane, Mendota, Lake, Monona
ID Content
4 University, Lake, Monona
5 Research, Area, Monona
6 Area, Mendota, Lake, Monona
O(x,y) gt 2 Prefix(x)x-(t-1)x-1
Prefix Inverted Index for B
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Research 5
University 4
ID1 Mendota, Lake ID2 ID3
Candidate set C 6
25
Prefix Inverted Index Bayardo et al, WWW 07
Ordered A
Ordered B
ID Content
1 Mendota, Lake
2 Area, Lake, Monona
3 Dane, Mendota, Lake, Monona
ID Content
4 University, Lake, Monona
5 Research, Area, Monona
6 Area, Mendota, Lake, Monona
O(x,y) gt 2 Prefix(x)x-(t-1)x-1
Prefix Inverted Index for B
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Research 5
University 4
ID1 ID2 Area, Lake, Monona ID3
Candidate set C 5 4,6 4,5,6
26
Prefix Inverted Index Bayardo et al, WWW 07
Ordered A
Ordered B
ID Content
1 Mendota, Lake
2 Area, Lake, Monona
3 Dane, Mendota, Lake, Monona
ID Content
4 University, Lake, Monona
5 Research, Area, Monona
6 Area, Mendota, Lake, Monona
O(x,y) gt 2 Prefix(x)x-(t-1)x-1
Prefix Inverted Index for B
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Research 5
University 4
ID1 ID2 ID3 Dane, Mendota, Lake,
Monona
Candidate set C 6 4,6 4,6
27
Position Index Xiao et al, WWW 08
Order Dane gt Research gt University gt Area gt
Mendota gt Lake gt Monona

Eg,
x Dane, Research, Area, Mendota, Lake
y Research, Area, Mendota, Lake, Monona
O(x,y) gt 4 ?
?
Prefix(x) Prefix(y) 5 (4 -1) 2
x Dane, Research, Area, Mendota, Lake
y Research, Area, Mendota, Lake, Monona
Research is common btw prefixes ? (x,y) is a
candidate pair ? need to compute sim(x,y)

28
Position Index Xiao et al, WWW 08
Order Dane gt Research gt University gt Area gt
Mendota gt Lake gt Monona

Eg,
x Dane, Research, Area, Mendota, Lake
y Research, Area, Mendota, Lake, Monona
O(x,y) gt 4 ?
?
Prefix(x) Prefix(y) 5 (4 -1) 2
x Dane, Research, Area, Mendota, Lake
y Research, Area, Mendota, Lake, Monona
Estimation of max overlap overlap in prefixes
min of unseen tokens 1 min(3,4) 4 gt t ?
No need to compute sim(x,y) !

29
Bound Filtering On et al, ICDE 07

Generalized Jaccard (GJ) similarity
Two sets x a1, , ax, y b1, , by
Normalized weight of the maximum bipartite
matching M in the bipartite graph (N x U y, Ex
X y)

30
Bound Filtering On et al, ICDE 07
0.7
0.7
0.5
0.5
0.4
0.4
0.9
0.2
0.9
0.2
0.1
0.1
x
y
M maximum weight bipartite matching
31
Bound Filtering On et al, ICDE 07

Issues
GJ captures more semantics btw. two sets via the
weighted bipartite matching than Jaccard
But more costly to compute maximum weight
bipartite matching
Bellman-Ford O(V2E)
Hungarian O(V3)

For x in A Using Foo, find a candidate
set C in B For y in C If GJ(x,y) gt t,
return (x,y)
32
Bound Filtering On et al, ICDE 07

Bipartite matching computation is expensive
because of the requirement
No node in the bipartite graph can have more than
one edge incident on it
Relax this constraint
For each element ai in x, find an element bj in y
with the highest element-level similarity ? S1
For each element bj in y, find an element ai in x
with the highest element-level similarity ? S2
Complexity becomes linear O(xy)

33
Bound Filtering On et al, ICDE 07
0.7
0.7
S1
S1
0.5
0.5
0.4
0.4
0.9
0.2
0.9
0.2
0.1
0.1
x
y
0.7
S2
0.5
S2
0.4
0.9
0.2
0.1
x
y
34
Bound Filtering On et al, ICDE 07

Properties
Numerator of UB is at least as large as that of
GJ
Denominator of UB is no larger than that of GJ
Similar arguments for LB
Theorem 2
LB lt GJ lt UB

35
Bound Filtering On et al, ICDE 07
For x in A Using Foo, find a candidate
set C in B For y in C If GJ(x,y) gt t,
return (x,y)

Algorithm
Compute UB(x,y)
If UB(x,y) lt t ? GJ(x,y) lt t ? (x,y) is not an
answer
Else Compute LB(x,y)
If LB(x,y) gt t ? GJ(x,y) gt t ? (x,y) is an answer
Else compute GJ(x,y)

LB lt GJ lt UB
36
Takeaways

WESSJoin finds ALL pairs of sets btw two
collections whose similarity gt t
Good abstraction for various problems
2 step framework is promising
Step 1 reduce candidates
Step 2 similarity computation among candidates
Less researched issues
Comparison among different WESSJoin methods
WESSJoin top-k/skyline/MapReduce/etc

37
Reference

Sarawagi et al, SIGMOD 04 Sunita Sarawagi, Alok
Kirpal Efficient set joins on similarity
predicates, SIGMOD 2004.
Arasu et al, VLDB 06 Arvind Arasu, Venkatesh
Ganti, and Raghav Kaushik, Efficient exact
set-similarity joins, VLDB 2006.
Chaudhuri et al, ICDE 06 Surajit Chaudhuri,
Venkatesh Ganti, Raghav Kaushik A Primitive
Operator for Similarity Joins in Data Cleaning.
ICDE 2006.
Bayardo et al, WWW 07 R. J. Bayardo, Yiming Ma,
Ramakrishnan Srikant. Scaling Up All-Pairs
Similarity Search, WWW 2007.
On et al, ICDE 07 Byung-Won On, Nick Koudas,
Dongwon Lee, Divesh Srivastava, Group Linkage,
ICDE 2007.
Xiao et al, WWW 08 Chuan Xiao, Wei Wang, Xuemin
Lin, Jeffrey Xu Yu. Efficient Similarity Joins
for Near Duplicate Detection. WWW 2008.
Wei Wang. Efficient Exact Similarity Join
Algorithms
http//www.cse.unsw.edu.au/weiw/project/PPJoin-UT
S-Oct-2008.pdf
Jeffrey D. Ullman. High-Similarity Algorithms
http//infolab.stanford.edu/ullman/mining/2009/si
milarity4.pdf