Weighted Exact Set Similarity Join - PowerPoint PPT Presentation

About This Presentation
Title:

Weighted Exact Set Similarity Join

Description:

Weighted Exact Set Similarity Join The Pennsylvania State University Dongwon Lee dongwon_at_psu.edu – PowerPoint PPT presentation

Number of Views:176
Avg rating:3.0/5.0
Slides: 38
Provided by: Dong179
Learn more at: https://pike.psu.edu
Category:

less

Transcript and Presenter's Notes

Title: Weighted Exact Set Similarity Join


1
Weighted Exact Set Similarity Join
  • The Pennsylvania State University
  • Dongwon Lee
  • dongwon_at_psu.edu

2
Set Similarity Join
  • Def. Set Similarity Join (SSJoin) Between
    collections A and B, find X pairs of objects
    whose similarity gt t
  • If X MOST ? Approximate SSJoin
  • If X ALL ? Exact SSJoin

0.7
Lake, Monona, Wisc, Dane, County
0.5
0.4
University, Mendota, Wisc, Dane,
0.9
0.2
0.1
A
B
3
Set Similarity Join
  • Weighted vs. Unweighted
  • Weighting quantifies relative importance of token
  • Eg, Microsoft is more important than Copr.
  • How to assign meaningful weights to tokens is an
    important problem itself
  • Not further discussed here

4
Set Similarity Join
  • Approximate SSJoin
  • Allows some false positives/negatives
  • Eg, LSH as solution
  • Exact SSJoin
  • Does not allow any false positives/negatives
  • Needs to be scalable
  • Weighted Exact SSJoin
  • Will simply call WESSJoin

WESSJoin
UESSJoin
exact
WASSJoin
UASSJoin
approx.
unweighted
weighted
5
Applications of WESSJoin
  • Entity resolution
  • Web document genre classification
  • Find all pairs of documents w. similar contents
  • Query refinement for web search
  • For a query, find another w. similar search
    result
  • Movie recommendation
  • Identify users who have similar movie tastes
    w.r.t. the rented movies
  • ? Focus on string data represented as SET
  • Eg, document, web page, record

6
Research Issues
  • Why not express WESSJoin in SQL?
  • Join predicate as UDF
  • Cartesian product followed by UDF processing ?
    Inefficient evaluation
  • Special handling for WESSJoin needed
  • Scalability
  • Support diverse similarity (or distance)
    functions
  • Eg, Overlap, Jaccard, Cosine vs. Edit,
  • Support diverse computation models
  • Eg, Threshold vs. Top-k

7
Similarity/Distance Functions
  • Jaccard Coefficient J(x,y)
  • Overlap similarity O(x,y)
  • Cosine similarity C(x,y)
  • Hamming distance H(x,y)
  • Levenshtein distance L(x,y) min of edit
    operations to transform x to y

8
Properties of sim()
  • Similarity functions can be re-written to each
    other equivalently
  • J(x,y) gt t ? O(x,y) gt t/(1t) (xy)
  • O(x,y) gt t ? H(x,y) lt xy-2t
  • C(x,y) gt t ? O(x,y) gt
  • Eg,
  • x Lake, Mendota, Monona
  • y Wisc, Dane, Mendota, Lake
  • J(x,y) gt 0.5 ? ? O(x,y) gt 2.3 ?
  • Set representation k-gram, word, phrase,

9
Naïve Solution
  • All pair-wise comparison between A and B
  • Nested-loop AB comparisons
  • The sim() evaluation may be costly
  • Eg, Generalized Jaccard Similarity function with
    O(x3)

For x in A For y in B If sim(x,y) gt
t, return (x,y)
A, B table x, y record as set
10
Naïve Solution Example
A
B
ID Content
1 Lake, Mendota
2 Lake, Monona, Area
3 Lake, Mendota, Monona, Dane
ID Content
4 Lake, Monona, University
5 Monona, Research, Area
6 Lake, Mendota, Monona, Area
O(x,y) gt 2 ?
O(x,y) ID4 ID5 ID6
ID1 1 0 2
ID2 2 2 3
ID3 2 1 3
11
Naïve Solution Example
A
B
ID Content
1 Lake, Mendota
2 Lake, Monona, Area
3 Lake, Mendota, Monona, Dane
ID Content
4 Lake, Monona, University
5 Monona, Research, Area
6 Lake, Mendota, Monona, Area
J(x,y) gt 0.6 ?
J(x,y)) ID4 ID5 ID6
ID1 0.25 0 0.5
ID2 0.5 0.4 0.75
ID3 0.2 0.16 0.6
12
2-Step Framework
  • Step 1 Blocking
  • Using Index/heuristics/filtering/etc, reduce of
    candidates to compare
  • Step 2 sim() only within candidate sets
  • O(AC) s.t. C ltlt B

For x in A Using Foo, find a candidate
set C in B For y in C If sim(x,y) gt
t, return (x,y)
13
Variants for Foo
  • Foo How to identify candidate set C
  • Fast
  • Accurate no false positives/negatives
  • Many Variants for Foo
  • Inverted Index Sarawagi et al, SIGMOD 04
  • Size filtering Arasu et al, VLDB 06
  • Prefix Index Chaudhuri et al, ICDE 06
  • Prefix Inverted Index Bayardo et al, WWW 07
  • Bound filtering On et al, ICDE 07
  • Position Index Xiao et al, WWW 08

14
Inverted Index Sarawagi et al, SIGMOD 04
A
B
ID Content
1 Lake, Mendota
2 Lake, Monona, Area
3 Lake, Mendota, Monona, Dane
ID Content
4 Lake, Monona, University
5 Monona, Research, Area
6 Lake, Mendota, Monona, Area
Inverted Index (IDX) for A
Inverted Index (IDX) for B
Token in A ID List
Area 2
Dane 3
Lake 1, 2, 3
Mendota 1, 3
Monona 2, 3
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Monona 4, 5, 6
Research 5
University 4
15
Inverted Index Sarawagi et al, SIGMOD 04
A
B
ID Content
1 Lake, Mendota
2 Lake, Monona, Area
3 Lake, Mendota, Monona, Dane
ID Content
4 Lake, Monona, University
5 Monona, Research, Area
6 Lake, Mendota, Monona, Area
Inverted Index (IDX) for B
For x in A Using IDX, find a candidate
set C in B For y in C If sim(x,y) gt
t, return (x,y)
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Monona 4, 5, 6
Research 5
University 4
ID1 Lake, Mendota ID2 ID3
Candidate set C 4,6 6 4, 6
16
Inverted Index Sarawagi et al, SIGMOD 04
A
B
ID Content
1 Lake, Mendota
2 Lake, Monona, Area
3 Lake, Mendota, Monona, Dane
ID Content
4 Lake, Monona, University
5 Monona, Research, Area
6 Lake, Mendota, Monona, Area
Inverted Index (IDX) for B
For x in A Using IDX, find a candidate
set C in B For y in C If sim(x,y) gt
t, return (x,y)
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Monona 4, 5, 6
Research 5
University 4
ID1 Lake, Mendota ID2 ID3
ID Freq.
4 1
6 2
Candidate set C
O(x,y) gt 2
17
Size Filtering Arasu et al, VLDB 06
  • Idea Build index on the size of inputs
  • Jaccard Coefficient J
  • Upperbound for Jaccard
  • Bounding y w.r.t. x
  • Combining two ?

x
x
y
y
18
Size Filtering Arasu et al, VLDB 06
  • Intuition If t and x are given, y is bounded
  • Eg,
  • x Lake, Mendota
  • y Lake, Mendota, Monona, Area
  • J(x,y) gt 0.8 ?
  • Then, according to
  • x2, t0.8 ? 1.6 lt y lt 2.5
  • However, y 4
  • y cannot satisfy t0.8 ? no need to compute
    J(x,y) at all

19
Size Filtering Arasu et al, VLDB 06
For x in A Using IDX, find a candidate
set C in B For y in C If sim(x,y) gt
t, return (x,y)
  • Algorithm
  • For all input strings, build B-tree w.r.t. their
    sizes
  • Given a set x, using B-tree index, find a
    candidate y in B s.t.

20
Prefix Index Chaudhuri et al, ICDE 06
  • Intuition If two sets are very similar, their
    prefixes, when ordered, must have some common
    tokens
  • Eg.
  • x Dane, University, Monona, Mendota
  • y Area, Lake, Mendota, Monona, Wisc
  • O(x,y) gt 3 ?
  • x Dane, Mendota, Monona, University
  • y Area, Lake, Mendota, Monona, Wisc

Prefixes
21
Prefix Index Chaudhuri et al, ICDE 06
  • Theorem 1 If there is no overlap btw. Prefix(x)
    and Prefix(y), then sim(x,y) gt t, where
  • If sim()Overlap, Prefix(x)x - (t-1)
  • If sim()Jaccard, Prefix(x)x-Ceiling(tx)1
  • Algorithm using Theorem 1
  • Given a set x
  • For each token t_x in the prefix of x
  • Using an index, locate a candidate y that
    contains t_x in the prefix of y
  • If sim(x,y) gt t, return (x,y)

22
Prefix Inverted Index Bayardo et al, WWW 07
A
B
ID Content
1 Lake, Mendota
2 Lake, Monona, Area
3 Lake, Mendota, Monona, Dane
ID Content
4 Lake, Monona, University
5 Monona, Research, Area
6 Lake, Mendota, Monona, Area
Token ID List DF Order
Area 2, 5 2 4
Dane 3 1 1
Lake 1, 2, 3, 4, 6 5 6
Mendota 1, 3, 6 3 5
Monona 2, 3, 4, 5, 6 5 7
Research 5 1 2
University 4 1 3
Inverted Index (IDX) for both A and B
Create a universal order Put rare tokens front
Order Dane gt Research gt University gt Area gt
Mendota gt Lake gt Monona
23
Prefix Inverted Index Bayardo et al, WWW 07
Ordered A
Ordered B
ID Content
1 Mendota, Lake
2 Area, Lake, Monona
3 Dane, Mendota, Lake, Monona
ID Content
4 University, Lake, Monona
5 Research, Area, Monona
6 Area, Mendota, Lake, Monona
Order Dane gt Research gt University gt Area gt
Mendota gt Lake gt Monona
24
Prefix Inverted Index Bayardo et al, WWW 07
Ordered A
Ordered B
ID Content
1 Mendota, Lake
2 Area, Lake, Monona
3 Dane, Mendota, Lake, Monona
ID Content
4 University, Lake, Monona
5 Research, Area, Monona
6 Area, Mendota, Lake, Monona
O(x,y) gt 2 Prefix(x)x-(t-1)x-1
Prefix Inverted Index for B
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Research 5
University 4
ID1 Mendota, Lake ID2 ID3
Candidate set C 6
25
Prefix Inverted Index Bayardo et al, WWW 07
Ordered A
Ordered B
ID Content
1 Mendota, Lake
2 Area, Lake, Monona
3 Dane, Mendota, Lake, Monona
ID Content
4 University, Lake, Monona
5 Research, Area, Monona
6 Area, Mendota, Lake, Monona
O(x,y) gt 2 Prefix(x)x-(t-1)x-1
Prefix Inverted Index for B
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Research 5
University 4
ID1 ID2 Area, Lake, Monona ID3
Candidate set C 5 4,6 4,5,6
26
Prefix Inverted Index Bayardo et al, WWW 07
Ordered A
Ordered B
ID Content
1 Mendota, Lake
2 Area, Lake, Monona
3 Dane, Mendota, Lake, Monona
ID Content
4 University, Lake, Monona
5 Research, Area, Monona
6 Area, Mendota, Lake, Monona
O(x,y) gt 2 Prefix(x)x-(t-1)x-1
Prefix Inverted Index for B
Token in B ID List
Area 5
Lake 4, 6
Mendota 6
Research 5
University 4
ID1 ID2 ID3 Dane, Mendota, Lake,
Monona
Candidate set C 6 4,6 4,6
27
Position Index Xiao et al, WWW 08
Order Dane gt Research gt University gt Area gt
Mendota gt Lake gt Monona
  • Eg,
  • x Dane, Research, Area, Mendota, Lake
  • y Research, Area, Mendota, Lake, Monona
  • O(x,y) gt 4 ?
  • ?
  • Prefix(x) Prefix(y) 5 (4 -1) 2
  • x Dane, Research, Area, Mendota, Lake
  • y Research, Area, Mendota, Lake, Monona
  • Research is common btw prefixes ? (x,y) is a
    candidate pair ? need to compute sim(x,y)

28
Position Index Xiao et al, WWW 08
Order Dane gt Research gt University gt Area gt
Mendota gt Lake gt Monona
  • Eg,
  • x Dane, Research, Area, Mendota, Lake
  • y Research, Area, Mendota, Lake, Monona
  • O(x,y) gt 4 ?
  • ?
  • Prefix(x) Prefix(y) 5 (4 -1) 2
  • x Dane, Research, Area, Mendota, Lake
  • y Research, Area, Mendota, Lake, Monona
  • Estimation of max overlap overlap in prefixes
    min of unseen tokens 1 min(3,4) 4 gt t ?
    No need to compute sim(x,y) !

29
Bound Filtering On et al, ICDE 07
  • Generalized Jaccard (GJ) similarity
  • Two sets x a1, , ax, y b1, , by
  • Normalized weight of the maximum bipartite
    matching M in the bipartite graph (N x U y, Ex
    X y)

30
Bound Filtering On et al, ICDE 07
0.7
0.7
0.5
0.5
0.4
0.4
0.9
0.2
0.9
0.2
0.1
0.1
x
y
M maximum weight bipartite matching
31
Bound Filtering On et al, ICDE 07
  • Issues
  • GJ captures more semantics btw. two sets via the
    weighted bipartite matching than Jaccard
  • But more costly to compute maximum weight
    bipartite matching
  • Bellman-Ford O(V2E)
  • Hungarian O(V3)

For x in A Using Foo, find a candidate
set C in B For y in C If GJ(x,y) gt t,
return (x,y)
32
Bound Filtering On et al, ICDE 07
  • Bipartite matching computation is expensive
    because of the requirement
  • No node in the bipartite graph can have more than
    one edge incident on it
  • Relax this constraint
  • For each element ai in x, find an element bj in y
    with the highest element-level similarity ? S1
  • For each element bj in y, find an element ai in x
    with the highest element-level similarity ? S2
  • Complexity becomes linear O(xy)

33
Bound Filtering On et al, ICDE 07
0.7
0.7
S1
S1
0.5
0.5
0.4
0.4
0.9
0.2
0.9
0.2
0.1
0.1
x
y
0.7
S2
0.5
S2
0.4
0.9
0.2
0.1
x
y
34
Bound Filtering On et al, ICDE 07
  • Properties
  • Numerator of UB is at least as large as that of
    GJ
  • Denominator of UB is no larger than that of GJ
  • Similar arguments for LB
  • Theorem 2
  • LB lt GJ lt UB

35
Bound Filtering On et al, ICDE 07
For x in A Using Foo, find a candidate
set C in B For y in C If GJ(x,y) gt t,
return (x,y)
  • Algorithm
  • Compute UB(x,y)
  • If UB(x,y) lt t ? GJ(x,y) lt t ? (x,y) is not an
    answer
  • Else Compute LB(x,y)
  • If LB(x,y) gt t ? GJ(x,y) gt t ? (x,y) is an answer
  • Else compute GJ(x,y)

LB lt GJ lt UB
36
Takeaways
  • WESSJoin finds ALL pairs of sets btw two
    collections whose similarity gt t
  • Good abstraction for various problems
  • 2 step framework is promising
  • Step 1 reduce candidates
  • Step 2 similarity computation among candidates
  • Less researched issues
  • Comparison among different WESSJoin methods
  • WESSJoin top-k/skyline/MapReduce/etc

37
Reference
  • Sarawagi et al, SIGMOD 04 Sunita Sarawagi, Alok
    Kirpal Efficient set joins on similarity
    predicates, SIGMOD 2004.
  • Arasu et al, VLDB 06 Arvind Arasu, Venkatesh
    Ganti, and Raghav Kaushik, Efficient exact
    set-similarity joins, VLDB 2006.
  • Chaudhuri et al, ICDE 06 Surajit Chaudhuri,
    Venkatesh Ganti, Raghav Kaushik A Primitive
    Operator for Similarity Joins in Data Cleaning.
    ICDE 2006.
  • Bayardo et al, WWW 07 R. J. Bayardo, Yiming Ma,
    Ramakrishnan Srikant. Scaling Up All-Pairs
    Similarity Search, WWW 2007.
  • On et al, ICDE 07 Byung-Won On, Nick Koudas,
    Dongwon Lee, Divesh Srivastava, Group Linkage,
    ICDE 2007.
  • Xiao et al, WWW 08 Chuan Xiao, Wei Wang, Xuemin
    Lin, Jeffrey Xu Yu. Efficient Similarity Joins
    for Near Duplicate Detection. WWW 2008.
  • Wei Wang. Efficient Exact Similarity Join
    Algorithms
  • http//www.cse.unsw.edu.au/weiw/project/PPJoin-UT
    S-Oct-2008.pdf
  • Jeffrey D. Ullman. High-Similarity Algorithms
  • http//infolab.stanford.edu/ullman/mining/2009/si
    milarity4.pdf
Write a Comment
User Comments (0)
About PowerShow.com