CS 361A (Advanced Data Structures and Algorithms) presentation

About This Presentation

Transcript and Presenter's Notes

Title: CS 361A (Advanced Data Structures and Algorithms)

1
CS 361A (Advanced Data Structures and Algorithms)

Lecture 18
(Nov 30, 2005)
Fingerprints, Min-Hashing, and Document
Similarity
Rajeev Motwani

2
Game Plan for Week

Fingerprints
Document Similarity
Shingling
Min-Hashing
Min-Wise Independent Permutations

3
Fingerprints

W set of large objects (e.g., URLs)
Goal
avoid storing large objects explicitly
quick-and-dirty equality-testing
Fingerprints?
Short tags for objects
Distinct fingerprints ? distinct objects
Distinct objects ? probably distinct fingerprints

4
Formalization

Fingerprint length k ? fingerprint space size
N2k
Fingerprint function family F f W0,1k
Random f eR F ?
f(A) ¹ f(B) ? A ¹ B
Collisions P f(A) f(B) A ¹ B ? 0
(ideally 2O(-k))
Typical Application
Adversarial object-set S with S n ltlt 2k
Goal f(S) S with high probability
n2 pair-wise collisions possible ? need 2k gt n2
(to avoid Birthday Paradox)

5
Example URL Fingerprints

Search Engines
Manage large numbers of URL strings
Long, variable strings (embedded
objects/database-queries)
Desiderata
small/fixed-length encodings hopefully, unique
Some scenarios
Exact string irrelevant
Only need ability to distinguish distinct URLs
Even otherwise, unique IDs useful for indexing
Numbers?
4 billion webpages ? n232
N n2 ? k64
Fingerprints ? 8-byte representation

6
Fingerprinting vs Hashing

Hashing h W 0,1k
Set Membership testing for set S of size n
Desire uniform distribution over bin address
0,1k
Minimize collisions per bin reduce lookup time
Minimize hash table size ? n N2k
Fingerprinting f W 0,1k
Object Equality testing over set S of size n
Distribution over 0,1k is irrelevant
Avoid collisions altogether
Tolerate larger k typically N gt n2

7
Fingerprinting Strings

Typical Application but techniques extend to
combinatorial objects (database tuples,
trees/graphs)
Obvious techniques
Checksum no worst-case collision probability
guarantees
MD5 cryptographically-secure string hashes
relatively slow
avoids leaking information about original string
Rabins Scheme
Algebraic technique polynomial arithmetic
Efficient need (1 table lookup 1 xor 1
shift) per byte
other nice properties

8
Rabin Fingerprints

Consider m-bit string Aa1 a2 am
Assume a11 and fixed-length strings (wlog)
Encoding Strings
Degree-m polynomials over Z2
A(x) a1 xm-1 a2 xm-2 am-1 x1 am
Fingerprints
P(x) random, irreducible deg-k polynomial over
Z2
(easy to sample such polynomials)
irreducible ? unlike x2x1, can factor
x21(x1)2
f(A) A(x) mod P(x)

9
Analysis

Fix S n strings of length m
Consider
Collision f(A)f(B) ? A(x)B(x) mod P(x) ? QS0
mod P(x)
Therefore P(x) is factor of QS(x)
Collision Probability?
degree(QS) n2m
number of irreducible degree-k factors of QS(x)
is lt n2m/k
Fact Number of irreducible degree-k polynomials
gt (2k-2k/2)/k
Probrandom P(x) divides QS(x) lt n2m/2k
Prob fingerprints not distinct lt

10
Beneficial Properties

Hardware-level implementation
Z2-polynomials same as strings
simple shift-register operations
Distributivity f(AB) f(A) f(B) over Z2
Let concatenation
f(A B) f(f(A) B)
f(A B) A(x)tm B(x) mod P(x)
Fingerprint sliding windows over strings
low incremental cost

11
Duplicate Document Detection

Problem
Given large collection of arbitrary documents
Identify near-duplicate documents
Web search engines
Proliferation of near-duplicate documents
Legitimate mirrors, local copies, updates,
Malicious spam, spider-traps, dynamic URLs,
Mistaken spider errors
30 of web-pages are near-duplicates Broder et
al 1997
Cost RAM/disk, search quality, unhappy users
Enterprise search even larger amount of
duplication
SCAM plagiarism detection Shivakumar et al
1998

12
Natural Approaches

Fingerprinting?
only works for exact matches
here must identify even near-duplicates
Random Sampling?
sample substrings (phrases, sentences, etc)
hope similar documents ? similar samples
No even samples of same document will differ
Edit-distance?
metric for approximate string-matching
expensive even for one pair of strings
impossible for 1032 web documents

13
Desiderata

Storage
only small sketches of each document.
Computation
O(n log n) time on n documents
Stream Processing
once sketch computed, source is unavailable
Error Guarantees
problem scale ? small biases have large impact
need formal guarantees heuristics will not do

14
Basic Idea Broder 1997

Shingling
dissect document into q-grams (shingles)
represent documents by shingle-sets
near-duplicates ?shingle-sets intersection is
large
reduce problem to set intersection
Set Intersection
fingerprints of shingles
min-hash to estimate intersections sizes

15
Shingling

Shingle q contiguous tokens/words (q-gram)
Consider following document
a rose is a rose is a rose
Choose q4 ? get multi-set of shingles
a rose is a
rose is a rose
is a rose is
a rose is a
rose is a rose

16
Documents ? Sets of 64-bit fingerprints

Fingerprints?
Use Rabin fingerprints
Fingerprint space U 0, , N-1
In practice, use 64-bit fingerprints, i.e.,
N264
Result uniformity in length of strings

17
Similarity of Documents
Doc A
Doc B

Jaccard measure similarity of SA, SB ? U 0
N-1
Claim A B are near-duplicates if sim(SA,SB)
is high
Claim A is contained in B if con(SA,SB) is high

18
Remarks

Multiplicities of q-grams could retain or
ignore
trade-off efficiency with precision
Shingle Size q e 3 10
Short shingles ? increase similarity of unrelated
documents
With q1, sim(SA,SB) 1 ? A is permutation of B
Need larger q to sensitize to permutation changes
Long shingles ? small random changes have larger
impact
Similarity Measure
Similarity is non-transitive, non-metric
But dissimilarity 1-sim(SA,SB) is a metric
Charikar 02
Ukkonen 92 relate q-gram edit-distance

19
Example

A a rose is a rose is a rose
B a rose is a flower which is a rose
Preserving multiplicity
q1 ? sim(SA,SB) 0.7
SA a, a, a, is, is, rose, rose, rose
SB a, a, a, is, is, rose, rose, flower, which
q2 ? sim(SA,SB) 0.5
q3 ? sim(SA,SB) 0.3
Disregarding multiplicity
q1 ? sim(SA,SB) 0.6
q2 ? sim(SA,SB) 0.5
q3 ? sim(SA,SB) 0.4285

20
Min-Hashing

Consider
SA, SB ? U
Pick random permutation p of U
Define ? p -1( minp(SA) ) and b p -1(
minp(SB) )
Meaning? minimal element under permutation p
Lemma
Let d min p(SA?SB)
Claim ? b ? p -1(d) ? SA?SB
Clearly

21
Min-Hashing

Similarity Sketches
Succinct representation of fingerprint sets SA
Allows efficient estimation of sim(SA,SB)
Basic idea use min-hash of fingerprints
sk(A) k minimal elements under p(SA)
Claim E sim(sk(A), sk(B)) sim(SA,SB)
For each ? ? sk(A) ? sk(B)
Observe
sketch-similarity is unbiased estimator of
similarity
reducing variance use larger k

22
Remarks

Implementation
shingle/fingerprint/sketch document in streams
Issue cost of pairwise comparison of sketches?
cluster sketch-streams Broder et al, Guha et al
Open? hashing sketches to identify similarity
Broder-Mitzenmacher 99 Min-Hash is only
unbiased estimator
Indyk-Motwani 99 Locality-Sensitive Hash
collisions more likely for similar items
Min-Hash is special case

23
Multiple Permutations

Better Variance Reduction
Instead of larger k, stick with k1
Multiple, independent permutations
Sketch Construction
Pick p random permutations of U p1,p2, ,pp
sk(A) minimal elements under p1(SA), , pp(SA)
Claim E sim(sk(A),sk(B)) sim(SA,SB)
Earlier lemma ? true for p1
Linearity of expectations
Variance reduction independence of p1, ,pp

24
Min-Wise Indep Permutations

Problem
Truly-random p over U 0 N-1 is infeasible
But do we really need true randomness?
Solution
Poly-size family of permutations F?SN over U
Choosing/representing random p?F is easy
Min-Wise Independence (MWI) Property
For all sets X?U, for all x?F,

25
Minimum-Size MWI Families

Broder et al 98
Upper/lower bounds of lcm(1,2,,n)
Problem exponential in N
Approximate MWI Families
Relax to
Non-constructive polynomial-size
Constructive size NO(log 1/?) Indyk 99
In practice 2-universal hashes work well!

26
References I

Fingerprinting by random polynomials. M. Rabin.
Technical Report TR-15-81, Harvard University
(1981).
Some applications of Rabin's fingerprinting
method. A. Broder. Sequence II (1993).
On the Resemblance and Containment of Documents,
A. Broder. SEQUENCES 1997.
Syntactic Clustering of the Web, A. Broder, S.
Glassman, M. Manasse, and G. Zweig, WWW 1997.
Finding near-replicas of documents on the web. N.
Shivakumar and H. Garcia-Molina. WebDB 1998.
Identifying and Filtering Near-Duplicate
Documents, Andrei Broder. CPM 2000.

27
References II

Approximate String Matching with q-grams and
Maximal Matches. E. Ukkonen. Theoretical Computer
Science (1992).
Completeness and Robustness Properties of
Min-Wise Independent Permutations. A. Broder and
M. Mitzenmacher.
Min-Wise Independent Permutations, A. Broder, M.
Charikar, A. Frieze and M. Mitzenmacher, JCSS
(2000).
A Small Approximately min-wise Independent Family
of Hash Functions. P. Indyk. SODA 1999.
Approximate Nearest Neighbors Towards Removing
the Curse of Dimensionality, P. Indyk and R.
Motwani. STOC 1998.
Similarity Search in High Dimensions via Hashing,
A. Gionis, P. Indyk, and R. Motwani. VLDB 1999.
Similarity Estimation Techniques from Rounding
Algorithms, M. Charikar, STOC 2002.

Write a Comment

User Comments (0)

About PowerShow.com

CS 361A (Advanced Data Structures and Algorithms) PowerPoint PPT Presentation