CS 361A (Advanced Data Structures and Algorithms) - PowerPoint PPT Presentation

About This Presentation
Title:

CS 361A (Advanced Data Structures and Algorithms)

Description:

Fingerprints. P(x): random, irreducible deg-k polynomial over Z2 ... Basic idea use min-hash of fingerprints. sk(A) = k minimal elements under p(SA) ... – PowerPoint PPT presentation

Number of Views:569
Avg rating:3.0/5.0
Slides: 28
Provided by: RajeevM2
Category:

less

Transcript and Presenter's Notes

Title: CS 361A (Advanced Data Structures and Algorithms)


1
CS 361A (Advanced Data Structures and Algorithms)
  • Lecture 18
    (Nov 30, 2005)
  • Fingerprints, Min-Hashing, and Document
    Similarity
  • Rajeev Motwani

2
Game Plan for Week
  • Fingerprints
  • Document Similarity
  • Shingling
  • Min-Hashing
  • Min-Wise Independent Permutations

3
Fingerprints
  • W set of large objects (e.g., URLs)
  • Goal
  • avoid storing large objects explicitly
  • quick-and-dirty equality-testing
  • Fingerprints?
  • Short tags for objects
  • Distinct fingerprints ? distinct objects
  • Distinct objects ? probably distinct fingerprints

4
Formalization
  • Fingerprint length k ? fingerprint space size
    N2k
  • Fingerprint function family F f W0,1k
  • Random f eR F ?
  • f(A) ¹ f(B) ? A ¹ B
  • Collisions P f(A) f(B) A ¹ B ? 0
    (ideally 2O(-k))
  • Typical Application
  • Adversarial object-set S with S n ltlt 2k
  • Goal f(S) S with high probability
  • n2 pair-wise collisions possible ? need 2k gt n2
    (to avoid Birthday Paradox)

5
Example URL Fingerprints
  • Search Engines
  • Manage large numbers of URL strings
  • Long, variable strings (embedded
    objects/database-queries)
  • Desiderata
  • small/fixed-length encodings hopefully, unique
  • Some scenarios
  • Exact string irrelevant
  • Only need ability to distinguish distinct URLs
  • Even otherwise, unique IDs useful for indexing
  • Numbers?
  • 4 billion webpages ? n232
  • N n2 ? k64
  • Fingerprints ? 8-byte representation

6
Fingerprinting vs Hashing
  • Hashing h W 0,1k
  • Set Membership testing for set S of size n
  • Desire uniform distribution over bin address
    0,1k
  • Minimize collisions per bin reduce lookup time
  • Minimize hash table size ? n N2k
  • Fingerprinting f W 0,1k
  • Object Equality testing over set S of size n
  • Distribution over 0,1k is irrelevant
  • Avoid collisions altogether
  • Tolerate larger k typically N gt n2

7
Fingerprinting Strings
  • Typical Application but techniques extend to
    combinatorial objects (database tuples,
    trees/graphs)
  • Obvious techniques
  • Checksum no worst-case collision probability
    guarantees
  • MD5 cryptographically-secure string hashes
  • relatively slow
  • avoids leaking information about original string
  • Rabins Scheme
  • Algebraic technique polynomial arithmetic
  • Efficient need (1 table lookup 1 xor 1
    shift) per byte
  • other nice properties

8
Rabin Fingerprints
  • Consider m-bit string Aa1 a2 am
  • Assume a11 and fixed-length strings (wlog)
  • Encoding Strings
  • Degree-m polynomials over Z2
  • A(x) a1 xm-1 a2 xm-2 am-1 x1 am
  • Fingerprints
  • P(x) random, irreducible deg-k polynomial over
    Z2
  • (easy to sample such polynomials)
  • irreducible ? unlike x2x1, can factor
    x21(x1)2
  • f(A) A(x) mod P(x)

9
Analysis
  • Fix S n strings of length m
  • Consider
  • Collision f(A)f(B) ? A(x)B(x) mod P(x) ? QS0
    mod P(x)
  • Therefore P(x) is factor of QS(x)
  • Collision Probability?
  • degree(QS) n2m
  • number of irreducible degree-k factors of QS(x)
    is lt n2m/k
  • Fact Number of irreducible degree-k polynomials
    gt (2k-2k/2)/k
  • Probrandom P(x) divides QS(x) lt n2m/2k
  • Prob fingerprints not distinct lt

10
Beneficial Properties
  • Hardware-level implementation
  • Z2-polynomials same as strings
  • simple shift-register operations
  • Distributivity f(AB) f(A) f(B) over Z2
  • Let concatenation
  • f(A B) f(f(A) B)
  • f(A B) A(x)tm B(x) mod P(x)
  • Fingerprint sliding windows over strings
    low incremental cost

11
Duplicate Document Detection
  • Problem
  • Given large collection of arbitrary documents
  • Identify near-duplicate documents
  • Web search engines
  • Proliferation of near-duplicate documents
  • Legitimate mirrors, local copies, updates,
  • Malicious spam, spider-traps, dynamic URLs,
  • Mistaken spider errors
  • 30 of web-pages are near-duplicates Broder et
    al 1997
  • Cost RAM/disk, search quality, unhappy users
  • Enterprise search even larger amount of
    duplication
  • SCAM plagiarism detection Shivakumar et al
    1998

12
Natural Approaches
  • Fingerprinting?
  • only works for exact matches
  • here must identify even near-duplicates
  • Random Sampling?
  • sample substrings (phrases, sentences, etc)
  • hope similar documents ? similar samples
  • No even samples of same document will differ
  • Edit-distance?
  • metric for approximate string-matching
  • expensive even for one pair of strings
  • impossible for 1032 web documents

13
Desiderata
  • Storage
  • only small sketches of each document.
  • Computation
  • O(n log n) time on n documents
  • Stream Processing
  • once sketch computed, source is unavailable
  • Error Guarantees
  • problem scale ? small biases have large impact
  • need formal guarantees heuristics will not do

14
Basic Idea Broder 1997
  • Shingling
  • dissect document into q-grams (shingles)
  • represent documents by shingle-sets
  • near-duplicates ?shingle-sets intersection is
    large
  • reduce problem to set intersection
  • Set Intersection
  • fingerprints of shingles
  • min-hash to estimate intersections sizes

15
Shingling
  • Shingle q contiguous tokens/words (q-gram)
  • Consider following document
  • a rose is a rose is a rose
  • Choose q4 ? get multi-set of shingles
  • a rose is a
  • rose is a rose
  • is a rose is
  • a rose is a
  • rose is a rose

16
Documents ? Sets of 64-bit fingerprints
  • Fingerprints?
  • Use Rabin fingerprints
  • Fingerprint space U 0, , N-1
  • In practice, use 64-bit fingerprints, i.e.,
    N264
  • Result uniformity in length of strings

17
Similarity of Documents
Doc A
Doc B
  • Jaccard measure similarity of SA, SB ? U 0
    N-1
  • Claim A B are near-duplicates if sim(SA,SB)
    is high
  • Claim A is contained in B if con(SA,SB) is high

18
Remarks
  • Multiplicities of q-grams could retain or
    ignore
  • trade-off efficiency with precision
  • Shingle Size q e 3 10
  • Short shingles ? increase similarity of unrelated
    documents
  • With q1, sim(SA,SB) 1 ? A is permutation of B
  • Need larger q to sensitize to permutation changes
  • Long shingles ? small random changes have larger
    impact
  • Similarity Measure
  • Similarity is non-transitive, non-metric
  • But dissimilarity 1-sim(SA,SB) is a metric
    Charikar 02
  • Ukkonen 92 relate q-gram edit-distance

19
Example
  • A a rose is a rose is a rose
  • B a rose is a flower which is a rose
  • Preserving multiplicity
  • q1 ? sim(SA,SB) 0.7
  • SA a, a, a, is, is, rose, rose, rose
  • SB a, a, a, is, is, rose, rose, flower, which
  • q2 ? sim(SA,SB) 0.5
  • q3 ? sim(SA,SB) 0.3
  • Disregarding multiplicity
  • q1 ? sim(SA,SB) 0.6
  • q2 ? sim(SA,SB) 0.5
  • q3 ? sim(SA,SB) 0.4285

20
Min-Hashing
  • Consider
  • SA, SB ? U
  • Pick random permutation p of U
  • Define ? p -1( minp(SA) ) and b p -1(
    minp(SB) )
  • Meaning? minimal element under permutation p
  • Lemma
  • Let d min p(SA?SB)
  • Claim ? b ? p -1(d) ? SA?SB
  • Clearly

21
Min-Hashing
  • Similarity Sketches
  • Succinct representation of fingerprint sets SA
  • Allows efficient estimation of sim(SA,SB)
  • Basic idea use min-hash of fingerprints
  • sk(A) k minimal elements under p(SA)
  • Claim E sim(sk(A), sk(B)) sim(SA,SB)
  • For each ? ? sk(A) ? sk(B)
  • Observe
  • sketch-similarity is unbiased estimator of
    similarity
  • reducing variance use larger k

22
Remarks
  • Implementation
  • shingle/fingerprint/sketch document in streams
  • Issue cost of pairwise comparison of sketches?
  • cluster sketch-streams Broder et al, Guha et al
  • Open? hashing sketches to identify similarity
  • Broder-Mitzenmacher 99 Min-Hash is only
    unbiased estimator
  • Indyk-Motwani 99 Locality-Sensitive Hash
  • collisions more likely for similar items
  • Min-Hash is special case

23
Multiple Permutations
  • Better Variance Reduction
  • Instead of larger k, stick with k1
  • Multiple, independent permutations
  • Sketch Construction
  • Pick p random permutations of U p1,p2, ,pp
  • sk(A) minimal elements under p1(SA), , pp(SA)
  • Claim E sim(sk(A),sk(B)) sim(SA,SB)
  • Earlier lemma ? true for p1
  • Linearity of expectations
  • Variance reduction independence of p1, ,pp

24
Min-Wise Indep Permutations
  • Problem
  • Truly-random p over U 0 N-1 is infeasible
  • But do we really need true randomness?
  • Solution
  • Poly-size family of permutations F?SN over U
  • Choosing/representing random p?F is easy
  • Min-Wise Independence (MWI) Property
  • For all sets X?U, for all x?F,

25
Minimum-Size MWI Families
  • Broder et al 98
  • Upper/lower bounds of lcm(1,2,,n)
  • Problem exponential in N
  • Approximate MWI Families
  • Relax to
  • Non-constructive polynomial-size
  • Constructive size NO(log 1/?) Indyk 99
  • In practice 2-universal hashes work well!

26
References I
  • Fingerprinting by random polynomials. M. Rabin.
    Technical Report TR-15-81, Harvard University
    (1981).
  • Some applications of Rabin's fingerprinting
    method. A. Broder. Sequence II (1993).
  • On the Resemblance and Containment of Documents,
    A. Broder. SEQUENCES 1997.
  • Syntactic Clustering of the Web, A. Broder, S.
    Glassman, M. Manasse, and G. Zweig, WWW 1997.
  • Finding near-replicas of documents on the web. N.
    Shivakumar and H. Garcia-Molina. WebDB 1998.
  • Identifying and Filtering Near-Duplicate
    Documents, Andrei Broder. CPM 2000.

27
References II
  • Approximate String Matching with q-grams and
    Maximal Matches. E. Ukkonen. Theoretical Computer
    Science (1992).
  • Completeness and Robustness Properties of
    Min-Wise Independent Permutations. A. Broder and
    M. Mitzenmacher.
  • Min-Wise Independent Permutations, A. Broder, M.
    Charikar, A. Frieze and M. Mitzenmacher, JCSS
    (2000).
  • A Small Approximately min-wise Independent Family
    of Hash Functions. P. Indyk. SODA 1999.
  • Approximate Nearest Neighbors Towards Removing
    the Curse of Dimensionality, P. Indyk and R.
    Motwani. STOC 1998.
  • Similarity Search in High Dimensions via Hashing,
    A. Gionis, P. Indyk, and R. Motwani. VLDB 1999.
  • Similarity Estimation Techniques from Rounding
    Algorithms, M. Charikar, STOC 2002.
Write a Comment
User Comments (0)
About PowerShow.com