Improving minhashing: De Bruijn sequences and primitive roots for counting trailing zeroes - PowerPoint PPT Presentation

About This Presentation
Title:

Improving minhashing: De Bruijn sequences and primitive roots for counting trailing zeroes

Description:

Improving minhashing: De Bruijn sequences and primitive roots for counting trailing zeroes Why things you didn t think you cared about are actually practical means ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 25
Provided by: Mark1177
Category:

less

Transcript and Presenter's Notes

Title: Improving minhashing: De Bruijn sequences and primitive roots for counting trailing zeroes


1
Improving minhashingDe Bruijn sequences and
primitive roots for counting trailing zeroes
Why things you didnt think you cared about are
actually practical means of bit-twiddling
  • Mark Manasse
  • Frank McSherry Kunal Talwar
  • Microsoft Research

2
Minhashing
  • This all comes out of a perennial quest to make
    minhashing faster
  • Minhashing is a technique for sampling an element
    from a stream which is
  • Uniformly random (every element equally likely)
  • Consistent (similar stream ? similar sample)
  • P(S(A) S(B))

3
Locality sensitive hashing
  • View documents, photos, music, etc. as a set of
    features
  • View feature set as high-dimensional vector
  • Find closely-matching vectors
  • Most of the time
  • Proportionally close
  • In L2, this leads us to cosine similarity (Indyk,
    Motwani)
  • A hash function whose bits match proprtionally to
    the cosine of the angle between two vectors
  • Allows off-line computation of hash, and faster
    comparison of hashes

4
Working in L1Jaccard similarity
  • Given two sets, define sim(A,B) to equal the
    cardinality of the intersection of the sets
    divided by the cardinality of the union
  • Proven useful, when applied to the set of phrases
    in a web page, when testing for near duplicates

A n B
A
B
5
Basic idea, and old speedup
  • Pick 100 such samples, by picking 100 random
    one-to-one functions mapping to a well-ordered
    range
  • For each function, select the pre-image of the
    smallest image under that function
  • Naively, takes 100 such evaluations per input
    item (one for each function)
  • Improve by factor of almost 8, by
  • Choosing a 64 bit function
  • Lead 8 bits of 8 images generated by carving 64
    into 8
  • Compute more bits, but only when needed

6
Why this approximates Jaccard
  • Assume a uniformly chosen random function,
    mapping injectively to infinite binary sequences.
  • Order sequences lexicographically
  • Given sets A and B, and a random function f,
    argmin(f(A?B)) is certain to be an element of A
    or B, and is in the intersection with exactly the
    Jaccard coefficient.
  • Sampling requires uniformity and consistency
  • Uniformity so that probability mass is spread
  • Consistency small perturbations dont matter

7
New idea to speed up and reduce collisions
  • Carve 64 into expected 32, by dividing at 1s
    into 100
  • 32, because expect half of bits to be 1
  • Better yet, the number of maximal length
    sequences is bounded by 2, independent of length
    of input
  • But, how do we efficiently divide at 1s?

8
Dividing at 1s
  • Could go bit at a time, shifting and testing
  • Lots of missed branch predictions, lots of tests
  • Could look at low-order byte if zero, shift by
    8, if not do table look-up
  • Almost certainly good enough, in practice
  • But we can do mathematically better.

9
Dividing at 1s smoothly
  • Reduce to a simpler problem taking the logarithm
    base 2 of a power of 2
  • Given x in twos-complement, x-x is the smallest
    power of 2 in binary expansion of x
  • Works because -x x 1
  • x is 01..1 below smallest power of 2 in x
  • x (x-1) removes the least power of 2 (not
    useful here)
  • x -x is all ones above the least power of 2
  • x (x-1) is all ones at and below the least
    power of 2
  • So all 64-bit numbers can be reduced to only 65
    possibilities, depending on least power of 2

10
How to we figure out which one?
  • Naïve binary search in sorted table
  • Better perfect hashing
  • Using x-x, all 65 possible values are powers of
    2, or 0
  • 2 is a primitive root of unity modulo 67 (kudos
    to Peter Montgomery for noting this)
  • So, the powers of 2 generate the multiplicative
    group (1, 2, , 66) modulo 67
  • That is, first 66 powers of 2 are distinct mod 67
  • So, take (x-x) 67, and look in a table

11
Perfect, maybe. But optimal?
  • Leiserson, Prokop, and Randall noticed that De
    Bruijn sequences are even better
  • Like Gray codes, only folded
  • De Bruijn sequences are vaguely like Gray codes
  • Hamiltonian circuit of hypercube is Gray code
  • Hamiltonian circuit of De Bruijn graph is .
  • De Bruijn sequences allow candy necklaces where
    any sequence of k candies occurs at exactly one
    starting point in clockwise order
  • Always exist (even generalized to higher
    dimension, but we dont need that)
  • (00011101) is such a sequence for 3-bit binary

12
More de Bruijn
  • Any rotation of De Bruijn is De Bruijn
  • Reversal of De Bruijn is De Bruijn
  • Canonicalize sequences by rotating to least
  • Three canonical sequences for binary sequences of
    length 6 one is own reversal (6 is even)
  • Starting with 6 zeroes, the first five bits
    needed in rotation are zero, so shift is good
    enough
  • Just look at high-order 6 bits after multiplying
    by constant DB0x0218a392cd3d5dbfUL
  • Doesnt handle 0, just powers of 2

13
Few branch misses
  • define DB 0x0218a392cd3d5dbfUL
  • define NRES 100
  • unsigned short dblookup64 // initialized to
    dblookup(DB ltlt i) gtgt 58 i
  • unsigned resultNRES 64 // answers plus
    spill-over space
  • unsigned n0, rb0, elog0 // quantity produced,
    remaining bits of randomness, left-over zeroes
  • unsigned long long cur
  • while (n lt NRES)
  • cur newrandom(key)
  • elog rb
  • rb 64
  • while (cur ! 0)
  • unsigned short log dblookup((cur (1cur))
    DB) gtgt 58
  • cur gtgt log 1
  • rb - log 1
  • resultn log elog
  • elog 0

14
Selecting randomly and repeatably from weighted
distributions
  • Mark Manasse
  • Frank McSherry
  • Kunal Talwar
  • Microsoft Research

15
Jaccard, extended to multisets
  • Jaccard, as defined, doesnt work when the number
    of copies of an element is a distinguishing
    factor
  • If we generalize a little, we get sum of lesser
    number of occurrences divided by sum of greater
  • Still works even for non-integral counts
  • Allows for weighting of elements by importance
  • Same as before, for integer counts, if we replace
    items with iteminstance
  • ltcat, cat, doggt ? cat1, cat2, dog1
  • Sample is (item, instance), not just item

16
Sampling, instead of pairwise computation, for
sets
  • To allow for faster computation, we estimate
    similarity by sampling
  • Pick some number of samples, where for any sets A
    and B, each sample agrees with probability equal
    to sim(A,B)
  • Count the average number of matching samples
  • To get a good sample, pick a random one-to-one
    mapping of set elements to a well-ordering, and
    pick preimage of the (unique!) smallest.

17
Multiset Jaccard, one implementation
  • Given a good way to approximate Jaccard for sets,
    we can convert a multiset (but not a
    distribution) into a set by replacing 100
    occurrences of cat by cat1, cat2, ,
    cat100.
  • Requires (if streaming) remembering how many
    prior occurrences of elements have been seen.

18
Multiset reduction considered inefficient
  • Previous technique is linear in input size, if
    input is cat, cat, cat, dog
  • Exponential in input size if input is (cat, 100)
  • Probability to the rescue!
  • If our random map is to a real number between 0
    and 1, we dont need to generate 100 random
    values to find the smallest
  • CDF(X gt x) 1-x
  • CDF(min_k gt x) (1-x)k

19
Not so fast!
  • Thats the probability, but not one that lets us
    pick samples to test for agreement
  • Not good enough to pick an element, have to pick
    an instance of the element
  • (cat, 100) and (cat, 200) are only .5 similar
  • Has to be repeatable
  • If cat7 is chosen from (cat, 100), mustnt choose
    cat73 from (cat, 200) (but cat104 would be OK)

20
Properties for repeatable sampling
  • A sampling process must pick an element of the
    input
  • For discrete things, an integer occurrence at
    most equal to the number of occurrences
  • For non-negative real valued things, a real value
    at most equal to the input
  • Must pick uniformly by weight
  • Must pick same sample from any subset containing
    the sample

21
Uniformity
8.0
S(8,1) 7.632
S(8,2) 4.918
  • To be uniform in a column, we have to pick a
    number smaller than a given number
  • Variant of reservoir sampling suffices
  • Given n, pick a random number below n uniformly
  • Given that number, pick a random number below
    that, and repeat
  • To make repeatable expected con-stant time,
    break into powers of 2
  • Given n, round up to next higher power of 2,
    repeat downward process until below half

4.0
n 3.724
S(4,1) 3.054
2.0
1.0
S(1,1) 0.783
0.5
For this choice of n, S(4,1) is the downward
selection for slightly smaller n, S(1,1) would
be selected
22
Scaling up
  • Same process, but we have to first round up, by
    finding smallest chosen number above n
  • First check the power of 2 range containing n
  • Next level up contains something if first
    selected number lt 2k1 is gt 2k
  • If this happens, take smallest number in range
  • Otherwise, repeat at next up power of 2

8.0
S(8,1) 7.632
S(8,2) 4.918
4.0
n 3.724
S(4,1) 3.054
2.0
1.0
S(1,1) 0.783
0.5
For this choice of n, S(8,2) is the upward
selection
23
Picking a column
  • Given a scaled-up column to n, need to construct
    the right distribution for second smallest number
    (assuming the nth is the smallest)
  • In the discrete case (if we consider only
    integers as valid choices) IDF (2nd smallest gt x)
    proportional to (1-x)n-1x, so CDF
    (n1)xn-nxn1 xn(1-n(x-1))
  • In the continuous case (which we can get by
    scaling the discrete case), CDF xn(1-nlnx)
  • Pick a random luckiness factor for a column, p,
    solve for x in CDF p by iteration
  • Pick column with smallest x value

24
Reducing randomness and improving numerical
accuracy
  • We can just use one bit to decide if a power of 2
    range has any selected values
  • So use a single random value to decide which of
    64 powers of 2 are useful
  • Either by computing 64 samples in parallel or
  • Computing 64 intervals at once
  • Use logarithms of CDF rather than CDF to keep
    things reasonable look at 1-x instead of x to
    keep log away from 0
  • Partially evaluate convergence to save time, and
    compare preimages to CDF when possible
Write a Comment
User Comments (0)
About PowerShow.com