Title: Improving minhashing: De Bruijn sequences and primitive roots for counting trailing zeroes
1Improving minhashingDe Bruijn sequences and
primitive roots for counting trailing zeroes
Why things you didnt think you cared about are
actually practical means of bit-twiddling
- Mark Manasse
- Frank McSherry Kunal Talwar
- Microsoft Research
2Minhashing
- This all comes out of a perennial quest to make
minhashing faster - Minhashing is a technique for sampling an element
from a stream which is - Uniformly random (every element equally likely)
- Consistent (similar stream ? similar sample)
- P(S(A) S(B))
3Locality sensitive hashing
- View documents, photos, music, etc. as a set of
features - View feature set as high-dimensional vector
- Find closely-matching vectors
- Most of the time
- Proportionally close
- In L2, this leads us to cosine similarity (Indyk,
Motwani) - A hash function whose bits match proprtionally to
the cosine of the angle between two vectors - Allows off-line computation of hash, and faster
comparison of hashes
4Working in L1Jaccard similarity
- Given two sets, define sim(A,B) to equal the
cardinality of the intersection of the sets
divided by the cardinality of the union - Proven useful, when applied to the set of phrases
in a web page, when testing for near duplicates
A n B
A
B
5Basic idea, and old speedup
- Pick 100 such samples, by picking 100 random
one-to-one functions mapping to a well-ordered
range - For each function, select the pre-image of the
smallest image under that function - Naively, takes 100 such evaluations per input
item (one for each function) - Improve by factor of almost 8, by
- Choosing a 64 bit function
- Lead 8 bits of 8 images generated by carving 64
into 8 - Compute more bits, but only when needed
6Why this approximates Jaccard
- Assume a uniformly chosen random function,
mapping injectively to infinite binary sequences. - Order sequences lexicographically
- Given sets A and B, and a random function f,
argmin(f(A?B)) is certain to be an element of A
or B, and is in the intersection with exactly the
Jaccard coefficient. - Sampling requires uniformity and consistency
- Uniformity so that probability mass is spread
- Consistency small perturbations dont matter
7New idea to speed up and reduce collisions
- Carve 64 into expected 32, by dividing at 1s
into 100 - 32, because expect half of bits to be 1
- Better yet, the number of maximal length
sequences is bounded by 2, independent of length
of input - But, how do we efficiently divide at 1s?
8Dividing at 1s
- Could go bit at a time, shifting and testing
- Lots of missed branch predictions, lots of tests
- Could look at low-order byte if zero, shift by
8, if not do table look-up - Almost certainly good enough, in practice
- But we can do mathematically better.
9Dividing at 1s smoothly
- Reduce to a simpler problem taking the logarithm
base 2 of a power of 2 - Given x in twos-complement, x-x is the smallest
power of 2 in binary expansion of x - Works because -x x 1
- x is 01..1 below smallest power of 2 in x
- x (x-1) removes the least power of 2 (not
useful here) - x -x is all ones above the least power of 2
- x (x-1) is all ones at and below the least
power of 2 - So all 64-bit numbers can be reduced to only 65
possibilities, depending on least power of 2
10How to we figure out which one?
- Naïve binary search in sorted table
- Better perfect hashing
- Using x-x, all 65 possible values are powers of
2, or 0 - 2 is a primitive root of unity modulo 67 (kudos
to Peter Montgomery for noting this) - So, the powers of 2 generate the multiplicative
group (1, 2, , 66) modulo 67 - That is, first 66 powers of 2 are distinct mod 67
- So, take (x-x) 67, and look in a table
11Perfect, maybe. But optimal?
- Leiserson, Prokop, and Randall noticed that De
Bruijn sequences are even better - Like Gray codes, only folded
- De Bruijn sequences are vaguely like Gray codes
- Hamiltonian circuit of hypercube is Gray code
- Hamiltonian circuit of De Bruijn graph is .
- De Bruijn sequences allow candy necklaces where
any sequence of k candies occurs at exactly one
starting point in clockwise order - Always exist (even generalized to higher
dimension, but we dont need that) - (00011101) is such a sequence for 3-bit binary
12More de Bruijn
- Any rotation of De Bruijn is De Bruijn
- Reversal of De Bruijn is De Bruijn
- Canonicalize sequences by rotating to least
- Three canonical sequences for binary sequences of
length 6 one is own reversal (6 is even) - Starting with 6 zeroes, the first five bits
needed in rotation are zero, so shift is good
enough - Just look at high-order 6 bits after multiplying
by constant DB0x0218a392cd3d5dbfUL - Doesnt handle 0, just powers of 2
13Few branch misses
- define DB 0x0218a392cd3d5dbfUL
- define NRES 100
- unsigned short dblookup64 // initialized to
dblookup(DB ltlt i) gtgt 58 i - unsigned resultNRES 64 // answers plus
spill-over space - unsigned n0, rb0, elog0 // quantity produced,
remaining bits of randomness, left-over zeroes - unsigned long long cur
- while (n lt NRES)
- cur newrandom(key)
- elog rb
- rb 64
- while (cur ! 0)
- unsigned short log dblookup((cur (1cur))
DB) gtgt 58 - cur gtgt log 1
- rb - log 1
- resultn log elog
- elog 0
14Selecting randomly and repeatably from weighted
distributions
- Mark Manasse
- Frank McSherry
- Kunal Talwar
- Microsoft Research
15Jaccard, extended to multisets
- Jaccard, as defined, doesnt work when the number
of copies of an element is a distinguishing
factor - If we generalize a little, we get sum of lesser
number of occurrences divided by sum of greater
- Still works even for non-integral counts
- Allows for weighting of elements by importance
- Same as before, for integer counts, if we replace
items with iteminstance - ltcat, cat, doggt ? cat1, cat2, dog1
- Sample is (item, instance), not just item
16Sampling, instead of pairwise computation, for
sets
- To allow for faster computation, we estimate
similarity by sampling - Pick some number of samples, where for any sets A
and B, each sample agrees with probability equal
to sim(A,B) - Count the average number of matching samples
- To get a good sample, pick a random one-to-one
mapping of set elements to a well-ordering, and
pick preimage of the (unique!) smallest.
17Multiset Jaccard, one implementation
- Given a good way to approximate Jaccard for sets,
we can convert a multiset (but not a
distribution) into a set by replacing 100
occurrences of cat by cat1, cat2, ,
cat100. - Requires (if streaming) remembering how many
prior occurrences of elements have been seen.
18Multiset reduction considered inefficient
- Previous technique is linear in input size, if
input is cat, cat, cat, dog - Exponential in input size if input is (cat, 100)
- Probability to the rescue!
- If our random map is to a real number between 0
and 1, we dont need to generate 100 random
values to find the smallest - CDF(X gt x) 1-x
- CDF(min_k gt x) (1-x)k
19Not so fast!
- Thats the probability, but not one that lets us
pick samples to test for agreement - Not good enough to pick an element, have to pick
an instance of the element - (cat, 100) and (cat, 200) are only .5 similar
- Has to be repeatable
- If cat7 is chosen from (cat, 100), mustnt choose
cat73 from (cat, 200) (but cat104 would be OK)
20Properties for repeatable sampling
- A sampling process must pick an element of the
input - For discrete things, an integer occurrence at
most equal to the number of occurrences - For non-negative real valued things, a real value
at most equal to the input - Must pick uniformly by weight
- Must pick same sample from any subset containing
the sample
21Uniformity
8.0
S(8,1) 7.632
S(8,2) 4.918
- To be uniform in a column, we have to pick a
number smaller than a given number - Variant of reservoir sampling suffices
- Given n, pick a random number below n uniformly
- Given that number, pick a random number below
that, and repeat - To make repeatable expected con-stant time,
break into powers of 2 - Given n, round up to next higher power of 2,
repeat downward process until below half
4.0
n 3.724
S(4,1) 3.054
2.0
1.0
S(1,1) 0.783
0.5
For this choice of n, S(4,1) is the downward
selection for slightly smaller n, S(1,1) would
be selected
22Scaling up
- Same process, but we have to first round up, by
finding smallest chosen number above n - First check the power of 2 range containing n
- Next level up contains something if first
selected number lt 2k1 is gt 2k - If this happens, take smallest number in range
- Otherwise, repeat at next up power of 2
8.0
S(8,1) 7.632
S(8,2) 4.918
4.0
n 3.724
S(4,1) 3.054
2.0
1.0
S(1,1) 0.783
0.5
For this choice of n, S(8,2) is the upward
selection
23Picking a column
- Given a scaled-up column to n, need to construct
the right distribution for second smallest number
(assuming the nth is the smallest) - In the discrete case (if we consider only
integers as valid choices) IDF (2nd smallest gt x)
proportional to (1-x)n-1x, so CDF
(n1)xn-nxn1 xn(1-n(x-1)) - In the continuous case (which we can get by
scaling the discrete case), CDF xn(1-nlnx) - Pick a random luckiness factor for a column, p,
solve for x in CDF p by iteration - Pick column with smallest x value
24Reducing randomness and improving numerical
accuracy
- We can just use one bit to decide if a power of 2
range has any selected values - So use a single random value to decide which of
64 powers of 2 are useful - Either by computing 64 samples in parallel or
- Computing 64 intervals at once
- Use logarithms of CDF rather than CDF to keep
things reasonable look at 1-x instead of x to
keep log away from 0 - Partially evaluate convergence to save time, and
compare preimages to CDF when possible