Title: Redundant Bit Vectors for the Audio Fingerprinting Server
1Redundant Bit Vectors for the Audio
Fingerprinting Server
- John Platt
- Jonathan Goldstein
- Chris Burges
2Structure of Talk
- Problem Statement
- Problems with Existing Techniques
- Bit Vectors
- Partitioning the Query Space
- Results
- Future Extensions
3Problem Statement
- Find all hyperspheres that overlap a query point
- Centers of spheres undistorted fingerprint of
song - Radius of sphere acceptable distortion of
fingerprint - Sphere overlaps query query song is same as dB
song
S1
S2
S3
Query is song 2
Query
S4
S5
Data sphere
4Existing Techniques
- Linear scan
- For each sphere, check if query is inside
- Linear effort in size of dB
- Previous best known technique
- Data partitioning (R-trees, SS-trees, etc)
- Store data in a tree of shapes
- Shapes chop up space
- Descend and backtrack in tree, finding all leaf
nodes that could match query - Performs worse than linear scan!!!
5Key New Ideas
- New Algorithm Redundant Bit Vectors (RBV)
- Partition the queries, not the data
- Avoids hopeless task
- Store each data point redundantly
- Combats high dimensionality
- Use bit vectors to index database
- Small fast representation
6Data Structure Bit Vectors
- Bit vectors represent every data point as one bit
1 0 1 1 1 1 1 1
Point 1
1 1 0 1 1 0 1 1
1 1 1 1 0 0 1 1
1 0 0 1 1 1 0 1
1 0 0 1 1 0 1 1
1 0 0 1 0 0 1 1
1 0 0 1 0 0 0 1
Most examples are excluded from a final linear
scan
?
Point N
0 means exclude this example from linear search 1
means linear search must still look at example
7Why Bit Vectors are Good
- Small memory footprint
- 1 bit per example per bit vector
- Fast on modern CPUs
- 1 CPU cycle operates on 32 examples per clock
cycle - Compare to Euclidean distance
- 3 operations/example/dimension
- Potential speed up 96x !!!
- we use 1 bit vector per dimension in lookup
8Partition the Queries, not the Data
Query
bin
- For each query dimension,
- dimension indexed into bins
- bit vector associated with each bin
- when query falls into bin
- use the corresponding bit vector
- AND together bit vectors for some
- or all dimensions
- Each dimension trims examples
- Perform linear scan on survivors
1 0 1 1 1 1 1 0
1 1 0 1 1 1 1 1
1 1 0 1 1 1 1 1
1 1 0 1 0 1 1 0
9How We Compute the RBVs
- At index building time, construct the bit vectors
S1
Project spheres into each dimension
S5
S2
S4
S3
S6
i th vector, j th bit does sphere j overlap
bin i ?
1 1 0 0 0 0
1 0 1 0 0 0
0 0 1 1 0 1
0 0 0 1 1 1
10How We Decide on the Bin Edges
- Two equivalent heuristics
- each bin should have same number of spheres
- adjacent bit vectors should be constant Hamming
distance apart - Place bin edges equal num. of sphere edges apart
6
3
9
12
S1
S5
S2
S4
S3
S6
11Improve Selectivity by Shrinking Boxes
- Fingerprinting with hyperspheres (L2 norm)
- Low, but non-zero false negative rate
- Bit vectors implement hyperrectangles (L8 norm)
bit vectors guaranteed to never introduce
false positives
bit vectors empirically found to introduce
no extra false positives
We shrink the hyperrectangles to speed up final
linear search
12Speed Comparisons
- Ran 1000 queries against fingerprint database
- Database size 240K
- 64 dimensional points
- 14 bit vector dimensions used
- chosen to optimize bit vector linear scan speed
- more dim bit vectors slow down, linear scan
speeds up - 32 bins per dimension
- chosen to optimize memory/speed tradeoff
- Pentium 4, 2.2 GHz
13Results
- all linear scans used early bailing
- measured code by itself, not in context of SQL
or IIS
14Code Details
- 600 lines of C (pretty simple)
- Not integrated with SQL Server or IIS, etc.
- Running as part of audio fingerprinting demo
15How Fast Is It Really?
- You tell us it depends on several factors
- Linear time in size of database
- Linear time in amount of resistance to cropping
- Sorting by popularity may help substantially
16Resistance to Cropping
- How much crop slop do you need?
- current system 5.4 traces / second of slop
- possible to reduce by 2x
- server load linear in number of traces
17Popularity Sorting May Help
Order database by approximate popularity of
music Split search into different sections
5000 most popular
first search here
May yield substantial speed gain
if not found, search here
50000 next most popular
if not found, search here
the rest
18Memory Performance
- Bit Vectors can stay resident in memory
- For 240K songs
- All fingerprints live in 128M
- Bit vector indices only require 13M
- We can store fingerprints as 2-byte short, save
2x mem. - Bit vector search blows out of cache
- speed depends on memory bandwidth of server
19Summary
- Bit Vectors are Simple, Small, and Fast
- Must be used to get good server-side performance