Redundant Bit Vectors for the Audio Fingerprinting Server - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Redundant Bit Vectors for the Audio Fingerprinting Server

Description:

Centers of spheres = undistorted fingerprint of song ... Improve Selectivity by Shrinking Boxes. Fingerprinting with hyperspheres (L2 norm) ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 20
Provided by: johnp92
Category:

less

Transcript and Presenter's Notes

Title: Redundant Bit Vectors for the Audio Fingerprinting Server


1
Redundant Bit Vectors for the Audio
Fingerprinting Server
  • John Platt
  • Jonathan Goldstein
  • Chris Burges

2
Structure of Talk
  • Problem Statement
  • Problems with Existing Techniques
  • Bit Vectors
  • Partitioning the Query Space
  • Results
  • Future Extensions

3
Problem Statement
  • Find all hyperspheres that overlap a query point
  • Centers of spheres undistorted fingerprint of
    song
  • Radius of sphere acceptable distortion of
    fingerprint
  • Sphere overlaps query query song is same as dB
    song

S1
S2
S3
Query is song 2
Query
S4
S5
Data sphere
4
Existing Techniques
  • Linear scan
  • For each sphere, check if query is inside
  • Linear effort in size of dB
  • Previous best known technique
  • Data partitioning (R-trees, SS-trees, etc)
  • Store data in a tree of shapes
  • Shapes chop up space
  • Descend and backtrack in tree, finding all leaf
    nodes that could match query
  • Performs worse than linear scan!!!

5
Key New Ideas
  • New Algorithm Redundant Bit Vectors (RBV)
  • Partition the queries, not the data
  • Avoids hopeless task
  • Store each data point redundantly
  • Combats high dimensionality
  • Use bit vectors to index database
  • Small fast representation

6
Data Structure Bit Vectors
  • Bit vectors represent every data point as one bit

1 0 1 1 1 1 1 1
Point 1
1 1 0 1 1 0 1 1
1 1 1 1 0 0 1 1
1 0 0 1 1 1 0 1
1 0 0 1 1 0 1 1
1 0 0 1 0 0 1 1
1 0 0 1 0 0 0 1
Most examples are excluded from a final linear
scan



?
Point N
0 means exclude this example from linear search 1
means linear search must still look at example
7
Why Bit Vectors are Good
  • Small memory footprint
  • 1 bit per example per bit vector
  • Fast on modern CPUs
  • 1 CPU cycle operates on 32 examples per clock
    cycle
  • Compare to Euclidean distance
  • 3 operations/example/dimension
  • Potential speed up 96x !!!
  • we use 1 bit vector per dimension in lookup

8
Partition the Queries, not the Data
Query
bin
  • For each query dimension,
  • dimension indexed into bins
  • bit vector associated with each bin
  • when query falls into bin
  • use the corresponding bit vector
  • AND together bit vectors for some
  • or all dimensions
  • Each dimension trims examples
  • Perform linear scan on survivors

1 0 1 1 1 1 1 0
1 1 0 1 1 1 1 1
1 1 0 1 1 1 1 1
1 1 0 1 0 1 1 0
9
How We Compute the RBVs
  • At index building time, construct the bit vectors

S1
Project spheres into each dimension
S5
S2
S4
S3
S6
i th vector, j th bit does sphere j overlap
bin i ?
1 1 0 0 0 0
1 0 1 0 0 0
0 0 1 1 0 1
0 0 0 1 1 1
10
How We Decide on the Bin Edges
  • Two equivalent heuristics
  • each bin should have same number of spheres
  • adjacent bit vectors should be constant Hamming
    distance apart
  • Place bin edges equal num. of sphere edges apart

6
3
9
12
S1
S5
S2
S4
S3
S6
11
Improve Selectivity by Shrinking Boxes
  • Fingerprinting with hyperspheres (L2 norm)
  • Low, but non-zero false negative rate
  • Bit vectors implement hyperrectangles (L8 norm)

bit vectors guaranteed to never introduce
false positives
bit vectors empirically found to introduce
no extra false positives
We shrink the hyperrectangles to speed up final
linear search
12
Speed Comparisons
  • Ran 1000 queries against fingerprint database
  • Database size 240K
  • 64 dimensional points
  • 14 bit vector dimensions used
  • chosen to optimize bit vector linear scan speed
  • more dim bit vectors slow down, linear scan
    speeds up
  • 32 bins per dimension
  • chosen to optimize memory/speed tradeoff
  • Pentium 4, 2.2 GHz

13
Results
  • all linear scans used early bailing
  • measured code by itself, not in context of SQL
    or IIS

14
Code Details
  • 600 lines of C (pretty simple)
  • Not integrated with SQL Server or IIS, etc.
  • Running as part of audio fingerprinting demo

15
How Fast Is It Really?
  • You tell us it depends on several factors
  • Linear time in size of database
  • Linear time in amount of resistance to cropping
  • Sorting by popularity may help substantially

16
Resistance to Cropping
  • How much crop slop do you need?
  • current system 5.4 traces / second of slop
  • possible to reduce by 2x
  • server load linear in number of traces

17
Popularity Sorting May Help
Order database by approximate popularity of
music Split search into different sections
5000 most popular
first search here
May yield substantial speed gain
if not found, search here
50000 next most popular
if not found, search here
the rest
18
Memory Performance
  • Bit Vectors can stay resident in memory
  • For 240K songs
  • All fingerprints live in 128M
  • Bit vector indices only require 13M
  • We can store fingerprints as 2-byte short, save
    2x mem.
  • Bit vector search blows out of cache
  • speed depends on memory bandwidth of server

19
Summary
  • Bit Vectors are Simple, Small, and Fast
  • Must be used to get good server-side performance
Write a Comment
User Comments (0)
About PowerShow.com