CS 361A (Advanced Data Structures and Algorithms) presentation

About This Presentation

Transcript and Presenter's Notes

Title: CS 361A (Advanced Data Structures and Algorithms)

1
CS 361A (Advanced Data Structures and Algorithms)

Lecture 19 (Dec 5, 2005)
Nearest Neighbors
Dimensionality Reduction and Locality-Sensitive
Hashing
Rajeev Motwani

2
Metric Space

Metric Space (M,D)
For points p,q in M, D(p,q) is distance from p to
q
only reasonable model for high-dimensional
geometric space
Defining Properties
Reflexive D(p,q) 0 if and only if pq
Symmetric D(p,q) D(q,p)
Triangle Inequality D(p,q) is at most
D(p,r)D(r,q)
Interesting Cases
M ? points in d-dimensional space
D ? Hamming or Euclidean Lp-norms

3
High-Dimensional Near Neighbors

Nearest Neighbors Data Structure
Given N points Pp1, , pN in metric space
(M,D)
Queries Which point p?P is closest to point
q?
Complexity Tradeoff preprocessing space with
query time
Applications
vector quantization
multimedia databases
data mining
machine learning

4
Known Results
Query Time Storage Technique Paper
dN dN Brute-Force
2d log N N2d1 Voronoi Diagram Dobkin-Lipton 76
Dd/2 log N Nd/2 Random Sampling Clarkson 88
d5 log N Nd Combination Meiser 93
logd-1 N N logd-1 N Parametric Search Agarwal-Matousek 92

Some expressions are approximate
Bottom-line exponential dependence on d

5
Approximate Nearest Neighbor

Exact Algorithms
Benchmark brute-force needs space O(N), query
time O(N)
Known Results exponential dependence on
dimension
Theory/Practice no better than brute-force
search
Approximate Near-Neighbors
Given N points Pp1, , pN in metric space
(M,D)
Given error parameter ?gt0
Goal for query q and nearest-neighbor p, return
r such that
Justification
Mapping objects to metric space is heuristic
anyway
Get tremendous performance improvement

6
Results for Approximate NN
Query Time Storage Technique Paper
dd e-d dN Balanced Trees Arya et al 94
d2 polylog(N,d) N N2d dN polylog(N,d) Random Projection Kleinberg 97
log3 N N1/?2 Search Trees Dimension Reduction Indyk-Motwani 98
dN1/?log2N N11/?log N Locality-Sensitive Hashing Indyk-Motwani 98
External Memory External Memory Locality-Sensitive Hashing Gionis-Indyk-Motwani 99

Will show main ideas of last 3 results
Some expressions are approximate

7
Approximate r-Near Neighbors

Given N points Pp1,,pN in metric space
(M,D)
Given error parameter ?gt0, distance threshold
rgt0
Query
If no point p with D(q,p)ltr, return FAILURE
Else, return any p with D(q,p)lt (1?)r
Application
Solving Approximate Nearest Neighbor
Assume maximum distance is R
Run in parallel for
Time/space O(log R) overhead
Indyk-Motwani reduce to O(polylog n) overhead

8
Hamming Metric

Hamming Space
Points in M bit-vectors 0,1d (can generalize
to 0,1,2,,qd)
Hamming Distance D(p,q) of positions where
p,q differ
Remarks
Simplest high-dimensional setting
Still useful in practice
In theory, as hard (or easy) as Euclidean space
Trivial in low dimensions
Example
Hypercube in d3 dimensions
000, 001, 010, 011, 100, 101, 110, 111

9
Dimensionality Reduction

Overall Idea
Map from high to low dimensions
Preserve distances approximately
Solve Nearest Neighbors in new space
Performance improvement at cost of approximation
error
Mapping?
Hash function family H H1, , Hm
Each Hi 0,1d ? 0,1t with tltltd
Pick HR from H uniformly at random
Map each point in P using same HR
Solve NN problem on HR(P) HR(p1), , HR(pN)

10
Reduction for Hamming Spaces

Theorem For any r and small ?gt0, there is hash
family H such that for any p,q and random HR ?H
with probability gt1-?, provided for some
constant C,

b
b
a
a
c
c
11
Remarks

For fixed threshold r, can distinguish between
Near D(p,q) lt r
Far D(p,q) gt (1e)r
For N points, need
Yet, can reduce to O(log N)-dimensional space,
while approximately preserving distances
Works even if points not known in advance

12
Hash Family

Projection Function
Let S be ordered, multiset of s indexes from
1,,d
pS0,1d ?0,1s projects p into s-dimensional
subspace
Example
d5, p01100
s3, S2,2,4 ? pS 110
Choosing hash function HR in H
Repeat for i1,,t
Pick Si randomly (with replacement) from 1d
Pick random hash function fi0,1s ?0,1
hi(p)fi(pSi)
HR(p) (h1(p), h2(p),,ht(p))
Remark note similarity to Bloom Filters

13
Illustration of Hashing
. . . . .
1
d
p
0 1 1 0 0 0 1 0 1 0
pS1
pSt
. . . . .
1 0 0 1
0 0 0 0
. . .
. . .
1
s
1
s
ft
f1
0 1 1 0
HR(p)
. . .
h1(p)
ht(p)
14
Analysis I

Choose random index-set S
Claim For any p,q
Why?
p,q differ in D(p,q) bit positions
Need all s indexes of S to avoid these positions
Sampling with replacement from 1, ,d

15
Analysis II

Choose sd/r
Since 1-xlte-x for xlt1, we obtain
Thus

16
Analysis III

Recall hi(p)fi(pSi)
Thus
Choosing c ½ (1-e-1)

17
Analysis IV

Recall HR(p)(h1(p),h2(p),,ht(p))
D(HR(p),HR(q)) number of is where hi(p), hi(q)
differ
By linearity of expectations
Theorem almost proved
For high probability bound, need Chernoff Bound

18
Chernoff Bound

Consider Bernoulli random variables X1,X2, , Xn
Values are 0-1
PrXi1 x and PrXi0 1-x
Define X X1X2Xn with EXnx
Theorem For independent X1,, Xn, for any 0lt?lt1,

P
2?nx
X
nx
19
Analysis V

Define
Xi0 if hi(p)hi(q), and 1 otherwise
nt
Then X X1X2Xt D(HR(p),HR(q))
Case 1 D(p,q)ltr ? xc
Case 2 D(p,q)gt(1e)r ? xce/6
Observe sloppy bounding of constants in Case 2

20
Putting it all together

Recall
Thus, error probability
Choosing C1200/c
Theorem is proved!!

21
Algorithm I

Set error probability
Select hash HR and map points p ? HR(p)
Processing query q
Compute HR(q)
Find nearest neighbor HR(p) for HR(q)
If then return p,
else FAILURE
Remarks
Brute-force for finding HR(p) implies query time
Need another approach for lower dimensions

22
Algorithm II

Fact Exact nearest neighbors in 0,1t requires
Space O(2t)
Query time O(t)
How?
Precompute/store answers to all queries
Number of possible queries is 2t
Since
Theorem In Hamming space 0,1d, can solve
approximate nearest neighbor with
Space
Query time

23
Different Metric

Many applications have sparse points
Many dimensions but few 1s
Example points?documents, dimensions?words
Better to view as sets
Previous approach would require large s
For sets A,B, define
Observe
AB ? sim(A,B)1
A,B disjoint ? sim(A,B)0
Question Handling D(A,B)1-sim(A,B) ?

24
Min-Hash

Random permutations p1,,pt of universe
(dimensions)
Define mapping hj(A)mina in A pj(a)
Fact Prhj(A) hj(B) sim(A,B)
Proof? already seen!!
Overall hash-function
HR(A) (h1(A), h2(A),,ht(A))

25
Min-Hash Analysis

Select
Hamming Distance
D(HR(A),HR(B)) ? number of js such that
Theorem For any A,B,
Proof? Exercise (apply Chernoff Bound)
Obtain ANN algorithm similar to earlier result

26
Generalization

Goal
abstract technique used for Hamming space
enable application to other metric spaces
handle Dynamic ANN
Dynamic Approximate r-Near Neighbors
Fix threshold r
Query if any point within distance r of q,
return any point within distance
Allow insertions/deletions of points in P
Recall earlier method required preprocessing
all possible queries in hash-range-space

27
Locality-Sensitive Hashing

Fix metric space (M,D), threshold r, error
Choose probability parameters Q1 gt Q2gt0
Definition Hash family HhM?S for (M,D) is
called .
-sensitive, if for random h and for any p,q
in M
Intuition
p,q are near ? likely to collide
p,q are far ? unlikely to collide

28
Examples

Hamming Space M0,1d
point pb1bd
H hi(b1bd)bi, for i1d
sampling one bit at random
Prhi(q)hi(p) 1 D(p,q)/d
Set Similarity D(A,B) 1 sim(A,B)
Recall
H
Prh(A)h(B) 1 D(A,B)

29
Multi-Index Hashing

Overall Idea
Fix LSH family H
Boost Q1, Q2 gap by defining G H k
Using G, each point hashes into l buckets
Intuition
r-near neighbors likely to collide
few non-near pairs in any bucket
Define
G g g(p) h1(p)h2(p)hk(p)
Hamming metric ? sample k random bits

30
Example (l4)

h1
hk
p
g1

q
g2

g3

g4

r
31
Overall Scheme

Preprocessing
Prepare hash table for range of G
Select l hash functions g1, g2, , gl
Insert(p) add p to buckets g1(p), g2(p), ,
gl(p)
Delete(p) remove p from buckets g1(p), g2(p),
, gl(p)
Query(q)
Check buckets g1(q), g2(q), , gl(q)
Report nearest of (say) first 3l points
Complexity
Assume computing D(p,q) needs O(d) time
Assume storing p needs O(d) space
Insert/Delete/Query Time O(dlk)
Preprocessing/Storage O(dNNlk)

32
Collision Probability vs. Distance
1
Q1
Q2
0
r
r
r
r
33
Multi-Index versus Error

Set lNz where
Theorem For lNz, any query returns r-near
neighbor correctly with probability at least 1/6.
Consequently (ignoring kO(log N) factors)
Time O(dNz)
Space O(N1z)
Hamming Metric ?
Boost Probability use several parallel
hash-tables

34
Analysis

Define (for fixed query q)
p any point with D(q,p) lt r
FAR(q) all p with D(q,p) gt (1 )r
BUCKET(q,j) all p with gj(p) gj(q)
Event Esize
(?query cost bounded by O(dl))
Event ENN gj(p) gj(q) for some j
(?nearest point in l buckets is r-near
neighbor)
Analysis
Show PrEsize x gt 2/3 and PrENN y gt 1/2
Thus Prnot(Esize ENN) lt (1-x) (1-y) lt 5/6

35
Analysis Bad Collisions

Choose
Fact
Clearly
Markov Inequality PrXgtr.EXlt1/r, for Xgt0
Lemma 1

36
Analysis Good Collisions

Observe ?
Since lnz ?
Lemma 2 PrENN gt1/2

37
Euclidean Norms

Recall
x(x1, x2, , xd) and y(y1, y2, , yd) in Rd
L1-norm
Lp-norm (for pgt1)

38
Extension to L1-Norm

Round coordinates to 1,M
Embed L1-1,,Md into Hamming-0,1dM
Unary Mapping
Apply algorithm for Hamming Spaces
Error due to rounding of 1/M ?
Space-Time Overhead due to mapping of d ? dM

39
Extension to L2-Norm

Observe
Little difference in L1-norm and L2-norm for high
d
Additional error is small
More generally Lp, for 1 p 2
Figiel et al 1977, Johnson-Schechtman 1982
Can embed Lp into L1
Dimensions d ? O(d)
Distances preserved within factor (1a)
Key Idea random rotation of space

40
Improved Bounds

Indyk-Motwani 1998
For any Lp-norm
Query Time O(log3 N)
Space
Problem impractical
Today only a high-level sketch

41
Better Reduction

Recall
Reduced Approximate Nearest Neighbors to
Approximate r-Near Neighbors
Space/Time Overhead O(log R)
R max distance in metric space
Ring-Cover Trees
Removed dependence on R
Reduced overhead to O(polylog N)

42
Approximate r-Near Neighbors

Idea
Impose regular-grid on Rd
Decompose into cubes of side length s
Label cubes with points at distance ltr
Data Structure
Query q determine cube containing q
Cube labels candidate r-near neighbors
Goals
Small s ? lower error
Fewer cubes ? smaller storage

43

p1
p2
p3
44
Grid Analysis

Assume r1
Choose
Cube Diameter
Number of cubes
Theorem For any Lp-norm, can solve Approx
r-Near Neighbor using
Space
Time O(d)

45
Dimensionality Reduction

Johnson-Lindenstraus 84, Frankl-Maehara 88 For
, can map points in P into subspace
of dimension while preserving all
inter-point distances to within a factor
Proof idea project onto random lines
Result for NN
Space
Time O(polylog N)

46
References

Approximate Nearest Neighbors Towards Removing
the Curse of Dimensionality
P. Indyk and R. Motwani
STOC 1998
Similarity Search in High Dimensions via Hashing
A. Gionis, P. Indyk, and R. Motwani
VLDB 1999

Write a Comment

User Comments (0)

About PowerShow.com

CS 361A (Advanced Data Structures and Algorithms) PowerPoint PPT Presentation