CS 361A (Advanced Data Structures and Algorithms) - PowerPoint PPT Presentation

PPT – CS 361A (Advanced Data Structures and Algorithms) PowerPoint presentation | free to download - id: 6f5b27-NDMxN

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

CS 361A (Advanced Data Structures and Algorithms)

Description:

CS 361A (Advanced Data Structures and Algorithms) Lecture 19 (Dec 5, 2005) Nearest Neighbors: Dimensionality Reduction and Locality-Sensitive Hashing – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 47
Provided by: Mayu150
Category:
Tags:
Transcript and Presenter's Notes

Title: CS 361A (Advanced Data Structures and Algorithms)

1
CS 361A (Advanced Data Structures and Algorithms)
• Lecture 19 (Dec 5, 2005)
• Nearest Neighbors
Dimensionality Reduction and Locality-Sensitive
Hashing
• Rajeev Motwani

2
Metric Space
• Metric Space (M,D)
• For points p,q in M, D(p,q) is distance from p to
q
• only reasonable model for high-dimensional
geometric space
• Defining Properties
• Reflexive D(p,q) 0 if and only if pq
• Symmetric D(p,q) D(q,p)
• Triangle Inequality D(p,q) is at most
D(p,r)D(r,q)
• Interesting Cases
• M ? points in d-dimensional space
• D ? Hamming or Euclidean Lp-norms

3
High-Dimensional Near Neighbors
• Nearest Neighbors Data Structure
• Given N points Pp1, , pN in metric space
(M,D)
• Queries Which point p?P is closest to point
q?
• Complexity Tradeoff preprocessing space with
query time
• Applications
• vector quantization
• multimedia databases
• data mining
• machine learning

4
Known Results
Query Time Storage Technique Paper
dN dN Brute-Force
2d log N N2d1 Voronoi Diagram Dobkin-Lipton 76
Dd/2 log N Nd/2 Random Sampling Clarkson 88
d5 log N Nd Combination Meiser 93
logd-1 N N logd-1 N Parametric Search Agarwal-Matousek 92
• Some expressions are approximate
• Bottom-line exponential dependence on d

5
Approximate Nearest Neighbor
• Exact Algorithms
• Benchmark brute-force needs space O(N), query
time O(N)
• Known Results exponential dependence on
dimension
• Theory/Practice no better than brute-force
search
• Approximate Near-Neighbors
• Given N points Pp1, , pN in metric space
(M,D)
• Given error parameter ?gt0
• Goal for query q and nearest-neighbor p, return
r such that
• Justification
• Mapping objects to metric space is heuristic
anyway
• Get tremendous performance improvement

6
Results for Approximate NN
Query Time Storage Technique Paper
dd e-d dN Balanced Trees Arya et al 94
d2 polylog(N,d) N N2d dN polylog(N,d) Random Projection Kleinberg 97
log3 N N1/?2 Search Trees Dimension Reduction Indyk-Motwani 98
dN1/?log2N N11/?log N Locality-Sensitive Hashing Indyk-Motwani 98
External Memory External Memory Locality-Sensitive Hashing Gionis-Indyk-Motwani 99
• Will show main ideas of last 3 results
• Some expressions are approximate

7
Approximate r-Near Neighbors
• Given N points Pp1,,pN in metric space
(M,D)
• Given error parameter ?gt0, distance threshold
rgt0
• Query
• If no point p with D(q,p)ltr, return FAILURE
• Else, return any p with D(q,p)lt (1?)r
• Application
• Solving Approximate Nearest Neighbor
• Assume maximum distance is R
• Run in parallel for
• Indyk-Motwani reduce to O(polylog n) overhead

8
Hamming Metric
• Hamming Space
• Points in M bit-vectors 0,1d (can generalize
to 0,1,2,,qd)
• Hamming Distance D(p,q) of positions where
p,q differ
• Remarks
• Simplest high-dimensional setting
• Still useful in practice
• In theory, as hard (or easy) as Euclidean space
• Trivial in low dimensions
• Example
• Hypercube in d3 dimensions
• 000, 001, 010, 011, 100, 101, 110, 111

9
Dimensionality Reduction
• Overall Idea
• Map from high to low dimensions
• Preserve distances approximately
• Solve Nearest Neighbors in new space
• Performance improvement at cost of approximation
error
• Mapping?
• Hash function family H H1, , Hm
• Each Hi 0,1d ? 0,1t with tltltd
• Pick HR from H uniformly at random
• Map each point in P using same HR
• Solve NN problem on HR(P) HR(p1), , HR(pN)

10
Reduction for Hamming Spaces
• Theorem For any r and small ?gt0, there is hash
family H such that for any p,q and random HR ?H

• with probability gt1-?, provided for some
constant C,

b
b
a
a
c
c
11
Remarks
• For fixed threshold r, can distinguish between
• Near D(p,q) lt r
• Far D(p,q) gt (1e)r
• For N points, need
• Yet, can reduce to O(log N)-dimensional space,
while approximately preserving distances
• Works even if points not known in advance

12
Hash Family
• Projection Function
• Let S be ordered, multiset of s indexes from
1,,d
• pS0,1d ?0,1s projects p into s-dimensional
subspace
• Example
• d5, p01100
• s3, S2,2,4 ? pS 110
• Choosing hash function HR in H
• Repeat for i1,,t
• Pick Si randomly (with replacement) from 1d
• Pick random hash function fi0,1s ?0,1
• hi(p)fi(pSi)
• HR(p) (h1(p), h2(p),,ht(p))
• Remark note similarity to Bloom Filters

13
Illustration of Hashing
. . . . .
1
d
p
0 1 1 0 0 0 1 0 1 0
pS1
pSt
. . . . .
1 0 0 1
0 0 0 0
. . .
. . .
1
s
1
s
ft
f1
0 1 1 0
HR(p)
. . .
h1(p)
ht(p)
14
Analysis I
• Choose random index-set S
• Claim For any p,q
• Why?
• p,q differ in D(p,q) bit positions
• Need all s indexes of S to avoid these positions
• Sampling with replacement from 1, ,d

15
Analysis II
• Choose sd/r
• Since 1-xlte-x for xlt1, we obtain
• Thus

16
Analysis III
• Recall hi(p)fi(pSi)
• Thus
• Choosing c ½ (1-e-1)

17
Analysis IV
• Recall HR(p)(h1(p),h2(p),,ht(p))
• D(HR(p),HR(q)) number of is where hi(p), hi(q)
differ
• By linearity of expectations
• Theorem almost proved
• For high probability bound, need Chernoff Bound

18
Chernoff Bound
• Consider Bernoulli random variables X1,X2, , Xn
• Values are 0-1
• PrXi1 x and PrXi0 1-x
• Define X X1X2Xn with EXnx
• Theorem For independent X1,, Xn, for any 0lt?lt1,

P
2?nx
X
nx
19
Analysis V
• Define
• Xi0 if hi(p)hi(q), and 1 otherwise
• nt
• Then X X1X2Xt D(HR(p),HR(q))
• Case 1 D(p,q)ltr ? xc
• Case 2 D(p,q)gt(1e)r ? xce/6
• Observe sloppy bounding of constants in Case 2

20
Putting it all together
• Recall
• Thus, error probability
• Choosing C1200/c
• Theorem is proved!!

21
Algorithm I
• Set error probability
• Select hash HR and map points p ? HR(p)
• Processing query q
• Compute HR(q)
• Find nearest neighbor HR(p) for HR(q)
• If then return p,
else FAILURE
• Remarks
• Brute-force for finding HR(p) implies query time
• Need another approach for lower dimensions

22
Algorithm II
• Fact Exact nearest neighbors in 0,1t requires
• Space O(2t)
• Query time O(t)
• How?
• Precompute/store answers to all queries
• Number of possible queries is 2t
• Since
• Theorem In Hamming space 0,1d, can solve
approximate nearest neighbor with
• Space
• Query time

23
Different Metric
• Many applications have sparse points
• Many dimensions but few 1s
• Example points?documents, dimensions?words
• Better to view as sets
• Previous approach would require large s
• For sets A,B, define
• Observe
• AB ? sim(A,B)1
• A,B disjoint ? sim(A,B)0
• Question Handling D(A,B)1-sim(A,B) ?

24
Min-Hash
• Random permutations p1,,pt of universe
(dimensions)
• Define mapping hj(A)mina in A pj(a)
• Fact Prhj(A) hj(B) sim(A,B)
• Overall hash-function
• HR(A) (h1(A), h2(A),,ht(A))

25
Min-Hash Analysis
• Select
• Hamming Distance
• D(HR(A),HR(B)) ? number of js such that
• Theorem For any A,B,
• Proof? Exercise (apply Chernoff Bound)
• Obtain ANN algorithm similar to earlier result

26
Generalization
• Goal
• abstract technique used for Hamming space
• enable application to other metric spaces
• handle Dynamic ANN
• Dynamic Approximate r-Near Neighbors
• Fix threshold r
• Query if any point within distance r of q,
return any point within distance
• Allow insertions/deletions of points in P
• Recall earlier method required preprocessing
all possible queries in hash-range-space

27
Locality-Sensitive Hashing
• Fix metric space (M,D), threshold r, error
• Choose probability parameters Q1 gt Q2gt0
• Definition Hash family HhM?S for (M,D) is
called .
-sensitive, if for random h and for any p,q
in M
• Intuition
• p,q are near ? likely to collide
• p,q are far ? unlikely to collide

28
Examples
• Hamming Space M0,1d
• point pb1bd
• H hi(b1bd)bi, for i1d
• sampling one bit at random
• Prhi(q)hi(p) 1 D(p,q)/d
• Set Similarity D(A,B) 1 sim(A,B)
• Recall
• H
• Prh(A)h(B) 1 D(A,B)

29
Multi-Index Hashing
• Overall Idea
• Fix LSH family H
• Boost Q1, Q2 gap by defining G H k
• Using G, each point hashes into l buckets
• Intuition
• r-near neighbors likely to collide
• few non-near pairs in any bucket
• Define
• G g g(p) h1(p)h2(p)hk(p)
• Hamming metric ? sample k random bits

30
Example (l4)

h1
hk
p
g1

q
g2

g3

g4

r
31
Overall Scheme
• Preprocessing
• Prepare hash table for range of G
• Select l hash functions g1, g2, , gl
• Insert(p) add p to buckets g1(p), g2(p), ,
gl(p)
• Delete(p) remove p from buckets g1(p), g2(p),
, gl(p)
• Query(q)
• Check buckets g1(q), g2(q), , gl(q)
• Report nearest of (say) first 3l points
• Complexity
• Assume computing D(p,q) needs O(d) time
• Assume storing p needs O(d) space
• Insert/Delete/Query Time O(dlk)
• Preprocessing/Storage O(dNNlk)

32
Collision Probability vs. Distance
1
Q1
Q2
0
r
r
r
r
33
Multi-Index versus Error
• Set lNz where
• Theorem For lNz, any query returns r-near
neighbor correctly with probability at least 1/6.
• Consequently (ignoring kO(log N) factors)
• Time O(dNz)
• Space O(N1z)
• Hamming Metric ?
• Boost Probability use several parallel
hash-tables

34
Analysis
• Define (for fixed query q)
• p any point with D(q,p) lt r
• FAR(q) all p with D(q,p) gt (1 )r
• BUCKET(q,j) all p with gj(p) gj(q)
• Event Esize
• (?query cost bounded by O(dl))
• Event ENN gj(p) gj(q) for some j
• (?nearest point in l buckets is r-near
neighbor)
• Analysis
• Show PrEsize x gt 2/3 and PrENN y gt 1/2
• Thus Prnot(Esize ENN) lt (1-x) (1-y) lt 5/6

35
• Choose
• Fact
• Clearly
• Markov Inequality PrXgtr.EXlt1/r, for Xgt0
• Lemma 1

36
Analysis Good Collisions
• Observe ?
• Since lnz ?
• Lemma 2 PrENN gt1/2

37
Euclidean Norms
• Recall
• x(x1, x2, , xd) and y(y1, y2, , yd) in Rd
• L1-norm
• Lp-norm (for pgt1)

38
Extension to L1-Norm
• Round coordinates to 1,M
• Embed L1-1,,Md into Hamming-0,1dM
• Unary Mapping
• Apply algorithm for Hamming Spaces
• Error due to rounding of 1/M ?
• Space-Time Overhead due to mapping of d ? dM

39
Extension to L2-Norm
• Observe
• Little difference in L1-norm and L2-norm for high
d
• More generally Lp, for 1 p 2
• Figiel et al 1977, Johnson-Schechtman 1982
• Can embed Lp into L1
• Dimensions d ? O(d)
• Distances preserved within factor (1a)
• Key Idea random rotation of space

40
Improved Bounds
• Indyk-Motwani 1998
• For any Lp-norm
• Query Time O(log3 N)
• Space
• Problem impractical
• Today only a high-level sketch

41
Better Reduction
• Recall
• Reduced Approximate Nearest Neighbors to
Approximate r-Near Neighbors
• R max distance in metric space
• Ring-Cover Trees
• Removed dependence on R
• Reduced overhead to O(polylog N)

42
Approximate r-Near Neighbors
• Idea
• Impose regular-grid on Rd
• Decompose into cubes of side length s
• Label cubes with points at distance ltr
• Data Structure
• Query q determine cube containing q
• Cube labels candidate r-near neighbors
• Goals
• Small s ? lower error
• Fewer cubes ? smaller storage

43

p1
p2
p3
44
Grid Analysis
• Assume r1
• Choose
• Cube Diameter
• Number of cubes
• Theorem For any Lp-norm, can solve Approx
r-Near Neighbor using
• Space
• Time O(d)

45
Dimensionality Reduction
• Johnson-Lindenstraus 84, Frankl-Maehara 88 For
, can map points in P into subspace
of dimension while preserving all
inter-point distances to within a factor
• Proof idea project onto random lines
• Result for NN
• Space
• Time O(polylog N)

46
References
• Approximate Nearest Neighbors Towards Removing
the Curse of Dimensionality
P. Indyk and R. Motwani
STOC 1998
• Similarity Search in High Dimensions via Hashing
A. Gionis, P. Indyk, and R. Motwani
VLDB 1999