Loading...

PPT – CS 361A (Advanced Data Structures and Algorithms) PowerPoint presentation | free to download - id: 6f5b27-NDMxN

The Adobe Flash plugin is needed to view this content

CS 361A (Advanced Data Structures and Algorithms)

- Lecture 19 (Dec 5, 2005)
- Nearest Neighbors

Dimensionality Reduction and Locality-Sensitive

Hashing - Rajeev Motwani

Metric Space

- Metric Space (M,D)
- For points p,q in M, D(p,q) is distance from p to

q - only reasonable model for high-dimensional

geometric space - Defining Properties
- Reflexive D(p,q) 0 if and only if pq
- Symmetric D(p,q) D(q,p)
- Triangle Inequality D(p,q) is at most

D(p,r)D(r,q) - Interesting Cases
- M ? points in d-dimensional space
- D ? Hamming or Euclidean Lp-norms

High-Dimensional Near Neighbors

- Nearest Neighbors Data Structure
- Given N points Pp1, , pN in metric space

(M,D) - Queries Which point p?P is closest to point

q? - Complexity Tradeoff preprocessing space with

query time - Applications
- vector quantization
- multimedia databases
- data mining
- machine learning

Known Results

Query Time Storage Technique Paper

dN dN Brute-Force

2d log N N2d1 Voronoi Diagram Dobkin-Lipton 76

Dd/2 log N Nd/2 Random Sampling Clarkson 88

d5 log N Nd Combination Meiser 93

logd-1 N N logd-1 N Parametric Search Agarwal-Matousek 92

- Some expressions are approximate
- Bottom-line exponential dependence on d

Approximate Nearest Neighbor

- Exact Algorithms
- Benchmark brute-force needs space O(N), query

time O(N) - Known Results exponential dependence on

dimension - Theory/Practice no better than brute-force

search - Approximate Near-Neighbors
- Given N points Pp1, , pN in metric space

(M,D) - Given error parameter ?gt0
- Goal for query q and nearest-neighbor p, return

r such that - Justification
- Mapping objects to metric space is heuristic

anyway - Get tremendous performance improvement

Results for Approximate NN

Query Time Storage Technique Paper

dd e-d dN Balanced Trees Arya et al 94

d2 polylog(N,d) N N2d dN polylog(N,d) Random Projection Kleinberg 97

log3 N N1/?2 Search Trees Dimension Reduction Indyk-Motwani 98

dN1/?log2N N11/?log N Locality-Sensitive Hashing Indyk-Motwani 98

External Memory External Memory Locality-Sensitive Hashing Gionis-Indyk-Motwani 99

- Will show main ideas of last 3 results
- Some expressions are approximate

Approximate r-Near Neighbors

- Given N points Pp1,,pN in metric space

(M,D) - Given error parameter ?gt0, distance threshold

rgt0 - Query
- If no point p with D(q,p)ltr, return FAILURE
- Else, return any p with D(q,p)lt (1?)r
- Application
- Solving Approximate Nearest Neighbor
- Assume maximum distance is R
- Run in parallel for
- Time/space O(log R) overhead
- Indyk-Motwani reduce to O(polylog n) overhead

Hamming Metric

- Hamming Space
- Points in M bit-vectors 0,1d (can generalize

to 0,1,2,,qd) - Hamming Distance D(p,q) of positions where

p,q differ - Remarks
- Simplest high-dimensional setting
- Still useful in practice
- In theory, as hard (or easy) as Euclidean space
- Trivial in low dimensions
- Example
- Hypercube in d3 dimensions
- 000, 001, 010, 011, 100, 101, 110, 111

Dimensionality Reduction

- Overall Idea
- Map from high to low dimensions
- Preserve distances approximately
- Solve Nearest Neighbors in new space
- Performance improvement at cost of approximation

error - Mapping?
- Hash function family H H1, , Hm
- Each Hi 0,1d ? 0,1t with tltltd
- Pick HR from H uniformly at random
- Map each point in P using same HR
- Solve NN problem on HR(P) HR(p1), , HR(pN)

Reduction for Hamming Spaces

- Theorem For any r and small ?gt0, there is hash

family H such that for any p,q and random HR ?H -

with probability gt1-?, provided for some

constant C,

b

b

a

a

c

c

Remarks

- For fixed threshold r, can distinguish between
- Near D(p,q) lt r
- Far D(p,q) gt (1e)r
- For N points, need
- Yet, can reduce to O(log N)-dimensional space,

while approximately preserving distances - Works even if points not known in advance

Hash Family

- Projection Function
- Let S be ordered, multiset of s indexes from

1,,d - pS0,1d ?0,1s projects p into s-dimensional

subspace - Example
- d5, p01100
- s3, S2,2,4 ? pS 110
- Choosing hash function HR in H
- Repeat for i1,,t
- Pick Si randomly (with replacement) from 1d
- Pick random hash function fi0,1s ?0,1
- hi(p)fi(pSi)
- HR(p) (h1(p), h2(p),,ht(p))
- Remark note similarity to Bloom Filters

Illustration of Hashing

. . . . .

1

d

p

0 1 1 0 0 0 1 0 1 0

pS1

pSt

. . . . .

1 0 0 1

0 0 0 0

. . .

. . .

1

s

1

s

ft

f1

0 1 1 0

HR(p)

. . .

h1(p)

ht(p)

Analysis I

- Choose random index-set S
- Claim For any p,q
- Why?
- p,q differ in D(p,q) bit positions
- Need all s indexes of S to avoid these positions
- Sampling with replacement from 1, ,d

Analysis II

- Choose sd/r
- Since 1-xlte-x for xlt1, we obtain
- Thus

Analysis III

- Recall hi(p)fi(pSi)
- Thus
- Choosing c ½ (1-e-1)

Analysis IV

- Recall HR(p)(h1(p),h2(p),,ht(p))
- D(HR(p),HR(q)) number of is where hi(p), hi(q)

differ - By linearity of expectations
- Theorem almost proved
- For high probability bound, need Chernoff Bound

Chernoff Bound

- Consider Bernoulli random variables X1,X2, , Xn
- Values are 0-1
- PrXi1 x and PrXi0 1-x
- Define X X1X2Xn with EXnx
- Theorem For independent X1,, Xn, for any 0lt?lt1,

P

2?nx

X

nx

Analysis V

- Define
- Xi0 if hi(p)hi(q), and 1 otherwise
- nt
- Then X X1X2Xt D(HR(p),HR(q))
- Case 1 D(p,q)ltr ? xc
- Case 2 D(p,q)gt(1e)r ? xce/6
- Observe sloppy bounding of constants in Case 2

Putting it all together

- Recall
- Thus, error probability
- Choosing C1200/c
- Theorem is proved!!

Algorithm I

- Set error probability
- Select hash HR and map points p ? HR(p)
- Processing query q
- Compute HR(q)
- Find nearest neighbor HR(p) for HR(q)
- If then return p,

else FAILURE - Remarks
- Brute-force for finding HR(p) implies query time
- Need another approach for lower dimensions

Algorithm II

- Fact Exact nearest neighbors in 0,1t requires
- Space O(2t)
- Query time O(t)
- How?
- Precompute/store answers to all queries
- Number of possible queries is 2t
- Since
- Theorem In Hamming space 0,1d, can solve

approximate nearest neighbor with - Space
- Query time

Different Metric

- Many applications have sparse points
- Many dimensions but few 1s
- Example points?documents, dimensions?words
- Better to view as sets
- Previous approach would require large s
- For sets A,B, define
- Observe
- AB ? sim(A,B)1
- A,B disjoint ? sim(A,B)0
- Question Handling D(A,B)1-sim(A,B) ?

Min-Hash

- Random permutations p1,,pt of universe

(dimensions) - Define mapping hj(A)mina in A pj(a)
- Fact Prhj(A) hj(B) sim(A,B)
- Proof? already seen!!
- Overall hash-function
- HR(A) (h1(A), h2(A),,ht(A))

Min-Hash Analysis

- Select
- Hamming Distance
- D(HR(A),HR(B)) ? number of js such that
- Theorem For any A,B,
- Proof? Exercise (apply Chernoff Bound)
- Obtain ANN algorithm similar to earlier result

Generalization

- Goal
- abstract technique used for Hamming space
- enable application to other metric spaces
- handle Dynamic ANN
- Dynamic Approximate r-Near Neighbors
- Fix threshold r
- Query if any point within distance r of q,

return any point within distance - Allow insertions/deletions of points in P
- Recall earlier method required preprocessing

all possible queries in hash-range-space

Locality-Sensitive Hashing

- Fix metric space (M,D), threshold r, error
- Choose probability parameters Q1 gt Q2gt0
- Definition Hash family HhM?S for (M,D) is

called .

-sensitive, if for random h and for any p,q

in M - Intuition
- p,q are near ? likely to collide
- p,q are far ? unlikely to collide

Examples

- Hamming Space M0,1d
- point pb1bd
- H hi(b1bd)bi, for i1d
- sampling one bit at random
- Prhi(q)hi(p) 1 D(p,q)/d
- Set Similarity D(A,B) 1 sim(A,B)
- Recall
- H
- Prh(A)h(B) 1 D(A,B)

Multi-Index Hashing

- Overall Idea
- Fix LSH family H
- Boost Q1, Q2 gap by defining G H k
- Using G, each point hashes into l buckets
- Intuition
- r-near neighbors likely to collide
- few non-near pairs in any bucket
- Define
- G g g(p) h1(p)h2(p)hk(p)
- Hamming metric ? sample k random bits

Example (l4)

h1

hk

p

g1

q

g2

g3

g4

r

Overall Scheme

- Preprocessing
- Prepare hash table for range of G
- Select l hash functions g1, g2, , gl
- Insert(p) add p to buckets g1(p), g2(p), ,

gl(p) - Delete(p) remove p from buckets g1(p), g2(p),

, gl(p) - Query(q)
- Check buckets g1(q), g2(q), , gl(q)
- Report nearest of (say) first 3l points
- Complexity
- Assume computing D(p,q) needs O(d) time
- Assume storing p needs O(d) space
- Insert/Delete/Query Time O(dlk)
- Preprocessing/Storage O(dNNlk)

Collision Probability vs. Distance

1

Q1

Q2

0

r

r

r

r

Multi-Index versus Error

- Set lNz where
- Theorem For lNz, any query returns r-near

neighbor correctly with probability at least 1/6. - Consequently (ignoring kO(log N) factors)
- Time O(dNz)
- Space O(N1z)
- Hamming Metric ?
- Boost Probability use several parallel

hash-tables

Analysis

- Define (for fixed query q)
- p any point with D(q,p) lt r
- FAR(q) all p with D(q,p) gt (1 )r
- BUCKET(q,j) all p with gj(p) gj(q)
- Event Esize
- (?query cost bounded by O(dl))
- Event ENN gj(p) gj(q) for some j
- (?nearest point in l buckets is r-near

neighbor) - Analysis
- Show PrEsize x gt 2/3 and PrENN y gt 1/2
- Thus Prnot(Esize ENN) lt (1-x) (1-y) lt 5/6

Analysis Bad Collisions

- Choose
- Fact
- Clearly
- Markov Inequality PrXgtr.EXlt1/r, for Xgt0
- Lemma 1

Analysis Good Collisions

- Observe ?
- Since lnz ?
- Lemma 2 PrENN gt1/2

Euclidean Norms

- Recall
- x(x1, x2, , xd) and y(y1, y2, , yd) in Rd
- L1-norm
- Lp-norm (for pgt1)

Extension to L1-Norm

- Round coordinates to 1,M
- Embed L1-1,,Md into Hamming-0,1dM
- Unary Mapping
- Apply algorithm for Hamming Spaces
- Error due to rounding of 1/M ?
- Space-Time Overhead due to mapping of d ? dM

Extension to L2-Norm

- Observe
- Little difference in L1-norm and L2-norm for high

d - Additional error is small
- More generally Lp, for 1 p 2
- Figiel et al 1977, Johnson-Schechtman 1982
- Can embed Lp into L1
- Dimensions d ? O(d)
- Distances preserved within factor (1a)
- Key Idea random rotation of space

Improved Bounds

- Indyk-Motwani 1998
- For any Lp-norm
- Query Time O(log3 N)
- Space
- Problem impractical
- Today only a high-level sketch

Better Reduction

- Recall
- Reduced Approximate Nearest Neighbors to

Approximate r-Near Neighbors - Space/Time Overhead O(log R)
- R max distance in metric space
- Ring-Cover Trees
- Removed dependence on R
- Reduced overhead to O(polylog N)

Approximate r-Near Neighbors

- Idea
- Impose regular-grid on Rd
- Decompose into cubes of side length s
- Label cubes with points at distance ltr
- Data Structure
- Query q determine cube containing q
- Cube labels candidate r-near neighbors
- Goals
- Small s ? lower error
- Fewer cubes ? smaller storage

p1

p2

p3

Grid Analysis

- Assume r1
- Choose
- Cube Diameter
- Number of cubes
- Theorem For any Lp-norm, can solve Approx

r-Near Neighbor using - Space
- Time O(d)

Dimensionality Reduction

- Johnson-Lindenstraus 84, Frankl-Maehara 88 For

, can map points in P into subspace

of dimension while preserving all

inter-point distances to within a factor - Proof idea project onto random lines
- Result for NN
- Space
- Time O(polylog N)

References

- Approximate Nearest Neighbors Towards Removing

the Curse of Dimensionality

P. Indyk and R. Motwani

STOC 1998 - Similarity Search in High Dimensions via Hashing

A. Gionis, P. Indyk, and R. Motwani

VLDB 1999