Overcoming the L1 Non-Embeddability Barrier presentation

About This Presentation

Transcript and Presenter's Notes

Title: Overcoming the L1 Non-Embeddability Barrier

1
Overcoming the L1 Non-Embeddability Barrier

Robert Krauthgamer (Weizmann Institute)
Joint work with Alexandr Andoni and Piotr Indyk
(MIT)

2
Algorithms on Metric Spaces
Hamming distance

Fix a metric M
Fix a computational problem
Solve problem under M

Ulam metric
Compute distance between x,y
Earthmover distance
ED(x,y) minimum number of edit operations
that transform x into y. edit operation
insert/delete/ substitute a
character ED(0101010, 1010101) 2
Nearest Neighbor Search Preprocess n strings,
so that given a query string, can find the
closest string to it.

3
Motivation for Nearest Neighbor

Many applications
Image search (Euclidean dist, Earth-mover dist)
Processing of genetic information, text
processing (edit dist.)
many others

Generic Search Engine
4
A General Tool Embeddings

An embedding of M into a host metric (H,dH) is a
map f M?H
preserves distances approximately
has distortion A 1 if for all x,y? M,
dM(x,y) dH(f(x),f(y)) AdM(x,y)
Why?
If H is easy ( can solve efficiently
computational problems like NNS)
Then get good algorithms for the original space
M!

f
5
Host space?
l1real space with d1(x,y) ?i xi-yi

Popular target metric l1
Have efficient algorithms
Distance estimation O(d) for d-dimensional space
(often less)
NNS c-approx with O(n1/c) query time and
O(n11/c) space IM98
Powerful enough for some things

Metric References Upper bound Lower bound
Edit distance over 0,1d OR05 KN05,KR06,AK07 2O(vlog d) ?(log d)
Ulam ( edit distance over permutations) CK06 AK07 O(log d) ?(log d)
Block edit distance over 0,1d MS00, CM07 Cor03 O(log d) 4/3
Earthmover distance in ?2 (sets of size s) Cha02, IT03 NS07 O(log s) ?(log1/2 s)
Earthmover distance in 0,1d (set of size s) AIK08 KN05 O(log slog d) ?(log s)
6
Below logarithmic?
(l2)preal space with dist2p(x,y)x-y2p

Cannot work with l1
Other possibilities?
(l2)p is bigger and algorithmically tractable
but not rich enough (often same lower bounds)
l8 is rich (includes all metrics),
but not efficient computationally usually (high
dimension)
And thats roughly it ?
(at least for efficient NNS)

l8real space with dist8(x,y)maxixi-yi
7
Meet our new host
a
d1

ß

Iterated product space, ?22,8,1

d8,1
d22,8,1
?
8
Why ?22,8,1?

Because we can
Theorem 1. Ulam embeds into ?22,8,1 with O(1)
distortion
Dimensions (?,ß,a)(d, log d, d)
Theorem 2. ?22,8,1 admits NNS on n points with
O(log log n) approximation
O(ne) query time and O(n1e) space
In fact, there is more for Ulam

Rich
Algorithmically tractable
9
Our Algorithms for Ulam
ED(1234567, 7123456) 2

Ulam edit on strings where each symbol appears
at most once
A classical distance between rankings
Exhibits hardness of misalignments (as in general
edit)
All lower bounds same as for general edit (up to
T() )
Distortion of embedding into l1 (and (l2)p, etc)
T(log d)
Our approach implies new algorithms for Ulam
1. NNS with O(log log n) approx, O(ne) query
time
Can improve to O(log log d) approx
2. Sketching with O(1)-approx in logO(1) d space
3. Distance estimation with O(1)-approx in time

If we ever hope for approximation ltltlog d for NNS
under general edit, first we have to get it under
Ulam!
BEKMRRS03 when ED¼d, approx de in O(d1-2e)
time
10
Theorem 1

Theorem 1. Can embed Ulam into ?22,8,1 with O(1)
distortion
Dimensions (?,ß,a)(d, log d, d)
Proof
Geometrization of Ulam characterizations
Previously studied in the context of testing
monotonicity (sortedness)
Sublinear algorithms EKKRV98, ACCL04
Data-stream algorithms GJKK07, GG07, EH08

11
Thm 1 Characterizing Ulam

Consider permutations x,y over d
Assume for now x identity permutation
Idea
Count chars in y to delete to obtain increasing
sequence ( Ulam(x,y))
Call them faulty characters
Issues
Ambiguity
How do we count them?

123456789
123456789
X
234657891
341256789
y
12
Thm 1 Characterization inversions

Definition chars altb form inversion if b
precedes a in y
How to identify faulty char?
Has an inversion?
Doesnt work all chars might have inversion
Has many inversions?
Still can miss faulty chars
Has many inversions locally?
Same problem

Check if either is true!
123456789
123456789
123456789
X
567981234
234567891
213456798
y
13
Thm 1 Characterization faulty chars

Definition 1 a is faulty if exists Kgt0 s.t.
a is inverted w.r.t. a majority of the K symbols
preceding a in y
(ok to consider K2k)
Lemma ACCL04, GJKK07 faulty chars
T(Ulam(x,y)).

123456789
234567891
4 characters preceding 1 (all inversions with 1)
14
Thm 1 Characterization?Embedding

To get embedding, need
Symmetrization (neither string is identity)
Deal with exists, majority?
To resolve (1), use instead XaK
Definition 2 a is faulty if exists K2k such
that
Xa2k ? Ya2k gt 2k (symmetric difference)

X54
123456789
123467895
Y54
15
Thm 1 Embedding final step
X522
123456789

We have
Replace by weight?
Final embedding

123467895
Y522
equal 1 iff true
)2
(
16
Theorem 2

Theorem 2. ?22,8,1 admits NNS on n points
O(log log n) approximation
O(ne) query time and O(n1e) space for any small
e
(ignoring (aß?)O(1))
A rather general approach
LSH on l1-products of general metric spaces
Of course, cannot do, but can reduce to
l8-products

17
Thm 2 Proof

Lets start from basics l1a
IM98 c-approx with O(n1/c) query time and
O(n11/c) space
(ignoring aO(1))
Ok, what about

Then NNS for
O(cM log log n) -approx
O(QM) query time
O(SM n1e) space.

Suppose NNS for M with
cM-approx
QM query time
SM space.

I02
18
Thm 2 What about (l2)2-product?

Enough to consider
(for us, M is the l1-product)
Off-the-shelf?
I04 gives space n? or gtlog n approximation
We reduce to multiple NNS queries under
Instructive to first look at NNS for standard l1

19
Thm 2 Review of NNS for l1
?

LSH family collection H of
hash functions such that
For random h?H (parameter ?gt0)
Prh(q)h(p) 1-q-p1 / ?
Query just uses primitive
Can obtain H by imposing randomly-shifted grid of
side-length ?
Then for h defined by ri20, ? at random,
primitive becomes

q
p
return all points p such that h(q)h(p)
return all p s.t. qi-piltri for all i?d
20
Thm 2 LSH for l1-product
?

Intuition abstract LSH!
Recall we had
for ri random from 0, ?,
point p returned if for all i qi-piltri
Equivalently
For all i

q
p
l8 product of R!
For l1
return all p s.t. qi-piltri for all i?d
return all points ps such that maxi
dM(qi,pi)/rilt1
For
21
Thm 2 Final

Thus, sufficient to solve primitive
We reduced NNS over
to several instances of NNS over
(with appropriately scaled coordinates)
Approximation is O(1)O(log log n)
Done!

return all points ps such that maxi
dM(qi,pi)/rilt1 (in fact, for k independent
choices of (r1,rd))
For
22
Take-home message

Can embed combinatorial metrics into iterated
product spaces
Works for Ulam (edit on non-repetitive strings)
Approach bypasses non-embeddability results into
usual-suspect spaces like l1, (l2)2

Overcoming the L1 Non-Embeddability Barrier PowerPoint PPT Presentation