Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) - PowerPoint PPT Presentation

About This Presentation

Title:

Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity)

Description:

Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni [Microsoft SVC] – PowerPoint PPT presentation

Number of Views:80

Avg rating:3.0/5.0

Slides: 23

Provided by: Robe6174

Category:

more less

Transcript and Presenter's Notes

Title: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity)

1
Polylogarithmic Approximation for Edit Distance
(and the Asymmetric Query Complexity)

Robert Krauthgamer Weizmann Institute
Joint with Alexandr Andoni Microsoft SVC
Krzysztof Onak CMU

TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAA
2
Polylogarithmic Approximation for Edit Distance
(and the Asymmetric Query Complexity)

Robert Krauthgamer Weizmann Institute
Joint with Alexandr Andoni Microsoft SVC
Krzysztof Onak CMU

TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAA
3
Edit Distance (Levenshtein distance)
Given two strings x,y??n
ed(x,y) minimum number of character operations
(insertion/deletion/substitution) that transform
x to y.
ed( banana , ananas ) 2
Applications

Computational Biology
Text processing
Web search

Generic Search Engine
4
Basic task

Compute ed(x,y) for input x,y ? ?n
O(n2) time WF74

Faster algorithms?
b a n a n a
a
n
a
n
a
s
D(i,j) ed( x1i, y1j )
1
1
2
3
4
5
1
2
2
2
3
4
D(i-1, j-1) , if xiyj
1
2
2
2
3
3
D(i,j) min
D(i-1, j) 1
2
1
2
2
3
4
D(i, j-1) 1
1
2
2
3
4
5
5
6
2
3
3
4
5
Faster Algorithms?

Compute ed(x,y) for given x,y ? ?n
O(n2) time WF74
O(n2/log2 n) time MP80
Linear time (or near-linear)?
Specific cases (average, smoothed, restricted
input) and variants (block edit dist etc.)
U83, LV85, M86, GG88, GP89, UW90, CL90,
CH98, LMS98, U85, CL92, N99, CPSV00,
MS00,CM02, AK08, BF08
2O(vlog n) approximation OR05,AO09,
improving earlier nc-approximation
BEKMRRS03,BJKK04,BES06
Same barrier 2O(vlog n)-approximation also for
related tasks
Nearest neighbor search (text indexing),
embedding into normed spaces, sketching OR05

6
Results I

Theorem 1 Can approximate ed(x,y) within (log
n)O(1/e) factor in time n1e (for any egt0).
Exponential improvement over previous factor
2O(vlog n)
Fallout from the study of asymmetric query model

7
Approach asymmetric query model

Compress one string, x, to ne information
Use dynamic programming to compute ed(x,y) in
n1e time
How to compress?
Carefully subsample x
Focus on sample-size (number of
queried positions) in x, for fixed y ?
Obtain near-tight bounds

y
x
8
Results II Asymmetric Query Complexity

Problem Decide ed(x,y) n/10 vs ed(x,y)
n/A
Complexity queries into x (unlimited access to
y)

Approximation (log n)O(1/e)
Queries O(ne)
O(ne/loglog n)
n1/(t1), n1/t-e
O(logt n)
O(logt n)
queries
T(logt n)
T(log3 n)
T(log2 n)
T(log n)
A
n1/2
n1-e
n1/3
n1/4
n1/2-e
n1/t-e
n1/(t1)
9
Upper bound

Theorem 2 can distinguish ed(x,y) n/10 vs
ed(x,y) n/A for A(log n)O(1/e) approximation
with ne queries into x (for any egt0).
Proof structure
1. Characterize edit by tree-distance Txy
Parameter b2 (degree)
Txy ed(x,y) up to 6blog n factor
2. Prune the tree to subsample x

b
x1
x2
xn
sampled positions in x
10
Step 1 Tree distance

Partition x into b blocks, recursively, for
hlogbn levels

x1n
x1?n
x?nn
x?n?n
xuu?n

y1n
yuu?n

Ti(s,u) T-distance between xssli and
yuuli where li is the block-length at level i

11
Tree distance recursive definition

Recall Ti(s,u) distance between xssli and
yuuli
Base case Th(s,u)Hamming(xs,yu)
Output TxyT0(s1,u1)

xssli
x
r0
y
yuuli
12
T-distance approximates edit distance

Lemma Txyed(x,y) up to 6blogbn factor.
Hierarchical decomposition inspired by earlier
approaches BEKMRRS03, OR05
All had approximation recurrence of the type
A(n) cA(n/b) b
for c2
Solves to A(n) 2vlog n factor for every choice
of b
Our characterization has no multiplicative loss
(c1)
A(n) A(n/b) b
Analysis inspired by algorithms for smoothed edit
AK08

13
Step 2 Compute the tree distance

For b2, T-distance gives O(log n)
approximation!
BUT know only how to compute T-distance in O(n2)
time
Instead, for b(log n)1/e, can prune the tree to
nO(e) nodes, and get 1e approximation
Pruning subsample (log n)O(1) children out of
each node
Works only when ed(x,y) ?(n)
Generally, must subsample
the tree non-uniformly, using
the Precision Sampling Lemma

b
sampled positions in x
14
Key tool non-uniform sampling

Goal
For unknown a1, a2, an?0,1
Estimate their sum, up to an additive constant
error
Using only weak estimates a1, a2, an

Sum Estimator
Adversary
0. fix distribution U
1. Fix a1,a2,an (unknown)
2. pick precisions ui (our algorithm uiU
i.i.d.)
3. provide a1,a2,an s.t. ai-ailt1/ui
4. report SS(a1,,u1,) with S ?ai
lt 1.
15
Precision Sampling

Goal estimate ?ai from ai s.t. ai-ailt1/ui.
Precision Sampling Lemma Can achieve WHP
additive error 1 and multiplicative error 1.5
with expected precision Eu_iUuiO(log n).
Inspired by a technique from IW05 for
streaming (Fk moments)
In fact, PSL gives simple improved algorithms
for Fk moments, cascaded (mixed) norms,
lp-sampling problems AKO10
Also distant relative of Priority Sampling
DLT07

16
Precision Sampling for Edit Distance

Apply Precision Sampling to the tree from the
characterization recursively at each node
If a node has very weak precision, can trim the
entire sub-tree

17
Lower Bound Theorem

Theorem 3 Achieving approximation AO(log7 n)
for edit distance requires asymmetric query
complexity nO(1/loglog n).
I.e., distinguishing ed(x,y)gtn/10 vs
ed(x,y)ltn/10A
Implications
First lower bound to expose hardness from
repetitiveness in edit distance
Contrast with edit on non-repetitive strings
(Ulams distance)
Empirically easier (better algorithms are known
for it)
Yet, all previous lower bounds essentially
equivalent for the two variants BEKMRRS03,
AN10, KN05, KR06, AK07, AJP10
But asymmetric query complexity
Ulam 2-approx. with O(log n) queries ACCL04,
SS10
Edit requires nO(1/loglog n) queries

18
Lower Bound Techniques

Core gadget ¾(.) cyclic shift operation
Observation ed(x,¾j(x)) 2j
Lower bound outline
exhibit lower bound via shifts
Amplification by composing the hard instance
recursively
We will see here
Theorem 4 Asymmetric query complexity of
approximation n1/2 to edit distance is O(log2 n)

19
The Shift Gadget

Lemma O(log n) query lower bound for
approximation An0.5.
Hard distribution (x,y)
Fix specific z1, z2?0,1n (random-looking)
Set
Formally yz1 and xsj(z1 OR z2) and random
j?n0.5
An algorithm is a set queried positions Q½n,
Qltltlog n
? It reads (z1 OR z2) at positions Qj
Claim Both z1Qj and z2Qj close to uniform
dist. on 0,1Q
up to 2Q/n0.5 statistical distance
Hence Q O(log n), even for approximation
An0.99

) ed(x,y) 2n0.5 close
) ed(x,y) n/10 far
20
Amplification via Substitution Product

O(log2 n) lower bound by amplification compose
two shift instances
Hard distribution (x,y)
Fix z1,z2?0,1vn, w0,w1?0,1vn and
yz1?(w0,w1) (substitution)
Choose either zz1 (close) or zz2 (far)
x z?(w0,w1) but with random shifts j?n1/3
inside each block and between blocks
Intuition must distinguish zz1 from zz2
Must learn O(log n) positions i of z, and each
requires reading O(log n) further positions in
the corresponding blocks wzi