Title: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity)
1Polylogarithmic Approximation for Edit Distance
(and the Asymmetric Query Complexity)
- Robert Krauthgamer Weizmann Institute
- Joint with Alexandr Andoni Microsoft SVC
- Krzysztof Onak CMU
TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAA
2Polylogarithmic Approximation for Edit Distance
(and the Asymmetric Query Complexity)
- Robert Krauthgamer Weizmann Institute
- Joint with Alexandr Andoni Microsoft SVC
- Krzysztof Onak CMU
TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAA
3Edit Distance (Levenshtein distance)
Given two strings x,y??n
ed(x,y) minimum number of character operations
(insertion/deletion/substitution) that transform
x to y.
ed( banana , ananas ) 2
Applications
- Computational Biology
- Text processing
- Web search
Generic Search Engine
4Basic task
- Compute ed(x,y) for input x,y ? ?n
- O(n2) time WF74
Faster algorithms?
b a n a n a
a
n
a
n
a
s
D(i,j) ed( x1i, y1j )
1
1
2
3
4
5
1
2
2
2
3
4
D(i-1, j-1) , if xiyj
1
2
2
2
3
3
D(i,j) min
D(i-1, j) 1
2
1
2
2
3
4
D(i, j-1) 1
1
2
2
3
4
5
5
6
2
3
3
4
5Faster Algorithms?
- Compute ed(x,y) for given x,y ? ?n
- O(n2) time WF74
- O(n2/log2 n) time MP80
- Linear time (or near-linear)?
- Specific cases (average, smoothed, restricted
input) and variants (block edit dist etc.)
U83, LV85, M86, GG88, GP89, UW90, CL90,
CH98, LMS98, U85, CL92, N99, CPSV00,
MS00,CM02, AK08, BF08 - 2O(vlog n) approximation OR05,AO09,
improving earlier nc-approximation
BEKMRRS03,BJKK04,BES06 - Same barrier 2O(vlog n)-approximation also for
related tasks - Nearest neighbor search (text indexing),
embedding into normed spaces, sketching OR05
6Results I
- Theorem 1 Can approximate ed(x,y) within (log
n)O(1/e) factor in time n1e (for any egt0). - Exponential improvement over previous factor
2O(vlog n) - Fallout from the study of asymmetric query model
7Approach asymmetric query model
- Compress one string, x, to ne information
- Use dynamic programming to compute ed(x,y) in
n1e time - How to compress?
- Carefully subsample x
- Focus on sample-size (number of
- queried positions) in x, for fixed y ?
- Obtain near-tight bounds
y
x
8Results II Asymmetric Query Complexity
- Problem Decide ed(x,y) n/10 vs ed(x,y)
n/A - Complexity queries into x (unlimited access to
y)
Approximation (log n)O(1/e)
Queries O(ne)
O(ne/loglog n)
n1/(t1), n1/t-e
O(logt n)
O(logt n)
queries
T(logt n)
T(log3 n)
T(log2 n)
T(log n)
A
n1/2
n1-e
n1/3
n1/4
n1/2-e
n1/t-e
n1/(t1)
9Upper bound
- Theorem 2 can distinguish ed(x,y) n/10 vs
ed(x,y) n/A for A(log n)O(1/e) approximation
with ne queries into x (for any egt0). - Proof structure
- 1. Characterize edit by tree-distance Txy
- Parameter b2 (degree)
- Txy ed(x,y) up to 6blog n factor
- 2. Prune the tree to subsample x
b
x1
x2
xn
sampled positions in x
10Step 1 Tree distance
- Partition x into b blocks, recursively, for
hlogbn levels
x1n
x1?n
x?nn
x?n?n
xuu?n
y1n
yuu?n
- Ti(s,u) T-distance between xssli and
yuuli where li is the block-length at level i
11Tree distance recursive definition
- Recall Ti(s,u) distance between xssli and
yuuli - Base case Th(s,u)Hamming(xs,yu)
- Output TxyT0(s1,u1)
xssli
x
r0
y
yuuli
12T-distance approximates edit distance
- Lemma Txyed(x,y) up to 6blogbn factor.
- Hierarchical decomposition inspired by earlier
approaches BEKMRRS03, OR05 - All had approximation recurrence of the type
- A(n) cA(n/b) b
- for c2
- Solves to A(n) 2vlog n factor for every choice
of b - Our characterization has no multiplicative loss
(c1) - A(n) A(n/b) b
- Analysis inspired by algorithms for smoothed edit
AK08
13Step 2 Compute the tree distance
- For b2, T-distance gives O(log n)
approximation! - BUT know only how to compute T-distance in O(n2)
time - Instead, for b(log n)1/e, can prune the tree to
nO(e) nodes, and get 1e approximation - Pruning subsample (log n)O(1) children out of
each node - Works only when ed(x,y) ?(n)
- Generally, must subsample
- the tree non-uniformly, using
- the Precision Sampling Lemma
b
sampled positions in x
14Key tool non-uniform sampling
- Goal
- For unknown a1, a2, an?0,1
- Estimate their sum, up to an additive constant
error - Using only weak estimates a1, a2, an
Sum Estimator
Adversary
0. fix distribution U
1. Fix a1,a2,an (unknown)
2. pick precisions ui (our algorithm uiU
i.i.d.)
3. provide a1,a2,an s.t. ai-ailt1/ui
4. report SS(a1,,u1,) with S ?ai
lt 1.
15Precision Sampling
- Goal estimate ?ai from ai s.t. ai-ailt1/ui.
- Precision Sampling Lemma Can achieve WHP
- additive error 1 and multiplicative error 1.5
- with expected precision Eu_iUuiO(log n).
- Inspired by a technique from IW05 for
streaming (Fk moments) - In fact, PSL gives simple improved algorithms
for Fk moments, cascaded (mixed) norms,
lp-sampling problems AKO10 - Also distant relative of Priority Sampling
DLT07
16Precision Sampling for Edit Distance
- Apply Precision Sampling to the tree from the
characterization recursively at each node - If a node has very weak precision, can trim the
entire sub-tree
17Lower Bound Theorem
- Theorem 3 Achieving approximation AO(log7 n)
for edit distance requires asymmetric query
complexity nO(1/loglog n). - I.e., distinguishing ed(x,y)gtn/10 vs
ed(x,y)ltn/10A - Implications
- First lower bound to expose hardness from
repetitiveness in edit distance - Contrast with edit on non-repetitive strings
(Ulams distance) - Empirically easier (better algorithms are known
for it) - Yet, all previous lower bounds essentially
equivalent for the two variants BEKMRRS03,
AN10, KN05, KR06, AK07, AJP10 - But asymmetric query complexity
- Ulam 2-approx. with O(log n) queries ACCL04,
SS10 - Edit requires nO(1/loglog n) queries
18Lower Bound Techniques
- Core gadget ¾(.) cyclic shift operation
- Observation ed(x,¾j(x)) 2j
- Lower bound outline
- exhibit lower bound via shifts
- Amplification by composing the hard instance
recursively - We will see here
- Theorem 4 Asymmetric query complexity of
approximation n1/2 to edit distance is O(log2 n)
19The Shift Gadget
- Lemma O(log n) query lower bound for
approximation An0.5. - Hard distribution (x,y)
- Fix specific z1, z2?0,1n (random-looking)
- Set
- Formally yz1 and xsj(z1 OR z2) and random
j?n0.5 - An algorithm is a set queried positions Q½n,
Qltltlog n - ? It reads (z1 OR z2) at positions Qj
- Claim Both z1Qj and z2Qj close to uniform
dist. on 0,1Q - up to 2Q/n0.5 statistical distance
- Hence Q O(log n), even for approximation
An0.99
) ed(x,y) 2n0.5 close
) ed(x,y) n/10 far
20Amplification via Substitution Product
- O(log2 n) lower bound by amplification compose
two shift instances - Hard distribution (x,y)
- Fix z1,z2?0,1vn, w0,w1?0,1vn and
yz1?(w0,w1) (substitution) - Choose either zz1 (close) or zz2 (far)
- x z?(w0,w1) but with random shifts j?n1/3
inside each block and between blocks - Intuition must distinguish zz1 from zz2
- Must learn O(log n) positions i of z, and each
requires reading O(log n) further positions in
the corresponding blocks wzi
00101
11011
00111
w0
w1
z1
x
11011
00111
11011
11011
00111
21Towards the Full Theorem
- For the full theorem recursive composition
- Proof overview
- 1. Define -similarity of k distributions
(information per query) - 2. -similarity ) query lower bound 1/
(for adaptive algorithms) - 3. Initial Shift metric has high -similarity
(induction basis) - 4. -similarity amplified under substitution
product (inductive step) - 5. Prove edit distance concentrates
well (requires large alphabet) - 6. Can reduce large alphabet to binary (lossy,
but done once)
22Conclusion
- We compute ed(x,y) up to (log n)O(1/e)
approximation in n1e time - Via Asymmetric Query Complexity (new model)
- Open questions
- Do faster / limitations
- E.g. O(log2n) approximation in n1o(1) time?
- Use these insights for related problems
- Nearest Neighbor Search?
- Sublinear-time algorithms (symmetric queries)?
- Embeddings? Communication complexity?
- Further thoughts
- Practical ramifications?
- Asymmetric queries model?
- Paradigm for fast dynamic programming?
Thank you!