Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) - PowerPoint PPT Presentation

About This Presentation
Title:

Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity)

Description:

Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity) Robert Krauthgamer [Weizmann Institute] Joint with: Alexandr Andoni [Microsoft SVC] – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 23
Provided by: Robe6174
Category:

less

Transcript and Presenter's Notes

Title: Polylogarithmic Approximation for Edit Distance (and the Asymmetric Query Complexity)


1
Polylogarithmic Approximation for Edit Distance
(and the Asymmetric Query Complexity)
  • Robert Krauthgamer Weizmann Institute
  • Joint with Alexandr Andoni Microsoft SVC
  • Krzysztof Onak CMU

TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAA
2
Polylogarithmic Approximation for Edit Distance
(and the Asymmetric Query Complexity)
  • Robert Krauthgamer Weizmann Institute
  • Joint with Alexandr Andoni Microsoft SVC
  • Krzysztof Onak CMU

TexPoint fonts used in EMF. Read the TexPoint
manual before you delete this box. AAAAA
3
Edit Distance (Levenshtein distance)
Given two strings x,y??n
ed(x,y) minimum number of character operations
(insertion/deletion/substitution) that transform
x to y.
ed( banana , ananas ) 2
Applications
  • Computational Biology
  • Text processing
  • Web search

Generic Search Engine
4
Basic task
  • Compute ed(x,y) for input x,y ? ?n
  • O(n2) time WF74

Faster algorithms?
b a n a n a
a
n
a
n
a
s
D(i,j) ed( x1i, y1j )
1
1
2
3
4
5
1
2
2
2
3
4
D(i-1, j-1) , if xiyj
1
2
2
2
3
3
D(i,j) min
D(i-1, j) 1
2
1
2
2
3
4
D(i, j-1) 1
1
2
2
3
4
5
5
6
2
3
3
4
5
Faster Algorithms?
  • Compute ed(x,y) for given x,y ? ?n
  • O(n2) time WF74
  • O(n2/log2 n) time MP80
  • Linear time (or near-linear)?
  • Specific cases (average, smoothed, restricted
    input) and variants (block edit dist etc.)
    U83, LV85, M86, GG88, GP89, UW90, CL90,
    CH98, LMS98, U85, CL92, N99, CPSV00,
    MS00,CM02, AK08, BF08
  • 2O(vlog n) approximation OR05,AO09,
    improving earlier nc-approximation
    BEKMRRS03,BJKK04,BES06
  • Same barrier 2O(vlog n)-approximation also for
    related tasks
  • Nearest neighbor search (text indexing),
    embedding into normed spaces, sketching OR05

6
Results I
  • Theorem 1 Can approximate ed(x,y) within (log
    n)O(1/e) factor in time n1e (for any egt0).
  • Exponential improvement over previous factor
    2O(vlog n)
  • Fallout from the study of asymmetric query model

7
Approach asymmetric query model
  • Compress one string, x, to ne information
  • Use dynamic programming to compute ed(x,y) in
    n1e time
  • How to compress?
  • Carefully subsample x
  • Focus on sample-size (number of
  • queried positions) in x, for fixed y ?
  • Obtain near-tight bounds

y
x
8
Results II Asymmetric Query Complexity
  • Problem Decide ed(x,y) n/10 vs ed(x,y)
    n/A
  • Complexity queries into x (unlimited access to
    y)

Approximation (log n)O(1/e)
Queries O(ne)
O(ne/loglog n)
n1/(t1), n1/t-e
O(logt n)
O(logt n)
queries
T(logt n)
T(log3 n)
T(log2 n)
T(log n)
A
n1/2
n1-e
n1/3
n1/4
n1/2-e
n1/t-e
n1/(t1)
9
Upper bound
  • Theorem 2 can distinguish ed(x,y) n/10 vs
    ed(x,y) n/A for A(log n)O(1/e) approximation
    with ne queries into x (for any egt0).
  • Proof structure
  • 1. Characterize edit by tree-distance Txy
  • Parameter b2 (degree)
  • Txy ed(x,y) up to 6blog n factor
  • 2. Prune the tree to subsample x

b
x1
x2
xn
sampled positions in x
10
Step 1 Tree distance
  • Partition x into b blocks, recursively, for
    hlogbn levels

x1n
x1?n
x?nn
x?n?n
xuu?n

y1n
yuu?n
  • Ti(s,u) T-distance between xssli and
    yuuli where li is the block-length at level i

11
Tree distance recursive definition
  • Recall Ti(s,u) distance between xssli and
    yuuli
  • Base case Th(s,u)Hamming(xs,yu)
  • Output TxyT0(s1,u1)

xssli
x
r0
y
yuuli
12
T-distance approximates edit distance
  • Lemma Txyed(x,y) up to 6blogbn factor.
  • Hierarchical decomposition inspired by earlier
    approaches BEKMRRS03, OR05
  • All had approximation recurrence of the type
  • A(n) cA(n/b) b
  • for c2
  • Solves to A(n) 2vlog n factor for every choice
    of b
  • Our characterization has no multiplicative loss
    (c1)
  • A(n) A(n/b) b
  • Analysis inspired by algorithms for smoothed edit
    AK08

13
Step 2 Compute the tree distance
  • For b2, T-distance gives O(log n)
    approximation!
  • BUT know only how to compute T-distance in O(n2)
    time
  • Instead, for b(log n)1/e, can prune the tree to
    nO(e) nodes, and get 1e approximation
  • Pruning subsample (log n)O(1) children out of
    each node
  • Works only when ed(x,y) ?(n)
  • Generally, must subsample
  • the tree non-uniformly, using
  • the Precision Sampling Lemma

b
sampled positions in x
14
Key tool non-uniform sampling
  • Goal
  • For unknown a1, a2, an?0,1
  • Estimate their sum, up to an additive constant
    error
  • Using only weak estimates a1, a2, an

Sum Estimator
Adversary
0. fix distribution U
1. Fix a1,a2,an (unknown)
2. pick precisions ui (our algorithm uiU
i.i.d.)
3. provide a1,a2,an s.t. ai-ailt1/ui
4. report SS(a1,,u1,) with S ?ai
lt 1.
15
Precision Sampling
  • Goal estimate ?ai from ai s.t. ai-ailt1/ui.
  • Precision Sampling Lemma Can achieve WHP
  • additive error 1 and multiplicative error 1.5
  • with expected precision Eu_iUuiO(log n).
  • Inspired by a technique from IW05 for
    streaming (Fk moments)
  • In fact, PSL gives simple improved algorithms
    for Fk moments, cascaded (mixed) norms,
    lp-sampling problems AKO10
  • Also distant relative of Priority Sampling
    DLT07

16
Precision Sampling for Edit Distance
  • Apply Precision Sampling to the tree from the
    characterization recursively at each node
  • If a node has very weak precision, can trim the
    entire sub-tree

17
Lower Bound Theorem
  • Theorem 3 Achieving approximation AO(log7 n)
    for edit distance requires asymmetric query
    complexity nO(1/loglog n).
  • I.e., distinguishing ed(x,y)gtn/10 vs
    ed(x,y)ltn/10A
  • Implications
  • First lower bound to expose hardness from
    repetitiveness in edit distance
  • Contrast with edit on non-repetitive strings
    (Ulams distance)
  • Empirically easier (better algorithms are known
    for it)
  • Yet, all previous lower bounds essentially
    equivalent for the two variants BEKMRRS03,
    AN10, KN05, KR06, AK07, AJP10
  • But asymmetric query complexity
  • Ulam 2-approx. with O(log n) queries ACCL04,
    SS10
  • Edit requires nO(1/loglog n) queries

18
Lower Bound Techniques
  • Core gadget ¾(.) cyclic shift operation
  • Observation ed(x,¾j(x)) 2j
  • Lower bound outline
  • exhibit lower bound via shifts
  • Amplification by composing the hard instance
    recursively
  • We will see here
  • Theorem 4 Asymmetric query complexity of
    approximation n1/2 to edit distance is O(log2 n)

19
The Shift Gadget
  • Lemma O(log n) query lower bound for
    approximation An0.5.
  • Hard distribution (x,y)
  • Fix specific z1, z2?0,1n (random-looking)
  • Set
  • Formally yz1 and xsj(z1 OR z2) and random
    j?n0.5
  • An algorithm is a set queried positions Q½n,
    Qltltlog n
  • ? It reads (z1 OR z2) at positions Qj
  • Claim Both z1Qj and z2Qj close to uniform
    dist. on 0,1Q
  • up to 2Q/n0.5 statistical distance
  • Hence Q O(log n), even for approximation
    An0.99

) ed(x,y) 2n0.5 close
) ed(x,y) n/10 far
20
Amplification via Substitution Product
  • O(log2 n) lower bound by amplification compose
    two shift instances
  • Hard distribution (x,y)
  • Fix z1,z2?0,1vn, w0,w1?0,1vn and
    yz1?(w0,w1) (substitution)
  • Choose either zz1 (close) or zz2 (far)
  • x z?(w0,w1) but with random shifts j?n1/3
    inside each block and between blocks
  • Intuition must distinguish zz1 from zz2
  • Must learn O(log n) positions i of z, and each
    requires reading O(log n) further positions in
    the corresponding blocks wzi

00101
11011
00111
w0
w1
z1
x
11011
00111
11011
11011
00111
21
Towards the Full Theorem
  • For the full theorem recursive composition
  • Proof overview
  • 1. Define -similarity of k distributions
    (information per query)
  • 2. -similarity ) query lower bound 1/
    (for adaptive algorithms)
  • 3. Initial Shift metric has high -similarity
    (induction basis)
  • 4. -similarity amplified under substitution
    product (inductive step)
  • 5. Prove edit distance concentrates
    well (requires large alphabet)
  • 6. Can reduce large alphabet to binary (lossy,
    but done once)

22
Conclusion
  • We compute ed(x,y) up to (log n)O(1/e)
    approximation in n1e time
  • Via Asymmetric Query Complexity (new model)
  • Open questions
  • Do faster / limitations
  • E.g. O(log2n) approximation in n1o(1) time?
  • Use these insights for related problems
  • Nearest Neighbor Search?
  • Sublinear-time algorithms (symmetric queries)?
  • Embeddings? Communication complexity?
  • Further thoughts
  • Practical ramifications?
  • Asymmetric queries model?
  • Paradigm for fast dynamic programming?

Thank you!
Write a Comment
User Comments (0)
About PowerShow.com