Efficient Algorithms for Substring Near Neighbor Problem PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Efficient Algorithms for Substring Near Neighbor Problem


1
Efficient Algorithms for Substring Near Neighbor
Problem
  • Alexandr Andoni
  • Piotr Indyk
  • MIT

2
Whats SNN?
  • SNN Text Indexing with mismatches
  • Text Indexing
  • Construct a data structure on a text T1..n,
    s.t.
  • Given query P1..m, finds occurrences of P in T
  • Text indexing with mismatches
  • Given P, find the substrings of T that are equal
    to P except R chars.
  • Motivation e.g., computational bio (BLAST)

T GAGTAACTCAATA
T GAGTAACTCAATA
P AGTA
3
Outline
  • General approach
  • View Near Neighbor in Hamming
  • Focus reducing space
  • Background
  • Locality-Sensitive Hashing (LSH)
  • Solution
  • Reducing query preprocessing
  • Redesign LSH
  • Concluding remarks

4
Approach (Or, why SNN?)
  • SNN a near neighbor problem in Hamming metric
    with m dimensions
  • Construct data structure on
  • Dall substrings of T of length m, s.t.
  • Given P, find a point in D that is at distance R
    from P
  • ? Use a NN data structure for Hamming

DGAGT, AGTA, GTAA, .
AATA
T GAGTAACTCAATA
P AGTA
5
Approximate NN
  • Exact NN problem seems hard (i.e., hard w/o
    exponential space or O(n) query time)
  • Approximate NN is easier
  • Defined for approximation c1e as
  • OK to report a point at distance cR (when there
    is a point at distance R)

cR
R
q
Query Space
KOR98, IM98 poly(log n, m) nO(1/e2)
LSH IM98 n1/cm n11/c
6
Our contribution
  • Problem need m in advance for NN
  • Have to construct a data structure for each mM
  • Here approx SNN data structure for unknown m
  • Without degradation in space or query time
  • Our algorithm for SNN based on LSH
  • Supports patterns of length mM
  • Optimal space n11/c
  • Optimal query time n1/c
  • Slightly worse preprocessing time if cgt3
  • ( Optimal w.r.t. LSH, modulo subpoly factors)
  • Also extends to l1

7
Outline
  • General approach
  • View Near Neighbor in Hamming
  • Focus reducing space
  • Background
  • Locality-Sensitive Hashing (LSH)
  • Solution
  • Reducing query preprocessing
  • Redesign LSH
  • Concluding remarks

8
Locality-Sensitive Hashing
  • Based on a family of hash functions g
  • For points P1..m, Q1..m
  • If dist(P,Q) R, Prgg(P)g(Q) medium
  • If dist(P,Q) gt cR, Prgg(P)g(Q) low
  • Idea
  • Construct L hash tables with random g1, g2, gL
  • For query P, look at buckets g1(P), g2(P) gL(P)
  • Space Ln
  • Query time L

9
LSH for Hamming
  • Hash function g
  • Projection on k random coordinates
  • E.g. g1(AGTA)AA (k2)
  • Lhash tablesn1/c
  • klog n / log(1-cR/m) lt m log n

R1
10
Outline
  • General approach
  • View Near Neighbor in Hamming
  • Focus reducing space
  • Background
  • Locality-Sensitive Hashing (LSH)
  • Solution
  • Reducing query preprocessing
  • Redesign LSH
  • Concluding remarks

11
Unknown m
  • Bad news
  • k dependent on m!
  • Distinct m ? distinct hash tables

g1(AGT)AT
12
Solution
  • Lets just reuse the same data structure for all
    m
  • g(AGTA)AA
  • On AGT ? have to guess last char
  • g(AGT?)g(AGT?) A?
  • Like in exact text indexing

13
Tries!
AGT
AGTA
  • Replace HT1 with
  • trie on g1(suffixes)
  • Stop search
  • when outside P
  • Same analysis!

Tries have been used with LSH before in MS02,
but in a different context
14
Resulting performance
  • Space
  • n11/c (using compressed tries, one trie takes n
    space)
  • Optimal!
  • Query time
  • n1/c m (mlength P)
  • Not yet really optimal originally, could do
    dim-reduction
  • Can improve to n1/c mno(1)
  • Preprocessing time
  • n11/c M (Mmax m)
  • Not optimal (optimal n11/c)
  • Can improve to n11/c M1/3 n1o(1)
  • Optimal for clt3

15
Outline
  • General approach
  • View Near Neighbor in Hamming
  • Focus reducing space
  • Background
  • Locality-Sensitive Hashing (LSH)
  • Solution
  • Reducing query preprocessing
  • Redesign LSH
  • Concluding remarks

16
Better query preprocessing
  • Redesign LSH to improve query and preprocessing
  • Query n1/c m ? n1/c mno(1)
  • Preprocessing n11/c M ? n11/c n1o(1) M
  • Idea for new LSH
  • Use same of hash tables/tries (L n1/c)
  • But use less randomness in choosing hash
    functions g1, g2, gL
  • S.t., each gi looks random, but gs are not
    independent

17
New LSH scheme
  • Old scheme
  • Choose L hash functions gi
  • Each gi projection on k random coordinates
  • New scheme
  • Construct the L functions gi from a smaller
    number of base hash functions
  • A base hash function projection on k/2 random
    coordinates
  • gi ,i 1..L all pairs of base hash
    functions
  • Need only L1/2 base hash functions!

18
Example
  • k4
  • w
  • base fns4
  • L(w choose 2)(4 choose 2)6

19
Saving time
  • Can save time since there are less base hash
    functions
  • E.g. computing fingerprints
  • Want to compute FP(gi(P)) for i1..L
  • FP(gi(P))(Sj Pj ?ji 2j) mod prime
  • Old way
  • Would take L m time for L functions g
  • New way
  • Takes L1/2 m time for L1/2 functions ui
  • Need only L time to combine FP(u(P)) into
    FP(g(P))
  • If gltu1,u2gt, then FP(g(P))(FP(u1(P))FP(u2(P)))
    mod prime
  • Total L L1/2 m

20
Better query preproc (2)
  • E.g., for query
  • Use fingerprints to leap faster in the trie
  • Yields time n1/c n1/(2c) m (since L n1/c)
  • To get n1/c no(1) m, generalize
  • g tuple of t base functions
  • a base function k/t random coordinates
  • Other details similar to fingerprints

21
Better preprocessing (3)
  • Preprocessing, can get
  • n11/c n1o(1) M
  • Can get n11/c n1o(1) M1/3
  • Can construct a trie in n M1/3 (instead on n
    M)
  • Using FFT, etc

22
Outline
  • General approach
  • View Near Neighbor problem in Hamming metric
  • Focus reducing space
  • Background
  • Locality-Sensitive Hashing (LSH)
  • Solution LSH Tries
  • Reducing query preprocessing
  • Redesign LSH
  • Concluding remarks

23
Conclusions
  • Problem
  • Substring Near Neighbor (a.k.a., text indexing
    with mismatches)
  • Approach
  • View as NN in m-dimensional Hamming
  • Use LSH
  • Challenge
  • Variable-length pattern w/o degradation in
    performance
  • Solution
  • Space/query optimal (w.r.t. LSH)
  • Preprocessing optimal (w.r.t. LSH) for clt3

24
Extensions
  • Extends to l1
  • Nontrivial since a need a quite different LSH
    functions
  • Preprocessing slightly worse n11/c n1o(1)
    M2/3
  • Using Less-than-matching problem
    Amir-Farach95

25
Remarks
  • Other approaches?
  • Or, why LSH for SNN?
  • Since better SNN ? better NN
  • And LSH is the best known algorithm for
    high-dimensional NN (using reasonable space)

26
  • Thanks!
Write a Comment
User Comments (0)
About PowerShow.com