Efficient Algorithms for Substring Near Neighbor Problem presentation

About This Presentation

Transcript and Presenter's Notes

Title: Efficient Algorithms for Substring Near Neighbor Problem

1
Efficient Algorithms for Substring Near Neighbor
Problem

Alexandr Andoni
Piotr Indyk
MIT

2
Whats SNN?

SNN Text Indexing with mismatches
Text Indexing
Construct a data structure on a text T1..n,
s.t.
Given query P1..m, finds occurrences of P in T
Text indexing with mismatches
Given P, find the substrings of T that are equal
to P except R chars.
Motivation e.g., computational bio (BLAST)

T GAGTAACTCAATA
T GAGTAACTCAATA
P AGTA
3
Outline

General approach
View Near Neighbor in Hamming
Focus reducing space
Background
Locality-Sensitive Hashing (LSH)
Solution
Reducing query preprocessing
Redesign LSH
Concluding remarks

4
Approach (Or, why SNN?)

SNN a near neighbor problem in Hamming metric
with m dimensions
Construct data structure on
Dall substrings of T of length m, s.t.
Given P, find a point in D that is at distance R
from P
? Use a NN data structure for Hamming

DGAGT, AGTA, GTAA, .
AATA
T GAGTAACTCAATA
P AGTA
5
Approximate NN

Exact NN problem seems hard (i.e., hard w/o
exponential space or O(n) query time)
Approximate NN is easier
Defined for approximation c1e as
OK to report a point at distance cR (when there
is a point at distance R)

cR
R
q
Query Space
KOR98, IM98 poly(log n, m) nO(1/e2)
LSH IM98 n1/cm n11/c
6
Our contribution

Problem need m in advance for NN
Have to construct a data structure for each mM
Here approx SNN data structure for unknown m
Without degradation in space or query time
Our algorithm for SNN based on LSH
Supports patterns of length mM
Optimal space n11/c
Optimal query time n1/c
Slightly worse preprocessing time if cgt3
( Optimal w.r.t. LSH, modulo subpoly factors)
Also extends to l1

7
Outline

General approach
View Near Neighbor in Hamming
Focus reducing space
Background
Locality-Sensitive Hashing (LSH)
Solution
Reducing query preprocessing
Redesign LSH
Concluding remarks

8
Locality-Sensitive Hashing

Based on a family of hash functions g
For points P1..m, Q1..m
If dist(P,Q) R, Prgg(P)g(Q) medium
If dist(P,Q) gt cR, Prgg(P)g(Q) low
Idea
Construct L hash tables with random g1, g2, gL
For query P, look at buckets g1(P), g2(P) gL(P)
Space Ln
Query time L

9
LSH for Hamming

Hash function g
Projection on k random coordinates
E.g. g1(AGTA)AA (k2)
Lhash tablesn1/c
klog n / log(1-cR/m) lt m log n

R1
10
Outline

General approach
View Near Neighbor in Hamming
Focus reducing space
Background
Locality-Sensitive Hashing (LSH)
Solution
Reducing query preprocessing
Redesign LSH
Concluding remarks

11
Unknown m

Bad news
k dependent on m!
Distinct m ? distinct hash tables

g1(AGT)AT
12
Solution

Lets just reuse the same data structure for all
m
g(AGTA)AA
On AGT ? have to guess last char
g(AGT?)g(AGT?) A?
Like in exact text indexing

13
Tries!
AGT
AGTA

Replace HT1 with
trie on g1(suffixes)
Stop search
when outside P
Same analysis!

Tries have been used with LSH before in MS02,
but in a different context
14
Resulting performance

Space
n11/c (using compressed tries, one trie takes n
space)
Optimal!
Query time
n1/c m (mlength P)
Not yet really optimal originally, could do
dim-reduction
Can improve to n1/c mno(1)
Preprocessing time
n11/c M (Mmax m)
Not optimal (optimal n11/c)
Can improve to n11/c M1/3 n1o(1)
Optimal for clt3

15
Outline

General approach
View Near Neighbor in Hamming
Focus reducing space
Background
Locality-Sensitive Hashing (LSH)
Solution
Reducing query preprocessing
Redesign LSH
Concluding remarks

16
Better query preprocessing

Redesign LSH to improve query and preprocessing
Query n1/c m ? n1/c mno(1)
Preprocessing n11/c M ? n11/c n1o(1) M
Idea for new LSH
Use same of hash tables/tries (L n1/c)
But use less randomness in choosing hash
functions g1, g2, gL
S.t., each gi looks random, but gs are not
independent

17
New LSH scheme

Old scheme
Choose L hash functions gi
Each gi projection on k random coordinates
New scheme
Construct the L functions gi from a smaller
number of base hash functions
A base hash function projection on k/2 random
coordinates
gi ,i 1..L all pairs of base hash
functions
Need only L1/2 base hash functions!

18
Example

k4
w
base fns4
L(w choose 2)(4 choose 2)6

19
Saving time

Can save time since there are less base hash
functions
E.g. computing fingerprints
Want to compute FP(gi(P)) for i1..L
FP(gi(P))(Sj Pj ?ji 2j) mod prime
Old way
Would take L m time for L functions g
New way
Takes L1/2 m time for L1/2 functions ui
Need only L time to combine FP(u(P)) into
FP(g(P))
If gltu1,u2gt, then FP(g(P))(FP(u1(P))FP(u2(P)))
mod prime
Total L L1/2 m

20
Better query preproc (2)

E.g., for query
Use fingerprints to leap faster in the trie
Yields time n1/c n1/(2c) m (since L n1/c)
To get n1/c no(1) m, generalize
g tuple of t base functions
a base function k/t random coordinates
Other details similar to fingerprints

21
Better preprocessing (3)

Preprocessing, can get
n11/c n1o(1) M
Can get n11/c n1o(1) M1/3
Can construct a trie in n M1/3 (instead on n
M)
Using FFT, etc

22
Outline

General approach
View Near Neighbor problem in Hamming metric
Focus reducing space
Background
Locality-Sensitive Hashing (LSH)
Solution LSH Tries
Reducing query preprocessing
Redesign LSH
Concluding remarks

23
Conclusions

Problem
Substring Near Neighbor (a.k.a., text indexing
with mismatches)
Approach
View as NN in m-dimensional Hamming
Use LSH
Challenge
Variable-length pattern w/o degradation in
performance
Solution
Space/query optimal (w.r.t. LSH)
Preprocessing optimal (w.r.t. LSH) for clt3

24
Extensions

Extends to l1
Nontrivial since a need a quite different LSH
functions
Preprocessing slightly worse n11/c n1o(1)
M2/3
Using Less-than-matching problem
Amir-Farach95

25
Remarks

Other approaches?
Or, why LSH for SNN?
Since better SNN ? better NN
And LSH is the best known algorithm for
high-dimensional NN (using reasonable space)

Thanks!

Write a Comment

User Comments (0)

About PowerShow.com

Efficient Algorithms for Substring Near Neighbor Problem PowerPoint PPT Presentation