Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp. 240-249 - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp. 240-249

Description:

... of character insertions, deletions, and replacements to covert string x to y. ... Given a text T with length n, a pattern P with length m, and an error bound k. ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 41
Provided by: algCsie
Category:

less

Transcript and Presenter's Notes

Title: Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp. 240-249


1
Approximate String Matching Using Compressed
Suffix ArraysTrinh N. D. Huynh, W. K. Hon, T.
W. Lam and W. K. Sung, Theoretical Computer
Science, Vol. 352, 2006, pp. 240-249
  • Advisor Prof. R. C. T. Lee
  • Speaker C. W. Lu

2
  • Let x and y be two strings. Edit distance d(x,
    y) is the minimum number of character insertions,
    deletions, and replacements to covert string x to
    y.
  • k-difference string matching problem
  • Given a text T with length n, a pattern P with
    length m, and an error bound k.
  • Find all position i of T such that there exists
    an suffix S of T(1, i), d(S, P) ? k.

3
  • The approach of this paper is as the follows
  • Given a pattern P and an error bound k, we
    generate all possible Ps which contain (?k)
    errors deduced from P.
  • Then we conduct an exact match of all such Ps
    against T.

4
  • Example
  • Tabbaaa,
  • Paba and k1.
  • From P and k, we generate the following Ps
  • ba, aaba, baba, bba, aa, abba, aaa, ab, abaa,
    abb, aba.

5
  • Then we conduct an exact matching of all Ps
    against T. Any success indicates that there is a
    substring S in T such that d(S,T)?k.
  • How can we generate all Ps which we want?
  • We use the following observation.

6
S
S2
S1
T
P
P1
P2
Let S be a substring of T, and S S1S2. P
P1P2. If d(S1, P1) ?k, and Dist(S2, P2) 0, d(S,
P) ? k.
7
  • Example

k 2
1
2
3
4
5
6
7
8
9
10
11
12
13
T
A
C
A
C
A
A
A
A
A
C
A
C
C
S1
S2
1
2
3
4
5
6
P
A
G
A
B
C
A
P1
P2
Consider the substring S T(6, 11) AAAACA, Let
S1 T(6, 9) AAAA, and S2 T(10, 11)
CA. Dist(S1, P1) 2 ?k, and Dist(S2, P2) 0. We
have Dist(S, P) 2 ?k.
8
  • Example

k 2
1
2
3
4
5
6
7
8
9
10
11
12
13
T
A
C
A
C
A
A
A
A
A
C
A
C
C
S1
S2
1
2
3
4
5
6
P
A
G
A
B
C
A
P1
P2
Consider the substring S T(8, 11) AACA, Let
S1 T(8, 9) AA, and S2 T(10, 11)
CA. Dist(S1, P1) 2 ?k, and Dist(S2, P2) 0. We
have Dist(S, P) 2 ?k.
9
  • Based upon the above observation, we can generate
    all edited pattern Ps by editing the prefix and
    keeping the suffix untouched, in some manner.
  • Consider Paba, k1.

10
  • Paba, k1.

ba (Deletion) k 1
aaba (Insertion) k 1
i 1
baba (Insertion) k 1
P aba
bba (Substution) k 1
aa (Deletion) k 1
aba k 0
aaba (Insertion) k 1
abba (Insertion) k 1
i 2
aaa (Substution) k 1
ab (Deletion) k 1
aba k 0
abaa (Insertion) k 1
abba (Insertion) k 1
i 3
abb (Substution) k 1
aba k 0
abaa (Insertion) k 1
abab (Insertion) k 1
i 4
11
  • Paba, k2.

ba (Deletion) k 1
aaba (Insertion) k 1
i 1
baba (Insertion) k 1
P aba
bba (Substution) k 1
aa (Deletion) k 1
aba k 0
aaba (Insertion) k 1
abba (Insertion) k 1
i 2
aaa (Substution) k 1
ab (Deletion) k 1
aba k 0
abaa (Insertion) k 1
abba (Insertion) k 1
i 3
abb (Substution) k 1
aba k 0
abaa (Insertion) k 1
abab (Insertion) k 1
i 4
12
  • Paba, k2.

a (Deletion) k 2
i 2
aba (Insertion) k 2
bba (Insertion) k 2
ba (k 1)
aa (Substution) k 2
b (Deletion) k 2
ba k 1
baa (Insertion) k 2
bba (Insertion) k 2
i 3
bb (Substution) k 2
ba k 1
baa (Insertion) k 2
bab (Insertion) k 2
i 4
13
PR
PL
i
For i1 to m1
Deletion, k
P
PR
PL
i
P
PL
PR
P
PL
PR
A
Replacement , k
P
P
C

PL
PR
kDist(PL, PL)?k. Dist(PR, PR) 0
P
Insertion, k
A
P
C

PL
PR
No operation.
P
i
Terminate if k gt k.
14
  • Our problem now becomes the following Given a
    pattern P, we produce a modified pattern P. Our
    job is to determine whether P exactly matches
    some substring of T or not.
  • For example, Suppose Paba. We have ba as one of
    the modified patterns. So, we like to find out
    whether ba matches exactly with a substring in T.

15
  • This exact matching can be found by using the
    suffix array and the inverse suffix array.

16
Suffix Array
  • Let , where t0, t1, tn-1 an alphabet
    A and tn is a special symbol that is not in A
    and smaller than any symbol in A.
  • The jth suffix of T is defined as T(j, n) tjtn
    and is denoted by Tj.
  • The suffix array SA0..n of T is an array of
    integers j that represent suffix Tj and the
    integers are sorted in lexicographic order of
    corresponding suffixes.

17
Example
0
1
2
3
4
5
6
7
8
9
T
G
A
C
A
G
T
T
C
G

Suffixes of T GACAGTTCG, ACAGTTCG, CAGTTCG,
AGTTCG, GTTCG, TTCG, TCG, CG, G,
Lexicographic order , ACAGTTCG, AGTTCG,
CAGTTCG, CG, G, GACAGTTCG, GTTCG, TCG,
TTCG. T9, T1, T3, T2, T7, T8, T0, T4, T6, T5
0
1
2
3
4
5
6
7
8
9
i
?
SAi
9
1
3
2
7
8
0
4
6
5
18
Inverse Suffix Array
  • The inverse suffix array of T is denoted as
    SA-1i.
  • SA-1i equals the number of suffix which are
    lexicographically smaller then Ti.

19
Example
0
1
2
3
4
5
6
7
8
9
T
G
A
C
A
G
T
T
C
G

Lexicographic order (T9) ACAGTTCG (T1) AGT
TCG (T3) CAGTTCG (T2) CG (T7) G (T8) GAC
AGTTCG (T0) GTTCG (T4) TCG (T6) TTCG. (T5)
SAi
i
SA-1i
SA-106 because there are 6 suffixes smaller
than T0 GACAGTTCG.
9
0
6
1
1
1
3
2
3
2
3
2
7
4
7
8
5
9
0
6
8
4
7
4
SA-1SAx x.
6
8
5
5
9
0
20
  • The size of SA and SA-1 are O(nlogn) bits. Both
    data structures can be constructed in linear
    time13, 15, 17.

21
  • In this paper, an interval st..ed is called the
    range of the suffix array of T corresponding to a
    string P if st..ed is the largest interval such
    that P is a prefix of every suffix Tj for j
    SAst, SAst1, , SAed.
  • We write st..ed range(T, P).

22
Example
0
1
2
3
4
5
6
7
8
9
T
G
A
C
A
G
T
T
C
G

Lexicographic order (T9) ACAGTTCG (T1) AGT
TCG (T3) CAGTTCG (T2) CG (T7) G (T8) GAC
AGTTCG (T0) GTTCG (T4) TCG (T6) TTCG. (T5)
SAi
i
P G.
9
0
G is a prefix of T8, T0 and T4.
1
1
3
2
T8 TSA5 T0 TSA6 T4 TSA7 ? st5,
ed7, range(T, P) 5..7.
2
3
7
4
8
5
0
6
4
7
6
8
5
9
23
  • Lemma 1 (Gusfild 12)
  • Given a text T together with its suffix array,
    assume st..ed range(T, P). Then, for any
    character c, the intervalst..ed range(T,
    Pc) can be computed in O(logn) time.

24
  • Lemma 2
  • Given the interval st1..ed1 range(T , P1) and
    the interval st2..ed2 range(T , P2), we can
    find the interval st..ed range(T , P1P2) in
    O(logn) time using the suffix array and the
    inverse suffix array of T.

25
  • Let st1..ed1 range(T , P1),
  • st2..ed2 range(T , P2),
  • st..ed range(T , P1P2).
  • st..ed is a subinterval of st1..ed1.

26
Example
0
1
2
3
4
5
6
7
8
9
T
G
A
C
A
G
T
T
C
G

Lexicographic order (T9) ACAGTTCG (T1) AGT
TCG (T3) CAGTTCG (T2) CG (T7) G (T8) GAC
AGTTCG (T0) GTTCG (T4) TCG (T6) TTCG. (T5)
SAi
i
P1 G. P2 A.
9
0
1
1
range(T, P1) 5..7.
3
2
2
3
range(T, P1P2) must be within 5..7. How can we
find the exact interval with 5..7?
7
4
8
5
0
6
4
7
6
8
5
9
27
  • By the definition of suffix array, the
    lexicographic order of are increasing.
  • The lexicographic order of
  • are also increasing.

28
T2 CAGTTCG T21 T3 AGTTCG T21 is
obtained by deleting the prefix with length 1
from T2. In general, Ti1 can be obtained by
deleting the prefix with length 1 from Ti.
Lexicographic order (T9) ACAGTTCG (T1) AGT
TCG (T3) CAGTTCG (T2) CG (T7) G (T8) GAC
AGTTCG (T0) GTTCG (T4) TCG (T6) TTCG. (T5)
29
Example
0
1
2
3
4
5
6
7
8
9
T
G
A
C
A
G
T
T
C
G

P1 G. P2 A.
SAi
i
Lexicographic order (T9) ACAGTTCG (T1) AG
TTCG (T3) CAGTTCG (T2) CG (T7) G (T8) GA
CAGTTCG (T0) GTTCG (T4) TCG (T6) TTCG. (T5
)
9
0
range(T, P1) 5..7.
1
1
3
2
2
3
? T8 lt T0 lt T4
7
4
8
5
0
6
  • T81, T01, T41
  • T9 lt T1 lt T5

4
7
6
8
5
9
30
  • The lexicographic order of
  • are also increasing.
  • Thus
  • To find st and ed, we find the smallest st such
    that
    and the largest ed such that

31
Example
0
1
2
3
4
5
6
7
8
9
T
G
A
C
A
G
A
T
C
G

P1 G. P2 A.
SAi
i
SA-1i
Lexicographic order (T9) ACAGTTCG (T1) AG
TTCG (T3) ATCG. (T5) CAGTTCG (T2) CG (T7)
G (T8) GACAGTTCG (T0) GATCG (T4) TCG (T6
)
9
0
range(T, P1) 6..8.
7
1
1
1
range(T, P2) 1..3.
3
2
4
range(T, P1P2) st..ed.
5
3
2
2
4
8
6 ? st, ed ? 8
7
5
3
8
6
9
0
7
5
4
8
6
6
9
0
? st 7 and ed 8.
32
  • To find the interval of the first character of P
  • We construct an array C such that for any c in
    A, Cc stores the total number of occurrences of
    all c in T, where c ? c.
  • range(T, p1) Cc21 Cc where c2 is a
    character immediately before c in A.

33
Example
0
1
2
3
4
5
6
7
8
9
T
G
A
C
A
G
T
T
C
G

SAi
i
CA 2 CC 4 CG 7 CT 9
Lexicographic order (T9) ACAGTTCG (T1) AG
TTCG (T3) CAGTTCG (T2) CG (T7) G (T8) GA
CAGTTCG (T0) GTTCG (T4) TCG (T6) TTCG. (T5
)
9
0
1
1
3
2
2
3
7
4
8
5
P GACAGCA
0
6
4
7
range(T, p1) CC1CG 57.
6
8
5
9
34
  • Lemma 3
  • Given the suffix array and the inverse suffix
    array of T, assume st..ed range(T, P). For
    any character c, assume we have in advance the
    array C, we can find the interval st..ed
    range(T, cP) in O(logn) time.

35
  • I Construct Fst 1..m1 and Fed 1..m1 such
    that Fst i..Fed i range(T ,Pi..m).
  • II Call kapproximate(0..n, 1, 0, e, e).
  • kapproximate(s..e, i, k, PL, ? )
  • begin
  • 1. Given Fst i..Fed i range(T ,
    Pi..m) and s..e range(T , PL), by
  • Lemma 2 find st..ed range(T ,
    PLPi..m).
  • 2. Report occurrences of P PLPi..m in
    st..ed if the interval exists.
  • 3. If (k k) return.
  • 4. For j i to m1
  • (a) (when j ?m, deletion at j)
  • Call kapproximate(s..e, j1,
    k1, PL, d?).
  • (b) (when j? m, replacement at j ) for
    each c in A
  • i. Given s..e range(T ,
    PL), by Lemma 1 find s..e range(T ,
    PLc).
  • ii. Call kapproximate(s..e,
    j1, k1, PLc, r?).
  • (c) (insertion at j) for each c in A
  • i. Given s..e range(T ,
    PL), by Lemma 1 find s..e range(T ,
    PLc).
  • ii. Call kapproximate(s..e,
    j, k1, PLc, i?).
  • (d) (when j?m) Given s..e range(T
    , PL), by Lemma 1 find s..e

36
  • After an O(n) time preprocessing the text T into
    an O(nlogn)-bit data structure, the algorithm
    solves the k-difference problem in O(Akmklogn
    outputtime) time.

37
  • References
  • 1 A. Amir, D. Keselman, G.M. Landau, M.
    Lewenstein, N. Lewenstein, M. Rodeh, Indexing and
    dictionary matching with one error, in Proc.
  • Sixth WADS, Lecture Notes in Computer Science,
    vol. 1663, Springer, Berlin, 1999, pp. 181192.
  • 2 A. Amir, M. Lewenstein, Ely. Porat, Faster
    algorithms for string matching with k mismatches,
    in Proc. 11th Ann. ACM-SIAM Symp. on
  • Discrete Algorithms, 2000, pp. 794803.
  • 3 R.A. Baeza-Yates, G. Navarro, A faster
    algorithm for approximate string matching, in
    Proc. Seventh Ann. Symp. on Combinatorial Pattern
  • Matching (CPM96), pp. 123.
  • 4 R.A. Baeza-Yates, G. Navarro, A practical
    index for text retrieval allowing errors, in
    CLEI, vol. 1, November 1997, pp. 273282.
  • 5 R. Boyer, S. Moore, A fast string matching
    algorithm, CACM 20 (1977) 762772.
  • 6 A.L. Buchsbaum, M.T. Goodrich, J. Westbrook,
    Range searching over tree cross products. in ESA
    2000, pp. 120131.
  • 7 A. Cobbs, Fast approximate matching using
    suffix trees. in Proc. Sixth Ann. Symp. on
    Combinatorial Pattern Matching (CPM95), Lecture
  • Notes in Computer Science, vol. 807, Springer,
    Berlin, 1995, pp. 4154.
  • 8 R. Cole, L.A. Gottlieb, M. Lewenstein,
    Dictionary matching and indexing with errors and
    dont cares, in Proc. 36th Ann. ACM Symp. on
  • Theory of Computing, 2004, pp. 91100.
  • 9 P. Ferragina, G. Manzini, Opportunistic data
    structures with applications, in Proc. 41st IEEE
    Symp. on Foundations of Computer Science
  • (FOCS00), 2000, pp. 390398.

38
  • 10 G. Gonnet, A tutorial introduction to
    computational biochemistry using Darwin,
    Technical Report, Informatik E.T.H., Zurich,
    Switzerland,
  • 1992.
  • 11 R. Grossi, J.S. Vitter, Compressed suffix
    arrays and suffix trees with applications to text
    indexing and string matching, in Proc. 32nd ACM
  • Symp. on Theory of Computing, 2000, pp. 397406.
  • 12 D. Gusfield, Algorithms on Strings, Trees,
    and Sequences Computer Science and Computational
    Biology, Cambridge University Press,
  • Cambridge, 1997.
  • 13 W.K. Hon, K. Sadakane,W.K. Sung. Breaking a
    time-and-space barrier in constructing full-text
    indices, in Proc. IEEE Symp. on Foundations
  • of Computer Science, 2003.
  • 14 P. Jokinen, E. Ukkonen, Two algorithms for
    approximate string matching in static texts. in
    Proc. MFCS91, Lecture Notes in Computer Science,
  • vol. 520, Springer, Berlin, 1991, pp. 240248.
  • 15 D.K. Kim, J.S. Sim, H. Park, K. Park,
    Linear-time construction of suffix arrays, in
    CPM 2003, pp. 186199.
  • 16 D.E. Knuth, J. Morris, V. Pratt, Fast
    pattern matching in strings, SIAM J. Comput. 6
    (1977) 323350.
  • 17 P. Ko, S. Aluru, Space efficient linear time
    construction of suffix arrays. in CPM 2003, pp.
    200210.
  • 18 G.M. Landau, U. Vishkin, Fast parallel and
    serial approximate string matching, J. Algorithms
    10 (1989) 157169.
  • 19 U. Manber, G. Myers, Suffix arrays a new
    method for on-line string searches, SIAM J.
    Comput. 22 (5) (1993) 935948.

39
  • 20 E.M. MCreight, A space economical suffix
    tree construction algorithm, J. ACM 23 (2) (1976)
    262272.
  • 21 G. Navarro, A guided tour to approximate
    string matching, ACM Comput. Surveys 33 (1)
    (2001) 3188.
  • 22 G. Navarro, R.A. Baeza-Yates, A new indexing
    method for approximate string matching, in Proc.
    10th Ann. Symp. on Combinatorial Pattern
  • Matching (CPM99), pp. 163185.
  • 23 G. Navarro, R.A. Baeza-Yates, A hybrid
    indexing method for approximate string matching,
    J. Discrete Algorithms 1 (1) (2000) 205239 18.
  • 24 G. Navarro, R. Baeza-Yates, E. Sutinen, J.
    Tarhio, Indexing methods for approximate string
    matching, IEEE Data Eng. Bull. 24 (4) (2001)
  • 1927.
  • 25 G. Navarro, E. Sutinen, J. Tanninen, J.
    Tarhio, Indexing text with approximate q-grams,
    in Proc. 11th Ann. Symp. on Combinatorial
    Pattern
  • Matching, Lecture Notes in Computer Science, vol.
    1848, Springer, Berlin, 2000.
  • 26 K. Sadakane, T. Shibuya, Indexing huge
    genome sequences for solving various problems,
    Genome Informatics 12 (2001) 175183.
  • 27 F. Shi, Fast approximate string matching
    with q-blocks sequences, in Proc. Third South
    American Workshop on String Processing (WSP96),
  • Carleton University Press, 1996.
  • 28 E. Sutinen, J. Tarhio, Filtration with
    q-samples in approximate string matching. in
    Proc. Seventh Ann. Symp. on Combinatorial Pattern
    Matching
  • (CPM96), pp. 5063.
  • 29 E. Ukkonen, Approximate matching over suffix
    trees, in Proc. Combinatorial Pattern Matching
    1993, vol. 4, Springer, Berlin, June 1993,
  • pp. 228242.
  • 30 R.A. Wagner, M.J. Fischer, The
    string-to-string correction problem, J. ACM 21
    (1974) 168173.

40
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com