Title: Approximate String Matching Using Compressed Suffix Arrays Trinh N. D. Huynh, W. K. Hon, T. W. Lam and W. K. Sung, Theoretical Computer Science, Vol. 352, 2006, pp. 240-249
1Approximate String Matching Using Compressed
Suffix ArraysTrinh N. D. Huynh, W. K. Hon, T.
W. Lam and W. K. Sung, Theoretical Computer
Science, Vol. 352, 2006, pp. 240-249
- Advisor Prof. R. C. T. Lee
- Speaker C. W. Lu
2- Let x and y be two strings. Edit distance d(x,
y) is the minimum number of character insertions,
deletions, and replacements to covert string x to
y. - k-difference string matching problem
- Given a text T with length n, a pattern P with
length m, and an error bound k. - Find all position i of T such that there exists
an suffix S of T(1, i), d(S, P) ? k.
3- The approach of this paper is as the follows
- Given a pattern P and an error bound k, we
generate all possible Ps which contain (?k)
errors deduced from P. - Then we conduct an exact match of all such Ps
against T.
4- Example
- Tabbaaa,
- Paba and k1.
- From P and k, we generate the following Ps
- ba, aaba, baba, bba, aa, abba, aaa, ab, abaa,
abb, aba.
5- Then we conduct an exact matching of all Ps
against T. Any success indicates that there is a
substring S in T such that d(S,T)?k. - How can we generate all Ps which we want?
- We use the following observation.
6S
S2
S1
T
P
P1
P2
Let S be a substring of T, and S S1S2. P
P1P2. If d(S1, P1) ?k, and Dist(S2, P2) 0, d(S,
P) ? k.
7k 2
1
2
3
4
5
6
7
8
9
10
11
12
13
T
A
C
A
C
A
A
A
A
A
C
A
C
C
S1
S2
1
2
3
4
5
6
P
A
G
A
B
C
A
P1
P2
Consider the substring S T(6, 11) AAAACA, Let
S1 T(6, 9) AAAA, and S2 T(10, 11)
CA. Dist(S1, P1) 2 ?k, and Dist(S2, P2) 0. We
have Dist(S, P) 2 ?k.
8k 2
1
2
3
4
5
6
7
8
9
10
11
12
13
T
A
C
A
C
A
A
A
A
A
C
A
C
C
S1
S2
1
2
3
4
5
6
P
A
G
A
B
C
A
P1
P2
Consider the substring S T(8, 11) AACA, Let
S1 T(8, 9) AA, and S2 T(10, 11)
CA. Dist(S1, P1) 2 ?k, and Dist(S2, P2) 0. We
have Dist(S, P) 2 ?k.
9- Based upon the above observation, we can generate
all edited pattern Ps by editing the prefix and
keeping the suffix untouched, in some manner. - Consider Paba, k1.
10ba (Deletion) k 1
aaba (Insertion) k 1
i 1
baba (Insertion) k 1
P aba
bba (Substution) k 1
aa (Deletion) k 1
aba k 0
aaba (Insertion) k 1
abba (Insertion) k 1
i 2
aaa (Substution) k 1
ab (Deletion) k 1
aba k 0
abaa (Insertion) k 1
abba (Insertion) k 1
i 3
abb (Substution) k 1
aba k 0
abaa (Insertion) k 1
abab (Insertion) k 1
i 4
11ba (Deletion) k 1
aaba (Insertion) k 1
i 1
baba (Insertion) k 1
P aba
bba (Substution) k 1
aa (Deletion) k 1
aba k 0
aaba (Insertion) k 1
abba (Insertion) k 1
i 2
aaa (Substution) k 1
ab (Deletion) k 1
aba k 0
abaa (Insertion) k 1
abba (Insertion) k 1
i 3
abb (Substution) k 1
aba k 0
abaa (Insertion) k 1
abab (Insertion) k 1
i 4
12a (Deletion) k 2
i 2
aba (Insertion) k 2
bba (Insertion) k 2
ba (k 1)
aa (Substution) k 2
b (Deletion) k 2
ba k 1
baa (Insertion) k 2
bba (Insertion) k 2
i 3
bb (Substution) k 2
ba k 1
baa (Insertion) k 2
bab (Insertion) k 2
i 4
13PR
PL
i
For i1 to m1
Deletion, k
P
PR
PL
i
P
PL
PR
P
PL
PR
A
Replacement , k
P
P
C
PL
PR
kDist(PL, PL)?k. Dist(PR, PR) 0
P
Insertion, k
A
P
C
PL
PR
No operation.
P
i
Terminate if k gt k.
14- Our problem now becomes the following Given a
pattern P, we produce a modified pattern P. Our
job is to determine whether P exactly matches
some substring of T or not. - For example, Suppose Paba. We have ba as one of
the modified patterns. So, we like to find out
whether ba matches exactly with a substring in T.
15- This exact matching can be found by using the
suffix array and the inverse suffix array.
16Suffix Array
- Let , where t0, t1, tn-1 an alphabet
A and tn is a special symbol that is not in A
and smaller than any symbol in A. - The jth suffix of T is defined as T(j, n) tjtn
and is denoted by Tj. - The suffix array SA0..n of T is an array of
integers j that represent suffix Tj and the
integers are sorted in lexicographic order of
corresponding suffixes.
17Example
0
1
2
3
4
5
6
7
8
9
T
G
A
C
A
G
T
T
C
G
Suffixes of T GACAGTTCG, ACAGTTCG, CAGTTCG,
AGTTCG, GTTCG, TTCG, TCG, CG, G,
Lexicographic order , ACAGTTCG, AGTTCG,
CAGTTCG, CG, G, GACAGTTCG, GTTCG, TCG,
TTCG. T9, T1, T3, T2, T7, T8, T0, T4, T6, T5
0
1
2
3
4
5
6
7
8
9
i
?
SAi
9
1
3
2
7
8
0
4
6
5
18Inverse Suffix Array
- The inverse suffix array of T is denoted as
SA-1i. - SA-1i equals the number of suffix which are
lexicographically smaller then Ti.
19Example
0
1
2
3
4
5
6
7
8
9
T
G
A
C
A
G
T
T
C
G
Lexicographic order (T9) ACAGTTCG (T1) AGT
TCG (T3) CAGTTCG (T2) CG (T7) G (T8) GAC
AGTTCG (T0) GTTCG (T4) TCG (T6) TTCG. (T5)
SAi
i
SA-1i
SA-106 because there are 6 suffixes smaller
than T0 GACAGTTCG.
9
0
6
1
1
1
3
2
3
2
3
2
7
4
7
8
5
9
0
6
8
4
7
4
SA-1SAx x.
6
8
5
5
9
0
20- The size of SA and SA-1 are O(nlogn) bits. Both
data structures can be constructed in linear
time13, 15, 17.
21- In this paper, an interval st..ed is called the
range of the suffix array of T corresponding to a
string P if st..ed is the largest interval such
that P is a prefix of every suffix Tj for j
SAst, SAst1, , SAed. - We write st..ed range(T, P).
22Example
0
1
2
3
4
5
6
7
8
9
T
G
A
C
A
G
T
T
C
G
Lexicographic order (T9) ACAGTTCG (T1) AGT
TCG (T3) CAGTTCG (T2) CG (T7) G (T8) GAC
AGTTCG (T0) GTTCG (T4) TCG (T6) TTCG. (T5)
SAi
i
P G.
9
0
G is a prefix of T8, T0 and T4.
1
1
3
2
T8 TSA5 T0 TSA6 T4 TSA7 ? st5,
ed7, range(T, P) 5..7.
2
3
7
4
8
5
0
6
4
7
6
8
5
9
23- Lemma 1 (Gusfild 12)
- Given a text T together with its suffix array,
assume st..ed range(T, P). Then, for any
character c, the intervalst..ed range(T,
Pc) can be computed in O(logn) time.
24- Lemma 2
- Given the interval st1..ed1 range(T , P1) and
the interval st2..ed2 range(T , P2), we can
find the interval st..ed range(T , P1P2) in
O(logn) time using the suffix array and the
inverse suffix array of T.
25- Let st1..ed1 range(T , P1),
- st2..ed2 range(T , P2),
- st..ed range(T , P1P2).
- st..ed is a subinterval of st1..ed1.
26Example
0
1
2
3
4
5
6
7
8
9
T
G
A
C
A
G
T
T
C
G
Lexicographic order (T9) ACAGTTCG (T1) AGT
TCG (T3) CAGTTCG (T2) CG (T7) G (T8) GAC
AGTTCG (T0) GTTCG (T4) TCG (T6) TTCG. (T5)
SAi
i
P1 G. P2 A.
9
0
1
1
range(T, P1) 5..7.
3
2
2
3
range(T, P1P2) must be within 5..7. How can we
find the exact interval with 5..7?
7
4
8
5
0
6
4
7
6
8
5
9
27- By the definition of suffix array, the
lexicographic order of are increasing. - The lexicographic order of
- are also increasing.
28T2 CAGTTCG T21 T3 AGTTCG T21 is
obtained by deleting the prefix with length 1
from T2. In general, Ti1 can be obtained by
deleting the prefix with length 1 from Ti.
Lexicographic order (T9) ACAGTTCG (T1) AGT
TCG (T3) CAGTTCG (T2) CG (T7) G (T8) GAC
AGTTCG (T0) GTTCG (T4) TCG (T6) TTCG. (T5)
29Example
0
1
2
3
4
5
6
7
8
9
T
G
A
C
A
G
T
T
C
G
P1 G. P2 A.
SAi
i
Lexicographic order (T9) ACAGTTCG (T1) AG
TTCG (T3) CAGTTCG (T2) CG (T7) G (T8) GA
CAGTTCG (T0) GTTCG (T4) TCG (T6) TTCG. (T5
)
9
0
range(T, P1) 5..7.
1
1
3
2
2
3
? T8 lt T0 lt T4
7
4
8
5
0
6
- T81, T01, T41
- T9 lt T1 lt T5
4
7
6
8
5
9
30- The lexicographic order of
- are also increasing.
- Thus
- To find st and ed, we find the smallest st such
that
and the largest ed such that
31Example
0
1
2
3
4
5
6
7
8
9
T
G
A
C
A
G
A
T
C
G
P1 G. P2 A.
SAi
i
SA-1i
Lexicographic order (T9) ACAGTTCG (T1) AG
TTCG (T3) ATCG. (T5) CAGTTCG (T2) CG (T7)
G (T8) GACAGTTCG (T0) GATCG (T4) TCG (T6
)
9
0
range(T, P1) 6..8.
7
1
1
1
range(T, P2) 1..3.
3
2
4
range(T, P1P2) st..ed.
5
3
2
2
4
8
6 ? st, ed ? 8
7
5
3
8
6
9
0
7
5
4
8
6
6
9
0
? st 7 and ed 8.
32- To find the interval of the first character of P
- We construct an array C such that for any c in
A, Cc stores the total number of occurrences of
all c in T, where c ? c. - range(T, p1) Cc21 Cc where c2 is a
character immediately before c in A.
33Example
0
1
2
3
4
5
6
7
8
9
T
G
A
C
A
G
T
T
C
G
SAi
i
CA 2 CC 4 CG 7 CT 9
Lexicographic order (T9) ACAGTTCG (T1) AG
TTCG (T3) CAGTTCG (T2) CG (T7) G (T8) GA
CAGTTCG (T0) GTTCG (T4) TCG (T6) TTCG. (T5
)
9
0
1
1
3
2
2
3
7
4
8
5
P GACAGCA
0
6
4
7
range(T, p1) CC1CG 57.
6
8
5
9
34- Lemma 3
- Given the suffix array and the inverse suffix
array of T, assume st..ed range(T, P). For
any character c, assume we have in advance the
array C, we can find the interval st..ed
range(T, cP) in O(logn) time.
35- I Construct Fst 1..m1 and Fed 1..m1 such
that Fst i..Fed i range(T ,Pi..m). - II Call kapproximate(0..n, 1, 0, e, e).
- kapproximate(s..e, i, k, PL, ? )
- begin
- 1. Given Fst i..Fed i range(T ,
Pi..m) and s..e range(T , PL), by - Lemma 2 find st..ed range(T ,
PLPi..m). - 2. Report occurrences of P PLPi..m in
st..ed if the interval exists. - 3. If (k k) return.
- 4. For j i to m1
- (a) (when j ?m, deletion at j)
- Call kapproximate(s..e, j1,
k1, PL, d?). - (b) (when j? m, replacement at j ) for
each c in A - i. Given s..e range(T ,
PL), by Lemma 1 find s..e range(T ,
PLc). - ii. Call kapproximate(s..e,
j1, k1, PLc, r?). - (c) (insertion at j) for each c in A
- i. Given s..e range(T ,
PL), by Lemma 1 find s..e range(T ,
PLc). - ii. Call kapproximate(s..e,
j, k1, PLc, i?). - (d) (when j?m) Given s..e range(T
, PL), by Lemma 1 find s..e
36- After an O(n) time preprocessing the text T into
an O(nlogn)-bit data structure, the algorithm
solves the k-difference problem in O(Akmklogn
outputtime) time.
37- References
- 1 A. Amir, D. Keselman, G.M. Landau, M.
Lewenstein, N. Lewenstein, M. Rodeh, Indexing and
dictionary matching with one error, in Proc. - Sixth WADS, Lecture Notes in Computer Science,
vol. 1663, Springer, Berlin, 1999, pp. 181192. - 2 A. Amir, M. Lewenstein, Ely. Porat, Faster
algorithms for string matching with k mismatches,
in Proc. 11th Ann. ACM-SIAM Symp. on - Discrete Algorithms, 2000, pp. 794803.
- 3 R.A. Baeza-Yates, G. Navarro, A faster
algorithm for approximate string matching, in
Proc. Seventh Ann. Symp. on Combinatorial Pattern - Matching (CPM96), pp. 123.
- 4 R.A. Baeza-Yates, G. Navarro, A practical
index for text retrieval allowing errors, in
CLEI, vol. 1, November 1997, pp. 273282. - 5 R. Boyer, S. Moore, A fast string matching
algorithm, CACM 20 (1977) 762772. - 6 A.L. Buchsbaum, M.T. Goodrich, J. Westbrook,
Range searching over tree cross products. in ESA
2000, pp. 120131. - 7 A. Cobbs, Fast approximate matching using
suffix trees. in Proc. Sixth Ann. Symp. on
Combinatorial Pattern Matching (CPM95), Lecture - Notes in Computer Science, vol. 807, Springer,
Berlin, 1995, pp. 4154. - 8 R. Cole, L.A. Gottlieb, M. Lewenstein,
Dictionary matching and indexing with errors and
dont cares, in Proc. 36th Ann. ACM Symp. on - Theory of Computing, 2004, pp. 91100.
- 9 P. Ferragina, G. Manzini, Opportunistic data
structures with applications, in Proc. 41st IEEE
Symp. on Foundations of Computer Science - (FOCS00), 2000, pp. 390398.
38- 10 G. Gonnet, A tutorial introduction to
computational biochemistry using Darwin,
Technical Report, Informatik E.T.H., Zurich,
Switzerland, - 1992.
- 11 R. Grossi, J.S. Vitter, Compressed suffix
arrays and suffix trees with applications to text
indexing and string matching, in Proc. 32nd ACM - Symp. on Theory of Computing, 2000, pp. 397406.
- 12 D. Gusfield, Algorithms on Strings, Trees,
and Sequences Computer Science and Computational
Biology, Cambridge University Press, - Cambridge, 1997.
- 13 W.K. Hon, K. Sadakane,W.K. Sung. Breaking a
time-and-space barrier in constructing full-text
indices, in Proc. IEEE Symp. on Foundations - of Computer Science, 2003.
- 14 P. Jokinen, E. Ukkonen, Two algorithms for
approximate string matching in static texts. in
Proc. MFCS91, Lecture Notes in Computer Science, - vol. 520, Springer, Berlin, 1991, pp. 240248.
- 15 D.K. Kim, J.S. Sim, H. Park, K. Park,
Linear-time construction of suffix arrays, in
CPM 2003, pp. 186199. - 16 D.E. Knuth, J. Morris, V. Pratt, Fast
pattern matching in strings, SIAM J. Comput. 6
(1977) 323350. - 17 P. Ko, S. Aluru, Space efficient linear time
construction of suffix arrays. in CPM 2003, pp.
200210. - 18 G.M. Landau, U. Vishkin, Fast parallel and
serial approximate string matching, J. Algorithms
10 (1989) 157169. - 19 U. Manber, G. Myers, Suffix arrays a new
method for on-line string searches, SIAM J.
Comput. 22 (5) (1993) 935948.
39- 20 E.M. MCreight, A space economical suffix
tree construction algorithm, J. ACM 23 (2) (1976)
262272. - 21 G. Navarro, A guided tour to approximate
string matching, ACM Comput. Surveys 33 (1)
(2001) 3188. - 22 G. Navarro, R.A. Baeza-Yates, A new indexing
method for approximate string matching, in Proc.
10th Ann. Symp. on Combinatorial Pattern - Matching (CPM99), pp. 163185.
- 23 G. Navarro, R.A. Baeza-Yates, A hybrid
indexing method for approximate string matching,
J. Discrete Algorithms 1 (1) (2000) 205239 18. - 24 G. Navarro, R. Baeza-Yates, E. Sutinen, J.
Tarhio, Indexing methods for approximate string
matching, IEEE Data Eng. Bull. 24 (4) (2001) - 1927.
- 25 G. Navarro, E. Sutinen, J. Tanninen, J.
Tarhio, Indexing text with approximate q-grams,
in Proc. 11th Ann. Symp. on Combinatorial
Pattern - Matching, Lecture Notes in Computer Science, vol.
1848, Springer, Berlin, 2000. - 26 K. Sadakane, T. Shibuya, Indexing huge
genome sequences for solving various problems,
Genome Informatics 12 (2001) 175183. - 27 F. Shi, Fast approximate string matching
with q-blocks sequences, in Proc. Third South
American Workshop on String Processing (WSP96), - Carleton University Press, 1996.
- 28 E. Sutinen, J. Tarhio, Filtration with
q-samples in approximate string matching. in
Proc. Seventh Ann. Symp. on Combinatorial Pattern
Matching - (CPM96), pp. 5063.
- 29 E. Ukkonen, Approximate matching over suffix
trees, in Proc. Combinatorial Pattern Matching
1993, vol. 4, Springer, Berlin, June 1993, - pp. 228242.
- 30 R.A. Wagner, M.J. Fischer, The
string-to-string correction problem, J. ACM 21
(1974) 168173.
40