String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43, 1986, pp. 239-249 - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43, 1986, pp. 239-249

Description:

String Matching with k. Mismatches by Using Kangaroo Method. Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. ... – PowerPoint PPT presentation

Number of Views:347
Avg rating:3.0/5.0
Slides: 43
Provided by: HsuHo
Category:

less

Transcript and Presenter's Notes

Title: String Matching with k Mismatches by Using Kangaroo Method Efficient string with k mismatches, Landau, G.M., and Vishkin, U., Theoret. Comput Sci 43, 1986, pp. 239-249


1
String Matching with k Mismatches by Using
Kangaroo Method Efficient string with k
mismatches, Landau, G.M., and Vishkin, U.,
Theoret. Comput Sci 43, 1986, pp. 239-249
Speaker C. C. LinAdviser R. C. T. Lee
2
  • Problem definition
  • Input A text T with length n , a pattern P with
  • length m and a mismatching threshold k.
  • Output All sub-strings of T with length m
  • matching P with k maximal number of
  • mismatches.

If k 2
k
1
3
2
4
T A G C T G C D C A C G I A B...
P A G C C
P A G C C
P A G C C
P A G C C
3
  • The concept of the Kangaroo method can be
  • explained as the following figure.
  • Assume that it is known before hand there
  • t1t2tap1p2pa and ta1 is not equal to pa1.
  • Thus we do not have to examine t1t2ta1 with
  • p1p2pa1 and jump directly to match the suffixes
  • beginning from ta2 and pa2.

Text t1 t2 ta ta1 ta2
ta3tk Pattern p1p2pa pa1
pa2pa3...pk
mismatch
4
Kangaroo method will process as follows.
k0
P ETBDBCCDFDC
T ABCCABDADBDETADBAADFDAAEERDXTDADCT
start
5
Kangaroo method will process as follows.
k1
P ETBDBCCDFDC
T ABCCABDADBDETADBAADFDAAEERDXTDADCT
6
Kangaroo method will process as follows.
k2
P ETBDBCCDFDC
T ABCCABDADBDETADBAADFDAAEERDXTDADCT
7
Kangaroo method will process as follows.
k3
P ETBDBCCDFDC
T ABCCABDADBDETADBAADFDAAEERDXTDADCT
8
Kangaroo method will process as follows.
k4
P ETBDBCCDFDC
T ABCCABDADBDETADBAADFDAAEERDXTDADCT
9
  • We continue the above process. Whenever we
  • come to the situation that it is known a
  • substring of T exactly matching with a substring
  • of P, we skip this substring. This process is
  • stopped when k1 mismatches have been found.
  • Input TABAABBCCDD, PACDCB and k2.
  • TABAABCCDD
  • PACDCB
  • k3, we stop and discard ABAAB, then we start to
  • compare BAADB and ACDCB.

10
  • Before we introduce the Kangaroo algorithm,
  • we shall first introduce the suffix tree and the
  • lowest common ancestor of two nodes.
  • The properties of suffix tree and the lowest
  • common ancestor of two nodes will be used in
  • Kangaroo algorithm.

11
S ABCDEADDBE
Suffix tree of a string with length n can be
constructed in O(n). Weiner, 1973 McCreight,
1976 Ukkonen, 1995
12
The lowest common ancestor of two leaf nodes can
be found in O(1) by O(n) preprocessing in
constructing time. Harel and Tarjan, 1984
13
  • The Kangaroo method constructs a suffix tree
  • for text T and pattern P. Let the leaf node
  • corresponding to the substring starting from the
  • location be denoted as X. Let the leaf
  • corresponding to the pattern be denoted as Y.
  • The Kangaroo Method finds the lowest common
  • ancestor of X and Y to verify a text location
    with
  • k mismatches in O(k).
  • Let us consider the next page to figure out the
  • Kangaroo method.

14
Two suffix strings ANBECF ANCEC
Then we can know that they have the same prefix
AN and a mismatch B and C.
ANBECF
ANCEC
ANBECF ANCEC
ANBECF ANCEC
We now have to find whether there is any
mismatches between ECF and EC.
mismatches1
15
We get remaining suffix strings ECF EC
Then we can know that they have the same prefix
EC and because we touch , we finish the
verification.
ECF
EC
Thus we could know that the mismatches between
ANBECF and ANCEC is 1.
ECF EC
mismatches1
16
  • We will not have to compare all characters by
  • using the finding of the lowest common ancestor
    of
  • two strings of text and pattern in the suffix
    tree.
  • This is useful if there are many equivalent
  • characters between the text and the pattern
    because
  • we will not have to compare those equivalent
  • characters.
  • Finding the lowest common ancestor between two
  • suffixes is to find the next mismatch between two
  • strings.

17
  • Input TABCCBDCDBC, PABCD and k2
  • The suffix tree of T and P is

18
  • The lowest common ancestor of ABCD and
  • ABCCBDCDBC.

TABCCBDCDBC PABCD k1, return ABCC.
19
  • The lowest common ancestor of ABCD and
  • BCCBDCDBC.

TABCCBDCDBC PABCD k1.
20
  • The lowest common ancestor of BCD and
  • CCBDCDBC.

TABCCBDCDBC PABCD k2.
21
  • The lowest common ancestor of CD and
  • CBDCDBC.

TABCCBDCDBC PABCD k3, discard BCCB.
22
  • The lowest common ancestor of ABCD and
  • CCBDCDBC.

TABCCBDCDBC PABCD k1.
23
  • The lowest common ancestor of BCD and
  • CBDCDBC.

TABCCBDCDBC PABCD k2.
24
  • The lowest common ancestor of CD and
  • BDCDBC.

TABCCBDCDBC PABCD k3, discard CCBD.
25
  • The lowest common ancestor of ABCD and
  • CBDCDBC.

TABCCBDCDBC PABCD k1.
26
  • The lowest common ancestor of BCD and
  • BDCDBC.

TABCCBDCDBC PABCD k2.
27
  • The lowest common ancestor of D and
  • CDBC.

TABCCBDCDBC PABCD k3, discard
CBDC.
28
  • The lowest common ancestor of ABCD and
  • BDCDBC.

TABCCBDCDBC PABCD k1.
29
  • The lowest common ancestor of BCD and
  • DCDBC.

TABCCBDCDBC PABCD k2.
30
  • The lowest common ancestor of CD and
  • CDBC.

TABCCBDCDBC PABCD k2, return
BDCD.
31
  • The lowest common ancestor of ABCD and
  • DCDBC.

TABCCBDCDBC PABCD k1.
32
  • The lowest common ancestor of BCD and
  • CDBC.

TABCCBDCDBC PABCD k2.
33
  • The lowest common ancestor of CD and
  • DBC.

TABCCBDCDBC PABCD k3,
discard DCDB.
34
  • The lowest common ancestor of ABCD and
  • CDBC.

TABCCBDCDBC PABCD k1.
35
  • The lowest common ancestor of BCD and
  • DBC.

TABCCBDCDBC PABCD k2.
36
  • The lowest common ancestor of CD and
  • BC.

TABCCBDCDBC PABCD k3,
discard CDBC.
37
  • Input TABCCBDCDBC, PABCD and k2.
  • Output ABCC and BDCD.

38
  • In order to use Kangaroo method, we construct
  • a suffix tree for the text T with the length n
    and
  • the pattern p with the length m in O(nm).
  • By using Kangaroo method, we take O(1) time
  • to find one mismatch. We stop when there are
  • more than k mismatches. Therefore, we take
  • O(k) time to find at most k mismatches.

39
  • Thus, the time complexity of finding out all
  • locations of text T with k maximal mismatches
  • with the pattern P is O(nk).

40
References
  • For Construction of Suffix trees
  • M76 McCreight, E.M., A Space-Economical
  • Suffix Tree Construction Algorithm, J. ACM 23
  • (1976) 262-272.
  • U95 Ukkonen, E., On-line Construction of
  • Suffix Trees, Algorithmica 41 (1995) 249-260.
  • For Finding Lowest Common Ancestor
  • HT84 Harel, D. and Tarjan, R.E., Fast
  • Algorithms for Finding Nearest Common Ancestor,
  • SIAM Journal on Computing 13 (1984) 338-355.

41
References
  • For String Matching with k Mismatches
  • LV86 Landau, G.M., and Vishkin, U., Efficient
  • string with k mismatches, Theoret. Comput Sci 43
  • (1986) 239-249.

42
  • Thank you
Write a Comment
User Comments (0)
About PowerShow.com