Chapter 3 String Matching - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Chapter 3 String Matching

Description:

Case1: The first mismatch occurs at P4 and T4. the window all the way to T4 and Match P1 with T4. The KMP Algorithm. Case2: ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 35
Provided by: dann60
Category:

less

Transcript and Presenter's Notes

Title: Chapter 3 String Matching


1
Chapter 3String Matching
  • 3.1 Basic Terminologies of Strings
  • 3.2 The KMP Algorithm
  • 3.3 The Boyer-Moore Algorithm
  • 3.4 Suffix Trees and Suffix Arrays
  • 3.5 Approximate String Matching

2
3.1 Basic Terminologies of String
  • Sa alphabet set
  • Sithe ith character
  • Si,jSiSi1Sj
  • Sa substring of string
  • perfixSS1,s
  • suffixSSs-s1,s

3
It take O(nm) time
4
The KMP Algorithm
  • Case1
  • The first mismatch occurs at P4 and T4
  • slide the window all the way to T4 and Match P1
    with T4

5
The KMP Algorithm
  • Case2
  • The first mismatch occurs at P7 and T7
  • We can not slide the window all the way to match
    P1 with T7
  • slide the window to match P1 with T6.

6
The KMP Algorithm
  • Case3
  • The first mismatch occurs at P8 and T8
  • go back to P7 and find out that P6,7 equal to
    the prefix P1,2
  • We can slide the window to align P1 with T6 and
    P3 with T8.

7
The KMP Algorithm
8
The KMP Algorithm
9
The KMP Algorithm
  • The KMP algorithm consists of two phases.
  • First phase computes the prefix function for the
    pattern P .
  • Second phase searches the pattern.

10
The KMP Algorithm
11
The KMP Algorithm
12
The Boyer-Moore Algorithm
  • This algorithm compares the pattern with the
    substring within a sliding window in the
    right-to-left order.
  • Assume that the first mismatch occurs when
    comparing Tsj-1 whit Pj

13
The Boyer-Moore Algorithm
  • Bad character rule
  • Align Tsj-1 with Pj , whrer j is the rightmost
    position of Tsj-1 in P.
  • Only one character is used.

14
The Boyer-Moore Algorithm
  • Good Suffix Rule 1
  • ?Align Tsj-1 with Pj-mj
  • ?j is the largest position such that Pj1,m is a
    suffix of P1,j
  • ? Pj-mjltgtPj

15
The Boyer-Moore Algorithm
  • Good Suffix Rule 2
  • ?Align Tsm-j with P1
  • ?j is the largest position such that P1,j is a
    suffix of Pj1,m

16
The Boyer-Moore Algorithm
17
The Boyer-Moore Algorithm
  • Let us consider g1(7). Note that g1(7)9.This
    mean that P8,12CATCA must be equal to P5,9 and
    P7ltgtP4
  • Consider g2(4).Note that g2(4)4.This means that
    P1,4 is a suffix of P5,12.That is,P1,4 must be
    equal to P9,12.
  • G(j)m-maxg1(j),g2(j)

18
The Boyer-Moore Algorithm
19
The Boyer-Moore Algorithm
20
The Boyer-Moore Algorithm
21
The Boyer-Moore Algorithm
22
The Boyer-Moore Algorithm
23
The Boyer-Moore Algorithm
  • n24 and m12
  • The first mismatch occurs at PjP12
  • ssmaxG(j),m-B(Tsj-1)
  • 1maxg(12),12-B(T12)
  • 1max1,2
  • 3
  • Then we shift the windows to position s3

24
Suffix Trees
If S is ATCACATCATCA, its 12 suffixes are listed
in Table 3.1.
25
Suffix Tree
Suffix_array
26
For Example
Consider the suffix tree for
SATCACATCATCA. Suppose PTCAT. Since P1 is
T, we follow the branch of TCA. Then we
match P1,3 with TCA. Since P4 is T, we
follow the branch of TCA. We now can report
that P is at position 7 in S because P4 matches
the first symbol of TCA and leaf 7 is reached
along the branch of TCA.
Suffix_tree
27
Suffix Array
For example, the non-decreasing lexical order
of suffices of SATCACATCATCA is S(12) , S(4)
, S(9) , S(1) , S(6) , S(11) , S(3) , S(8) ,
S(5) , S(10) , S(2) and S(7) .Table 3.2 shows the
suffix array A.
Suffix_tree
28
The longest common substring(1)
The longest common substring of strings X and
Y is a common substring of X and Y which has the
longest length. For example, PAT is the
longest common substring of XAPAT and
YPATT. We can create a suffix tree for X
and Y for finding the longest Common substring.
The suffices for X and Y are descrebed in Table
3.3. Figure 3.25 shows the suffix string for
X and Y.
29
The longest common substring(2)
30
Approximate String Matching
Given a text string T of length n, a pattern
string P of length m and a maximal number of
errors allowed k. For instance, if
Tpttapa, Ppatt and k2, the substrings
T1,2, T1,4,and T5,6 are all up to 2 errors with
P.
31
The suffix edit distance
This is called the suffix edit distance which
is the minimum number of substitutions,
insertions and deletions, which will transform
some suffix of S1 and S2. Consider S1p and
S2p. The suffix edit distance between S1 and
S2 is 0. Consider S1ptt and S2p. The
suffix edit distance between S1 and S2 is 1 as we
can replace the last character t by p.
32
Algorithm
33
Dynamic Programming
34
For Example
Consider E(4, 3). The arrows traced are E(4,
3) to E(3, 2) to E(2, 1) to E(1, 1). We ignore
E(2, 1) to E(1, 1). Thus we have obtained an
occurrence of approximate matching, T1,3ptt.
Algorithm
Write a Comment
User Comments (0)
About PowerShow.com