Chapter 13 String Matching - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Chapter 13 String Matching

Description:

Stringology's major area of interest is pattern matching ... (c) or the set keywords = {inner, input, in, outer, output, out, put, outing, tint } ... – PowerPoint PPT presentation

Number of Views:387
Avg rating:5.0/5.0
Slides: 55
Provided by: cynd4
Category:

less

Transcript and Presenter's Notes

Title: Chapter 13 String Matching


1
Chapter 13String Matching
2
Objectives
  • Discuss the following topics
  • Exact String Matching
  • Approximate String Matching
  • Case Study Longest Common Substring

3
Exact String Matching
  • Stringologys major area of interest is pattern
    matching
  • Exact string matching consists of finding an
    exact copy of pattern P in text T

4
Exact String Matching (continued)
  • bruteForceStringMatching(pattern P, text T)
  • i 0
  • while i T - P
  • j 0
  • while Ti Pj and j lt P
  • i // try to match all characters in P
  • j
  • if j P
  • return match at i - P // success if the end
    of P is reached
  • // if there is a mismatch,
  • i i - j 1 // shift P to the right by one
    position
  • return no match // failure if fewer characters
    left in T than P

5
Exact String Matching (continued)
  • Hancart(pattern P, text T)
  • if P0 P1
  • sEqual 1
  • sDiff 2
  • else sEqual 2
  • sDiff 1
  • i 0
  • while i T - P
  • if Ti1 ? P1
  • i i sDiff
  • else j 1
  • while j lt P and Tij Pj
  • j
  • if j P and P0 Ti
  • return match at i
  • i i sEqual
  • return no match

6
The Knuth-Morris-Pratt Algorithm
  • The Knuth-Morris-Pratt algorithm can be obtained
    from bruteForceStringMatching()
  • KnuthMorrisPratt(pattern P, text T)
  • findNext (P, next)
  • i j 0
  • while i T - P
  • while j -1 or j lt P and Ti Pj
  • i //increment i only for matched character
  • j
  • if j P
  • return a match at i - P
  • j next j //in the case of a mismatch, i does
    not change
  • return no match

7
The Boyer-Moore Algorithm
  • The Boyer-Moore algorithm tries to match P with T
    by comparing them from right to left, not from
    left to right

8
The Boyer-Moore Algorithm (continued)
  • BoyerMooreSimple(pattern P, text T)
  • initialize all cells of delta1 to P
  • for j 0 to P - 1
  • delta1Pj P - j 1
  • i P - 1
  • while i lt T
  • j P - 1
  • while j 0 and Pj Ti
  • i --
  • j --
  • if j -1
  • return match at i1
  • i i max(delta1 Ti , P-j)
  • return no match

9
The Sunday Algorithms
  • Daniel Sunday (1990) observed that in the case of
    a mismatch with a text character Ti, the pattern
    shifts to the right by at least one position
    thus, the character TiP is included
  • More advantageous to build delta1 with respect to
    character TiP
  • Sunday introduced two more algorithms, based on a
    generalized delta2 table

10
Multiple Searches
  • All preceding algorithms presented find an
    occurrence of a pattern in text and discontinue
    after finding the first.
  • Modifying the Boyer-Moore algorithm allows for
    multiple searches

11
Multiple Searches (continued)
  • BoyerMooreSimple(pattern P, text T)
  • initialize all cells of delta1 to P
  • for j 0 to P - 1
  • delta1Pj P - j 1
  • compute delta2
  • i P - 1
  • while i lt T
  • j P - 1
  • while j 0 and Pj Ti
  • i --
  • j --
  • if j -1
  • output match at i1
  • i i P 1 //shift P by one position to
    the right
  • else i i max(delta1 Ti, delta2j)

12
Bit-Oriented Approach
  • Each state of the search is represented as a
    numberthat is, a string of bitsand a transition
    from one state to the next is the result of a
    small number of bitwise operations
  • A shift-and algorithm that uses a bit-oriented
    approach for string matching was proposed by
    Baeza-Yates and Gonnet (1992)

13
Matching Sets of Words
  • To considerably improve run time by considering
    all relevant words at the same time during the
    match process, Aho and Corasick (1975)
    constructed a string-match automation algorithm
  • The goto function is constructed in the form of a
    trie, or multiway tree, in which consecutive
    characters of a string are used to navigate the
    search in the tree

14
Matching Sets of Words (continued)
  • AhoCorasick(set keywords, text T)
  • computeGotoFunction(keywords,g,output) // the
    output function is computed
  • computeFailureFunction(g,output,f) // in
    these two functions
  • state 0
  • for i 0 to T - 1
  • while g(state,Ti) fail
  • state f(state)
  • state g(state,Ti)
  • if output(state) is not empty
  • output a match ending at i
  • output(state)

15
Matching Sets of Words (continued)
Figure 13-1 (a) A trie for the string inner, (b)
for the strings inner and input, and (c) or the
set keywords inner, input, in, outer, output,
out, put, outing, tint
16
Matching Sets of Words (continued)
Figure 13-1 (d) the trie (c) with failure links
(e) scanning the trie (d) for the text T
outinputting (continued)
17
Regular Expression Matching
  • All letters of the alphabet are regular
    expressions
  • If r and s are regular expressions, then rs,
    (r), r, and rs are regular expressions.
  • Regular expression rs represents regular
    expression r or s
  • Regular expression r (where the star is called
    a Kleene closure) represents any finite sequence
    of rs r, rr, rrr, . . . .

18
Regular Expression Matching (continued)
  • Regular expression rs represents a concatenation
    rs
  • (r) represents regular expression r

19
Regular Expression Matching (continued)
Figure 13-2 (a) An automaton representing one
letter c an automaton a
regular expression (b) r s
20
Regular Expression Matching (continued)
Figure 13-2 (c) rs, (d) r (continued)
21
Regular Expression Matching (continued)
Figure 13-3 The Thompson automaton for the
regular expression a(bcd
)ef
22
Suffix Tries and Trees
  • A suffix trie for a text T is a tree structure in
    which each edge is labeled with one letter of T
    and each suffix of T is represented in the trie
    as a concatenation of edge labels from the root
    to some node of the trie

23
Suffix Tries and Trees (continued)
Figure 13-4 (a) A suffix trie for the string
caracas
24
Suffix Tries and Trees (continued)
Figure 13-4 (b) a suffix tree for the substring
caraca and (c) for the
string caracas (continued)
25
Suffix Tries and Trees (continued)
Figure 13-5 Creating an Ukkonen suffix trie for
the string pepper
26
Suffix Tries and Trees (continued)
Figure 13-5 Creating an Ukkonen suffix trie for
the string pepper (continued)
27
Suffix Tries and Trees (continued)
Figure 13-5 Creating an Ukkonen suffix trie for
the string pepper (continued)
28
Suffix Tries and Trees (continued)
Figure 13-6 Creating an Ukkonen suffix tree for
the string pepper
29
Suffix Tries and Trees (continued)
Figure 13-6 Creating an Ukkonen suffix tree for
the string pepper (continued)
30
Suffix Arrays
  • If suffix trees require too much space, a simple
    alternative are suffix arrays (Manber and Myers,
    1993)
  • Suffix array pos is the array position o through
    T - 1 of suffixes taken in lexicographic order
  • The Suffix array can be created from an existing
    suffix tree on which ordered depth-first
    traversal is performed

31
Approximate String Matching
  • A popular measure of the similarity of two
    strings is the number of elementary edit
    operations that are needed to transform one
    string into another
  • The differences between two strings is sought in
    terms of insertion (I), deletion (D), and
    substitution (S)
  • Difference can be represented in trace,
    alignment (matching), and listing (derivation)

32
String Similarity
  • The string similarity problem can be approached
    by reducing the problem of finding the minimum
    distance for a particulate i and j to the problem
    of finding the minimum distance for values not
    larger than i and j
  • There are four possibilities deletion,
    insertion, substitution, and match
  • The Wagner and Fischer algorithm (1974) attempts
    to address string similarity

33
String Matching with k Errors
  • To determine all substrings of text T for which
    the Levenshtein distance does not exceed k,
    perform string matching with k errors or k
    differences
  • All the possibilities for matching P(0j) with a
    substring of T that ends at position i with e k
    errors can be summarized using Match,
    Substitution, Insertion, and Deletion where there
    is a match with e errors between P(0j - 1) and a
    substring ending at Tj-1

34
Case Study Longest Common Substring
Figure 13-7 (ah) Creating an Ukkonen suffix tree
for the string abaabaac
35
Case Study Longest Common Substring (continued)
Figure 13-7 (ah) Creating an Ukkonen suffix tree
for the string abaabaac
(continued)
36
Case Study Longest Common Substring (continued)
Figure 13-7 (i) a data structure used for
implementation of the
Ukkonen tree (h) (continued)
37
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
38
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
39
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
40
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
41
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
42
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
43
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
44
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
45
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
46
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
47
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
48
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
49
Case Study Longest Common Substring (continued)
Figure 13-8 Listing of the program to find
longest common substring
(continued)
50
Summary
  • Stringologys major area of interest is pattern
    matching
  • Exact string matching consists of finding an
    exact copy of pattern P in text T
  • The bruteForceStringMatching algorithm is an
    example of exact string matching
  • The Knuth-Morris-Pratt algorithm can be obtained
    from bruteForceStringMatching()
  • The Boyer-Moore algorthm tries to match P with T
    by comparing them from right to left, not from
    left to right

51
Summary (continued)
  • Daniel Sunday (1990) observed that in the case of
    a mismatch with a text character Ti, the pattern
    shifts to the right by at least one position
    thus, the character TiP is included.
  • Modifying the Boyer-Moore algorithm allows for
    multiple searches
  • A shift-and algorithm that uses a bit-oriented
    approach for string matching was proposed by
    Baeza-Yates and Gonnet (1992)
  • To considerably improve run time by considering
    all relevant word at the same time during the
    match process, Aho and Corasick (1975)
    constructed a string-match automation algorithm

52
Summary (continued)
  • All letters of the alphabet are regular
    expressions
  • A suffix trie for a text T is a tree structure in
    which each edge is labeled with one letter of T
    and each suffix of T is represented in the trie
    as a concatenation of edge labels from the root
    to some node of the trie
  • If suffix trees require too much space, a simple
    alternative are suffix arrays (Manber and Myers,
    1993)

53
Summary (continued)
  • A popular measure of the similarity of two
    strings is the number of elementary edit
    operations that are needed to transform one
    string into another
  • The differences between two strings is sought in
    terms of insertion (I), deletion (D), and
    substitution (S)
  • The string similarity problem can be approached
    by reducing the problem of finding the minimum
    distance for a particulate i and j to the problem
    of finding the minimum distance for values not
    larger than i and j

54
Summary (continued)
  • The Wagner and Fischer algorithm (1974) attempts
    to address string similarity
  • To determine all substrings of text T for which
    the Levenshtein distance does not exceed k,
    perform string matching with k errors or k
    differences
Write a Comment
User Comments (0)
About PowerShow.com