Matching Algorithms - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Matching Algorithms

Description:

Chapter 7 Matching Algorithms Chapter Outline String Matching Straightforward matching Finite automata Knuth-Morris-Pratt algorithm Boyer-Moore algorithm Approximate ... – PowerPoint PPT presentation

Number of Views:221
Avg rating:3.0/5.0
Slides: 44
Provided by: Jeffr258
Category:

less

Transcript and Presenter's Notes

Title: Matching Algorithms


1
Chapter 7
  • Matching Algorithms

2
Chapter Outline
  • String Matching
  • Straightforward matching
  • Finite automata
  • Knuth-Morris-Pratt algorithm
  • Boyer-Moore algorithm
  • Approximate string matching

3
Prerequisites
  • Before beginning this chapter, you should be able
    to
  • Create finite automata
  • Use character strings
  • Use one- and two-dimensional arrays
  • Describe growth rates and order

4
Goals
  • At the end of this chapter you should be able to
  • Explain the substring matching problems
  • Explain the straightforward algorithms and its
    analysis
  • Explain the use of finite automata for string
    matching

5
Goals (continued)
  • Construct and use a Knuth-Morris-Pratt automaton
  • Construct and use slide and jump arrays for the
    Boyer-Moore algorithm
  • Explain the method of approximate string matching

6
Matching Algorithms
  • The general problem is to find a string of
    characters in a larger piece of text
  • Could also be used to find any pattern of bits or
    bytes in a larger binary file
  • Algorithms find the first occurrence of the
    string in the larger text
  • We will assume that the string length is S and
    the text length is T when we analyze the
    algorithms

7
Straightforward Matching Example
8
Straightforward Matching
  • We compare the first character of the string with
    the first character of the text
  • If they match, we move to the next character
    until we have matched the entire string or found
    a mismatch
  • If there is a mismatch, we move the string by one
    place and start again

9
The Straightforward Algorithm
  • subLoc 1
  • textLoc 1
  • textStart 1
  • while textLoc length(text)
  • and subLoc length(substring) do
  • if text textLoc substring subLoc then
  • textLoc textLoc 1
  • subLoc subLoc 1
  • else
  • textStart textStart 1
  • textLoc textStart
  • subLoc 1
  • end if
  • end while
  • if subLoc gt length(substring) then
  • return textStart
  • else
  • return 0
  • end if

10
Analysis
  • In the worst case, we succeed on each comparison
    of the string with the text except for the last
  • This is possible if the string is all X
    characters except for one Y at the end and the
    text is all X characters
  • In this case, we do S(T-S1) comparisons

11
Analysis
  • Natural language texts do not have this sort of
    pattern, so the algorithm will do better with
    them
  • This is because there is an uneven distribution
    of character use in natural language
  • Studies show that this algorithm uses a little
    over T comparisons on a natural language text

12
Finite Automata
  • Finite automata are used to decide whether a word
    is in a given language
  • We could set up a finite automaton to accept the
    string we are looking for and then if we wind up
    in the accepting state, we know we found the
    string and can stop

13
Finite Automata
  • Because we will look at each text character once,
    this will do at most T comparisons
  • However, the algorithm to construct a finite
    automaton from a string takes a long time

14
Knuth-Morris-Pratt Algorithm
  • For each character comparison, we can either
    succeed or fail
  • The Knuth-Morris-Pratt (KMP) algorithm constructs
    an automaton that labels the nodes with the
    string characters and has a success and fail link
    for each node

15
Knuth-Morris-Pratt Algorithm
  • The success links are easy to determine because
    they just take us to the next node
  • The fail links will take us back in the automaton
    and are based on the string we are trying to
    match
  • We will get a new character of the text when we
    succeed in matching, but will reuse that
    character if we fail

16
Knuth-Morris-Pratt Example
  • The automaton for the string ababcd would be

17
Knuth-Morris-Pratt Matching Algorithm
  • subLoc 1
  • textLoc 1
  • while textLoc length(text)
  • and subLoc length(substring) do
  • if subLoc 0 or
  • text textLoc substring subLoc then
  • textLoc textLoc 1
  • subLoc subLoc 1
  • else
  • subLoc fail subLoc
  • end if
  • end while
  • if subLoc gt length(substring) then
  • return textLoc - length(substring)
  • else
  • return 0
  • end if

18
Knuth-Morris-Pratt Failure Link Algorithm
  • fail 1 0
  • for i 2 to length(substring) do
  • temp fail i - 1
  • while temp gt 0 and substring temp ?
    substring i - 1 do
  • temp fail temp
  • end while
  • fail i temp 1
  • end for

19
Failure Link Analysis
  • The ? comparison will be false at most S 1
    times
  • The fail links are all smaller than their index
  • temp is decreased each time the ? is true
  • The while loop is not done on the first pass

20
Failure Link Analysis
  • The variable temp is incremented by 1 for the
    next pass because of
  • The final statement of the for loop
  • The increment of i
  • The first statement of the while loop
  • There are S 2 next passes, so temp in
    incremented S 2 times
  • Because fail10, temp never becomes negative

21
Failure Link Analysis
  • temp starts at 0 and is incremented no more than
    S 2 times
  • Because temp is decreased for each mismatched
    comparison, there are at most S 2 failed
    comparisons
  • There are S 1 successful comparisons, so there
    are at most 2S 3 comparisons

22
Match Algorithm Analysis
  • The while loop does one character comparison per
    pass
  • Either textLoc and subLoc are incremented or
    subLoc is decremented
  • Because textLoc starts at 1 and is never greater
    than T, it is incremented no more than T times

23
Match Algorithm Analysis
  • Because subLoc starts at 1 and is never greater
    than T, it is decremented no more than T times
  • This means that the then clause is done no more
    than T times and the else clause is done no more
    than T times, so there are no more than 2T
    comparisons

24
Knuth-Morris-Pratt Analysis
  • The fail link construction takes 2S-3 comparisons
    and the matching takes 2T comparisons
  • The KMP algorithm is O(S T), where the standard
    algorithm is O(S T)

25
Boyer-Moore Algorithm
  • If we match from the right of the string, a
    mismatch might help us move the string a bigger
    distance in the text to skip over other mismatch
    locations that can be predicted

26
Boyer-Moore Algorithm
  • We have to also consider what we have matched, so
    we do not make too small of a move
  • If we move the string by one position to line up
    the two t characters we will fail quickly, but
    that could be predicted

27
Boyer-Moore Algorithm
  • This algorithm calculates a slide and a jump move
  • The slide value tells us how much the pattern
    should be moved to line up the text character
    that did not match
  • The jump value tells us how much to move the
    pattern to line up the end characters that
    matched with their occurrence earlier in the
    string

28
Boyer-Moore Matching Algorithm
  • textLoc length(pattern)
  • patternLoc length(pattern)
  • while (textLoc length(text)) and (patternLoc gt
    0) do
  • if text textLoc pattern patternLoc then
  • textLoc textLoc - 1
  • patternLoc patternLoc - 1
  • else
  • textLoc textLoc
  • MAXIMUM(slidetexttextLoc,jumppatternLoc)
  • patternLoc length(pattern)
  • end if
  • end while
  • if patternLoc 0 then
  • return textLoc 1
  • else
  • return 0
  • end if

29
Deciding on a Slide Value
30
Boyer-Moore Slide Array Algorithm
  • for every ch in the character set do
  • slide ch length(pattern)
  • end for
  • for i 1 to length(pattern) do
  • slide patterni length(pattern) - i
  • end for

31
Boyer-Moore Jump Array Algorithm
  • for i 1 to length(pattern) do
  • jump i 2 length(pattern) - i
  • end for
  • test length(pattern)
  • target length(pattern) 1
  • while test gt 0 do
  • linktest target
  • while target length(pattern)
  • and patterntest ? patterntarget do
  • jumptarget MINIMUM( jumptarget,
  • length(pattern)-test )
  • target linktarget
  • end while
  • test test - 1
  • target target - 1
  • end while

32
Boyer-Moore Jump Array Algorithm
  • for i 1 to target do
  • jump i MINIMUM( jump i ,
  • length(pattern) target - i )
  • end for
  • temp link target
  • while target lt length(pattern) do
  • while target temp do
  • jumptarget MINIMUM(jumptarget,
  • temp-targetlength(pattern))
  • target target 1
  • end while
  • temp linktemp
  • end while

33
Jump Array Calculation Example
34
Boyer-Moore Analysis
  • The slide array calculation does O(A P)
    assignments but no comparisons
  • The jump array calculation at worst compares all
    of the pattern characters with those appearing
    later for O(P2) comparisons

35
Boyer-Moore Analysis
  • Studies have shown that with natural language
    text, and a pattern of six or more characters,
    there are at most 0.4T comparisons
  • As the length of the pattern increases, the
    algorithm has a lower value of about 0.25T
    comparisons

36
Approximate String Matching
  • Spelling checkers will make suggestions of close
    words that could have been intended for
    misspelled words
  • This involves finding words that are close to the
    misspelled word
  • We will talk about approximate string matching in
    terms of a string and text as in the other
    algorithms

37
Common errors
  • The string could have characters that are missing
    from the text
  • The text could have characters that are missing
    from the string
  • There could be a character in the string or the
    text that needs to be changed

38
Errors Example
  • Matching the string ad with the text read we
    could have
  • 2 mismatches in the first position or a missing
    re from the string
  • 2 mismatches in the second position or just a
    missing e from the string

39
The Algorithm
  • This can be complex because it might be that a
    better match occurs if we look at other
    possibilities
  • In the example above, for the second position
    there were 2 mismatches of characters, but we get
    a better result if we add just one character to
    the string

40
The Algorithm
  • To keep the algorithm a little simpler, we use a
    larger structure to keep track of what we have
    found so far
  • In this case, we will keep a two-dimensional
    array with the best matches found so far
  • This array will have a row for each character of
    the string and a column for each character of the
    text

41
The Array
  • For each location of the array diffsi, j, we
    will choose the minimum of
  • If stringi textj, diffsi 1, j 1otherwise
    diffsi 1, j 1 1
  • diffsi 1, j 1
  • diffsi, j 1 1

42
Example
  • If we compare the string trim with the text
    try the trumpet we get

43
Analysis
  • We do not really need the entire array, but just
    need two columns - the current one and the
    previous one on which it is based
  • We compare each string character with each text
    character and so do ST comparisons
Write a Comment
User Comments (0)
About PowerShow.com