Matching Algorithms presentation

About This Presentation

Transcript and Presenter's Notes

Title: Matching Algorithms

1
Chapter 7

Matching Algorithms

2
Chapter Outline

String Matching
Straightforward matching
Finite automata
Knuth-Morris-Pratt algorithm
Boyer-Moore algorithm
Approximate string matching

3
Prerequisites

Before beginning this chapter, you should be able
to
Create finite automata
Use character strings
Use one- and two-dimensional arrays
Describe growth rates and order

4
Goals

At the end of this chapter you should be able to
Explain the substring matching problems
Explain the straightforward algorithms and its
analysis
Explain the use of finite automata for string
matching

5
Goals (continued)

Construct and use a Knuth-Morris-Pratt automaton
Construct and use slide and jump arrays for the
Boyer-Moore algorithm
Explain the method of approximate string matching

6
Matching Algorithms

The general problem is to find a string of
characters in a larger piece of text
Could also be used to find any pattern of bits or
bytes in a larger binary file
Algorithms find the first occurrence of the
string in the larger text
We will assume that the string length is S and
the text length is T when we analyze the
algorithms

7
Straightforward Matching Example
8
Straightforward Matching

We compare the first character of the string with
the first character of the text
If they match, we move to the next character
until we have matched the entire string or found
a mismatch
If there is a mismatch, we move the string by one
place and start again

9
The Straightforward Algorithm

subLoc 1
textLoc 1
textStart 1
while textLoc length(text)
and subLoc length(substring) do
if text textLoc substring subLoc then
textLoc textLoc 1
subLoc subLoc 1
else
textStart textStart 1
textLoc textStart
subLoc 1
end if
end while
if subLoc gt length(substring) then
return textStart
else
return 0
end if

10
Analysis

In the worst case, we succeed on each comparison
of the string with the text except for the last
This is possible if the string is all X
characters except for one Y at the end and the
text is all X characters
In this case, we do S(T-S1) comparisons

11
Analysis

Natural language texts do not have this sort of
pattern, so the algorithm will do better with
them
This is because there is an uneven distribution
of character use in natural language
Studies show that this algorithm uses a little
over T comparisons on a natural language text

12
Finite Automata

Finite automata are used to decide whether a word
is in a given language
We could set up a finite automaton to accept the
string we are looking for and then if we wind up
in the accepting state, we know we found the
string and can stop

13
Finite Automata

Because we will look at each text character once,
this will do at most T comparisons
However, the algorithm to construct a finite
automaton from a string takes a long time

14
Knuth-Morris-Pratt Algorithm

For each character comparison, we can either
succeed or fail
The Knuth-Morris-Pratt (KMP) algorithm constructs
an automaton that labels the nodes with the
string characters and has a success and fail link
for each node

15
Knuth-Morris-Pratt Algorithm

The success links are easy to determine because
they just take us to the next node
The fail links will take us back in the automaton
and are based on the string we are trying to
match
We will get a new character of the text when we
succeed in matching, but will reuse that
character if we fail

16
Knuth-Morris-Pratt Example

The automaton for the string ababcd would be

17
Knuth-Morris-Pratt Matching Algorithm

subLoc 1
textLoc 1
while textLoc length(text)
and subLoc length(substring) do
if subLoc 0 or
text textLoc substring subLoc then
textLoc textLoc 1
subLoc subLoc 1
else
subLoc fail subLoc
end if
end while
if subLoc gt length(substring) then
return textLoc - length(substring)
else
return 0
end if

18
Knuth-Morris-Pratt Failure Link Algorithm

fail 1 0
for i 2 to length(substring) do
temp fail i - 1
while temp gt 0 and substring temp ?
substring i - 1 do
temp fail temp
end while
fail i temp 1
end for

19
Failure Link Analysis

The ? comparison will be false at most S 1
times
The fail links are all smaller than their index
temp is decreased each time the ? is true
The while loop is not done on the first pass

20
Failure Link Analysis

The variable temp is incremented by 1 for the
next pass because of
The final statement of the for loop
The increment of i
The first statement of the while loop
There are S 2 next passes, so temp in
incremented S 2 times
Because fail10, temp never becomes negative

21
Failure Link Analysis

temp starts at 0 and is incremented no more than
S 2 times
Because temp is decreased for each mismatched
comparison, there are at most S 2 failed
comparisons
There are S 1 successful comparisons, so there
are at most 2S 3 comparisons

22
Match Algorithm Analysis

The while loop does one character comparison per
pass
Either textLoc and subLoc are incremented or
subLoc is decremented
Because textLoc starts at 1 and is never greater
than T, it is incremented no more than T times

23
Match Algorithm Analysis

Because subLoc starts at 1 and is never greater
than T, it is decremented no more than T times
This means that the then clause is done no more
than T times and the else clause is done no more
than T times, so there are no more than 2T
comparisons

24
Knuth-Morris-Pratt Analysis

The fail link construction takes 2S-3 comparisons
and the matching takes 2T comparisons
The KMP algorithm is O(S T), where the standard
algorithm is O(S T)

25
Boyer-Moore Algorithm

If we match from the right of the string, a
mismatch might help us move the string a bigger
distance in the text to skip over other mismatch
locations that can be predicted

26
Boyer-Moore Algorithm

We have to also consider what we have matched, so
we do not make too small of a move
If we move the string by one position to line up
the two t characters we will fail quickly, but
that could be predicted

27
Boyer-Moore Algorithm

This algorithm calculates a slide and a jump move
The slide value tells us how much the pattern
should be moved to line up the text character
that did not match
The jump value tells us how much to move the
pattern to line up the end characters that
matched with their occurrence earlier in the
string

28
Boyer-Moore Matching Algorithm

textLoc length(pattern)
patternLoc length(pattern)
while (textLoc length(text)) and (patternLoc gt
0) do
if text textLoc pattern patternLoc then
textLoc textLoc - 1
patternLoc patternLoc - 1
else
textLoc textLoc
MAXIMUM(slidetexttextLoc,jumppatternLoc)
patternLoc length(pattern)
end if
end while
if patternLoc 0 then
return textLoc 1
else
return 0
end if

29
Deciding on a Slide Value
30
Boyer-Moore Slide Array Algorithm

for every ch in the character set do
slide ch length(pattern)
end for
for i 1 to length(pattern) do
slide patterni length(pattern) - i
end for

31
Boyer-Moore Jump Array Algorithm

for i 1 to length(pattern) do
jump i 2 length(pattern) - i
end for
test length(pattern)
target length(pattern) 1
while test gt 0 do
linktest target
while target length(pattern)
and patterntest ? patterntarget do
jumptarget MINIMUM( jumptarget,
length(pattern)-test )
target linktarget
end while
test test - 1
target target - 1
end while

32
Boyer-Moore Jump Array Algorithm

for i 1 to target do
jump i MINIMUM( jump i ,
length(pattern) target - i )
end for
temp link target
while target lt length(pattern) do
while target temp do
jumptarget MINIMUM(jumptarget,
temp-targetlength(pattern))
target target 1
end while
temp linktemp
end while

33
Jump Array Calculation Example
34
Boyer-Moore Analysis

The slide array calculation does O(A P)
assignments but no comparisons
The jump array calculation at worst compares all
of the pattern characters with those appearing
later for O(P2) comparisons

35
Boyer-Moore Analysis

Studies have shown that with natural language
text, and a pattern of six or more characters,
there are at most 0.4T comparisons
As the length of the pattern increases, the
algorithm has a lower value of about 0.25T
comparisons

36
Approximate String Matching

Spelling checkers will make suggestions of close
words that could have been intended for
misspelled words
This involves finding words that are close to the
misspelled word
We will talk about approximate string matching in
terms of a string and text as in the other
algorithms

37
Common errors

The string could have characters that are missing
from the text
The text could have characters that are missing
from the string
There could be a character in the string or the
text that needs to be changed

38
Errors Example

Matching the string ad with the text read we
could have
2 mismatches in the first position or a missing
re from the string
2 mismatches in the second position or just a
missing e from the string

39
The Algorithm

This can be complex because it might be that a
better match occurs if we look at other
possibilities
In the example above, for the second position
there were 2 mismatches of characters, but we get
a better result if we add just one character to
the string

40
The Algorithm

To keep the algorithm a little simpler, we use a
larger structure to keep track of what we have
found so far
In this case, we will keep a two-dimensional
array with the best matches found so far
This array will have a row for each character of
the string and a column for each character of the
text

41
The Array

For each location of the array diffsi, j, we
will choose the minimum of
If stringi textj, diffsi 1, j 1otherwise
diffsi 1, j 1 1
diffsi 1, j 1
diffsi, j 1 1

42
Example

If we compare the string trim with the text
try the trumpet we get

43
Analysis

We do not really need the entire array, but just
need two columns - the current one and the
previous one on which it is based
We compare each string character with each text
character and so do ST comparisons

Write a Comment

User Comments (0)

About PowerShow.com

Matching Algorithms PowerPoint PPT Presentation