CSE 024: Design - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

CSE 024: Design

Description:

CSE 024: Design & Analysis of Algorithms Chapter 8: String Searching Sedgewick Chp:19 – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 33

Provided by: Yusu73

Category:

more less

Transcript and Presenter's Notes

Title: CSE 024: Design

1
CSE 024 Design Analysis of Algorithms

Chapter 8 String Searching
Sedgewick Chp19

2
Course Content

Introduction, Algorithmic Notation and Flowcharts
(Brassard Bratley Chp Chapter 3)
Efficiency of Algorithms (Brassard Bratley Chp
Chapter 2)
Basic Data Structures (Brassard Bratley Chp
Chapter 5)
Sorting (Weiss Chp 7)
Searching (Brassard Bratley Chp Chapter 9)
Graph Algorithms (Weiss Chp 9)
Randomized Algorithms (Weiss Chp 10)
String Searching (Sedgewick Chp19)
NP Completeness (Sedgewick Chap. 40)

3
Lecture Content

Strings Applications
String Operations
What is Pattern Matching?
Brute-Force Algorithm
Knuth-Morris-Pratt Algorithm
Boyer-Moore Algorithm

4
String Applications

Processing text is dealing with character strings
Come from wide variety of sources
DNA Applications
PCGTAAACTGCTTTAATCAAACGC
News headline
RU.S. Men Win Soccer World Cup!
URL of Web site
Shttp//www.wiley.com/college/goodrich
String operations
Breaking large strings into smaller strings
Pattern matching

5
String Concepts

Assume S is a string of size m.
A substring Si .. j of S is the string fragment
between indexes i and j.
A prefix of S is a substring S0 .. i
A suffix of S is a substring Si .. m-1
i is any index between 0 and m-1

6
Examples
S
a
n
d
r
e
w
0
5

Substring S1..3 "ndr"
All possible prefixes of S
"andrew", "andre", "andr", "and", "an, "a"
All possible suffixes of S
"andrew", "ndrew", "drew", "rew", "ew", "w

7
What is Pattern Matching?

Definition
given a text string T and a pattern string P,
find the pattern inside the text
T the rain in spain stays mainly on the plain
P n th
Applications
text editors,
Web search engines
e.g. Google
image analysis

8
Brute-Force Algorithm

Check each position in the text T
to see if the pattern P starts in that position

T
a
n
d
r
e
w
T
a
n
d
r
e
w
r
e
w
P
r
e
w
P
P moves 1 char at a time through T
. . . .
9
Brute-Force Algorithm

The obvious method for pattern matching
Just check, for each possible position in the
text
at which the pattern could match
Search for the first occurrence of
a pattern p 1. .M in a text string a 1. .N

keep one pointer (i) into the text,
another pointer (j) into the pattern.
As long as they point to matching characters,
both pointers are incremented.
If the end of the pattern is reached (jgtM),
then a match has been found.
If i and j point to mismatching characters,
then j is reset to point to the beginning of the
pattern
i is reset to correspond to moving the pattern to
the right one position for matching against the
text.
If the end of the text is reached (igtN)
then there is no match.
If the pattern does not occur in the text,
the value Ni is returned.

int bruteSearch(int M, int N) int i,j i1
j1 repeat if aipj then i j else
ii-j2 j1 until (jgtM) or (igtN) if jgtM
then return i-M else return i
10
Analysis of Algorithm

Brute force pattern matching
runs in time O(mn) in the worst case.
But most searches of ordinary text
take O(mn), which is very quick.
The brute force algorithm is fast
when the alphabet of the text is large
e.g. A..Z, a..z, 1..9, etc.
It is slower
when the alphabet is small
e.g. 0, 1 (as in binary files, image files, etc.)

11
Algorithm Performance

In a text-editing application,
the inner loop of this program is seldom
iterated,
the running time is very nearly proportional
to the number of text characters examined O(n)
For example, look for the pattern STING in
A STRING SEARCHING EXAMPLE CONSISTING OF SIMPLE
TEXT
the statement jj1 is executed only four times
once for each S
twice for the first ST before the match

12
Algorithm Performance Binary Strings

Example
pattern is 00000001
the text string is
00000000000000000000000000000000000000000000000000
001
then j is incremented
745 (315) times before the match

search for 10010111 in the binary string
100111010010010010010111000111
1001
1
10
10010
10010
10010
10010111

13
Knuth-Morris-Pratt Algorithm

The basic idea is
when a mismatch is detected,
our false start consists of characters
we know in advance
since they are in the pattern.
take the advantage of this information
instead of backing up the i pointer
over all those known characters

14
The Knuth-Morris-Pratt (KMP) algorithm

looks for the pattern in the text
in a left-to-right order
like the brute force algorithm
But it shifts the pattern more intelligently
If a mismatch occurs
between the text and pattern P at Pj,
what is the most we can shift the pattern
to avoid wasteful comparisons?
Answer the largest prefix of P0 .. j-1
that is a suffix of P1 .. j-1

15
KMP Algorithm is Generalization

Assume first character in the pattern
doesnt appear again in the pattern
say the pattern is 10000000
Suppose we have a false start j characters long
at some position in the text.
When the mismatch is detected,
j characters have already matched,
No need to back up the text pointer i
none of the previous j-1 characters in the text
can match the first character in the pattern.
This change could be implemented by replacing
ii-j2 in the program above by i
The practical effect of this change is limited
such a specialized pattern is not particularly
likely to occur,
but the idea is worth of thinking
Knuth-Morris-Pratt algorithm is a generalization.
it is always possible to arrange things
so that the i pointer is never decremented.

Fully skipping past the pattern
on detecting a mismatch wont work
when the pattern could match itself
at the point of the mismatch.
For example,
searching for 10100111 in 1010100111
First mismatch at the fifth character,
but it is better to back up
to the third character to continue the search,
since otherwise we would miss the match.
But we can figure out ahead of time what to do
because it depends only on the pattern

17
Example
i
T
P
j 5
jnew 2
18
Knuth-Morris-Pratt Algorithm

The pseudocode and explanation

int kmpsearch(int M, int N) int i,j i1
j1 repeat if (j0) or (aipb) then i
j else jnextj until (jgtM) or (igtN) if
jgtM then return i-M else return 1

When i and j point to mismatching characters
testing for a pattern match
beginning at position i-j1 in the text string
then the next possible position for a pattern
match
is beginning at position i-nextj1.
the first nextj-1 characters at that position
match
the first nextj-1 characters of the pattern,
so theres no need to back up the i pointer that
far
we can simply leave the i pointer unchanged
and set the j pointer to next j

19
Knuth-Morris-Pratt Algorithm

The pseudocode and explanation

int initnext(int M, int N) int i, j il j0
next1 0 repeat if (j0) or
(pipj) then i j nextij else
jnextj until igtM

Just after i and j are incremented,
the first j-1 characters of the pattern match
the characters in positions p i-j- 1. .i-1
the last j-1 characters in the first i-1
characters of the pattern.
this is the largest j with this property,
otherwise a possible match of the pattern with
itself would be missed.
Thus, j is exactly the value to be assigned to
next i.

20
Advantages

KMP runs in optimal time O(mn)
very fast
The algorithm never needs
to move backwards in the input text, T
this makes the algorithm good
for processing very large files
read in from external devices
or through a network stream

21
Disadvantages

KMP doesnt work so well
as the size of the alphabet increases
more chance for a mismatch
mismatches tend to occur
early in the pattern
but KMP is faster
when the mismatches occur later

22
Boyer-Moore Algorithm

a significantly faster string searching method
scan the pattern from right to left
when trying to match it against the text.
When searching for our sample pattern 10100111
if we find matches
on the eighth, seventh, and sixth character
but not on the fifth,
then we can immediately slide
the pattern seven positions to the right,
and check the fifteenth character next,

23
Heuristics Used

based on two heuristics
1. The looking-glass technique
find P in T by moving backwards
through P, starting at its end
2. The character-jump technique
when a mismatch occurs at Ti x
the character in pattern Pj
is not the same as Ti

T
a
x
i
P
b
a
j
There are 3 possible cases, tried in order.
24
Case 1

If P contains x somewhere,
then try to shift P right
to align the last occurrence of x in P with Ti.

T
T
?
?
a
a
x
x
i
inew
and move i and j right, so j at end
P
P
a
x
a
x
b
b
c
c
j
jnew
25
Case 2

If P contains x somewhere,
but a shift right
to the last occurrence is not possible,
Then shift P right by 1 character to Ti1.

T
T
?
a
x
x
x
x
a
i
inew
and move i and j right, so j at end
P
P
a
x
c
a
x
c
w
w
j
jnew
x is after j position
26
Case 3

If cases 1 and 2 do not apply,
then shift P to align P0 with Ti1.

T
T
?
?
a
a
x
x
?
i
inew
and move i and j right, so j at end
P
P
a
d
a
d
b
b
c
c
0
j
jnew
No x in P
27
Boyer-Moore Algorithm

Pseudocode and Explanation

int initnext(int M, int N) int i, j iM
jM repeat if (aipb) then i--
j-- else iiM-j1 jM if
(skipindex(ai)gtM-j1) then
iiskipindex(ai)-(M-jl) until (jltl) or
(igtN) return i1

It simply improves a brute-force right-to-left
pattern scan by using an array skip
tells, for each character in the alphabet,
how far to skip if that character appears in the
text and causes a mismatch

28
Boyer-Moore Example (1)
T
P
29
Boyer-Moore Example (2)
T
P
30
Boyer-Moore Algorithm Analysis