ICS220 Data Structures and Algorithms Analysis - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

ICS220 Data Structures and Algorithms Analysis

Description:

String manipulation in word processing. Advanced DNA sequence matching ... that the string is matched straight away (consider searching this sentence for 'The' ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 16
Provided by: K18
Category:

less

Transcript and Presenter's Notes

Title: ICS220 Data Structures and Algorithms Analysis


1
ICS220 Data Structures and Algorithms Analysis
  • Lecture 14
  • Dr. Ken Cosh

2
Review
  • Memory Management
  • Memory Allocation
  • Garbage Collection

3
This Week
  • String Matching
  • String matching is a common task for many
    computer users
  • Internet Searches
  • String manipulation in word processing
  • Advanced DNA sequence matching
  • Therefore effective pattern matching algorithms
    are essential.

4
Brute Force
  • Our first simple string matching algorithm is
    brute force.
  • We check the first character, if it is a match,
    we check the second character, if not a match, we
    step forward one character and start again.
  • Any useful information that could be used in
    subsequent searches is then lost.

5
Brute Force
  • bruteForceStringMatching(pattern P, text T)
  • i0
  • while i T - P
  • j0
  • while Ti Pj j lt P
  • i
  • j
  • if j P
  • return match at i-P
  • i i j 1
  • return no match

6
Brute Force
  • T ababcdababababababad, Pbabab
  • ababcdababababababad
  • 1 babab
  • babab
  • babab
  • babab
  • babab
  • babab
  • babab
  • babab
  • In this case the match is found on the 8th try.

7
Brute Force Complexity
  • The best case for the algorithm is that the
    string is matched straight away (consider
    searching this sentence for The). Here P
    comparisons are required O(P).
  • The worst case is if the string isnt found, but
    for each character in T, we are required to
    make P comparisons here worst case is
    O(TP).
  • The average case depends on the size and
    frequencies of the character set.

8
Brute Force Complexity
  • Notice the nested while loops in the Brute Force
    algorithm
  • while i T - P
  • while Ti Pj j lt P
  • Shortly well investigate how we can reduce the
    number of iterations of each loop.
  • For the worst case to occur we could search of a
    string such as aaaaaaaaaaaab within a string
    aaaaaaaaaaaaaaaaaaaaaaaaaaa etc.

9
Improving Brute Force
  • A key problem with brute force is that each time
    we abort the comparison we have to start from the
    beginning of the pattern again.
  • We could reduce the algorithm complexity by
    enabling us to skip unnecessary searches.
  • Hancarts algorithm allows the search to step
    forward 2 characters if a match wont be found.

10
Hancart
  • Hancarts algorithm refines brute force in a
    couple of ways.
  • First the first two characters of the pattern are
    compared
  • Either they are the same, or they are different.
  • Second comparisons begin with the 2nd character
    in the Text.

11
Hancart
  • Hancarts revision works by allowing us to skip
    forward 2 characters in situations where there
    cant be a match.
  • Notice that the situations where 2 steps forwards
    are allowed depends on whether the first 2
    characters of the pattern.
  • We can refine the search further by extending
    this observation that the number of steps
    forward allowed depends on the contents of the
    pattern.
  • The Knuth Morris Pratt algorithm observes that
    the pattern contains enough information to
    determine where the next match could begin.

12
Hancart
  • Hancarts algorithm reduces the number of
    iterations through the outer loop by sometimes
    allowing the increment to be
  • i i j 2

13
Knuth Morris Pratt
  • The Knutt Morris Pratt algorithm begins by
    finding the longest suffix, which is equal to a
    prefix of the same substring.
  • Substring A,B,C,D,A,B,D
  • Longest Suffix 0,0,0,0,1,2,0
  • i.e. when the 2nd A comes it is both a suffix and
    a prefix for the substring. The following B
    forms AB a 2 character prefix and suffix.
  • Now for each iteration of the outer loop i can be
    increased by j-x, where x is the longest suffix.
  • i.e. if a mismatch is found when comparing the
    second A, j5, so i can be increased by 4 (j-1)

14
Test
  • Try searching for this substring,
  • A,B,C,D,A,B,D
  • within this string
  • ABCDABCABCDABDE

15
Knuth Morris Pratt complexity
  • Knuth Morris Pratt removes some of the complexity
    of the brute force algorithm by preprocessing the
    substring being searched for (to create the
    suffix table).
  • Now as we dont need to recheck characters in the
    text it is O(T) for the outer loop.
  • Preprocessing can be performed quickly, in O(P)
    time, leaving a total complexity of O(TP)
Write a Comment
User Comments (0)
About PowerShow.com