CS5263 Bioinformatics - PowerPoint PPT Presentation

1 / 98
About This Presentation
Title:

CS5263 Bioinformatics

Description:

Cost for exact string matching String matching scenarios Pre-processing algs Karp Rabin Algorithm Karp Rabin algorithm Karp ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 99
Provided by: Jianhu3
Learn more at: http://www.cs.utsa.edu
Category:

less

Transcript and Presenter's Notes

Title: CS5263 Bioinformatics


1
CS5263 Bioinformatics
  • Lecture 9-10
  • Exact String Matching Algorithms

2
Overview
  • Pair-wise alignment
  • Multiple alignment
  • Commonality allowing errors when comparing
    strings
  • Two sub-problems
  • How to score an alignment with errors
  • How to find an alignment with the best score
  • Today exact string matching
  • Do not allow any errors
  • Efficiency becomes the sole consideration

3
Why exact string matching?
  • The most fundamental string comparison problem
  • Work processors
  • Information retrieval
  • DNA sequence retrieval
  • Many many more
  • Is it still an interesting research problem?
  • Yes, if database is large
  • Exact string matching is often the core of more
    complex string comparison algorithms
  • E.g., BLAST
  • Often repeatedly called by other methods
  • Usually the most time consuming part
  • Small improvement could improve overall
    efficiency considerably

4
Definitions
  • Text a longer string T (length m)
  • Pattern a shorter string P (length n)
  • Exact matching find all occurrences of P in T

length m
T
length n
P
5
The naïve algorithm
6
Time complexity
  • Worst case O(mn)
  • Best case O(m)
  • e.g. aaaaaaaaaaaaaa vs baaaaaaa
  • Average case?
  • Alphabet A, C, G, T
  • Assume both P and T are random
  • Equal probability
  • In average how many chars do you need to compare
    before giving up?

7
Average case time complexity
  • P(mismatch at 1st position) ¾
  • P(mismatch at 2nd position) ¼ ¾
  • P(mismatch at 3nd position) (¼)2 ¾
  • P(mismatch at kth position) (¼)k-1 ¾
  • Expected number of comparison per position
  • p 1/4
  • ?k (1-p) p(k-1) k (1-p) / p ?k pk k
  • 1/(1-p)
  • 4/3
  • Average complexity 4m/3
  • Not as bad as you thought it might be

8
Biological sequences are not random
  • T aaaaaaaaaaaaaaaaaaaaaaaaa
  • P aaaab
  • Plus 4m/3 average case is still bad for long
    genomic sequences!
  • Especially if this has to be done again and again
  • Smarter algorithms
  • O(m n) in worst case
  • sub-linear in practice

9
How to speedup?
  • Pre-processing T or P
  • Why pre-processing can save us time?
  • Uncovers the structure of T or P
  • Determines when we can skip ahead without missing
    anything
  • Determines when we can infer the result of
    character comparisons without doing them.

ACGTAXACXTAXACGXAX
ACGTACA
10
Cost for exact string matching
  • Total cost cost (preprocessing)
  • cost(comparison)
  • cost(output)

Overhead
Minimize
Constant
Hope gain gt overhead
11
String matching scenarios
  • One T and one P
  • Search a word in a document
  • One T and many P all at once
  • Search a set of words in a document
  • Spell checking (fixed P)
  • One fixed T, many P
  • Search a completed genome for short sequences
  • Two (or many) Ts for common patterns
  • Q Which one to pre-process?
  • A Always pre-process the shorter seq, or the one
    that is repeatedly used

12
Pre-processing algs
  • Pattern preprocessing
  • Karp Rabin algorithm
  • Small alphabet and short patterns
  • Knuth-Morris-Pratt algorithm (KMP)
  • Aho-Corasick algorithm
  • Multiple patterns
  • Boyer Moore algorithm
  • The choice of most cases
  • Typically sub-linear time
  • Text preprocessing
  • Suffix tree
  • Very useful for many purposes

13
Karp Rabin Algorithm
  • Lets say we are dealing with binary numbers
  • Text 01010001011001010101001
  • Pattern 101100
  • Convert pattern to integer
  • 101100 25 23 22 44

14
Karp Rabin algorithm
  • Text 01010001011001010101001
  • Pattern 101100 44 decimal
  • 10111011001010101001
  • 25 23 22 21 46
  • 10111011001010101001
  • 46 2 64 1 29
  • 10111011001010101001
  • 29 2 - 0 1 59
  • 10111011001010101001
  • 59 2 - 64 0 54
  • 10111011001010101001
  • 54 2 - 64 0 44

15
Karp Rabin algorithm
  • What if the pattern is too long to fit into a
    single integer?
  • Pattern 101100. But our machine only has 5 bits
  • Basic idea hashing. 44 13 5
  • 10111011001010101001
  • 46 ( 13 7)
  • 10111011001010101001
  • 46 2 64 1 29 ( 13 3)
  • 10111011001010101001
  • 29 2 - 0 1 59 ( 13 7)
  • 10111011001010101001
  • 59 2 - 64 0 54 ( 13 2)
  • 10111011001010101001
  • 54 2 - 64 0 44 ( 13 5)

16
Algorithm KMP
  • Not the fastest
  • Best known
  • Good for real-time matching
  • i.e. text comes one char at a time
  • No memory of previous chars
  • Idea
  • Left-to-right comparison
  • Shift P more than one char whenever possible

17
Intuitive example 1
abcxabc
T
mismatch
P
abcxabcde
Naïve approach
abcxabc
T
?
abcxabcde
  • Observation by reasoning on the pattern alone,
    we can determine that if a mismatch happened when
    comparing P8 with Ti, we can shift P by four
    chars, and compare P4 with Ti, without
    missing any possible matches.
  • Number of comparisons saved 6

18
Intuitive example 2
abcxabc
T
mismatch
P
abcxabcde
Naïve approach
abcxabc
T
?
abcxabcde
  • Observation by reasoning on the pattern alone,
    we can determine that if a mismatch happened
    between P7 and Tj, we can shift P by six
    chars and compare Tj with P1 without missing
    any possible matches
  • Number of comparisons saved 7

19
KMP algorithm pre-processing
  • Key the reasoning is done without even knowing
    what string T is.
  • Only the location of mismatch in P must be known.

x
t
T
y
z
t
t
P
i
j
y
z
t
t
P
i
j
Pre-processing for any position i in P, find
P1..is longest proper suffix, t Pj..i,
such that t matches to a prefix of P, t, and the
next char of t is different from the next char of
t (i.e., y ? z) For each i, let sp(i) length(t)
20
KMP algorithm shift rule
x
t
T
y
z
t
t
P
i
j
y
z
t
P
t
i
j
sp(i)
1
Shift rule when a mismatch occurred between
Pi1 and Tk, shift P to the right by i
sp(i) chars and compare x with z. This shift
rule can be implicitly represented by creating a
failure link between y and z. Meaning when a
mismatch occurred between x on T and Pi1,
resume comparison between x and Psp(i)1.
21
Failure Link Example
  • P aataac

If a char in T fails to match at pos 6,
re-compare it with the char at pos 3 ( 2 1)
a
a
t
a
a
c
sp(i) 0 1 0 0 2 0
aaat
aataac
22
Another example
  • P abababc

If a char in T fails to match at pos 7,
re-compare it with the char at pos 5 ( 4 1)
a
b
a
b
a
b
c
Sp(i) 0 0 0 0 0 4 0
ababaababc
abab
abababab
23
KMP Example using Failure Link
a
a
t
a
a
c
T aacaataaaaataaccttacta
aataac
  • Time complexity analysis
  • Each char in T may be compared up to n times. A
    lousy analysis gives O(mn) time.
  • More careful analysis number of comparisons can
    be broken to two phases
  • Comparison phase the first time a char in T is
    compared to P. Total is exactly m.
  • Shift phase. First comparisons made after a
    shift. Total is at most m.
  • Time complexity O(2m)

aataac .
aataac
aataac ..
aataac .
24
KMP algorithm using DFA (Deterministic Finite
Automata)
  • P aataac

If a char in T fails to match at pos 6,
re-compare it with the char at pos 3
Failure link
a
a
t
a
a
c
If the next char in T is t after matching 5
chars, go to state 3
a
t
t
a
a
c
a
a
1
2
3
4
5
0
6
DFA
a
a
All other inputs goes to state 0.
25
DFA Example
a
t
t
a
a
c
a
a
1
2
3
4
5
0
6
DFA
a
a
T aacaataataataaccttacta
1201234534534560001001
Each char in T will be examined exactly once.
Therefore, exactly m comparisons are made. But
it takes longer to do pre-processing, and needs
more space to store the FSA.
26
Difference between Failure Link and DFA
  • Failure link
  • Preprocessing time and space are O(n), regardless
    of alphabet size
  • Comparison time is at most 2m (at least m)
  • DFA
  • Preprocessing time and space are O(n ?)
  • May be a problem for very large alphabet size
  • For example, each char is a big integer
  • Chinese characters
  • Comparison time is always m.

27
Boyer Moore algorithm
  • Often the choice of algorithm for many cases
  • One T and one P
  • We will talk about it later if have time
  • In its original version does not guarantee linear
    time
  • Some modification did it
  • In practice sub-linear

28
The set matching problem
  • Find all occurrences of a set of patterns in T
  • First idea run KMP or BM for each P
  • O(km n)
  • k number of patterns
  • m length of text
  • n total length of patterns
  • Better idea combine all patterns together and
    search in one run

29
A simpler problem spell-checking
  • A dictionary contains five words
  • potato
  • poetry
  • pottery
  • science
  • school
  • Given a document, check if any word is (not) in
    the dictionary
  • Words in document are separated by special chars.
  • Relatively easy.

30
Keyword tree for spell checking
This version of the potato gun was inspired by
the Weird Science team out of Illinois
p
s
o
c
l
h
o
o
5
e
i
t
e
t
a
t
r
n
t
y
e
c
o
r
e
y
3
1
4
2
  • O(n) time to construct. n total length of
    patterns.
  • Search time O(m). m length of text
  • Common prefix only need to be compared once.
  • What if there is no space between words?

31
Aho-Corasick algorithm
  • Basis of the fgrep algorithm
  • Generalizing KMP
  • Using failure links
  • Example given the following 4 patterns
  • potato
  • tattoo
  • theater
  • other

32
Keyword tree
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
33
Keyword tree
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
34
Keyword tree
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
O(mn)
m length of text. n length of longest pattern
35
Keyword Tree with a failure link
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
36
Keyword Tree with a failure link
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
37
Keyword Tree with all failure links
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
38
Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
39
Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
40
Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
41
Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
42
Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
43
Aho-Corasick algorithm
  • O(n) preprocessing, and O(mk) searching.
  • n total length of patterns.
  • m length of text
  • k is of occurrence.
  • Can create a DFA similar as in KMP.
  • Requires more space,
  • Preprocessing time depends on alphabet size
  • Search time is constant
  • A Where can this algorithm be used in previous
    topics?
  • Q BLAST
  • Given a query sequence, we generate many seed
    sequences (k-mers)
  • Search for exact matches to these seed sequences
  • Extend exact matches into longer inexact matches

44
Suffix Tree
  • All algorithms we talked about so far preprocess
    pattern(s)
  • Karp-Rabin small pattern, small alphabet
  • Boyer-Moore fastest in practice. O(m) worst
    case.
  • KMP O(m)
  • Aho-Corasick O(m)
  • In some cases we may prefer to pre-process T
  • Fixed T, varying P
  • Suffix tree basically a keyword tree of all
    suffixes

45
Suffix tree
  • T xabxac
  • Suffixes
  • xabxac
  • abxac
  • bxac
  • xac
  • ac
  • c

x
a
b
x
a
a
c
c
1
c
b
b
x
x
c
4
6
a
a
c
c
5
2
3
Naïve construction O(m2) using
Aho-Corasick. Smarter O(m). Very technical. big
constant factor Difference from a keyword tree
create an internal node only when there is a
branch
46
Suffix tree implementation
  • Explicitly labeling sequence end
  • T xabxa

x
a
x
a
b
x
b
a
a
x
a
a

1
1

b
b
b
b
x

x
x
x
4
a
a
a
a

5

2
2
3
3
47
Suffix tree implementation
  • Implicitly labeling edges
  • T xabxa

12
x
a
3
b
x
22
a
a

1
1


b
b


x
x
3
3
4
4
a
a
5

5

2
2
3
3
48
Suffix links
  • Similar to failure link in a keyword tree
  • Only link internal nodes having branches

x
a
b
P xabcf
a
b
c
f
c
d
d
e
e
f
f
g
g
h
h
i
i
j
j
49
Suffix tree construction
1234567890acatgacatt
1
1
50
Suffix tree construction
1234567890acatgacatt
2
1
1
2
51
Suffix tree construction
1234567890acatgacatt
a
2
2
4
3
1
2
52
Suffix tree construction
1234567890acatgacatt
a
4
2
2
4
4
3
1
2
53
Suffix tree construction
5
1234567890acatgacatt
5
a
4
2
2
4
4
3
1
2
54
Suffix tree construction
5
1234567890acatgacatt
5
a
4
c
a
2
4
t
4
t
5

3
6
1
2
55
Suffix tree construction
5
1234567890acatgacatt
5
a
c
4
a
c
t
a
4
t
4
t
5
t
5

7
3
6
1
2
56
Suffix tree construction
5
1234567890acatgacatt
5
a
c
4
a
c
t
t
a
t
4
t
5
5
t
5
t

7
3
6
8
1
2
57
Suffix tree construction
5
1234567890acatgacatt
5
t
a
c
a
5
t
c
t
t
a
9
t
4
t
5
5
t
5
t

7
3
6
8
1
2
58
Suffix tree construction
5
1234567890acatgacatt
5
t
a

c
10
a
5
t
c
t
t
a
9
t
4
t
5
5
t
5
t

7
3
6
8
1
2
59
ST Application 1 pattern matching
  • Find all occurrence of Pxa in T
  • Find node v in the ST that matches to P
  • Traverse the subtree rooted at v to get the
    locations

x
a
b
x
a
a
c
c
1
c
b
b
x
x
c
4
6
a
a
c
c
5
T xabxac
2
3
  • O(m) to construct ST (large constant factor)
  • O(n) to find v linear to length of P instead of
    T!
  • O(k) to get all leaves, k is the number of
    occurrence.
  • Asymptotic time is the same as KMP. ST wins if T
    is fixed. KMP wins otherwise.

60
ST Application 2 set matching
  • Find all occurrences of a set of patterns in T
  • Build a ST from T
  • Match each P to ST

x
a
b
x
a
a
c
c
1
c
b
b
x
x
c
4
6
a
a
c
c
5
T xabxac P xab
2
3
  • O(m) to construct ST (large constant factor)
  • O(n) to find v linear to total length of Ps
  • O(k) to get all leaves, k is the number of
    occurrence.
  • Asymptotic time is the same as Aho-Corasick. ST
    wins if T fixed. AC wins if Ps are fixed.
    Otherwise depending on relative size.

61
ST application 3 repeats finding
  • Genome contains many repeated DNA sequences
  • Repeat sequence length Varies from 1 nucleotide
    to millions
  • Genes may have multiple copies (50 to 10,000)
  • Highly repetitive DNA in some non-coding regions
  • 6 to 10bp x 100,000 to 1,000,000 times
  • Problem find all repeats that are at least
    k-residues long and appear at least p times in
    the genome

62
Repeats finding
  • at least k-residues long and appear at least p
    times in the seq
  • Phase 1 top-down, count label lengths (L) from
    root to each node
  • Phase 2 bottom-up count of leaves descended
    from each internal node

For each node with L gt k, and N gt p, print all
leaves
O(m) to traverse tree
(L, N)
63
Maximal repeats finding
  • Right-maximal repeat
  • Si1..ik Sj1..jk,
  • but Sik1 ! Sjk1
  • Left-maximal repeat
  • Si1..ik Sj1..jk
  • But Si ! Sj
  • Maximal repeat
  • Si1..ik Sj1..jk
  • But Si ! Sj, and Sik1 ! Sjk1

acatgacatt
  • cat
  • aca
  • acat

64
Maximal repeats finding
5e
1234567890acatgacatt
5
t
a

c
10
a
5e
t
c
t
t
a
9
t
4
t
5e
5e
t
5e
t
7
3
6
8
1
2
  • Find repeats with at least 3 bases and 2
    occurrence
  • right-maximal cat
  • Maximal acat
  • left-maximal aca

65
Maximal repeats finding
5e
1234567890acatgacatt
5
t
a

c
10
a
5e
t
c
t
t
a
9
t
4
t
5e
5e
t
5e
t
7
3
6
8
1
2
Left char
g
c
c
a
a
  • How to find maximal repeat?
  • A right-maximal repeats with different left chars

66
ST application 4 word enumeration
  • Find all k-mers that occur at least p times
  • Compute (L, N) for each node
  • L total label length from root to node
  • N leaves
  • Find nodes v with Lgtk, and L(parent)ltk, and Ngty
  • Traverse sub-tree rooted at v to get the locations

Lltk
Lk
L K
Lgtk, Ngtp
This can be used in many applications. For
example, to find words that appeared frequently
in a genome or a document
67
Joint Suffix Tree (JST)
  • Build a ST for more than two strings
  • Two strings S1 and S2
  • S S1 S2
  • Build a suffix tree for S in time O(S1 S2)
  • The separator will only appear in the edge ending
    in a leaf

68
Joint suffix tree example
  • S1 abcd
  • S2 abca
  • S abcdabca

a b c d
useless
a
d

c
b c d a b c a
a
b
c
b
c
d

d
d

a

a
a
a
2,4
b
1,4
a
c
2,3
a
b
2,1
c
2,2
d
1,1
Seq ID
1,3
Suffix ID
1,2
69
To Simplify
a b c d
useless
a
d

c
b c d a b c a
a
a
b
d
c
b
c
c
b c d

b
d
d
d
c


a

d
a
d
a
1,4
a
2,4
b
a
1,4
a
a
c
a
2,4
2,3
a
b
1,1
2,3
2,1
c
2,1
2,2
1,3
d
1,1
2,2
1,2
1,3
1,2
  • We dont really need to do anything, since all
    edge labels were implicit.
  • The right hand side is more convenient to look at

70
Application 1 of JST
  • Longest common substring between two sequences
  • Using smith-waterman
  • Gap mismatch -infinity.
  • Quadratic time
  • Using JST
  • Linear time
  • For each internal node v, keep a bit vector B
  • B1 1 if a child of v is a suffix of S1
  • Bottom-up find all internal nodes with B1
    B2 1 (green nodes)
  • Report a green node with the longest label
  • Can be extended to k sequences. Just use a longer
    bit vector.

a
d
c
b c d
b
c

d
d
1,4
a
a
a
2,4
1,1
2,3
2,1
1,3
2,2
1,2
71
Application 2 of JST
  • Given K strings, find all k-mers that appear in
    at least (or at most) d strings
  • Exact motif finding problem

Llt k
cardinal(B) gt 3
B BitOR(1010, 0011) 1011
L gt k
cardinal(B) 3
B 0011
B 1010
4,x
3,x
1,x
3,x
72
Application 3 of JST
  • Substring problem for sequence databases
  • Given A fixed database of sequences (e.g.,
    individual genomes)
  • Given A short pattern (e.g., DNA signature)
  • Q Does this DNA signature belong to any
    individual in the database?
  • i.e. the pattern is a substring of some sequences
    in the database
  • Aho-Corasick doesnt work
  • This can also be used to design signatures for
    individuals
  • Build a JST for the database seqs
  • Match P to the JST
  • Find seq IDs from descendents

a
d
c
b c d
b
c

d
1,4
a
d
a
a
2,4
1,1
2,3
2,1
Seqs abcd, abca P1 cd P2 ac
1,3
2,2
1,2
73
Application 4 of JST
  • Detect DNA contamination
  • For some reason when we try to clone and sequence
    a genome, some DNAs from other sources may
    contaminate our sample, which should be detected
    and removed
  • Given A fixed database of sequences (e.g.,
    possible cantamination sources)
  • Given A DNA just sequenced (e.g., DNA signature)
  • Q Does this DNA contain longer enough substring
    from the seqs in the database?
  • Build a JST for the database seqs
  • Scan T using the JST

a
d
c
b c d
b
c

d
d
1,4
a
a
a
2,4
1,1
2,3
Contamination sources abcd, abca Sequence
dbcgaabctacgtctagt
2,1
1,3
2,2
1,2
74
Summary
  • One T, one P
  • Boyer-Moore is the choice
  • KMP works but not the best
  • One T, many P
  • Aho-Corasick
  • Suffix Tree
  • One fixed T, many varying P
  • Suffix tree
  • Two or more Ts
  • Suffix tree, joint suffix tree

Alphabet independent
Alphabet dependent
75
Boyer Moore algorithm
  • Three ideas
  • Right-to-left comparison
  • Bad character rule
  • Good suffix rule

76
Boyer Moore algorithm
  • Right to left comparison

Resume comparison here
x
y
Skip some chars without missing any occurrence.
y
77
Bad character rule
  • 0 1
  • 12345678901234567
  • Txpbctbxabpqqaabpq
  • P tpabxab
  • What would you do now?

78
Bad character rule
  • 0 1
  • 12345678901234567
  • Txpbctbxabpqqaabpq
  • P tpabxab
  • P tpabxab

79
Bad character rule
  • 0 1
  • 123456789012345678
  • Txpbctbxabpqqaabpqz
  • P tpabxab
  • P tpabxab
  • P tpabxab

80
Basic bad character rule
tpabxab
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
Pre-processing O(n)
81
Basic bad character rule
k
T xpbctbxabpqqaabpqz
P tpabxab
When rightmost T(k) in P is left to i, shift
pattern P to align T(k) with the rightmost T(k)
in P
Shift 3 1 2
i 3
P tpabxab
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
82
Basic bad character rule
k
T xpbctbxabpqqaabpqz
P tpabxab
When T(k) is not in P, shift left end of P to
align with T(k1)
i 7
Shift 7 0 7
P tpabxab
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
83
Basic bad character rule
k
T xpbctbxabpqqaabpqz
P tpabxab
When rightmost T(k) in P is right to i, shift
pattern P by 1
i 5
5 6 lt 0. so shift 1
P tpabxab
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
84
Extended bad character rule
k
T xpbctbxabpqqaabpqz
P tpabxab
Find T(k) in P that is immediately left to i,
shift P to align T(k) with that position
i 5
5 3 2. so shift 2
P tpabxab
char Position in P
a 6, 3
b 7, 4
p 2
t 1
x 5
Preprocessing still O(n)
85
Extended bad character rule
  • Best possible m / n comparisons
  • Works better for large alphabet size
  • In some cases the extended bad character rule is
    sufficiently good
  • Worst-case O(mn)
  • Expected time is sublinear

86
  • 0 1
  • 123456789012345678
  • Tprstabstubabvqxrst
  • P qcabdabdab
  • P qcabdabdab

According to extended bad character rule
87
(weak) good suffix rule
  • 0 1
  • 123456789012345678
  • Tprstabstubabvqxrst
  • P qcabdabdab
  • P qcabdabdab

88
(Weak) good suffix rule
x
t
T
Preprocessing For any suffix t of P, find the
rightmost copy of t, denoted by t. How to find
t efficiently?
y
t
P
t
y
t
t
P
89
(Strong) good suffix rule
  • 0 1
  • 123456789012345678
  • Tprstabstubabvqxrst
  • P qcabdabdab

90
(Strong) good suffix rule
  • 0 1
  • 123456789012345678
  • Tprstabstubabvqxrst
  • P qcabdabdab
  • P qcabdabdab

91
(Strong) good suffix rule
x
t
T
In preprocessing For any suffix t of P, find
the rightmost copy of t, t, such that the char
left to t ? the char left to t
y
z
P
t
t
z ? y
y
z
t
t
P
92
Example preprocessing
qcabdabdab
Bad char rule
Good suffix rule
char Positions in P
a 9, 6, 3
b 10, 7, 4
c 2
d 8, 5
q 1
1 2 3 4 5 6 7 8 9 10
q c a b d a b d a b
0 0 0 0 2 0 0 2 0 0
dabcab
dabdabcabdab
Where to shift depends on T
Does not depend on T
Largest shift given by either the (extended) bad
char rule or the (strong) good suffix rule is
used.
93
Time complexity of BM algorithm
  • Pre-processing can be done in linear time
  • With strong good suffix rule, worst-case is O(m)
    if P is not in T
  • If P is in T, worst-case could be O(mn)
  • E.g. T m100, P m10
  • unless a modification was used (Galils rule)
  • Proofs are technical. Skip.

94
How to actually do pre-processing?
  • Similar pre-processing for KMP and B-M
  • Find matches between a suffix and a prefix
  • Both can be done in linear time
  • P is usually short, even a more expensive
    pre-processing may result in a gain overall

y
x
KMP
t
t
P
i
j
For each i, find a j. similar to DP. Start from i
2
y
x
B-M
t
t
P
i
j
95
Fundamental pre-processing
y
x
t
t
P
izi-1
zi
i
1
  • Zi length of longest substring starting at i
    that matches a prefix of P
  • i.e. t t, x ? y, Zi t
  • With the Z-values computed, we can get the
    preprocessing for both KMP and B-M in linear
    time.
  • aabcaabxaaz
  • Z 01003100210
  • How to compute Z-values in linear time?

96
Computing Z in Linear time
We already computed all Z-values up to k-1. need
to compute Zk. We also know the starting and
ending points of the previous match, l and r.
t
t
y
x
P
r
k
l
t
t
y
x
P
We know that t t, therefore the Z-value at
k-l1 may be helpful to us.
r
k
l
1
k-l1
97
Computing Z in Linear time
The previous r is smaller than k. i.e., no
previous match extends beyond k. do explicit
comparison.
P
Case 1
k
Case 2
y
x
P
Zk-l1 lt r-k1. Zk Zk-l1 No comparison is
needed.
l
r
k
1
k-l1
Zk-l1 gt r-k1. Zk Zk-l1 Comparison start from
r
Case 3
P
r
k
l
1
k-l1
  • No char inside the box is compared twice. At most
    one mismatch per iteration.
  • Therefore, O(n).

98
Z-preprocessing for B-M and KMP
y
x
t
t
Z
zi
i
j
1
j izi-1
y
x
KMP
t
t
For each j sp(jzj-1) z(j)
j
i
y
x
B-M
t
t
Use Z backwards
i
j
  • Both KMP and B-M preprocessing can be done in O(n)
Write a Comment
User Comments (0)
About PowerShow.com