Title: CS5263 Bioinformatics
1CS5263 Bioinformatics
- Lecture 9-10
- Exact String Matching Algorithms
2Overview
- Pair-wise alignment
- Multiple alignment
- Commonality allowing errors when comparing
strings - Two sub-problems
- How to score an alignment with errors
- How to find an alignment with the best score
- Today exact string matching
- Do not allow any errors
- Efficiency becomes the sole consideration
3Why exact string matching?
- The most fundamental string comparison problem
- Work processors
- Information retrieval
- DNA sequence retrieval
- Many many more
- Is it still an interesting research problem?
- Yes, if database is large
- Exact string matching is often the core of more
complex string comparison algorithms - E.g., BLAST
- Often repeatedly called by other methods
- Usually the most time consuming part
- Small improvement could improve overall
efficiency considerably
4Definitions
- Text a longer string T (length m)
- Pattern a shorter string P (length n)
- Exact matching find all occurrences of P in T
length m
T
length n
P
5The naïve algorithm
6Time complexity
- Worst case O(mn)
- Best case O(m)
- e.g. aaaaaaaaaaaaaa vs baaaaaaa
- Average case?
- Alphabet A, C, G, T
- Assume both P and T are random
- Equal probability
- In average how many chars do you need to compare
before giving up?
7Average case time complexity
- P(mismatch at 1st position) ¾
- P(mismatch at 2nd position) ¼ ¾
- P(mismatch at 3nd position) (¼)2 ¾
- P(mismatch at kth position) (¼)k-1 ¾
- Expected number of comparison per position
- p 1/4
- ?k (1-p) p(k-1) k (1-p) / p ?k pk k
- 1/(1-p)
- 4/3
- Average complexity 4m/3
- Not as bad as you thought it might be
8Biological sequences are not random
- T aaaaaaaaaaaaaaaaaaaaaaaaa
- P aaaab
- Plus 4m/3 average case is still bad for long
genomic sequences! - Especially if this has to be done again and again
- Smarter algorithms
- O(m n) in worst case
- sub-linear in practice
9How to speedup?
- Pre-processing T or P
- Why pre-processing can save us time?
- Uncovers the structure of T or P
- Determines when we can skip ahead without missing
anything - Determines when we can infer the result of
character comparisons without doing them.
ACGTAXACXTAXACGXAX
ACGTACA
10Cost for exact string matching
- Total cost cost (preprocessing)
- cost(comparison)
- cost(output)
Overhead
Minimize
Constant
Hope gain gt overhead
11String matching scenarios
- One T and one P
- Search a word in a document
- One T and many P all at once
- Search a set of words in a document
- Spell checking (fixed P)
- One fixed T, many P
- Search a completed genome for short sequences
- Two (or many) Ts for common patterns
- Q Which one to pre-process?
- A Always pre-process the shorter seq, or the one
that is repeatedly used
12Pre-processing algs
- Pattern preprocessing
- Karp Rabin algorithm
- Small alphabet and short patterns
- Knuth-Morris-Pratt algorithm (KMP)
- Aho-Corasick algorithm
- Multiple patterns
- Boyer Moore algorithm
- The choice of most cases
- Typically sub-linear time
- Text preprocessing
- Suffix tree
- Very useful for many purposes
13Karp Rabin Algorithm
- Lets say we are dealing with binary numbers
- Text 01010001011001010101001
- Pattern 101100
- Convert pattern to integer
- 101100 25 23 22 44
14Karp Rabin algorithm
- Text 01010001011001010101001
- Pattern 101100 44 decimal
- 10111011001010101001
- 25 23 22 21 46
- 10111011001010101001
- 46 2 64 1 29
- 10111011001010101001
- 29 2 - 0 1 59
- 10111011001010101001
- 59 2 - 64 0 54
- 10111011001010101001
- 54 2 - 64 0 44
15Karp Rabin algorithm
- What if the pattern is too long to fit into a
single integer? - Pattern 101100. But our machine only has 5 bits
- Basic idea hashing. 44 13 5
- 10111011001010101001
- 46 ( 13 7)
- 10111011001010101001
- 46 2 64 1 29 ( 13 3)
- 10111011001010101001
- 29 2 - 0 1 59 ( 13 7)
- 10111011001010101001
- 59 2 - 64 0 54 ( 13 2)
- 10111011001010101001
- 54 2 - 64 0 44 ( 13 5)
16Algorithm KMP
- Not the fastest
- Best known
- Good for real-time matching
- i.e. text comes one char at a time
- No memory of previous chars
- Idea
- Left-to-right comparison
- Shift P more than one char whenever possible
17Intuitive example 1
abcxabc
T
mismatch
P
abcxabcde
Naïve approach
abcxabc
T
?
abcxabcde
- Observation by reasoning on the pattern alone,
we can determine that if a mismatch happened when
comparing P8 with Ti, we can shift P by four
chars, and compare P4 with Ti, without
missing any possible matches. - Number of comparisons saved 6
18Intuitive example 2
abcxabc
T
mismatch
P
abcxabcde
Naïve approach
abcxabc
T
?
abcxabcde
- Observation by reasoning on the pattern alone,
we can determine that if a mismatch happened
between P7 and Tj, we can shift P by six
chars and compare Tj with P1 without missing
any possible matches - Number of comparisons saved 7
19KMP algorithm pre-processing
- Key the reasoning is done without even knowing
what string T is. - Only the location of mismatch in P must be known.
x
t
T
y
z
t
t
P
i
j
y
z
t
t
P
i
j
Pre-processing for any position i in P, find
P1..is longest proper suffix, t Pj..i,
such that t matches to a prefix of P, t, and the
next char of t is different from the next char of
t (i.e., y ? z) For each i, let sp(i) length(t)
20KMP algorithm shift rule
x
t
T
y
z
t
t
P
i
j
y
z
t
P
t
i
j
sp(i)
1
Shift rule when a mismatch occurred between
Pi1 and Tk, shift P to the right by i
sp(i) chars and compare x with z. This shift
rule can be implicitly represented by creating a
failure link between y and z. Meaning when a
mismatch occurred between x on T and Pi1,
resume comparison between x and Psp(i)1.
21Failure Link Example
If a char in T fails to match at pos 6,
re-compare it with the char at pos 3 ( 2 1)
a
a
t
a
a
c
sp(i) 0 1 0 0 2 0
aaat
aataac
22Another example
If a char in T fails to match at pos 7,
re-compare it with the char at pos 5 ( 4 1)
a
b
a
b
a
b
c
Sp(i) 0 0 0 0 0 4 0
ababaababc
abab
abababab
23KMP Example using Failure Link
a
a
t
a
a
c
T aacaataaaaataaccttacta
aataac
- Time complexity analysis
- Each char in T may be compared up to n times. A
lousy analysis gives O(mn) time. - More careful analysis number of comparisons can
be broken to two phases - Comparison phase the first time a char in T is
compared to P. Total is exactly m. - Shift phase. First comparisons made after a
shift. Total is at most m. - Time complexity O(2m)
aataac .
aataac
aataac ..
aataac .
24KMP algorithm using DFA (Deterministic Finite
Automata)
If a char in T fails to match at pos 6,
re-compare it with the char at pos 3
Failure link
a
a
t
a
a
c
If the next char in T is t after matching 5
chars, go to state 3
a
t
t
a
a
c
a
a
1
2
3
4
5
0
6
DFA
a
a
All other inputs goes to state 0.
25DFA Example
a
t
t
a
a
c
a
a
1
2
3
4
5
0
6
DFA
a
a
T aacaataataataaccttacta
1201234534534560001001
Each char in T will be examined exactly once.
Therefore, exactly m comparisons are made. But
it takes longer to do pre-processing, and needs
more space to store the FSA.
26Difference between Failure Link and DFA
- Failure link
- Preprocessing time and space are O(n), regardless
of alphabet size - Comparison time is at most 2m (at least m)
- DFA
- Preprocessing time and space are O(n ?)
- May be a problem for very large alphabet size
- For example, each char is a big integer
- Chinese characters
- Comparison time is always m.
27Boyer Moore algorithm
- Often the choice of algorithm for many cases
- One T and one P
- We will talk about it later if have time
- In its original version does not guarantee linear
time - Some modification did it
- In practice sub-linear
28The set matching problem
- Find all occurrences of a set of patterns in T
- First idea run KMP or BM for each P
- O(km n)
- k number of patterns
- m length of text
- n total length of patterns
- Better idea combine all patterns together and
search in one run
29A simpler problem spell-checking
- A dictionary contains five words
- potato
- poetry
- pottery
- science
- school
- Given a document, check if any word is (not) in
the dictionary - Words in document are separated by special chars.
- Relatively easy.
30Keyword tree for spell checking
This version of the potato gun was inspired by
the Weird Science team out of Illinois
p
s
o
c
l
h
o
o
5
e
i
t
e
t
a
t
r
n
t
y
e
c
o
r
e
y
3
1
4
2
- O(n) time to construct. n total length of
patterns. - Search time O(m). m length of text
- Common prefix only need to be compared once.
- What if there is no space between words?
31Aho-Corasick algorithm
- Basis of the fgrep algorithm
- Generalizing KMP
- Using failure links
- Example given the following 4 patterns
- potato
- tattoo
- theater
- other
32Keyword tree
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
33Keyword tree
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
34Keyword tree
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
O(mn)
m length of text. n length of longest pattern
35Keyword Tree with a failure link
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
36Keyword Tree with a failure link
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
37Keyword Tree with all failure links
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
38Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
39Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
40Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
41Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
42Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
43Aho-Corasick algorithm
- O(n) preprocessing, and O(mk) searching.
- n total length of patterns.
- m length of text
- k is of occurrence.
- Can create a DFA similar as in KMP.
- Requires more space,
- Preprocessing time depends on alphabet size
- Search time is constant
- A Where can this algorithm be used in previous
topics? - Q BLAST
- Given a query sequence, we generate many seed
sequences (k-mers) - Search for exact matches to these seed sequences
- Extend exact matches into longer inexact matches
44Suffix Tree
- All algorithms we talked about so far preprocess
pattern(s) - Karp-Rabin small pattern, small alphabet
- Boyer-Moore fastest in practice. O(m) worst
case. - KMP O(m)
- Aho-Corasick O(m)
- In some cases we may prefer to pre-process T
- Fixed T, varying P
- Suffix tree basically a keyword tree of all
suffixes
45Suffix tree
- T xabxac
- Suffixes
- xabxac
- abxac
- bxac
- xac
- ac
- c
x
a
b
x
a
a
c
c
1
c
b
b
x
x
c
4
6
a
a
c
c
5
2
3
Naïve construction O(m2) using
Aho-Corasick. Smarter O(m). Very technical. big
constant factor Difference from a keyword tree
create an internal node only when there is a
branch
46Suffix tree implementation
- Explicitly labeling sequence end
- T xabxa
x
a
x
a
b
x
b
a
a
x
a
a
1
1
b
b
b
b
x
x
x
x
4
a
a
a
a
5
2
2
3
3
47Suffix tree implementation
- Implicitly labeling edges
- T xabxa
12
x
a
3
b
x
22
a
a
1
1
b
b
x
x
3
3
4
4
a
a
5
5
2
2
3
3
48Suffix links
- Similar to failure link in a keyword tree
- Only link internal nodes having branches
x
a
b
P xabcf
a
b
c
f
c
d
d
e
e
f
f
g
g
h
h
i
i
j
j
49Suffix tree construction
1234567890acatgacatt
1
1
50Suffix tree construction
1234567890acatgacatt
2
1
1
2
51Suffix tree construction
1234567890acatgacatt
a
2
2
4
3
1
2
52Suffix tree construction
1234567890acatgacatt
a
4
2
2
4
4
3
1
2
53Suffix tree construction
5
1234567890acatgacatt
5
a
4
2
2
4
4
3
1
2
54Suffix tree construction
5
1234567890acatgacatt
5
a
4
c
a
2
4
t
4
t
5
3
6
1
2
55Suffix tree construction
5
1234567890acatgacatt
5
a
c
4
a
c
t
a
4
t
4
t
5
t
5
7
3
6
1
2
56Suffix tree construction
5
1234567890acatgacatt
5
a
c
4
a
c
t
t
a
t
4
t
5
5
t
5
t
7
3
6
8
1
2
57Suffix tree construction
5
1234567890acatgacatt
5
t
a
c
a
5
t
c
t
t
a
9
t
4
t
5
5
t
5
t
7
3
6
8
1
2
58Suffix tree construction
5
1234567890acatgacatt
5
t
a
c
10
a
5
t
c
t
t
a
9
t
4
t
5
5
t
5
t
7
3
6
8
1
2
59ST Application 1 pattern matching
- Find all occurrence of Pxa in T
- Find node v in the ST that matches to P
- Traverse the subtree rooted at v to get the
locations
x
a
b
x
a
a
c
c
1
c
b
b
x
x
c
4
6
a
a
c
c
5
T xabxac
2
3
- O(m) to construct ST (large constant factor)
- O(n) to find v linear to length of P instead of
T! - O(k) to get all leaves, k is the number of
occurrence. - Asymptotic time is the same as KMP. ST wins if T
is fixed. KMP wins otherwise.
60ST Application 2 set matching
- Find all occurrences of a set of patterns in T
- Build a ST from T
- Match each P to ST
x
a
b
x
a
a
c
c
1
c
b
b
x
x
c
4
6
a
a
c
c
5
T xabxac P xab
2
3
- O(m) to construct ST (large constant factor)
- O(n) to find v linear to total length of Ps
- O(k) to get all leaves, k is the number of
occurrence. - Asymptotic time is the same as Aho-Corasick. ST
wins if T fixed. AC wins if Ps are fixed.
Otherwise depending on relative size.
61ST application 3 repeats finding
- Genome contains many repeated DNA sequences
- Repeat sequence length Varies from 1 nucleotide
to millions - Genes may have multiple copies (50 to 10,000)
- Highly repetitive DNA in some non-coding regions
- 6 to 10bp x 100,000 to 1,000,000 times
- Problem find all repeats that are at least
k-residues long and appear at least p times in
the genome
62Repeats finding
- at least k-residues long and appear at least p
times in the seq - Phase 1 top-down, count label lengths (L) from
root to each node - Phase 2 bottom-up count of leaves descended
from each internal node
For each node with L gt k, and N gt p, print all
leaves
O(m) to traverse tree
(L, N)
63Maximal repeats finding
- Right-maximal repeat
- Si1..ik Sj1..jk,
- but Sik1 ! Sjk1
- Left-maximal repeat
- Si1..ik Sj1..jk
- But Si ! Sj
- Maximal repeat
- Si1..ik Sj1..jk
- But Si ! Sj, and Sik1 ! Sjk1
acatgacatt
64Maximal repeats finding
5e
1234567890acatgacatt
5
t
a
c
10
a
5e
t
c
t
t
a
9
t
4
t
5e
5e
t
5e
t
7
3
6
8
1
2
- Find repeats with at least 3 bases and 2
occurrence - right-maximal cat
- Maximal acat
- left-maximal aca
65Maximal repeats finding
5e
1234567890acatgacatt
5
t
a
c
10
a
5e
t
c
t
t
a
9
t
4
t
5e
5e
t
5e
t
7
3
6
8
1
2
Left char
g
c
c
a
a
- How to find maximal repeat?
- A right-maximal repeats with different left chars
66ST application 4 word enumeration
- Find all k-mers that occur at least p times
- Compute (L, N) for each node
- L total label length from root to node
- N leaves
- Find nodes v with Lgtk, and L(parent)ltk, and Ngty
- Traverse sub-tree rooted at v to get the locations
Lltk
Lk
L K
Lgtk, Ngtp
This can be used in many applications. For
example, to find words that appeared frequently
in a genome or a document
67Joint Suffix Tree (JST)
- Build a ST for more than two strings
- Two strings S1 and S2
- S S1 S2
- Build a suffix tree for S in time O(S1 S2)
- The separator will only appear in the edge ending
in a leaf
68Joint suffix tree example
- S1 abcd
- S2 abca
- S abcdabca
a b c d
useless
a
d
c
b c d a b c a
a
b
c
b
c
d
d
d
a
a
a
a
2,4
b
1,4
a
c
2,3
a
b
2,1
c
2,2
d
1,1
Seq ID
1,3
Suffix ID
1,2
69To Simplify
a b c d
useless
a
d
c
b c d a b c a
a
a
b
d
c
b
c
c
b c d
b
d
d
d
c
a
d
a
d
a
1,4
a
2,4
b
a
1,4
a
a
c
a
2,4
2,3
a
b
1,1
2,3
2,1
c
2,1
2,2
1,3
d
1,1
2,2
1,2
1,3
1,2
- We dont really need to do anything, since all
edge labels were implicit. - The right hand side is more convenient to look at
70Application 1 of JST
- Longest common substring between two sequences
- Using smith-waterman
- Gap mismatch -infinity.
- Quadratic time
- Using JST
- Linear time
- For each internal node v, keep a bit vector B
- B1 1 if a child of v is a suffix of S1
- Bottom-up find all internal nodes with B1
B2 1 (green nodes) - Report a green node with the longest label
- Can be extended to k sequences. Just use a longer
bit vector.
a
d
c
b c d
b
c
d
d
1,4
a
a
a
2,4
1,1
2,3
2,1
1,3
2,2
1,2
71Application 2 of JST
- Given K strings, find all k-mers that appear in
at least (or at most) d strings - Exact motif finding problem
Llt k
cardinal(B) gt 3
B BitOR(1010, 0011) 1011
L gt k
cardinal(B) 3
B 0011
B 1010
4,x
3,x
1,x
3,x
72Application 3 of JST
- Substring problem for sequence databases
- Given A fixed database of sequences (e.g.,
individual genomes) - Given A short pattern (e.g., DNA signature)
- Q Does this DNA signature belong to any
individual in the database? - i.e. the pattern is a substring of some sequences
in the database - Aho-Corasick doesnt work
- This can also be used to design signatures for
individuals - Build a JST for the database seqs
- Match P to the JST
- Find seq IDs from descendents
a
d
c
b c d
b
c
d
1,4
a
d
a
a
2,4
1,1
2,3
2,1
Seqs abcd, abca P1 cd P2 ac
1,3
2,2
1,2
73Application 4 of JST
- Detect DNA contamination
- For some reason when we try to clone and sequence
a genome, some DNAs from other sources may
contaminate our sample, which should be detected
and removed - Given A fixed database of sequences (e.g.,
possible cantamination sources) - Given A DNA just sequenced (e.g., DNA signature)
- Q Does this DNA contain longer enough substring
from the seqs in the database? - Build a JST for the database seqs
- Scan T using the JST
a
d
c
b c d
b
c
d
d
1,4
a
a
a
2,4
1,1
2,3
Contamination sources abcd, abca Sequence
dbcgaabctacgtctagt
2,1
1,3
2,2
1,2
74Summary
- One T, one P
- Boyer-Moore is the choice
- KMP works but not the best
- One T, many P
- Aho-Corasick
- Suffix Tree
- One fixed T, many varying P
- Suffix tree
- Two or more Ts
- Suffix tree, joint suffix tree
Alphabet independent
Alphabet dependent
75Boyer Moore algorithm
- Three ideas
- Right-to-left comparison
- Bad character rule
- Good suffix rule
76Boyer Moore algorithm
Resume comparison here
x
y
Skip some chars without missing any occurrence.
y
77Bad character rule
- 0 1
- 12345678901234567
- Txpbctbxabpqqaabpq
- P tpabxab
-
- What would you do now?
78Bad character rule
- 0 1
- 12345678901234567
- Txpbctbxabpqqaabpq
- P tpabxab
-
- P tpabxab
-
79Bad character rule
- 0 1
- 123456789012345678
- Txpbctbxabpqqaabpqz
- P tpabxab
-
- P tpabxab
-
- P tpabxab
80Basic bad character rule
tpabxab
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
Pre-processing O(n)
81Basic bad character rule
k
T xpbctbxabpqqaabpqz
P tpabxab
When rightmost T(k) in P is left to i, shift
pattern P to align T(k) with the rightmost T(k)
in P
Shift 3 1 2
i 3
P tpabxab
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
82Basic bad character rule
k
T xpbctbxabpqqaabpqz
P tpabxab
When T(k) is not in P, shift left end of P to
align with T(k1)
i 7
Shift 7 0 7
P tpabxab
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
83Basic bad character rule
k
T xpbctbxabpqqaabpqz
P tpabxab
When rightmost T(k) in P is right to i, shift
pattern P by 1
i 5
5 6 lt 0. so shift 1
P tpabxab
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
84Extended bad character rule
k
T xpbctbxabpqqaabpqz
P tpabxab
Find T(k) in P that is immediately left to i,
shift P to align T(k) with that position
i 5
5 3 2. so shift 2
P tpabxab
char Position in P
a 6, 3
b 7, 4
p 2
t 1
x 5
Preprocessing still O(n)
85Extended bad character rule
- Best possible m / n comparisons
- Works better for large alphabet size
- In some cases the extended bad character rule is
sufficiently good - Worst-case O(mn)
- Expected time is sublinear
86- 0 1
- 123456789012345678
- Tprstabstubabvqxrst
- P qcabdabdab
-
- P qcabdabdab
According to extended bad character rule
87(weak) good suffix rule
- 0 1
- 123456789012345678
- Tprstabstubabvqxrst
- P qcabdabdab
-
- P qcabdabdab
88(Weak) good suffix rule
x
t
T
Preprocessing For any suffix t of P, find the
rightmost copy of t, denoted by t. How to find
t efficiently?
y
t
P
t
y
t
t
P
89(Strong) good suffix rule
- 0 1
- 123456789012345678
- Tprstabstubabvqxrst
- P qcabdabdab
-
90(Strong) good suffix rule
- 0 1
- 123456789012345678
- Tprstabstubabvqxrst
- P qcabdabdab
-
- P qcabdabdab
91(Strong) good suffix rule
x
t
T
In preprocessing For any suffix t of P, find
the rightmost copy of t, t, such that the char
left to t ? the char left to t
y
z
P
t
t
z ? y
y
z
t
t
P
92Example preprocessing
qcabdabdab
Bad char rule
Good suffix rule
char Positions in P
a 9, 6, 3
b 10, 7, 4
c 2
d 8, 5
q 1
1 2 3 4 5 6 7 8 9 10
q c a b d a b d a b
0 0 0 0 2 0 0 2 0 0
dabcab
dabdabcabdab
Where to shift depends on T
Does not depend on T
Largest shift given by either the (extended) bad
char rule or the (strong) good suffix rule is
used.
93Time complexity of BM algorithm
- Pre-processing can be done in linear time
- With strong good suffix rule, worst-case is O(m)
if P is not in T - If P is in T, worst-case could be O(mn)
- E.g. T m100, P m10
- unless a modification was used (Galils rule)
- Proofs are technical. Skip.
94How to actually do pre-processing?
- Similar pre-processing for KMP and B-M
- Find matches between a suffix and a prefix
- Both can be done in linear time
- P is usually short, even a more expensive
pre-processing may result in a gain overall
y
x
KMP
t
t
P
i
j
For each i, find a j. similar to DP. Start from i
2
y
x
B-M
t
t
P
i
j
95Fundamental pre-processing
y
x
t
t
P
izi-1
zi
i
1
- Zi length of longest substring starting at i
that matches a prefix of P - i.e. t t, x ? y, Zi t
- With the Z-values computed, we can get the
preprocessing for both KMP and B-M in linear
time. - aabcaabxaaz
- Z 01003100210
- How to compute Z-values in linear time?
96Computing Z in Linear time
We already computed all Z-values up to k-1. need
to compute Zk. We also know the starting and
ending points of the previous match, l and r.
t
t
y
x
P
r
k
l
t
t
y
x
P
We know that t t, therefore the Z-value at
k-l1 may be helpful to us.
r
k
l
1
k-l1
97Computing Z in Linear time
The previous r is smaller than k. i.e., no
previous match extends beyond k. do explicit
comparison.
P
Case 1
k
Case 2
y
x
P
Zk-l1 lt r-k1. Zk Zk-l1 No comparison is
needed.
l
r
k
1
k-l1
Zk-l1 gt r-k1. Zk Zk-l1 Comparison start from
r
Case 3
P
r
k
l
1
k-l1
- No char inside the box is compared twice. At most
one mismatch per iteration. - Therefore, O(n).
98Z-preprocessing for B-M and KMP
y
x
t
t
Z
zi
i
j
1
j izi-1
y
x
KMP
t
t
For each j sp(jzj-1) z(j)
j
i
y
x
B-M
t
t
Use Z backwards
i
j
- Both KMP and B-M preprocessing can be done in O(n)