CS5263 Bioinformatics

About This Presentation

Title:

CS5263 Bioinformatics

Description:

Cost for exact string matching String matching scenarios Pre-processing algs Karp Rabin Algorithm Karp Rabin algorithm Karp ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 99

Provided by: Jianhu3

Learn more at: http://www.cs.utsa.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS5263 Bioinformatics

1
CS5263 Bioinformatics

Lecture 9-10
Exact String Matching Algorithms

2
Overview

Pair-wise alignment
Multiple alignment
Commonality allowing errors when comparing
strings
Two sub-problems
How to score an alignment with errors
How to find an alignment with the best score
Today exact string matching
Do not allow any errors
Efficiency becomes the sole consideration

3
Why exact string matching?

The most fundamental string comparison problem
Work processors
Information retrieval
DNA sequence retrieval
Many many more
Is it still an interesting research problem?
Yes, if database is large
Exact string matching is often the core of more
complex string comparison algorithms
E.g., BLAST
Often repeatedly called by other methods
Usually the most time consuming part
Small improvement could improve overall
efficiency considerably

4
Definitions

Text a longer string T (length m)
Pattern a shorter string P (length n)
Exact matching find all occurrences of P in T

length m
T
length n
P
5
The naïve algorithm
6
Time complexity

Worst case O(mn)
Best case O(m)
e.g. aaaaaaaaaaaaaa vs baaaaaaa
Average case?
Alphabet A, C, G, T
Assume both P and T are random
Equal probability
In average how many chars do you need to compare
before giving up?

7
Average case time complexity

P(mismatch at 1st position) ¾
P(mismatch at 2nd position) ¼ ¾
P(mismatch at 3nd position) (¼)2 ¾
P(mismatch at kth position) (¼)k-1 ¾
Expected number of comparison per position
p 1/4
?k (1-p) p(k-1) k (1-p) / p ?k pk k
1/(1-p)
4/3
Average complexity 4m/3
Not as bad as you thought it might be

8
Biological sequences are not random

T aaaaaaaaaaaaaaaaaaaaaaaaa
P aaaab
Plus 4m/3 average case is still bad for long
genomic sequences!
Especially if this has to be done again and again
Smarter algorithms
O(m n) in worst case
sub-linear in practice

9
How to speedup?

Pre-processing T or P
Why pre-processing can save us time?
Uncovers the structure of T or P
Determines when we can skip ahead without missing
anything
Determines when we can infer the result of
character comparisons without doing them.

ACGTAXACXTAXACGXAX
ACGTACA
10
Cost for exact string matching

Total cost cost (preprocessing)
cost(comparison)
cost(output)

Overhead
Minimize
Constant
Hope gain gt overhead
11
String matching scenarios

One T and one P
Search a word in a document
One T and many P all at once
Search a set of words in a document
Spell checking (fixed P)
One fixed T, many P
Search a completed genome for short sequences
Two (or many) Ts for common patterns
Q Which one to pre-process?
A Always pre-process the shorter seq, or the one
that is repeatedly used

12
Pre-processing algs

Pattern preprocessing
Karp Rabin algorithm
Small alphabet and short patterns
Knuth-Morris-Pratt algorithm (KMP)
Aho-Corasick algorithm
Multiple patterns
Boyer Moore algorithm
The choice of most cases
Typically sub-linear time
Text preprocessing
Suffix tree
Very useful for many purposes

13
Karp Rabin Algorithm

Lets say we are dealing with binary numbers
Text 01010001011001010101001
Pattern 101100
Convert pattern to integer
101100 25 23 22 44

14
Karp Rabin algorithm

Text 01010001011001010101001
Pattern 101100 44 decimal
10111011001010101001
25 23 22 21 46
10111011001010101001
46 2 64 1 29
10111011001010101001
29 2 - 0 1 59
10111011001010101001
59 2 - 64 0 54
10111011001010101001
54 2 - 64 0 44

15
Karp Rabin algorithm

What if the pattern is too long to fit into a
single integer?
Pattern 101100. But our machine only has 5 bits
Basic idea hashing. 44 13 5
10111011001010101001
46 ( 13 7)
10111011001010101001
46 2 64 1 29 ( 13 3)
10111011001010101001
29 2 - 0 1 59 ( 13 7)
10111011001010101001
59 2 - 64 0 54 ( 13 2)
10111011001010101001
54 2 - 64 0 44 ( 13 5)

16
Algorithm KMP

Not the fastest
Best known
Good for real-time matching
i.e. text comes one char at a time
No memory of previous chars
Idea
Left-to-right comparison
Shift P more than one char whenever possible

17
Intuitive example 1
abcxabc
T
mismatch
P
abcxabcde
Naïve approach
abcxabc
T
?
abcxabcde

Observation by reasoning on the pattern alone,
we can determine that if a mismatch happened when
comparing P8 with Ti, we can shift P by four
chars, and compare P4 with Ti, without
missing any possible matches.
Number of comparisons saved 6

18
Intuitive example 2
abcxabc
T
mismatch
P
abcxabcde
Naïve approach
abcxabc
T
?
abcxabcde

Observation by reasoning on the pattern alone,
we can determine that if a mismatch happened
between P7 and Tj, we can shift P by six
chars and compare Tj with P1 without missing
any possible matches
Number of comparisons saved 7

19
KMP algorithm pre-processing

Key the reasoning is done without even knowing
what string T is.
Only the location of mismatch in P must be known.

x
t
T
y
z
t
t
P
i
j
y
z
t
t
P
i
j
Pre-processing for any position i in P, find
P1..is longest proper suffix, t Pj..i,
such that t matches to a prefix of P, t, and the
next char of t is different from the next char of
t (i.e., y ? z) For each i, let sp(i) length(t)
20
KMP algorithm shift rule
x
t
T
y
z
t
t
P
i
j
y
z
t
P
t
i
j
sp(i)
1
Shift rule when a mismatch occurred between
Pi1 and Tk, shift P to the right by i
sp(i) chars and compare x with z. This shift
rule can be implicitly represented by creating a
failure link between y and z. Meaning when a
mismatch occurred between x on T and Pi1,
resume comparison between x and Psp(i)1.
21
Failure Link Example

P aataac

If a char in T fails to match at pos 6,
re-compare it with the char at pos 3 ( 2 1)
a
a
t
a
a
c
sp(i) 0 1 0 0 2 0
aaat
aataac
22
Another example

P abababc

If a char in T fails to match at pos 7,
re-compare it with the char at pos 5 ( 4 1)
a
b
a
b
a
b
c
Sp(i) 0 0 0 0 0 4 0
ababaababc
abab
abababab
23
KMP Example using Failure Link
a
a
t
a
a
c
T aacaataaaaataaccttacta
aataac

Time complexity analysis
Each char in T may be compared up to n times. A
lousy analysis gives O(mn) time.
More careful analysis number of comparisons can
be broken to two phases
Comparison phase the first time a char in T is
compared to P. Total is exactly m.
Shift phase. First comparisons made after a
shift. Total is at most m.
Time complexity O(2m)

aataac .
aataac
aataac ..
aataac .
24
KMP algorithm using DFA (Deterministic Finite
Automata)

P aataac

If a char in T fails to match at pos 6,
re-compare it with the char at pos 3
Failure link
a
a
t
a
a
c
If the next char in T is t after matching 5
chars, go to state 3
a
t
t
a
a
c
a
a
1
2
3
4
5
0
6
DFA
a
a
All other inputs goes to state 0.
25
DFA Example
a
t
t
a
a
c
a
a
1
2
3
4
5
0
6
DFA
a
a
T aacaataataataaccttacta
1201234534534560001001
Each char in T will be examined exactly once.
Therefore, exactly m comparisons are made. But
it takes longer to do pre-processing, and needs
more space to store the FSA.
26
Difference between Failure Link and DFA

Failure link
Preprocessing time and space are O(n), regardless
of alphabet size
Comparison time is at most 2m (at least m)
DFA
Preprocessing time and space are O(n ?)
May be a problem for very large alphabet size
For example, each char is a big integer
Chinese characters
Comparison time is always m.

27
Boyer Moore algorithm

Often the choice of algorithm for many cases
One T and one P
We will talk about it later if have time
In its original version does not guarantee linear
time
Some modification did it
In practice sub-linear

28
The set matching problem

Find all occurrences of a set of patterns in T
First idea run KMP or BM for each P
O(km n)
k number of patterns
m length of text
n total length of patterns
Better idea combine all patterns together and
search in one run

29
A simpler problem spell-checking

A dictionary contains five words
potato
poetry
pottery
science
school
Given a document, check if any word is (not) in
the dictionary
Words in document are separated by special chars.
Relatively easy.

30
Keyword tree for spell checking
This version of the potato gun was inspired by
the Weird Science team out of Illinois
p
s
o
c
l
h
o
o
5
e
i
t
e
t
a
t
r
n
t
y
e
c
o
r
e
y
3
1
4
2

O(n) time to construct. n total length of
patterns.
Search time O(m). m length of text
Common prefix only need to be compared once.
What if there is no space between words?

31
Aho-Corasick algorithm

Basis of the fgrep algorithm
Generalizing KMP
Using failure links
Example given the following 4 patterns
potato
tattoo
theater
other

32
Keyword tree
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
33
Keyword tree
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
34
Keyword tree
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
O(mn)
m length of text. n length of longest pattern
35
Keyword Tree with a failure link
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
36
Keyword Tree with a failure link
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
37
Keyword Tree with all failure links
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
38
Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
39
Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
40
Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
41
Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
42
Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
43
Aho-Corasick algorithm

O(n) preprocessing, and O(mk) searching.
n total length of patterns.
m length of text
k is of occurrence.
Can create a DFA similar as in KMP.
Requires more space,
Preprocessing time depends on alphabet size
Search time is constant
A Where can this algorithm be used in previous
topics?
Q BLAST
Given a query sequence, we generate many seed
sequences (k-mers)
Search for exact matches to these seed sequences
Extend exact matches into longer inexact matches

44
Suffix Tree

All algorithms we talked about so far preprocess
pattern(s)
Karp-Rabin small pattern, small alphabet
Boyer-Moore fastest in practice. O(m) worst
case.
KMP O(m)
Aho-Corasick O(m)
In some cases we may prefer to pre-process T
Fixed T, varying P
Suffix tree basically a keyword tree of all
suffixes

45
Suffix tree

T xabxac
Suffixes
xabxac
abxac
bxac
xac
ac
c

x
a
b
x
a
a
c
c
1
c
b
b
x
x
c
4
6
a
a
c
c
5
2
3
Naïve construction O(m2) using
Aho-Corasick. Smarter O(m). Very technical. big
constant factor Difference from a keyword tree
create an internal node only when there is a
branch
46
Suffix tree implementation

Explicitly labeling sequence end
T xabxa

x
a
x
a
b
x
b
a
a
x
a
a

1
1

b
b
b
b
x

x
x
x
4
a
a
a
a

5

2
2
3
3
47
Suffix tree implementation

Implicitly labeling edges
T xabxa

12
x
a
3
b
x
22
a
a

1
1

b
b

x
x
3
3
4
4
a
a
5

5

2
2
3
3
48
Suffix links

Similar to failure link in a keyword tree
Only link internal nodes having branches

x
a
b
P xabcf
a
b
c
f
c
d
d
e
e
f
f
g
g
h
h
i
i
j
j
49
Suffix tree construction
1234567890acatgacatt
1
1
50
Suffix tree construction
1234567890acatgacatt
2
1
1
2
51
Suffix tree construction
1234567890acatgacatt
a
2
2
4
3
1
2
52
Suffix tree construction
1234567890acatgacatt
a
4
2
2
4
4
3
1
2
53
Suffix tree construction
5
1234567890acatgacatt
5
a
4
2
2
4
4
3
1
2
54
Suffix tree construction
5
1234567890acatgacatt
5
a
4
c
a
2
4
t
4
t
5

3
6
1
2
55
Suffix tree construction
5
1234567890acatgacatt
5
a
c
4
a
c
t
a
4
t
4
t
5
t
5

7
3
6
1
2
56
Suffix tree construction
5
1234567890acatgacatt
5
a
c
4
a
c
t
t
a
t
4
t
5
5
t
5
t

7
3
6
8
1
2
57
Suffix tree construction
5
1234567890acatgacatt
5
t
a
c
a
5
t
c
t
t
a
9
t
4
t
5
5
t
5
t

7
3
6
8
1
2
58
Suffix tree construction
5
1234567890acatgacatt
5
t
a

c
10
a
5
t
c
t
t
a
9
t
4
t
5
5
t
5
t

7
3
6
8
1
2
59
ST Application 1 pattern matching

Find all occurrence of Pxa in T
Find node v in the ST that matches to P
Traverse the subtree rooted at v to get the
locations

x
a
b
x
a
a
c
c
1
c
b
b
x
x
c
4
6
a
a
c
c
5
T xabxac
2
3

O(m) to construct ST (large constant factor)
O(n) to find v linear to length of P instead of
T!
O(k) to get all leaves, k is the number of
occurrence.
Asymptotic time is the same as KMP. ST wins if T
is fixed. KMP wins otherwise.

60
ST Application 2 set matching

Find all occurrences of a set of patterns in T
Build a ST from T
Match each P to ST

x
a
b
x
a
a
c
c
1
c
b
b
x
x
c
4
6
a
a
c
c
5
T xabxac P xab
2
3

O(m) to construct ST (large constant factor)
O(n) to find v linear to total length of Ps
O(k) to get all leaves, k is the number of
occurrence.
Asymptotic time is the same as Aho-Corasick. ST
wins if T fixed. AC wins if Ps are fixed.
Otherwise depending on relative size.

61
ST application 3 repeats finding

Genome contains many repeated DNA sequences
Repeat sequence length Varies from 1 nucleotide
to millions
Genes may have multiple copies (50 to 10,000)
Highly repetitive DNA in some non-coding regions
6 to 10bp x 100,000 to 1,000,000 times
Problem find all repeats that are at least
k-residues long and appear at least p times in
the genome

62
Repeats finding

at least k-residues long and appear at least p
times in the seq
Phase 1 top-down, count label lengths (L) from
root to each node
Phase 2 bottom-up count of leaves descended
from each internal node

For each node with L gt k, and N gt p, print all
leaves
O(m) to traverse tree
(L, N)
63
Maximal repeats finding

Right-maximal repeat
Si1..ik Sj1..jk,
but Sik1 ! Sjk1
Left-maximal repeat
Si1..ik Sj1..jk
But Si ! Sj
Maximal repeat
Si1..ik Sj1..jk
But Si ! Sj, and Sik1 ! Sjk1

acatgacatt

cat
aca
acat

64
Maximal repeats finding
5e
1234567890acatgacatt
5
t
a

c
10
a
5e
t
c
t
t
a
9
t
4
t
5e
5e
t
5e
t
7
3
6
8
1
2

Find repeats with at least 3 bases and 2
occurrence
right-maximal cat
Maximal acat
left-maximal aca

65
Maximal repeats finding
5e
1234567890acatgacatt
5
t
a

c
10
a
5e
t
c
t
t
a
9
t
4
t
5e
5e
t
5e
t
7
3
6
8
1
2
Left char
g
c
c
a
a

How to find maximal repeat?
A right-maximal repeats with different left chars

66
ST application 4 word enumeration

Find all k-mers that occur at least p times
Compute (L, N) for each node
L total label length from root to node
N leaves
Find nodes v with Lgtk, and L(parent)ltk, and Ngty
Traverse sub-tree rooted at v to get the locations

Lltk
Lk
L K
Lgtk, Ngtp
This can be used in many applications. For
example, to find words that appeared frequently
in a genome or a document
67
Joint Suffix Tree (JST)

Build a ST for more than two strings
Two strings S1 and S2
S S1 S2
Build a suffix tree for S in time O(S1 S2)
The separator will only appear in the edge ending
in a leaf

68
Joint suffix tree example

S1 abcd
S2 abca
S abcdabca

a b c d
useless
a
d

c
b c d a b c a
a
b
c
b
c
d

d
d

a

a
a
a
2,4
b
1,4
a
c
2,3
a
b
2,1
c
2,2
d
1,1
Seq ID
1,3
Suffix ID
1,2
69
To Simplify
a b c d
useless
a
d

c
b c d a b c a
a
a
b
d
c
b
c
c
b c d

b
d
d
d
c

a

d
a
d
a
1,4
a
2,4
b
a
1,4
a
a
c
a
2,4
2,3
a
b
1,1
2,3
2,1
c
2,1
2,2
1,3
d
1,1
2,2
1,2
1,3
1,2

We dont really need to do anything, since all
edge labels were implicit.
The right hand side is more convenient to look at

70
Application 1 of JST

Longest common substring between two sequences
Using smith-waterman
Gap mismatch -infinity.
Quadratic time
Using JST
Linear time
For each internal node v, keep a bit vector B
B1 1 if a child of v is a suffix of S1
Bottom-up find all internal nodes with B1
B2 1 (green nodes)
Report a green node with the longest label
Can be extended to k sequences. Just use a longer
bit vector.

a
d
c
b c d
b
c

d
d
1,4
a
a
a
2,4
1,1
2,3
2,1
1,3
2,2
1,2
71
Application 2 of JST

Given K strings, find all k-mers that appear in
at least (or at most) d strings
Exact motif finding problem

Llt k
cardinal(B) gt 3
B BitOR(1010, 0011) 1011
L gt k
cardinal(B) 3
B 0011
B 1010
4,x
3,x
1,x
3,x
72
Application 3 of JST

Substring problem for sequence databases
Given A fixed database of sequences (e.g.,
individual genomes)
Given A short pattern (e.g., DNA signature)
Q Does this DNA signature belong to any
individual in the database?
i.e. the pattern is a substring of some sequences
in the database
Aho-Corasick doesnt work
This can also be used to design signatures for
individuals
Build a JST for the database seqs
Match P to the JST
Find seq IDs from descendents

a
d
c
b c d
b
c

d
1,4
a
d
a
a
2,4
1,1
2,3
2,1
Seqs abcd, abca P1 cd P2 ac
1,3
2,2
1,2
73
Application 4 of JST

Detect DNA contamination
For some reason when we try to clone and sequence
a genome, some DNAs from other sources may
contaminate our sample, which should be detected
and removed
Given A fixed database of sequences (e.g.,
possible cantamination sources)
Given A DNA just sequenced (e.g., DNA signature)
Q Does this DNA contain longer enough substring
from the seqs in the database?
Build a JST for the database seqs
Scan T using the JST

a
d
c
b c d
b
c

d
d
1,4
a
a
a
2,4
1,1
2,3
Contamination sources abcd, abca Sequence
dbcgaabctacgtctagt
2,1
1,3
2,2
1,2
74
Summary

One T, one P
Boyer-Moore is the choice
KMP works but not the best
One T, many P
Aho-Corasick
Suffix Tree
One fixed T, many varying P
Suffix tree
Two or more Ts
Suffix tree, joint suffix tree

Alphabet independent
Alphabet dependent
75
Boyer Moore algorithm

Three ideas
Right-to-left comparison
Bad character rule
Good suffix rule

76
Boyer Moore algorithm

Right to left comparison

Resume comparison here
x
y
Skip some chars without missing any occurrence.
y
77
Bad character rule

0 1
12345678901234567
Txpbctbxabpqqaabpq
P tpabxab
What would you do now?

78
Bad character rule

0 1
12345678901234567
Txpbctbxabpqqaabpq
P tpabxab
P tpabxab

79
Bad character rule

0 1
123456789012345678
Txpbctbxabpqqaabpqz
P tpabxab
P tpabxab
P tpabxab

80
Basic bad character rule
tpabxab
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
Pre-processing O(n)
81
Basic bad character rule
k
T xpbctbxabpqqaabpqz
P tpabxab
When rightmost T(k) in P is left to i, shift
pattern P to align T(k) with the rightmost T(k)
in P
Shift 3 1 2
i 3
P tpabxab
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
82
Basic bad character rule
k
T xpbctbxabpqqaabpqz
P tpabxab
When T(k) is not in P, shift left end of P to
align with T(k1)
i 7
Shift 7 0 7
P tpabxab
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
83
Basic bad character rule
k
T xpbctbxabpqqaabpqz
P tpabxab
When rightmost T(k) in P is right to i, shift
pattern P by 1
i 5
5 6 lt 0. so shift 1
P tpabxab
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
84
Extended bad character rule
k
T xpbctbxabpqqaabpqz
P tpabxab
Find T(k) in P that is immediately left to i,
shift P to align T(k) with that position
i 5
5 3 2. so shift 2
P tpabxab
char Position in P
a 6, 3
b 7, 4
p 2
t 1
x 5
Preprocessing still O(n)
85
Extended bad character rule

Best possible m / n comparisons
Works better for large alphabet size
In some cases the extended bad character rule is
sufficiently good
Worst-case O(mn)
Expected time is sublinear

0 1
123456789012345678
Tprstabstubabvqxrst
P qcabdabdab
P qcabdabdab

According to extended bad character rule
87
(weak) good suffix rule

0 1
123456789012345678
Tprstabstubabvqxrst
P qcabdabdab
P qcabdabdab

88
(Weak) good suffix rule
x
t
T
Preprocessing For any suffix t of P, find the
rightmost copy of t, denoted by t. How to find
t efficiently?
y
t
P
t
y
t
t
P
89
(Strong) good suffix rule

0 1
123456789012345678
Tprstabstubabvqxrst
P qcabdabdab

90
(Strong) good suffix rule

0 1
123456789012345678
Tprstabstubabvqxrst
P qcabdabdab
P qcabdabdab

91
(Strong) good suffix rule
x
t
T
In preprocessing For any suffix t of P, find
the rightmost copy of t, t, such that the char
left to t ? the char left to t
y
z
P
t
t
z ? y
y
z
t
t
P
92
Example preprocessing
qcabdabdab
Bad char rule
Good suffix rule
char Positions in P
a 9, 6, 3
b 10, 7, 4
c 2
d 8, 5
q 1
1 2 3 4 5 6 7 8 9 10
q c a b d a b d a b
0 0 0 0 2 0 0 2 0 0
dabcab
dabdabcabdab
Where to shift depends on T
Does not depend on T
Largest shift given by either the (extended) bad
char rule or the (strong) good suffix rule is
used.
93
Time complexity of BM algorithm

Pre-processing can be done in linear time
With strong good suffix rule, worst-case is O(m)
if P is not in T
If P is in T, worst-case could be O(mn)
E.g. T m100, P m10
unless a modification was used (Galils rule)
Proofs are technical. Skip.

94
How to actually do pre-processing?

Similar pre-processing for KMP and B-M
Find matches between a suffix and a prefix
Both can be done in linear time
P is usually short, even a more expensive
pre-processing may result in a gain overall

y
x
KMP
t
t
P
i
j
For each i, find a j. similar to DP. Start from i
2
y
x
B-M
t
t
P
i
j
95
Fundamental pre-processing
y
x
t
t
P
izi-1
zi
i
1

Zi length of longest substring starting at i
that matches a prefix of P
i.e. t t, x ? y, Zi t
With the Z-values computed, we can get the
preprocessing for both KMP and B-M in linear
time.
aabcaabxaaz
Z 01003100210
How to compute Z-values in linear time?

96
Computing Z in Linear time
We already computed all Z-values up to k-1. need
to compute Zk. We also know the starting and
ending points of the previous match, l and r.
t
t
y
x
P
r
k
l
t
t
y
x
P
We know that t t, therefore the Z-value at
k-l1 may be helpful to us.
r
k
l
1
k-l1
97
Computing Z in Linear time
The previous r is smaller than k. i.e., no
previous match extends beyond k. do explicit
comparison.
P
Case 1
k
Case 2
y
x
P
Zk-l1 lt r-k1. Zk Zk-l1 No comparison is
needed.
l
r
k
1
k-l1
Zk-l1 gt r-k1. Zk Zk-l1 Comparison start from
r
Case 3
P
r
k
l
1
k-l1