CS 3343 Analysis of Algorithms

- Lecture 26 String Matching Algorithms

Definitions

- Text a longer string T
- Pattern a shorter string P
- Exact matching find all occurrence of P in T

length m

T

P

Length n

The naïve algorithm

Length m

Length n

Time complexity

- Worst case O(mn)
- Best case O(m)
- aaaaaaaaaaaaaa vs. baaaaaaa
- Average case?
- Alphabet size k
- Assume equal probability
- How many chars do you need to compare before find

a mismatch? - In average k / (k-1)
- Therefore average-case complexity mk / (k-1)
- For large alphabet, m
- Not as bad as you thought, huh?

Real strings are not random

- T aaaaaaaaaaaaaaaaaaaaaaaaa
- P aaaab
- Plus O(m) average case is still bad for long

strings! - Smarter algorithms
- O(m n) in worst case
- sub-linear in practice
- how is this possible?

How to speedup?

- Pre-processing T or P
- Why pre-processing can save us time?
- Uncovers the structure of T or P
- Determines when we can skip ahead without missing

anything - Determines when we can infer the result of

character comparisons without actually doing them.

ACGTAXACXTAXACGXAX

ACGTACA

Cost for exact string matching

- Total cost cost (preprocessing)
- cost(comparison)
- cost(output)

Overhead

Minimize

Constant

Hope gain gt overhead

String matching scenarios

- One T and one P
- Search a word in a document
- One T and many P all at once
- Search a set of words in a document
- Spell checking
- One fixed T, many P
- Search a completed genome for a short sequence
- Two (or many) Ts for common patterns

- Would you preprocess P or T?
- Always pre-process the shorter seq, or the one

that is repeatedly used

Pattern pre-processing algs

- Karp Rabin algorithm
- Small alphabet and small pattern
- Boyer Moore algorithm
- The choice of most cases
- Typically sub-linear time
- Knuth-Morris-Pratt algorithm (KMP)
- Aho-Corasick algorithm
- The algorithm for the unix utility fgrep
- Suffix tree
- One of the most useful preprocessing techniques
- Many applications

Algorithm KMP

- Not the fastest
- Best known
- Good for real-time matching
- i.e. text comes one char at a time
- No memory of previous chars
- Idea
- Left-to-right comparison
- Shift P more than one char whenever possible

Intuitive example 1

abcxabc

T

mismatch

P

abcxabcde

Naïve approach

abcxabc

T

?

abcxabcde

- Observation by reasoning on the pattern alone,

we can determine that if a mismatch happened when

comparing P8 with Ti, we can shift P by four

chars, and compare P4 with Ti, without

missing any possible matches. - Number of comparisons saved 6

Intuitive example 2

abcxabc

T

mismatch

P

abcxabcde

Naïve approach

abcxabc

T

?

abcxabcde

- Observation by reasoning on the pattern alone,

we can determine that if a mismatch happened

between P7 and Tj, we can shift P by six

chars and compare Tj with P1 without missing

any possible matches - Number of comparisons saved 7

KMP algorithm pre-processing

- Key the reasoning is done without even knowing

what string T is. - Only the location of mismatch in P must be known.

x

t

T

y

z

t

t

P

i

j

y

z

t

t

P

i

j

Pre-processing for any position i in P, find

P1..is longest proper suffix, t Pj..i,

such that t matches to a prefix of P, t, and the

next char of t is different from the next char of

t (i.e., y ? z) For each i, let sp(i) length(t)

KMP algorithm shift rule

x

t

T

y

z

t

t

P

i

j

y

z

t

P

t

i

j

sp(i)

1

Shift rule when a mismatch occurred between

Pi1 and Tk, shift P to the right by i

sp(i) chars and compare x with z. This shift

rule can be implicitly represented by creating a

failure link between y and z. Meaning when a

mismatch occurred between x on T and Pi1,

resume comparison between x and Psp(i)1.

Failure Link Example

- P aataac

If a char in T fails to match at pos 6,

re-compare it with the char at pos 3 ( 2 1)

a

a

t

a

a

c

sp(i) 0 1 0 0 2 0

aa at

aat aac

Another example

- P abababc

If a char in T fails to match at pos 7,

re-compare it with the char at pos 5 ( 4 1)

a

b

a

b

a

b

c

Sp(i) 0 0 0 0 0 4 0

ababa ababc

ab ab

abab abab

KMP Example using Failure Link

a

a

t

a

a

c

T aacaataaaaataaccttacta

aataac

- Time complexity analysis
- Each char in T may be compared up to n times. A

lousy analysis gives O(mn) time. - More careful analysis number of comparisons can

be broken to two phases - Comparison phase the first time a char in T is

compared to P. Total is exactly m. - Shift phase. First comparisons made after a

shift. Total is at most m. - Time complexity O(2m)

aataac .

aataac

aataac ..

aataac .

KMP algorithm using DFA (Deterministic Finite

Automata)

- P aataac

If a char in T fails to match at pos 6,

re-compare it with the char at pos 3

Failure link

a

a

t

a

a

c

If the next char in T is t after matching 5

chars, go to state 3

a

t

t

a

a

c

a

a

1

2

3

4

5

0

6

DFA

a

a

All other inputs goes to state 0.

DFA Example

a

t

t

a

a

c

a

a

1

2

3

4

5

0

6

DFA

a

a

T aacaataataataaccttacta

1201234534534560001001

Each char in T will be examined exactly once.

Therefore, exactly m comparisons are made. But

it takes longer to do pre-processing, and needs

more space to store the FSA.

Difference between Failure Link and DFA

- Failure link
- Preprocessing time and space are O(n), regardless

of alphabet size - Comparison time is at most 2m (at least m)
- DFA
- Preprocessing time and space are O(n ?)
- May be a problem for very large alphabet size
- For example, each char is a big integer
- Chinese characters
- Comparison time is always m.

The set matching problem

- Find all occurrences of a set of patterns in T
- First idea run KMP or BM for each P
- O(km n)
- k number of patterns
- m length of text
- n total length of patterns
- Better idea combine all patterns together and

search in one run

A simpler problem spell-checking

- A dictionary contains five words
- potato
- poetry
- pottery
- science
- school
- Given a document, check if any word is (not) in

the dictionary - Words in document are separated by special chars.
- Relatively easy.

Keyword tree for spell checking

This version of the potato gun was inspired by

the Weird Science team out of Illinois

p

s

o

c

l

h

o

o

5

e

i

t

e

t

a

t

r

n

t

y

e

c

o

r

e

y

3

1

4

2

- O(n) time to construct. n total length of

patterns. - Search time O(m). m length of text
- Common prefix only need to be compared once.
- What if there is no space between words?

Aho-Corasick algorithm

- Basis of the fgrep algorithm
- Generalizing KMP
- Using failure links
- Example given the following 4 patterns
- potato
- tattoo
- theater
- other

Keyword tree

0

p

t

t

h

o

e

h

a

t

r

e

t

a

a

4

t

t

t

e

o

o

r

1

o

3

2

Keyword tree

0

p

t

t

h

o

e

h

a

t

r

e

t

a

a

4

t

t

t

e

o

o

r

1

o

3

2

potherotathxythopotattooattoo

Keyword tree

0

p

t

t

h

o

e

h

a

t

r

e

t

a

a

4

t

t

t

e

o

o

r

1

o

3

2

potherotathxythopotattooattoo

O(mn)

m length of text. n length of longest pattern

Keyword Tree with a failure link

0

p

t

t

h

o

e

h

a

t

r

e

t

a

a

4

t

t

t

e

o

o

r

1

o

3

2

potherotathxythopotattooattoo

Keyword Tree with a failure link

0

p

t

t

h

o

e

h

a

t

r

e

t

a

a

4

t

t

t

e

o

o

r

1

o

3

2

potherotathxythopotattooattoo

Keyword Tree with all failure links

0

p

t

t

h

o

e

h

a

t

r

e

t

4

a

a

t

t

t

e

o

o

r

1

o

3

2

Example

0

p

t

t

h

o

e

h

a

t

r

e

t

4

a

a

t

t

t

e

o

o

r

1

o

3

2

potherotathxythopotattooattoo

Example

0

p

t

t

h

o

e

h

a

t

r

e

t

4

a

a

t

t

t

e

o

o

r

1

o

3

2

potherotathxythopotattooattoo

Example

0

p

t

t

h

o

e

h

a

t

r

e

t

4

a

a

t

t

t

e

o

o

r

1

o

3

2

potherotathxythopotattooattoo

Example

0

p

t

t

h

o

e

h

a

t

r

e

t

4

a

a

t

t

t

e

o

o

r

1

o

3

2

potherotathxythopotattooattoo

Example

0

p

t

t

h

o

e

h

a

t

r

e

t

4

a

a

t

t

t

e

o

o

r

1

o

3

2

potherotathxythopotattooattoo

Aho-Corasick algorithm

- O(n) preprocessing, and O(mk) searching.
- n total length of patterns.
- m length of text
- k is of occurrence.
- Can create a DFA similar as in KMP.
- Requires more space,
- Preprocessing time depends on alphabet size
- Search time is constant

Suffix Tree

- All algorithms we talked about so far preprocess

pattern(s) - Karp-Rabin small pattern, small alphabet
- Boyer-Moore fastest in practice. O(m) worst

case. - KMP O(m)
- Aho-Corasick O(m)
- In some cases we may prefer to pre-process T
- Fixed T, varying P
- Suffix tree basically a keyword tree of all

suffixes

Suffix tree

- T xabxac
- Suffixes
- xabxac
- abxac
- bxac
- xac
- ac
- c

x

a

b

x

a

a

c

c

1

c

b

b

x

x

c

4

6

a

a

c

c

5

2

3

Naïve construction O(m2) using

Aho-Corasick. Smarter O(m). Very technical. big

constant factor Difference from a keyword tree

create an internal node only when there is a

branch

Suffix tree implementation

- Explicitly labeling seq end
- T xabxa T xabxa

x

a

x

a

b

x

b

a

a

x

a

a

1

1

b

b

b

b

x

x

x

x

4

a

a

a

a

5

2

2

3

3

Suffix tree implementation

- Implicitly labeling edges
- T xabxa

12

x

a

3

b

x

22

a

a

1

1

b

b

x

x

3

3

4

4

a

a

5

5

2

2

3

3

Suffix links

- Similar to failure link in a keyword tree
- Only link internal nodes having branches

x

a

b

xabcf

a

b

c

f

c

d

d

e

e

f

f

g

g

h

h

i

i

j

j

Suffix tree construction

1234567890 acatgacatt

1

1

Suffix tree construction

1234567890 acatgacatt

2

1

1

2

Suffix tree construction

1234567890 acatgacatt

a

2

2

4

3

1

2

Suffix tree construction

1234567890 acatgacatt

a

4

2

2

4

4

3

1

2

Suffix tree construction

5

1234567890 acatgacatt

5

a

4

2

2

4

4

3

1

2

Suffix tree construction

5

1234567890 acatgacatt

5

a

4

c

a

2

4

t

4

t

5

3

6

1

2

Suffix tree construction

5

1234567890 acatgacatt

5

a

c

4

a

c

t

a

4

t

4

t

5

t

5

7

3

6

1

2

Suffix tree construction

5

1234567890 acatgacatt

5

a

c

4

a

c

t

t

a

t

4

t

5

5

t

5

t

7

3

6

8

1

2

Suffix tree construction

5

1234567890 acatgacatt

5

t

a

c

a

5

t

c

t

t

a

9

t

4

t

5

5

t

5

t

7

3

6

8

1

2

Suffix tree construction

5

1234567890 acatgacatt

5

t

a

c

10

a

5

t

c

t

t

a

9

t

4

t

5

5

t

5

t

7

3

6

8

1

2

ST Application 1 pattern matching

- Find all occurrence of Pxa in T
- Find node v in the ST that matches to P
- Traverse the subtree rooted at v to get the

locations

x

a

b

x

a

a

c

c

1

c

b

b

x

x

c

4

6

a

a

c

c

5

T xabxac

2

3

- O(m) to construct ST (large constant factor)
- O(n) to find v linear to length of P instead of

T! - O(k) to get all leaves, k is the number of

occurrence. - Asymptotic time is the same as KMP. ST wins if T

is fixed. KMP wins otherwise.

ST Application 2 set matching

- Find all occurrences of a set of patterns in T
- Build a ST from T
- Match each P to ST

x

a

b

x

a

a

c

c

1

c

b

b

x

x

c

4

6

a

a

c

c

5

T xabxac P xab

2

3

- O(m) to construct ST (large constant factor)
- O(n) to find v linear to total length of Ps
- O(k) to get all leaves, k is the number of

occurrence. - Asymptotic time is the same as Aho-Corasick. ST

wins if T fixed. AC wins if Ps are fixed.

Otherwise depending on relative size.

ST application 3 repeats finding

- Genome contains many repeated DNA sequences
- Repeat sequence length Varies from 1 nucleotide

to millions - Genes may have multiple copies (50 to 10,000)
- Highly repetitive DNA in some non-coding regions
- 6 to 10bp x 100,000 to 1,000,000 times
- Problem find all repeats that are at least

k-residues long and appear at least p times in

the genome

Repeats finding

- at least k-residues long and appear at least p

times in the seq - Phase 1 top-down, count label lengths (L) from

root to each node - Phase 2 bottom-up count of leaves descended

from each internal node

For each node with L gt k, and N gt p, print all

leaves

O(m) to traverse tree

(L, N)

Maximal repeats finding

- Right-maximal repeat
- Si1..ik Sj1..jk,
- but Sik1 ! Sjk1
- Left-maximal repeat
- Si1..ik Sj1..jk
- But Si ! Sj
- Maximal repeat
- Si1..ik Sj1..jk
- But Si ! Sj, and Sik1 ! Sjk1

acatgacatt

- cat
- aca
- acat

Maximal repeats finding

5e

1234567890 acatgacatt

5

t

a

c

10

a

5e

t

c

t

t

a

9

t

4

t

5e

5e

t

5e

t

7

3

6

8

1

2

- Find repeats with at least 3 bases and 2

occurrence - right-maximal cat
- Maximal acat
- left-maximal aca

Maximal repeats finding

5e

1234567890 acatgacatt

5

t

a

c

10

a

5e

t

c

t

t

a

9

t

4

t

5e

5e

t

5e

t

7

3

6

8

1

2

Left char

g

c

c

a

a

- How to find maximal repeat?
- A right-maximal repeats with different left chars

ST application 4 word enumeration

- Find all k-mers that occur at least p times
- Compute (L, N) for each node
- L total label length from root to node
- N leaves
- Find nodes v with Lgtk, and L(parent)ltk, and Ngty
- Traverse sub-tree rooted at v to get the locations

Lltk

Lk

L K

Lgtk, Ngtp

This can be used in many applications. For

example, to find words that appeared frequently

in a genome or a document

Joint Suffix Tree

- Build a ST for many than two strings
- Two strings S1 and S2
- S S1 S2
- Build a suffix tree for S in time O(S1 S2)
- The separator will only appear in the edge ending

in a leaf

- S1 abcd
- S2 abca
- S abcdabca

a b c d

useless

a

d

c

b c d a b c a

a

b

c

b

c

d

d

d

a

a

a

a

2,4

b

1,4

a

c

2,3

a

b

2,1

c

2,2

d

1,1

1,3

1,2

To Simplify

a b c d

useless

a

d

c

b c d a b c a

a

a

b

d

c

b

c

c

b c d

b

d

d

d

c

a

d

a

d

a

1,4

a

2,4

b

a

1,4

a

a

c

a

2,4

2,3

a

b

1,1

2,3

2,1

c

2,1

2,2

1,3

d

1,1

2,2

1,2

1,3

1,2

- We dont really need to do anything, since all

edge labels were implicit. - The right hand side is more convenient to look at

Application of JST

Not subsequence

- Longest common substring
- For each internal node v, keep a bit vector B
- B1 1 if a child of v is a suffix of S1
- Find all internal nodes with B1 B2 1
- Report one with the longest label
- Can be extended to k sequences. Just use a longer

bit vector.

a

d

c

b c d

b

c

d

d

1,4

a

a

a

2,4

1,1

2,3

2,1

1,3

2,2

1,2

Application of JST

- Given K strings, find all k-mers that appear in

at least d strings

Llt k

L gt k

B (1, 0, 1, 1)

cardinal(B) gt d

4,x

1,x

3,x

3,x

Many other applications

- Reproduce the behavior of Aho-Corasick
- Recognizing computer virus
- A database of known computer viruses
- Does a file contain virus?
- DNA finger printing
- A database of peoples DNA sequence
- Given a short DNA, which person is it from?
- Catch
- Large constant factor for space requirement
- Large constant factor for construction
- Suffix array trade off time for space

Summary

- One T, one P
- Boyer-Moore is the choice
- KMP works but not the best
- One T, many P
- Aho-Corasick
- Suffix Tree
- One fixed T, many varying P
- Suffix tree
- Two or more Ts
- Suffix tree, joint suffix tree, suffix array

Alphabet independent

Alphabet dependent

Pattern pre-processing algs

- Karp Rabin algorithm
- Small alphabet and small pattern
- Boyer Moore algorithm
- The choice of most cases
- Typically sub-linear time
- Knuth-Morris-Pratt algorithm (KMP)
- Aho-Corasick algorithm
- The algorithm for the unix utility fgrep
- Suffix tree
- One of the most useful preprocessing techniques
- Many applications

Karp Rabin Algorithm

- Lets say we are dealing with binary numbers
- Text 01010001011001010101001
- Pattern 101100
- Convert pattern to integer
- 101100 25 23 22 44

Karp Rabin algorithm

- Text 01010001011001010101001
- Pattern 101100 44 decimal
- 10111011001010101001
- 25 0 23 22 21 46
- 10111011001010101001
- 46 2 64 1 29
- 10111011001010101001
- 29 2 - 0 1 59
- 10111011001010101001
- 59 2 - 64 0 54
- 10111011001010101001
- 54 2 - 64 0 44

T(mn)

Karp Rabin algorithm

- What if the pattern is too long to fit into a

single integer? - Pattern 101100. What if each word in our

computer has only 4 bits? - Basic idea hashing. 44 13 5
- 10111011001010101001
- 46 ( 13 7)
- 10111011001010101001
- 46 2 64 1 29 ( 13 3)
- 10111011001010101001
- 29 2 - 0 1 59 ( 13 7)
- 10111011001010101001
- 59 2 - 64 0 54 ( 13 2)
- 10111011001010101001
- 54 2 - 64 0 44 ( 13 5)

T(mn) expected running time

Boyer Moore algorithm

- Three ideas
- Right-to-left comparison
- Bad character rule
- Good suffix rule

Boyer Moore algorithm

- Right to left comparison

x

y

Skip some chars without missing any occurrence.

y

But how?

Bad character rule

- 0 1
- 12345678901234567
- Txpbctbxabpqqaabpq
- P tpabxab
- What would you do now?

Bad character rule

- 0 1
- 12345678901234567
- Txpbctbxabpqqaabpq
- P tpabxab
- P tpabxab

Bad character rule

- 0 1
- 123456789012345678
- Txpbctbxabpqqaabpqz
- P tpabxab
- P tpabxab
- P tpabxab

Basic bad character rule

tpabxab

char Right-most-position in P

a 6

b 7

p 2

t 1

x 5

Pre-processing O(n)

Basic bad character rule

k

T xpbctbxabpqqaabpqz

P tpabxab

When rightmost T(k) in P is left to i, shift

pattern P to align T(k) with the rightmost T(k)

in P

Shift 3 1 2

i 3

P tpabxab

char Right-most-position in P

a 6

b 7

p 2

t 1

x 5

Basic bad character rule

k

T xpbctbxabpqqaabpqz

P tpabxab

When T(k) is not in P, shift left end of P to

align with T(k1)

i 7

Shift 7 0 7

P tpabxab

char Right-most-position in P

a 6

b 7

p 2

t 1

x 5

Basic bad character rule

k

T xpbctbxabpqqaabpqz

P tpabxab

When rightmost T(k) in P is right to i, shift

pattern P one pos

i 5

5 6 lt 0. so shift 1

P tpabxab

char Right-most-position in P

a 6

b 7

p 2

t 1

x 5

Extended bad character rule

k

T xpbctbxabpqqaabpqz

P tpabxab

Find T(k) in P that is immediately left to i,

shift P to align T(k) with that position

i 5

5 3 2. so shift 2

P tpabxab

char Position in P

a 6, 3

b 7, 4

p 2

t 1

x 5

Preprocessing still O(n)

Extended bad character rule

- Best possible m / n comparisons
- Works better for large alphabet size
- In some cases the extended bad character rule is

sufficiently good - Worst-case O(mn)
- What else can we do?

- 0 1
- 123456789012345678
- Tprstabstubabvqxrst
- P qcabdabdab
- P qcabdabdab

According to extended bad character rule

(weak) good suffix rule

- 0 1
- 123456789012345678
- Tprstabstubabvqxrst
- P qcabdabdab
- P qcabdabdab

(Weak) good suffix rule

x

t

T

Preprocessing For any suffix t of P, find the

rightmost copy of t, denoted by t. How to find

t efficiently?

y

t

P

t

y

t

t

P

(Strong) good suffix rule

- 0 1
- 123456789012345678
- Tprstabstubabvqxrst
- P qcabdabdab

(Strong) good suffix rule

- 0 1
- 123456789012345678
- Tprstabstubabvqxrst
- P qcabdabdab
- P qcabdabdab

(Strong) good suffix rule

- 0 1
- 123456789012345678
- Tprstabstubabvqxrst
- P qcabdabdab
- P qcabdabdab

(Strong) good suffix rule

x

t

T

In preprocessing For any suffix t of P, find

the rightmost copy of t, t, such that the char

left to t ? the char left to t

y

z

P

t

t

z ? y

y

z

t

t

P

- Pre-processing can be done in linear time
- If P in T, searching may take O(mn)
- If P not in T, searching in worst-case is O(mn)

Example preprocessing

qcabdabdab

Bad char rule

Good suffix rule

char Positions in P

a 9, 6, 3

b 10, 7, 4

c 2

d 8,5

q 1

1 2 3 4 5 6 7 8 9 10

q c a b d a b d a b

0 0 0 0 0 0 0 2 0 0

dab cab

Where to shift depends on T

Does not depend on T