# CS 3343: Analysis of Algorithms - PowerPoint PPT Presentation

PPT – CS 3343: Analysis of Algorithms PowerPoint presentation | free to download - id: 77bb02-NDExN

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

## CS 3343: Analysis of Algorithms

Description:

### CS 3343: Analysis of Algorithms Lecture 26: String Matching Algorithms – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 90
Provided by: Jianh155
Category:
Tags:
Transcript and Presenter's Notes

Title: CS 3343: Analysis of Algorithms

1
CS 3343 Analysis of Algorithms
• Lecture 26 String Matching Algorithms

2
Definitions
• Text a longer string T
• Pattern a shorter string P
• Exact matching find all occurrence of P in T

length m
T
P
Length n
3
The naïve algorithm
Length m
Length n
4
Time complexity
• Worst case O(mn)
• Best case O(m)
• aaaaaaaaaaaaaa vs. baaaaaaa
• Average case?
• Alphabet size k
• Assume equal probability
• How many chars do you need to compare before find
a mismatch?
• In average k / (k-1)
• Therefore average-case complexity mk / (k-1)
• For large alphabet, m
• Not as bad as you thought, huh?

5
Real strings are not random
• T aaaaaaaaaaaaaaaaaaaaaaaaa
• P aaaab
• Plus O(m) average case is still bad for long
strings!
• Smarter algorithms
• O(m n) in worst case
• sub-linear in practice
• how is this possible?

6
How to speedup?
• Pre-processing T or P
• Why pre-processing can save us time?
• Uncovers the structure of T or P
• Determines when we can skip ahead without missing
anything
• Determines when we can infer the result of
character comparisons without actually doing them.

ACGTAXACXTAXACGXAX
ACGTACA
7
Cost for exact string matching
• Total cost cost (preprocessing)
• cost(comparison)
• cost(output)

Minimize
Constant
Hope gain gt overhead
8
String matching scenarios
• One T and one P
• Search a word in a document
• One T and many P all at once
• Search a set of words in a document
• Spell checking
• One fixed T, many P
• Search a completed genome for a short sequence
• Two (or many) Ts for common patterns
• Would you preprocess P or T?
• Always pre-process the shorter seq, or the one
that is repeatedly used

9
Pattern pre-processing algs
• Karp Rabin algorithm
• Small alphabet and small pattern
• Boyer Moore algorithm
• The choice of most cases
• Typically sub-linear time
• Knuth-Morris-Pratt algorithm (KMP)
• Aho-Corasick algorithm
• The algorithm for the unix utility fgrep
• Suffix tree
• One of the most useful preprocessing techniques
• Many applications

10
Algorithm KMP
• Not the fastest
• Best known
• Good for real-time matching
• i.e. text comes one char at a time
• No memory of previous chars
• Idea
• Left-to-right comparison
• Shift P more than one char whenever possible

11
Intuitive example 1
abcxabc
T
mismatch
P
abcxabcde
Naïve approach
abcxabc
T
?
abcxabcde
• Observation by reasoning on the pattern alone,
we can determine that if a mismatch happened when
comparing P8 with Ti, we can shift P by four
chars, and compare P4 with Ti, without
missing any possible matches.
• Number of comparisons saved 6

12
Intuitive example 2
abcxabc
T
mismatch
P
abcxabcde
Naïve approach
abcxabc
T
?
abcxabcde
• Observation by reasoning on the pattern alone,
we can determine that if a mismatch happened
between P7 and Tj, we can shift P by six
chars and compare Tj with P1 without missing
any possible matches
• Number of comparisons saved 7

13
KMP algorithm pre-processing
• Key the reasoning is done without even knowing
what string T is.
• Only the location of mismatch in P must be known.

x
t
T
y
z
t
t
P
i
j
y
z
t
t
P
i
j
Pre-processing for any position i in P, find
P1..is longest proper suffix, t Pj..i,
such that t matches to a prefix of P, t, and the
next char of t is different from the next char of
t (i.e., y ? z) For each i, let sp(i) length(t)
14
KMP algorithm shift rule
x
t
T
y
z
t
t
P
i
j
y
z
t
P
t
i
j
sp(i)
1
Shift rule when a mismatch occurred between
Pi1 and Tk, shift P to the right by i
sp(i) chars and compare x with z. This shift
rule can be implicitly represented by creating a
failure link between y and z. Meaning when a
mismatch occurred between x on T and Pi1,
resume comparison between x and Psp(i)1.
15
• P aataac

If a char in T fails to match at pos 6,
re-compare it with the char at pos 3 ( 2 1)
a
a
t
a
a
c
sp(i) 0 1 0 0 2 0
aa at
aat aac
16
Another example
• P abababc

If a char in T fails to match at pos 7,
re-compare it with the char at pos 5 ( 4 1)
a
b
a
b
a
b
c
Sp(i) 0 0 0 0 0 4 0
ababa ababc
ab ab
abab abab
17
KMP Example using Failure Link
a
a
t
a
a
c
T aacaataaaaataaccttacta
aataac
• Time complexity analysis
• Each char in T may be compared up to n times. A
lousy analysis gives O(mn) time.
• More careful analysis number of comparisons can
be broken to two phases
• Comparison phase the first time a char in T is
compared to P. Total is exactly m.
• Shift phase. First comparisons made after a
shift. Total is at most m.
• Time complexity O(2m)

aataac .
aataac
aataac ..
aataac .
18
KMP algorithm using DFA (Deterministic Finite
Automata)
• P aataac

If a char in T fails to match at pos 6,
re-compare it with the char at pos 3
a
a
t
a
a
c
If the next char in T is t after matching 5
chars, go to state 3
a
t
t
a
a
c
a
a
1
2
3
4
5
0
6
DFA
a
a
All other inputs goes to state 0.
19
DFA Example
a
t
t
a
a
c
a
a
1
2
3
4
5
0
6
DFA
a
a
T aacaataataataaccttacta
1201234534534560001001
Each char in T will be examined exactly once.
Therefore, exactly m comparisons are made. But
it takes longer to do pre-processing, and needs
more space to store the FSA.
20
Difference between Failure Link and DFA
• Preprocessing time and space are O(n), regardless
of alphabet size
• Comparison time is at most 2m (at least m)
• DFA
• Preprocessing time and space are O(n ?)
• May be a problem for very large alphabet size
• For example, each char is a big integer
• Chinese characters
• Comparison time is always m.

21
The set matching problem
• Find all occurrences of a set of patterns in T
• First idea run KMP or BM for each P
• O(km n)
• k number of patterns
• m length of text
• n total length of patterns
• Better idea combine all patterns together and
search in one run

22
A simpler problem spell-checking
• A dictionary contains five words
• potato
• poetry
• pottery
• science
• school
• Given a document, check if any word is (not) in
the dictionary
• Words in document are separated by special chars.
• Relatively easy.

23
Keyword tree for spell checking
This version of the potato gun was inspired by
the Weird Science team out of Illinois
p
s
o
c
l
h
o
o
5
e
i
t
e
t
a
t
r
n
t
y
e
c
o
r
e
y
3
1
4
2
• O(n) time to construct. n total length of
patterns.
• Search time O(m). m length of text
• Common prefix only need to be compared once.
• What if there is no space between words?

24
Aho-Corasick algorithm
• Basis of the fgrep algorithm
• Generalizing KMP
• Using failure links
• Example given the following 4 patterns
• potato
• tattoo
• theater
• other

25
Keyword tree
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
26
Keyword tree
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
27
Keyword tree
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
O(mn)
m length of text. n length of longest pattern
28
Keyword Tree with a failure link
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
29
Keyword Tree with a failure link
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
30
Keyword Tree with all failure links
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
31
Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
32
Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
33
Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
34
Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
35
Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
36
Aho-Corasick algorithm
• O(n) preprocessing, and O(mk) searching.
• n total length of patterns.
• m length of text
• k is of occurrence.
• Can create a DFA similar as in KMP.
• Requires more space,
• Preprocessing time depends on alphabet size
• Search time is constant

37
Suffix Tree
• All algorithms we talked about so far preprocess
pattern(s)
• Karp-Rabin small pattern, small alphabet
• Boyer-Moore fastest in practice. O(m) worst
case.
• KMP O(m)
• Aho-Corasick O(m)
• In some cases we may prefer to pre-process T
• Fixed T, varying P
• Suffix tree basically a keyword tree of all
suffixes

38
Suffix tree
• T xabxac
• Suffixes
• xabxac
• abxac
• bxac
• xac
• ac
• c

x
a
b
x
a
a
c
c
1
c
b
b
x
x
c
4
6
a
a
c
c
5
2
3
Naïve construction O(m2) using
Aho-Corasick. Smarter O(m). Very technical. big
constant factor Difference from a keyword tree
create an internal node only when there is a
branch
39
Suffix tree implementation
• Explicitly labeling seq end
• T xabxa T xabxa

x
a
x
a
b
x
b
a
a
x
a
a

1
1

b
b
b
b
x

x
x
x
4
a
a
a
a

5

2
2
3
3
40
Suffix tree implementation
• Implicitly labeling edges
• T xabxa

12
x
a
3
b
x
22
a
a

1
1

b
b

x
x
3
3
4
4
a
a
5

5

2
2
3
3
41
• Similar to failure link in a keyword tree
• Only link internal nodes having branches

x
a
b
xabcf
a
b
c
f
c
d
d
e
e
f
f
g
g
h
h
i
i
j
j
42
Suffix tree construction
1234567890 acatgacatt
1
1
43
Suffix tree construction
1234567890 acatgacatt
2
1
1
2
44
Suffix tree construction
1234567890 acatgacatt
a
2
2
4
3
1
2
45
Suffix tree construction
1234567890 acatgacatt
a
4
2
2
4
4
3
1
2
46
Suffix tree construction
5
1234567890 acatgacatt
5
a
4
2
2
4
4
3
1
2
47
Suffix tree construction
5
1234567890 acatgacatt
5
a
4
c
a
2
4
t
4
t
5

3
6
1
2
48
Suffix tree construction
5
1234567890 acatgacatt
5
a
c
4
a
c
t
a
4
t
4
t
5
t
5

7
3
6
1
2
49
Suffix tree construction
5
1234567890 acatgacatt
5
a
c
4
a
c
t
t
a
t
4
t
5
5
t
5
t

7
3
6
8
1
2
50
Suffix tree construction
5
1234567890 acatgacatt
5
t
a
c
a
5
t
c
t
t
a
9
t
4
t
5
5
t
5
t

7
3
6
8
1
2
51
Suffix tree construction
5
1234567890 acatgacatt
5
t
a

c
10
a
5
t
c
t
t
a
9
t
4
t
5
5
t
5
t

7
3
6
8
1
2
52
ST Application 1 pattern matching
• Find all occurrence of Pxa in T
• Find node v in the ST that matches to P
• Traverse the subtree rooted at v to get the
locations

x
a
b
x
a
a
c
c
1
c
b
b
x
x
c
4
6
a
a
c
c
5
T xabxac
2
3
• O(m) to construct ST (large constant factor)
• O(n) to find v linear to length of P instead of
T!
• O(k) to get all leaves, k is the number of
occurrence.
• Asymptotic time is the same as KMP. ST wins if T
is fixed. KMP wins otherwise.

53
ST Application 2 set matching
• Find all occurrences of a set of patterns in T
• Build a ST from T
• Match each P to ST

x
a
b
x
a
a
c
c
1
c
b
b
x
x
c
4
6
a
a
c
c
5
T xabxac P xab
2
3
• O(m) to construct ST (large constant factor)
• O(n) to find v linear to total length of Ps
• O(k) to get all leaves, k is the number of
occurrence.
• Asymptotic time is the same as Aho-Corasick. ST
wins if T fixed. AC wins if Ps are fixed.
Otherwise depending on relative size.

54
ST application 3 repeats finding
• Genome contains many repeated DNA sequences
• Repeat sequence length Varies from 1 nucleotide
to millions
• Genes may have multiple copies (50 to 10,000)
• Highly repetitive DNA in some non-coding regions
• 6 to 10bp x 100,000 to 1,000,000 times
• Problem find all repeats that are at least
k-residues long and appear at least p times in
the genome

55
Repeats finding
• at least k-residues long and appear at least p
times in the seq
• Phase 1 top-down, count label lengths (L) from
root to each node
• Phase 2 bottom-up count of leaves descended
from each internal node

For each node with L gt k, and N gt p, print all
leaves
O(m) to traverse tree
(L, N)
56
Maximal repeats finding
• Right-maximal repeat
• Si1..ik Sj1..jk,
• but Sik1 ! Sjk1
• Left-maximal repeat
• Si1..ik Sj1..jk
• But Si ! Sj
• Maximal repeat
• Si1..ik Sj1..jk
• But Si ! Sj, and Sik1 ! Sjk1

acatgacatt
• cat
• aca
• acat

57
Maximal repeats finding
5e
1234567890 acatgacatt
5
t
a

c
10
a
5e
t
c
t
t
a
9
t
4
t
5e
5e
t
5e
t
7
3
6
8
1
2
• Find repeats with at least 3 bases and 2
occurrence
• right-maximal cat
• Maximal acat
• left-maximal aca

58
Maximal repeats finding
5e
1234567890 acatgacatt
5
t
a

c
10
a
5e
t
c
t
t
a
9
t
4
t
5e
5e
t
5e
t
7
3
6
8
1
2
Left char
g
c
c
a
a
• How to find maximal repeat?
• A right-maximal repeats with different left chars

59
ST application 4 word enumeration
• Find all k-mers that occur at least p times
• Compute (L, N) for each node
• L total label length from root to node
• N leaves
• Find nodes v with Lgtk, and L(parent)ltk, and Ngty
• Traverse sub-tree rooted at v to get the locations

Lltk
Lk
L K
Lgtk, Ngtp
This can be used in many applications. For
example, to find words that appeared frequently
in a genome or a document
60
Joint Suffix Tree
• Build a ST for many than two strings
• Two strings S1 and S2
• S S1 S2
• Build a suffix tree for S in time O(S1 S2)
• The separator will only appear in the edge ending
in a leaf

61
• S1 abcd
• S2 abca
• S abcdabca

a b c d
useless
a
d

c
b c d a b c a
a
b
c
b
c
d

d
d

a

a
a
a
2,4
b
1,4
a
c
2,3
a
b
2,1
c
2,2
d
1,1
1,3
1,2
62
To Simplify
a b c d
useless
a
d

c
b c d a b c a
a
a
b
d
c
b
c
c
b c d

b
d
d
d
c

a

d
a
d
a
1,4
a
2,4
b
a
1,4
a
a
c
a
2,4
2,3
a
b
1,1
2,3
2,1
c
2,1
2,2
1,3
d
1,1
2,2
1,2
1,3
1,2
• We dont really need to do anything, since all
edge labels were implicit.
• The right hand side is more convenient to look at

63
Application of JST
Not subsequence
• Longest common substring
• For each internal node v, keep a bit vector B
• B1 1 if a child of v is a suffix of S1
• Find all internal nodes with B1 B2 1
• Report one with the longest label
• Can be extended to k sequences. Just use a longer
bit vector.

a
d
c
b c d
b
c

d
d
1,4
a
a
a
2,4
1,1
2,3
2,1
1,3
2,2
1,2
64
Application of JST
• Given K strings, find all k-mers that appear in
at least d strings

Llt k
L gt k
B (1, 0, 1, 1)
cardinal(B) gt d
4,x
1,x
3,x
3,x
65
Many other applications
• Reproduce the behavior of Aho-Corasick
• Recognizing computer virus
• A database of known computer viruses
• Does a file contain virus?
• DNA finger printing
• A database of peoples DNA sequence
• Given a short DNA, which person is it from?
• Catch
• Large constant factor for space requirement
• Large constant factor for construction
• Suffix array trade off time for space

66
Summary
• One T, one P
• Boyer-Moore is the choice
• KMP works but not the best
• One T, many P
• Aho-Corasick
• Suffix Tree
• One fixed T, many varying P
• Suffix tree
• Two or more Ts
• Suffix tree, joint suffix tree, suffix array

Alphabet independent
Alphabet dependent
67
Pattern pre-processing algs
• Karp Rabin algorithm
• Small alphabet and small pattern
• Boyer Moore algorithm
• The choice of most cases
• Typically sub-linear time
• Knuth-Morris-Pratt algorithm (KMP)
• Aho-Corasick algorithm
• The algorithm for the unix utility fgrep
• Suffix tree
• One of the most useful preprocessing techniques
• Many applications

68
Karp Rabin Algorithm
• Lets say we are dealing with binary numbers
• Text 01010001011001010101001
• Pattern 101100
• Convert pattern to integer
• 101100 25 23 22 44

69
Karp Rabin algorithm
• Text 01010001011001010101001
• Pattern 101100 44 decimal
• 10111011001010101001
• 25 0 23 22 21 46
• 10111011001010101001
• 46 2 64 1 29
• 10111011001010101001
• 29 2 - 0 1 59
• 10111011001010101001
• 59 2 - 64 0 54
• 10111011001010101001
• 54 2 - 64 0 44

T(mn)
70
Karp Rabin algorithm
• What if the pattern is too long to fit into a
single integer?
• Pattern 101100. What if each word in our
computer has only 4 bits?
• Basic idea hashing. 44 13 5
• 10111011001010101001
• 46 ( 13 7)
• 10111011001010101001
• 46 2 64 1 29 ( 13 3)
• 10111011001010101001
• 29 2 - 0 1 59 ( 13 7)
• 10111011001010101001
• 59 2 - 64 0 54 ( 13 2)
• 10111011001010101001
• 54 2 - 64 0 44 ( 13 5)

T(mn) expected running time
71
Boyer Moore algorithm
• Three ideas
• Right-to-left comparison
• Bad character rule
• Good suffix rule

72
Boyer Moore algorithm
• Right to left comparison

x
y
Skip some chars without missing any occurrence.
y
But how?
73
• 0 1
• 12345678901234567
• Txpbctbxabpqqaabpq
• P tpabxab
• What would you do now?

74
• 0 1
• 12345678901234567
• Txpbctbxabpqqaabpq
• P tpabxab
• P tpabxab

75
• 0 1
• 123456789012345678
• Txpbctbxabpqqaabpqz
• P tpabxab
• P tpabxab
• P tpabxab

76
Basic bad character rule
tpabxab
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
Pre-processing O(n)
77
Basic bad character rule
k
T xpbctbxabpqqaabpqz
P tpabxab
When rightmost T(k) in P is left to i, shift
pattern P to align T(k) with the rightmost T(k)
in P
Shift 3 1 2
i 3
P tpabxab
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
78
Basic bad character rule
k
T xpbctbxabpqqaabpqz
P tpabxab
When T(k) is not in P, shift left end of P to
align with T(k1)
i 7
Shift 7 0 7
P tpabxab
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
79
Basic bad character rule
k
T xpbctbxabpqqaabpqz
P tpabxab
When rightmost T(k) in P is right to i, shift
pattern P one pos
i 5
5 6 lt 0. so shift 1
P tpabxab
char Right-most-position in P
a 6
b 7
p 2
t 1
x 5
80
Extended bad character rule
k
T xpbctbxabpqqaabpqz
P tpabxab
Find T(k) in P that is immediately left to i,
shift P to align T(k) with that position
i 5
5 3 2. so shift 2
P tpabxab
char Position in P
a 6, 3
b 7, 4
p 2
t 1
x 5
Preprocessing still O(n)
81
Extended bad character rule
• Best possible m / n comparisons
• Works better for large alphabet size
• In some cases the extended bad character rule is
sufficiently good
• Worst-case O(mn)
• What else can we do?

82
• 0 1
• 123456789012345678
• Tprstabstubabvqxrst
• P qcabdabdab
• P qcabdabdab

According to extended bad character rule
83
(weak) good suffix rule
• 0 1
• 123456789012345678
• Tprstabstubabvqxrst
• P qcabdabdab
• P qcabdabdab

84
(Weak) good suffix rule
x
t
T
Preprocessing For any suffix t of P, find the
rightmost copy of t, denoted by t. How to find
t efficiently?
y
t
P
t
y
t
t
P
85
(Strong) good suffix rule
• 0 1
• 123456789012345678
• Tprstabstubabvqxrst
• P qcabdabdab

86
(Strong) good suffix rule
• 0 1
• 123456789012345678
• Tprstabstubabvqxrst
• P qcabdabdab
• P qcabdabdab

87
(Strong) good suffix rule
• 0 1
• 123456789012345678
• Tprstabstubabvqxrst
• P qcabdabdab
• P qcabdabdab

88
(Strong) good suffix rule
x
t
T
In preprocessing For any suffix t of P, find
the rightmost copy of t, t, such that the char
left to t ? the char left to t
y
z
P
t
t
z ? y
y
z
t
t
P
• Pre-processing can be done in linear time
• If P in T, searching may take O(mn)
• If P not in T, searching in worst-case is O(mn)

89
Example preprocessing
qcabdabdab
Good suffix rule
char Positions in P
a 9, 6, 3
b 10, 7, 4
c 2
d 8,5
q 1
1 2 3 4 5 6 7 8 9 10
q c a b d a b d a b
0 0 0 0 0 0 0 2 0 0
dab cab
Where to shift depends on T
Does not depend on T