Title: CS5263 Bioinformatics
1CS5263 Bioinformatics
- Lecture 17
- Exact String Matching Algorithms
2Boyer Moore algorithm
- Three ideas
- Right-to-left comparison
- Bad character rule
- Good suffix rule
3Boyer Moore algorithm
x
y
Skip some chars without missing any occurrence.
y
4Extended bad character rule
k
T xpbctbxabpqqaabpqz
P tpabxab
Find T(k) in P that is immediately left to i,
shift P to align T(k) with that position
i 5
5 3 2. so shift 2
P tpabxab
Restart the comparison here.
Preprocessing O(n)
5(Strong) good suffix rule
x
t
T
In preprocessing For any suffix t of P, find
the rightmost copy of t, t, such that the char
left to t ? the char left to t
y
z
P
t
t
z ? y
y
z
t
t
P
x
t
T
y
z
z
t
P
t
t
y
z
z
t
P
t
t
6Example preprocessing
qcabdabdab
Bad char rule
Good suffix rule
1 2 3 4 5 6 7 8 9 10
q c a b d a b d a b
0 0 0 0 2 0 0 2 0 0
dabcab
dabdabcabdab
Where to shift depends on T
Does not depend on T
7Tricky case
T x y a a b c a b
a b c a b0 0 0 1 0
a b c a bN N 0 N N
i-L
shift 4 1 3
b
b
c
c
8Example preprocessing
qcabdabdab
Bad char rule
Good suffix rule
1 2 3 4 5 6 7 8 9 10
q c a b d a b d a b
0 0 0 0 0 3 0 0 3 0
dabcab
dabdabcabdab
Where to shift depends on T
Does not depend on T
9Example preprocessing
qcabdabdab
Bad char rule
Good suffix rule
1 2 3 4 5 6 7 8 9 10
q c a b d a b d a b
N N N N 2 N N 2 N N
dabcab
dabdabcabdab
Where to shift depends on T
Does not depend on T
10Algorithm KMP Basic idea
x
t
T
y
z
t
t
P
i
j
y
z
t
t
P
In pre-processing for any position i in P, find
the longest suffix t, such that t t, and y ?
z. For each i, let Sp(i) length(t)
11Failure link
If a char in T fails to match at pos 6,
re-compare it with the char at pos 3
a
a
t
a
a
c
Sp(i) 0 1 0 0 2 0
aaat
aataac
12FSA
If the next char in T is t, we go to state 3
t
a
a
t
a
a
c
a
1
2
3
4
5
0
6
Sp(i) 0 1 0 0 2 0
aaat
aataac
All other input goes to state 0
13Tricky case
a
b
c
a
b
Failure link
dummy
0 0 0 0 2
c
b
b
c
a
a
FSA
14How to actually do pre-processing?
- Similar pre-processing for KMP and B-M
- Find matches between a suffix and a prefix
- Both can be done in linear time
- P is usually short, even a more expensive
pre-processing may result in a gain overall
y
x
KMP
t
t
P
i
j
For each i, find a j. similar to DP. Start from i
2
y
x
B-M
t
t
P
i
j
15Fundamental pre-processing
y
x
t
t
P
izi-1
zi
i
1
- Zi length of longest substring starting at i
that matches a prefix of P - i.e. t t, x ? y, Zi t
- With the Z-values computed, we can get the
preprocessing for both KMP and B-M in linear
time. - aabcaabxaaz
- Z 01003100210
- How to compute Z-values in linear time?
16Computing Z in Linear time
We already computed all Z-values up to k-1. need
to compute Zk. We also know the starting and
ending points of the previous match, l and r.
t
t
y
x
P
r
k
l
t
t
y
x
P
We know that t t, therefore the Z-value at
k-l1 may be helpful to us.
r
k
l
1
k-l1
17Computing Z in Linear time
The previous r is smaller than k. i.e., no
previous match extends beyond k. do explicit
comparison.
P
Case 1
k
Case 2
y
x
P
Zk-l1 lt r-k1. Zk Zk-l1 No comparison is
needed.
l
r
k
1
k-l1
Zk-l1 gt r-k1. Zk Zk-l1 Comparison start from
r
Case 3
P
r
k
l
1
k-l1
- No char inside the box is compared twice. At most
one mismatch per iteration. - Therefore, O(n).
18Z-preprocessing for B-M and KMP
y
x
t
t
Z
zi
i
j
1
j izi-1
y
x
KMP
t
t
For each j sp(jzj-1) z(j)
j
i
y
x
B-M
t
t
Use Z backwards
i
j
- Both KMP and B-M preprocessing can be done in O(n)
19Keyword tree for spell checking
p
s
o
c
l
h
o
o
5
e
i
t
e
t
a
t
r
n
t
y
e
c
o
r
e
y
3
1
4
2
- O(n) time to construct. n total length of
patterns. - Search time O(m). m length of word
- Common prefix only need to be compared once.
20Aho-Corasick algorithm
- Generalizing KMP
- Create failure links
- Basis of the fgrep algorithm
- Given the following patterns
- potato
- tattoo
- theater
- other
21Failure link
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
potterisapersonwhomakespottery
22Failure link
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
O(n) preprocessing, and O(mk) searching. k is
of occurrence. Can create a FSA similarly.
Requires more space, and preprocessing time
depends on alphabet size.
23A problem with failure link
- Patterns potato, other, pot
0
p
t
h
o
e
t
r
3
a
2
t
o
1
24A problem with failure link for multiple patterns
- Patterns potato, other, pot, the, he, era
h
e
5
e
0
r
p
t
a
t
h
o
h
e
t
r
e
2
a
3
4
t
o
potherarac
1
25Output link
- Patterns potato, other, pot, the
Failure link taken when a mismatch occurs.
Output link always taken. (but will return).
h
e
e
5
0
r
p
t
a
t
h
o
h
e
t
r
e
2
a
3
4
t
o
potherarac
1
26Suffix Tree
- All algorithms we talked about so far preprocess
pattern(s) - Karp-Rabin small pattern, small alphabet
- Boyer-Moore fastest in practice. O(m) worst
case. - KMP O(m)
- Aho-Corasick O(m)
- In some cases we may prefer to pre-process T
- Fixed T, varying P
- Suffix tree basically a keyword tree of all
suffixes
27Suffix tree
- T xabxac
- Suffixes
- xabxac
- abxac
- bxac
- xac
- ac
- c
x
a
b
x
a
a
c
c
1
c
b
b
x
x
c
4
6
a
a
c
c
5
2
3
Naïve construction O(m2) using
Aho-Corasick. Smarter O(m). Very technical. big
constant factor Create an internal node only when
there is a branch
28Suffix tree implementation
- Explicitly labeling seq end
- T xabxa T xabxa
x
a
x
a
b
x
b
a
a
x
a
a
1
1
b
b
b
b
x
x
x
x
4
a
a
a
a
5
2
2
3
3
29Suffix tree implementation
- Implicitly labeling edges
- T xabxa
12
x
a
3
b
x
22
a
a
1
1
b
b
x
x
3
3
4
4
a
a
5
5
2
2
3
3
30Suffix links
- Similar to failure link in a keyword tree
- Only link internal nodes having branches
x
a
b
xabcf
a
b
c
f
c
d
d
e
e
f
f
g
g
h
h
i
i
j
j
31Suffix tree construction
1234567890...acatgacatt...
1
1
32Suffix tree construction
1234567890...acatgacatt...
2
1
1
2
33Suffix tree construction
1234567890...acatgacatt...
a
2
2
4
3
1
2
34Suffix tree construction
1234567890...acatgacatt...
a
4
2
2
4
4
3
1
2
35Suffix tree construction
5
1234567890...acatgacatt...
5
a
4
2
2
4
4
3
1
2
36Suffix tree construction
5
1234567890...acatgacatt...
5
a
4
c
a
2
4
t
4
t
5
3
6
1
2
37Suffix tree construction
5
1234567890...acatgacatt...
5
a
c
4
a
c
t
a
4
t
4
t
5
t
5
7
3
6
1
2
- With this suffix link, when we later need to add
another suffix, say acaty, we can use the link to
avoid going back to the root and re-compare cat
38Suffix tree construction
5
1234567890...acatgacatt...
5
a
c
4
a
c
t
t
a
t
4
t
5
5
t
5
t
7
3
6
8
1
2
39Suffix tree construction
5
1234567890...acatgacatt...
5
t
a
c
a
5
t
c
t
t
a
9
t
4
t
5
5
t
5
t
7
3
6
8
1
2
40Suffix tree construction
5
1234567890...acatgacatt...
5
t
a
c
10
a
5
t
c
t
t
a
9
t
4
t
5
5
t
5
t
7
3
6
8
1
2
41ST Application pattern matching
- Find all occurrence of Pxa in T
- Find node v in the ST that matches to P
- Traverse the subtree rooted at v to get the
locations
x
a
b
x
a
a
c
c
1
c
b
b
x
x
c
4
6
a
a
c
c
5
2
3
T xabxac
O(m) to construct ST (large constant factor) O(n)
to find v linear to length of P instead of
T! O(k) to get all leaves, k is the number of
occurrence.
42ST application repeats finding
- Genome contains many repeated DNA sequences
- Repeat sequence length Varies from 1 nucleotide
to whole gene - Highly repetitive DNA in some non-coding regions
- 6 to 10bp x 100,000 to 1,000,000 times
- Genes may have multiple copies (50 to 10,000)
43Find longest repeated substring
25
L 4
1518
610
L 9
L 8
- Do a tree traversal, compute the lengths of
labels at each node - O(m)
44Repeats finding
- Find all repeats that are at least k-residue long
and appear at least p times in the seq - Phase 1 top-down, count lengths of labels at
each node - Phase 2 bottom-up count of leaves descended
from each internal node
For each node with L gt k, and N gt p, print all
leaves
O(m) to traverse tree
(L, N)
45Repeats finding
5e
1234567890acatgacatt
5
t
a
c
10
a
5e
t
c
t
t
a
9
t
4
t
5e
5e
t
5e
t
7
3
6
8
1
2
- Find repeats with at least 3 bases and 2
occurrence - cat
- acat
- aca
46Repeats finding
- Left-maximal repeat
- Si1..ik Sj1..jk
- Si ! Sj
- Right-maximal repeat
- Si1..ik Sj1..jk,
- Sik1 ! Sjk1
- Maximal repeat
- Si1..ik Sj1..jk
- Si ! Sj, and Sik1 ! Sjk1
acatgacatt
47Repeats finding
5e
1234567890acatgacatt
5
t
a
c
10
a
5e
t
c
t
t
a
9
t
4
t
5e
5e
t
5e
t
7
3
6
8
1
2
Left char
g
c
c
a
a
- How to find maximal repeat?
- A right-maximal repeats with different left chars
48ST application word enumeration
- Find all k-mers that occur at least p times
- Compute (L, N) for each node
- Find nodes v with Lgtk, and L(parent)ltk, and Ngty
- Traverse sub-tree rooted at v to get the locations
Lltk
Lk
L K
Lgtk, Ngtp
This can be used in many applications. For
example, to find words that appeared frequently
in a genome or a document
49Joint Suffix Tree
- Build a ST for many than two strings
- Two strings S1 and S2
- S S1 S2
- Build a suffix tree for S in time O(S1 S2)
- The separator will only appear in the edge ending
in a leaf
50- S1 abcd
- S2 abca
- S abcdabca
a b c d
a
d
c
b c d a b c a
a
b
c
b
c
d
d
d
a
a
a
a
2,4
b
1,4
a
c
2,3
a
b
2,1
c
2,2
d
1,1
1,3
1,2
51To Simplify
a b c d
useless
a
d
c
b c d a b c a
a
a
b
d
c
b
c
c
b c d
b
d
d
d
c
a
d
a
d
a
1,4
a
2,4
b
a
1,4
a
a
c
a
2,4
2,3
a
b
1,1
2,3
2,1
c
2,1
2,2
1,3
d
1,1
2,2
1,2
1,3
1,2
- We dont really need to do anything, since all
edge labels were implicit. - The right hand side is more convenient to look at
52Application of JST
- Longest common substring
- For each internal node v, keep a bit vector B2
- B1 1 if a child of v is a suffix of S1
- Find all internal nodes with B1 B2 1
- Report one with the longest label
- Can be extended to k sequences. Just use a longer
bit vector.
a
d
c
b c d
b
c
d
d
1,4
a
a
a
2,4
1,1
2,3
2,1
1,3
2,2
1,2
O(m), m the total seq length
53Application of JST
- Given K strings, find all substrings with Lgtl,
that appear in at least d strings - Exact motif finding problem
- Build a joint suffix tree with all strings
- S S1 S2 S3 S4 _at_ S5 ! S6 S7
- Use a unique end char for each string
- Not really necessary if caution is taken in
construction
54Llt k
B 1010 0011 1011
L gt k
B 3
B 0011
4,x
3,x
3,x
1,x
3,x
B 1010
O(mK), m the total seq length. K is for bitwise
or two bit vectors
55Many other applications
- Reproduce the behavior of Aho-Corasick
- DNA finger printing
- A database of peoples DNA sequence
- Given a short DNA, which person is it from?
- Recognizing DNA contamination
- Indexing sequence databases
-
- Catch
- Large constant factor for space requirement
(15-40 bytes per base for DNA) - Large constant factor for construction
- Suffix array trade off time for space
56Summary
- One T, one P
- Boyer-Moore is the choice
- KMP works but not the best
- One T, many P
- Aho-Corasick
- Suffix Tree
- One fixed T, many varying P
- Suffix tree
- Two or more Ts
- Suffix tree, joint suffix tree, suffix array
Alphabet independent
Alphabet dependent