CS5263 Bioinformatics - PowerPoint PPT Presentation

1 / 56

About This Presentation

Title:

CS5263 Bioinformatics

Description:

Three ideas: Right-to-left comparison. Bad character rule. Good suffix rule ... tattoo. theater. other. Failure link. p. o. t. a. t. o. t. e. r. 0. t. h. e. r ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 57

Provided by: jianhu

Learn more at: http://www.cs.utsa.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS5263 Bioinformatics

1
CS5263 Bioinformatics

Lecture 17
Exact String Matching Algorithms

2
Boyer Moore algorithm

Three ideas
Right-to-left comparison
Bad character rule
Good suffix rule

3
Boyer Moore algorithm

Right to left comparison

x
y
Skip some chars without missing any occurrence.
y
4
Extended bad character rule
k
T xpbctbxabpqqaabpqz
P tpabxab
Find T(k) in P that is immediately left to i,
shift P to align T(k) with that position
i 5
5 3 2. so shift 2
P tpabxab
Restart the comparison here.
Preprocessing O(n)
5
(Strong) good suffix rule
x
t
T
In preprocessing For any suffix t of P, find
the rightmost copy of t, t, such that the char
left to t ? the char left to t
y
z
P
t
t
z ? y
y
z
t
t
P
x
t
T
y
z
z
t
P
t
t
y
z
z
t
P
t
t
6
Example preprocessing
qcabdabdab
Bad char rule
Good suffix rule
1 2 3 4 5 6 7 8 9 10
q c a b d a b d a b
0 0 0 0 2 0 0 2 0 0
dabcab
dabdabcabdab
Where to shift depends on T
Does not depend on T
7
Tricky case

Pattern abcab

T x y a a b c a b
a b c a b0 0 0 1 0
a b c a bN N 0 N N
i-L
shift 4 1 3
b
b
c
c
8
Example preprocessing
qcabdabdab
Bad char rule
Good suffix rule
1 2 3 4 5 6 7 8 9 10
q c a b d a b d a b
0 0 0 0 0 3 0 0 3 0
dabcab
dabdabcabdab
Where to shift depends on T
Does not depend on T
9
Example preprocessing
qcabdabdab
Bad char rule
Good suffix rule
1 2 3 4 5 6 7 8 9 10
q c a b d a b d a b
N N N N 2 N N 2 N N
dabcab
dabdabcabdab
Where to shift depends on T
Does not depend on T
10
Algorithm KMP Basic idea
x
t
T
y
z
t
t
P
i
j
y
z
t
t
P
In pre-processing for any position i in P, find
the longest suffix t, such that t t, and y ?
z. For each i, let Sp(i) length(t)
11
Failure link

P aataac

If a char in T fails to match at pos 6,
re-compare it with the char at pos 3
a
a
t
a
a
c
Sp(i) 0 1 0 0 2 0
aaat
aataac
12
FSA
If the next char in T is t, we go to state 3

P aataac

t
a
a
t
a
a
c
a
1
2
3
4
5
0
6
Sp(i) 0 1 0 0 2 0
aaat
aataac
All other input goes to state 0
13
Tricky case

Pattern abcab

a
b
c
a
b
Failure link
dummy
0 0 0 0 2
c
b
b
c
a
a
FSA
14
How to actually do pre-processing?

Similar pre-processing for KMP and B-M
Find matches between a suffix and a prefix
Both can be done in linear time
P is usually short, even a more expensive
pre-processing may result in a gain overall

y
x
KMP
t
t
P
i
j
For each i, find a j. similar to DP. Start from i
2
y
x
B-M
t
t
P
i
j
15
Fundamental pre-processing
y
x
t
t
P
izi-1
zi
i
1

Zi length of longest substring starting at i
that matches a prefix of P
i.e. t t, x ? y, Zi t
With the Z-values computed, we can get the
preprocessing for both KMP and B-M in linear
time.
aabcaabxaaz
Z 01003100210
How to compute Z-values in linear time?

16
Computing Z in Linear time
We already computed all Z-values up to k-1. need
to compute Zk. We also know the starting and
ending points of the previous match, l and r.
t
t
y
x
P
r
k
l
t
t
y
x
P
We know that t t, therefore the Z-value at
k-l1 may be helpful to us.
r
k
l
1
k-l1
17
Computing Z in Linear time
The previous r is smaller than k. i.e., no
previous match extends beyond k. do explicit
comparison.
P
Case 1
k
Case 2
y
x
P
Zk-l1 lt r-k1. Zk Zk-l1 No comparison is
needed.
l
r
k
1
k-l1
Zk-l1 gt r-k1. Zk Zk-l1 Comparison start from
r
Case 3
P
r
k
l
1
k-l1

No char inside the box is compared twice. At most
one mismatch per iteration.
Therefore, O(n).

18
Z-preprocessing for B-M and KMP
y
x
t
t
Z
zi
i
j
1
j izi-1
y
x
KMP
t
t
For each j sp(jzj-1) z(j)
j
i
y
x
B-M
t
t
Use Z backwards
i
j

Both KMP and B-M preprocessing can be done in O(n)

19
Keyword tree for spell checking
p
s
o
c
l
h
o
o
5
e
i
t
e
t
a
t
r
n
t
y
e
c
o
r
e
y
3
1
4
2

O(n) time to construct. n total length of
patterns.
Search time O(m). m length of word
Common prefix only need to be compared once.

20
Aho-Corasick algorithm

Generalizing KMP
Create failure links
Basis of the fgrep algorithm
Given the following patterns
potato
tattoo
theater
other

21
Failure link
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
potterisapersonwhomakespottery
22
Failure link
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
O(n) preprocessing, and O(mk) searching. k is
of occurrence. Can create a FSA similarly.
Requires more space, and preprocessing time
depends on alphabet size.
23
A problem with failure link

Patterns potato, other, pot

0
p
t
h
o
e
t
r
3
a
2
t
o
1
24
A problem with failure link for multiple patterns

Patterns potato, other, pot, the, he, era

h
e
5
e
0
r
p
t
a
t
h
o
h
e
t
r
e
2
a
3
4
t
o
potherarac
1
25
Output link

Patterns potato, other, pot, the

Failure link taken when a mismatch occurs.
Output link always taken. (but will return).
h
e
e
5
0
r
p
t
a
t
h
o
h
e
t
r
e
2
a
3
4
t
o
potherarac
1
26
Suffix Tree

All algorithms we talked about so far preprocess
pattern(s)
Karp-Rabin small pattern, small alphabet
Boyer-Moore fastest in practice. O(m) worst
case.
KMP O(m)
Aho-Corasick O(m)
In some cases we may prefer to pre-process T
Fixed T, varying P
Suffix tree basically a keyword tree of all
suffixes

27
Suffix tree

T xabxac
Suffixes
xabxac
abxac
bxac
xac
ac
c

x
a
b
x
a
a
c
c
1
c
b
b
x
x
c
4
6
a
a
c
c
5
2
3
Naïve construction O(m2) using
Aho-Corasick. Smarter O(m). Very technical. big
constant factor Create an internal node only when
there is a branch
28
Suffix tree implementation

Explicitly labeling seq end
T xabxa T xabxa

x
a
x
a
b
x
b
a
a
x
a
a

1
1

b
b
b
b
x

x
x
x
4
a
a
a
a

5

2
2
3
3
29
Suffix tree implementation

Implicitly labeling edges
T xabxa

12
x
a
3
b
x
22
a
a

1
1

b
b

x
x
3
3
4
4
a
a
5

5

2
2
3
3
30
Suffix links

Similar to failure link in a keyword tree
Only link internal nodes having branches

x
a
b
xabcf
a
b
c
f
c
d
d
e
e
f
f
g
g
h
h
i
i
j
j
31
Suffix tree construction
1234567890...acatgacatt...
1
1
32
Suffix tree construction
1234567890...acatgacatt...
2
1
1
2
33
Suffix tree construction
1234567890...acatgacatt...
a
2
2
4
3
1
2
34
Suffix tree construction
1234567890...acatgacatt...
a
4
2
2
4
4
3
1
2
35
Suffix tree construction
5
1234567890...acatgacatt...
5
a
4
2
2
4
4
3
1
2
36
Suffix tree construction
5
1234567890...acatgacatt...
5
a
4
c
a
2
4
t
4
t
5

3
6
1
2
37
Suffix tree construction
5
1234567890...acatgacatt...
5
a
c
4
a
c
t
a
4
t
4
t
5
t
5

7
3
6
1
2

With this suffix link, when we later need to add
another suffix, say acaty, we can use the link to
avoid going back to the root and re-compare cat

38
Suffix tree construction
5
1234567890...acatgacatt...
5
a
c
4
a
c
t
t
a
t
4
t
5
5
t
5
t

7
3
6
8
1
2
39
Suffix tree construction
5
1234567890...acatgacatt...
5
t
a
c
a
5
t
c
t
t
a
9
t
4
t
5
5
t
5
t

7
3
6
8
1
2
40
Suffix tree construction
5
1234567890...acatgacatt...
5
t
a

c
10
a
5
t
c
t
t
a
9
t
4
t
5
5
t
5
t

7
3
6
8
1
2
41
ST Application pattern matching

Find all occurrence of Pxa in T
Find node v in the ST that matches to P
Traverse the subtree rooted at v to get the
locations

x
a
b
x
a
a
c
c
1
c
b
b
x
x
c
4
6
a
a
c
c
5
2
3
T xabxac
O(m) to construct ST (large constant factor) O(n)
to find v linear to length of P instead of
T! O(k) to get all leaves, k is the number of
occurrence.
42
ST application repeats finding

Genome contains many repeated DNA sequences
Repeat sequence length Varies from 1 nucleotide
to whole gene
Highly repetitive DNA in some non-coding regions
6 to 10bp x 100,000 to 1,000,000 times
Genes may have multiple copies (50 to 10,000)

43
Find longest repeated substring
25
L 4
1518
610
L 9
L 8

Do a tree traversal, compute the lengths of
labels at each node
O(m)

44
Repeats finding

Find all repeats that are at least k-residue long
and appear at least p times in the seq
Phase 1 top-down, count lengths of labels at
each node
Phase 2 bottom-up count of leaves descended
from each internal node

For each node with L gt k, and N gt p, print all
leaves
O(m) to traverse tree
(L, N)
45
Repeats finding
5e
1234567890acatgacatt
5
t
a

c
10
a
5e
t
c
t
t
a
9
t
4
t
5e
5e
t
5e
t
7
3
6
8
1
2

Find repeats with at least 3 bases and 2
occurrence
cat
acat
aca

46
Repeats finding

Left-maximal repeat
Si1..ik Sj1..jk
Si ! Sj
Right-maximal repeat
Si1..ik Sj1..jk,
Sik1 ! Sjk1
Maximal repeat
Si1..ik Sj1..jk
Si ! Sj, and Sik1 ! Sjk1

acatgacatt

aca
cat
acat

47
Repeats finding
5e
1234567890acatgacatt
5
t
a

c
10
a
5e
t
c
t
t
a
9
t
4
t
5e
5e
t
5e
t
7
3
6
8
1
2
Left char
g
c
c
a
a

How to find maximal repeat?
A right-maximal repeats with different left chars

48
ST application word enumeration

Find all k-mers that occur at least p times
Compute (L, N) for each node
Find nodes v with Lgtk, and L(parent)ltk, and Ngty
Traverse sub-tree rooted at v to get the locations

Lltk
Lk
L K
Lgtk, Ngtp
This can be used in many applications. For
example, to find words that appeared frequently
in a genome or a document
49
Joint Suffix Tree

Build a ST for many than two strings
Two strings S1 and S2
S S1 S2
Build a suffix tree for S in time O(S1 S2)
The separator will only appear in the edge ending
in a leaf

S1 abcd
S2 abca
S abcdabca

a b c d
a
d

c
b c d a b c a
a
b
c
b
c
d

d
d

a

a
a
a
2,4
b
1,4
a
c
2,3
a
b
2,1
c
2,2
d
1,1
1,3
1,2
51
To Simplify
a b c d
useless
a
d

c
b c d a b c a
a
a
b
d
c
b
c
c
b c d

b
d
d
d
c

a

d
a
d
a
1,4
a
2,4
b
a
1,4
a
a
c
a
2,4
2,3
a
b
1,1
2,3
2,1
c
2,1
2,2
1,3
d
1,1
2,2
1,2
1,3
1,2

We dont really need to do anything, since all
edge labels were implicit.
The right hand side is more convenient to look at

52
Application of JST

Longest common substring
For each internal node v, keep a bit vector B2
B1 1 if a child of v is a suffix of S1
Find all internal nodes with B1 B2 1
Report one with the longest label
Can be extended to k sequences. Just use a longer
bit vector.

a
d
c
b c d
b
c

d
d
1,4
a
a
a
2,4
1,1
2,3
2,1
1,3
2,2
1,2
O(m), m the total seq length
53
Application of JST

Given K strings, find all substrings with Lgtl,
that appear in at least d strings
Exact motif finding problem
Build a joint suffix tree with all strings
S S1 S2 S3 S4 _at_ S5 ! S6 S7
Use a unique end char for each string
Not really necessary if caution is taken in
construction

54
Llt k
B 1010 0011 1011
L gt k
B 3
B 0011
4,x
3,x
3,x
1,x
3,x
B 1010
O(mK), m the total seq length. K is for bitwise
or two bit vectors
55
Many other applications