CS5263 Bioinformatics - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

CS5263 Bioinformatics

Description:

Three ideas: Right-to-left comparison. Bad character rule. Good suffix rule ... tattoo. theater. other. Failure link. p. o. t. a. t. o. t. e. r. 0. t. h. e. r ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 57
Provided by: jianhu
Learn more at: http://www.cs.utsa.edu
Category:

less

Transcript and Presenter's Notes

Title: CS5263 Bioinformatics


1
CS5263 Bioinformatics
  • Lecture 17
  • Exact String Matching Algorithms

2
Boyer Moore algorithm
  • Three ideas
  • Right-to-left comparison
  • Bad character rule
  • Good suffix rule

3
Boyer Moore algorithm
  • Right to left comparison

x
y
Skip some chars without missing any occurrence.
y
4
Extended bad character rule
k
T xpbctbxabpqqaabpqz
P tpabxab
Find T(k) in P that is immediately left to i,
shift P to align T(k) with that position
i 5
5 3 2. so shift 2
P tpabxab
Restart the comparison here.
Preprocessing O(n)
5
(Strong) good suffix rule
x
t
T
In preprocessing For any suffix t of P, find
the rightmost copy of t, t, such that the char
left to t ? the char left to t
y
z
P
t
t
z ? y
y
z
t
t
P
x
t
T
y
z
z
t
P
t
t
y
z
z
t
P
t
t
6
Example preprocessing
qcabdabdab
Bad char rule
Good suffix rule
1 2 3 4 5 6 7 8 9 10
q c a b d a b d a b
0 0 0 0 2 0 0 2 0 0
dabcab
dabdabcabdab
Where to shift depends on T
Does not depend on T
7
Tricky case
  • Pattern abcab

T x y a a b c a b
a b c a b0 0 0 1 0
a b c a bN N 0 N N
i-L
shift 4 1 3
b
b
c
c
8
Example preprocessing
qcabdabdab
Bad char rule
Good suffix rule
1 2 3 4 5 6 7 8 9 10
q c a b d a b d a b
0 0 0 0 0 3 0 0 3 0
dabcab
dabdabcabdab
Where to shift depends on T
Does not depend on T
9
Example preprocessing
qcabdabdab
Bad char rule
Good suffix rule
1 2 3 4 5 6 7 8 9 10
q c a b d a b d a b
N N N N 2 N N 2 N N
dabcab
dabdabcabdab
Where to shift depends on T
Does not depend on T
10
Algorithm KMP Basic idea
x
t
T
y
z
t
t
P
i
j
y
z
t
t
P
In pre-processing for any position i in P, find
the longest suffix t, such that t t, and y ?
z. For each i, let Sp(i) length(t)
11
Failure link
  • P aataac

If a char in T fails to match at pos 6,
re-compare it with the char at pos 3
a
a
t
a
a
c
Sp(i) 0 1 0 0 2 0
aaat
aataac
12
FSA
If the next char in T is t, we go to state 3
  • P aataac

t
a
a
t
a
a
c
a
1
2
3
4
5
0
6
Sp(i) 0 1 0 0 2 0
aaat
aataac
All other input goes to state 0
13
Tricky case
  • Pattern abcab

a
b
c
a
b
Failure link
dummy
0 0 0 0 2
c
b
b
c
a
a
FSA
14
How to actually do pre-processing?
  • Similar pre-processing for KMP and B-M
  • Find matches between a suffix and a prefix
  • Both can be done in linear time
  • P is usually short, even a more expensive
    pre-processing may result in a gain overall

y
x
KMP
t
t
P
i
j
For each i, find a j. similar to DP. Start from i
2
y
x
B-M
t
t
P
i
j
15
Fundamental pre-processing
y
x
t
t
P
izi-1
zi
i
1
  • Zi length of longest substring starting at i
    that matches a prefix of P
  • i.e. t t, x ? y, Zi t
  • With the Z-values computed, we can get the
    preprocessing for both KMP and B-M in linear
    time.
  • aabcaabxaaz
  • Z 01003100210
  • How to compute Z-values in linear time?

16
Computing Z in Linear time
We already computed all Z-values up to k-1. need
to compute Zk. We also know the starting and
ending points of the previous match, l and r.
t
t
y
x
P
r
k
l
t
t
y
x
P
We know that t t, therefore the Z-value at
k-l1 may be helpful to us.
r
k
l
1
k-l1
17
Computing Z in Linear time
The previous r is smaller than k. i.e., no
previous match extends beyond k. do explicit
comparison.
P
Case 1
k
Case 2
y
x
P
Zk-l1 lt r-k1. Zk Zk-l1 No comparison is
needed.
l
r
k
1
k-l1
Zk-l1 gt r-k1. Zk Zk-l1 Comparison start from
r
Case 3
P
r
k
l
1
k-l1
  • No char inside the box is compared twice. At most
    one mismatch per iteration.
  • Therefore, O(n).

18
Z-preprocessing for B-M and KMP
y
x
t
t
Z
zi
i
j
1
j izi-1
y
x
KMP
t
t
For each j sp(jzj-1) z(j)
j
i
y
x
B-M
t
t
Use Z backwards
i
j
  • Both KMP and B-M preprocessing can be done in O(n)

19
Keyword tree for spell checking
p
s
o
c
l
h
o
o
5
e
i
t
e
t
a
t
r
n
t
y
e
c
o
r
e
y
3
1
4
2
  • O(n) time to construct. n total length of
    patterns.
  • Search time O(m). m length of word
  • Common prefix only need to be compared once.

20
Aho-Corasick algorithm
  • Generalizing KMP
  • Create failure links
  • Basis of the fgrep algorithm
  • Given the following patterns
  • potato
  • tattoo
  • theater
  • other

21
Failure link
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
potterisapersonwhomakespottery
22
Failure link
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
O(n) preprocessing, and O(mk) searching. k is
of occurrence. Can create a FSA similarly.
Requires more space, and preprocessing time
depends on alphabet size.
23
A problem with failure link
  • Patterns potato, other, pot

0
p
t
h
o
e
t
r
3
a
2
t
o
1
24
A problem with failure link for multiple patterns
  • Patterns potato, other, pot, the, he, era

h
e
5
e
0
r
p
t
a
t
h
o
h
e
t
r
e
2
a
3
4
t
o
potherarac
1
25
Output link
  • Patterns potato, other, pot, the

Failure link taken when a mismatch occurs.
Output link always taken. (but will return).
h
e
e
5
0
r
p
t
a
t
h
o
h
e
t
r
e
2
a
3
4
t
o
potherarac
1
26
Suffix Tree
  • All algorithms we talked about so far preprocess
    pattern(s)
  • Karp-Rabin small pattern, small alphabet
  • Boyer-Moore fastest in practice. O(m) worst
    case.
  • KMP O(m)
  • Aho-Corasick O(m)
  • In some cases we may prefer to pre-process T
  • Fixed T, varying P
  • Suffix tree basically a keyword tree of all
    suffixes

27
Suffix tree
  • T xabxac
  • Suffixes
  • xabxac
  • abxac
  • bxac
  • xac
  • ac
  • c

x
a
b
x
a
a
c
c
1
c
b
b
x
x
c
4
6
a
a
c
c
5
2
3
Naïve construction O(m2) using
Aho-Corasick. Smarter O(m). Very technical. big
constant factor Create an internal node only when
there is a branch
28
Suffix tree implementation
  • Explicitly labeling seq end
  • T xabxa T xabxa

x
a
x
a
b
x
b
a
a
x
a
a

1
1

b
b
b
b
x

x
x
x
4
a
a
a
a

5

2
2
3
3
29
Suffix tree implementation
  • Implicitly labeling edges
  • T xabxa

12
x
a
3
b
x
22
a
a

1
1


b
b


x
x
3
3
4
4
a
a
5

5

2
2
3
3
30
Suffix links
  • Similar to failure link in a keyword tree
  • Only link internal nodes having branches

x
a
b
xabcf
a
b
c
f
c
d
d
e
e
f
f
g
g
h
h
i
i
j
j
31
Suffix tree construction
1234567890...acatgacatt...
1
1
32
Suffix tree construction
1234567890...acatgacatt...
2
1
1
2
33
Suffix tree construction
1234567890...acatgacatt...
a
2
2
4
3
1
2
34
Suffix tree construction
1234567890...acatgacatt...
a
4
2
2
4
4
3
1
2
35
Suffix tree construction
5
1234567890...acatgacatt...
5
a
4
2
2
4
4
3
1
2
36
Suffix tree construction
5
1234567890...acatgacatt...
5
a
4
c
a
2
4
t
4
t
5

3
6
1
2
37
Suffix tree construction
5
1234567890...acatgacatt...
5
a
c
4
a
c
t
a
4
t
4
t
5
t
5

7
3
6
1
2
  • With this suffix link, when we later need to add
    another suffix, say acaty, we can use the link to
    avoid going back to the root and re-compare cat

38
Suffix tree construction
5
1234567890...acatgacatt...
5
a
c
4
a
c
t
t
a
t
4
t
5
5
t
5
t

7
3
6
8
1
2
39
Suffix tree construction
5
1234567890...acatgacatt...
5
t
a
c
a
5
t
c
t
t
a
9
t
4
t
5
5
t
5
t

7
3
6
8
1
2
40
Suffix tree construction
5
1234567890...acatgacatt...
5
t
a

c
10
a
5
t
c
t
t
a
9
t
4
t
5
5
t
5
t

7
3
6
8
1
2
41
ST Application pattern matching
  • Find all occurrence of Pxa in T
  • Find node v in the ST that matches to P
  • Traverse the subtree rooted at v to get the
    locations

x
a
b
x
a
a
c
c
1
c
b
b
x
x
c
4
6
a
a
c
c
5
2
3
T xabxac
O(m) to construct ST (large constant factor) O(n)
to find v linear to length of P instead of
T! O(k) to get all leaves, k is the number of
occurrence.
42
ST application repeats finding
  • Genome contains many repeated DNA sequences
  • Repeat sequence length Varies from 1 nucleotide
    to whole gene
  • Highly repetitive DNA in some non-coding regions
  • 6 to 10bp x 100,000 to 1,000,000 times
  • Genes may have multiple copies (50 to 10,000)

43
Find longest repeated substring
25
L 4
1518
610
L 9
L 8
  • Do a tree traversal, compute the lengths of
    labels at each node
  • O(m)

44
Repeats finding
  • Find all repeats that are at least k-residue long
    and appear at least p times in the seq
  • Phase 1 top-down, count lengths of labels at
    each node
  • Phase 2 bottom-up count of leaves descended
    from each internal node

For each node with L gt k, and N gt p, print all
leaves
O(m) to traverse tree
(L, N)
45
Repeats finding
5e
1234567890acatgacatt
5
t
a

c
10
a
5e
t
c
t
t
a
9
t
4
t
5e
5e
t
5e
t
7
3
6
8
1
2
  • Find repeats with at least 3 bases and 2
    occurrence
  • cat
  • acat
  • aca

46
Repeats finding
  • Left-maximal repeat
  • Si1..ik Sj1..jk
  • Si ! Sj
  • Right-maximal repeat
  • Si1..ik Sj1..jk,
  • Sik1 ! Sjk1
  • Maximal repeat
  • Si1..ik Sj1..jk
  • Si ! Sj, and Sik1 ! Sjk1

acatgacatt
  • aca
  • cat
  • acat

47
Repeats finding
5e
1234567890acatgacatt
5
t
a

c
10
a
5e
t
c
t
t
a
9
t
4
t
5e
5e
t
5e
t
7
3
6
8
1
2
Left char
g
c
c
a
a
  • How to find maximal repeat?
  • A right-maximal repeats with different left chars

48
ST application word enumeration
  • Find all k-mers that occur at least p times
  • Compute (L, N) for each node
  • Find nodes v with Lgtk, and L(parent)ltk, and Ngty
  • Traverse sub-tree rooted at v to get the locations

Lltk
Lk
L K
Lgtk, Ngtp
This can be used in many applications. For
example, to find words that appeared frequently
in a genome or a document
49
Joint Suffix Tree
  • Build a ST for many than two strings
  • Two strings S1 and S2
  • S S1 S2
  • Build a suffix tree for S in time O(S1 S2)
  • The separator will only appear in the edge ending
    in a leaf

50
  • S1 abcd
  • S2 abca
  • S abcdabca

a b c d
a
d

c
b c d a b c a
a
b
c
b
c
d

d
d

a

a
a
a
2,4
b
1,4
a
c
2,3
a
b
2,1
c
2,2
d
1,1
1,3
1,2
51
To Simplify
a b c d
useless
a
d

c
b c d a b c a
a
a
b
d
c
b
c
c
b c d

b
d
d
d
c


a

d
a
d
a
1,4
a
2,4
b
a
1,4
a
a
c
a
2,4
2,3
a
b
1,1
2,3
2,1
c
2,1
2,2
1,3
d
1,1
2,2
1,2
1,3
1,2
  • We dont really need to do anything, since all
    edge labels were implicit.
  • The right hand side is more convenient to look at

52
Application of JST
  • Longest common substring
  • For each internal node v, keep a bit vector B2
  • B1 1 if a child of v is a suffix of S1
  • Find all internal nodes with B1 B2 1
  • Report one with the longest label
  • Can be extended to k sequences. Just use a longer
    bit vector.

a
d
c
b c d
b
c

d
d
1,4
a
a
a
2,4
1,1
2,3
2,1
1,3
2,2
1,2
O(m), m the total seq length
53
Application of JST
  • Given K strings, find all substrings with Lgtl,
    that appear in at least d strings
  • Exact motif finding problem
  • Build a joint suffix tree with all strings
  • S S1 S2 S3 S4 _at_ S5 ! S6 S7
  • Use a unique end char for each string
  • Not really necessary if caution is taken in
    construction

54
Llt k
B 1010 0011 1011
L gt k
B 3
B 0011
4,x
3,x
3,x
1,x
3,x
B 1010
O(mK), m the total seq length. K is for bitwise
or two bit vectors
55
Many other applications
  • Reproduce the behavior of Aho-Corasick
  • DNA finger printing
  • A database of peoples DNA sequence
  • Given a short DNA, which person is it from?
  • Recognizing DNA contamination
  • Indexing sequence databases
  • Catch
  • Large constant factor for space requirement
    (15-40 bytes per base for DNA)
  • Large constant factor for construction
  • Suffix array trade off time for space

56
Summary
  • One T, one P
  • Boyer-Moore is the choice
  • KMP works but not the best
  • One T, many P
  • Aho-Corasick
  • Suffix Tree
  • One fixed T, many varying P
  • Suffix tree
  • Two or more Ts
  • Suffix tree, joint suffix tree, suffix array

Alphabet independent
Alphabet dependent
Write a Comment
User Comments (0)
About PowerShow.com