CS 6293 Advanced Topics: Translational Bioinformatics - PowerPoint PPT Presentation

1 / 59

About This Presentation

Title:

CS 6293 Advanced Topics: Translational Bioinformatics

Description:

KMP algorithm: shift rule Failure ... pattern matching Find all occurrence of P=xa in T Find node v in the ST that matches to P Traverse the subtree rooted at v to ... – PowerPoint PPT presentation

Number of Views:71

Avg rating:3.0/5.0

Slides: 60

Provided by: Jianh152

Learn more at: http://www.cs.utsa.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 6293 Advanced Topics: Translational Bioinformatics

1
CS 6293 Advanced Topics Translational
Bioinformatics

Lecture 5
Exact String Matching Algorithms

2
Overview

Sequence alignment two sub-problems
How to score an alignment with errors
How to find an alignment with the best score
Today exact string matching
Does not allow any errors
Efficiency becomes the sole consideration
Time and space

3
Why exact string matching?

The most fundamental string comparison problem
Often the core of more complex string comparison
algorithms
E.g., BLAST
Often repeatedly called by other methods
Usually the most time consuming part
Small improvement could improve overall
efficiency considerably

4
Definitions

Text a longer string T (length m)
Pattern a shorter string P (length n)
Exact matching find all occurrences of P in T

length m
T
length n
P
5
The naïve algorithm
6
Time complexity

Worst case O(mn)
How to speedup?
Pre-processing T or P
Why pre-processing can save us time?
Uncovers the structure of T or P
Determines when we can skip ahead without missing
anything
Determines when we can infer the result of
character comparisons without doing them.

7
Cost for exact string matching

Total cost cost (preprocessing)
cost(comparison)
cost(output)

Overhead
Minimize
Constant
Hope gain gt overhead
8
String matching scenarios

One T and one P
Search a word in a document
One T and many P all at once
Search a set of words in a document
Spell checking (fixed P)
One fixed T, many P
Search a completed genome for short sequences
Two (or many) Ts for common patterns
Q Which one to pre-process?
A Always pre-process the shorter seq, or the one
that is repeatedly used

9
Pre-processing algs

Pattern preprocessing
Knuth-Morris-Pratt algorithm (KMP)
Aho-Corasick algorithm
Multiple patterns
Boyer Moore algorithm (discuss if have time)
The choice of most cases
Typically sub-linear time
Text preprocessing
Suffix tree
Very useful for many purposes
Suffix array
Burrows-Wheeler Transformation

10
Algorithm KMP Intuitive example 1
abcxabc
T
mismatch
P
abcxabcde
Naïve approach
abcxabc
T
?
abcxabcde

Observation by reasoning on the pattern alone,
we can determine that if a mismatch happened when
comparing P8 with Ti, we can shift P by four
chars, and compare P4 with Ti, without
missing any possible matches.
Number of comparisons saved 6

11
Intuitive example 2
abcxabc
T
mismatch
P
abcxabcde
Naïve approach
abcxabc
T
?
abcxabcde

Observation by reasoning on the pattern alone,
we can determine that if a mismatch happened
between P7 and Tj, we can shift P by six
chars and compare Tj with P1 without missing
any possible matches
Number of comparisons saved 7

12
KMP algorithm pre-processing

Key the reasoning is done without even knowing
what string T is.
Only the location of mismatch in P must be known.

x
t
T
y
z
t
t
P
i
j
y
z
t
t
P
i
j
Pre-processing for any position i in P, find
P1..is longest proper suffix, t Pj..i,
such that t matches to a prefix of P, t, and the
next char of t is different from the next char of
t (i.e., y ? z) For each i, let sp(i) length(t)
13
KMP algorithm shift rule
x
t
T
y
z
t
t
P
i
j
y
z
t
P
t
i
j
sp(i)
1
Shift rule when a mismatch occurred between
Pi1 and Tk, shift P to the right by i
sp(i) chars and compare x with z. This shift
rule can be implicitly represented by creating a
failure link between y and z. Meaning when a
mismatch occurred between x on T and Pi1,
resume comparison between x and Psp(i)1.
14
Failure Link Example

P aataac

If a char in T fails to match at pos 6,
re-compare it with the char at pos 3 ( 2 1)
a
a
t
a
a
c
sp(i) 0 1 0 0 2 0
aaat
aataac
15
Another example

P abababc

If a char in T fails to match at pos 7,
re-compare it with the char at pos 5 ( 4 1)
a
b
a
b
a
b
c
Sp(i) 0 0 0 0 0 4 0
ababaababc
abab
abababab
16
KMP Example using Failure Link
a
a
t
a
a
c
T aacaataaaaataaccttacta
aataac

Time complexity analysis
Each char in T may be compared up to n times. A
lousy analysis gives O(mn) time.
More careful analysis number of comparisons can
be broken to two phases
Comparison phase the first time a char in T is
compared to P. Total is exactly m.
Shift phase. First comparisons made after a
shift. Total is at most m.
Time complexity O(2m)

aataac .
aataac
aataac ..
aataac .
17
KMP algorithm using DFA (Deterministic Finite
Automata)

P aataac

If a char in T fails to match at pos 6,
re-compare it with the char at pos 3
Failure link
a
a
t
a
a
c
If the next char in T is t after matching 5
chars, go to state 3
a
t
t
a
a
c
a
a
1
2
3
4
5
0
6
DFA
a
a
All other inputs goes to state 0.
18
DFA Example
a
t
t
a
a
c
a
a
1
2
3
4
5
0
6
DFA
a
a
T aacaataataataaccttacta
1201234534534560001001
Each char in T will be examined exactly once.
Therefore, exactly m comparisons are made. But
it takes longer to do pre-processing, and needs
more space to store the FSA.
19
Difference between Failure Link and DFA

Failure link
Preprocessing time and space are O(n), regardless
of alphabet size
Comparison time is at most 2m (at least m)
DFA
Preprocessing time and space are O(n ?)
May be a problem for very large alphabet size
For example, each char is a big integer
Chinese characters
Comparison time is always m.

20
The set matching problem

Find all occurrences of a set of patterns in T
First idea run KMP or BM for each P
O(km n)
k number of patterns
m length of text
n total length of patterns
Better idea combine all patterns together and
search in one run

21
A simpler problem spell-checking

A dictionary contains five words
potato
poetry
pottery
science
school
Given a document, check if any word is (not) in
the dictionary
Words in document are separated by special chars.
Relatively easy.

22
Keyword tree for spell checking
This version of the potato gun was inspired by
the Weird Science team out of Illinois
p
s
o
c
l
h
o
o
5
e
i
t
e
t
a
t
r
n
t
y
e
c
o
r
e
y
3
1
4
2

O(n) time to construct. n total length of
patterns.
Search time O(m). m length of text
Common prefix only need to be compared once.
What if there is no space between words?

23
Aho-Corasick algorithm

Basis of the fgrep algorithm
Generalizing KMP
Using failure links
Example given the following 4 patterns
potato
tattoo
theater
other

24
Keyword tree
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
25
Keyword tree
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
26
Keyword tree
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
O(mn)
m length of text. n length of longest pattern
27
Keyword Tree with a failure link
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
28
Keyword Tree with a failure link
0
p
t
t
h
o
e
h
a
t
r
e
t
a
a
4
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
29
Keyword Tree with all failure links
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
30
Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
31
Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
32
Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
33
Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
34
Example
0
p
t
t
h
o
e
h
a
t
r
e
t
4
a
a
t
t
t
e
o
o
r
1
o
3
2
potherotathxythopotattooattoo
35
Aho-Corasick algorithm

O(n) preprocessing, and O(mk) searching.
n total length of patterns.
m length of text
k is of occurrence.

36
Suffix Tree

All algorithms we talked about so far preprocess
pattern(s)
Boyer-Moore fastest in practice. O(m) worst
case.
KMP O(m)
Aho-Corasick O(m)
In some cases we may prefer to pre-process T
Fixed T, varying P
Suffix tree basically a keyword tree of all
suffixes

37
Suffix tree

T xabxac
Suffixes
xabxac
abxac
bxac
xac
ac
c

x
a
b
x
a
a
c
c
1
c
b
b
x
x
c
4
6
a
a
c
c
5
2
3
Naïve construction O(m2) using
Aho-Corasick. Smarter O(m). Very technical. big
constant factor Difference from a keyword tree
create an internal node only when there is a
branch
38
Suffix tree implementation

Explicitly labeling sequence end
T xabxa

x
a
x
a
b
x
b
a
a
x
a
a

1
1

b
b
b
b
x

x
x
x
4
a
a
a
a

5

2
2
3
3

One-to-one correspondence of leaves and suffixes
T leaves, hence lt T internal nodes

39
Suffix tree implementation

Implicitly labeling edges
T xabxa

12
x
a
3
b
x
22
a
a

1
1

b
b

x
x
3
3
4
4
a
a
5

5

2
2
3
3

Tree(T) O(T size(edge labels))

40
Suffix links

Similar to failure link in a keyword tree
Only link internal nodes having branches

x
a
b
P xabcf
a
b
c
f
c
d
d
e
e
f
f
g
g
h
h
i
i
j
j
41
ST Application 1 pattern matching

Find all occurrence of Pxa in T
Find node v in the ST that matches to P
Traverse the subtree rooted at v to get the
locations

x
a
b
x
a
a
c
c
1
c
b
b
x
x
c
4
6
a
a
c
c
5
T xabxac
2
3

O(m) to construct ST (large constant factor)
O(n) to find v linear to length of P instead of
T!
O(k) to get all leaves, k is the number of
occurrence.
Asymptotic time is the same as KMP. ST wins if T
is fixed. KMP wins otherwise.

42
ST application 2 repeats finding

Genome contains many repeated DNA sequences
Repeat sequence length Varies from 1 nucleotide
to millions
Genes may have multiple copies (50 to 10,000)
Highly repetitive DNA in some non-coding regions
6 to 10bp x 100,000 to 1,000,000 times
Problem find all repeats that are at least
k-residues long and appear at least p times in
the genome

43
Repeats finding

at least k-residues long and appear at least p
times in the seq
Phase 1 top-down, count label lengths (L) from
root to each node
Phase 2 bottom-up count of leaves descended
from each internal node

For each node with L gt k, and N gt p, print all
leaves
O(m) to traverse tree
(L, N)
44
Maximal repeats finding

Right-maximal repeat
Si1..ik Sj1..jk,
but Sik1 ! Sjk1
Left-maximal repeat
Si1..ik Sj1..jk
But Si ! Sj
Maximal repeat
Si1..ik Sj1..jk
But Si ! Sj, and Sik1 ! Sjk1

acatgacatt

cat
aca
acat

45
Maximal repeats finding
5e
1234567890acatgacatt
5
t
a

c
10
a
5e
t
c
t
t
a
9
t
4
t
5e
5e
t
5e
t
7
3
6
8
1
2

Find repeats with at least 3 bases and 2
occurrence
right-maximal cat
Maximal acat
left-maximal aca

46
Maximal repeats finding
5e
1234567890acatgacatt
5
t
a

c
10
a
5e
t
c
t
t
a
9
t
4
t
5e
5e
t
5e
t
7
3
6
8
1
2
Left char
g
c
c
a
a

How to find maximal repeat?
A right-maximal repeats with different left chars

47
Joint Suffix Tree (JST)

Build a ST for more than two strings
Two strings S1 and S2
S S1 S2
Build a suffix tree for S in time O(S1 S2)
The separator will only appear in the edge ending
in a leaf (why?)

48
Joint suffix tree example

S1 abcd
S2 abca
S abcdabca

a b c d
(2, 0) useless
a
d

c
b c d a b c a
a
b
c
b
c
d

d
d

a

a
a
a
2,4
b
1,4
a
c
2,3
a
b
2,1
c
2,2
d
1,1
Seq ID
1,3
Suffix ID
1,2
49
To Simplify
a b c d
useless
a
d

c
b c d a b c a
a
a
b
d
c
b
c
c
b c d

b
d
d
d
c

a

d
a
d
a
1,4
a
2,4
b
a
1,4
a
a
c
a
2,4
2,3
a
b
1,1
2,3
2,1
c
2,1
2,2
1,3
d
1,1
2,2
1,2
1,3
1,2