CS 5263 Bioinformatics

About This Presentation

Title:

CS 5263 Bioinformatics

Description:

Lectures 3-6: Pair-wise Sequence Alignment ... – PowerPoint PPT presentation

Number of Views:124

Avg rating:3.0/5.0

Slides: 198

Provided by: Jian144

Learn more at: http://www.cs.utsa.edu

Category:

more less

Transcript and Presenter's Notes

Title: CS 5263 Bioinformatics

1
CS 5263 Bioinformatics

Lectures 3-6 Pair-wise Sequence Alignment

2
Outline

Part I Algorithms
Biological problem
Intro to dynamic programming
Global sequence alignment
Local sequence alignment
More efficient algorithms
Part II Biological issues
Model gaps more accurately
Alignment statistics
Part III BLAST

3
Evolution at the DNA level
C
ACGGTGCAGTCACCA
ACGTTGC-GTCCACCA
DNA evolutionary events (sequence
edits) Mutation, deletion, insertion
4
Sequence conservation implies function

next generation
OK

OK

OK

X

X

Still OK?

5
Why sequence alignment?

Conserved regions are more likely to be
functional
Can be used for finding genes, regulatory
elements, etc.
Similar sequences often have similar origin and
function
Can be used to predict functions for new genes /
proteins
Sequence alignment is one of the most widely used
computational tools in biology

6
Global Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGG
TCGATTTGCCCGAC
S
T
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC--GGTCGATTTGCCCGAC
S
T

Definition
An alignment of two strings S, T is a pair of
strings S, T (with spaces) s.t.
S T, and (S length of S)
removing all spaces in S, T leaves S, T

7
What is a good alignment?

Alignment
The best way to match the letters of one
sequence with those of the other
How do we define best?

8
S -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- T
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

The score of aligning (characters or spaces) x
y is s (x,y).
Score of an alignment
An optimal alignment one with max score

9
Scoring Function

Sequence edits
AGGCCTC
Mutations AGGACTC
Insertions AGGGCCTC
Deletions AGG-CTC
Scoring Function
Match m AAC
Mismatch -s A-A
Gap (indel) -d

Match 2, mismatch -1, gap -1
Score 3 x 2 2 x 1 1 x 1 3

11
More complex scoring function

Substitution matrix
Similarity score of matching two letters a, b
should reflect the probability of a, b derived
from the same ancestor
It is usually defined by log likelihood ratio
Active research area. Especially for proteins.
Commonly used PAM, BLOSUM

12
An example substitution matrix
A C G T
A 3 -2 -1 -2
C 3 -2 -1
G 3 -2
T 3
13
How to find an optimal alignment?

A naïve algorithm
for all subseqs A of S, B of T s.t. A B do
align Ai with Bi, 1 i A
align all other chars to spaces
compute its value
retain the max
end
output the retained alignment

S abcd A cd T wxyz B xz -abc-d
a-bc-d w--xyz -w-xyz
14
Analysis

Assume S T n
Cost of evaluating one alignment n
How many alignments are there
pick n chars of S,T together
say k of them are in S
match these k to the k unpicked chars of T
Total time
E.g., for n 20, time is gt 240 gt1012 operations

Intro to Dynamic Programming

16
Dynamic programming

What is dynamic programming?
A method for solving problems exhibiting the
properties of overlapping subproblems and optimal
substructure
Key idea tabulating sub-problem solutions rather
than re-computing them repeatedly
Two simple examples
Computing Fibonacci numbers
Find the special shortest path in a grid

17
Example 1 Fibonacci numbers

1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89,
F(0) 1
F(1) 1
F(n) F(n-1) f(n-2)
How to compute F(n)?

18
A recursive algorithm

function fib(n)
if (n 0 or n 1) return 1
else return fib(n-1) fib(n-2)

19
n/2
n

Time complexity
Between 2n/2 and 2n
O(1.62n), i.e. exponential
Why recursive Fib algorithm is inefficient?
Overlapping subproblems

20
An iterative algorithm

function fib(n)
F0 1 F1 1
for i 2 to n
Fi Fi-1 Fi-2
Return Fn

Time complexity Time O(n), space O(n)
21
Example 2 shortest path in a grid
S

m
G
n
Each edge has a length (cost). We need to get to
G from S. Can only move right or down. Aim find
a path with the minimum total length
22
Optimal substructures

Naïve algorithm enumerate all possible paths and
compare costs
Exponential number of paths
Key observation
If a path P(S, G) is the shortest from S to G,
any of its sub-path P(S,x), where x is on P(S,G),
is the shortest from S to x

23
Proof

Proof by contradiction
If the path between P(S,x) is not the shortest,
i.e., P(S,x) lt P(S,x)
Construct a new path P(S,G) P(S,x) P(x, G)
P(S,G) lt P(S,G) gt P(S,G) is not the shortest
Contradiction
Therefore, P(S, x) is the shortest

S

x
G
24
Recursive solution
(0,0)

Index each intersection by two indices, (i, j)
Let F(i, j) be the total length of the shortest
path from (0, 0) to (i, j). Therefore, F(m, n) is
the shortest path we wanted.
To compute F(m, n), we need to compute both
F(m-1, n) and F(m, n-1)

m
(m, n)
n
F(m-1, n) length((m-1, n), (m, n)) F(m,
n) min F(m, n-1) length((m,
n-1), (m, n))
25
Recursive solution
F(i-1, j) length((i-1, j), (i, j)) F(i, j)
min F(i, j-1) length((i, j-1), (i, j))
(0,0)

But if we use recursive call, many subpaths will
be recomputed for many times
Strategy pre-compute F values starting from the
upper-left corner. Fill in row by row (what other
order will also do?)

(i-1, j)
(i, j)
(i, j-1)
m
(m, n)
n
26
Dynamic programming illustration
S
9
1
2
3
3
12
13
15
0

5
3
3
3
3
2
5
2
3
6
8
13
15
5
2
3
3
9
3
4
2
3
2
9
7
11
13
16
6
2
3
7
4
6
3
3
3
11
14
17
20
13
4
6
3
1
3
2
3
2
1
17
17
18
20
17
G
F(i-1, j) length(i-1, j, i, j) F(i, j)
min F(i, j-1) length(i, j-1, i, j)
27
Trackback
9
1
2
3
3
12
13
15
0

5
3
3
3
3
2
5
2
3
6
8
13
15
5
2
3
3
9
3
4
2
3
2
9
7
11
13
16
6
2
3
7
4
6
3
3
3
11
14
17
20
13
4
6
3
1
3
2
3
2
1
17
17
18
20
17
28
Elements of dynamic programming

Optimal sub-structures
Optimal solutions to the original problem
contains optimal solutions to sub-problems
Overlapping sub-problems
Some sub-problems appear in many solutions
Memorization and reuse
Carefully choose the order that sub-problems are
solved

29
Dynamic Programming for sequence alignment

Suppose we wish to align
x1xM
y1yN
Let F(i,j) optimal score of aligning
x1xi
y1yj
Scoring Function
Match m
Mismatch -s
Gap (indel) -d

30
Elements of dynamic programming

Optimal sub-structures
Optimal solutions to the original problem
contains optimal solutions to sub-problems
Overlapping sub-problems
Some sub-problems appear in many solutions
Memorization and reuse
Carefully choose the order that sub-problems are
solved

31
Optimal substructure

If xi is aligned to yj in the optimal
alignment between x1..M and y1..N, then
The alignment between x1..i and y1..j is also
optimal
Easy to prove by contradiction

32
Recursive solution

Notice three possible cases
xM aligns to yN
xM
yN
2. xM aligns to a gap
xM
?
yN aligns to a gap
?
yN

m, if xM yN F(M,N)
F(M-1, N-1) -s, if not
F(M,N) F(M-1, N) - d
F(M,N) F(M, N-1) - d
33
Recursive solution

Generalize
F(i-1, j-1) ?(Xi,Yj)
F(i,j) max F(i-1, j) d
F(i, j-1) d
?(Xi,Yj) m if Xi Yj, and s otherwise
Boundary conditions
F(0, 0) 0.
F(0, j) ?
F(i, 0) ?

-jd y1..j aligned to gaps.
-id x1..i aligned to gaps.
34
What order to fill?
F(0,0)

F(M,N)
i
j
35
What order to fill?
F(0,0)

F(M,N)
36
Example

x AGTA m 1
y ATA s 1
d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A

A
T
A
j 0
1
2
3
37
Example

x AGTA m 1
y ATA s 1
d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1
T -2
A -3
j 0
1
2
3
38
Example

x AGTA m 1
y ATA s 1
d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2
A -3
j 0
1
2
3
39
Example

x AGTA m 1
y ATA s 1
d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3
j 0
1
2
3
40
Example

x AGTA m 1
y ATA s 1
d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
Optimal Alignment F(4,3) 2
j 0
1
2
3
41
Example

x AGTA m 1
y ATA s 1
d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
Optimal Alignment F(4,3) 2 This only tells us
the best score
j 0
1
2
3
42
Trace-back

x AGTA m 1
y ATA s 1
d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
A
A
2
3
43
Trace-back

x AGTA m 1
y ATA s 1
d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
T A
T A
2
3
44
Trace-back

x AGTA m 1
y ATA s 1
d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
G T A
- T A
2
3
45
Trace-back

x AGTA m 1
y ATA s 1
d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
A G T A
A - T A
2
3
46
Trace-back

x AGTA m 1
y ATA s 1
d 1

F(i-1, j-1) ?(Xi,Yj) F(i,j)
max F(i-1, j) d F(i,
j-1) d
F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
Optimal Alignment F(4,3) 2 AGTA A?TA
j 0
1
2
3
47
Using trace-back pointers

x AGTA m 1
y ATA s 1
d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1
T -2
A -3
j 0
1
2
3
48
Using trace-back pointers

x AGTA m 1
y ATA s 1
d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2
A -3
j 0
1
2
3
49
Using trace-back pointers

x AGTA m 1
y ATA s 1
d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3
j 0
1
2
3
50
Using trace-back pointers

x AGTA m 1
y ATA s 1
d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
2
3
51
Using trace-back pointers

x AGTA m 1
y ATA s 1
d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
2
3
52
Using trace-back pointers

x AGTA m 1
y ATA s 1
d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
2
3
53
Using trace-back pointers

x AGTA m 1
y ATA s 1
d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
j 0
1
2
3
54
Using trace-back pointers

x AGTA m 1
y ATA s 1
d 1

F(i,j) i 0 1 2 3 4
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
Optimal Alignment F(4,3) 2 AGTA A?TA
j 0
1
2
3
55
The Needleman-Wunsch Algorithm

Initialization.
F(0, 0) 0
F(0, j) - j ? d
F(i, 0) - i ? d
Main Iteration. Filling in scores
For each i 1M
For each j 1N
F(i-1,j) d case 1
F(i, j) max F(i, j-1) d
case 2
F(i-1, j-1) s(xi, yj) case 3
UP, if case 1
Ptr(i,j) LEFT if case 2
DIAG if case 3
Termination. F(M, N) is the optimal score, and
from Ptr(M, N) can trace back optimal alignment

56
Complexity

Time
O(NM)
Space
O(NM)
Linear-space algorithms do exist (with the same
time complexity)

57
Equivalent graph problem
S1
G
A
T
A
(0,0)

? a gap in the 2nd sequence ? a gap in the 1st
sequence match / mismatch
1
1
A
S2
1
T
Value on vertical/horizontal line -d Value on
diagonal m or -s
1
1
A
(3,4)

Number of steps length of the alignment
Path length alignment score
Optimal alignment find the longest path from (0,
0) to (3, 4)
General longest path problem cannot be found with
DP. Longest path on this graph can be found by DP
since no cycle is possible.

58
Question

If we change the scoring scheme, will the optimal
alignment be changed?
Old Match 1, mismatch gap -1
New match 2, mismatch gap 0
New Match 2, mismatch gap -2?

59
Question

What kind of alignment is represented by these
paths?

A
A
A
A
A

B C
B C
B C
B C
B C
A- BC
A-- -BC
--A BC-
-A- B-C
-A BC
Alternating gaps are impossible if s gt -2d
60
A variant of the basic algorithm

Scoring scheme m s d 1
Seq1 CAGCA-CTTGGATTCTCGG
Seq2 ---CAGCGTGG--------
Seq1 CAGCACTTGGATTCTCGG
Seq2 CAGC-----G-T----GG
The first alignment may be biologically more
realistic in some cases (e.g. if we know s2 is a
subsequence of s1)

Score -7
Score -2
61
A variant of the basic algorithm

Maybe it is OK to have an unlimited of gaps in
the beginning and end

----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCG
AGTTCATCTATCAC--GACCGC--GGTCG--------------

Then, we dont want to penalize gaps in the ends

62
The Overlap Detection variant

Changes
Initialization
For all i, j,
F(i, 0) 0
F(0, j) 0
Termination
maxi F(i, N)
FOPT max maxj F(M, j)

x1 xM
yN y1
63
Different types of overlaps
x
x
y
y
64
The local alignment problem

Given two strings X x1xM,
Y y1yN
Find substrings x, y whose similarity (optimal
global alignment value) is maximum
e.g. X abcxdex X cxde
Y xxxcde Y c-de

x
y
65
Why local alignment

Conserved regions may be a small part of the
whole
Global alignment might miss them if flanking
junk outweighs similar regions
Genes are shuffled between genomes

C
D
B
A
D
A
B
C
66
Naïve algorithm

for all substrings X of X and Y of Y
Align X Y via dynamic programming
Retain pair with max value
end
Output the retained pair
Time O(n2) choices for A, O(m2) for B, O(nm) for
DP, so O(n3m3 ) total.

67
Reminder

The overlap detection algorithm
We do not give penalty to gaps at either end

Free gap
Free gap
68
The local alignment idea

Do not penalize the unaligned regions (gaps or
mismatches)
The alignment can start anywhere and ends
anywhere
Strategy whenever we get to some low similarity
region (negative score), we restart a new
alignment
By resetting alignment score to zero

69
The Smith-Waterman algorithm

Initialization F(0, j) F(i, 0) 0
0
F(i 1, j) d
F(i, j 1) d
F(i 1, j 1) ?(xi, yj)

Iteration F(i, j) max
70
The Smith-Waterman algorithm

Termination
If we want the best local alignment
FOPT maxi,j F(i, j)
If we want all local alignments scoring gt t
For all i, j find F(i, j) gt t, and trace back

71
x x x c d e
0 0 0 0 0 0 0
a 0
b 0
c 0
x 0
d 0
e 0
x 0
Match 2 Mismatch -1 Gap -1
72
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0
x 0
d 0
e 0
x 0
Match 2 Mismatch -1 Gap -1
73
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0
d 0
e 0
x 0
Match 2 Mismatch -1 Gap -1
74
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 0 0
d 0
e 0
x 0
Match 2 Mismatch -1 Gap -1
75
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 0 0
d 0 1 1 1 1 3 2
e 0
x 0
Match 2 Mismatch -1 Gap -1
76
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 0 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0
Match 2 Mismatch -1 Gap -1
77
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 1 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0 2 2 2 1 1 4
Match 2 Mismatch -1 Gap -1
78
Trace back
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 1 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0 2 2 2 1 1 4
Match 2 Mismatch -1 Gap -1
79
Trace back
x x x c d e
0 0 0 0 0 0 0
a 0 0 0 0 0 0 0
b 0 0 0 0 0 0 0
c 0 0 0 0 2 1 0
x 0 2 2 2 1 1 0
d 0 1 1 1 1 3 2
e 0 0 0 0 0 2 5
x 0 2 2 2 1 1 4
Match 2 Mismatch -1 Gap -1
cxde c-de
x-de xcde
80

No negative values in local alignment DP array
Optimal local alignment will never have a gap on
either end
Local alignment Smith-Waterman
Global alignment Needleman-Wunsch

81
Analysis

Time
O(MN) for finding the best alignment
Time to report all alignments depends on the
number of sub-opt alignments
Memory
O(MN)
O(MN) possible

More efficient alignment algorithms

Given two sequences of length M, N
Time O(MN)
Ok, but still slow for long sequences
Space O(MN)
bad
1Mb seq x 1Mb seq 1TB memory
Can we do better?

84
Bounded alignment

Good alignment should appear near the diagonal

85
Bounded Dynamic Programming

If we know that x and y are very similar
Assumption gaps(x, y) lt k
xi
Then, implies i j lt k
yj

86
Bounded Dynamic Programming

Initialization
F(i,0), F(0,j) undefined for i, j gt k
Iteration
For i 1M
For j max(1, i k)min(N, ik)
F(i 1, j 1) ?(xi, yj)
F(i, j) max F(i, j 1) d, if j gt i k
F(i 1, j) d, if j lt i k
Termination same

x1 xM
yN y1
k
87
Analysis

Time O(kM) ltlt O(MN)
Space O(kM) with some tricks

gt
M
M
2k
2k
88

(No Transcript)
89

Given two sequences of length M, N
Time O(MN)
ok
Space O(MN)
bad
1mb seq x 1mb seq 1TB memory
Can we do better?

90
Linear space algorithm

If all we need is the alignment score but not the
alignment, easy!

We only need to keep two rows (You only need one
row, with a little trick)

But how do we get the alignment?
91
Linear space algorithm

When we finish, we know how we have aligned the
ends of the sequences

XM YN

Naïve idea Repeat on the smaller subproblem
F(M-1, N-1) Time complexity O((MN)(MN))
92
(0, 0)
M/2
(M, N)
Key observation optimal alignment (longest path)
must use an intermediate point on the M/2-th row.
Call it (M/2, k), where k is unknown.
93
(0,0)

(3,2)
(3,4)
(3,6)
(3,0)
(6,6)

Longest path from (0, 0) to (6, 6) is max_k
(LP(0,0,3,k) LP(6,6,3,k))

94
Hirschbergs idea

Divide and conquer!

Y

Forward algorithm Align x1x2xM/2 with Y
X
M/2
F(M/2, k) represents the best alignment between
x1x2xM/2 and y1y2yk
95
Backward Algorithm
Y

Backward algorithm Align reverse(xM/21xM) with
reverse(Y)
X
M/2
B(M/2, k) represents the best alignment between
reverse(xM/21xM) and reverse(ykyk1yN )
96
Linear-space alignment

Using 2 (4) rows of space, we can compute
for k 1N, F(M/2, k), B(M/2, k)

M/2
97
Linear-space alignment

Now, we can find k maximizing F(M/2, k) B(M/2,
k)
Also, we can trace the path exiting column M/2
from k

Conclusion In O(NM) time, O(N) space, we found
optimal alignment path at row M/2
98
Linear-space alignment

Iterate this procedure to the two sub-problems!

M/2
k
M/2
N-k
99
Analysis

Memory O(N) for computation, O(NM) to store the
optimal alignment
Time
MN for first iteration
k M/2 (N-k) M/2 MN/2 for second

k
M/2
M/2
N-k
100
MN
MN/2
MN/4
MN MN/2 MN/4 MN/8 MN (1 ½ ¼
1/8 1/16 ) 2MN O(MN)
MN/8
101
Outline

Part I Algorithms
Biological problem
Intro to dynamic programming
Global sequence alignment
Local sequence alignment
More efficient algorithms
Part II Biological issues
Model gaps more accurately
Alignment statistics
Part III BLAST

102

How to model gaps more accurately

103
Whats a better alignment?

GACGCCGAACG
GACGC---ACG

GACGCCGAACG
GACG-C-A-CG

Score 8 x m 3 x d
Score 8 x m 3 x d

However, gaps usually occur in bunches.
During evolution, chunks of DNA may be lost
entirely
Aligning genomic sequences vs. cDNAs (reverse
complimentary to mRNAs)

104
Model gaps more accurately

Current model
Gap of length n incurs penalty n?d
General
Convex function
E.g. ?(n) c sqrt (n)

?
n
?
n
105
General gap dynamic programming

Initialization same
Iteration
F(i-1, j-1) s(xi, yj)
F(i, j) max maxk0i-1F(k,j) ?(i-k)
maxk0j-1F(i,k) ?(j-k)
Termination same
Running Time O((MN)MN) (cubic)
Space O(NM) (linear-space algorithm not
applicable)

106
Compromise affine gaps
?(n)

?(n) d (n 1)?e
gap gap
open extension

e
d
Match 2 Gap open -5 Gap extension -1
GACGCCGAACG GACGC---ACG
GACGCCGAACG GACG-C-A-CG
8x2-5-2 9
8x2-3x5 1

We want to find the optimal alignment with affine
gap penalty in
O(MN) time
O(MN) or better O(MN) memory

107
Allowing affine gap penalties

Still three cases
xi aligned with yj
xi aligned to a gap
Are we continuing a gap in x? (if no, start is
more expensive)
yj aligned to a gap
Are we continuing a gap in y? (if no, start is
more expensive)
We can use a finite state machine to represent
the three cases as three states
The machine has two heads, reading the chars on
the two strings separately
At every step, each head reads 0 or 1 char from
each sequence
Depending on what it reads, goes to a different
state, and produces different scores

108
Finite State Machine
Input
Output
? / ?
? / ?
Ix
? / ?
? / ?
F
? / ?
Iy
? / ?
State
? / ?
F have just read 1 char from each seq (xi
aligned to yj ) Ix have read 0 char from x. (yj
aligned to a gap) Iy have read 0 char from y (xi
aligned to a gap)
109
Input
Output
(-, yj) / e
Ix
(xi,yj) / ?
(xi,yj) / ?
(-, yj) / d
F
(xi,-) / d
Iy
(xi,-) / e
Start state
(xi,yj) / ?
Current state Input Output Next state
F (xi,yj) ? F
F (-,yj) d Ix
F (xi,-) d Iy
Ix (-,yj) e Ix

110
F-F-Iy-F-Ix
F-F-F-F
F-Iy-F-F-Ix
AAC ACT
AAC- A-CT
AAC- -ACT
AAC ACT
Given a pair of sequences, an alignment (not
necessarily optimal) corresponds to a state path
in the FSM. Optimal alignment find a state path
to read the two sequences such that the total
output score is the highest
111
Dynamic programming

We encode this information in three different
matrices
For each element (i,j) we use three variables
F(i,j) best alignment (score) of x1..xi y1..yj
if xi aligns to yj
Ix(i,j) best alignment of x1..xi y1..yj if yj
aligns to gap
Iy(i,j) best alignment of x1..xi y1..yj if xi
aligns to gap

xi
xi
xi
yj
yj
yj
Iy(i, j)
Ix(i, j)
F(i, j)
112
(-, yj)/e
Ix
(xi,yj) /?
(xi,yj) /?
(-, yj) /d
F
(xi,-) /d
Iy
(xi,-)/e
(xi,yj) /?
F(i-1, j-1) ?(xi, yj) F(i, j) max Ix(i-1,
j-1) ?(xi, yj) Iy(i-1, j-1) ?(xi, yj)
xi
yj
113
(-, yj)/e
Ix
(xi,yj) /?
(xi,yj) /?
(-, yj) /d
F
(xi,-) /d
Iy
(xi,-)/e
(xi,yj) /?
F(i, j-1) d Ix(i, j) max Ix(i, j-1)
e
xi
yj
Ix(i, j)
114
(-, yj)/e
Ix
(xi,yj) /?
(xi,yj) /?
(-, yj) /d
F
(xi,-) /d
Iy
(xi,-)/e
(xi,yj) /?
F(i-1, j) d Iy(i, j) max Iy(i-1, j)
e
xi
yj
Iy(i, j)
115

F(i 1, j 1)
F(i, j) ?(xi, yj) max Ix(i 1, j 1)
Iy(i 1, j 1)
F(i, j 1) d
Ix(i, j) max
Ix(i, j 1) e
F(i 1, j) d
Iy(i, j) max
Iy(i 1, j) e

Continuing alignment
Closing gaps in x
Closing gaps in y
Opening a gap in x
Gap extension in x
Opening a gap in y
Gap extension in y
116
Data dependency

F
i
Iy
Ix
j

i-1
i-1
j-1
j-1
117
Data dependency

F
i
Iy
Ix
j

i
i
j
j
118
Data dependency

If we stack all three matrices
No cyclic dependency
Therefore, we can fill in all three matrices in
order

119
Algorithm

for i 1m
for j 1n
Fill in F(i, j), Ix(i, j), Iy(i, j)
end
end
F(M, N) max (F(M, N), Ix(M, N), Iy(M, N))
Time O(MN)
Space O(MN) or O(N) when combined with the
linear-space algorithm

120
Exercise

x GCAC
y GCC
m 2
s -2
d -5
e -1

121
y
y
G C C
G C C
0 -? -? -?
-?
-?
-?
-?
x
x
-? -? -? -?
-5
-6
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F aligned on both
Iy Insertion on y
G C C
y
F(i-1, j-1)
Iy(i-1, j-1)
-? -5 -6 -7
-?
-?
-?
-?
Iy(i-1,j)
x
F(i-1,j)
?(xi, yj)
G C A C
e
Ix(i-1, j-1)
d
F(i, j)
F(i,j-1)
Iy(i,j)
d
Ix(i,j)
Ix(i,j-1)
e
Ix Insertion on x
122
y
y
G C C
G C C
0 -? -? -?
-? 2
-?
-?
-?
x
x
-? -? -? -?
-5
-6
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
G C C
y
-? -5 -6 -7
-?
-?
-?
-?
x
F(i-1, j-1)
Iy(i-1, j-1)
G C A C
?(xi, yj) 2
Ix(i-1, j-1)
F(i, j)
Ix
123
y
y
G C C
G C C
0 -? -? -?
-? 2 -7
-?
-?
-?
x
x
-? -? -? -?
-5
-6
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
G C C
y
-? -5 -6 -7
-?
-?
-?
-?
x
F(i-1, j-1)
Iy(i-1, j-1)
G C A C
?(xi, yj) -2
Ix(i-1, j-1)
F(i, j)
Ix
124
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-?
-?
-?
x
x
-? -? -? -?
-5
-6
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
G C C
y
-? -5 -6 -7
-?
-?
-?
-?
x
F(i-1, j-1)
Iy(i-1, j-1)
G C A C
?(xi, yj) -2
Ix(i-1, j-1)
F(i, j)
Ix
125
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-?
-?
-?
x
x
-? -? -?
-5
-6
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
x
-5 -6 -7
-? -? -3
-?
-?
-?
G C A C
F(i,j-1)
d -5
Ix(i,j)
Ix(i,j-1)
e -1
Ix
126
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-?
-?
-?
x
x
-? -? -?
-5
-6
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
x
-5 -6 -7
-? -? -3 -4
-?
-?
-?
G C A C
F(i,j-1)
d -5
Ix(i,j)
Ix(i,j-1)
e -1
Ix
127
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-?
-?
-?
x
x
-? -? -?
-5 -? -? -?
-6
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
x
-5 -6 -7
-? -? -3 -4
-?
-?
-?
Iy(i-1,j)
G C A C
F(i-1,j)
e-1
d-5
Iy(i,j)
Ix
128
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7
-?
-?
x
x
-? -? -?
-5 -? -? -?
-6
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
x
-5 -6 -7
-? -? -3 -4
-?
-?
-?
F(i-1, j-1)
Iy(i-1, j-1)
G C A C
?(xi, yj) -2
Ix(i-1, j-1)
F(i, j)
Ix
129
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7 4
-?
-?
x
x
-? -? -?
-5 -? -? -?
-6
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
x
-5 -6 -7
-? -? -3 -4
-?
-?
-?
F(i-1, j-1)
Iy(i-1, j-1)
G C A C
?(xi, yj) 2
Ix(i-1, j-1)
F(i, j)
Ix
130
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7 4 -1
-?
-?
x
x
-? -? -?
-5 -? -? -?
-6
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
x
-5 -6 -7
-? -? -3 -4
-?
-?
-?
F(i-1, j-1)
Iy(i-1, j-1)
G C A C
?(xi, yj) 2
Ix(i-1, j-1)
F(i, j)
Ix
131
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7 4 -1
-?
-?
x
x
-? -? -?
-5 -? -? -?
-6
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
x
-5 -6 -7
-? -? -3 -4
-? -? -12 -1
-?
-?
G C A C
F(i,j-1)
d -5
Ix(i,j)
Ix(i,j-1)
e -1
Ix
132
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7 4 -1
-?
-?
x
x
-? -? -?
-5 -? -? -?
-6 -3
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
x
-5 -6 -7
-? -? -3 -4
-? -? -12 -1
-?
-?
Iy(i-1,j)
G C A C
F(i-1,j)
e-1
d-5
Iy(i,j)
Ix
133
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7 4 -1
-?
-?
x
x
-? -? -?
-5 -? -? -?
-6 -3 -12 -13
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
F(i-1, j-1)
Iy(i-1, j-1)
Iy(i-1,j)
x
-5 -6 -7
-? -? -3 -4
-? -? -12 -1
-?
-?
F(i-1,j)
?(xi, yj)
G C A C
e
Ix(i-1, j-1)
d
F(i, j)
F(i,j-1)
Iy(i,j)
d
Ix(i,j)
Ix(i,j-1)
e
Ix
134
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7 4 -5
-? -8 -5 2
-?
x
x
-? -? -?
-5 -? -? -?
-6 -3 -12 -13
-7
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
F(i-1, j-1)
Iy(i-1, j-1)
Iy(i-1,j)
x
-5 -6 -7
-? -? -3 -4
-? -? -12 -1
-? -? -13 -10
-?
F(i-1,j)
?(xi, yj)
G C A C
e
Ix(i-1, j-1)
d
F(i, j)
F(i,j-1)
Iy(i,j)
d
Ix(i,j)
Ix(i,j-1)
e
Ix
135
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7 4 -1
-? -8 -5 2
-?
x
x
-? -? -?
-5 -? -? -?
-6 -3 -12 -13
-7 -8 -1
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
x
-5 -6 -7
-? -? -3 -4
-? -? -12 -1
-? -? -13 -10
-?
Iy(i-1,j)
G C A C
F(i-1,j)
e-1
d-5
Iy(i,j)
Ix
136
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7 4 -1
-? -8 -5 2
-?
x
x
-? -? -?
-5 -? -? -?
-6 -3 -12 -13
-7 -8 -1 -6
-8
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
x
-5 -6 -7
-? -? -3 -4
-? -? -12 -1
-? -? -13 -10
-?
Iy(i-1,j)
G C A C
F(i-1,j)
e-1
d-5
Iy(i,j)
Ix
137
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7 4 -1
-? -8 -5 2
-? -9 -6 1
x
x
-? -? -?
-5 -? -? -?
-6 -3 -12 -13
-7 -8 -1 -6
-8 -13 -2 -3
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
F(i-1, j-1)
Iy(i-1, j-1)
Iy(i-1,j)
x
-5 -6 -7
-? -? -3 -4
-? -? -12 -1
-? -? -13 -10
-? -? -14 -11
F(i-1,j)
?(xi, yj)
G C A C
e
Ix(i-1, j-1)
d
F(i, j)
F(i,j-1)
Iy(i,j)
d
Ix(i,j)
Ix(i,j-1)
e
Ix
138
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7 4 -1
-? -8 -5 2
-? -9 -6 1
x
x
-? -? -?
-5 -? -? -?
-6 -3 -12 -13
-7 -8 -1 -6
-8 -13 -2 -3
m 2 s -2 d -5 e -1
G C A C
G C A C
F
Iy
y
G C C
F(i-1, j-1)
Iy(i-1, j-1)
Iy(i-1,j)
x
-5 -6 -7
-? -? -3 -4
-? -? -12 -1
-? -? -13 -10
-? -? -14 -11
F(i-1,j)
?(xi, yj)
G C A C
e
Ix(i-1, j-1)
d
F(i, j)
F(i,j-1)
Iy(i,j)
d
Ix(i,j)
Ix(i,j-1)
e
Ix
139
y
y
G C C
G C C
0 -? -? -?
-? 2 -7 -8
-? -7 4 -1
-? -8 -5 2
-? -9 -6 1
x
x
-? -? -?
-5 -? -? -?
-6 -3 -12 -13
-7 -8 -1 -6
-8 -13 -2 -3
m 2 s -2 d -5 e -1
G C A C
G C A C
GCAC GC-C
x
y
F
Iy
y
G C C
y
G C C
x
-5 -6 -7
-? -? -3 -4
-? -? -12 -1
-? -? -13 -10
-? -? -14 -11

x
G C A C
G C A C
Ix
140
Statistics of alignment

Where does ?(xi, yj) come from?
Are two aligned sequences actually related?

141
Probabilistic model of alignments

Well first focus on protein alignments without
gaps
Given an alignment, we can consider two possible
models
R the sequences are related by evolution
U the sequences are unrelated
How can we distinguish these two models?
How is this view related to amino-acid
substitution matrix?

142
Model for unrelated sequences

Assume each position of the alignment is
independently sampled from some distribution of
amino acids
ps probability of amino acid s in the sequences
Probability of seeing an amino acid s aligned to
an amino acid t by chance is
Pr(s, t U) ps pt
Probability of seeing an ungapped alignment
between x x1xn and y y1yn randomly is

i
143
Model for related sequences

Assume each pair of aligned amino acids evolved
from a common ancestor
Let qst be the probability that amino acid s in
one sequence is related to t in another sequence
The probability of an alignment of x and y is
give by

144
Probabilistic model of Alignments

How can we decide which model (U or R) is more
likely?
One principled way is to consider the relative
likelihood of the two models (the odds ratio)
A higher ratio means that R is more likely than U

145
Log odds ratio

Taking logarithm, we get
Recall that the score of an alignment is given by

146

Therefore, if we define
We are actually defining the alignment score as
the log odds ratio between the two models R and U

147
How to get the probabilities?

ps can be counted from the available protein
sequences
But how do we get qst? (the probability that s
and t have a common ancestor)
Counted from trusted alignments of related
sequences

148
Protein Substitution Matrices

Two popular sets of matrices for protein
sequences
PAM matrices Dayhoff et al, 1978
Better for aligning closely related sequences
BLOSUM matrices Henikoff Henikoff, 1992
For both closely or remotely related sequences

149
BLOSUM-N matrices

Constructed from a database called BLOCKS
Contain many closely related sequences
Conserved amino acids may be over-counted
N 62 the probabilities qst were computed using
trusted alignments with no more than 62 identity
identity of matched columns
Using this matrix, the Smith-Waterman algorithm
is most effective in detecting real alignments
with a similar identity level (i.e. 62)

150
? Scaling factor to convert score to
integer. Important when you are told that
a scoring matrix is in half-bits gt ? ½ ln2
Positive for chemically similar substitution
Common amino acids get low weights
Rare amino acids get high weights
151
BLOSUM-N matrices

If you want to detect homologous genes with high
identity, you may want a BLOSUM matrix with
higher N. say BLOSUM75
On the other hand, if you want to detect remote
homology, you may want to use lower N, say
BLOSUM50
BLOSUM-62 good for most purposes

152
For DNAs

No database of trusted alignments to start with
Specify the percentage identity you would like to
detect
You can then get the substitution matrix by some
calculation

153
For example

Suppose pA pC pT pG 0.25
We want 88 identity
qAA qCC qTT qGG 0.22, the rest 0.12/12
0.01
?(A, A) ?(C, C) ?(G, G) ?(T, T)
log (0.22 / (0.250.25)) 1.26
?(s, t) log (0.01 / (0.250.25)) -1.83 for s
? t.

154
Substitution matrix
A C G T
A 1.26 -1.83 -1.83 -1.83
C -1.83 1.26 -1.83 -1.83
G -1.83 -1.83 1.26 -1.83
T -1.83 -1.83 -1.83 1.26
155
A C G T
A 5 -7 -7 -7
C -7 5 -7 -7
G -7 -7 5 -7
T -7 -7 -7 5

Scale wont change the alignment
Multiply by 4 and then round off to get integers

156
Arbitrary substitution matrix

Say you have a substitution matrix provided by
someone
Its important to know what you are actually
looking for when you use the matrix

157
NCBI-BLAST
WU-BLAST
A C G T
A 1 -2 -2 -2
C -2 1 -2 -2
G -2 -2 1 -2
T -2 -2 -2 1
A C G T
A 5 -4 -4 -4
C -4 5 -4 -4
G -4 -4 5 -4
T -4 -4 -4 5

Whats the difference?
Which one should I use for my sequences?

158

We had
Scale it, so that
Reorganize

159

Since all probabilities must sum to 1,
We have
Suppose again ps 0.25 for any s
We know ?(s, t) from the substitution matrix
We can solve the equation for ?
Plug ? into to
get qst

160
NCBI-BLAST
WU-BLAST
A C G T
A 1 -2 -2 -2
C -2 1 -2 -2
G -2 -2 1 -2
T -2 -2 -2 1
A C G T
A 5 -4 -4 -4
C -4 5 -4 -4
G -4 -4 5 -4
T -4 -4 -4 5
? 1.33 qst 0.24 for s t, and 0.004 for s ?
t Translate 95 identity
? 0.19 qst 0.16 for s t, and 0.03 for s ?
t Translate 65 identity
161
Details for solving ?