Pairwise alignment algorithms Haverford College, Fall 2001 - PowerPoint PPT Presentation

1 / 54

About This Presentation

Title:

Pairwise alignment algorithms Haverford College, Fall 2001

Description:

... Statistical Methods in Bioinformatics: An Introduction, Springer-Verlag, ... Most textbooks on Computational Biology/Bioinformatics cover these topics ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 55

Provided by: Elisabett70

Category:

more less

Transcript and Presenter's Notes

Title: Pairwise alignment algorithms Haverford College, Fall 2001

1
Pairwise alignment algorithms(Haverford
College, Fall 2001)

Elisabetta Manduchi
manduchi_at_pcbi.upenn.edu
phone 215-573-4408

2
References

Ewens and Grant, Statistical Methods in
Bioinformatics An Introduction, Springer-Verlag,
New York, 2001
Sections 6.2 and 6.4 (all subsections)
See also problems 6.1-6.5
Background appendix B, sections B.1-B.9
Most textbooks on Computational
Biology/Bioinformatics cover these topics

3
Algorithms definition

Websters definition
a procedure for solving a mathematical problem
in a finite number of steps that frequently
involves a repetition of an operation or
broadly a step-by-step procedure for solving a
problem or accomplishing some end

4
Time and Space Complexity

Time Complexity or Running Time of an algorithm
number of steps.
Space Complexity amount of temporary storage
required to run the algorithm.
These are functions of the input size.
They refer to the worst case scenario.

5
Example

Given a list of numbers x1,x2,,xn, where each xi
is 0 or 1, find the first 0, if any, in this list
and change it to a 1.
For i1,2,,n, if xi0, set xi1. Stop.
Input size is n.
Total number of steps? Depends on the input, e.g.
1 for 0,1,1,1
3 for 1,1,0,0
In the worst case scenario, the total number of
steps is n. This is the time complexity of this
algorithm.

6
The O notation

Ignore constant factors when trying to evaluate
time or space complexity.
Definition Let f and g be positive-valued
functions of n. We say that f(n)O(g(n)) as n??
if there exist constants c?0 and N such that, for
all n?N, we have f(n)?cg(n).
That is, if for sufficiently large values of n,
f(n) is no bigger than some fixed non-negative
constant times g(n).

7
Examples

2n3O(n3)
10n415O(n4)O(n5)
a polynomial of degree s is O of any polynomial
of degree t, for t?s. In particular it is O(nt).
If agt1 and if g is a polynomial, then f(n)an is
NOT O(g(n))

8
Substitutions, Insertions, Deletions

Mutation one of
switch from one nucleotide to another
insertion
deletion
Substitution a switch in nucleotides which
spreads throughout most of a species.
Substitutions, insertions and deletions passed
along two independent lines of descent cause a
divergence of the two sequences from the original
(and from each other)
cgggtatccaa
cggtatgcca
ccctaggtccca

9
Questions

Given a collection of sequences. Do they descend
from the same ancestor sequence?
If so, how did they diverge from their ancestor?
I.e. how should they be aligned?

10
Example

For the previous example
cggtatgcca? cgggtatccaa , ccctaggtccca, the two
descendent sequences align as follows
c g g g t a - - t - c
c a a
c c c - t a g g t c c
c - a
- (indel) represents an insertion or deletion.
A sequence of ? consecutive indels is called a
gap of length ? .

11
Alignments

A means of comparing sequences to answer the
questions stated above.
Types of alignments
global/local
gapped/ungapped
pairwise/multiple
In what follows we will focus on pairwise
alignments.

12
Alignments (cont.)

Given two sequences, find an optimal alignment
between them and use it to answer the questions
stated above.
What is an optimal alignment?
Need a way to compare alignments.
Attach a score to each alignment.
This should reflect the likelihood that this
alignment was produced as a consequence of
divergence from a common ancestor.

13
Scoring schemes

Given a scoring scheme,
an optimal alignment between two sequences is one
with the best score (there might be more than one
optimal alignment).
the score of the sequence pair is such a best
score.
Using the scores of sequence pairs one can
investigate the hypothesis that two sequences
diverged from a common ancestor
use the alignment of a pair of sequences that are
judged to be related in order to discover common
patterns.
by comparing scores among different species, get
information to help reconstruct the phylogenetic
tree that relates them all.

14
Types of scores

Similarity Scores the higher the score, the more
closely related are the two aligned sequences.
Distance scores (or distance measures) the lower
the score, the more closely related the
sequences.
In what follows we will use similarity score.

15
Using substitution matrices and gap penalties

Score of an alignment sum of individual scores,
one for each aligned pair of residues, and a
score (penalty) for each gap.
A substitution matrix s is a (symmetric) matrix
where s(X,Y) is the score assigned to the aligned
pair of residues X and Y.
A gap penalty is usually a function of the length
? of a gap ?(?). The simplest example is the
linear gap model ?(?) -?d, for some
non-negative linear gap penalty d.

16
Substitution Matrices

A 4?4 (NA) or a 20?20 (AA) symmetric matrix.
Examples
see table 6.7.
s(X,Y)1 if XY, -1 otherwise.
In what follows we will assume that a scoring
scheme, consisting of a substitution matrix and a
gap penalty function, is given.
Substitution matrices (such as BLOSUM and PAM
matrices) are discussed in detail in section 6.5
of the Ewens and Grant book and will also be
discussed next time.

17
Example

Let s(X,Y)1 if XY, -1 otherwise and use a
linear gap score with d2. Then the score of the
alignment
c t t a g - g - -
c a t - g a g a a
is 1 1 1 -2 1 -2 1 -2 ? 2 -5

18
Gapped Global Pairwise Alignments(linear gap
model)

Given
a scoring scheme described by a substitution
matrix and a linear gap penalty
two sequences of lengths m and n xX1X2Xm and
yY1Y2Yn
Goal
find the score of this sequence pair (i.e. the
score of an optimal gapped global alignment
between x and y)
find an optimal alignment of this type.

19
Naïve approach

Exhaustive search
List all possible global gapped alignments of x
and y.
For each such alignment, compute its score using
the given scoring scheme.
Find the maximum of the scores and the
corresponding alignment(s).

20
Time complexity of the naïve approach

Let c(m,n) be the number of global gapped
alignments between two sequences of lengths m and
n respectively.
How does c grow with m and n?
Let g(m,n) be the number of groups obtained by
grouping together those alignments that have the
same combination of aligned residue pairs,
ignoring indels.
e.g. X1 X2 X3 - X4 and X1 X2 - X3 X4
- Y1 - Y2 Y3 - Y1 Y2
- Y3

21
Time complexity of the naïve approach (cont.)

Then g(m,n)?c(m,n).
But
The RHS equals (see problem 6.1).
This number grows quite fast with m and n.

22
Time complexity of the naïve approach (cont.)

To get a sense of how fast this number grows,
assume mn.
Then (use
Stirlings
approximation B.5).
For example

23
Needleman-Wunsch algorithm (1970)Gotohs version
(1982)

This is an example of dynamic programming
algorithm
break the problem into sub-problems of the same
kind
build the final solution using the solutions for
the sub-problems.

24
Notation

s substitution matrix (for brevity, we will
write s(i,j) for s(Xi,Yj)), d linear gap
penalty
xX1X2Xm and yY1Y2Yn sequences to align
For i1,2,,m and j1,2,,n, let
x1,iX1X2Xi, y1,jY1Y2Yj
B(i,j) score of the pair (x1i,,y1,j)
B(i,0)-id, B(0,j)-jd, B(0,0)0
Get an (m1)?(n1) matrix B, whose entry B(m,n)
is the desired score of the pair (x,y)

25
The algorithm

Fill in the elements of B recursively, proceeding
from top left to bottom right.
A highest scoring alignment between x1,i and y1,j
could terminate in one of 3 ways
Xi Xi
-
Yj - Yj
In the first case, B(i,j) B(i-1,j-1)s(i,j)
In the second case, B(i,j) B(i-1,j)-d
In the third case, B(i,j) B(i,j-1)-d

26
Recursive formula

B(i,j) maxB(i-1,j-1)s(i,j), B(i-1,j)-d,
B(i,j-1)-d
Thus, to compute B(m,n), the running time is
O(mn).
To find an alignment that has this score, we must
keep track, at each step of the recursion, of the
choice made to get the max.

27
Example

xgaatct, ycatt (m6 and n4)
s(X,Y)1 if XY, -1 otherwise
d2

28
Example (cont.)

3 optimal alignments
gaatct gaatct gaatct
c-at-t ca-t-t -cat-t

29
Fitting one sequence into another(linear gap
model)

Given
a scoring scheme described by a substitution
matrix and a linear gap penalty
two sequences of lengths m and n xX1X2Xm and
yY1Y2Yn, with m? n.
Goal
find the score of an optimal (gapped) alignment
of x with a subsequence of y.
find an optimal alignment of this type.

30
Restating the goal

Notation
s substitution matrix (for brevity, we will
write s(i,j) for s(Xi,Yj)), d linear gap
penalty
for 1? i? m and 1? k? j? n, let x1,iX1X2Xi, and
yk,jYkY2Yj
for two sequences u and v, let B(u,v) be the
score of the pair (u,v)
Goal
find maxB(x,yk,j) 1? k? j? n

31
Naïve approach

For each choice of k and j, with 1? k? j?n, apply
the NW algorithm to the sequences x and yk,j to
find B(x,yk,j). Then find the max.
Running time to find the optimal score is
therefore O(mn3)
there are choices for
k and j
for each such choice, the running time of NW to
find B(x,yk,j) is O(m(j-k)).

32
A better approach

For i1,2,,m and j1,2,,n, let
F(i,j)maxB(x1,i,yk,j) k1,2,,j
Then maxB(x,yk,j) 1? k? j? n equals
maxF(m,j) j1,2,,n, since
maxB(x,yk,j) 1? k? j? n
maxB(x1,m,yk,j) 1? k? j? n
max maxB(x1,m,yk,j) k1,2,,j j1,2,,n
maxF(m,j) j1,2,,n

33
Computing the matrix F

Initialize
F(i,0)-id, for i1,2,,m
F(0,j)0, for j0,1,,n, since deletions of the
beginning of y should be without penalty
Fill in the elements of F recursively, proceeding
from top left to bottom right.
Look at the last row of the (m1)?(n1) matrix F.
The max value on this row is the desired score.
Note that there might be more than one cell on
this row at which the max is attained.

34
Recursive formula

F(i,j)maxF(i-1,j-1)s(i,j), F(i-1,j)-d,
F(i,j-1)-d
Thus, to compute maxF(m,j) j1,2,,n, the
running time is O(mn).
To find an alignment that has this score, we must
keep track, at each step of the recursion, of the
choice made to get the max.

35
Example

xatg, ygaatct (m3 and n6)
s(X,Y)1 if XY, -1 otherwise
d2

36
Example (cont.)

One optimal fit
a t g
g a a t c t

37
Local alignments(linear gap model)

Given
a scoring scheme described by a substitution
matrix and a linear gap penalty
two sequences of lengths m and n xX1X2Xm and
yY1Y2Yn.
Goal
find the score of an optimal (gapped) alignment
of a subsequence of x with a subsequence of y.
find an optimal alignment of this type.

38
Assumption

In what follows we make the assumption that the
scoring scheme we use is such that the expected
score for a random alignment is negative.
If this assumption did not hold, then long
matches between subsequences could score highly
just because of their lengths, so that two long
unrelated subsequences could give a highest
scoring alignment. We do not want this to occur.

39
Restating the goal

Notation
s substitution matrix (for brevity, we will
write s(i,j) for s(Xi,Yj)), d linear gap
penalty
for 1?h? i?m and 1? k? j? n, let xh,iXhX2Xi,
and yk,jYkY2Yj
for two sequences u and v, let B(u,v) be the
score of the pair (u,v)
Goal
find maxB(xh,i,yk,j) 1? h? i? m, 1? k? j?
n, when this is non-negative.

40
Naïve approach

For each choice of h, i, k and j, with 1? h? i? m
and 1? k? j? n, apply the NW algorithm to the
sequences xh,i and yk,j to find B(xh,i,yk,j).
Then find the max.
Running time to find the optimal score is
therefore O(m3n3)
there are choices
for h and i and
choices
for k and j
for each such choice, the running time of NW to
find B(xh,i,yk,j) is O((i-h)(j-k)).

41
A better approachthe Smith-Waterman algorithm
(1981)

For i1,2,,m and j1,2,,n, let
L(i,j)max0, B(xh,i,yk,j) h1,2,, i,
k1,2,,j
Then
max0, B(xh,i yk,j) 1? h? i? m, 1? k? j? n
equals maxL(i,j) i1,2,,m, j1,2,,n, since
max0, B(xh,i,yk,j) 1? h? i? m, 1? k? j? n
max max0, B(xh,i,yk,j) h1,2,,i, k1,2,,j
i1,2,,m, j1,2,,n
maxL(i,j) i1,2,,m, j1,2,,n

42
Computing the matrix L

Initialize
L(i,0)0, for i1,2,,m
L(0,j)0, for j0,1,,n,
since deletions at the beginning or end of
our two sequences should be without penalty
Fill in the elements of L recursively, proceeding
from top left to bottom right.
Look at the cell of the (m1)? (n1) matrix L.
The max value is the desired score.
Note that there might be more than one cell in L
at which this max is attained.

43
Recursive formula

L(i,j)max0, L(i-1,j-1)s(i,j), L(i-1,j)-d,
L(i,j-1)-d
Thus, to compute
maxL(i,j) i1,2,,m, j1,2,,n
the running time is O(mn).
To find an alignment that has this score, we must
keep track, at each step of the recursion, of the
choice made to get the max.

44
Example

X2 X3 - X4 X5
Y5 Y6 Y7 Y8 Y9

45
Other gap models

Linear gap model
simple
but often not appropriate for biological
sequences often it is harder for a gap to open
than it is for it to extend
penalize additional gap steps a little less than
the first one
use a more complicated gap cost ?(?)
need to adjust the recurrence relations in the
dynamic programming algorithms

46
Other gap models (cont.)

E.g., for global alignments, need to distinguish
among alignments of x1,i and y1,j that end with
Xi aligned to an indel
the score of such an alignment will also depend
on how many symbols in x immediately preceding Xi
are aligned to indels
if the last symbol not aligned to an indel
preceding Xi is Xk, then the i-k symbols from
Xk1 to Xi are aligned to indels, and the score
of a highest scoring alignment is B(k, j)?(i-k)

47
Recursive formula for a general gap model(global
alignment)

B(0,0)0, B(i,0)?(i), B(0,j)?(j)
B(i,j) maxB(i-1,j-1)s(i,j),
B(k,j)?(i-k) k0,1,,i-1,
B(i,k)?(j-k) k0,1,,j-1
Finding B(m,n) has a running time of O(m2nmn2)
as, for each cell, one must consider ij1
previous ones, not just 3.

48
Special case affine gap model

?(?)-d-(?-1)e, 0lteltd dgap open penalty, egap
extension penalty
Trick to get a running time of O(mn) use 3
matrices
S(i,j) holds the score of a highest scoring
alignment between x1,i and y1,j , given that the
alignment ends with Xi aligned to Yj
Ix(i,j) holds the score of a highest scoring
alignment between x1,i and y1,j , given that the
alignment ends with Xi aligned to an indel.
Iy(i,j) holds the score of a highest scoring
alignment between x1,i and y1,j , given that the
alignment ends with and indel aligned to Yj.

49
Initialization

S(0,0) Ix(0,0) Iy(0,0)0
S(0,j) Ix(0,j) -d-(j-1)e
S(i,0) Iy(i,0) -d-(i-1)e
We assume that a deletion will not be followed
directly by an insertion

50
Recursive formulae

S(i,j) maxS(i-1,j-1)s(i,j ),
Ix(i-1,j-1)s(i,j),
Iy(i-1,j-1)s(i,j)
Ix(i,j) maxS(i-1,j)-d, Ix(i-1,j)-e
Iy(i,j) maxS(i,j-1)-d, Iy(i,j-1)-e
The score of the pair (x,y) is then
maxS(m,n), Ix(m,n), Iy(m,n)

51
Limitation of the dynamic programming alignment
algorithms

When dealing with very long sequences a time
complexity as that of the algorithms discussed so
far might not be good enough
e.g. when trying to align a query sequence to all
sequences in a database
Other algorithms, like BLAST, have been
developed.
These use heuristic techniques to limit the
search to a fraction of the possible alignments
between two sequences, with an attempt not to
miss the high-scoring alignments.
Trade-off dynamic programming always finds the
best score, BLAST might miss the best scoring
alignment in some cases.

52
BLAST(Basic Local Alignment Search Tool)

For simplicity here we sketch the ideas behind
the original (ungapped) algorithm
There are variants of this algorithm, also
accomodating for gapped alignments and
incorporating dynamic programming ideas
The statistical aspects of BLAST will be
discussed in a subsequent lecture
GOAL seek for locally maximal segment pairs with
scores above some cutoff (HSPs)
segment paira pair of subsequences of the same
length, one from each sequence
locally maximalscore cannot be improved either
by extending or shortening both segments
a maximal segment pair (MSP) is also a locally
maximal one
statistical results can be used to estimate the
highest MSP score S at which chance similarities
are likely to appear

53
BLAST algorithm outline

Confine the attention to segment pairs that
contain a word pair with a score of at least T,
for some appropriate choice of T and w
word pair a segment pair of fixed length w
Compile a list of high-scoring words
Search for hits each hit gives a seed
Extend seeds
Note 1 and (the typical implementation of) 4 are
the steps that might introduce departure from the
ideal finding guaranteed MSPs

54
An implementation for protein sequences

w fixed at 4 (more recently 3), T chosen to
balance sensitivity and running time
List all w-mers that score at least T with some
w-mer in the query sequence
Scan the database in search for hits (perfect
matches) in the list constructed in the previous
step (e.g. using a hash)
Extend each hit to economize time the process of
extending in one direction is terminated when a
segment pair whose score falls a certain distance
below the best score found for shorter extensions
is reached