Title: Pairwise alignment algorithms Haverford College, Fall 2001
1Pairwise alignment algorithms(Haverford
College, Fall 2001)
- Elisabetta Manduchi
- manduchi_at_pcbi.upenn.edu
- phone 215-573-4408
2References
- Ewens and Grant, Statistical Methods in
Bioinformatics An Introduction, Springer-Verlag,
New York, 2001 - Sections 6.2 and 6.4 (all subsections)
- See also problems 6.1-6.5
- Background appendix B, sections B.1-B.9
- Most textbooks on Computational
Biology/Bioinformatics cover these topics
3Algorithms definition
- Websters definition
- a procedure for solving a mathematical problem
in a finite number of steps that frequently
involves a repetition of an operation or
broadly a step-by-step procedure for solving a
problem or accomplishing some end
4Time and Space Complexity
- Time Complexity or Running Time of an algorithm
number of steps. - Space Complexity amount of temporary storage
required to run the algorithm. - These are functions of the input size.
- They refer to the worst case scenario.
5Example
- Given a list of numbers x1,x2,,xn, where each xi
is 0 or 1, find the first 0, if any, in this list
and change it to a 1. - For i1,2,,n, if xi0, set xi1. Stop.
- Input size is n.
- Total number of steps? Depends on the input, e.g.
- 1 for 0,1,1,1
- 3 for 1,1,0,0
- In the worst case scenario, the total number of
steps is n. This is the time complexity of this
algorithm.
6The O notation
- Ignore constant factors when trying to evaluate
time or space complexity. - Definition Let f and g be positive-valued
functions of n. We say that f(n)O(g(n)) as n??
if there exist constants c?0 and N such that, for
all n?N, we have f(n)?cg(n). - That is, if for sufficiently large values of n,
f(n) is no bigger than some fixed non-negative
constant times g(n).
7Examples
- 2n3O(n3)
- 10n415O(n4)O(n5)
- a polynomial of degree s is O of any polynomial
of degree t, for t?s. In particular it is O(nt). - If agt1 and if g is a polynomial, then f(n)an is
NOT O(g(n))
8Substitutions, Insertions, Deletions
- Mutation one of
- switch from one nucleotide to another
- insertion
- deletion
- Substitution a switch in nucleotides which
spreads throughout most of a species. - Substitutions, insertions and deletions passed
along two independent lines of descent cause a
divergence of the two sequences from the original
(and from each other) -
cgggtatccaa - cggtatgcca
-
ccctaggtccca
9Questions
- Given a collection of sequences. Do they descend
from the same ancestor sequence? - If so, how did they diverge from their ancestor?
I.e. how should they be aligned?
10Example
- For the previous example
- cggtatgcca? cgggtatccaa , ccctaggtccca, the two
descendent sequences align as follows - c g g g t a - - t - c
c a a - c c c - t a g g t c c
c - a - - (indel) represents an insertion or deletion.
- A sequence of ? consecutive indels is called a
gap of length ? .
11Alignments
- A means of comparing sequences to answer the
questions stated above. - Types of alignments
- global/local
- gapped/ungapped
- pairwise/multiple
- In what follows we will focus on pairwise
alignments.
12Alignments (cont.)
- Given two sequences, find an optimal alignment
between them and use it to answer the questions
stated above. - What is an optimal alignment?
- Need a way to compare alignments.
- Attach a score to each alignment.
- This should reflect the likelihood that this
alignment was produced as a consequence of
divergence from a common ancestor.
13Scoring schemes
- Given a scoring scheme,
- an optimal alignment between two sequences is one
with the best score (there might be more than one
optimal alignment). - the score of the sequence pair is such a best
score. - Using the scores of sequence pairs one can
- investigate the hypothesis that two sequences
diverged from a common ancestor - use the alignment of a pair of sequences that are
judged to be related in order to discover common
patterns. - by comparing scores among different species, get
information to help reconstruct the phylogenetic
tree that relates them all.
14Types of scores
- Similarity Scores the higher the score, the more
closely related are the two aligned sequences. - Distance scores (or distance measures) the lower
the score, the more closely related the
sequences. - In what follows we will use similarity score.
15Using substitution matrices and gap penalties
- Score of an alignment sum of individual scores,
one for each aligned pair of residues, and a
score (penalty) for each gap. - A substitution matrix s is a (symmetric) matrix
where s(X,Y) is the score assigned to the aligned
pair of residues X and Y. - A gap penalty is usually a function of the length
? of a gap ?(?). The simplest example is the
linear gap model ?(?) -?d, for some
non-negative linear gap penalty d.
16Substitution Matrices
- A 4?4 (NA) or a 20?20 (AA) symmetric matrix.
- Examples
- see table 6.7.
- s(X,Y)1 if XY, -1 otherwise.
- In what follows we will assume that a scoring
scheme, consisting of a substitution matrix and a
gap penalty function, is given. - Substitution matrices (such as BLOSUM and PAM
matrices) are discussed in detail in section 6.5
of the Ewens and Grant book and will also be
discussed next time.
17Example
- Let s(X,Y)1 if XY, -1 otherwise and use a
linear gap score with d2. Then the score of the
alignment - c t t a g - g - -
- c a t - g a g a a
- is 1 1 1 -2 1 -2 1 -2 ? 2 -5
18Gapped Global Pairwise Alignments(linear gap
model)
- Given
- a scoring scheme described by a substitution
matrix and a linear gap penalty - two sequences of lengths m and n xX1X2Xm and
yY1Y2Yn - Goal
- find the score of this sequence pair (i.e. the
score of an optimal gapped global alignment
between x and y) - find an optimal alignment of this type.
19Naïve approach
- Exhaustive search
- List all possible global gapped alignments of x
and y. - For each such alignment, compute its score using
the given scoring scheme. - Find the maximum of the scores and the
corresponding alignment(s).
20Time complexity of the naïve approach
- Let c(m,n) be the number of global gapped
alignments between two sequences of lengths m and
n respectively. - How does c grow with m and n?
- Let g(m,n) be the number of groups obtained by
grouping together those alignments that have the
same combination of aligned residue pairs,
ignoring indels. - e.g. X1 X2 X3 - X4 and X1 X2 - X3 X4
- - Y1 - Y2 Y3 - Y1 Y2
- Y3
21Time complexity of the naïve approach (cont.)
- Then g(m,n)?c(m,n).
- But
- The RHS equals (see problem 6.1).
- This number grows quite fast with m and n.
22Time complexity of the naïve approach (cont.)
- To get a sense of how fast this number grows,
assume mn. - Then (use
Stirlings - approximation B.5).
- For example
23Needleman-Wunsch algorithm (1970)Gotohs version
(1982)
- This is an example of dynamic programming
algorithm - break the problem into sub-problems of the same
kind - build the final solution using the solutions for
the sub-problems.
24Notation
- s substitution matrix (for brevity, we will
write s(i,j) for s(Xi,Yj)), d linear gap
penalty - xX1X2Xm and yY1Y2Yn sequences to align
- For i1,2,,m and j1,2,,n, let
- x1,iX1X2Xi, y1,jY1Y2Yj
- B(i,j) score of the pair (x1i,,y1,j)
- B(i,0)-id, B(0,j)-jd, B(0,0)0
- Get an (m1)?(n1) matrix B, whose entry B(m,n)
is the desired score of the pair (x,y)
25The algorithm
- Fill in the elements of B recursively, proceeding
from top left to bottom right. - A highest scoring alignment between x1,i and y1,j
could terminate in one of 3 ways - Xi Xi
- - Yj - Yj
- In the first case, B(i,j) B(i-1,j-1)s(i,j)
- In the second case, B(i,j) B(i-1,j)-d
- In the third case, B(i,j) B(i,j-1)-d
26Recursive formula
- B(i,j) maxB(i-1,j-1)s(i,j), B(i-1,j)-d,
B(i,j-1)-d - Thus, to compute B(m,n), the running time is
O(mn). - To find an alignment that has this score, we must
keep track, at each step of the recursion, of the
choice made to get the max.
27Example
- xgaatct, ycatt (m6 and n4)
- s(X,Y)1 if XY, -1 otherwise
- d2
28Example (cont.)
- 3 optimal alignments
- gaatct gaatct gaatct
- c-at-t ca-t-t -cat-t
29Fitting one sequence into another(linear gap
model)
- Given
- a scoring scheme described by a substitution
matrix and a linear gap penalty - two sequences of lengths m and n xX1X2Xm and
yY1Y2Yn, with m? n. - Goal
- find the score of an optimal (gapped) alignment
of x with a subsequence of y. - find an optimal alignment of this type.
30Restating the goal
- Notation
- s substitution matrix (for brevity, we will
write s(i,j) for s(Xi,Yj)), d linear gap
penalty - for 1? i? m and 1? k? j? n, let x1,iX1X2Xi, and
yk,jYkY2Yj - for two sequences u and v, let B(u,v) be the
score of the pair (u,v) - Goal
- find maxB(x,yk,j) 1? k? j? n
31Naïve approach
- For each choice of k and j, with 1? k? j?n, apply
the NW algorithm to the sequences x and yk,j to
find B(x,yk,j). Then find the max. - Running time to find the optimal score is
therefore O(mn3) - there are choices for
k and j - for each such choice, the running time of NW to
find B(x,yk,j) is O(m(j-k)).
32A better approach
- For i1,2,,m and j1,2,,n, let
F(i,j)maxB(x1,i,yk,j) k1,2,,j - Then maxB(x,yk,j) 1? k? j? n equals
maxF(m,j) j1,2,,n, since - maxB(x,yk,j) 1? k? j? n
- maxB(x1,m,yk,j) 1? k? j? n
- max maxB(x1,m,yk,j) k1,2,,j j1,2,,n
- maxF(m,j) j1,2,,n
33Computing the matrix F
- Initialize
- F(i,0)-id, for i1,2,,m
- F(0,j)0, for j0,1,,n, since deletions of the
beginning of y should be without penalty - Fill in the elements of F recursively, proceeding
from top left to bottom right. - Look at the last row of the (m1)?(n1) matrix F.
The max value on this row is the desired score. - Note that there might be more than one cell on
this row at which the max is attained.
34Recursive formula
- F(i,j)maxF(i-1,j-1)s(i,j), F(i-1,j)-d,
F(i,j-1)-d - Thus, to compute maxF(m,j) j1,2,,n, the
running time is O(mn). - To find an alignment that has this score, we must
keep track, at each step of the recursion, of the
choice made to get the max.
35Example
- xatg, ygaatct (m3 and n6)
- s(X,Y)1 if XY, -1 otherwise
- d2
36Example (cont.)
- One optimal fit
- a t g
- g a a t c t
37Local alignments(linear gap model)
- Given
- a scoring scheme described by a substitution
matrix and a linear gap penalty - two sequences of lengths m and n xX1X2Xm and
yY1Y2Yn. - Goal
- find the score of an optimal (gapped) alignment
of a subsequence of x with a subsequence of y. - find an optimal alignment of this type.
38Assumption
- In what follows we make the assumption that the
scoring scheme we use is such that the expected
score for a random alignment is negative. - If this assumption did not hold, then long
matches between subsequences could score highly
just because of their lengths, so that two long
unrelated subsequences could give a highest
scoring alignment. We do not want this to occur.
39Restating the goal
- Notation
- s substitution matrix (for brevity, we will
write s(i,j) for s(Xi,Yj)), d linear gap
penalty - for 1?h? i?m and 1? k? j? n, let xh,iXhX2Xi,
and yk,jYkY2Yj - for two sequences u and v, let B(u,v) be the
score of the pair (u,v) - Goal
- find maxB(xh,i,yk,j) 1? h? i? m, 1? k? j?
n, when this is non-negative.
40Naïve approach
- For each choice of h, i, k and j, with 1? h? i? m
and 1? k? j? n, apply the NW algorithm to the
sequences xh,i and yk,j to find B(xh,i,yk,j).
Then find the max. - Running time to find the optimal score is
therefore O(m3n3) - there are choices
for h and i and - choices
for k and j - for each such choice, the running time of NW to
find B(xh,i,yk,j) is O((i-h)(j-k)).
41A better approachthe Smith-Waterman algorithm
(1981)
- For i1,2,,m and j1,2,,n, let
- L(i,j)max0, B(xh,i,yk,j) h1,2,, i,
k1,2,,j - Then
- max0, B(xh,i yk,j) 1? h? i? m, 1? k? j? n
- equals maxL(i,j) i1,2,,m, j1,2,,n, since
- max0, B(xh,i,yk,j) 1? h? i? m, 1? k? j? n
- max max0, B(xh,i,yk,j) h1,2,,i, k1,2,,j
i1,2,,m, j1,2,,n - maxL(i,j) i1,2,,m, j1,2,,n
42Computing the matrix L
- Initialize
- L(i,0)0, for i1,2,,m
- L(0,j)0, for j0,1,,n,
- since deletions at the beginning or end of
our two sequences should be without penalty - Fill in the elements of L recursively, proceeding
from top left to bottom right. - Look at the cell of the (m1)? (n1) matrix L.
The max value is the desired score. - Note that there might be more than one cell in L
at which this max is attained.
43Recursive formula
- L(i,j)max0, L(i-1,j-1)s(i,j), L(i-1,j)-d,
L(i,j-1)-d - Thus, to compute
- maxL(i,j) i1,2,,m, j1,2,,n
- the running time is O(mn).
- To find an alignment that has this score, we must
keep track, at each step of the recursion, of the
choice made to get the max.
44Example
- X2 X3 - X4 X5
- Y5 Y6 Y7 Y8 Y9
45Other gap models
- Linear gap model
- simple
- but often not appropriate for biological
sequences often it is harder for a gap to open
than it is for it to extend - penalize additional gap steps a little less than
the first one - use a more complicated gap cost ?(?)
- need to adjust the recurrence relations in the
dynamic programming algorithms
46Other gap models (cont.)
- E.g., for global alignments, need to distinguish
among alignments of x1,i and y1,j that end with
Xi aligned to an indel - the score of such an alignment will also depend
on how many symbols in x immediately preceding Xi
are aligned to indels - if the last symbol not aligned to an indel
preceding Xi is Xk, then the i-k symbols from
Xk1 to Xi are aligned to indels, and the score
of a highest scoring alignment is B(k, j)?(i-k)
47Recursive formula for a general gap model(global
alignment)
- B(0,0)0, B(i,0)?(i), B(0,j)?(j)
- B(i,j) maxB(i-1,j-1)s(i,j),
- B(k,j)?(i-k) k0,1,,i-1,
- B(i,k)?(j-k) k0,1,,j-1
- Finding B(m,n) has a running time of O(m2nmn2)
as, for each cell, one must consider ij1
previous ones, not just 3.
48Special case affine gap model
- ?(?)-d-(?-1)e, 0lteltd dgap open penalty, egap
extension penalty - Trick to get a running time of O(mn) use 3
matrices - S(i,j) holds the score of a highest scoring
alignment between x1,i and y1,j , given that the
alignment ends with Xi aligned to Yj - Ix(i,j) holds the score of a highest scoring
alignment between x1,i and y1,j , given that the
alignment ends with Xi aligned to an indel. - Iy(i,j) holds the score of a highest scoring
alignment between x1,i and y1,j , given that the
alignment ends with and indel aligned to Yj.
49Initialization
- S(0,0) Ix(0,0) Iy(0,0)0
- S(0,j) Ix(0,j) -d-(j-1)e
- S(i,0) Iy(i,0) -d-(i-1)e
- We assume that a deletion will not be followed
directly by an insertion
50Recursive formulae
- S(i,j) maxS(i-1,j-1)s(i,j ),
- Ix(i-1,j-1)s(i,j),
- Iy(i-1,j-1)s(i,j)
- Ix(i,j) maxS(i-1,j)-d, Ix(i-1,j)-e
- Iy(i,j) maxS(i,j-1)-d, Iy(i,j-1)-e
- The score of the pair (x,y) is then
- maxS(m,n), Ix(m,n), Iy(m,n)
51Limitation of the dynamic programming alignment
algorithms
- When dealing with very long sequences a time
complexity as that of the algorithms discussed so
far might not be good enough - e.g. when trying to align a query sequence to all
sequences in a database - Other algorithms, like BLAST, have been
developed. - These use heuristic techniques to limit the
search to a fraction of the possible alignments
between two sequences, with an attempt not to
miss the high-scoring alignments. - Trade-off dynamic programming always finds the
best score, BLAST might miss the best scoring
alignment in some cases.
52BLAST(Basic Local Alignment Search Tool)
- For simplicity here we sketch the ideas behind
the original (ungapped) algorithm - There are variants of this algorithm, also
accomodating for gapped alignments and
incorporating dynamic programming ideas - The statistical aspects of BLAST will be
discussed in a subsequent lecture - GOAL seek for locally maximal segment pairs with
scores above some cutoff (HSPs) - segment paira pair of subsequences of the same
length, one from each sequence - locally maximalscore cannot be improved either
by extending or shortening both segments - a maximal segment pair (MSP) is also a locally
maximal one - statistical results can be used to estimate the
highest MSP score S at which chance similarities
are likely to appear
53BLAST algorithm outline
- Confine the attention to segment pairs that
contain a word pair with a score of at least T,
for some appropriate choice of T and w - word pair a segment pair of fixed length w
- Compile a list of high-scoring words
- Search for hits each hit gives a seed
- Extend seeds
- Note 1 and (the typical implementation of) 4 are
the steps that might introduce departure from the
ideal finding guaranteed MSPs
54An implementation for protein sequences
- w fixed at 4 (more recently 3), T chosen to
balance sensitivity and running time - List all w-mers that score at least T with some
w-mer in the query sequence - Scan the database in search for hits (perfect
matches) in the list constructed in the previous
step (e.g. using a hash) - Extend each hit to economize time the process of
extending in one direction is terminated when a
segment pair whose score falls a certain distance
below the best score found for shorter extensions
is reached