Title: Developing Pairwise Sequence Alignment Algorithms
1Developing Pairwise Sequence Alignment Algorithms
2Outline
- Overview of global and local alignment
- References for sequence alignment algorithms
- Discussion of Needleman-Wunsch iterative approach
to global alignment - Discussion of Smith-Waterman recursive approach
to local alignment - Discussion of how LCS Algorithm can be extended
for - Global alignment (Needleman-Wunsch)
- Local alignment (Smith-Waterman)
- Group assignments for project
3Overview of Pairwise Sequence Alignment
- Dynamic Programming
- Applied to optimization problems
- Useful when
- Problem can be recursively divided into
sub-problems - Sub-problems are not independent
- Needleman-Wunsch is a global alignment technique
that uses an iterative algorithm and no gap
penalty (could extend to fixed gap penalty). - Smith-Waterman is a local alignment technique
that uses a recursive algorithm and can use
alternative gap penalties (such as affine).
Smith-Watermans algorithm is an extension of
Longest Common Substring (LCS) problem and can be
generalized to solve both local and global
alignment. - Note Needleman-Wunsch is usually used to refer
to global alignment regardless of the algorithm
used.
4References
- http//www.sbc.su.se/arne/kurser/swell/pairwise_a
lignments.html - An Introduction to Bioinformatics Algorithms
(Computational Molecular Biology) Neil C. Jones,
Pavel Pevzner - Computational Molecular Biology An Algorithmic
Approach, Pavel Pevzner - Introduction to Computational Biology Maps,
sequences, and genomes, Michael Waterman - Algorithms on Strings, Trees, and Sequences
Computer Science and Computational Biology, Dan
Gusfield
5Classic Papers
- Needleman, S.B. and Wunsch, C.D. A General Method
Applicable to the Search for Similarities in
Amino Acid Sequence of Two Proteins. J. Mol.
Biol., 48, pp. 443-453, 1970. (http//www.cs.umd.e
du/class/spring2003/cmsc838t/papers/needlemanandwu
nsch1970.pdf) - Smith, T.F. and Waterman, M.S. Identification of
Common Molecular Subsequences. J. Mol. Biol.,
147, pp. 195-197, 1981.(http//www.cmb.usc.edu/pap
ers/msw_papers/msw-042.pdf)
6Why search sequence databases?
- I have just sequenced something. What is known
about the thing I sequenced? - I have a unique sequence. Is there similarity to
another gene that has a known function? - I found a new protein sequence in a lower
organism. Is it similar to a protein from
another species?
7Global Alignment Method
Output An alignment of two sequences is
represented by three lines The first line shows
the first sequence The third line shows the
second sequence. The second line has a row of
symbols. The symbol is a vertical bar wherever
characters in the two sequences match, and a
space where ever they do not. Dots (or dashes)
may be inserted in either sequence to represent
gaps.
8Global Alignment Method (cont. 1)
For example, the two hypothetical sequences
abcdefghajklm abbdhijk could be aligned like
this abcdefghajklm
abbd...hijk As shown, there are 6 matches, 2
mismatches, and one gap of length 3.
9Global Alignment Method (cont. 2)
The alignment can be scored according to a payoff
matrix for fixed scoring. You can also use PAM
or BLOSUM. payoff match gt match,
mismatch gt mismatch,
gap_open gt gap_open,
gap_extend gt gap_extend For correct
operation, an algorithm is created such that the
match must be positive and the other payoff
entities must be negative.
10Global Alignment Method (cont. 3)
Example Given the payoff matrix payoff
match gt 4, mismatch gt
-3, gap_open gt -2,
gap_extend gt -1 What is the alignment and
what is the alignment score for the Following
two sequences? Sequence 1 abcdefghajklm Sequence
2 abbdhijk
11Global Alignment Method (cont. 4)
The sequences abcdefghajklm abbdhijk are
aligned and scored like this a b
c d e f g h a j k l m
a b b d . . . h i j k
match 4 4 4 4 4 4
mismatch -3 -3 gap_open
-2 gap_extend -1-1-1 for a total
score of 24-6-2-3 13.
12Global Alignment Method (cont. 5)
The algorithm should guarantee that no
other alignment of these two sequences has
a higher score under this payoff matrix.
13Three steps in Dynamic Programming
1. Initialization 2. Matrix fill or scoring 3.
Traceback and alignment
14Ends-Free Global Alignment with Fixed Scoring
Two sequences will be aligned. GGATCGA (sequence
1 V) GAATTCAGTTA (sequence 2 W) A simple
fixed scoring scheme will be used ?(vi, wj) 1
if the residue at position i of sequence 1 (vi)
is the same as the residue at position j of the
sequence 2 (wj) called match score ?(vi,
wj) 0 if vi ? wj called mismatch score ?(vi,
-) ?(-, wj) 0 called gap penalty
15Matrix fill step Each position Si,j is defined
to be the MAXIMUM score at position i,j Si,j
MAXIMUM Si-1, j-1 ?(vi, wj) (match or
mismatch in the diagonal) Si, j-1 w (gap in
sequence 1 V) Si-1, j w (gap in sequence 2
W)
column
row
16Initialization step 1) Create Matrix with m 1
columns and n 1 rows, where n number of
letters in sequence 1 and m number of letters
in sequence 2. 2) First column and first row
will be filled with 0s (for ends-free
alignment).
17Scoring Step Fill in row by row
18Traceback Step
Seq1 G G A - T C - G - - A
Seq2 G A A T T C A G T T A
19Global Alignment output file
Global HBA_HUMAN vs HBB_HUMAN Score
290.50 HBA_HUMAN 1
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP 44
HBB_HUMAN 1
VHLTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFE
43 HBA_HUMAN 45 HF.DLS.....HGSAQVKGHG
KKVADALTNAVAHVDDMPNALSAL 83
HBB_HUMAN 44 SFGDLSTPDAVMGNPKVKAHGKK
VLGAFSDGLAHLDNLKGTFATL 88 HBA_HUMAN 84
SDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKF
128
HBB_HUMAN 89
SELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKV
133 HBA_HUMAN 129 LASVSTVLTSKYR
141
HBB_HUMAN 134
VAGVANALAHKYH
146 id 45.32 similarity 63.31
(88/139 100) Overall id 43.15 Overall
similarity 60.27 (88/146 100)
20LCS Problem (review)
- Similarity score
- si-1,j
- si,j max si,j-1
- si-1,j-1 1, if vi wj
21Extend LCS to Global Alignment
- si-1,j ?(vi, -)
- si,j max si,j-1 ?(-, wj)
- si-1,j-1 ?(vi, wj)
- ?(vi, -) ?(-, wj) -? fixed gap penalty
- ?(vi, wj) score for match or mismatch can be
fixed, from PAM or BLOSUM
22Global Alignment Alternatives
- Ends-free alignment dont penalize gaps at the
beginning or end - Initialize first row and column of S to 0
- Search last row and column for maximum score
- Regular global alignment score end to end
(penalize gaps at beginning and end) - Initialize first row and column of S with gap
penalty - Alignment score is in the lower right corner of S
23Historical PerspectiveNeedleman-Wunsch (1 of 3)
Match 1 Mismatch 0 Gap 0
24Historical Perspective Needleman-Wunsch (2 of 3)
25Historical Perspective Needleman-Wunsch (3 of 3)
From page 446 It is apparent that the above
array operation can begin at any of a number of
points along the borders of the array, which is
equivalent to a comparison of N-terminal residues
or C-terminal residues only. As long as the
appropriate rules for pathways are followed, the
maximum match will be the same. The cells of the
array which contributed to the maximum match, may
be determined by recording the origin of the
number that was added to each cell when the array
was operated upon.
26Smith-Waterman Algorithm Advances inApplied
Mathematics, 2482-489 (1981)
- The Smith-Waterman algorithm is a local alignment
tool used to obtain sensitive pairwise similarity
alignments. - Smith-Waterman algorithm uses dynamic
programming. - It selects the optimal path as the highest ranked
alignment. - Smith-Waterman algorithm is useful for finding
local areas of similarity between sequences that
are too dissimilar for global alignment. - The S-W algorithm uses a lot of computer memory.
- BLAST and FASTA are other search algorithms that
use some aspects of S-W.
27Smith-Waterman (cont. 1)
a. It searches for sequence matches. b. Assigns a
score to each pair of amino acids -uses
similarity scores -uses positive scores for
related residues -uses negative scores for
substitutions and gaps c. Initializes edges of
the matrix with zeros d. As the scores are summed
in the matrix, any sum below 0 is recorded as
a zero. e. Begins backtracing at the maximum
value found anywhere in the matrix. f.
Continues the backtrace until the score falls to
0.
28Smith-Waterman (cont. 2)
H E A G A W G H E E
Put zeros on borders. Assign initial scores based
on a scoring matrix. Calculate new scores based
on adjacent cell scores. If sum is less than zero
or equal to zero begin new scoring with next
cell.
P A W H E A E
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 5 0 5 0 0 0 0 0 0 0 0 0 3 0 2012 4 0
0 0 10 2 0 0 1 12182214 6 0 2 16 8 0 0 4101828
20 0 0 82113 5 0 41020 27 0 0 6131912 4 0 416
26
This example uses the BLOSUM45 Scoring Matrix
with a gap penalty of -8.
29Smith-Waterman (cont. 3)
H E A G A W G H E E
P A W H E A E
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 5 0 5 0 0 0 0 0 0 0 0 0 3 0 2012 4 0
0 0 10 2 0 0 0 12182214 6 0 2 16 8 0 0 4101828
20 0 0 82113 5 0 41020 27 0 0 6131812 4 0 416
26
Begin backtrace at the maximum value
found anywhere on the matrix. Continue the
backtrace until score falls to zero
AWGHE AW-HE
Path Score28
30Calculation of similarity score and percent
similarity
A W G H E A W - H E
Blosum45 SCORES
5 15 -8 10 6
GAP PENALTY (novel)
SIMILARITY NUMBER OF POS. SCORES DIVIDED BY
NUMBER OF AAs IN REGION x 100
Similarity Score 28
SIMILARITY 4/5 x 100 80
31Extend LCS to Local Alignment
- 0 (no negative scores)
- si-1,j ?(vi, -)
- si,j max si,j-1 ?(-, wj)
- si-1,j-1 ?(vi, wj)
- ?(vi, -) ?(-, wj) -? fixed gap penalty
- ?(vi, wj) score for match or mismatch can be
fixed, from PAM or BLOSUM
32Historical Perspective Smith-Waterman (1 of 3)
Algorithm The two molecular sequences will be
Aa1a2 . . . an, and Bb1b2 . . . bm. A
similarity s(a,b) is given between sequence
elements a and b. Deletions of length k are given
weight Wk. To find pairs of segments with high
degrees of similarity, we set up a matrix H .
First set Hk0 Hol 0 for 0 lt k lt n and 0 lt
l lt m. Preliminary values of H have the
interpretation that H i j is the maximum
similarity of two segments ending in ai and bj.
respectively. These values are obtained from the
relationship HijmaxHi-1,j-1 s(ai,bj), max
Hi-k,j Wk, maxHi,j-l - Wl , 0 ( 1 )
k
gt 1 l gt 1 1 lt i lt n and 1 lt j
lt m.
33Historical Perspective Smith-Waterman (2 of 3)
- The formula for Hij follows by considering the
possibilities for ending the segments at any ai
and bj. - If ai and bj are associated, the similarity is
- Hi-l,j-l s(ai,bj).
- (2) If ai is at the end of a deletion of length
k, the similarity is - Hi k, j - Wk .
- (3) If bj is at the end of a deletion of length
1, the similarity is - Hi,j-l - Wl. (typo in paper)
- (4) Finally, a zero is included to prevent
calculated negative similarity, indicating no
similarity up to ai and bj.
34Historical Perspective Smith-Waterman (3 of 3)
The pair of segments with maximum similarity is
found by first locating the maximum element of H.
The other matrix elements leading to this maximum
value are than sequentially determined with a
traceback procedure ending with an element of H
equal to zero. This procedure identifies the
segments as well as produces the corresponding
alignment. The pair of segments with the next
best similarity is found by applying the
traceback procedure to the second largest element
of H not associated with the first traceback.
35The E value (false positive expectation value)
- The Expect value (E) is a parameter that
describes the number of hits one can "expect"
to see just by chance when searching a database
of a particular size. It decreases exponentially
as the Similarity Score (S) increases (inverse
relationship). The higher the Similarity Score,
the lower the E value. Essentially, the E value
describes the random background noise that exists
for matches between two sequences. The E value
is used as a convenient way to create a
significance threshold for reporting results.
When the E value is increased from the default
value of 10 prior to a sequence search, a larger
list with more low-similarity scoring hits can be
reported. An E value of 1 assigned to a hit can
be interpreted as meaning that in a database of
the current size you might expect to see 1 match
with a similar score simply by chance.
36E value (Karlin-Altschul statistics)
- E Kmne-?S
- Where K is constant, m is the length of the query
sequence, n is the length of the database
sequence, ? is the decay constant, S is the
similarity score. - If S increases, E decreases exponentially.
- If the decay constant increases, E decreases
exponentially - If mn increases the search space increases and
there is a greater chance for a random hit, E
increases. Larger database will increase E.
However, larger query sequence often decreases E.
Why???
37Project Teams and Presentation Assignments
- Base Project (Global Alignment)
- Larry and Darnell
- Extension 1 (Ends-Free Global Alignment)
- Steven and Charlie
- Extension 2 (Local Alignment)
- Olivera and Natalia
- Extension 3 (Local Alignment all)
- Brittany and Alana
- Extension 4 (Database)
- Nathaniel and Anna U.
- Extension 5 (Space Efficient Algorithm)
- David and Shilpa
- Extension 6 (Affine Gap Penalty)
- Rachel and Anna P.
- Extension 7 (Hirschbergs Algorithm)
- Wendy and Andrew
38Workshop
- Meet with your group and develop for the overall
structure of your program - High-level algorithm
- Identify the modules, functions (including
parameters), and global variables - Determine who is responsible for each module
- Devise a development timeline and a testing
strategy