Developing Pairwise Sequence Alignment Algorithms - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Developing Pairwise Sequence Alignment Algorithms

Description:

I found a new protein sequence in a lower organism. Is it similar to a protein from another species? ... begin new scoring. with next cell. This example uses ... – PowerPoint PPT presentation

Number of Views:95

Avg rating:3.0/5.0

Slides: 39

Provided by: lanct9

Learn more at: https://www.calstatela.edu

Category:

more less

Transcript and Presenter's Notes

Title: Developing Pairwise Sequence Alignment Algorithms

1
Developing Pairwise Sequence Alignment Algorithms

Dr. Nancy Warter-Perez

2
Outline

Overview of global and local alignment
References for sequence alignment algorithms
Discussion of Needleman-Wunsch iterative approach
to global alignment
Discussion of Smith-Waterman recursive approach
to local alignment
Discussion of how LCS Algorithm can be extended
for
Global alignment (Needleman-Wunsch)
Local alignment (Smith-Waterman)
Group assignments for project

3
Overview of Pairwise Sequence Alignment

Dynamic Programming
Applied to optimization problems
Useful when
Problem can be recursively divided into
sub-problems
Sub-problems are not independent
Needleman-Wunsch is a global alignment technique
that uses an iterative algorithm and no gap
penalty (could extend to fixed gap penalty).
Smith-Waterman is a local alignment technique
that uses a recursive algorithm and can use
alternative gap penalties (such as affine).
Smith-Watermans algorithm is an extension of
Longest Common Substring (LCS) problem and can be
generalized to solve both local and global
alignment.
Note Needleman-Wunsch is usually used to refer
to global alignment regardless of the algorithm
used.

4
References

http//www.sbc.su.se/arne/kurser/swell/pairwise_a
lignments.html
An Introduction to Bioinformatics Algorithms
(Computational Molecular Biology) Neil C. Jones,
Pavel Pevzner
Computational Molecular Biology An Algorithmic
Approach, Pavel Pevzner
Introduction to Computational Biology Maps,
sequences, and genomes, Michael Waterman
Algorithms on Strings, Trees, and Sequences
Computer Science and Computational Biology, Dan
Gusfield

5
Classic Papers

Needleman, S.B. and Wunsch, C.D. A General Method
Applicable to the Search for Similarities in
Amino Acid Sequence of Two Proteins. J. Mol.
Biol., 48, pp. 443-453, 1970. (http//www.cs.umd.e
du/class/spring2003/cmsc838t/papers/needlemanandwu
nsch1970.pdf)
Smith, T.F. and Waterman, M.S. Identification of
Common Molecular Subsequences. J. Mol. Biol.,
147, pp. 195-197, 1981.(http//www.cmb.usc.edu/pap
ers/msw_papers/msw-042.pdf)

6
Why search sequence databases?

I have just sequenced something. What is known
about the thing I sequenced?
I have a unique sequence. Is there similarity to
another gene that has a known function?
I found a new protein sequence in a lower
organism. Is it similar to a protein from
another species?

7
Global Alignment Method
Output An alignment of two sequences is
represented by three lines The first line shows
the first sequence The third line shows the
second sequence. The second line has a row of
symbols. The symbol is a vertical bar wherever
characters in the two sequences match, and a
space where ever they do not. Dots (or dashes)
may be inserted in either sequence to represent
gaps.
8
Global Alignment Method (cont. 1)
For example, the two hypothetical sequences
abcdefghajklm abbdhijk could be aligned like
this abcdefghajklm
abbd...hijk As shown, there are 6 matches, 2
mismatches, and one gap of length 3.
9
Global Alignment Method (cont. 2)
The alignment can be scored according to a payoff
matrix for fixed scoring. You can also use PAM
or BLOSUM. payoff match gt match,
mismatch gt mismatch,
gap_open gt gap_open,
gap_extend gt gap_extend For correct
operation, an algorithm is created such that the
match must be positive and the other payoff
entities must be negative.
10
Global Alignment Method (cont. 3)
Example Given the payoff matrix payoff
match gt 4, mismatch gt
-3, gap_open gt -2,
gap_extend gt -1 What is the alignment and
what is the alignment score for the Following
two sequences? Sequence 1 abcdefghajklm Sequence
2 abbdhijk
11
Global Alignment Method (cont. 4)
The sequences abcdefghajklm abbdhijk are
aligned and scored like this a b
c d e f g h a j k l m
a b b d . . . h i j k
match 4 4 4 4 4 4
mismatch -3 -3 gap_open
-2 gap_extend -1-1-1 for a total
score of 24-6-2-3 13.
12
Global Alignment Method (cont. 5)
The algorithm should guarantee that no
other alignment of these two sequences has
a higher score under this payoff matrix.
13
Three steps in Dynamic Programming
1. Initialization 2. Matrix fill or scoring 3.
Traceback and alignment
14
Ends-Free Global Alignment with Fixed Scoring
Two sequences will be aligned. GGATCGA (sequence
1 V) GAATTCAGTTA (sequence 2 W) A simple
fixed scoring scheme will be used ?(vi, wj) 1
if the residue at position i of sequence 1 (vi)
is the same as the residue at position j of the
sequence 2 (wj) called match score ?(vi,
wj) 0 if vi ? wj called mismatch score ?(vi,
-) ?(-, wj) 0 called gap penalty
15
Matrix fill step Each position Si,j is defined
to be the MAXIMUM score at position i,j Si,j
MAXIMUM Si-1, j-1 ?(vi, wj) (match or
mismatch in the diagonal) Si, j-1 w (gap in
sequence 1 V) Si-1, j w (gap in sequence 2
W)
column
row
16
Initialization step 1) Create Matrix with m 1
columns and n 1 rows, where n number of
letters in sequence 1 and m number of letters
in sequence 2. 2) First column and first row
will be filled with 0s (for ends-free
alignment).
17
Scoring Step Fill in row by row
18
Traceback Step
Seq1 G G A - T C - G - - A
Seq2 G A A T T C A G T T A
19
Global Alignment output file
Global HBA_HUMAN vs HBB_HUMAN Score
290.50 HBA_HUMAN 1
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP 44

HBB_HUMAN 1
VHLTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFE
43 HBA_HUMAN 45 HF.DLS.....HGSAQVKGHG
KKVADALTNAVAHVDDMPNALSAL 83

HBB_HUMAN 44 SFGDLSTPDAVMGNPKVKAHGKK
VLGAFSDGLAHLDNLKGTFATL 88 HBA_HUMAN 84
SDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKF
128
HBB_HUMAN 89
SELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKV
133 HBA_HUMAN 129 LASVSTVLTSKYR
141
HBB_HUMAN 134
VAGVANALAHKYH
146 id 45.32 similarity 63.31
(88/139 100) Overall id 43.15 Overall
similarity 60.27 (88/146 100)
20
LCS Problem (review)

Similarity score
si-1,j
si,j max si,j-1
si-1,j-1 1, if vi wj

21
Extend LCS to Global Alignment

si-1,j ?(vi, -)
si,j max si,j-1 ?(-, wj)
si-1,j-1 ?(vi, wj)
?(vi, -) ?(-, wj) -? fixed gap penalty
?(vi, wj) score for match or mismatch can be
fixed, from PAM or BLOSUM

22
Global Alignment Alternatives

Ends-free alignment dont penalize gaps at the
beginning or end
Initialize first row and column of S to 0
Search last row and column for maximum score
Regular global alignment score end to end
(penalize gaps at beginning and end)
Initialize first row and column of S with gap
penalty
Alignment score is in the lower right corner of S

23
Historical PerspectiveNeedleman-Wunsch (1 of 3)
Match 1 Mismatch 0 Gap 0
24
Historical Perspective Needleman-Wunsch (2 of 3)
25
Historical Perspective Needleman-Wunsch (3 of 3)
From page 446 It is apparent that the above
array operation can begin at any of a number of
points along the borders of the array, which is
equivalent to a comparison of N-terminal residues
or C-terminal residues only. As long as the
appropriate rules for pathways are followed, the
maximum match will be the same. The cells of the
array which contributed to the maximum match, may
be determined by recording the origin of the
number that was added to each cell when the array
was operated upon.
26
Smith-Waterman Algorithm Advances inApplied
Mathematics, 2482-489 (1981)

The Smith-Waterman algorithm is a local alignment
tool used to obtain sensitive pairwise similarity
alignments.
Smith-Waterman algorithm uses dynamic
programming.
It selects the optimal path as the highest ranked
alignment.
Smith-Waterman algorithm is useful for finding
local areas of similarity between sequences that
are too dissimilar for global alignment.
The S-W algorithm uses a lot of computer memory.
BLAST and FASTA are other search algorithms that
use some aspects of S-W.

27
Smith-Waterman (cont. 1)
a. It searches for sequence matches. b. Assigns a
score to each pair of amino acids -uses
similarity scores -uses positive scores for
related residues -uses negative scores for
substitutions and gaps c. Initializes edges of
the matrix with zeros d. As the scores are summed
in the matrix, any sum below 0 is recorded as
a zero. e. Begins backtracing at the maximum
value found anywhere in the matrix. f.
Continues the backtrace until the score falls to
0.
28
Smith-Waterman (cont. 2)
H E A G A W G H E E
Put zeros on borders. Assign initial scores based
on a scoring matrix. Calculate new scores based
on adjacent cell scores. If sum is less than zero
or equal to zero begin new scoring with next
cell.
P A W H E A E
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 5 0 5 0 0 0 0 0 0 0 0 0 3 0 2012 4 0
0 0 10 2 0 0 1 12182214 6 0 2 16 8 0 0 4101828
20 0 0 82113 5 0 41020 27 0 0 6131912 4 0 416
26
This example uses the BLOSUM45 Scoring Matrix
with a gap penalty of -8.
29
Smith-Waterman (cont. 3)
H E A G A W G H E E
P A W H E A E
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 5 0 5 0 0 0 0 0 0 0 0 0 3 0 2012 4 0
0 0 10 2 0 0 0 12182214 6 0 2 16 8 0 0 4101828
20 0 0 82113 5 0 41020 27 0 0 6131812 4 0 416
26
Begin backtrace at the maximum value
found anywhere on the matrix. Continue the
backtrace until score falls to zero
AWGHE AW-HE
Path Score28
30
Calculation of similarity score and percent
similarity
A W G H E A W - H E
Blosum45 SCORES
5 15 -8 10 6
GAP PENALTY (novel)
SIMILARITY NUMBER OF POS. SCORES DIVIDED BY
NUMBER OF AAs IN REGION x 100
Similarity Score 28
SIMILARITY 4/5 x 100 80
31
Extend LCS to Local Alignment

0 (no negative scores)
si-1,j ?(vi, -)
si,j max si,j-1 ?(-, wj)
si-1,j-1 ?(vi, wj)
?(vi, -) ?(-, wj) -? fixed gap penalty
?(vi, wj) score for match or mismatch can be
fixed, from PAM or BLOSUM

32
Historical Perspective Smith-Waterman (1 of 3)
Algorithm The two molecular sequences will be
Aa1a2 . . . an, and Bb1b2 . . . bm. A
similarity s(a,b) is given between sequence
elements a and b. Deletions of length k are given
weight Wk. To find pairs of segments with high
degrees of similarity, we set up a matrix H .
First set Hk0 Hol 0 for 0 lt k lt n and 0 lt
l lt m. Preliminary values of H have the
interpretation that H i j is the maximum
similarity of two segments ending in ai and bj.
respectively. These values are obtained from the
relationship HijmaxHi-1,j-1 s(ai,bj), max
Hi-k,j Wk, maxHi,j-l - Wl , 0 ( 1 )
k
gt 1 l gt 1 1 lt i lt n and 1 lt j
lt m.
33
Historical Perspective Smith-Waterman (2 of 3)

The formula for Hij follows by considering the
possibilities for ending the segments at any ai
and bj.
If ai and bj are associated, the similarity is
Hi-l,j-l s(ai,bj).
(2) If ai is at the end of a deletion of length
k, the similarity is
Hi k, j - Wk .
(3) If bj is at the end of a deletion of length
1, the similarity is
Hi,j-l - Wl. (typo in paper)
(4) Finally, a zero is included to prevent
calculated negative similarity, indicating no
similarity up to ai and bj.

34
Historical Perspective Smith-Waterman (3 of 3)
The pair of segments with maximum similarity is
found by first locating the maximum element of H.
The other matrix elements leading to this maximum
value are than sequentially determined with a
traceback procedure ending with an element of H
equal to zero. This procedure identifies the
segments as well as produces the corresponding
alignment. The pair of segments with the next
best similarity is found by applying the
traceback procedure to the second largest element
of H not associated with the first traceback.
35
The E value (false positive expectation value)

The Expect value (E) is a parameter that
describes the number of hits one can "expect"
to see just by chance when searching a database
of a particular size. It decreases exponentially
as the Similarity Score (S) increases (inverse
relationship). The higher the Similarity Score,
the lower the E value. Essentially, the E value
describes the random background noise that exists
for matches between two sequences. The E value
is used as a convenient way to create a
significance threshold for reporting results.
When the E value is increased from the default
value of 10 prior to a sequence search, a larger
list with more low-similarity scoring hits can be
reported. An E value of 1 assigned to a hit can
be interpreted as meaning that in a database of
the current size you might expect to see 1 match
with a similar score simply by chance.

36
E value (Karlin-Altschul statistics)

E Kmne-?S
Where K is constant, m is the length of the query
sequence, n is the length of the database
sequence, ? is the decay constant, S is the
similarity score.
If S increases, E decreases exponentially.
If the decay constant increases, E decreases
exponentially
If mn increases the search space increases and
there is a greater chance for a random hit, E
increases. Larger database will increase E.
However, larger query sequence often decreases E.
Why???

37
Project Teams and Presentation Assignments