Developing Pairwise Sequence Alignment Algorithms - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Developing Pairwise Sequence Alignment Algorithms

Description:

I found a new protein sequence in a lower organism. Is it similar to a protein from another species? ... begin new scoring. with next cell. This example uses ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 39
Provided by: lanct9
Category:

less

Transcript and Presenter's Notes

Title: Developing Pairwise Sequence Alignment Algorithms


1
Developing Pairwise Sequence Alignment Algorithms
  • Dr. Nancy Warter-Perez

2
Outline
  • Overview of global and local alignment
  • References for sequence alignment algorithms
  • Discussion of Needleman-Wunsch iterative approach
    to global alignment
  • Discussion of Smith-Waterman recursive approach
    to local alignment
  • Discussion of how LCS Algorithm can be extended
    for
  • Global alignment (Needleman-Wunsch)
  • Local alignment (Smith-Waterman)
  • Group assignments for project

3
Overview of Pairwise Sequence Alignment
  • Dynamic Programming
  • Applied to optimization problems
  • Useful when
  • Problem can be recursively divided into
    sub-problems
  • Sub-problems are not independent
  • Needleman-Wunsch is a global alignment technique
    that uses an iterative algorithm and no gap
    penalty (could extend to fixed gap penalty).
  • Smith-Waterman is a local alignment technique
    that uses a recursive algorithm and can use
    alternative gap penalties (such as affine).
    Smith-Watermans algorithm is an extension of
    Longest Common Substring (LCS) problem and can be
    generalized to solve both local and global
    alignment.
  • Note Needleman-Wunsch is usually used to refer
    to global alignment regardless of the algorithm
    used.

4
References
  • http//www.sbc.su.se/arne/kurser/swell/pairwise_a
    lignments.html
  • An Introduction to Bioinformatics Algorithms
    (Computational Molecular Biology) Neil C. Jones,
    Pavel Pevzner
  • Computational Molecular Biology An Algorithmic
    Approach, Pavel Pevzner
  • Introduction to Computational Biology Maps,
    sequences, and genomes, Michael Waterman
  • Algorithms on Strings, Trees, and Sequences
    Computer Science and Computational Biology, Dan
    Gusfield

5
Classic Papers
  • Needleman, S.B. and Wunsch, C.D. A General Method
    Applicable to the Search for Similarities in
    Amino Acid Sequence of Two Proteins. J. Mol.
    Biol., 48, pp. 443-453, 1970. (http//www.cs.umd.e
    du/class/spring2003/cmsc838t/papers/needlemanandwu
    nsch1970.pdf)
  • Smith, T.F. and Waterman, M.S. Identification of
    Common Molecular Subsequences. J. Mol. Biol.,
    147, pp. 195-197, 1981.(http//www.cmb.usc.edu/pap
    ers/msw_papers/msw-042.pdf)

6
Why search sequence databases?
  • I have just sequenced something. What is known
    about the thing I sequenced?
  • I have a unique sequence. Is there similarity to
    another gene that has a known function?
  • I found a new protein sequence in a lower
    organism. Is it similar to a protein from
    another species?

7
Global Alignment Method
Output An alignment of two sequences is
represented by three lines The first line shows
the first sequence The third line shows the
second sequence. The second line has a row of
symbols. The symbol is a vertical bar wherever
characters in the two sequences match, and a
space where ever they do not. Dots (or dashes)
may be inserted in either sequence to represent
gaps.
8
Global Alignment Method (cont. 1)
For example, the two hypothetical sequences
abcdefghajklm abbdhijk could be aligned like
this abcdefghajklm
abbd...hijk As shown, there are 6 matches, 2
mismatches, and one gap of length 3.
9
Global Alignment Method (cont. 2)
The alignment can be scored according to a payoff
matrix for fixed scoring. You can also use PAM
or BLOSUM. payoff match gt match,
mismatch gt mismatch,
gap_open gt gap_open,
gap_extend gt gap_extend For correct
operation, an algorithm is created such that the
match must be positive and the other payoff
entities must be negative.
10
Global Alignment Method (cont. 3)
Example Given the payoff matrix payoff
match gt 4, mismatch gt
-3, gap_open gt -2,
gap_extend gt -1 What is the alignment and
what is the alignment score for the Following
two sequences? Sequence 1 abcdefghajklm Sequence
2 abbdhijk
11
Global Alignment Method (cont. 4)
The sequences abcdefghajklm abbdhijk are
aligned and scored like this a b
c d e f g h a j k l m
a b b d . . . h i j k
match 4 4 4 4 4 4
mismatch -3 -3 gap_open
-2 gap_extend -1-1-1 for a total
score of 24-6-2-3 13.
12
Global Alignment Method (cont. 5)
The algorithm should guarantee that no
other alignment of these two sequences has
a higher score under this payoff matrix.
13
Three steps in Dynamic Programming
1. Initialization 2. Matrix fill or scoring 3.
Traceback and alignment
14
Ends-Free Global Alignment with Fixed Scoring
Two sequences will be aligned. GGATCGA (sequence
1 V) GAATTCAGTTA (sequence 2 W) A simple
fixed scoring scheme will be used ?(vi, wj) 1
if the residue at position i of sequence 1 (vi)
is the same as the residue at position j of the
sequence 2 (wj) called match score ?(vi,
wj) 0 if vi ? wj called mismatch score ?(vi,
-) ?(-, wj) 0 called gap penalty
15
Matrix fill step Each position Si,j is defined
to be the MAXIMUM score at position i,j Si,j
MAXIMUM Si-1, j-1 ?(vi, wj) (match or
mismatch in the diagonal) Si, j-1 w (gap in
sequence 1 V) Si-1, j w (gap in sequence 2
W)
column
row
16
Initialization step 1) Create Matrix with m 1
columns and n 1 rows, where n number of
letters in sequence 1 and m number of letters
in sequence 2. 2) First column and first row
will be filled with 0s (for ends-free
alignment).
17
Scoring Step Fill in row by row
18
Traceback Step
Seq1 G G A - T C - G - - A
Seq2 G A A T T C A G T T A
19
Global Alignment output file
Global HBA_HUMAN vs HBB_HUMAN Score
290.50 HBA_HUMAN 1
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP 44

HBB_HUMAN 1
VHLTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFE
43 HBA_HUMAN 45 HF.DLS.....HGSAQVKGHG
KKVADALTNAVAHVDDMPNALSAL 83

HBB_HUMAN 44 SFGDLSTPDAVMGNPKVKAHGKK
VLGAFSDGLAHLDNLKGTFATL 88 HBA_HUMAN 84
SDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKF
128
HBB_HUMAN 89
SELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKV
133 HBA_HUMAN 129 LASVSTVLTSKYR
141
HBB_HUMAN 134
VAGVANALAHKYH
146 id 45.32 similarity 63.31
(88/139 100) Overall id 43.15 Overall
similarity 60.27 (88/146 100)
20
LCS Problem (review)
  • Similarity score
  • si-1,j
  • si,j max si,j-1
  • si-1,j-1 1, if vi wj

21
Extend LCS to Global Alignment
  • si-1,j ?(vi, -)
  • si,j max si,j-1 ?(-, wj)
  • si-1,j-1 ?(vi, wj)
  • ?(vi, -) ?(-, wj) -? fixed gap penalty
  • ?(vi, wj) score for match or mismatch can be
    fixed, from PAM or BLOSUM

22
Global Alignment Alternatives
  • Ends-free alignment dont penalize gaps at the
    beginning or end
  • Initialize first row and column of S to 0
  • Search last row and column for maximum score
  • Regular global alignment score end to end
    (penalize gaps at beginning and end)
  • Initialize first row and column of S with gap
    penalty
  • Alignment score is in the lower right corner of S

23
Historical PerspectiveNeedleman-Wunsch (1 of 3)
Match 1 Mismatch 0 Gap 0
24
Historical Perspective Needleman-Wunsch (2 of 3)
25
Historical Perspective Needleman-Wunsch (3 of 3)
From page 446 It is apparent that the above
array operation can begin at any of a number of
points along the borders of the array, which is
equivalent to a comparison of N-terminal residues
or C-terminal residues only. As long as the
appropriate rules for pathways are followed, the
maximum match will be the same. The cells of the
array which contributed to the maximum match, may
be determined by recording the origin of the
number that was added to each cell when the array
was operated upon.
26
Smith-Waterman Algorithm Advances inApplied
Mathematics, 2482-489 (1981)
  • The Smith-Waterman algorithm is a local alignment
    tool used to obtain sensitive pairwise similarity
    alignments.
  • Smith-Waterman algorithm uses dynamic
    programming.
  • It selects the optimal path as the highest ranked
    alignment.
  • Smith-Waterman algorithm is useful for finding
    local areas of similarity between sequences that
    are too dissimilar for global alignment.
  • The S-W algorithm uses a lot of computer memory.
  • BLAST and FASTA are other search algorithms that
    use some aspects of S-W.

27
Smith-Waterman (cont. 1)
a. It searches for sequence matches. b. Assigns a
score to each pair of amino acids -uses
similarity scores -uses positive scores for
related residues -uses negative scores for
substitutions and gaps c. Initializes edges of
the matrix with zeros d. As the scores are summed
in the matrix, any sum below 0 is recorded as
a zero. e. Begins backtracing at the maximum
value found anywhere in the matrix. f.
Continues the backtrace until the score falls to
0.
28
Smith-Waterman (cont. 2)
H E A G A W G H E E
Put zeros on borders. Assign initial scores based
on a scoring matrix. Calculate new scores based
on adjacent cell scores. If sum is less than zero
or equal to zero begin new scoring with next
cell.
P A W H E A E
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 5 0 5 0 0 0 0 0 0 0 0 0 3 0 2012 4 0
0 0 10 2 0 0 1 12182214 6 0 2 16 8 0 0 4101828
20 0 0 82113 5 0 41020 27 0 0 6131912 4 0 416
26
This example uses the BLOSUM45 Scoring Matrix
with a gap penalty of -8.
29
Smith-Waterman (cont. 3)
H E A G A W G H E E
P A W H E A E
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 5 0 5 0 0 0 0 0 0 0 0 0 3 0 2012 4 0
0 0 10 2 0 0 0 12182214 6 0 2 16 8 0 0 4101828
20 0 0 82113 5 0 41020 27 0 0 6131812 4 0 416
26
Begin backtrace at the maximum value
found anywhere on the matrix. Continue the
backtrace until score falls to zero
AWGHE AW-HE
Path Score28
30
Calculation of similarity score and percent
similarity
A W G H E A W - H E
Blosum45 SCORES
5 15 -8 10 6
GAP PENALTY (novel)
SIMILARITY NUMBER OF POS. SCORES DIVIDED BY
NUMBER OF AAs IN REGION x 100
Similarity Score 28
SIMILARITY 4/5 x 100 80
31
Extend LCS to Local Alignment
  • 0 (no negative scores)
  • si-1,j ?(vi, -)
  • si,j max si,j-1 ?(-, wj)
  • si-1,j-1 ?(vi, wj)
  • ?(vi, -) ?(-, wj) -? fixed gap penalty
  • ?(vi, wj) score for match or mismatch can be
    fixed, from PAM or BLOSUM

32
Historical Perspective Smith-Waterman (1 of 3)
Algorithm The two molecular sequences will be
Aa1a2 . . . an, and Bb1b2 . . . bm. A
similarity s(a,b) is given between sequence
elements a and b. Deletions of length k are given
weight Wk. To find pairs of segments with high
degrees of similarity, we set up a matrix H .
First set Hk0 Hol 0 for 0 lt k lt n and 0 lt
l lt m. Preliminary values of H have the
interpretation that H i j is the maximum
similarity of two segments ending in ai and bj.
respectively. These values are obtained from the
relationship HijmaxHi-1,j-1 s(ai,bj), max
Hi-k,j Wk, maxHi,j-l - Wl , 0 ( 1 )
k
gt 1 l gt 1 1 lt i lt n and 1 lt j
lt m.
33
Historical Perspective Smith-Waterman (2 of 3)
  • The formula for Hij follows by considering the
    possibilities for ending the segments at any ai
    and bj.
  • If ai and bj are associated, the similarity is
  • Hi-l,j-l s(ai,bj).
  • (2) If ai is at the end of a deletion of length
    k, the similarity is
  • Hi k, j - Wk .
  • (3) If bj is at the end of a deletion of length
    1, the similarity is
  • Hi,j-l - Wl. (typo in paper)
  • (4) Finally, a zero is included to prevent
    calculated negative similarity, indicating no
    similarity up to ai and bj.

34
Historical Perspective Smith-Waterman (3 of 3)
The pair of segments with maximum similarity is
found by first locating the maximum element of H.
The other matrix elements leading to this maximum
value are than sequentially determined with a
traceback procedure ending with an element of H
equal to zero. This procedure identifies the
segments as well as produces the corresponding
alignment. The pair of segments with the next
best similarity is found by applying the
traceback procedure to the second largest element
of H not associated with the first traceback.
35
The E value (false positive expectation value)
  • The Expect value (E) is a parameter that
    describes the number of hits one can "expect"
    to see just by chance when searching a database
    of a particular size. It decreases exponentially
    as the Similarity Score (S) increases (inverse
    relationship). The higher the Similarity Score,
    the lower the E value. Essentially, the E value
    describes the random background noise that exists
    for matches between two sequences. The E value
    is used as a convenient way to create a
    significance threshold for reporting results.
    When the E value is increased from the default
    value of 10 prior to a sequence search, a larger
    list with more low-similarity scoring hits can be
    reported. An E value of 1 assigned to a hit can
    be interpreted as meaning that in a database of
    the current size you might expect to see 1 match
    with a similar score simply by chance.

36
E value (Karlin-Altschul statistics)
  • E Kmne-?S
  • Where K is constant, m is the length of the query
    sequence, n is the length of the database
    sequence, ? is the decay constant, S is the
    similarity score.
  • If S increases, E decreases exponentially.
  • If the decay constant increases, E decreases
    exponentially
  • If mn increases the search space increases and
    there is a greater chance for a random hit, E
    increases. Larger database will increase E.
    However, larger query sequence often decreases E.
    Why???

37
Project Teams and Presentation Assignments
  • Base Project (Global Alignment)
  • Larry and Darnell
  • Extension 1 (Ends-Free Global Alignment)
  • Steven and Charlie
  • Extension 2 (Local Alignment)
  • Olivera and Natalia
  • Extension 3 (Local Alignment all)
  • Brittany and Alana
  • Extension 4 (Database)
  • Nathaniel and Anna U.
  • Extension 5 (Space Efficient Algorithm)
  • David and Shilpa
  • Extension 6 (Affine Gap Penalty)
  • Rachel and Anna P.
  • Extension 7 (Hirschbergs Algorithm)
  • Wendy and Andrew

38
Workshop
  • Meet with your group and develop for the overall
    structure of your program
  • High-level algorithm
  • Identify the modules, functions (including
    parameters), and global variables
  • Determine who is responsible for each module
  • Devise a development timeline and a testing
    strategy
Write a Comment
User Comments (0)
About PowerShow.com