Loading...

PPT – Developing Pairwise Sequence Alignment Algorithms PowerPoint presentation | free to download - id: 27111c-ZDc1Z

The Adobe Flash plugin is needed to view this content

Developing Pairwise Sequence Alignment Algorithms

- Dr. Nancy Warter-Perez

Outline

- Overview of global and local alignment
- References for sequence alignment algorithms
- Discussion of Needleman-Wunsch iterative approach

to global alignment - Discussion of Smith-Waterman recursive approach

to local alignment - Discussion of how LCS Algorithm can be extended

for - Global alignment (Needleman-Wunsch)
- Local alignment (Smith-Waterman)
- Project assignment

Why search sequence databases?

- I have just sequenced something. What is known

about the thing I sequenced? - I have a unique sequence. Is there similarity to

another gene that has a known function? - I found a new protein sequence in a lower

organism. Is it similar to a protein from

another species?

Overview of Pairwise Sequence Alignment

- Dynamic Programming
- Applied to optimization problems
- Useful when
- Problem can be recursively divided into

sub-problems - Sub-problems are not independent
- Needleman-Wunsch is a global alignment technique

that uses an iterative algorithm and no gap

penalty (could extend to fixed gap penalty). - Smith-Waterman is a local alignment technique

that uses a recursive algorithm and can use

alternative gap penalties (such as affine).

Smith-Watermans algorithm is an extension of

Longest Common Substring (LCS) problem and can be

generalized to solve both local and global

alignment. - Note Needleman-Wunsch is usually used to refer

to global alignment regardless of the algorithm

used.

References

- http//www.sbc.su.se/arne/kurser/swell/pairwise_a

lignments.html - An Introduction to Bioinformatics Algorithms

(Computational Molecular Biology) Neil C. Jones,

Pavel Pevzner - Computational Molecular Biology An Algorithmic

Approach, Pavel Pevzner - Introduction to Computational Biology Maps,

sequences, and genomes, Michael Waterman - Algorithms on Strings, Trees, and Sequences

Computer Science and Computational Biology, Dan

Gusfield

Classic Papers

- Needleman, S.B. and Wunsch, C.D. A General Method

Applicable to the Search for Similarities in

Amino Acid Sequence of Two Proteins. J. Mol.

Biol., 48, pp. 443-453, 1970. (http//www.cs.umd.e

du/class/spring2003/cmsc838t/papers/needlemanandwu

nsch1970.pdf) - Smith, T.F. and Waterman, M.S. Identification of

Common Molecular Subsequences. J. Mol. Biol.,

147, pp. 195-197, 1981.(http//www.cmb.usc.edu/pap

ers/msw_papers/msw-042.pdf)

Global Alignment Method

Output An alignment of two sequences is

represented by three lines The first line shows

the first sequence The third line shows the

second sequence. The second line has a row of

symbols. The symbol is a vertical bar wherever

characters in the two sequences match, and a

space wherever they do not. Dots (or dashes) may

be inserted in either sequence to represent gaps.

Global Alignment Method (cont. 1)

For example, the two hypothetical sequences

abcdefghajklm abbdhijk could be aligned like

this abcdefghajklm

abbd...hijk As shown, there are 6 matches, 2

mismatches, and one gap of length 3.

Global Alignment Method (cont. 2)

The alignment can be scored according to a payoff

matrix for fixed scoring. You can also use PAM

or BLOSUM. payoff match gt match,

mismatch gt mismatch,

gap_open gt gap_open,

gap_extend gt gap_extend For correct

operation, an algorithm is created such that the

match must be positive and the other payoff

entities must be negative.

Global Alignment Method (cont. 3)

Example Given the payoff matrix payoff

match gt 4, mismatch gt

-3, gap_open gt -2,

gap_extend gt -1 What is the alignment and

what is the alignment score for the Following

two sequences? Sequence 1 abcdefghajklm Sequence

2 abbdhijk

Global Alignment Method (cont. 4)

The sequences abcdefghajklm abbdhijk are

aligned and scored like this a b

c d e f g h a j k l m

a b b d . . . h i j k

match 4 4 4 4 4 4

mismatch -3 -3 gap_open

-2 gap_extend -1-1-1 for a total

score of 24-6-2-3 13.

Global Alignment Method (cont. 5)

The algorithm should guarantee that no

other alignment of these two sequences has

a higher score under this payoff matrix.

Three steps in Dynamic Programming

1. Initialization 2. Matrix fill or scoring 3.

Traceback and alignment

Ends-Free Global Alignment with Fixed Scoring

Two sequences will be aligned. GGATCGA (sequence

1 V) GAATTCAGTTA (sequence 2 W) A simple

fixed scoring scheme will be used ?(vi, wj) 1

if the residue at position i of sequence 1 (vi)

is the same as the residue at position j of the

sequence 2 (wj) called match score ?(vi,

wj) 0 if vi ? wj called mismatch score ?(vi,

-) ?(-, wj) 0 called gap penalty

Matrix fill step Each position Si,j is defined

to be the MAXIMUM score at position i,j Si,j

MAXIMUM Si-1, j-1 ?(vi, wj) (match or

mismatch in the diagonal) Si, j-1 w (gap in

sequence 1 V) Si-1, j w (gap in sequence 2

W)

column

row

Initialization step 1) Create Matrix with m 1

columns and n 1 rows, where n number of

letters in sequence 1 and m number of letters

in sequence 2. 2) First column and first row

will be filled with 0s (for ends-free

alignment).

Scoring Step Fill in row by row

Traceback Step

Seq1 G G A - T C - G - - A

Seq2 G A A T T C A G T T A

LCS Problem (review)

- Similarity score
- si-1,j
- si,j max si,j-1
- si-1,j-1 1, if vi wj

Extend LCS to Global Alignment

- si-1,j ?(vi, -)
- si,j max si,j-1 ?(-, wj)
- si-1,j-1 ?(vi, wj)
- ?(vi, -) ?(-, wj) -? fixed gap penalty
- ?(vi, wj) score for match or mismatch can be

fixed, from PAM or BLOSUM

Global Alignment Alternatives

- Ends-free alignment dont penalize gaps at the

beginning or end - Initialize first row and column of S to 0
- Search last row and column for maximum score
- Regular global alignment score end to end

(penalize gaps at beginning and end) - Initialize first row and column of S with gap

penalty - Alignment score is in the lower right corner of S

Historical Perspective Needleman-Wunsch (1 of 3)

Match 1 Mismatch 0 Gap 0

Historical Perspective Needleman-Wunsch (2 of 3)

Historical Perspective Needleman-Wunsch (3 of 3)

From page 446 It is apparent that the above

array operation can begin at any of a number of

points along the borders of the array, which is

equivalent to a comparison of N-terminal residues

or C-terminal residues only. As long as the

appropriate rules for pathways are followed, the

maximum match will be the same. The cells of the

array which contributed to the maximum match, may

be determined by recording the origin of the

number that was added to each cell when the array

was operated upon.

Smith-Waterman Algorithm Advances in Applied

Mathematics, 2482-489 (1981)

- The Smith-Waterman algorithm is a local alignment

tool used to obtain sensitive pairwise similarity

alignments. - Smith-Waterman algorithm uses dynamic

programming. - It selects the optimal path as the highest ranked

alignment. - Smith-Waterman algorithm is useful for finding

local areas of similarity between sequences that

are too dissimilar for global alignment. - The S-W algorithm uses a lot of computer memory.
- BLAST and FASTA are other search algorithms that

use some aspects of S-W.

Smith-Waterman (cont. 1)

a. It searches for sequence matches. b. Assigns a

score to each pair of amino acids -uses

similarity scores -uses positive scores for

related residues -uses negative scores for

substitutions and gaps c. Initializes edges of

the matrix with zeros d. As the scores are summed

in the matrix, any sum below 0 is recorded as

a zero. e. Begins backtracing at the maximum

value found anywhere in the matrix. f.

Continues the backtrace until the score falls to

0.

Smith-Waterman (cont. 2)

H E A G A W G H E E

Put zeros on borders. Assign initial scores based

on a scoring matrix. Calculate new scores based

on adjacent cell scores. If sum is less than zero

or equal to zero begin new scoring with next

cell.

P A W H E A E

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 5 0 5 0 0 0 0 0 0 0 0 0 3 0 2012 4 0

0 0 10 2 0 0 1 12182214 6 0 2 16 8 0 0 4101828

20 0 0 82113 5 0 41020 27 0 0 6131912 4 0 416

26

This example uses the BLOSUM45 Scoring Matrix

with a gap penalty of -8.

Smith-Waterman (cont. 3)

H E A G A W G H E E

P A W H E A E

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 0 0 5 0 5 0 0 0 0 0 0 0 0 0 3 0 2012 4 0

0 0 10 2 0 0 0 12182214 6 0 2 16 8 0 0 4101828

20 0 0 82113 5 0 41020 27 0 0 6131812 4 0 416

26

Begin backtrace at the maximum value

found anywhere on the matrix. Continue the

backtrace until score falls to zero

AWGHE AW-HE

Path Score28

Extend LCS to Local Alignment

- 0 (no negative scores)
- si-1,j ?(vi, -)
- si,j max si,j-1 ?(-, wj)
- si-1,j-1 ?(vi, wj)
- ?(vi, -) ?(-, wj) -? fixed gap penalty
- ?(vi, wj) score for match or mismatch can be

fixed, from PAM or BLOSUM

Historical Perspective Smith-Waterman (1 of 3)

Algorithm The two molecular sequences will be

Aa1a2 . . . an, and Bb1b2 . . . bm. A

similarity s(a,b) is given between sequence

elements a and b. Deletions of length k are given

weight Wk. To find pairs of segments with high

degrees of similarity, we set up a matrix H .

First set Hk0 Hol 0 for 0 lt k lt n and 0 lt

l lt m. Preliminary values of H have the

interpretation that H i j is the maximum

similarity of two segments ending in ai and bj.

respectively. These values are obtained from the

relationship HijmaxHi-1,j-1 s(ai,bj), max

Hi-k,j Wk, maxHi,j-l - Wl , 0 ( 1 )

k

gt 1 l gt 1 1 lt i lt n and 1 lt j

lt m.

Historical Perspective Smith-Waterman (2 of 3)

- The formula for Hij follows by considering the

possibilities for ending the segments at any ai

and bj. - If ai and bj are associated, the similarity is
- Hi-l,j-l s(ai,bj).
- (2) If ai is at the end of a deletion of length

k, the similarity is - Hi k, j - Wk .
- (3) If bj is at the end of a deletion of length

1, the similarity is - Hi,j-l - Wl. (typo in paper)
- (4) Finally, a zero is included to prevent

calculated negative similarity, indicating no

similarity up to ai and bj.

Historical Perspective Smith-Waterman (3 of 3)

The pair of segments with maximum similarity is

found by first locating the maximum element of H.

The other matrix elements leading to this maximum

value are than sequentially determined with a

traceback procedure ending with an element of H

equal to zero. This procedure identifies the

segments as well as produces the corresponding

alignment. The pair of segments with the next

best similarity is found by applying the

traceback procedure to the second largest element

of H not associated with the first traceback.

Global Alignment output file

Global HBA_HUMAN vs HBB_HUMAN Score

290.50 HBA_HUMAN 1

VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP 44

HBB_HUMAN 1

VHLTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFE

43 HBA_HUMAN 45 HF.DLS.....HGSAQVKGHG

KKVADALTNAVAHVDDMPNALSAL 83

HBB_HUMAN 44 SFGDLSTPDAVMGNPKVKAHGKK

VLGAFSDGLAHLDNLKGTFATL 88 HBA_HUMAN 84

SDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKF

128

HBB_HUMAN 89

SELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKV

133 HBA_HUMAN 129 LASVSTVLTSKYR

141

HBB_HUMAN 134

VAGVANALAHKYH

146 id 45.32 similarity 63.31

(88/139 100) Overall id 43.15 Overall

similarity 60.27 (88/146 100)

Calculation of similarity score and percent

similarity

A W G H E A W - H E

Blosum45 SCORES

5 15 -8 10 6

GAP PENALTY (novel)

SIMILARITY NUMBER OF POS. SCORES DIVIDED BY

NUMBER OF AAs IN REGION x 100

Similarity Score 28

SIMILARITY 4/5 x 100 80

The E value (false positive expectation value)

- The Expect value (E) is a parameter that

describes the number of hits one can "expect"

to see just by chance when searching a database

of a particular size. It decreases exponentially

as the Similarity Score (S) increases (inverse

relationship). The higher the Similarity Score,

the lower the E value. Essentially, the E value

describes the random background noise that exists

for matches between two sequences. The E value

is used as a convenient way to create a

significance threshold for reporting results.

When the E value is increased from the default

value of 10 prior to a sequence search, a larger

list with more low-similarity scoring hits can be

reported. An E value of 1 assigned to a hit can

be interpreted as meaning that in a database of

the current size you might expect to see 1 match

with a similar score simply by chance.

E value (Karlin-Altschul statistics)

- E Kmne-?S
- Where K is constant, m is the length of the query

sequence, n is the length of the database

sequence, ? is the decay constant, S is the

similarity score. - If S increases, E decreases exponentially.
- If the decay constant increases, E decreases

exponentially - If mn increases the search space increases and

there is a greater chance for a random hit, E

increases. Larger database will increase E.

However, larger query sequence often decreases E.

Why???

Project Teams and Presentation Assignments

- Base Project (Global Alignment)
- Extension 1 (Ends-Free Global Alignment)
- Extension 2 (Local Alignment)
- Extension 3 (Local Alignment all)
- Extension 4 (Database)
- Extension 5 (Space Efficient Algorithm)
- Extension 6 (Affine Gap Penalty)
- Extension 7 (Hirschbergs Algorithm)