# Algorithms for Pairwise Sequence Alignment - PowerPoint PPT Presentation

1 / 26
Title:

## Algorithms for Pairwise Sequence Alignment

Description:

### Title: Algorithms for Pairwise Sequence Alignment Author: hygiea Last modified by: hygiea Created Date: 9/12/2002 8:24:52 PM Document presentation format – PowerPoint PPT presentation

Number of Views:188
Avg rating:3.0/5.0
Slides: 27
Provided by: hyg7
Category:
Tags:
Transcript and Presenter's Notes

Title: Algorithms for Pairwise Sequence Alignment

1
Algorithms for Pairwise Sequence Alignment
• Craig A. Struble, Ph.D.
• Marquette University

2
Overview
• Pairwise Sequence Alignment
• Dynamic Programming Solution
• Global Alignment
• Local Alignment
• BLAST and FASTA

3
Pairwise Sequence Alignment
• As weve seen, sequence similarity is an
indicator of homology
• There are other uses for sequence similarity
• Database queries
• Comparative genomics

4
Pairwise Sequence Alignment
• Example
• Which one is better?

HEAGAWGHEE
PAWHEAE
HEAGAWGHE-E
HEAGAWGHE-E
P-A--W-HEAE
--P-AW-HEAE
5
Scoring
• To compare two sequence alignments, calculate a
score
• PAM or BLOSUM matrices
• Matches and mismatches
• Gap penalty
• Initiating a gap
• Gap extension penalty
• Extending a gap

6
Example
• Gap penalty -8
• Gap extension -8

A E G H W
A 5 -1 0 -2 -3
E -1 6 -3 0 -3
H -2 0 -2 10 -3
P -1 -1 -2 -2 -4
W -3 -3 -3 -3 15
HEAGAWGHE-E
--P-AW-HEAE
(-8) (-8) (-1) 5 15 (-8) 10 6
(-8) 6 9
HEAGAWGHE-E
Exercise Calculate for
P-A--W-HEAE
7
Formal Description
• Problem PairSeqAlign
• Input Two sequences x,y
• Scoring matrix s
• Gap penalty d
• Gap extension penalty e
• Output The optimal sequence alignment

8
How Difficult Is This?
• Consider two sequences of length n
• There are
• possible global alignments, and we need to find
an optimal one from amongst those!

9
So what?
• So at n 20, we have over 120 billion possible
alignments
• We want to be able to align much, much longer
sequences
• Some proteins have 1000 amino acids
• Genes can have several thousand base pairs

10
Dynamic Programming
• General algorithmic development technique
• Reuses the results of previous computations
• Store intermediate results in a table for reuse
• Look up in table for earlier result to build from

11
Global Alignment
• Needleman-Wunsch 1970
• Idea Build up optimal alignment from optimal
alignments of subsequences

HEAG --P- -25
HEAGA --P-A -20
HEAG- --P-A -33
HEAGA --P -33
Gap with bottom
Top and bottom
Gap with top
12
Global Alignment
• Notation
• xi ith letter of string x
• yj jth letter of string y
• x1..i Prefix of x from letters 1 through I
• F matrix of optimal scores
• F(i,j) represents optimal score lining up x1..i
with y1..j
• d gap penalty
• s scoring matrix

13
Global Alignment
• The work is to build up F
• Initialize F(0,0) 0, F(i,0) id, F(0,j)jd
• Fill from top left to bottom right using the
recursive relation

14
Global Alignment
yj aligned to gap
F(i-1,j-1) F(i,j-1)
F(i-1,j) F(i,j)
s(xi,yj)
d
d
xi aligned to gap
While building the table, keep track of where
optimal score came from, reverse arrows
15
Example
H E A G A W G H E E
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16
W -24
H -32
E -40
A -48
E -56
16
Completed Table
H E A G A W G H E E
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60
W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37
H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19
E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5
A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2
E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1
17
Traceback
• Trace arrows back from the lower right to top
left
• Diagonal both
• Up upper gap
• Left lower gap

H E A G A W G H E E
0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60
W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37
H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19
E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5
A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2
E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1
HEAGAWGHE-E --P-AW-HEAE
18
Summary
• Uses recursion to fill in intermediate results
table
• Uses O(nm) space and time
• O(n2) algorithm
• Feasible for moderate sized sequences, but not
for aligning whole genomes.

19
Local Alignment
• Smith-Waterman (1981)
• Another dynamic programming solution

20
Example
H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 5 0 0 0 0 0
W 0 0 0 0 2 0 20 12 4 0 0
H 0 10 2 0 0 0 12 18 22 14 6
E 0 2 16 8 0 0 4 10 18 28 20
A 0 0 8 21 13 5 0 4 10 20 27
E 0 0 6 13 18 12 4 0 4 16 26
21
Traceback
Start at highest score and traceback to first 0
H E A G A W G H E E
0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 5 0 0 0 0 0
W 0 0 0 0 2 0 20 12 4 0 0
H 0 10 2 0 0 0 12 18 22 14 6
E 0 2 16 8 0 0 4 10 18 28 20
A 0 0 8 21 13 5 0 4 10 20 27
E 0 0 6 13 18 12 4 0 4 16 26
AWGHE AW-HE
22
Summary
• Similar to global alignment algorithm
• For this to work, expected match with random
sequence must have negative score.
• Behavior is like global alignment otherwise
• Similar extensions for repeated and overlap
matching
• Care must be given to gap penalties to maintain
O(nm) time complexity

23
Repeat and Overlap Matches
• Repeat matches allow for sections of a sequence
to match repeatedly
• Repeated domain or motif
• Overlap matches
• Matching when the two sequences overlap
• Does not penalize overhanging ends

x
x
y
y
24
BLAST
• O(n2) algorithms are too slow for large scale
searches
• BLAST developed by Altschul et al (1990)
• Uses probabilistic approach to searching
• Idea True alignments will have a short stretch
of identities (perfect match)

25
BLAST Overview
• Make a list of neighborhood words
• Length 3 for proteins, 11 for nucleic acids
• Match query with score higher than some threshold
• Usually 2 bits per residue
• Scans database for words
• When a hit is obtained, extends the match in both
direction as ungapped alignment

26
FASTA
• Pearson Lipman (1988)
• Find all matching words of length ktup
• 1 or 2 for proteins, 4 or 6 for DNA
• Look for diagonals supporting word matches
• Extend with ungapped alignment
• Join ungapped regions with gaps