Lecture 5 Pairwise sequence alignment algorithms - PowerPoint PPT Presentation

1 / 113
About This Presentation
Title:

Lecture 5 Pairwise sequence alignment algorithms

Description:

Local - Smith Waterman. Heuristics for the real world. BLAST. Fundamentals. When we compare sequences, we are looking for evidence that they have diverged ... – PowerPoint PPT presentation

Number of Views:173
Avg rating:3.0/5.0
Slides: 114
Provided by: karch7
Category:

less

Transcript and Presenter's Notes

Title: Lecture 5 Pairwise sequence alignment algorithms


1
Lecture 5Pairwise sequence alignment algorithms
  • Rachel Karchin
  • BME 580.688/580.488 and CS 600.488 Spring 2009

2
Outline
  • Fundamentals and basics
  • Exact alignment algorithms
  • Global Needleman-Wunsch
  • Local - Smith Waterman
  • Heuristics for the real world
  • BLAST

3
Fundamentals
  • When we compare sequences, we are looking for
    evidence that they have diverged from a common
    ancestor by a process of mutation and selection.

Durbin et al. Biological Sequence Analysis
4
Fundamentals
  • Assume a simplified model of evolution
  • Substitutions
  • Deletions
  • Insertions

Durbin et al. Biological Sequence Analysis
5
Pairwise alignment basics
GACCC GACAA
Two sequences
What is the best way to match up their shared
equivalent positions?
What is a shared equivalent position?
6
Pairwise alignment basics
GACCC GACAA
Two sequences
What is the best way to match up their shared
equivalent positions?
GACCC GACAA
GACCC-- --GACAA
GACC-C GACAA-
G-ACCC GACAA-
GACCC- GA-CAA
GACCC- GAC-AA
7
Pairwise alignment basics
Tabular representation
GACCC GACAA
Two sequences
8
Pairwise alignment basics
GACCC GACAA
Tabular representation
Two sequences
9
Pairwise alignment basics
Tabular representation
GACCC- G-ACAA
10
Pairwise alignment basics
Tabular representation
GACCC- G-ACAA
11
Pairwise alignment
  • Which of all of these alignments is the best one?

12
Pairwise alignment
  • Requires a scoring system that gives points or
    penalties for matches and gaps

13
Pairwise alignment
  • Also requires efficient search strategy to select
    the best alignment or best k alignments out of
    very large number of possible alignments

14
Scoring system
  • Match points
  • Indel costs

What do we want out of such a scoring system?
15
Scoring matrices
  • PAM matrices invented by Margaret Dayhoff in
    1978.
  • Based on mutations in 70 protein families
  • Reconstructed the ancestral sequence of each
    family

16
Scoring matrices
  • Count number of changes for each amino acid and
    divide by a normalization factor of exposure to
    mutation
  • Assumes that the observed substitution is the
    result of a single mutation event.

17
Scoring matrices
  • PAM1 gives the probability of a 1 change in
    sequence
  • One substitution per 100 amino acids
  • PAM100, PAM250 etc. are linear extrapolations of
    PAM1
  • PAM250 is probability of 250 change in sequence

Tutorial to do it yourself http//www.biorecipes
.com/Dayhoff/code.html
18
Scoring matrices
Do these look like probabilities?
19
Scoring matrices
Do these look like probabilities?
20
Scoring matrices
  • Henikoff and Henikoff (1992)
  • Based on amino acid substitutions in conserved
    amino acid patterns from the BLOCKS database
  • BLOSUMn from blocks which are n identical

21
Scoring matrices
  • BLOCKS found with motif software which finds
    patterns of amino acids at arbitrary intervals
  • Protein families defined by presence of a common
    motif
  • Motifs strung together into ungapped blocks of
    sequence

22
BLOCKS
http//blocks.fhcrc.org/
23
Nucleotide substitution matrices
  • How might you design one?

24
Gap penalties
  • Simplest approach is a constant gap penalty
  • Cost for opening a gap d
  • Every gap gets same penalty, regardless of length

25
Gap penalties
Scoring system
PIK3CA COW PIK3CB HUMAN
First 100 aa of both
Substitution BLOSUM62 Gap constant d -10
26
Gap penalties
Scoring system
PIK3CA COW PIK3CB HUMAN
First 100 aa of both
Substitution BLOSUM62 Gap constant d -5
What has changed?
27
Gap penalties
Scoring system
PIK3CA COW PIK3CB HUMAN
First 100 aa of both
Substitution BLOSUM62 Gap constant d 0
What has changed?
28
Gap penalties
Scoring system
PIK3CA COW PIK3CB HUMAN
First 100 aa of both
Substitution BLOSUM62 Gap constant d 0
What has changed? How might we assess which
value of d is best?
29
Gap penalties
  • Linear gap penalty
  • Cost for opening a gap d
  • Cost for gap extenstion d
  • Each gap of one base (or residue) gets penalized
    the same, regardless of whether or not it is
    contiguous with other gaps
  • Alignment with fewer gaps favored
  • Penalty for large gap is same as many small gaps

30
Gap penalties
  • Linear gap penalty
  • Cost for opening a gap d
  • Cost for gap extenstion d
  • Each gap of one base (or residue) gets penalized
    the same, regardless of whether or not it is
    contiguous with other gaps
  • Alignment with fewer gaps favored
  • Penalty for large gap is same as many small gaps

What do you think about this from a biological
point of view?
31
Gap penalties
Scoring system
PIK3CA COW PIK3CB HUMAN
First 100 aa of both
Substitution BLOSUM62 Gap linear d -10
What has changed?
32
Gap penalties
  • Affine gap penalty
  • Biologically, more likely to see a single gap of
    10 than 10 single gaps
  • We can model this using a gap opening penalty o
    and a gap extension penalty e
  • Gap of length l gets penalty o (l-1) e

33
Gap penalties
  • Affine gap penalty
  • Biologically, more likely to see a single gap of
    10 than 10 single gaps
  • We can model this using a gap opening penalty o
    and a gap extension penalty e
  • Gap of length l gets penalty o (l-1) e

Should e be more or less negative than o ? Why?
34
Gap penalties
  • Affine gap penalty
  • Biologically, more likely to see a single gap of
    10 than 10 single gaps
  • We can model this using a gap opening penalty o
    and a gap extension penalty e
  • Gap of length l gets penalty o (l-1) e

Should e be more or less negative than o
? Why? Less. Because a few large gaps are
morerealistic than manys mall gaps. Want to
encourage gap extension.
35
Gap penalties
  • Build your intuition about parameters for
    pairwise sequence alignment.
  • Gap penalty functions often written as ?(.)

http//fasta.bioch.virginia.edu/noptalign/
36
Dynamic programming for sequence alignment
  • Scoring system
  • Search strategy

37
Dynamic programming for sequence alignment
  • Key insight is that we can align prefixes and
    then extend them one position at a time
  • Efficient search through the huge alignment space
    is done by caching intermediate results

38
Dynamic programming for sequence alignment
  • General idea

Matrix Initialization
Two sequences
GACCC ACAA
39
Dynamic programming for sequence alignment
  • General idea

Matrix Initialization
Why?
Two sequences
GACCC ACAA
40
Dynamic programming for sequence alignment
  • General idea

Aligning each nucleotidewith the empty string
isequivalent to putting a gap atthe beginning
of the alignment. What kind of gap penalty is
being used here?
Matrix Initialization
Why?
Two sequences
GACCC ACAA
41
Dynamic programming for sequence alignment
  • General idea

What is this?
Matrix Initialization
Two sequences
GACCC ACAA
42
Dynamic programming for sequence alignment
  • General idea

What is this?
Score for aligning two empty sequences
Matrix Initialization
Two sequences
GACCC ACAA
43
Dynamic programming for sequence alignment
  • General idea

Matrix Fill
Recurrence relation
M(i,j) max( M(i-1,j-1) S(i,j),
M(i,j-1) ?(gap in seq 1),
M(i-1,j) ?(gap in seq 2) )
44
To fill in each M(i,j), need to look where?
  • General idea

Matrix Fill
Recurrence relation
M(i,j) max( M(i-1,j-1) S(i,j),
M(i,j-1) ?(gap in seq 1),
M(i-1,j) ?(gap in seq 2) )
45
M(i-1,j)
M(i-1,j-1)
To fill in each M(i,j), need to look where?
M(i,j)
M(i,j-1)
  • General idea

Matrix Fill
Recurrence relation
M(i,j) max( M(i-1,j-1) S(i,j),
M(i,j-1) ?(gap in seq 1),
M(i-1,j) ?(gap in seq 2) )
46
Dynamic programming for sequence alignment
  • General idea

Matrix Fill
Recurrence relation
M(i,j) max( M(i-1,j-1) S(i,j),
M(i,j-1) ?(gap in seq 1),
M(i-1,j) ?(gap in seq 2) )
What else do we need here?
47
Dynamic programming for sequence alignment
constant
48
Dynamic programming for sequence alignment
Whats this?
constant
49
Dynamic programming for sequence alignment
Whats this?
constant
50
Dynamic programming for sequence alignment
Whats this?
constant
51
Dynamic programming for sequence alignment
constant
52
Dynamic programming for sequence alignment
constant
53
Dynamic programming for sequence alignment
constant
54
Dynamic programming for sequence alignment
constant
55
Dynamic programming for sequence alignment
constant
56
Dynamic programming for sequence alignment
constant
57
Dynamic programming for sequence alignment
constant
58
Dynamic programming for sequence alignment
constant
59
Dynamic programming for sequence alignment
Matrix Fill
Recurrence relation
M(i,j) max( M(i-1,j-1) S(i,j),
M(i,j-1) ?(gap in seq 1),
M(i-1,j) ?(gap in seq 2) )
Scoring system
S(i,j)
?(.) -2
constant
60
Dynamic programming for sequence alignment
constant
61
Dynamic programming for sequence alignment
Only one winner is shown for each
maxoperation, but there are often ties, with
thewinner selected at random. Thus, there are
many optimal alignmentsand even more that are
slightly less than optimal (sub-optimal
alignments)
M(i,j) max( M(i-1,j-1) S(i,j),
M(i,j-1) ?(gap in seq 1),
M(i-1,j) ?(gap in seq 2) )
Scoring system
S(i,j)
?(.) -2
constant
62
Dynamic programming for sequence alignment
Time and space complexity?
constant
63
Dynamic programming for sequence alignment
Traceback
Where do we start?
64
Dynamic programming for sequence alignment
Traceback
Are we doing a globalor local alignment here?
65
Dynamic programming for sequence alignment
Traceback
Are we doing a globalor local alignment here?
Bottom right cell givesus the optimal global
alignment score
66
Dynamic programming for sequence alignment
Traceback
Want to identify the alignmentwith over the
length of bothsequences Backtrack over the
winningpath How?
67
Dynamic programming for sequence alignment
Traceback
Want to recover the (an) optimal alignment over
the length of both sequences Backtrack over the
winningpath How? Have to identify the
directioninto which each cell was reached.
68
Dynamic programming for sequence alignment
Traceback
Want to recover the (an) optimal alignment over
the length of both sequences Backtrack over the
winningpath How? Have to identify the
directioninto which each cell was
reached. Forward pointers tell us
69
Dynamic programming for sequence alignment
Traceback
Follow the path backwardsIdentify the (a)
winning path
70
Dynamic programming for sequence alignment
Traceback
Follow the path backwardsIdentify the (a)
winning path
71
Dynamic programming for sequence alignment
Traceback
Follow the path backwardsIdentify the (a)
winning path What does our alignment look like
at this point?
72
Dynamic programming for sequence alignment
Traceback
Follow the path backwardsIdentify the (a)
winning path What does our alignment look like
at this point?
CCC CAA
73
Dynamic programming for sequence alignment
Traceback
Follow the path backwardsIdentify the (a)
winning path
ACCC ACAA
74
Dynamic programming for sequence alignment
Traceback
Follow the path backwardsIdentify the (a)
winning path
GACCC -ACAA
75
Global pairwise sequence alignment
  • Global pairwise sequence alignment by dynamic
    programming introduced by Needleman and Wunsch in
  • Known as the Needleman-Wunsch algorithm.

76
  • In what situations might we want a global
    sequence alignment?
  • In what situations might this not be desirable?

77
Local pairwise sequence alignment
  • General idea

Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
78
Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Zero win gt start ofa new alignment
(reset) Alignment can begin andend anywhere in
the matrix
79
Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
80
Local pairwise sequence alignment
Do we set a pointer from the parent to this cell?
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
81
Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
82
Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
83
Local pairwise sequence alignment
Do we set a pointer from the parent to this cell?
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
84
Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
85
Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
86
Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
87
Local pairwise sequence alignment
Recurrence relation
M(i,j) max( 0,
M(i-1,j-1) S(i,j), M(i,j-1) ?(gap
in seq 1), M(i-1,j) ?(gap in seq 2)
)
Scoring system
S(i,j)
?(.) -2
constant
88
Local pairwise sequence alignment
Traceback
Want to recover the (an) optimal alignment over
the length of both sequences Backtrack over the
winningpaths How? Where do we start?
89
Local pairwise sequence alignment
Traceback
Want to recover the (an) optimal alignment over
the length of both sequences Backtrack over the
winningpaths How? Where do we start?
Largest value in the matrixis score of the best
localalignment between subsequences
90
Local pairwise sequence alignment
Traceback
91
Local pairwise sequence alignment
Traceback
92
Local pairwise sequence alignment
Traceback
93
Local pairwise sequence alignment
Traceback
What does this alignmentlook like?
94
Local pairwise sequence alignment
Traceback
What does this alignmentlook like?
CCC CCA
95
Local pairwise sequence alignment
Traceback
For this to work, what is requirement for
expected value of a randommatch? Why?
96
Local pairwise sequence alignment
  • Smith-Waterman (1981)
  • Gotoh (affine gap scores) (1982)

97
Motivation for heuristic alignments
  • For two sequences of length m and n
  • Time to get optimal alignment score is T(mn)
  • T notation like O notation but provides upper and
    lower bounds on asymptotic behavior (O is only
    for upper)
  • Space required to recover the alignment (naively
    T(mn) but can reduce to T(mn)
  • Not realistic for large scale problems
  • Sequence (or sequence database size) in the
    billions

98
Big picture strategy for heuristic alignments
  • Do fast preprocessing to identify regions where
    more expensive algorithms can be used

Filter candidates byseeking a high
scoringalignment near each one
Generate candidate patterns that
indicatepresence of a high scoring alignment
Repeat withprogressivelymoresensitivealgoirthm
s
Courtesy of WU lecture 1
99
Big picture strategy for heuristic alignments
  • Key issues to address
  • What are good candidate patterns?
  • How do we generate and find them efficiently?
  • How do we eliminate redundant candidates?
  • How do we search for alignments near candidates?
  • How do we score (and filter) alignments?

Courtesy of WU lectures on alignment problems
100
BLAST
Biologically real alignments very likelyto
contain a short stretch of identitiesor very
high scoring matches.
Parallel computing for bioinformatics and
computational biology, Zomaya
101
BLAST
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
102
BLAST
Q
D
ACTGA
GACTGC
Word List
ACTCTGTGA
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
103
BLAST
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
104
BLAST
Q
D
ACTGA
GACTGC
Word List
Seeds?
ACTCTGTGA
What are they?
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
105
BLAST
Q
D
ACTGA
GACTGC
Word List
Seeds?
ACTCTGTGA
ACTCTG
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
106
BLAST
Q
D
ACTGA
GACTGC
Word List
Seeds?
ACTCTGTGA
ACTCTG
What kind of data structures might you use to
store the words? The word/seed associations?
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
107
BLAST
Q
D
ACTGA
GACTGC
Word List
Seeds?
ACTCTGTGA
ACT CTG
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
108
BLAST
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
109
BLAST
HSP(high-scoring pair)
GACTGC ACT CTG
GACTGC ACTG
D
GACTGC
Seeds?
ACT CTG
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
110
BLAST
  • Combine consistent HSPs
  • They cant overlap
  • They must be in same order in Q and D sequences
  • These consistent local aligments are assessed
    for statistical significance

Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
111
BLAST
  • Parameterization includes choice of W (word
    length) and T (threshold for scores in the
    initial word list)

How will make W and T longer or shorter impact
the finallist of statistically significant local
alignments produced by BLAST?What about its
speed?
Parallel computing for bioinformatics and
computational biology, Wu and Tseng,Editor
Zomaya
112
What you should know
  • What alignments mean in terms of a model of
    sequence evolution
  • Start to develop intuition about dynamic
    programming and how to use it
  • Basic ideas behind heuristic alignment approach
    such as BLAST

113
Credits
  • Materials for this lecture were taken from many
    sources including the following web pages, which
    you may want to explore on your own

http//www.life.umd.edu/labs/delwiche/bsci348s/lec
/PAMmatrices.html
http//web.cecs.pdx.edu/ps/CapStone03/dynvis/Simi
larityApplet.html
http//fasta.bioch.virginia.edu/noptalign/
Write a Comment
User Comments (0)
About PowerShow.com