Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment

Description:

When (an unknown) gene X is homologous to (a known) gene G it means that we gain ... Technique to overcome the combinatorial explosion: Dynamic Programming ... – PowerPoint PPT presentation

Number of Views:500
Avg rating:3.0/5.0
Slides: 41
Provided by: heri4
Category:

less

Transcript and Presenter's Notes

Title: Introduction to bioinformatics Lecture 5 Pair-wise sequence alignment


1
Introduction to bioinformaticsLecture
5Pair-wise sequence alignment
2
Bioinformatics
  • Nothing in Biology makes sense except in the
    light of evolution (Theodosius Dobzhansky
    (1900-1975))
  • Nothing in bioinformatics makes sense except in
    the light of Biology

3
Example today Pairwise sequence alignment needs
sense of evolution Global dynamic programming
MDAGSTVILCFVG
Evolution
M D A A S T I L C G S
Amino Acid Exchange Matrix
Search matrix
MDAGSTVILCFVG-
Gap penalties (open,extension)
MDAAST-ILC--GS
4
Evolution
  • Ancestral sequence ABCD
  • ACCD (B C)
    ABD (C ø)
  • ACCD or ACCD
    Pairwise Alignment
  • AB-D A-BD

mutation deletion
5
Evolution
  • Ancestral sequence ABCD
  • ACCD (B C)
    ABD (C ø)
  • ACCD or ACCD
    Pairwise Alignment
  • AB-D A-BD

mutation deletion
true alignment
6
A protein sequence alignment MSTGAVLIY--TSILIKECHA
MPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS
A DNA sequence
alignment attcgttggcaaatcgcccctatccggccttaa att---
tggcggatcg-cctctacgggcc----

7
Searching for similarities What is the function
of the new gene? The lazy investigation (i.e.,
no biologial experiments, just bioinformatics
techniques) Find a set of similar protein
sequences to the unknown sequence Identify
similarities and differences For long proteins
identify domains
8
  • Evolutionary and functional relationships
  • Reconstruct evolutionary relation
  • Based on sequence
  • -Identity (simplest method)
  • -Similarity
  • Homology (common ancestry the ultimate goal)
  • Other (e.g., 3D structure)
  • Functional relation
  • Sequence Structure Function

9
Searching for similarities
Common ancestry is more interesting Makes it
more likely that genes share the same
function Homology sharing a common ancestor a
binary property (yes/no) its a nice tool When
(an unknown) gene X is homologous to (a known)
gene G it means that we gain a lot of information
on X what we know about G can be transferred to
X as a good suggestion.
10
How to go from DNA to protein sequence
A piece of double stranded DNA 5
attcgttggcaaatcgcccctatccggc 3 3
taagcaaccgtttagcggggataggccg 5
DNA direction is from 5 to 3
11
How to go from DNA to protein sequence
6-frame translation using the codon table (last
lecture) 5 attcgttggcaaatcgcccctatccggc
3 3 taagcaaccgtttagcggggataggccg 5
12
Evolution and three-dimensional protein structure
information
Isocitrate dehydrogenase The distance from the
active site (in yellow) determines the rate of
evolution (red fast evolution, blue slow
evolution)
Dean, A. M. and G. B. Golding Pacific Symposium
on Bioinformatics 2000
13
How to determine similarity Frequent evolutionary
events at the DNA level 1. Substitution 2.
Insertion, deletion 3. Duplication 4. Inversion
We will restrict ourselves to these events
14
A protein sequence alignment MSTGAVLIY--TSILIKECHA
MPAGNE----- ---GGILLFHRTHELIKESHAMANDEGGSNNS
A DNA sequence
alignment attcgttggcaaatcgcccctatccggccttaa att---
tggcggatcg-cctctacgggcc----

15
Dynamic programmingScoring alignments
Substitution (or match/mismatch) DNA
proteins Gap penalty Linear gp(k)ak
Affine gp(k)bak Concave, e.g.
gp(k)log(k) The score for an alignment is the
sum of the scores of all alignment columns
16
Dynamic programmingScoring alignments
Sa,b gp(k) gapinit
k?gapextension affine gap penalties
17
DNA define a score for match/mismatch of
letters Simple Used in genome
alignments
A C G T
A 1 -1 -1 -1
C -1 1 -1 -1
G -1 -1 1 -1
T -1 -1 -1 1
A C G T
A 91 -114 -31 -123
C -114 100 -125 -31
G -31 -125 100 -114
T -123 -31 -114 91
18
Dynamic programmingScoring alignments
T D W V T A L K T D W L - - I K
20?20
10
1
Affine gap penalties (open, extension)
Amino Acid Exchange Matrix
Score s(T,T)s(D,D)s(W,W)s(V,L)-Po-2Px
s(L,I)s(K,K)
19
Amino acid exchange matrices
20?20
How do we get one? And how do we get associated
gap penalties? First systematic method to derive
a.a. exchange matrices by Margaret Dayhoff et al.
(1978) Atlas of Protein Structure
20
A 2 R -2 6 N 0 0 2 D 0 -1 2 4 C -2 -4 -4
-5 12 Q 0 1 1 2 -5 4 E 0 -1 1 3 -5 2
4 G 1 -3 0 1 -3 -1 0 5 H -1 2 2 1 -3 3
1 -2 6 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 L -2 -3
-3 -4 -6 -2 -3 -4 -2 2 6 K -1 3 1 0 -5 1 0
-2 0 -2 -3 5 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4
0 6 F -4 -4 -4 -6 -4 -5 -5 -5 -2 1 2 -5 0
9 P 1 0 -1 -1 -3 0 -1 -1 0 -2 -3 -1 -2 -5
6 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1
2 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0
1 3 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0
-6 -2 -5 17 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4
-2 7 -5 -3 -3 0 10 V 0 -2 -2 -2 -2 -2 -2 -1 -2
4 2 -2 2 -1 -1 -1 0 -6 -2 4 B 0 -1 2 3 -4
1 2 0 1 -2 -3 1 -2 -5 -1 0 0 -5 -3 -2 2 Z
0 0 1 3 -5 3 3 -1 2 -2 -3 0 -2 -5 0 0
-1 -6 -4 -2 2 3 A R N D C Q E G H I
L K M F P S T W Y V B Z
PAM250 matrix amino acid exchange matrix (log
odds)
Positive exchange values denote mutations that
are more likely than randomly expected, while
negative numbers correspond to avoided mutations
compared to the randomly expected situation
21
Amino acid exchange matrices
Amino acids are not equal 1. Some are easily
substituted because they have similar
physico-chemical properties structure 2. Some
mutations between amino acids occur more often
due to similar codons The two above observations
give us ways to define substitution matrices
22
Pair-wise alignment
T D W V T A L K T D W L - - I K
Combinatorial explosion - 1 gap in 1 sequence
n1 possibilities - 2 gaps in 1 sequence (n1)n
- 3 gaps in 1 sequence (n1)n(n-1), etc.
2n (2n)! 22n
n (n!)2
??n 2 sequences of 300 a.a. 1088
alignments 2 sequences of 1000 a.a. 10600
alignments!
23
Technique to overcome the combinatorial
explosionDynamic Programming
  • Alignment is simulated as Markov process, all
    sequence positions are seen as independent
  • Chances of sequence events are independent

24
Sequence alignmentHistory of Dynamic Programming
algorithm
1970 Needleman-Wunsch global pair-wise
alignment Needleman SB, Wunsch CD (1970) A
general method applicable to the search for
similarities in the amino acid sequence of two
proteins, J Mol Biol. 48(3)443-53. 1981
Smith-Waterman local pair-wise alignment Smith,
TF, Waterman, MS (1981) Identification of common
molecular subsequences. J. Mol. Biol. 147,
195-197.
25
Pairwise sequence alignment Global dynamic
programming
MDAGSTVILCFVG
Evolution
M D A A S T I L C G S
Amino Acid Exchange Matrix
Search matrix
Gap penalties (open,extension)
MDAGSTVILCFVG-
MDAAST-ILC--GS
26
Global dynamic programming
j-1
i-1
MaxS0ltxlti-1, j-1 - Pi - (i-x-1)Px Si-1,j-1 MaxS
i-1, 0ltyltj-1 - Pi - (j-y-1)Px
Si,j si,j Max
27
Global dynamic programming
These values are copied from the PAM250 matrix
(see earlier slide), after being made
non-negative by adding 8 to each PAM250 matrx
cell (-8 is the lowest number in the PAM250
matrix)
Global score is 65 10 12 10 22
28
Global dynamic programming
These values are copied from the PAM250 matrix
(see earlier slide), after being made
non-negative by adding 8 to each PAM250 matrx
cell (-8 is the lowest number in the PAM250
matrix)
Global score is 65 10 12 10 22
29
Global dynamic programmingGapo10, Gape2
D W V T A L K
0 -12 -14 -16 -18 -20 -22 -24
T -12 8 -9 -6 -5 -9 -11 -14
D -14 0 9 2 2 3 -5 -3 -34
W -16 -13 25 11 5 4 9 0 -21
V -18 -10 -4 37 21 19 19 15 -16
L -20 -14 -2 23 46 31 37 26 1
K -22 -12 -9 17 33 53 39 50 14
-34 -29 -1 17 39 27 50
D W V T A L K
T 8 3 8 11 9 9 8
D 12 1 6 8 8 4 8
W 1 25 2 3 2 6 5
V 6 2 12 8 8 10 6
L 4 6 10 9 6 14 5
K 8 5 6 8 7 5 13
These values are copied from the PAM250 matrix
(see earlier slide), after being made
non-negative by adding 8 to each PAM250 matrx
cell (-8 is the lowest number in the PAM250
matrix)
The extra bottom row and rightmost column give
the final global alignment scores
30
Easy DP recipe for using affine gap penalties
j-1
i-1
  • Mi,j is optimal alignment (highest scoring
    alignment until (i,j)
  • Check
  • preceding row until j-2 apply appropriate gap
    penalties
  • preceding row until i-2 apply appropriate gap
    penalties
  • and celli-1, j-1 apply score for celli-1,
    j-1

31
DP is a two-step process
  • Forward step calculate scores
  • Trace back start at highest score and
    reconstruct the path leading to the highest score
  • These two steps lead to the highest scoring
    alignment (the optimal alignment)
  • This is guaranteed when you use DP!

32
Global dynamic programming
33
Global pairwise alignment
  • Global alignment all gaps are penalised
  • Semi-global alignment N- and C-terminal gaps
    (end-gaps) are not penalised
  • MSTGAVLIY--TS-----
  • ---GGILLFHRTSGTSNS

End-gaps
End-gaps
34
Semi-global pairwise alignment
  • Applications of semi-global
  • Finding a gene in genome
  • Placing marker onto a chromosome
  • One sequence much longer than the other
  • Danger if gap penalties high -- really bad
    alignments for divergent sequences

35
Local dynamic programming (Smith Waterman,
1981)
LCFVMLAGSTVIVGTR
E D A S T I L C G S
Negative numbers
Amino Acid Exchange Matrix
Search matrix
Gap penalties (open, extension)
AGSTVIVG A-STILCG
36
Local dynamic programming (Smith Waterman,
1981)
j-1
i-1
Si,j MaxS0ltxlti-1,j-1 - Pi - (i-x-1)Px Si,j
Si-1,j-1 Si,j Max Si-1,0ltyltj-1 - Pi -
(j-y-1)Px 0
Si,j Max
37
Local dynamic programming
38
Dot plots
  • Way of representing (visualising) sequence
    similarity without doing dynamic programming (DP)
  • Make same matrix, but locally represent sequence
    similarity by averaging using a window
  • See Lesks book pp. 167-171

39
Comparing two sequences We want to be able to
choose the best alignment between two
sequences. A simple method of visualising
similarities between two sequences is to use dot
plots. The first sequence to be compared is
assigned to the horizontal axis and the second is
assigned to the vertical axis.
40
Dot plots can be filtered by window approaches
(to calculate running averages) and applying a
threshold They can identify insertions,
deletions, inversions
Write a Comment
User Comments (0)
About PowerShow.com