Sequence comparison and Phylogeny

About This Presentation

Title:

Sequence comparison and Phylogeny

Description:

In 1999 Encephalitis caused by the West Nile Virus broke out in New York. ... E.g. v-sys onco genes in simian sarcoma virus leading to cancer in monkeys and ... – PowerPoint PPT presentation

Number of Views:170

Avg rating:3.0/5.0

Slides: 149

Provided by: biotecTu

Category:

more less

Transcript and Presenter's Notes

Title: Sequence comparison and Phylogeny

1
Sequence comparisonand Phylogeny
based on Chapter 4 Lesk, Introduction to
Bioinformatics
2
Contents

Motivation
Sequence comparison and alignments
Dot plots
Dynamic programming
Substitution matrices
Dynamic programming Local and global alignments
and gaps
BLAST
Significance of alignments
Multiple sequence alignments
Phylogenetic trees

3
Motivation

From where are we?
Recent Africa vs. Multi-regional Hypothese
In 1999 Encephalitis caused by the West Nile
Virus broke out in New York. How did the virus
come to New York?
How did the nucleus get into the eucaryotic
cells?
To answer such questions we will need sequence
comparison and phylogenetic trees

4
Sequence Similarity Searches

Sequence similarity can be clue to common
evolutionary ancestor
E.g. globin genes in chimpanzees and humans
or common function
E.g. v-sys onco genes in simian sarcoma virus
leading to cancer in monkeys and the seemingly
unrelated growth stimulating hormone PDGF, which
stimulates cell growth (first success of
similarity idea, 1983)
In general
If an unknown sequences is found, deduce its
function/structure indirectly by finding similar
sequences, whose function/structure is known
Assumption Evolution changes sequences slowly
often maintaining main features of a sequences
function/structure

5
Sequence alignment

Substitutions, insertions and deletions can be
interpreted in evolutionary terms
But distinguish chance similarity and real
biological relationship

CCGTAA
TCGTAA
CCGTAT
TCGTAC
TTGTAA
TCGTAG
TAGTAC
6
Evolution

Convergent evolution same sequence evolved from
different ancestors
Back evolution - mutate to a previous sequence

CCGTAA
TCGTAA
CCGTAT
TCGTAC
TAGTAA
TCGTAG
TAGTAC
TAGTAC
CCGTAA
7
Similarity vs. Homology

Any sequence can be similar
Sequences homologues if evolved from common
ancestor
Homologous sequences
Orthologs similar biological function
Paralogs different biological function (after
gene duplication), e.g. lysozyme and
a-lactalbumin, a mammalian regulatory protein
Assumption Similarity indicator for homology
Note, altered function of the expressed protein
will determine if the organism will survive to
reproduce, and hence pass on the altered gene

8
Sequence alignments

Given two or more sequences, we wish to
Measure their similarity
Determine the residue-residue correspondences
Observe patterns of conservation and variability
Infer evolutionary relationships

9
What is the best alignment?

Uninformative -------gctgaacg ctataatc------
-
Without gaps gctgaacg ctataatc
With gaps gctga-a--cg --ct-ataatc
Another one gctg-aa-cg -ctataatc-
Formally The best alignment have only a minimal
number of mismatches (insertions, deletions,
replace)
We need a method to systematically explore and to
compute alignments

10
Scores for an alignment

Percentage of matches
Score each match, mismatch, gap opening, gap
extension
Example
match 1
mismatch -1
Gap opening -3
Gap extension -1
Uninformative 0, score -21 -------gctgaacg
ctataatc-------
Without gaps25,score -4 gctgaacg ctataatc
With gaps 0, score -23 gctga-a--cg --ct-a
taatc
Another one 50, score-12 gctg-aa-cg -ctat
aatc-

11
Scores for an alignment

Percentage of matches
Score each match, mismatch, gap opening, gap
extension
Example
match 2
mismatch -1
Gap opening -1
Gap extension -1
Uninformative 0, score -17 -------gctgaacg
ctataatc-------
Without gaps25,score -2 gctgaacg ctataatc
With gaps 0, score -15 gctga-a--cg --ct-a
taatc
Another one 50, score0 gctg-aa-cg -ctataa
tc-

12
Dot plots
13
Dot plots

A convenient way of comparing 2 sequences
visually
Use matrix, put 1 sequence on X-axis, 1 on Y-axis
Cells with
identical characters filled with a 1,
non-identical with 0
(simplest scheme - could have weights)

14
Dot plots
15
Dot plots
16
Interpreting dot plots

What do identical sequences look like?
What do unrelated sequences look like?
What do distantly related sequences look like?
What does reverse sequence look like?
Relevant for detections of stems in RNA structure
What does a palindrome look like?
Relevant for restriction enzymes
What do repeats look like?
What does a protein with domains A and B and
another one with domains B and C look like?

17
Dot plot for identical sequences
18
Dotplot for unrelated sequences
19
Dotplot for distantly related sequences
20
Dotplot for reverse sequences
21
Dotplot for reverse sequences

Relevant to identify stems in RNA structures
Plot sequence against its reverse complement

22
Palindromes and restriction enzymes

Madam, I'm Adam
Able was I ere I saw Elba (supposedly said by
Napoleon)
Doc note I dissent, a fast never prevents a
fatness, I diet on cod.
Because DNA is double stranded and the strands
run antiparallel, palindromes are defined as any
double stranded DNA in which reading 5 to 3
both are the same
The HindIII cutting site
5'-AAGCTT-3'
3'-TTCGAA-5'
The EcoRI cutting site
5'-GAATTC-3'
3'-CTTAAG-5'

23
Dotplot of a Palindrome
24
Dotplot of repeats
25
Dotplot of Repeats/Palindrome
26
Dotplot for shared domain
27
Result

Dot plot
dorothycrowfoothodgkin
d
o
r
o
t
h
y
h
o
d
g
k
i
n

28
Result

Dot plot
dorothycrowfoothodgkin
d
o
r
o
t
h
y
h
o
d
g
k
i
n

29
Dotplots

Window size 15
Dot if
6 matches in window

30
(No Transcript)
31
gtgi1942644pdb1MEG Crystal Structure Of A
Caricain D158e Mutant In Complex With E-64
Length 216 Score 271 bits (693), Expect
1e-73 Identities 142/216 (65), Positives
168/216 (77), Gaps 4/216 (1) Query 1
IPEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKIRTGNLNQYSE
QELLDCDRRS 60 PE VDWRKGAVTPVQGSCGSC
WAFSAV TEGI KIRTG L SEQELDCRRS Sbjct 1
LPENVDWRKKGAVTPVRHQGSCGSCWAFSAVATVEGINKIRTGKLVELSE
QELVDCERRS 60 Query 61 YGCNGGYPWSALQLVAQYGIHYRN
TYPYEGVQRYCRSREKGPYAAKTDGVRQVQPYNQGA 120
GC GGYP AL VA GIH R YPY Q CR G KT
GV VQP NG Sbjct 61 HGCKGGYPPYALEYVAKNGIHLRSKY
PYKAKQGTCRAKQVGGPIVKTSGVGRVQPNNEGN 120 Query
121 LLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCGNKVDHAVAAV--
--GYGPNYILIKNS 176 LL IA QPVSVV G
FQLYGGIF GPCG KVHAV AV G YILIKNS Sbjct
121 LLNAIAKQPVSVVVESKGRPFQLYKGGIFEGPCGTKVEHAVTAVGY
GKSGGKGYILIKNS 180 Query 177 WGTGWGENGYIRIKRGTGN
SYGVCGLYTSSFYPVKN 212 WGT WGE GYIRIKR
GNS GVCGLY SSYP KN Sbjct 181 WGTAWGEKGYIRIKRAPGN
SPGVCGLYKSSYYPTKN 216 1 lpenvdwrkk
gavtpvrhqg scgscwafsa vatveginki rtgklvelse
qelvdcerrs 61 hgckggyppy aleyvakngi
hlrskypyka kqgtcrakqv ggpivktsgv grvqpnnegn
121 llnaiakqpv svvveskgrp fqlykggife gpcgtkveha
vtavgygksg gkgyilikns 181 wgtawgekgy
irikrapgns pgvcglykss yyptkn
32
(No Transcript)
33

gtgi2624670pdb1AIM Cruzain Inhibited By
Benzoyl-Tyrosine-Alanine- Fluoromethylketone
Length 215
Score 121 bits (303), Expect 3e-28
Identities 78/202 (38), Positives 107/202
(52), Gaps 13/202 (6)
Query 2 PEYVDWRQKGAVTPVKNQGSCGSCWAFSAVVTIEGIIKI
RTGNLNQYSEQELLDCDRRSY 61
P VDWR GAVT VKQG CGSCWAFSA E
L SEQ L CD
Sbjct 2 PAAVDWRARGAVTAVKDQGQCGSCWAFSAIGNVECQWFL
AGHPLTNLSEQMLVSCDKTDS 61
Query 62 GCNGGYPWSALQLVAQY---GIHYRNTYPY---EGVQRY
CRSREKGPYAAKTDGVRQVQP 115
GCGG A Q YPY EG
C A T V Q
Sbjct 62 GCSGGLMNNAFEWIVQENNGAVYTEDSYPYASGEGISPP
CTTSGHTVGATITGHVELPQD 121
Query 116 YNQGALLYSIANQPVSVVLQAAGKDFQLYRGGIFVGPCG
NKVDHAVAAVGYGPN----YI 171
Q A N PVV A Y GG
DH V VGY Y
Sbjct 122 EAQIAAWLAV-NGPVAVAVDAS--SWMTYTGGVMTSCVS
EALDHGVLLVGYNDSAAVPYW 178
Query 172 LIKNSWGTGWGENGYIRIKRGT 193

34
(No Transcript)
35

gi7546546pdb1EF7B Chain B, Crystal
Structure Of Human Cathepsin X
Length 242
Score 52.0 bits (123), Expect 2e-07
Identities 60/231 (25), Positives 94/231
(40), Gaps 34/231 (14)
Query 1 IPEYVDWRQKGAV---TPVKNQ---GSCGSCWAFSAVVT
IEGIIKIRTGNL---NQYSEQ 51
P DWR V NQ CGSCWA
I I S Q
Sbjct 1 LPKSWDWRNVDGVNYASITRNQHIPQYCGSCWAHASTSA
MADRINIKRKGAWPSTLLSVQ 60
Query 52 ELLDCDRRSYGCNGGYPWSALQLVAQYGIHYRNTYPYEG
VQRYCR--------SREKGPY 103
DC C GG S QGI Y
C K
Sbjct 61 NVIDCGNAG-SCEGGNDLSVWDYAHQHGIPDETCNNYQA
KDQECDKFNQCGTCNEFKECH 119
Query 104 AAKTDGVRQVQPYN-----QGALLYSIANQPVSVVLQAA
GKDFQLYRGGIFVGPCGNK-V 157
A V Y AN PS A
Y GGI
Sbjct 120 AIRNYTLWRVGDYGSLSGREKMMAEIYANGPISCGIMAT
ER-LANYTGGIYAEYQDTTYI 178
Query 158 DHAVAAVGY----GPNYILIKNSWGTGWGENGYIRI---
--KRGTGNSYGV 199

36
(No Transcript)
37
(No Transcript)
38
Dynamic programming
39
From Dotplots to Alignments

Obvious best alignment DOROTHYCROWFOOTHODGKIN
DOROTHY--------HODGKIN

40
From Dotplots to Alignments

Find best path from top left corner to bottom
right
Moving east corresponds to - in the second
sequence
Moving south corresponds to - in the first
sequence
Moving southeast corresponds to a match if the
characters are the same or a mismatch otherwise
Can we automate this?

41
From Dotplots to Alignments

Algorithm (Dynamic Programming)
Insert a row 0 and column 0 initialised with 0
Starting from the top left, move down row by row
from row 1 and
right column by column from column 1 visiting
each cell
Consider
The value of the cell north
The value of the cell west
The value of the cell northwest if the row/column
character mismatch
1 the value of the cell northwest if the
row/column character match
Put down the maximum of these values as the value
for the current cell
Trace back the path with the highest values from
the bottom right to the top left and output the
alignment

42
From Dotplots to Alignments

0 1 2 3 4 5 6 T G C A T A0 1 A2 T3
C4 T5 G6 A7 T

43
From Dotplots to Alignments

0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
A 02 T 03 C 04 T 05 G 06 A 07 T 0
Insert a row 0 and column 0 initialised with 0

44
From Dotplots to Alignments

0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
A 0 0 0 0 1 1 12 T 03 C 04 T 05 G 06 A 07
T 0

Consider
Value north
Value west
Value northwest if the row/column character
mismatch
1 value northwest if the row/column character
match
Put down the maximum of these values for current
celll

45
From Dotplots to Alignments

0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23
C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35
G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47
T 0 1 2 2 3 4 4

46
Reading the Alignment

0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23
C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35
G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47
T 0 1 2 2 3 4 4

-tgcat-a- at-c-tgat
47
Reading the Alignment there are more than one
possibility

0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23
C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35
G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47
T 0 1 2 2 3 4 4

---tgcata atctg-at-
48
FormallyLongest Common Subsequence LCS

What is the length s(V,W) of the longest common
subsequence of two sequencesVv1..vn and
Ww1..wm ?
Find sequences of indices1 i1 lt lt ik n and
1 j1 lt lt jk msuch that vit wjt for 1
t k
How? Dynamic programming
si,0 s0,j 0 for all 1 i n and 1 j m
and
si-1,jsi,j max si,j-1 si-1,j-1 1,
if vi wj
Then s(V,W) sn,m is the length of the LCS

49
Example LCS

0 1 2 3 4 5 6 T G C A T A0 1 A2 T3
C4 T5 G6 A7 T

50
Example LCS

0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
A 02 T 03 C 04 T 05 G 06 A 07 T 0
Initialisation si,0 s0,j 0 for all 1 i
n and 1 j m and

51
Example LCS

0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
A 0 0 0 0 1 1 12 T 03 C 04 T 05 G 06 A 07
T 0
Computing each cell si-1,j si,j max
si,j-1 si-1,j-1 1, if vi wj

52
Example LCS

0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
A 0 0 0 0 1 1 12 T 0 1 1 1 1 2 23
C 0 1 1 2 2 2 24 T 0 1 1 2 2 3 35
G 0 1 2 2 2 3 36 A 0 1 2 2 3 3 47
T 0 1 2 2 3 4 4
Computing each cell si-1,j si,j max
si,j-1 si-1,j-1 1, if vi wj

53
LCS Algorithm

Complexity
LCS has quadratic complexity O(n m)

LCS(V,W)
For i 1 to n
si,0 0
For j 1 to m
s0,j 0
For i 1 to n
For j 1 to m
If vi wj and si-1,j-1 1 si-1,j and si-1,j-1
1 si,j-1 Then
si,j si-1,j-1 1
bi,j North West
Else if si-1,j si,j-1 Then
si,j si-1,j
bi,j North
Else
si,j si,j-1
bi,j West
Return s and b

54
Printing the alignment of LCS

PRINT-LCS(b,V,i,j)
If i0 or j0 Then Return
If bi,j North West Then
PRINT-LCS(V,b,i-1,j-1)
Print vi
Else if bi,j North Then
PRINT-LCS(V,b,i-1,j)
Else
PRINT-LCS(V,b,i,j-1)

55
Rewards/Penalities

We can use different schemes
-1 for insert/delete/mismatch
1 for match
Consider
-1 the value of the cell north
-1 the value of the cell west
-1 the value of the cell northwest if the
row/column character mismatch
1 the value of the cell northwest if the
row/column character match
Put down the maximum of these values as the value
for the current cell

56
Reading the Alignment

0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
A 0 -1 -1 -1 1 0 12 T 0 1 0 -1 0 2 13
C 0 0 -1 1 0 1 14 T 0 1 0 0 0 1 05
G 0 0 2 1 0 0 06 A 0 -1 1 1 2 1 17
T 0 1 0 0 1 3 2

---tgcata atctg-at-
57
Rewards/Penalities

Lets refine the schemes
Transition mutations are more common
purinelt-gtpurine, alt-gtg
pyrimidinelt-gtpyrimidine, tlt-gtc
Transversions (purinelt-gtpyrimidine) are less
common
Use a subsitutation matrix to rate mismatches
-2 for insert/delete
Mismatch/match according to substitution matrix
Consider
-2 the value of the cell north
-2 the value of the cell west
Corresponding value of the substion matrix the
value of the cell northwest
Put down the maximum of these values as the value
for the current cell

58
Reading the Alignment

0 1 2 3 4 5 6 T G C A T A0 0 0 0 0 0 0 01
A 0 -2 0 -2 2 0 22 T 0 2 0 0 0 4 23
C 0 0 0 2 0 2 24 T 0 2 0 0 0 2 05
G 0 0 4 2 0 0 26 A 0 -2 2 2 4 2 27
T 0 2 0 2 2 6 4

---tgcata atctg-at-
59
Substitution matrixes
60
How to derive a substitution matrix for amino
acids?

Amino acids can be classified by physiochemical
properties

61
PAM 250 matrix
Cys 12 Ser 0 2 Thr -2 1 3 Pro -3 1 0
6 Ala -2 1 1 1 2 Gly -3 1 0 -1 1 5 Asn -4
1 0 -1 0 0 2 Asp -5 0 0 -1 0 1 2 4 Glu
-5 0 0 -1 0 0 1 3 4 Gln -5 -1 -1 0 0 -1
1 2 2 4 His -3 -1 -1 0 -1 -2 2 1 1 3
6 Arg -4 0 -1 0 -2 -3 0 -1 -1 1 2 6 Lys -5
0 0 -1 -1 -2 1 0 0 1 0 3 5 Met -5 -2 -1
-2 -1 -3 -2 -3 -2 -1 -2 0 0 6 Ile -2 -1 0 -2
-1 -3 -2 -2 -2 -2 -2 -2 -2 2 5 Leu -6 -3 -2 -3
-2 -4 -3 -4 -3 -2 -2 -3 -3 4 2 6 Val -2 -1 0
-1 0 -1 -2 -2 -2 -2 -2 -2 -2 2 4 2 4 Phe -4
-3 -3 -5 -4 -5 -4 -6 -5 -5 -2 -4 -5 0 1 2 -1
9 Tyr 0 -3 -3 -5 -3 -5 -2 -4 -4 -4 0 -4 -4 -2
-1 -1 -2 7 10 Trp -8 -2 -5 -6 -6 -7 -4 -7 -7 -5
-3 2 -3 -4 -5 -2 -6 0 0 17 C S T P A
G N D E Q H R K M I L V F Y W
gt0, likely mutation 0, random mutation lt0,
unlikely
62
(No Transcript)
63
(No Transcript)
64
PAM 250 Interpretation

Immutable
Cysteine (Avg-2.8) known to have several
unique, indispensable functions
attachment site of heme group in cytochrome and
of iron sulphur FeS in ferredoxins
Cross links in proteins such as chymotrypsin or
ribonuclease
Seldom without unique function
Glycine (Avg-1.6) small size maybe advantageous
Mutable
Serine often functions in active site, but can be
easily replaced
Self-alignment
Tryptophan with itself scores very high, as W
occurs rarely

65
Point Accepted Mutations PAM

Substitution matrix using explicit evolutionary
model of how amino acids change over time
Use parsimony method to determine frequency of
mutations
Entry in PAM matrix Likelihood ratio for
residues a and b Probability a-b is a mutation /
probability a-b is chance
PAM x Two sequences V, W have evolutionary
distance of x PAM if a series of accepted point
mutations (and no insertions/deletions) converts
V into W averaging to x point mutation per 100
residues
Mutations here mutations in the DNA
Because of silent mutations and back mutations n
can be gt100
PAM 250 most commonly used

66
PAM and Sequence Similarity
67
PAM

Dayhoff, Eck, Park A model of evolutionary
change in proteins, 1978
Accepted point mutation substitution of an
amino acid accepted by natureal selection
Assumption X replacing Y as likely as Y
replacing X
Used cytochrome c, hemoglobin, myoglobin, virus
coat proteins, chymotrypsinogen, glyceraldehyde
3-phosphate dehrydogenase, clupeine, insulin,
ferredoxin
Sequences which are too distantly related have
been omitted as they are more likely to contain
multiple mutations per site

68
PAM Step 1

Step 1 Construct a multiple alignment
Example
ACGCTAFKI
GCGCTAFKI
ACGCTAFKL
GCGCTGFKI
GCGCTLFKI
ASGCTAFKL
ACACTAFKL

69
PAM Step 2

Create a phylogenetic tree (parsimony method)

ACGCTAFKI A-gtG
I-gtL GCGCTAFKI ACGCTAFKL
A-gtG A-gtL C-gtS G-gtA GCGCTGFKI
GCGCTLFKI ASGCTAFKL ACACTAFKL
70
PAM Step 3

Note, the following variables
Residue frequency ri is the number of amino acid
i occurring in the sequences, e.g. rA 10 and
rG10
Number of residues r is the number of overall
amino acids in all sequences, e.g. r63
Substitutability si is the number of
substitutions in the tree involving amino acid i
, e.g. sA4
Substituion frequency si,j is the number of
substitutions involving amino acid i and j (i.e.
the number of i?j and j?i ), e.g. sA,G 3
Number of substitutions s is the number of
overall substitutions, s6

71
PAM Step 4

Relative mutability is the number of times the
amino acid is substituted by any other amino acid
in the tree divided by the total number of
substitutions that could have affected the
residue
Note, it is assumed that substitutions in both
directions are equally likely
mi 100 x ( si x ri ) / ( 2 s x r )
Example mA 100 x ( 4 x 10 ) / ( 2 x 6 x 63 )
5.3

72
PAM Step 5

Compute mutation probability
Mi,j mj x si,j / sj
Example MG,A 5.3 x 3 / 4 3.975

73
PAM Step 6

Finally the entry in the PAM Matrix
Ri,j log ( Mi,j / ( ri / r ) )
Example RG,A log ( 3.975 / (10/63) ) 1.4

74
PAM Step 6

For the entries on the diagonal
mj relative mutability ? 1-mj relative
immutability
Rj,j log ( (1-mj ) / ( rj / r ) )

75
BLOSUM

Different approach to PAM
BLOcks SUbstitution Matrix (based on BLOCKS
database)
Generation of BLOSUM x
Group highly similar sequences and replace them
by a representative sequences.
Only consider sequences with no more than x
similarity
Align sequences (no gaps)
For any pair of amino acids a,b and for all
columns c of the alignment, let q(a,b) be the
number of co-occurrences of a,b in all columns c.
Let p(a) be the overall probability of a
occurring
BLOSUM entry for a,b is log2 ( q(a,b) / (
p(a)p(b) ) )
BLOSUM 50 and BLOSUM 62 widely used

76
LCS Algorithm (Longest Common Subsequence)
Revisited

Algorithm (Dynamic Programming) with Substitution
Matrix
Insert a row 0 and column 0 initialised with 0
Starting from the top left, move down row by row
from row 1 and
right column by column from column 1 visiting
each cell
Consider
The value of the cell north
The value of the cell west
The value of the cell northwest if the row/column
character mismatch
s the value of the cell northwest, where s is
the value in the subsitution matrix for the
residues in row/column
Put down the minimum of these values as the value
for the current cell
Trace back the path with the highest values from
the bottom right to the top left and output the
alignment

77
LCS Revisited Formally

What is the length s(V,W) of the longest common
subsequence of two sequencesVv1..vn and
Ww1..wm ?
Find sequences of indices1 i1 lt lt ik n and
1 j1 lt lt jk msuch that vit wjt for 1
t k
How? Dynamic programming
si,0 s0,j 0 for all 1 i n and 1 j m
and
si-1,jsi,j max si,j-1 si-1,j-1 t,
where t is the value for vi and wj in
the substitution matrix
Then s(V,W) sn,m is the length of the LCS

78
Dynamic programming revisitedlocal and global
alignments and gap
79
Evolution and Alignments

Alignments can be interpreted in evolutionary
terms
Identical letters are aligned. Interpretation
part of the same ancestral sequence and not
changed
Non-identical letters are aligned
(substitution)Interpretation Mutation
GapsInterpretation Insertions and deletions
(indels)

80
Evolution and Alignments

Specific problems aligning DNA
Frame shift
DNA triplets code amino acids
Indel of one nucleotide shifts the whole sequence
of triplets
Thus may have a global effect and change all
coded amino acids
Silent mutation
Substitution in DNA leaves transcribed amino acid
unchanged
Non-sense mutation
Substitution to stop-codon

81
Local and Global Alignments

Global alignment (Needleham-Wunsch) algorithm
finds overall best alignment
Example members of a protein family, e.g.
globins are very conserved and have the same
length in different organisms from fruit fly to
humans
Local alignment (Smith-Waterman) algorithm finds
locally best alignment
most widely used, as
e.g. genes from different organisms retain
similar exons, but may have different introns
e.g. homeobox gene, which regulates embryonic
development occurs in many species, but very
different apart from one region called
homeodomain
e.g. proteins share some domains, but not all

82
Local Alignment

LCS s(V,W) computes globally best alignment
Often it is better to maximise locally, i.e.
compute maximal s(vivi , wj wi ) for all
substrings of V and W
Can we adapt algorithm?
Global alignment longest path in matrix s from
(0,0) to (n,m)
Local alignment longest path in matrix s from
any (i,j) to any (i,j)
Modify definition of s adding vertex of weight 0
from source to every other vertex, creating a
free jump to any starting position (i,j)

83
Local Alignment

Modify the definition of s as follows
si,0 s0,j 0 for all 1 i n and 1 j m
and
0 si-1,jsi,j max si,j-1 si-1,j-1
t, where t is the value for vi wj in the
substitution matrix
Then s(V,W) max si,j is the length of the
local LCS
This computes longest path in edit graph
Several local alignment may have biological
significance (consider e.g. two multi-domain
proteins whose domains are re-ordered

84
Aligning with Gap Penalties

Gap is sequence of spaces in alignment
So far, we consider only insertion and deletion
of single nucleotides or amino acids creating
alignments with many gaps
So far, score of a gap of length l is l
Because insertion/deletion of monomers is
evolutionary slow process, large numbers of gaps
do not make sense
Instead whole substrings will be deleted or
inserted
We can generalise score of a gap to a score
function A B l, where A is the penalty to open
the gap and B is the penalty to extend the gap

85
Aligning with Gap Penalties

High gap penalties result in shorter,
lower-scoring alignments with fewer gaps and
Lower gap penalties give higher-scoring, longer
alignments with more gaps
Gap opening penalty A mainly influences number of
gaps
Gap extension penalty B mainly influences length
of gaps
E.g. if interested in close relationships, then
choose A, B above default values, for distant
relationships decrease default values

86
Aligning with Gap Penalties

Adapt the definition of s as follows
s-deli,j max s-deli-1,j - B si-1,j
(AB)
s-insi,j max s-insi,j-1 - B si,j-1
(AB)
0 s-deli,jsi,j max s-insi,j
si-1,j-1 t, where t is the value for vi,
wj in the substitution matrix Then s(V,W)
max si,j is the length of the local LCS with
gap penalties A and B

87
FASTA and BLAST
88
Motivation

As in dotplots, the underlying data structure for
dynamic programming is a table
Given two sequences of length n dynamic
programming takes time proportional to n2
Given a database with m sequences, comparing a
query sequence to the whole database takes time
proportional to m n2
What does this mean?
Imagine you need to fill in the tables by hand
and it takes 10 second to fill in one cell
Assume there are 1.000.000 sequences each 100
amino acids long
How long does it take?

1.000.000 x 100 x 100 x 10 sec 1011 sec
27.777.778h 1157407days 3170 years
Even if a computer does not take 10 sec, but just
0.1ms to fill in one cell, it would still be 12
days.
We cannot do something about the database size,
but can we do something about the table size?

90
An idea Prune the search space
91
Another idea

Did we formulate the problem correctly?
Do we need the alignments for all sequences in
the database?
No, only for reasonable hits ? introduce a
threshold
A reasonable alignment will contain short
stretches of perfect matches
Find these first, then extend them to connect
them as best possible

92
FASTA and BLAST

FASTA and BLAST faster than dynamic programming
(5 times and 50 times respectively)
Underlying idea for a heuristic
High-scoring alignments will contain short
stretches of identical letters, called words
FASTA and BLAST first search for matches of words
of a given length and score threshold
BLAST for words of length 3 for proteins and 11
for DNA
FASTA for words of length 2 for proteins and 6
for DNA
Next, matches are extended to local (BLAST) and
global (FASTA) alignments

93
FASTA and BLAST

More formallyIf the strings Vv1..vm and
Ww1..wm match with at most k mismatches, then
they share an p-tuple for p ?m/(k1)?, i.e.
vi..vil-1 wj..wjl-1 for some 1 i,j m-p1
FILTRATION ALGORITHM, which detects all matching
words of length m with up to k mismatches
Potential match detection Find all matches of
p-tuples of V,W (can be done in linear time by
inserting them into a hash table)
Potential match verification Verify each
potential match by extending it to the left and
right until either the first k1 mismatches are
found or the beginning or end of the sequences
are found

94
Example for BLAST

Search SWISSPROT for Immunoglobulin

95
Example for BLAST

Search BLAST (www.ncbi.nlm.nih.gov/BLAST/) for
P11912

96
Example for BLAST

Distribution of Hits

97
Example for BLAST Top Hits

Score E Sequences producing significant
alignments Score E-Value gi547896spP11912C7
9A_HUMAN B-cell antigen receptor comp... 473
e-133 gi728993spP40293C79A_BOVIN B-cell
antigen receptor comp... 312 3e-85
gi126779spP11911C79A_MOUSE B-cell antigen
receptor comp... 278 5e-75 gi728994spP40259C
79B_HUMAN B-cell antigen receptor comp... 55
1e-07 gi125781spP01618KV1_CANFA IG KAPPA
CHAIN V REGION GOM 38 0.019 gi125361spP17948
VGR1_HUMAN Vascular endothelial growth ... 37
0.042 gi549319spP35969VGR1_MOUSE Vascular
endothelial growth ... 36 0.052
gi114764spP15530C79B_MOUSE B-cell antigen
receptor comp... 36 0.064 gi1718161spP53767V
GR1_RAT Vascular endothelial growth f... 35
0.080 gi125735spP01681KV01_RAT Ig kappa
chain V region S211 35 0.095
gi1730075spP01625KV4A_HUMAN IG KAPPA CHAIN
V-IV REGION LEN 34 0.26 gi1718188spP52583VGR2
_COTJA Vascular endothelial growth... 33 0.28
gi125833spP06313KV4B_HUMAN IG KAPPA CHAIN
V-IV REGION J... 33 0.30 gi125806spP01658KV3
F_MOUSE IG KAPPA CHAIN V-III REGION ... 33 0.30
gi125808spP01659KV3G_MOUSE IG KAPPA CHAIN
V-III REGION ... 33 0.30 gi1172451spQ05793PG
BM_MOUSE Basement membrane-specific ... 33 0.33
gi125850spP01648KV5O_MOUSE Ig kappa chain V-V
region HP... 33 0.36 gi125830spP06312KV40_HU
MAN Ig kappa chain V-IV region p... 33 0.38
gi2501738spQ06639YD03_YEAST Putative 101.7
kDa transcri... 33 0.41

98
Example for BLAST Alignment
gtgi126779spP11911C79A_MOUSE B-cell antigen
receptor complex associated protein
alpha-chain precursor (IG-alpha) (MB-1 membrane
glycoprotein) (Surface-IGM-associated protein)
(Membrane-bound immunoglobulin associated
protein) (CD79A) Length 220 Score 278 bits
(711), Expect 5e-75 Identities 150/226 (66),
Positives 165/226 (73), Gaps 6/226
(2) Query 1 MPGGPGVLQALPATIFLLFLLSAVYLGPGCQA
LWMHKVPASLMVSLGEDAHFQCPHNSSN 60 MPGG
LL LS LGPGCQAL P SL VLGEA C
N Sbjct 1 MPGG----LEALRALPLLLFLSYACLGPGCQAL
RVEGGPPSLTVNLGEEARLTC-ENNGR 55 Query 61
NANVTWWRVLHGNYTWPPEFLGPGEDPNGTLIIQNVNKSHGGIYVCRVQ
EGNESYQQSCG 120 N NTWW L N TWPP
LGPG G L VNK G CV E N SCG Sbjct
56 NPNITWWFSLQSNITWPPVPLGPGQGTTGQLFFPEVNKNTGACTG
CQVIE-NNILKRSCG 114 Query 121
TYLRVRQPPPRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQ
NEKLGLDAGD 180 TYLRVR P
PRPFLDMGEGTKNRIITAEGIILLFCAVVPGTLLLFRKRWQNEK GD
D Sbjct 115 TYLRVRNPVPRPFLDMGEGTKNRIITAEGIILLFCA
VVPGTLLLFRKRWQNEKFGVDMPD 174 Query 181
EYEDENLYEGLNLDDCSMYEDISRGLQGTYQDVGSLNIGDVQLEKP
226 YEDENLYEGLNLDDCSMYEDISRGLQGTYQDVG
LIGD QLEKP Sbjct 175 DYEDENLYEGLNLDDCSMYEDISRG
LQGTYQDVGNLHIGDAQLEKP 220
99
Example for BLAST

Lineage Report
root
. cellular organisms
. . Eukaryota eukaryotes
. . . Fungi/Metazoa group eukaryotes
. . . . Bilateria animals
. . . . . Coelomata animals
. . . . . . Gnathostomata vertebrates
. . . . . . . Tetrapoda vertebrates
. . . . . . . . Amniota vertebrates
. . . . . . . . . Eutheria mammals
. . . . . . . . . . Homo sapiens (man)
---------------------- 473 33 hits mammals
B-cell antigen receptor complex associated
protein alpha-ch
. . . . . . . . . . Bos taurus (bovine)
..................... 312 2 hits mammals
B-cell antigen receptor complex associated
protein alpha-ch
. . . . . . . . . . Mus musculus (mouse)
.................... 278 31 hits mammals
B-cell antigen receptor complex associated
protein alpha-ch
. . . . . . . . . . Canis familiaris (dogs)
................. 37 1 hit mammals
IG KAPPA CHAIN V REGION GOM
. . . . . . . . . . Rattus norvegicus (brown rat)
........... 35 7 hits mammals
Vascular endothelial growth factor receptor 1
precursor (VE
. . . . . . . . . . Oryctolagus cuniculus
(domestic rabbit) . 29 1 hit mammals
IG KAPPA CHAIN V REGION K29-213
. . . . . . . . . Coturnix japonica
------------------------- 33 2 hits birds
Vascular endothelial growth factor
receptor 2 precursor (VE
. . . . . . . . . Gallus gallus (chickens)
.................. 31 4 hits birds
CILIARY NEUROTROPHIC FACTOR RECEPTOR ALPHA
PRECURSOR (CNTFR

100
How good is an alignment?

Be careful Fitch/Smith found 17 alignments for
alpha- and beta-chains in chicken haemoglobins
Only one is the correct one (according to the
structure)
Given an alignment, how good is it
Percentage of matching residues, i.e. number of
matches divided by length of smallest sequence
Advantage independent of sequence length
E.g. ATC TGAT 4/6 66.67 TGCAT A
More general also consider gaps, extensions,

101
Blast Raw Score

R a I b X - c O - d G, where
I is the number of identities in the alignment
and a is the reward for each identity
X is the number of mismatches in the alignment
and b is the reward for each mismatch
O is the number of gaps and c is the penalty for
each gap
G is the number of - characters in the
alignment and d is the penalty for each
The values for a,b,c,d appear at the bottom of a
Blast report. For BLASTn they are a1, b-3, c5,
d2

102
Example

Query 1 atgctctggccacggcacttgcgga
Sbjt107 atgctctggccacggatcttgtgga
tcccagggtgatctgtgcacctgcgata 53
tccca---tgatatgtgcacctgcgata 156
R 1 x 46 -3 x 4 - 5 x 1 - 2 x 3 23
So, given the scores how significant is the
alignment?

103
Significance of an alignment

Significance of an alignment needs to be defined
with respect to a control population
Pairwise alignment How can we get control
population?
Generate sequences randomly? ? Not a good model
of real sequences
Chop up both sequences and randomly reassemble
them
Database search How can we get control
population?
Control whole database
Align sequence to control population and see how
good result is in comparison
This is captured by Z scores, P-values and
E-values

104
Z-score

Z-score normalises the score S
Let m be mean of population and std its standard
deviation, then Z-score (S m) / std
Z-score of 0 ? no better than average, hence
might have occurred by chance
The higher the Z-score the better

105
P-value

P-value probability of obtaining a score S
Range 0 P 1
Let m be the number of sequences in the control
population with score S
Let p be the size of the control population
Then P-value m / p
Rule of thumb
P 10-100 exact match,
10-100 P 10-50 nearly identical (SNPs)
10-50 P 10-10 homology certain
10-5 P 10-1 usually distant relative
P gt 10-1 probably insignificant

106
E-values

E-value takes also the database into account
E-value expected frequency of a score S
Range 0 E m, where m is the size of the
database
Relationship to P E m P
E values are calculated from
the bit score
the length of the query
the size of the database

107
BLAST Bit score

The bit score normalizes the raw score S to make
score under different settings comparable
The bit score is obtained from the raw score as
follows
S ( lambda x R - ln(K) ) / ( ln(2),
where lambda 1.37 and K0.711
Example
S ( 1.37 x 23 - ln(0.711) ) / ln(2) 46

108
E-value

The E-value is then calculated as follows
E m x n x 2 -S , where
m is the effective length of the query
n is the effective length of the database
S is the bit score
(effective length takes into account that an
alignment cannot start at the end of a sequence)
Example
m34 (19 nucleotides fewer than the 53 submitted)
n5,854,611,841
Result E0.003

109
Precision and Recall

How good are BLAST and FASTA?
True positives, tp hits which are biologically
meaningful
False positives, fp hits which are not
biologically meaningful
True negatives, tn non-hits which are not
biologically meaningful
False negatives, fn non-hits which are
biologically meaningful
Minimise fp and fn
Recall tp/(tpfn) (meaningful hits / all
meaningful)
Precision tp/(tpfp) (meaningful hits / all
hits)
But since no objective data available difficult
to judge BLAST and FASTAs sensitivity and
specificity

110
Multiple Sequence Alignments
111
Multiple Sequence Alignment

Align more than two sequences
Choice of sequences
If too closely related then large redundant
If very distantly related then difficult to
generate good alignment
Additionally use colour for residues with similar
properties
Yellow Small polar GLy,Ala,Ser,Thr
Green Hydrophobic Cys,Val,Ile,Leu, Pro,Phe
,Tyr,Met,Trp
Magenta Polar Asn,Gln,His
Red Negatively charged Asp,Glu
Blue Positively charged Lys, Arg

112
Thioredoxins WCGPCK or R motif
113
Thioredoxins Gly/Pro turn
114
Thioredoxins every second hydrophobic beta
strand
115
Thioredoxins ca. every 4th hydrophobic alpha
helix
116
(No Transcript)
117
Profiles, PSI-Blast, HMM
118
Profiles

Derive profile from multiple sequence alignment
Useful to
Align distantly related sequences
Conserved regions, which may indicate active site
Classify subfamilies within homologues
How can profile be used to search
Insist on profile (such as WGCPC)? Too strict
Use frequence distribution of profile

119
Consider frequencies

Score for
VDFSAS 1316167167
ADATAA 11601160
Not good to pick up distant relationships
Better combine with substitution matrix
Result position specific substitution matrix

120
PSI-Blast

Globin familiy (oxygen transport ) of proteins
occurs in many species
Proteins have same function and structure and
But there are pairs of members of the family
sharing less than 10 identical residues
A B C
PSI-BLAST idea score via intermediaries may be
better than score from direct comparison

50
50
Only 10
121
PSI-BLAST

PSI-BLAST
1. BLAST
2. Collect top hits
3. Build multiple sequence alignment from
significant local matches
4. Build profile
5. Re-probe database with profile
6. Go back to 2.

122
PSI-BLAST

But beware of PSI-BLAST
False positives propagate and spread through
iterations
If protein A consists of domains D and E, and
protein B of domains E and F and protein C of
domain F, then PSI-BLAST will relate A and C
although they do not share any domain

123
Hidden Markov Model

Procedure to generate sequences
State transition systems with three types of
states
Deletion
Insertion
Match, which emits residues
Follow probability distribution for successor
state
Train model on multiple sequence alignment

124
Summary

Evolutionary model Indels and substitutions
Homologues vs. similarity
Dot plots
Easy visual exploration, but not scalable
Dynamic programming
Local, global, gaps
Substitution matrices (PAM, BLOSUM)
BLAST and FASTA
Scores and significance
Multiple Sequence Alignments
Profiles, PSI-BLAST, HMM

125
Phylogeny
126
Motivation

How did the nucleus get into the eucaryotic
cells?
From where are we?
Recent Africa vs. Multi-regional Hypothese
In 1999 Encephalitis caused by the West Nile
Virus broke out in New York. How did the virus
come to New York?

127
How did the nucleus get into the eucaryotic cells?

Simple experiment
Blast classes genes with related functions in
yeast (Eucaryote)
against Bacteria and
against Archaea
And count number of significant hits

128
How did the nucleus get into the eucaryotic cells?

Mitochondria und Energy metabolism
Significantly more hits in bacteria
Cell organisation
Significantly more hits in Archaea
Fundamental Result without any experiment!

Blue Bacteria Grey Archaea
129
Phylogeny

Taxonomists aim to classify and group organisms
E.g. Aristoteles, De Partibus Animalium
Ought we, for instance, to begin by discussing
each separate species man, lion, ox, and the
like taking each kind in hand independently of
the rest, or ought we rather to deal first with
the attributes which they have in common in
virtue of some common element of their nature,
and proceed from this as a basis for
consideration of them separately other

130
Schools of Taxonomists

Goal create taxonomy
Approach
Phenotype
Phylogeny
3 schools
Phenotype only
Evolutionary TaxonomistsPhenotype ( Phylogeny)
Cladists Phylogeny (Phenotype)

131
Practical Application Westnile virus in NY

Westnile virus mainly in Africa
Transmitted by insects and birds
How did the virus get to NY in 1999
Hundreds of DNA samples taken
All 99.8 identical ? single entry to NY!
Phylogenetic tree allows to deduce origin

132
Example Westnil virus in NY

How can the trees be constructed?

133
Three Methods to Generate Phylogenetic Trees

Distance-based
Hierarchical clustering
Character-based
Parsimony
Maximum likelihood

134
Distance-based Approach

Single Alignment
Score 46 matches, 3 mismatches, 1 gap, 3 gap
extensions,
z.B. Score 46x1 - 3x1 - 1x2 - 3x1 38
Approach
Define distance between two sequences, e.g.
percentage of mismatches in their alignment
Construct tree, which groups sequences with
minimal distances iteratively together

135
Hierarchical Clustering
136
Hierarchical Clustering

Given a distance matrix D(dij) with 1 i,j n
Result A binary tree of clusters
Init
ToDo
For all i in 1,, n do
Let ti be a tree without children, i.e. a leaf
ToDo ToDo ? ti
Main loop
While ToDo gt 1 do
Find i,j such that dij is minimal
Add a new column and row labelled k (i,j) to D
For all indices h of D apart from k,i,j do
dh,k dk,h min dh,i , dh,j // min
single linkage
Let tk be a new tree with children ti and tj
ToDo ( ToDo ? tk ) - ti ,tj
Remove columns and rows i,j from D
Complexity O(n2)

137
Hierarchical Clustering

How to define distance between clusters?
Single linkage
dh,k dk,h min dh,i , dh,j
Example Distance (A,B) to C is 1
Complete linkage
dh,k dk,h max dh,i , dh,j
Example Distance (A,B) is C is 2
Average linkage
dh,k dk,h 0.5 dh,i 0.5 dh,j
Example Distance (A,B) to C is 1.5
Are dendrograms always the same independent of
the linkage method?

138
Parsimony-method

Approach Generate smallest tree containing all
the sequences as leaves

139
Parsimony

Generate smallest tree
Informative vs. non-informative sites
Build pairs with fewest possible substitutions
Example
3 possible trees
((a,b),(c,d)) or ((a,c),(b,d)) or ((a,d),(b,c))
1,2,3,4 are not informative
5,6 are informative
5 ((a,b),(c,d))
6 ((a,c),(b,d))

140
Maximum likelihood

Assigns quantitative probabilities to mutation
events
Reconstructs ancestors for all nodes in the tree
Assigns branch lengths based on probabilities of
the mutational events
For each possible tree topology, the assumed
substitution rates are varied to find the
parameters that give the highest likelihood of
producing the observed data

141
Problems

Character-based methods tend to be better (based
on paleontological data)
All make assumptions
No back mutations
Same evolutionary rate

142
Assessing Quality Bootstrapping

Given a tree obtained from one of the methods
above
Generate Multiple Alignment
For a number of interations
Generate new sequences by selecting columns
(possibly the same column more than once) form
the multiple alignment
Generate tree for the new sequences
Compare this new tree with the given tree
For each cluster in the given tree, which also
approach in the new tree, the bootstrap value is
increased
Bootstrap-Value Percentage of trees containing
the same cluster

143
From where are we?

Recent-Africa Hypothesis
Homo Sapiens came 100-200.000 years ago from
Africa
Multi-regional Hypothesis
Ancestors of Homo Sapiens left Africa ca.
2.000.000 years ago
Which ones right?

144
Experiment

Mitochondrial DNA form 53 humans in different
regions sequenced
Outgroup Mitochondrial DNA of chimpanzee

145
A nice phylogeny (Nature 2004)
Nature October 2004 Volume 431 No. 7012
146
Why Mitochondria?

Simple genetic structure
No repetitions
No Pseudo genes
No Introns
No recombination

147
Molecular Clock

Based on genetic and paleontological the most
recent common ancestor (mrca) of chimpanzee and
homo sapiens dates back 5.000.000 years
Molecular clock
? 1.7 x10-8 nucleotide changes per site and year
Assumption equal distribution, no silent
mutations
Diversity in Afrikca 3.7 x10-3 nucleotide
changes per site and year
Diversity outside Africa 1.7 x10-3 nucleotide
changes per site and year
Estimated expansion1925 generations ago ca.
40.000 years
Mrca of all humans 171.500 /- 50.000 years ago
Mrca of African and non-African 52,000 /-
27.500 years ago
Experiment supports recent-Africa hypothesis