Molecular Evolution of Proteins and Phylogenetic Analysis Fred R. Opperdoes Christian de Duve Institute of Cellular Pathology (ICP) and Laboratory of Biochemistry, Universit

About This Presentation

Title:

Molecular Evolution of Proteins and Phylogenetic Analysis Fred R. Opperdoes Christian de Duve Institute of Cellular Pathology (ICP) and Laboratory of Biochemistry, Universit

Description:

Fred R. Opperdoes Christian de Duve Institute of Cellular Pathology (ICP) and Laboratory of Biochemistry, Universit catholique de Louvain, Brussels, Belgium – PowerPoint PPT presentation

Number of Views:276

Avg rating:3.0/5.0

Slides: 132

Provided by: FOppe

Category:

more less

Transcript and Presenter's Notes

Title: Molecular Evolution of Proteins and Phylogenetic Analysis Fred R. Opperdoes Christian de Duve Institute of Cellular Pathology (ICP) and Laboratory of Biochemistry, Universit

1
Molecular Evolution of Proteins and Phylogenetic
Analysis Fred R. OpperdoesChristian de Duve
Institute of Cellular Pathology (ICP) and
Laboratory of Biochemistry, Université catholique
de Louvain, Brussels, Belgium
2
The tree of life based on rRNA sequences
Mitochondriates
Amitochondriates
3
The fusion hypothesis the eukaryotic cell is a
chimaera of eubacterial and archaebacterial
traits
Energy metabolism
Genetic machinery
Root?
Common ancestor?
4
Triosephosphate isomerase
Triosephosphate isomerase of eukaryotes is of
typical eubacterial origin and probably has
entered the eukaryotic cell together with the
bacterial endosymbiont that gave rise to the
formation of the mitochondrion
Root?
5
Arguments in favour of protein rather than the
DNA sequences

CODON BIAS
64 different possible triplet codes encode 20
amino acids. One amino acid may be encoded by 1
to 6 different triplet codes, and 3 of the 64
codes, called stop (or termination) codons,
specify "end of peptide sequence"
The different codons are used with unequal
frequency and this distribution of frequency is
referred to as "codon usage"
Codon usage varies between species. Amino-acid
codons have been degenerated with wobble in the
third position.

6
The universal genetic code

First Second Position
Third
Position ------------------------------------
Position
U(T) C A G
U(T) Phe Ser Tyr Cys
U(T)
Phe Ser Tyr Cys
C
Leu Ser STOP STOP
A
Leu Ser STOP Trp
G
C Leu Pro His Arg
U(T)
Leu Pro His Arg
C
Leu Pro Gln Arg
A
Leu Pro Gln Arg
G
A Ile Thr Asn Ser
U(T)
Ile Thr Asn Ser
C
Ile Thr Lys Arg
A
Met Thr Lys Arg
G

7
Arguments in favour of ... (codon bias 2)

Yeasts, protozoa, and animals have different
codon preferences,
This would result in differences in DNA sequence
related to codon bias and not to evolution.

8
Different species use different codons

Homo sapiens gbmam 1 CDS's (389 codons)
--------------------------------------------------
--------------------------
fields triplet frequency per thousand
(number)
--------------------------------------------------
--------------------------
UUU 20.6( 8) UCU 5.1( 2) UAU 7.7(
3) UGU 7.7( 3)
UUC 12.9( 5) UCC 20.6( 8) UAC 30.8(
12) UGC 0.0( 0)
UUA 10.3( 4) UCA 18.0( 7) UAA 0.0(
0) UGA 0.0( 0)
UUG 10.3( 4) UCG 0.0( 0) UAG 2.6(
1) UGG 15.4( 6)

Saccharomyces cerevisiae gbpln 9295 CDS's
(4586264 codons) ---------------------------------
------------------------------------------- fields
triplet frequency per thousand
(number) ---------------------------------------
------------------------------------- UUU
25.9(118900) UCU 23.6(108308) UAU 18.7( 85651)
UGU 8.0( 36624) UUC 18.3( 83880) UCC 14.3(
65421) UAC 14.7( 67599) UGC 4.6( 21255) UUA
26.3(120698) UCA 18.7( 85618) UAA 1.0( 4476)
UGA 0.6( 2742) UUG 27.2(124967) UCG 8.5(
39137) UAG 0.4( 2058) UGG 10.4( 47694)
9
Differences between the Universal and
Mitochondrial Genetic Codes

Codon Universal code mitochondrial code
UGA Stop Trp
AGA Arg Stop
AGG Arg Stop (or Lys)
AUA Ile Met
Modified from Li and Graur, 1991, Fundamentals
of Molecular Evolution , Sinauer Publ.
Only in arthropod mitochonria (Abascal et al.,
PLoS Biol 4, e127 (2006))

10
Arguments in favour... (codon bias)

Also, the protozoa use the codons UAA and UGA to
encode glutamine, rather than STOP
The inclusion of unique codons in a subset of the
sequences will tend to make that subset appear
more divergent than they really are

11
Arguments in favour... (codon bias 2)

High GC content of DNA seems to be associated
with aerobiosis in prokaryotes (Naya et al.,
2002)
In all major groups both organisms with AT rich
and GC rich DNA can be found.
The inclusion of unique codons in a subset of the
sequences will tend to make that subset appear
more divergent than they really are

12
GC content of DNA in aerobic and anaerobic
prokaryotes
Anaerobic
Aerobic
From Naya et al., J. Mol. Evol. 55 (2002) 260-264
13
The use of protein sequences in phylogeny
requires knowledge of the properties of the
amino acids and their single letter codes
14
The use of protein sequences in phylogeny
requires knowledge of the properties of the
amino acids and their single letter codes

Alanine A Leucine L
Arginine R Lysine K
Asparagine N Methionine M
Aspartic acid D Phenylalanine F
Cysteine C Proline P
Glutamic acid E Serine S
Glutamine Q Threonine T
Glycine G Tryptophane W
Histidine H Tyrosine Y
Isoleucine I Valine V

15
Arguments in favour of a phylogenetic analysis of
the corresponding protein rather than the DNA

LONG TIME HORIZON
When comparing sequences that have diverged for
possibly a billion years or more, it is very
likely that the wobble bases in the codons will
have become randomized. By excluding the wobble
bases (a general technique), one is actually
looking at amino acid sequences.So why not
taking a protein sequence directly?

16
Advantages of the translation of DNA into protein
(1)

DNA is composed of only four kinds of unit A, G,
C and T
If gaps are not allowed, on the average, 25 of
residues in two randomly chosen aligned sequences
would be identical
If gaps are allowed, as much as 50 of residues
in two randomly chosen aligned sequences can be
identical. Such a situation may obscure any
genuine relationship that may exist. Especially
when comparing distantly related or rapidly
evolving gene sequences
Moreover, it is easier to translate a gene
sequence into its corresponding protein than to
remove the third wobble base from each of the
codons in the gene
All open reading frames have alreday been
translated in to their corresponding peptide
sequences (GenPept and Uniprot databases)

17
Alignment of two random DNA sequences
Without indels 19 identity Indels
allowed 56 identity
18
Advantages of the translation of DNA into protein
(2)

Translation of DNA into 21 different types of
codon (20 amino acids and a terminator) allows
the information to sharpen up considerably. Wrong
frame information is set aside
Third-base degeneracies are consolidated
After insertion of gaps to align two random
protein sequences it can be expected that they
are between 10-20 identical
As a result of the translation procedure the
protein sequences with their 20 amino acids are
much more easy to align than the corresponding
DNA sequences with only 4 nucleotides

19
Alignment of two random protein sequences
Without indels 7 identity
Indels allowed 22 identity
20
Advantages of the translation of DNA into protein
(3)

If, after this, you still want to align distantly
related gene sequences, you better prepare first
a protein alignment and then base yourself on
this alignment for the alignment of the gene
sequences and the precise placement of indels in
the aligned sequences (use EMBOSS tranalign).
Conclusion The signal to noise ratio is greatly
improved when using protein sequences over DNA
sequences!

21
TBLASTX

The blast algorithm TBLASTX allows the use of
translated nucleic acid sequence information to
search for distant relationships between genes
A translated protein sequence is compared with
all the translated sequences from a nucleotide
database

22
NCBI BLASTN output
23
NCBI TBLASTX output
24
Nature of Sequence Divergence in Proteins

The observed sequence difference of two diverging
sequences takes the course of a negative
exponential. This is the result of the fact that
each position is subject to reverse changes
("back mutations") and multiple hits
Thus the observed percentage of difference
between the protein sequences is not proportional
to the actual evolutionary difference between two
homologous sequences
The evolutionary distance between two proteins is
expressed in PAM units. PAM (Dayhoff and Eck,
1968) stands for "accepted point mutation"

25
Relation between distance and PAM distance

PAM Distance
value ()
80 50
100 60
200 75
250 85 Twilight zone
300 92
(From Doolittle, 1987, Of URFs and ORFs,
University Science Books)
As the evolutionary distance increases, the
probability of super-imposed mutations becomes
greater resulting in a lower observed percent
difference.

26
Relation between distance and PAM distance
Distance
Twilight zone
Pam value
27
The Kimura correction for multiple substitutions

The formula used to correct for multiple hits is
from Motoo Kimura (Kimura, M. The neutral Theory
of Molecular Evolution, Camb.Univ.Press, 1983,
page 75)
K -Ln(1 - D - (D.D)/5) where D is the observed
distance and K is corrected distance.
This formula gives mean number of estimated
substitutions per site and, in contrast to D (the
observed number), can be greater than 1 i.e. more
than one substitution per site, on average. For
example, if you observe 0.8 differences per site
(80 difference 20 identity), then the above
formula predicts that there have been 2.5
substitutions per site over the course of
evolution since the 2 sequences diverged.
This can also be expressed in PAM units by
multiplying by 100 (mean number of substitutions
per 100 residues).

28
Proteins evolve at highly different rates
Rate of Change Theoretical PAMs /
108 yrs Lookback Time Pseudogenes
400 45 x 106 yrs Fibrinopeptides
90 200 " Lactalbumins 27 670
" Lysozymes 24 850 " Ribonucleases
21 850 " Haemoglobins 12
1500 " Acid proteases 8
2300 " Cytochrome c 4
5000 " Glyceraldehyde-P dehydrogenase
2 9000 " Glutamate
dehydrogenase 1 18000
" PAM number of Accepted Point Mutations per
100 amino acids. Useful lookback time 360 PAMs
29
Some Important Dates in History

Event Number of years ago
Origin of the Universe 15 4 109 yrs
Formation of the Solar System 4.6 "
First Self-replicating System 3.5 0.5 "
Prokaryotic-Eukaryotic Divergence 2.0 0.5 "
Plant-Animal Divergence 1.0 "
Invertebrate-Vertebrate Divergence 0.5 "
Mammalian Radiation Beginning 0.1 "
From Doolittle, Of URFs and ORFs, 1987

30
Construction of a phylogenetic tree from
phosphoglycerate kinase sequences
31
Arguments in favour of a protein rather than a
DNA sequence (3)

INTRONS
A study of the evolution of a protein using its
DNA sequence should only include coding sequences
This requires that in every DNA sequence all the
introns are being edited out. This may be
cumbersome and time consuming
An easier approach would be the direct
translation of the cDNA sequence into its
corresponding protein sequence

32
Typical structure of a eukaryotic gene
Exon 2
Exon 1
Exon 3
Flanking region
Flanking region
3'
5'
Intron II
Intron I
TATA
Initiation
Stop
Poly (A)
box
codon
codon
addition site
Transcription
AATAA
initiation
33
Arguments in favour of a protein rather than a
DNA sequence (4)

MULTIGENE FAMILIES
Organisms may contain many highly similar genes,
while only one peptide sequence can be identified
(e.g. histones, tubulins and GAPDH in humans).
Using these DNA sequences, it would be difficult
to decide which are expressed and which not and
thus which genes to include in the analysis.
Moreover, if all the genes that are expressed
encode the same protein, then DNA differences are
not significant

34
Arguments in favour of a protein rather than a
DNA sequence (5)

PROTEIN IS THE UNIT OF SELECTION
For protein-encoding genes, the object on which
natural selection acts is the protein itself.
The underlying DNA sequence reflects this process
in combination with species-specific pressures on
DNA sequence (like the need for aerophiles to
have DNA that is GC richer).
If function demands that a protein maintains a
specific sequence, there still is room for the
DNA sequence to change.

35
Arguments in favour of a protein rather than a
DNA sequence (6)

RNA EDITING
The DNA sequence doesn't always translate
into amino acid sequence.
In post-translational editing non-coded amino
acids are added or coded amino acids are removed
in the editing process.
This could lead to major differences in DNA
sequence (sometimes more than 50) that
nevertheless leads to roughly the same protein
sequence after final editing

36
Pan-editing of mitochondrial RNA in Kinetoplastida
UCCuAuuAAuUUUUUGuUAUAu AGuuuuuuAAUGUUGuuuGGuGu
A uuuuuuuAuUGUGuuuAGuuuuG uuuuGuuGuuGuuuGuuuG
GU GuGuuAuuGUUUUGAGAuuGuuG
note that the mature mRNA would not be able to
hybridise with the gene present in the
kinetoplast DNA and thus cannot be detected as
such.
37
Some good advice (1)

It is recommended to prepare the phylogenetic
trees both ways (DNA and Protein) and see how
they look
For a group of species that are relatively close
in time and closely related (like viral proteins
or vertebrate enzymes), DNA-based analysis is
probably a good way to go, since you avoid
problems of codon bias and randomization of
wobble bases. But check the protein anyway

38
Some good advice (2)

Be aware of the problems of multigene families
(for instance coding for isoenzymes)
Be careful when you decide to exclude or include
such sequences (you may compare paralogous
rather than orthologous sequences)

39
What is required

A DNA or protein sequence
A set of homologous sequences
A good multiple sequence alignment
Several programs to create a phylogenetic tree

40
(No Transcript)
41
(No Transcript)
42
What is required

A DNA or protein sequence
A set of homologous sequences
A good multiple sequence alignment
Several programs to create a phylogenetic tree

43
(No Transcript)
44
(No Transcript)
45
Alignment parametres in ClustalX
46
PAM 250 matrix as used in Clustal

C 12,
S 0, 2,
T -2, 1, 3,
P -3, 1, 0, 6,
A -2, 1, 1, 1, 2,
G -3, 1, 0,-1, 1, 5,
N -4, 1, 0,-1, 0, 0, 2,
D -5, 0, 0,-1, 0, 1, 2, 4,
E -5, 0, 0,-1, 0, 0, 1, 3, 4,
Q -5,-1,-1, 0, 0,-1, 1, 2, 2, 4,
H -3,-1,-1, 0,-1,-2, 2, 1, 1, 3, 6,
R -4, 0,-1, 0,-2,-3, 0,-1,-1, 1, 2, 6,
K -5, 0, 0,-1,-1,-2, 1, 0, 0, 1, 0, 3, 5,
M -5,-2,-1,-2,-1,-3,-2,-3,-2,-1,-2, 0, 0, 6,
I -2,-1, 0,-2,-1,-3,-2,-2,-2,-2,-2,-2,-2, 2, 5,
L -6,-3,-2,-3,-2,-4,-3,-4,-3,-2,-2,-3,-3, 4, 2,
6,
V -2,-1, 0,-1, 0,-1,-2,-2,-2,-2,-2,-2,-2, 2, 4,
2, 4,
F -4,-3,-3,-5,-4,-5,-4,-6,-5,-5,-2,-4,-5, 0, 1,
2,-1, 9,
Y 0,-3,-3,-5,-3,-5,-2,-4,-4,-4,
0,-4,-4,-2,-1,-1,-2, 7,10,

47
ClustalX distance matrix

Non-corrected
AROC_LEIMJ 0.000 0.036 0.268 0.268 0.232
AROC_PSEAE 0.036 0.000 0.268 0.268 0.232
AROC_VIBCH 0.268 0.268 0.000 0.089 0.232
AROC_VIBAN 0.268 0.268 0.089 0.000 0.232
AROC_NEIMB 0.232 0.232 0.232 0.232 0.000
Corrected for multiple substitution
AROC_LEIMJ 0.000 0.037 0.332 0.332 0.278
AROC_PSEAE 0.037 0.000 0.332 0.332 0.278
AROC_VIBCH 0.332 0.332 0.000 0.095 0.278
AROC_VIBAN 0.332 0.332 0.095 0.000 0.278
AROC_NEIMB 0.278 0.278 0.278 0.278 0.000

48
Matrices often used for the alignment of proteins

PAM 350 (Dayhoff et al., 1978)
BLOSUM30 (Henikoff-Henikoff, 1992)
JTT (Jones et al., 1992)
mtREV24 (Adachi-Hasegawa, 1996)
GONNET 250 matrix (Gonnet et al., 1992)

49
Alignment of two protein sequences (1)

For the creation of a phylogenetic tree a good
alignment of protein sequences is of vital
importance
Only homologous residues should be aligned with
each other
Doubtful regions should not be included in the
alignment
Aligned sequences should have similar lengths

50
Alignment of two protein sequences

Alignment requires the user to make assumptions
regarding relative costs of substitution versus
insertions and deletions (indels).
If substitution cost gtgt gap penalty there will
be many short gaps and no phylogenetic
information.
In general search for maximum similarity and
minimize the number of insertions and deletions.
Exclude regions that cannot be aligned
unambiguously!

51
Multiple alignment of protein sequences

For the construction of reliable phylogenetic
trees the quality of a multiple alignment is of
the utmost importance
There are many programs available for the
multiple alignment of proteins.
A good program in the public domain is ClustalW
or ClustalX
Available on the web for free and for any
platform (PC, Mac, Unix/Linux)
They quickly align sequence pairs and roughly
determine the degrees of identity between each
pair
Then the sequences are aligned more precisely in
a progressive way starting with the two closest
sequencesMost programs work best when the
sequences have similar length.

52
Some rules of thumb for the manual alignment of
proteins (1)

An automatically produced multiple alignment
often needs manual adjustment to improve the
quality of the alignment.
Such improvement can be obtained by using all the
knowledge that is available about a protein.
If a structure is available you should use the
detailed information about secondary structure
for the alignment.

53
What is required

A DNA or protein sequence
A set of homologous sequences
A good multiple sequence alignment
Several programs to create a phylogenetic tree

54
Tree construction methods (2)

Character-based methods
maximum parsimony
maximum likelihood
Non-character-based methods
distance matrix methods

55
Tree construction methods (1)

Distance matrix methods
Cluster analysis (UPGMA, WPGMA, etc)
Fitch Margoliash (1967)
Transformed distance methods (eg. Li, 1981)
Neighbor-joining (Saitou Nei, 1987)
...many more
Parsimony methods
Maximum parsimony (Protpars, DNApars, PAUP)
Other methods
Maximum likelihood (DNAML, ProtML, TreePuzzle)
Splitstree, Mr. Bayes
... many more

56
Text available from opperdoes_at_bchm.ucl.ac.be
Text and slides http//www.icp.be/opperd/cha
pter8/Website http//www.icp.be/opperd/private
/proteins.html http//www.icp.be/opperd/private
/phylogeny_2006_Athens.ppt
57
Distance Matrix Methods

UPGMA (Unweighted Pair Group with Arithmatic
Mean) uses real (uncorrected) distance values and
a sequential clustering algorithm. (Should only
be used with closely related OTUs, or when there
is constancy of evolutionary rate)
Neighbors relation methods
FITCH (Fitch, 1981)
Neighbor-Joining method, (Saitou and Nei, 1987)
Should all be used with corrected (see above)
distance matrices

58
Alignment of two protein sequences (1)

For the creation of a phylogenetic tree a good
alignment of protein sequences is of vital
importance
Only homologous residues should be aligned with
each other
Doubtful regions should not be included in the
alignment
Aligned sequences should have similar lengths

59
Pair-wise alignment of two protein sequences
according to the Dot-Matrix method
60
Dot-Matrix plots
Two homologous sequences with 81 identity
Two homologous sequences with 50 identity
61
Pair-wise alignment of two protein sequences
according to the Dot-Matrix method
62
Alignment of two protein sequences (2)

Alignment requires the user to make assumptions
regarding relative costs of substitution versus
insertions and deletions (indels).
If substitution cost gtgt gap penalty there will
be many short gaps and no phylogenetic
information.
In general search for maximum identity and
minimize the number of insertions and deletions.
Exclude regions that cannot be aligned
unambiguously!
Visual alignment is possible using the
"dot-matrix method"

63
Identity matrix as used in Clustal

C10,
S 0, 10,
T 0, 0, 10,
P 0, 0, 0, 10,
A 0, 0, 0, 0, 10,
G 0, 0, 0, 0, 0, 10,
N 0, 0, 0, 0, 0, 0, 10,
D 0, 0, 0, 0, 0, 0, 0, 10,
E 0, 0, 0, 0, 0, 0, 0, 0, 10,
Q 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
H 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
R 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
K 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
M 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
I 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
L 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
10,
V 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
10,
F 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 10,

64
Distance matrix withmutation costs for amino
acids

A S G L K V T P E D N I Q
R F Y C H M W Z B X
Ala A 0 1 1 2 2 1 1 1 1 1 2 2 2
2 2 2 2 2 2 2 2 2 2
Ser S 1 0 1 1 2 2 1 1 2 2 1 1 2
1 1 1 1 2 2 1 2 2 2
Gly G 1 1 0 2 2 1 2 2 1 1 2 2 2
1 2 2 1 2 2 1 2 2 2
Leu L 2 1 2 0 2 1 2 1 2 2 2 1 1
1 1 2 2 1 1 1 2 2 2
Lys K 2 2 2 2 0 2 1 2 1 2 1 1 1
1 2 2 2 2 1 2 1 2 2
Val V 1 2 1 1 2 0 2 2 1 1 2 1 2
2 1 2 2 2 1 2 2 2 2
Thr T 1 1 2 2 1 2 0 1 2 2 1 1 2
1 2 2 2 2 1 2 2 2 2
Pro P 1 1 2 1 2 2 1 0 2 2 2 2 1
1 2 2 2 1 2 2 2 2 2
Glu E 1 2 1 2 1 1 2 2 0 1 2 2 1
2 2 2 2 2 2 2 1 2 2
Asp D 1 2 1 2 2 1 2 2 1 0 1 2 2
2 2 1 2 1 2 2 2 1 2
Asn N 2 1 2 2 1 2 1 2 2 1 0 1 2
2 2 1 2 1 2 2 2 1 2
Ile I 2 1 2 1 1 1 1 2 2 2 1 0 2
1 1 2 2 2 1 2 2 2 2
Gln Q 2 2 2 1 1 2 2 1 1 2 2 2 0
1 2 2 2 1 2 2 1 2 2
Arg R 2 1 1 1 1 2 1 1 2 2 2 1 1
0 2 2 1 1 1 1 2 2 2
Phe F 2 1 2 1 2 1 2 2 2 2 2 1 2
2 0 1 1 2 2 2 2 2 2
Tyr Y 2 1 2 2 2 2 2 2 2 1 1 2 2
2 1 0 1 1 3 2 2 1 2
Cys C 2 1 1 2 2 2 2 2 2 2 2 2 2
1 1 1 0 2 2 1 2 2 2

65
Hydrophobicity matrix

R K D E B Z S N Q G X T H A
C M P V L I Y F W
Arg R 10 10 9 9 8 8 6 6 6 5 5 5 5 5
4 3 3 3 3 3 2 1 0
Lys K 10 10 9 9 8 8 6 6 6 5 5 5 5 5
4 3 3 3 3 3 2 1 0
Asp D 9 9 10 10 8 8 7 6 6 6 5 5 5 5
5 4 4 4 3 3 3 2 1
Glu E 9 9 10 10 8 8 7 6 6 6 5 5 5 5
5 4 4 4 3 3 3 2 1
Asx B 8 8 8 8 10 10 8 8 8 8 7 7 7 7
6 6 6 5 5 5 4 4 3
Glx Z 8 8 8 8 10 10 8 8 8 8 7 7 7 7
6 6 6 5 5 5 4 4 3
Ser S 6 6 7 7 8 8 10 10 10 10 9 9 9 9
8 8 7 7 7 7 6 6 4
Asn N 6 6 6 6 8 8 10 10 10 10 9 9 9 9
8 8 8 7 7 7 6 6 4
Gln Q 6 6 6 6 8 8 10 10 10 10 9 9 9 9
8 8 8 7 7 7 6 6 4
Gly G 5 5 6 6 8 8 10 10 10 10 9 9 9 9
8 8 8 8 7 7 6 6 5
??? X 5 5 5 5 7 7 9 9 9 9 10 10 10 10
9 9 8 8 8 8 7 7 5
Thr T 5 5 5 5 7 7 9 9 9 9 10 10 10 10
9 9 8 8 8 8 7 7 5
His H 5 5 5 5 7 7 9 9 9 9 10 10 10 10
9 9 9 8 8 8 7 7 5
Ala A 5 5 5 5 7 7 9 9 9 9 10 10 10 10
9 9 9 8 8 8 7 7 5
Cys C 4 4 5 5 6 6 8 8 8 8 9 9 9 9
10 10 9 9 9 9 8 8 5
Met M 3 3 4 4 6 6 8 8 8 8 9 9 9 9
10 10 10 10 9 9 8 8 7
Pro P 3 3 4 4 6 6 7 8 8 8 8 8 9 9
9 10 10 10 9 9 9 8 7
Val V 3 3 4 4 5 5 7 7 7 8 8 8 8 8
9 10 10 10 10 10 9 8 7

66
PAM 1 mutation matrix

1 PAM evolutionary distance
Ala Arg Asn Asp Cys Gln Glu Gly
His Ile Leu Lys Met Phe Pro Ser Thr Trp
Tyr Val
A R N D C Q E G
H I L K M F P S T W
Y V
Ala A 9867 2 9 10 3 8 17 21
2 6 4 2 6 2 22 35 32 0
2 18
Arg R 1 9913 1 0 1 10 0 0
10 3 1 19 4 1 4 6 1 8
0 1
Asn N 4 1 9822 36 0 4 6 6
21 3 1 13 0 1 2 20 9 1
4 1
Asp D 6 0 42 9859 0 6 53 6
4 1 0 3 0 0 1 5 3 0
0 1
Cys C 1 1 0 0 9973 0 0 0
1 1 0 0 0 0 1 5 1 0
3 2
Gln Q 3 9 4 5 0 9876 27 1
23 1 3 6 4 0 6 2 2 0
0 1
Glu E 10 0 7 56 0 35 9865 4
2 3 1 4 1 0 3 4 2 0
1 2
Gly G 21 1 12 11 1 3 7 9935
1 0 1 2 1 1 3 21 3 0
0 5
His H 1 8 18 3 1 20 1 0
9912 0 1 1 0 2 3 1 1 1
4 1
Ile I 2 2 3 1 2 1 2 0
0 9872 9 2 12 7 0 1 7 0
1 33
Leu L 3 1 3 0 0 6 1 1
4 22 9947 2 45 13 3 1 3 4
2 15
Lys K 2 37 25 6 0 12 7 2
2 4 1 9926 20 0 3 8 11 0
1 1
Met M 1 1 0 0 0 2 0 0
0 5 8 4 9874 1 0 1 2 0
0 4
Phe F 1 1 1 0 0 0 0 1
2 8 6 0 4 9946 0 2 1 3
28 0
Pro P 13 5 2 1 1 8 3 2
5 1 2 2 1 1 9926 12 4 0
0 2

67
PAM 100 matrix as used in Clustal

C 14,
S -1, 6,
T -5, 2, 7,
P -6, 1, -1, 10,
A -5, 2, 2, 1, 6,
G -8, 1, -3, -3, 1, 8,
N -8, 2, 0, -3, -1, -1, 7,
D -11, -1, -2, -4, -1, -1, 4, 8,
E -11, -2, -3, -3, 0, -2, 1, 5, 8,
Q -11, -3, -3, -1, -2, -5, -1, 1, 4, 9,
H -6, -4, -5, -2, -5, -7, 2, -1, -2, 4, 11,
R -6, -1, -4, -2, -5, -8, -3, -6, -5, 1, 1,
10,
K -11, -2, -1, -4, -4, -5, 1, -2, -2, -1, -3,
3, 8,
M -11, -4, -2, -6, -3, -8, -5, -8, -6, -2, -7,
-2, 1, 13,
I -5, -4, -1, -6, -3, -7, -4, -6, -5, -5, -7,
-4, -4, 2, 9,
L -12, -7, -5, -5, -5, -8, -6, -9, -7, -3, -5,
-7, -6, 4, 2, 9,
V -4, -4, -1, -4, 0, -4, -5, -6, -5, -5, -6,
-6, -6, 1, 5, 1, 8,
F -10, -5, -6, -9, -7, -8, -6,-11,-11,-10, -4,
-7,-11, -2, 0, 0, -5, 12,

68
PAM 250 matrix as used in Clustal

C 12,
S 0, 2,
T -2, 1, 3,
P -3, 1, 0, 6,
A -2, 1, 1, 1, 2,
G -3, 1, 0,-1, 1, 5,
N -4, 1, 0,-1, 0, 0, 2,
D -5, 0, 0,-1, 0, 1, 2, 4,
E -5, 0, 0,-1, 0, 0, 1, 3, 4,
Q -5,-1,-1, 0, 0,-1, 1, 2, 2, 4,
H -3,-1,-1, 0,-1,-2, 2, 1, 1, 3, 6,
R -4, 0,-1, 0,-2,-3, 0,-1,-1, 1, 2, 6,
K -5, 0, 0,-1,-1,-2, 1, 0, 0, 1, 0, 3, 5,
M -5,-2,-1,-2,-1,-3,-2,-3,-2,-1,-2, 0, 0, 6,
I -2,-1, 0,-2,-1,-3,-2,-2,-2,-2,-2,-2,-2, 2, 5,
L -6,-3,-2,-3,-2,-4,-3,-4,-3,-2,-2,-3,-3, 4, 2,
6,
V -2,-1, 0,-1, 0,-1,-2,-2,-2,-2,-2,-2,-2, 2, 4,
2, 4,
F -4,-3,-3,-5,-4,-5,-4,-6,-5,-5,-2,-4,-5, 0, 1,
2,-1, 9,
Y 0,-3,-3,-5,-3,-5,-2,-4,-4,-4,
0,-4,-4,-2,-1,-1,-2, 7,10,

69
Matrices often used for the alignment of proteins

PAM 250 (Dayhoff et al., 1978)
BLOSUM62 (Henikoff-Henikoff, 1992)
JTT (Jones et al., 1992)
mtREV24 (Adachi-Hasegawa, 1996)
GONNET matrix (Gonnet et al., 1992)

70
Multiple alignment of protein sequences

For the construction of reliable phylogenetic
trees the quality of a multiple alignment is of
the utmost importance
There are many programs available for the
multiple alignment of proteins.
A good program in the public domain is ClustalW
A similar program is Pileup of the GCG package
They quickly align sequence pairs and roughly
determine the degrees of identity between each
pair
Then the sequences are aligned more precisely in
a progressive way starting with the two closest
sequencesMost programs work best when the
sequences have similar length.

71
Some rules of thumb for the manual alignment of
proteins (1)

An automatically produced multiple alignment
often needs manual adjustment to improve the
quality of the alignment.
Such improvement can be obtained by using all the
knowledge that is available about a protein.
If a structure is available you should use the
detailed information about secondary structure
for the alignment.

72
Some rules of thumb for the manual alignment of
proteins (2)

The rules for mutation of amino acids are
dependent on their physicochemical properties.
Surface residues (DRENK) are preferably mutated
to residues of similar properties. Since they are
not, or less, involved in protein folding they
mutate rather easily.
Hydrophobic residues (FAMILYVW) are
preferentially replaced by other hydrophobic
ones. These residues are mainly internal and
determine the folding of the protein. They thus
mutate rather slowly.

73
Some rules of thumb for the manual alignment of
proteins (3)

The residues CHQST are indifferent and may be
replaced with any other type of residue
The residues (DRENKCHQST), when conserved
throughout the alignment are very likely residues
that are involved in the active site. So the
multiple alignment should be adjusted
accordingly
Periodicity of charged residues may provide
information as to the presence of elements of
secondary structure such as ?-helices and
?-strands

74
a-helix
75
b-strand
76
Some rules of thumb for the manual alignment of
proteins (4)

Indels (insertions/deletions) are never found in
elements of secondary structure but only in
loops.
Pro and Gly interfere with secondary structure
elements and thus have a preference for loops
Hydrophobicity (or hydropathy) profiles according
to Kyte and Doolittle of two homologous proteins
are in general strikingly similar

77
Proline interferes with a-helix and b-sheet
formation
From Deber and Therien,2002
78
Possible functions for proline in trans-membrane
domains
From Deber and Therien,2002
79
Alignment of malate dehydrogenase sequences
SlclCHR34_tmp.0150 ----MKPST--LSRFKVTVLGASGA
IGQPLALALVQNKRVSEL-----ALYDIVQPR--- lclCHR34_tmp.
0140 ----MRRSQ--GCFFRVAVLGAAGGIGQPLSLLLKNNKYV
KEL-----KLYDVKGGP--- lclCHR34_tmp.0130
MGLLFRRSLTALKKGKVVLFGCSNAVGQPLSLLLKMNPHVEELVCCNTAA
DDDVPGS--- lclCHR28_tmp.0050
-----------MSAVKVAVTGAAGQIGYALVPLIARGALLGPTTPVELRL
LDIEPALKAL . .
. .. . .
lclCHR34_tmp.0150 -GVAVDLSHFPRKVKVTGYPTKWI
HK--ALDGADLVLMSAGMPRRPGMT-HDDLFNTNAL lclCHR34_tmp
.0140 -GVAADLSHICAPAKVTGYTKDELSR--AVENADVVVIP
AGIPRKPGMT-RDDLFNTNAS lclCHR34_tmp.0130
-GIAADLSHIDTLPKVH-YATDEGQWPALLRDAQLILVCFGSSFDLLRED
RDIALKAAAP lclCHR28_tmp.0050
AGVEAELEDCAFPLLDKVVVTADPRV--AFDGVAIAIMCGAFPRKAGME-
RKDLLEMNAR ...
. . .. . . .
lclCHR34_tmp.0150 TVNELSAAVARYAPKSV-LAIISN
PLNSMVPVAAETLQRAGVYDPRKLFGIISLNMMRAR lclCHR34_tmp
.0140 IVRDLAIAVGTHAPKAI-VGIITNPVNSTVPVAAEALKK
VGVYDPARLFGVTTLDVVRAR lclCHR34_tmp.0130
TMRRVMAAVASSDTTGN-VAVVSSPVNALTPFCAELLKASGKFDPRKLFG
VTTLDVIRTR lclCHR28_tmp.0050
IFKEQGEAIAAVAASDCRVVVVGNPANTNALILLKSAQ--GKLNPRHVTA
MTRLDHNRAL .. .
.. . . . .
lclCHR34_tmp.0150 KMLGDFTGQDPEMLDVPVIGGHSG
QTIVPLFSHS--GVELRQEQVEYLTHRVR------- lclCHR34_tmp
.0140 TFVAEALGASPYDVDVPVIGGHSGETIVPLLSG---FPS
LSEEQVRQLTHRIQ------- lclCHR34_tmp.0130
KLVAGTLHMNPYDVNVPVVGGCGGVTACPLIAQT--GLRIPLDDIVRISG
EVQSYGVLFE lclCHR28_tmp.0050
SLLARKAGVPVSQVRNVIIWGNHSSTQVPDTDSAVIGTTPAREAIKDDAL
DDD-----FV ..
. .
lclCHR34_tmp.0150 --VGGD-EVVKAKEGRGSSSLSMA
FAAAEWADGVLRAMDGEKTLLQCSFVESPLFADKCR lclCHR34_tmp
.0140 --FGGD-EVVKAKDGAGSATLSMAFAGNEWTTAVLRALS
GEKGVVVCTYVQS-TVEPSCA lclCHR34_tmp.0130
AAVGADSHDALSTEVAPPVALGLAYAACDFSTSLLKALRGDVGIVECALV
ES-TMRSETP lclCHR28_tmp.0050
QVVRGRGAEIIQLRGLSSAMSAAKAAVDHVHDWIHGTPEGVYVSMGVYSD
ENPYGVPSGL . .
. . . . .
lclCHR34_tmp.0150 FFGSTVEVCKEGIERVLPLPPLNE
YEEEQLDRCLPDLEKN-IRKGLAFVAENAATSTPST lclCHR34_tmp
.0140 FFSSPVLLGNSGVEKIYPVPMLNAYEEKLMAKCLEGLQS
N-ITKGIAFSNK--------- lclCHR34_tmp.0130
FFSSRVELGREGVQRVFPMGALTSYEHELIETAVPELMRD-VQAGIEAAT
QF-------- lclCHR28_tmp.0050
IFSFP-CTCHAGEWTVVSGKLNGDLGKQRLASTIAELQEERAQAGL----
---------- . .
. .
80
Hydrophobicity profiles

Profiles according to Kyte and Doolittle of
homologous proteins are in general strikingly
similar and may provide a tool in the alignment
of two or more proteins.
The two phosphoglycerate kinase sequences below
share 50 identical residues.

Trypanosoma congolense PGK
Euglena gracilis PGK
81
Tree construction methods (1)

Distance matrix methods
Cluster analysis (UPGMA, WPGMA, etc)
Fitch Margoliash (1967)
Transformed distance methods (eg. Li, 1981)
Neighbor-joining (Saitou Nei, 1987)
...many more
Parsimony methods
Maximum parsimony
Other methods
Maximum likelihood (Felsenstein, 1981)
... many more

82
Tree construction methods (2)

Character-based methods
maximum parsimony
maximum likelihood
Non-character-based methods
distance matrix methods

83
Phylogeny (2)

Distance Matrix methods (in the public domain)
Least squares method (Fitch and Margoliash)
Fitch, Kitsch of the Phylip package (Jo
Felsentein, Univ. Washington)
Neighbor-joining method
Neighbor of the Phylip package (Jo Felsentein,
Univ. Washington)
Clustal, or Distnj in Protml package (Adachi and
Hasegawa, Univ. Tokyo)
Darwin (Gaston Gonner, ETH, Zurich, via
mailserver or WWW)
Protein Maximum likelihood (in the public domain)
Protml (Adachi and Hasegawa, Univ. Tokyo) (very
cpu intensive)
TreePuzzle (Strimmer and von Haeseler, 1997)
Protein maximal parsimony (in the public domain)
Protpars (Jo Felsentein, Univ. Washington)
Paup (David Swofford, latest version will be
commercial)

84
Some useful information about phylogenetic trees
External nodes
OTUs
Internal
A
nodes
F
A-E are external nodes (extant)
F-I are internal (ancestral) nodes

B
H
OTUs are operational taxonomic
populations
units
C
individuals
I
They can be species
genes
They are the extant (existing) or extinct
(ancestral) OTUs
proteins
G

Root
D

Topology order of the nodes on the
tree
E

85
Distance Matrix Methods

UPGMA (Unweighted Pair Group with Arithmatic
Mean) uses real (uncorrected) distance values and
a sequential clustering algorithm. (Should only
be used with closely related OTUs, or when there
is constancy of evolutionary rate)
Transformed distance methods. Corrections may be
introduced to obtain trees with true evolutionary
distances (PAM values, Kimura), or corrections
are carried out with reference to an outgroup
(Farris, 1971 Klotz et al, 1979). Should be used
when evolutionary distant organisms are included
in the dataset
Neighbors relation methods
FITCH (Fitch, 1981)
Neighbor-Joining method, (Saitou and Nei, 1987)
Should all be used with corrected (see above)
distance matrices

86
Distance matrix
Uncorrected for Multiple Substitutions
1 2 3 4 5
0.00 0.63 0.63 22.88 18.50
AC007866_13 1 0.00 0.63
22.57 18.50 AC007866_17 2
0.00 22.88 17.87
AC007866_15 3
0.00 5.64 AC007866_9 4
0.00
AC007866_11 5 Using the Kimura correction
method Gap weighting is 0.000000 1
2 3 4 5 0.00
0.63 0.63 27.35 21.29 AC007866_13
1 0.00 0.63 26.90 21.29
AC007866_17 2
0.00 27.35 20.47 AC007866_15 3
0.00 5.88
AC007866_9 4
0.00 AC007866_11 5
Distance matrix as produced by the EMBOSS program
distmat
87
UPGMA

UPGMA (Unweighted Pair Group with Arithmetic
Mean) uses real (uncorrected) distance values and
a sequential clustering algorithm. (Should only
be used with closely related OTUs, or when there
is constancy of evolutionary rate)

88
Tree construction (UPGMA)
First cycle A B C D E B 2 C 4 4
D 6 6 6 E 6 6 6 4 F 8 8 8 8 8
Cluster the pair of OTUs with the smallest
distance, being A and B, The branching point is
positioned at a distance of 2 / 2 1
substitution.
89
Tree construction (UPGMA)

Following the first clustering A and B are
considered as a single composite OTU(A,B) and we
now calculate the new distance matrix as follows
dist(A,B),C (distAC distBC) / 2 4
dist(A,B),D (distAD distBD) / 2 6
dist(A,B),E (distAE distBE) / 2 6
dist(A,B),F (distAF distBF) / 2 8
In other words the distance between a simple
OTU and a composite OTU is the average of the
distances between the simple OTU and the
constituent simple OTUs of the composite OTU.
Then a new distance matrix is recalculated using
the newly calculated distances and the whole
cycle is being repeated

90
Tree construction (UPGMA)

Second cycle
A,B C D E
C 4
D 6 6
E 6 6 4
F 8 8 8 8

91
Tree construction (UPGMA)

Third cycle
A,B C D,E
C 4
D,E 6 6
F 8 8 8

92
Tree construction (UPGMA 4)

Fourth cycle
AB,C D,E
D,E 6
F 8 8

93
Tree construction (UPGMA)

Fifth cycle
ABC,DE
F 8
The final step consists of clustering the last
OTU, F,with the composite OTU.

94
Pitfalls of UPGMA

The UPGMA clustering method is very sensitive to
unequal evolutionary rates.
Clustering works only if the data are ultrametric
Ultrametric distances are defined by the
satisfaction of the 'three-point condition'.

95
The treepoint condition

For any three taxa dist AC lt max (distAB,
distBC) or,
in words the two greatest distances are equal,
or
UPGMA assumes that the evolutionary rate is the
same for all branches
If the assumption of rate constancy among
lineages does not hold UPGMA may give an
erroneous topology.

Non-ultrametric tree
96
Unequal rates of mutation lead to wrong trees

UPGMA tree construction based on the data of the
left tree would result in the erroneous tree at
the right

97
UPGMA (conclusion)

UPGMA uses real (uncorrected) distance values and
a sequential clustering algorithm.
This method of tree construction is very
sensitive to differences in branch length or
unequal rates of evolution.
It should only be used with closely related OTUs,
or when there is constancy of evolutionary rate.
The method is often used in combination with
isoenzyme or restriction site data or with
morphological criteria

98
Maximum Parsimony Methods

Use sequence information rather than distance
information
Calculate for all possible trees the tree that
represents the minimum number of substitutions at
each informative site

99
Maximum Parsimony analysis (2)

Parsimony implies that simpler hypotheses are
preferable to more complicated ones.
Maximum parsimony is a character-based method
that infers a phylogenetic tree by minimizing the
total number of evolutionary steps required to
explain a given set of data, or in other words by
minimizing the total tree length.
The steps may be base or amino-acid substitutions
for sequence data, or gain and loss events for
restriction site data.

100
Maximum Parsimony analysis (3)

Maximum parsimony, when applied to protein
sequence data either considers each site of the
sequence as a multistate unordered characterd
with 20 possible states (the amino-acids) (Eck
and Dayhoff, 1966), or may take into account the
genetic code and the number of mutations, 1, 2 or
3, that is required to explain an observed
amino-acid substitution. The latter method is
implemented in the PROTPARS program (Felsenstein,
1993).
The maximum parsimony method searches all
possible tree topologies for the optimal
(minimal) tree. However, the number of unrooted
trees that have to be analysed rapidly increases
with the number of OTUs.

101
Maximum Parsimony analysis (4)

The number of rooted trees (Nr) for n OTUs is
given byNr (2n -3)!/(2exp(n -2)) (n -2)!
The number of unrooted trees (Nr) for n OTUs is
given byNu (2n -5)!/(2exp(n -3)) (n -3)!

Number of OTUs unrooted trees rooted
trees 2 1 1 3 1
3 4 3
15 5 15 105 6 105
945 7 954
10,395 8 10,395 135,135 9 135,135
34,459,425 10 34,459,425
2.13E15 15 2.13E15 8.E21
This rapid increase in number of trees to be
analysed may make it impossible to apply the
method to very large datasets. In that case the
parsimony method may become very time consuming,
even on very fast computers.
102
maximum parsimony method for 4 nucleic-acid
sequences

Site _________________________
Sequence 1 2 3 4 5 6 7 8 9 1
A A G A G T G C A 2 A G C C
G T G C G 3 A G A T A T C C
A 4 A G A G A T C C G
For four OTUs there are three possible unrooted
trees. The trees are then analysed by searching
for the ancestral sequences and by counting the
number of mutations required to explain the
respective trees

103
(1) AAGAGTGCA AGATATCCA (3) \4
2/
Number of mutations \
4 / AGCCGTGCG --- AGAGATCCG
Tree I 11 /
\ /0
0\ (2) AGCCGTGCG AGAGATCCG
(4) (1) AAGAGTGCA AGCCGTGCG (2)
\1 3/
\ 5
/ AGGAGTGCA --- AGAGGTCCG Tree II
14 /
\ /4 1\
(3) AGATATCCA AGAGATCCG (4) (1)
AAGAGTGCA AGCCGTGCG (2) \1
3/
\ 5 /
AGAAGTGCA --- AGATGTCCG Tree III 16
/ \
/5 2\ (4)
AGAGATCCG AGATATCCA (3)
Tree I has the topology with the least number of
mutations and thus is the most parsimonious
tree. Ancestral trees are calculated This
analysis includes both informative and
non-informative sites in the sequence. When
only informative sites are included a much lesser
number of sites can be analysed, which means in
the case of large datasets a considerable gain in
CPU time.
104
Informative sites
A site is informative only when there are at
least two different kinds of nucleotides at the
site, each of which is represented in at least
two of the sequences under study.

Site _________________________
Sequence 1 2 3 4 5 6 7 8 9 1
A A G A G T G C A 2 A G C C
G T G C G 3 A G A T A T C C
A 4 A G A G A T C C G

Informative sites are indicated by an asterisk ()
105
Informative sites only
1 GGA 2 GGG 3 ACA 4 ACG (1) GGA
ACA (3) \1 1/
Number of mutations \ 2 /
GGG --- ACG Tree I 4 /
\ /0 0\ (2) GGG
ACG (4) (1) GGA GGG (2) \1
1/ \ 1 /
GCA --- GCG Tree II 5 /
\ /1 1\ (3) ACA
ACG (4) (1) GGA GGG (2) \2
1/ \ 0 /
GCG --- GCG Tree III 6 /
\ /1 2\ (4) ACG
ACA (3)
To infer a maximum parsimony tree, for each
possible tree we calculate the minimum number of
substitutions at each informative site. In the
above example, for sites 5, 7, and 9, tree I
requires in total 4 changes, tree II requires 5
changes, and tree III requires 6 changes. In the
final step, we sum the number of changes over all
the informative sites for each tree and choose
the tree associated with the smallest number of
substitutions. In our case, tree I is chosen
because it requires the smallest number of
changes (4) at the informative sites.
106
How to find the best tree ?

Maximum parsimony searches for the optimal
(minimal) tree. In this process more than one
minimal trees may be found. In order to guarantee
to find the best possible tree an exhaustive
evaluation of all possible tree topologies has to
be carried out. However, this becomes impossible
when there are more than 12 OTUs in a dataset.
Branch and Bound is a variation on maximum
parsimony that garantees to find the minimal tree
without having to evaluate all possible trees.
This way a larger number of taxa can be evaluated
but the method is still limited.
Heuristic searches is a method with step-wise
addition and rearrangement (branch swapping) of
OTUs. Here it is not guaranteed to find the best
tree.
Since, in view of the size of the dataset, it is
often not possible to carry out an exhaustive or
other search for the best tree, it is adviced to
change the order of the taxa in the dataset and
to repeat the analysis, or to indicate to the
program to do this for you by providing a
so-called jumble factor to the program.

107
Consensus tree

Since the Maximum Parsimony method may result in
more than one equally parsimonious tree, a
consensus tree should be created. For the
creation of a consensus tree see bootstrapping.

108
Parsimony and branch lengths
(1) G A (3) \1 0/
\ 1 / C -----A
/ \ /0
1\ (2) C T (4) (1) G
A (3) \0 1/
\ 1 / G -----T
/ \ /1 0\ (2) C
T (4) (1) G A (3) \1
1/ \ 1 /
C -----A / \
/0 0\ (2) C A (4)
3 possible trees for 4 OTUs, all describe the
same final state by assuming a total of 3 steps.
Each final state is arrived at via a different
route. Each of the three trees is equally
valid, but the number of steps along the
indiviual branches (or the length of each branch)
is not determined. For this reason branch
lengths are not given in parsimony, but only the
total number of steps for a tree.
109
Some final notes on maximum parsimony

Maximum Parsimony (positive points)
is based on shared and derived characters. It
therefore is a cladistic rather than a phenetic
method
does not reduce sequence information to a single
number
tries to provide information on the ancestral
sequences
evaluates different trees
Maximum Parsimony (negative points)
does not assume an evolutionary model
is slow in comparison with distance methods
does not use all the sequence information (only
informative sites are used)
does not correct for multiple mutations (does not
imply a model of evolution)
does not provide information on the branch
lengths
is notorious for its sensitivity to codon bias

110
How to root an unrooted tree?

The majority of methods yield unrooted trees
To root a tree one should add an outgroup to the
dataset. An outgroup is an OTU for which external
information (eg. paleontological information) is
available that indicates that the outgroup
branched off before all other taxa
Do not choose an outgroup that is very distantly
related to your taxa. This may result in serious
topolocical errors
Do not choose either an outgroup that is too
closely related to the taxa in question. In this
case it may not be a true outgroup
The use of more than one outgroup generally
improves the estimate of tree topology
In the absence of a good outgroup the root may be
positioned by assuming approximately equal
evolutionary rates over all the branches. In this
way the root is put at the midpoint of the
longest pathway between two OTUs

111
Maximum likelihood

It evaluates a hypothesis about evolutionary
history in terms of the probability that the
proposed model and the hypothesized history would
give rise to the observed data set. A history
with a higher probability of reaching the
observed state is preferred to a history with a
lower probability. The method searches for the
tree with the highest probability or likelihood.
The following programs are available from the
web
DNAML (DNA data only. By Joe Felsenstein in the
Phylip package)
FastDNAML (DNA data only. A faster algorithm
applied by Gary Olsen to Joe Felsenstein's DNAML
program )
ProtML (DNA and protein. By Adachi and Hasegawa,
1992)
TreePuzzle (DNA and protein. By Strimmer and von
Haeseler, 1995). This program applies a heuristic
method and is much faster than PROTML, but does
not guarantee to find the best tree.

112
Advantages and disadvantages of the maximum
likelihood method

There are some supposed adavantages of maximum
likelihood methods over other methods.
It is the estimation method least affected by
sampling error
It is robust to many violations of the
assumptions in the evolutionary model
with very short sequences it tends to outperform
alternative methods such as parsimony or distance
methods.
the method is statistically well founded
evalutates different tree topologies
uses all the sequence information
There are also some supposed disadvantages
maximum likelihood is very CPU intensive and thus
extremely slow
result is dependent on the model of evolution
used

113
Explication of the method
Maximum likelihood evaluates the probability that
the choosen evolutionary model will have
generated the observed sequences. Phylogenies are
then inferred by finding those trees that yield
the highest likelihood. Assume that we have the
aligned nucleotide sequences for four taxa
1 j ....N (1)
A G G C U C C A A ....A (2) A G
G U U C G A A ....A (3) A G C C C A
G A A.... A (4) A U U U C G G A
A.... C and we want to evauate the likelihood
of the unrooted tree represented by the
nucleotides of site j in the sequence and shown
below (1) (2) \
/ \ /
------ / \
/ \ (3) (4) What
is the probabliity that this tree would have
generated the data presented in the sequence
under the the chosen model ?
114
Likelihood for one site
The models are time-reversible, therefore the
likelihood of the tree is independent of the
position of the root. Thus it is convenient to
root the tree at an arbitrary internal node.
C C A G \ / /
\/ / A / \ /
\ / A
_ _
C C A G
\ / /
\/ / L(j) Sum(Prob (5)
/ ) \ /
\ /
_ (6) _
Assume that nucleotide sites evolve independently
(the Markovian model of evolution). Then we can
calculate the likelihood for each site separately
and combine these to the total likelihood. For
the likelihood for site j, we have to consider
all the possible scenarios by which the
nucleotides present at the tips of the tree could
have evolved. So the likelihood for a particular
site is the summation of the probablilities of
every possible reconstruction of ancestral
states, given some model of base substitution. So
in this specific case all possible nucleotides A,
G, C, and T occupying nodes (5) and (6), or 4 x 4
16 possibilities
In the case of protein sequences each site may
ooccupy 20 states (that of the 20 amino acids) an
thus 400 possibilities have to be considered.
Since any one of these scenarios could have led
to the amino-acid configuration at the tip of the
tree, we must calculate the probability of each
and sum and sum them to obtain the total
probability for each site j.
115
likelihood for the full tree
The likelihood for the full tree then is the
product of the likelihood at each site.
N L L(1) x L(2) ..... x
L(N) P L(j)
j1 Since the individual likelihoods are
extremely small numbers it is convenient to sum
the log likelihoods at each site and report the
likelihood of the entire tree as the log
likelihood.

N ln L ln L(1) ln L(2) ..... ln
L(N) S ln L(j)
j1
116
The model of evolution
The PROTML program in the MOLPHY package (Adachi
and Hasegawa, 1992), as well as the TreePUZZLE
program by Strimmer and von Haeseler (1995), have
implemented an instantaneous rate matrix derived
from the Dayhoff emperical substitution matrix.
This has been called the Dayhoff model.
Recently a model called the

Write a Comment

User Comments (0)