Title: Molecular Evolution of Proteins and Phylogenetic Analysis Fred R. Opperdoes Christian de Duve Institute of Cellular Pathology (ICP) and Laboratory of Biochemistry, Universit
1Molecular Evolution of Proteins and Phylogenetic
Analysis Fred R. OpperdoesChristian de Duve
Institute of Cellular Pathology (ICP) and
Laboratory of Biochemistry, Université catholique
de Louvain, Brussels, Belgium
2The tree of life based on rRNA sequences
Mitochondriates
Amitochondriates
3The fusion hypothesis the eukaryotic cell is a
chimaera of eubacterial and archaebacterial
traits
Energy metabolism
Genetic machinery
Root?
Common ancestor?
4Triosephosphate isomerase
Triosephosphate isomerase of eukaryotes is of
typical eubacterial origin and probably has
entered the eukaryotic cell together with the
bacterial endosymbiont that gave rise to the
formation of the mitochondrion
Root?
5Arguments in favour of protein rather than the
DNA sequences
- CODON BIAS
- 64 different possible triplet codes encode 20
amino acids. One amino acid may be encoded by 1
to 6 different triplet codes, and 3 of the 64
codes, called stop (or termination) codons,
specify "end of peptide sequence" - The different codons are used with unequal
frequency and this distribution of frequency is
referred to as "codon usage" - Codon usage varies between species. Amino-acid
codons have been degenerated with wobble in the
third position.
6The universal genetic code
- First Second Position
Third - Position ------------------------------------
Position - U(T) C A G
-
- U(T) Phe Ser Tyr Cys
U(T) - Phe Ser Tyr Cys
C - Leu Ser STOP STOP
A - Leu Ser STOP Trp
G - C Leu Pro His Arg
U(T) - Leu Pro His Arg
C - Leu Pro Gln Arg
A - Leu Pro Gln Arg
G - A Ile Thr Asn Ser
U(T) - Ile Thr Asn Ser
C - Ile Thr Lys Arg
A - Met Thr Lys Arg
G
7Arguments in favour of ... (codon bias 2)
- Yeasts, protozoa, and animals have different
codon preferences, - This would result in differences in DNA sequence
related to codon bias and not to evolution.
8Different species use different codons
- Homo sapiens gbmam 1 CDS's (389 codons)
- --------------------------------------------------
-------------------------- - fields triplet frequency per thousand
(number) - --------------------------------------------------
-------------------------- - UUU 20.6( 8) UCU 5.1( 2) UAU 7.7(
3) UGU 7.7( 3) - UUC 12.9( 5) UCC 20.6( 8) UAC 30.8(
12) UGC 0.0( 0) - UUA 10.3( 4) UCA 18.0( 7) UAA 0.0(
0) UGA 0.0( 0) - UUG 10.3( 4) UCG 0.0( 0) UAG 2.6(
1) UGG 15.4( 6)
Saccharomyces cerevisiae gbpln 9295 CDS's
(4586264 codons) ---------------------------------
------------------------------------------- fields
triplet frequency per thousand
(number) ---------------------------------------
------------------------------------- UUU
25.9(118900) UCU 23.6(108308) UAU 18.7( 85651)
UGU 8.0( 36624) UUC 18.3( 83880) UCC 14.3(
65421) UAC 14.7( 67599) UGC 4.6( 21255) UUA
26.3(120698) UCA 18.7( 85618) UAA 1.0( 4476)
UGA 0.6( 2742) UUG 27.2(124967) UCG 8.5(
39137) UAG 0.4( 2058) UGG 10.4( 47694)
9Differences between the Universal and
Mitochondrial Genetic Codes
- Codon Universal code mitochondrial code
- UGA Stop Trp
- AGA Arg Stop
- AGG Arg Stop (or Lys)
- AUA Ile Met
- Modified from Li and Graur, 1991, Fundamentals
of Molecular Evolution , Sinauer Publ. - Only in arthropod mitochonria (Abascal et al.,
PLoS Biol 4, e127 (2006))
10Arguments in favour... (codon bias)
- Also, the protozoa use the codons UAA and UGA to
encode glutamine, rather than STOP - The inclusion of unique codons in a subset of the
sequences will tend to make that subset appear
more divergent than they really are
11Arguments in favour... (codon bias 2)
- High GC content of DNA seems to be associated
with aerobiosis in prokaryotes (Naya et al.,
2002) - In all major groups both organisms with AT rich
and GC rich DNA can be found. - The inclusion of unique codons in a subset of the
sequences will tend to make that subset appear
more divergent than they really are
12GC content of DNA in aerobic and anaerobic
prokaryotes
Anaerobic
Aerobic
From Naya et al., J. Mol. Evol. 55 (2002) 260-264
13The use of protein sequences in phylogeny
requires knowledge of the properties of the
amino acids and their single letter codes
14The use of protein sequences in phylogeny
requires knowledge of the properties of the
amino acids and their single letter codes
- Alanine A Leucine L
- Arginine R Lysine K
- Asparagine N Methionine M
- Aspartic acid D Phenylalanine F
- Cysteine C Proline P
- Glutamic acid E Serine S
- Glutamine Q Threonine T
- Glycine G Tryptophane W
- Histidine H Tyrosine Y
- Isoleucine I Valine V
15Arguments in favour of a phylogenetic analysis of
the corresponding protein rather than the DNA
- LONG TIME HORIZON
- When comparing sequences that have diverged for
possibly a billion years or more, it is very
likely that the wobble bases in the codons will
have become randomized. By excluding the wobble
bases (a general technique), one is actually
looking at amino acid sequences.So why not
taking a protein sequence directly?
16Advantages of the translation of DNA into protein
(1)
- DNA is composed of only four kinds of unit A, G,
C and T - If gaps are not allowed, on the average, 25 of
residues in two randomly chosen aligned sequences
would be identical - If gaps are allowed, as much as 50 of residues
in two randomly chosen aligned sequences can be
identical. Such a situation may obscure any
genuine relationship that may exist. Especially
when comparing distantly related or rapidly
evolving gene sequences - Moreover, it is easier to translate a gene
sequence into its corresponding protein than to
remove the third wobble base from each of the
codons in the gene - All open reading frames have alreday been
translated in to their corresponding peptide
sequences (GenPept and Uniprot databases)
17Alignment of two random DNA sequences
Without indels 19 identity Indels
allowed 56 identity
18Advantages of the translation of DNA into protein
(2)
- Translation of DNA into 21 different types of
codon (20 amino acids and a terminator) allows
the information to sharpen up considerably. Wrong
frame information is set aside - Third-base degeneracies are consolidated
- After insertion of gaps to align two random
protein sequences it can be expected that they
are between 10-20 identical - As a result of the translation procedure the
protein sequences with their 20 amino acids are
much more easy to align than the corresponding
DNA sequences with only 4 nucleotides
19Alignment of two random protein sequences
Without indels 7 identity
Indels allowed 22 identity
20Advantages of the translation of DNA into protein
(3)
- If, after this, you still want to align distantly
related gene sequences, you better prepare first
a protein alignment and then base yourself on
this alignment for the alignment of the gene
sequences and the precise placement of indels in
the aligned sequences (use EMBOSS tranalign). - Conclusion The signal to noise ratio is greatly
improved when using protein sequences over DNA
sequences!
21TBLASTX
- The blast algorithm TBLASTX allows the use of
translated nucleic acid sequence information to
search for distant relationships between genes - A translated protein sequence is compared with
all the translated sequences from a nucleotide
database
22NCBI BLASTN output
23NCBI TBLASTX output
24Nature of Sequence Divergence in Proteins
- The observed sequence difference of two diverging
sequences takes the course of a negative
exponential. This is the result of the fact that
each position is subject to reverse changes
("back mutations") and multiple hits - Thus the observed percentage of difference
between the protein sequences is not proportional
to the actual evolutionary difference between two
homologous sequences - The evolutionary distance between two proteins is
expressed in PAM units. PAM (Dayhoff and Eck,
1968) stands for "accepted point mutation"
25Relation between distance and PAM distance
- PAM Distance
- value ()
- 80 50
- 100 60
- 200 75
- 250 85 Twilight zone
- 300 92
-
- (From Doolittle, 1987, Of URFs and ORFs,
University Science Books) - As the evolutionary distance increases, the
probability of super-imposed mutations becomes
greater resulting in a lower observed percent
difference.
26Relation between distance and PAM distance
Distance
Twilight zone
Pam value
27The Kimura correction for multiple substitutions
- The formula used to correct for multiple hits is
from Motoo Kimura (Kimura, M. The neutral Theory
of Molecular Evolution, Camb.Univ.Press, 1983,
page 75) - K -Ln(1 - D - (D.D)/5) where D is the observed
distance and K is corrected distance. - This formula gives mean number of estimated
substitutions per site and, in contrast to D (the
observed number), can be greater than 1 i.e. more
than one substitution per site, on average. For
example, if you observe 0.8 differences per site
(80 difference 20 identity), then the above
formula predicts that there have been 2.5
substitutions per site over the course of
evolution since the 2 sequences diverged. - This can also be expressed in PAM units by
multiplying by 100 (mean number of substitutions
per 100 residues).
28Proteins evolve at highly different rates
Rate of Change Theoretical PAMs /
108 yrs Lookback Time Pseudogenes
400 45 x 106 yrs Fibrinopeptides
90 200 " Lactalbumins 27 670
" Lysozymes 24 850 " Ribonucleases
21 850 " Haemoglobins 12
1500 " Acid proteases 8
2300 " Cytochrome c 4
5000 " Glyceraldehyde-P dehydrogenase
2 9000 " Glutamate
dehydrogenase 1 18000
" PAM number of Accepted Point Mutations per
100 amino acids. Useful lookback time 360 PAMs
29Some Important Dates in History
- Event Number of years ago
- Origin of the Universe 15 4 109 yrs
- Formation of the Solar System 4.6 "
- First Self-replicating System 3.5 0.5 "
- Prokaryotic-Eukaryotic Divergence 2.0 0.5 "
- Plant-Animal Divergence 1.0 "
- Invertebrate-Vertebrate Divergence 0.5 "
- Mammalian Radiation Beginning 0.1 "
- From Doolittle, Of URFs and ORFs, 1987
30Construction of a phylogenetic tree from
phosphoglycerate kinase sequences
31Arguments in favour of a protein rather than a
DNA sequence (3)
- INTRONS
- A study of the evolution of a protein using its
DNA sequence should only include coding sequences
- This requires that in every DNA sequence all the
introns are being edited out. This may be
cumbersome and time consuming - An easier approach would be the direct
translation of the cDNA sequence into its
corresponding protein sequence
32Typical structure of a eukaryotic gene
Exon 2
Exon 1
Exon 3
Flanking region
Flanking region
3'
5'
Intron II
Intron I
TATA
Initiation
Stop
Poly (A)
box
codon
codon
addition site
Transcription
AATAA
initiation
33Arguments in favour of a protein rather than a
DNA sequence (4)
- MULTIGENE FAMILIES
- Organisms may contain many highly similar genes,
while only one peptide sequence can be identified
(e.g. histones, tubulins and GAPDH in humans). - Using these DNA sequences, it would be difficult
to decide which are expressed and which not and
thus which genes to include in the analysis. - Moreover, if all the genes that are expressed
encode the same protein, then DNA differences are
not significant
34Arguments in favour of a protein rather than a
DNA sequence (5)
- PROTEIN IS THE UNIT OF SELECTION
- For protein-encoding genes, the object on which
natural selection acts is the protein itself. - The underlying DNA sequence reflects this process
in combination with species-specific pressures on
DNA sequence (like the need for aerophiles to
have DNA that is GC richer). - If function demands that a protein maintains a
specific sequence, there still is room for the
DNA sequence to change.
35Arguments in favour of a protein rather than a
DNA sequence (6)
- RNA EDITING
- The DNA sequence doesn't always translate
into amino acid sequence. - In post-translational editing non-coded amino
acids are added or coded amino acids are removed
in the editing process. - This could lead to major differences in DNA
sequence (sometimes more than 50) that
nevertheless leads to roughly the same protein
sequence after final editing
36Pan-editing of mitochondrial RNA in Kinetoplastida
UCCuAuuAAuUUUUUGuUAUAu AGuuuuuuAAUGUUGuuuGGuGu
A uuuuuuuAuUGUGuuuAGuuuuG uuuuGuuGuuGuuuGuuuG
GU GuGuuAuuGUUUUGAGAuuGuuG
note that the mature mRNA would not be able to
hybridise with the gene present in the
kinetoplast DNA and thus cannot be detected as
such.
37Some good advice (1)
- It is recommended to prepare the phylogenetic
trees both ways (DNA and Protein) and see how
they look - For a group of species that are relatively close
in time and closely related (like viral proteins
or vertebrate enzymes), DNA-based analysis is
probably a good way to go, since you avoid
problems of codon bias and randomization of
wobble bases. But check the protein anyway
38Some good advice (2)
- Be aware of the problems of multigene families
(for instance coding for isoenzymes) - Be careful when you decide to exclude or include
such sequences (you may compare paralogous
rather than orthologous sequences)
39What is required
- A DNA or protein sequence
- A set of homologous sequences
- A good multiple sequence alignment
- Several programs to create a phylogenetic tree
40(No Transcript)
41(No Transcript)
42What is required
- A DNA or protein sequence
- A set of homologous sequences
- A good multiple sequence alignment
- Several programs to create a phylogenetic tree
43(No Transcript)
44(No Transcript)
45Alignment parametres in ClustalX
46PAM 250 matrix as used in Clustal
- C 12,
- S 0, 2,
- T -2, 1, 3,
- P -3, 1, 0, 6,
- A -2, 1, 1, 1, 2,
- G -3, 1, 0,-1, 1, 5,
- N -4, 1, 0,-1, 0, 0, 2,
- D -5, 0, 0,-1, 0, 1, 2, 4,
- E -5, 0, 0,-1, 0, 0, 1, 3, 4,
- Q -5,-1,-1, 0, 0,-1, 1, 2, 2, 4,
- H -3,-1,-1, 0,-1,-2, 2, 1, 1, 3, 6,
- R -4, 0,-1, 0,-2,-3, 0,-1,-1, 1, 2, 6,
- K -5, 0, 0,-1,-1,-2, 1, 0, 0, 1, 0, 3, 5,
- M -5,-2,-1,-2,-1,-3,-2,-3,-2,-1,-2, 0, 0, 6,
- I -2,-1, 0,-2,-1,-3,-2,-2,-2,-2,-2,-2,-2, 2, 5,
- L -6,-3,-2,-3,-2,-4,-3,-4,-3,-2,-2,-3,-3, 4, 2,
6, - V -2,-1, 0,-1, 0,-1,-2,-2,-2,-2,-2,-2,-2, 2, 4,
2, 4, - F -4,-3,-3,-5,-4,-5,-4,-6,-5,-5,-2,-4,-5, 0, 1,
2,-1, 9, - Y 0,-3,-3,-5,-3,-5,-2,-4,-4,-4,
0,-4,-4,-2,-1,-1,-2, 7,10,
47ClustalX distance matrix
- Non-corrected
- AROC_LEIMJ 0.000 0.036 0.268 0.268 0.232
- AROC_PSEAE 0.036 0.000 0.268 0.268 0.232
- AROC_VIBCH 0.268 0.268 0.000 0.089 0.232
- AROC_VIBAN 0.268 0.268 0.089 0.000 0.232
- AROC_NEIMB 0.232 0.232 0.232 0.232 0.000
- Corrected for multiple substitution
- AROC_LEIMJ 0.000 0.037 0.332 0.332 0.278
- AROC_PSEAE 0.037 0.000 0.332 0.332 0.278
- AROC_VIBCH 0.332 0.332 0.000 0.095 0.278
- AROC_VIBAN 0.332 0.332 0.095 0.000 0.278
- AROC_NEIMB 0.278 0.278 0.278 0.278 0.000
48Matrices often used for the alignment of proteins
- PAM 350 (Dayhoff et al., 1978)
- BLOSUM30 (Henikoff-Henikoff, 1992)
- JTT (Jones et al., 1992)
- mtREV24 (Adachi-Hasegawa, 1996)
- GONNET 250 matrix (Gonnet et al., 1992)
49Alignment of two protein sequences (1)
- For the creation of a phylogenetic tree a good
alignment of protein sequences is of vital
importance - Only homologous residues should be aligned with
each other - Doubtful regions should not be included in the
alignment - Aligned sequences should have similar lengths
50Alignment of two protein sequences
- Alignment requires the user to make assumptions
regarding relative costs of substitution versus
insertions and deletions (indels). - If substitution cost gtgt gap penalty there will
be many short gaps and no phylogenetic
information. - In general search for maximum similarity and
minimize the number of insertions and deletions. - Exclude regions that cannot be aligned
unambiguously!
51Multiple alignment of protein sequences
- For the construction of reliable phylogenetic
trees the quality of a multiple alignment is of
the utmost importance - There are many programs available for the
multiple alignment of proteins. - A good program in the public domain is ClustalW
or ClustalX - Available on the web for free and for any
platform (PC, Mac, Unix/Linux) - They quickly align sequence pairs and roughly
determine the degrees of identity between each
pair - Then the sequences are aligned more precisely in
a progressive way starting with the two closest
sequencesMost programs work best when the
sequences have similar length.
52Some rules of thumb for the manual alignment of
proteins (1)
- An automatically produced multiple alignment
often needs manual adjustment to improve the
quality of the alignment. - Such improvement can be obtained by using all the
knowledge that is available about a protein. - If a structure is available you should use the
detailed information about secondary structure
for the alignment.
53What is required
- A DNA or protein sequence
- A set of homologous sequences
- A good multiple sequence alignment
- Several programs to create a phylogenetic tree
54Tree construction methods (2)
- Character-based methods
- maximum parsimony
- maximum likelihood
- Non-character-based methods
- distance matrix methods
55Tree construction methods (1)
- Distance matrix methods
- Cluster analysis (UPGMA, WPGMA, etc)
- Fitch Margoliash (1967)
- Transformed distance methods (eg. Li, 1981)
- Neighbor-joining (Saitou Nei, 1987)
- ...many more
- Parsimony methods
- Maximum parsimony (Protpars, DNApars, PAUP)
- Other methods
- Maximum likelihood (DNAML, ProtML, TreePuzzle)
- Splitstree, Mr. Bayes
- ... many more
56Text available from opperdoes_at_bchm.ucl.ac.be
Text and slides http//www.icp.be/opperd/cha
pter8/Website http//www.icp.be/opperd/private
/proteins.html http//www.icp.be/opperd/private
/phylogeny_2006_Athens.ppt
57Distance Matrix Methods
- UPGMA (Unweighted Pair Group with Arithmatic
Mean) uses real (uncorrected) distance values and
a sequential clustering algorithm. (Should only
be used with closely related OTUs, or when there
is constancy of evolutionary rate) - Neighbors relation methods
- FITCH (Fitch, 1981)
- Neighbor-Joining method, (Saitou and Nei, 1987)
Should all be used with corrected (see above)
distance matrices
58Alignment of two protein sequences (1)
- For the creation of a phylogenetic tree a good
alignment of protein sequences is of vital
importance - Only homologous residues should be aligned with
each other - Doubtful regions should not be included in the
alignment - Aligned sequences should have similar lengths
59Pair-wise alignment of two protein sequences
according to the Dot-Matrix method
60Dot-Matrix plots
Two homologous sequences with 81 identity
Two homologous sequences with 50 identity
61Pair-wise alignment of two protein sequences
according to the Dot-Matrix method
62Alignment of two protein sequences (2)
- Alignment requires the user to make assumptions
regarding relative costs of substitution versus
insertions and deletions (indels). - If substitution cost gtgt gap penalty there will
be many short gaps and no phylogenetic
information. - In general search for maximum identity and
minimize the number of insertions and deletions. - Exclude regions that cannot be aligned
unambiguously! - Visual alignment is possible using the
"dot-matrix method"
63Identity matrix as used in Clustal
- C10,
- S 0, 10,
- T 0, 0, 10,
- P 0, 0, 0, 10,
- A 0, 0, 0, 0, 10,
- G 0, 0, 0, 0, 0, 10,
- N 0, 0, 0, 0, 0, 0, 10,
- D 0, 0, 0, 0, 0, 0, 0, 10,
- E 0, 0, 0, 0, 0, 0, 0, 0, 10,
- Q 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
- H 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
- R 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
- K 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
- M 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
- I 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
- L 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
10, - V 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
10, - F 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 10,
64Distance matrix withmutation costs for amino
acids
- A S G L K V T P E D N I Q
R F Y C H M W Z B X - Ala A 0 1 1 2 2 1 1 1 1 1 2 2 2
2 2 2 2 2 2 2 2 2 2 - Ser S 1 0 1 1 2 2 1 1 2 2 1 1 2
1 1 1 1 2 2 1 2 2 2 - Gly G 1 1 0 2 2 1 2 2 1 1 2 2 2
1 2 2 1 2 2 1 2 2 2 - Leu L 2 1 2 0 2 1 2 1 2 2 2 1 1
1 1 2 2 1 1 1 2 2 2 - Lys K 2 2 2 2 0 2 1 2 1 2 1 1 1
1 2 2 2 2 1 2 1 2 2 - Val V 1 2 1 1 2 0 2 2 1 1 2 1 2
2 1 2 2 2 1 2 2 2 2 - Thr T 1 1 2 2 1 2 0 1 2 2 1 1 2
1 2 2 2 2 1 2 2 2 2 - Pro P 1 1 2 1 2 2 1 0 2 2 2 2 1
1 2 2 2 1 2 2 2 2 2 - Glu E 1 2 1 2 1 1 2 2 0 1 2 2 1
2 2 2 2 2 2 2 1 2 2 - Asp D 1 2 1 2 2 1 2 2 1 0 1 2 2
2 2 1 2 1 2 2 2 1 2 - Asn N 2 1 2 2 1 2 1 2 2 1 0 1 2
2 2 1 2 1 2 2 2 1 2 - Ile I 2 1 2 1 1 1 1 2 2 2 1 0 2
1 1 2 2 2 1 2 2 2 2 - Gln Q 2 2 2 1 1 2 2 1 1 2 2 2 0
1 2 2 2 1 2 2 1 2 2 - Arg R 2 1 1 1 1 2 1 1 2 2 2 1 1
0 2 2 1 1 1 1 2 2 2 - Phe F 2 1 2 1 2 1 2 2 2 2 2 1 2
2 0 1 1 2 2 2 2 2 2 - Tyr Y 2 1 2 2 2 2 2 2 2 1 1 2 2
2 1 0 1 1 3 2 2 1 2 - Cys C 2 1 1 2 2 2 2 2 2 2 2 2 2
1 1 1 0 2 2 1 2 2 2
65Hydrophobicity matrix
- R K D E B Z S N Q G X T H A
C M P V L I Y F W - Arg R 10 10 9 9 8 8 6 6 6 5 5 5 5 5
4 3 3 3 3 3 2 1 0 - Lys K 10 10 9 9 8 8 6 6 6 5 5 5 5 5
4 3 3 3 3 3 2 1 0 - Asp D 9 9 10 10 8 8 7 6 6 6 5 5 5 5
5 4 4 4 3 3 3 2 1 - Glu E 9 9 10 10 8 8 7 6 6 6 5 5 5 5
5 4 4 4 3 3 3 2 1 - Asx B 8 8 8 8 10 10 8 8 8 8 7 7 7 7
6 6 6 5 5 5 4 4 3 - Glx Z 8 8 8 8 10 10 8 8 8 8 7 7 7 7
6 6 6 5 5 5 4 4 3 - Ser S 6 6 7 7 8 8 10 10 10 10 9 9 9 9
8 8 7 7 7 7 6 6 4 - Asn N 6 6 6 6 8 8 10 10 10 10 9 9 9 9
8 8 8 7 7 7 6 6 4 - Gln Q 6 6 6 6 8 8 10 10 10 10 9 9 9 9
8 8 8 7 7 7 6 6 4 - Gly G 5 5 6 6 8 8 10 10 10 10 9 9 9 9
8 8 8 8 7 7 6 6 5 - ??? X 5 5 5 5 7 7 9 9 9 9 10 10 10 10
9 9 8 8 8 8 7 7 5 - Thr T 5 5 5 5 7 7 9 9 9 9 10 10 10 10
9 9 8 8 8 8 7 7 5 - His H 5 5 5 5 7 7 9 9 9 9 10 10 10 10
9 9 9 8 8 8 7 7 5 - Ala A 5 5 5 5 7 7 9 9 9 9 10 10 10 10
9 9 9 8 8 8 7 7 5 - Cys C 4 4 5 5 6 6 8 8 8 8 9 9 9 9
10 10 9 9 9 9 8 8 5 - Met M 3 3 4 4 6 6 8 8 8 8 9 9 9 9
10 10 10 10 9 9 8 8 7 - Pro P 3 3 4 4 6 6 7 8 8 8 8 8 9 9
9 10 10 10 9 9 9 8 7 - Val V 3 3 4 4 5 5 7 7 7 8 8 8 8 8
9 10 10 10 10 10 9 8 7
66PAM 1 mutation matrix
- 1 PAM evolutionary distance
- Ala Arg Asn Asp Cys Gln Glu Gly
His Ile Leu Lys Met Phe Pro Ser Thr Trp
Tyr Val - A R N D C Q E G
H I L K M F P S T W
Y V - Ala A 9867 2 9 10 3 8 17 21
2 6 4 2 6 2 22 35 32 0
2 18 - Arg R 1 9913 1 0 1 10 0 0
10 3 1 19 4 1 4 6 1 8
0 1 - Asn N 4 1 9822 36 0 4 6 6
21 3 1 13 0 1 2 20 9 1
4 1 - Asp D 6 0 42 9859 0 6 53 6
4 1 0 3 0 0 1 5 3 0
0 1 - Cys C 1 1 0 0 9973 0 0 0
1 1 0 0 0 0 1 5 1 0
3 2 - Gln Q 3 9 4 5 0 9876 27 1
23 1 3 6 4 0 6 2 2 0
0 1 - Glu E 10 0 7 56 0 35 9865 4
2 3 1 4 1 0 3 4 2 0
1 2 - Gly G 21 1 12 11 1 3 7 9935
1 0 1 2 1 1 3 21 3 0
0 5 - His H 1 8 18 3 1 20 1 0
9912 0 1 1 0 2 3 1 1 1
4 1 - Ile I 2 2 3 1 2 1 2 0
0 9872 9 2 12 7 0 1 7 0
1 33 - Leu L 3 1 3 0 0 6 1 1
4 22 9947 2 45 13 3 1 3 4
2 15 - Lys K 2 37 25 6 0 12 7 2
2 4 1 9926 20 0 3 8 11 0
1 1 - Met M 1 1 0 0 0 2 0 0
0 5 8 4 9874 1 0 1 2 0
0 4 - Phe F 1 1 1 0 0 0 0 1
2 8 6 0 4 9946 0 2 1 3
28 0 - Pro P 13 5 2 1 1 8 3 2
5 1 2 2 1 1 9926 12 4 0
0 2
67PAM 100 matrix as used in Clustal
- C 14,
- S -1, 6,
- T -5, 2, 7,
- P -6, 1, -1, 10,
- A -5, 2, 2, 1, 6,
- G -8, 1, -3, -3, 1, 8,
- N -8, 2, 0, -3, -1, -1, 7,
- D -11, -1, -2, -4, -1, -1, 4, 8,
- E -11, -2, -3, -3, 0, -2, 1, 5, 8,
- Q -11, -3, -3, -1, -2, -5, -1, 1, 4, 9,
- H -6, -4, -5, -2, -5, -7, 2, -1, -2, 4, 11,
- R -6, -1, -4, -2, -5, -8, -3, -6, -5, 1, 1,
10, - K -11, -2, -1, -4, -4, -5, 1, -2, -2, -1, -3,
3, 8, - M -11, -4, -2, -6, -3, -8, -5, -8, -6, -2, -7,
-2, 1, 13, - I -5, -4, -1, -6, -3, -7, -4, -6, -5, -5, -7,
-4, -4, 2, 9, - L -12, -7, -5, -5, -5, -8, -6, -9, -7, -3, -5,
-7, -6, 4, 2, 9, - V -4, -4, -1, -4, 0, -4, -5, -6, -5, -5, -6,
-6, -6, 1, 5, 1, 8, - F -10, -5, -6, -9, -7, -8, -6,-11,-11,-10, -4,
-7,-11, -2, 0, 0, -5, 12,
68PAM 250 matrix as used in Clustal
- C 12,
- S 0, 2,
- T -2, 1, 3,
- P -3, 1, 0, 6,
- A -2, 1, 1, 1, 2,
- G -3, 1, 0,-1, 1, 5,
- N -4, 1, 0,-1, 0, 0, 2,
- D -5, 0, 0,-1, 0, 1, 2, 4,
- E -5, 0, 0,-1, 0, 0, 1, 3, 4,
- Q -5,-1,-1, 0, 0,-1, 1, 2, 2, 4,
- H -3,-1,-1, 0,-1,-2, 2, 1, 1, 3, 6,
- R -4, 0,-1, 0,-2,-3, 0,-1,-1, 1, 2, 6,
- K -5, 0, 0,-1,-1,-2, 1, 0, 0, 1, 0, 3, 5,
- M -5,-2,-1,-2,-1,-3,-2,-3,-2,-1,-2, 0, 0, 6,
- I -2,-1, 0,-2,-1,-3,-2,-2,-2,-2,-2,-2,-2, 2, 5,
- L -6,-3,-2,-3,-2,-4,-3,-4,-3,-2,-2,-3,-3, 4, 2,
6, - V -2,-1, 0,-1, 0,-1,-2,-2,-2,-2,-2,-2,-2, 2, 4,
2, 4, - F -4,-3,-3,-5,-4,-5,-4,-6,-5,-5,-2,-4,-5, 0, 1,
2,-1, 9, - Y 0,-3,-3,-5,-3,-5,-2,-4,-4,-4,
0,-4,-4,-2,-1,-1,-2, 7,10,
69Matrices often used for the alignment of proteins
- PAM 250 (Dayhoff et al., 1978)
- BLOSUM62 (Henikoff-Henikoff, 1992)
- JTT (Jones et al., 1992)
- mtREV24 (Adachi-Hasegawa, 1996)
- GONNET matrix (Gonnet et al., 1992)
70Multiple alignment of protein sequences
- For the construction of reliable phylogenetic
trees the quality of a multiple alignment is of
the utmost importance - There are many programs available for the
multiple alignment of proteins. - A good program in the public domain is ClustalW
- A similar program is Pileup of the GCG package
- They quickly align sequence pairs and roughly
determine the degrees of identity between each
pair - Then the sequences are aligned more precisely in
a progressive way starting with the two closest
sequencesMost programs work best when the
sequences have similar length.
71Some rules of thumb for the manual alignment of
proteins (1)
- An automatically produced multiple alignment
often needs manual adjustment to improve the
quality of the alignment. - Such improvement can be obtained by using all the
knowledge that is available about a protein. - If a structure is available you should use the
detailed information about secondary structure
for the alignment.
72Some rules of thumb for the manual alignment of
proteins (2)
- The rules for mutation of amino acids are
dependent on their physicochemical properties. - Surface residues (DRENK) are preferably mutated
to residues of similar properties. Since they are
not, or less, involved in protein folding they
mutate rather easily. - Hydrophobic residues (FAMILYVW) are
preferentially replaced by other hydrophobic
ones. These residues are mainly internal and
determine the folding of the protein. They thus
mutate rather slowly.
73Some rules of thumb for the manual alignment of
proteins (3)
- The residues CHQST are indifferent and may be
replaced with any other type of residue - The residues (DRENKCHQST), when conserved
throughout the alignment are very likely residues
that are involved in the active site. So the
multiple alignment should be adjusted
accordingly - Periodicity of charged residues may provide
information as to the presence of elements of
secondary structure such as ?-helices and
?-strands
74a-helix
75b-strand
76Some rules of thumb for the manual alignment of
proteins (4)
- Indels (insertions/deletions) are never found in
elements of secondary structure but only in
loops. - Pro and Gly interfere with secondary structure
elements and thus have a preference for loops - Hydrophobicity (or hydropathy) profiles according
to Kyte and Doolittle of two homologous proteins
are in general strikingly similar
77Proline interferes with a-helix and b-sheet
formation
From Deber and Therien,2002
78Possible functions for proline in trans-membrane
domains
From Deber and Therien,2002
79Alignment of malate dehydrogenase sequences
SlclCHR34_tmp.0150 ----MKPST--LSRFKVTVLGASGA
IGQPLALALVQNKRVSEL-----ALYDIVQPR--- lclCHR34_tmp.
0140 ----MRRSQ--GCFFRVAVLGAAGGIGQPLSLLLKNNKYV
KEL-----KLYDVKGGP--- lclCHR34_tmp.0130
MGLLFRRSLTALKKGKVVLFGCSNAVGQPLSLLLKMNPHVEELVCCNTAA
DDDVPGS--- lclCHR28_tmp.0050
-----------MSAVKVAVTGAAGQIGYALVPLIARGALLGPTTPVELRL
LDIEPALKAL . .
. .. . .
lclCHR34_tmp.0150 -GVAVDLSHFPRKVKVTGYPTKWI
HK--ALDGADLVLMSAGMPRRPGMT-HDDLFNTNAL lclCHR34_tmp
.0140 -GVAADLSHICAPAKVTGYTKDELSR--AVENADVVVIP
AGIPRKPGMT-RDDLFNTNAS lclCHR34_tmp.0130
-GIAADLSHIDTLPKVH-YATDEGQWPALLRDAQLILVCFGSSFDLLRED
RDIALKAAAP lclCHR28_tmp.0050
AGVEAELEDCAFPLLDKVVVTADPRV--AFDGVAIAIMCGAFPRKAGME-
RKDLLEMNAR ...
. . .. . . .
lclCHR34_tmp.0150 TVNELSAAVARYAPKSV-LAIISN
PLNSMVPVAAETLQRAGVYDPRKLFGIISLNMMRAR lclCHR34_tmp
.0140 IVRDLAIAVGTHAPKAI-VGIITNPVNSTVPVAAEALKK
VGVYDPARLFGVTTLDVVRAR lclCHR34_tmp.0130
TMRRVMAAVASSDTTGN-VAVVSSPVNALTPFCAELLKASGKFDPRKLFG
VTTLDVIRTR lclCHR28_tmp.0050
IFKEQGEAIAAVAASDCRVVVVGNPANTNALILLKSAQ--GKLNPRHVTA
MTRLDHNRAL .. .
.. . . . .
lclCHR34_tmp.0150 KMLGDFTGQDPEMLDVPVIGGHSG
QTIVPLFSHS--GVELRQEQVEYLTHRVR------- lclCHR34_tmp
.0140 TFVAEALGASPYDVDVPVIGGHSGETIVPLLSG---FPS
LSEEQVRQLTHRIQ------- lclCHR34_tmp.0130
KLVAGTLHMNPYDVNVPVVGGCGGVTACPLIAQT--GLRIPLDDIVRISG
EVQSYGVLFE lclCHR28_tmp.0050
SLLARKAGVPVSQVRNVIIWGNHSSTQVPDTDSAVIGTTPAREAIKDDAL
DDD-----FV ..
. .
lclCHR34_tmp.0150 --VGGD-EVVKAKEGRGSSSLSMA
FAAAEWADGVLRAMDGEKTLLQCSFVESPLFADKCR lclCHR34_tmp
.0140 --FGGD-EVVKAKDGAGSATLSMAFAGNEWTTAVLRALS
GEKGVVVCTYVQS-TVEPSCA lclCHR34_tmp.0130
AAVGADSHDALSTEVAPPVALGLAYAACDFSTSLLKALRGDVGIVECALV
ES-TMRSETP lclCHR28_tmp.0050
QVVRGRGAEIIQLRGLSSAMSAAKAAVDHVHDWIHGTPEGVYVSMGVYSD
ENPYGVPSGL . .
. . . . .
lclCHR34_tmp.0150 FFGSTVEVCKEGIERVLPLPPLNE
YEEEQLDRCLPDLEKN-IRKGLAFVAENAATSTPST lclCHR34_tmp
.0140 FFSSPVLLGNSGVEKIYPVPMLNAYEEKLMAKCLEGLQS
N-ITKGIAFSNK--------- lclCHR34_tmp.0130
FFSSRVELGREGVQRVFPMGALTSYEHELIETAVPELMRD-VQAGIEAAT
QF-------- lclCHR28_tmp.0050
IFSFP-CTCHAGEWTVVSGKLNGDLGKQRLASTIAELQEERAQAGL----
---------- . .
. .
80Hydrophobicity profiles
- Profiles according to Kyte and Doolittle of
homologous proteins are in general strikingly
similar and may provide a tool in the alignment
of two or more proteins. - The two phosphoglycerate kinase sequences below
share 50 identical residues.
Trypanosoma congolense PGK
Euglena gracilis PGK
81Tree construction methods (1)
- Distance matrix methods
- Cluster analysis (UPGMA, WPGMA, etc)
- Fitch Margoliash (1967)
- Transformed distance methods (eg. Li, 1981)
- Neighbor-joining (Saitou Nei, 1987)
- ...many more
- Parsimony methods
- Maximum parsimony
- Other methods
- Maximum likelihood (Felsenstein, 1981)
- ... many more
82Tree construction methods (2)
- Character-based methods
- maximum parsimony
- maximum likelihood
- Non-character-based methods
- distance matrix methods
83Phylogeny (2)
- Distance Matrix methods (in the public domain)
- Least squares method (Fitch and Margoliash)
- Fitch, Kitsch of the Phylip package (Jo
Felsentein, Univ. Washington) - Neighbor-joining method
- Neighbor of the Phylip package (Jo Felsentein,
Univ. Washington) - Clustal, or Distnj in Protml package (Adachi and
Hasegawa, Univ. Tokyo) - Darwin (Gaston Gonner, ETH, Zurich, via
mailserver or WWW) - Protein Maximum likelihood (in the public domain)
- Protml (Adachi and Hasegawa, Univ. Tokyo) (very
cpu intensive) - TreePuzzle (Strimmer and von Haeseler, 1997)
- Protein maximal parsimony (in the public domain)
- Protpars (Jo Felsentein, Univ. Washington)
- Paup (David Swofford, latest version will be
commercial)
84Some useful information about phylogenetic trees
External nodes
OTUs
Internal
A
nodes
F
A-E are external nodes (extant)
F-I are internal (ancestral) nodes
B
H
OTUs are operational taxonomic
populations
units
C
individuals
I
They can be species
genes
They are the extant (existing) or extinct
(ancestral) OTUs
proteins
G
Root
D
Topology order of the nodes on the
tree
E
85Distance Matrix Methods
- UPGMA (Unweighted Pair Group with Arithmatic
Mean) uses real (uncorrected) distance values and
a sequential clustering algorithm. (Should only
be used with closely related OTUs, or when there
is constancy of evolutionary rate) - Transformed distance methods. Corrections may be
introduced to obtain trees with true evolutionary
distances (PAM values, Kimura), or corrections
are carried out with reference to an outgroup
(Farris, 1971 Klotz et al, 1979). Should be used
when evolutionary distant organisms are included
in the dataset - Neighbors relation methods
- FITCH (Fitch, 1981)
- Neighbor-Joining method, (Saitou and Nei, 1987)
Should all be used with corrected (see above)
distance matrices
86Distance matrix
Uncorrected for Multiple Substitutions
1 2 3 4 5
0.00 0.63 0.63 22.88 18.50
AC007866_13 1 0.00 0.63
22.57 18.50 AC007866_17 2
0.00 22.88 17.87
AC007866_15 3
0.00 5.64 AC007866_9 4
0.00
AC007866_11 5 Using the Kimura correction
method Gap weighting is 0.000000 1
2 3 4 5 0.00
0.63 0.63 27.35 21.29 AC007866_13
1 0.00 0.63 26.90 21.29
AC007866_17 2
0.00 27.35 20.47 AC007866_15 3
0.00 5.88
AC007866_9 4
0.00 AC007866_11 5
Distance matrix as produced by the EMBOSS program
distmat
87UPGMA
- UPGMA (Unweighted Pair Group with Arithmetic
Mean) uses real (uncorrected) distance values and
a sequential clustering algorithm. (Should only
be used with closely related OTUs, or when there
is constancy of evolutionary rate)
88Tree construction (UPGMA)
First cycle A B C D E B 2 C 4 4
D 6 6 6 E 6 6 6 4 F 8 8 8 8 8
Cluster the pair of OTUs with the smallest
distance, being A and B, The branching point is
positioned at a distance of 2 / 2 1
substitution.
89Tree construction (UPGMA)
- Following the first clustering A and B are
considered as a single composite OTU(A,B) and we
now calculate the new distance matrix as follows - dist(A,B),C (distAC distBC) / 2 4
- dist(A,B),D (distAD distBD) / 2 6
- dist(A,B),E (distAE distBE) / 2 6
- dist(A,B),F (distAF distBF) / 2 8
- In other words the distance between a simple
OTU and a composite OTU is the average of the
distances between the simple OTU and the
constituent simple OTUs of the composite OTU.
Then a new distance matrix is recalculated using
the newly calculated distances and the whole
cycle is being repeated
90Tree construction (UPGMA)
- Second cycle
- A,B C D E
- C 4
- D 6 6
- E 6 6 4
- F 8 8 8 8
91Tree construction (UPGMA)
- Third cycle
- A,B C D,E
- C 4
- D,E 6 6
- F 8 8 8
92Tree construction (UPGMA 4)
- Fourth cycle
- AB,C D,E
- D,E 6
- F 8 8
93Tree construction (UPGMA)
- Fifth cycle
- ABC,DE
- F 8
- The final step consists of clustering the last
OTU, F,with the composite OTU.
94Pitfalls of UPGMA
- The UPGMA clustering method is very sensitive to
unequal evolutionary rates. - Clustering works only if the data are ultrametric
- Ultrametric distances are defined by the
satisfaction of the 'three-point condition'.
95The treepoint condition
- For any three taxa dist AC lt max (distAB,
distBC) or, - in words the two greatest distances are equal,
or - UPGMA assumes that the evolutionary rate is the
same for all branches - If the assumption of rate constancy among
lineages does not hold UPGMA may give an
erroneous topology.
Non-ultrametric tree
96Unequal rates of mutation lead to wrong trees
- UPGMA tree construction based on the data of the
left tree would result in the erroneous tree at
the right
97UPGMA (conclusion)
- UPGMA uses real (uncorrected) distance values and
a sequential clustering algorithm. - This method of tree construction is very
sensitive to differences in branch length or
unequal rates of evolution. - It should only be used with closely related OTUs,
or when there is constancy of evolutionary rate. - The method is often used in combination with
isoenzyme or restriction site data or with
morphological criteria
98Maximum Parsimony Methods
- Use sequence information rather than distance
information - Calculate for all possible trees the tree that
represents the minimum number of substitutions at
each informative site
99Maximum Parsimony analysis (2)
- Parsimony implies that simpler hypotheses are
preferable to more complicated ones. - Maximum parsimony is a character-based method
that infers a phylogenetic tree by minimizing the
total number of evolutionary steps required to
explain a given set of data, or in other words by
minimizing the total tree length. - The steps may be base or amino-acid substitutions
for sequence data, or gain and loss events for
restriction site data.
100Maximum Parsimony analysis (3)
- Maximum parsimony, when applied to protein
sequence data either considers each site of the
sequence as a multistate unordered characterd
with 20 possible states (the amino-acids) (Eck
and Dayhoff, 1966), or may take into account the
genetic code and the number of mutations, 1, 2 or
3, that is required to explain an observed
amino-acid substitution. The latter method is
implemented in the PROTPARS program (Felsenstein,
1993). - The maximum parsimony method searches all
possible tree topologies for the optimal
(minimal) tree. However, the number of unrooted
trees that have to be analysed rapidly increases
with the number of OTUs.
101Maximum Parsimony analysis (4)
- The number of rooted trees (Nr) for n OTUs is
given byNr (2n -3)!/(2exp(n -2)) (n -2)! - The number of unrooted trees (Nr) for n OTUs is
given byNu (2n -5)!/(2exp(n -3)) (n -3)!
Number of OTUs unrooted trees rooted
trees 2 1 1 3 1
3 4 3
15 5 15 105 6 105
945 7 954
10,395 8 10,395 135,135 9 135,135
34,459,425 10 34,459,425
2.13E15 15 2.13E15 8.E21
This rapid increase in number of trees to be
analysed may make it impossible to apply the
method to very large datasets. In that case the
parsimony method may become very time consuming,
even on very fast computers.
102maximum parsimony method for 4 nucleic-acid
sequences
- Site _________________________
Sequence 1 2 3 4 5 6 7 8 9 1
A A G A G T G C A 2 A G C C
G T G C G 3 A G A T A T C C
A 4 A G A G A T C C G - For four OTUs there are three possible unrooted
trees. The trees are then analysed by searching
for the ancestral sequences and by counting the
number of mutations required to explain the
respective trees
103(1) AAGAGTGCA AGATATCCA (3) \4
2/
Number of mutations \
4 / AGCCGTGCG --- AGAGATCCG
Tree I 11 /
\ /0
0\ (2) AGCCGTGCG AGAGATCCG
(4) (1) AAGAGTGCA AGCCGTGCG (2)
\1 3/
\ 5
/ AGGAGTGCA --- AGAGGTCCG Tree II
14 /
\ /4 1\
(3) AGATATCCA AGAGATCCG (4) (1)
AAGAGTGCA AGCCGTGCG (2) \1
3/
\ 5 /
AGAAGTGCA --- AGATGTCCG Tree III 16
/ \
/5 2\ (4)
AGAGATCCG AGATATCCA (3)
Tree I has the topology with the least number of
mutations and thus is the most parsimonious
tree. Ancestral trees are calculated This
analysis includes both informative and
non-informative sites in the sequence. When
only informative sites are included a much lesser
number of sites can be analysed, which means in
the case of large datasets a considerable gain in
CPU time.
104Informative sites
A site is informative only when there are at
least two different kinds of nucleotides at the
site, each of which is represented in at least
two of the sequences under study.
- Site _________________________
Sequence 1 2 3 4 5 6 7 8 9 1
A A G A G T G C A 2 A G C C
G T G C G 3 A G A T A T C C
A 4 A G A G A T C C G
Informative sites are indicated by an asterisk ()
105Informative sites only
1 GGA 2 GGG 3 ACA 4 ACG (1) GGA
ACA (3) \1 1/
Number of mutations \ 2 /
GGG --- ACG Tree I 4 /
\ /0 0\ (2) GGG
ACG (4) (1) GGA GGG (2) \1
1/ \ 1 /
GCA --- GCG Tree II 5 /
\ /1 1\ (3) ACA
ACG (4) (1) GGA GGG (2) \2
1/ \ 0 /
GCG --- GCG Tree III 6 /
\ /1 2\ (4) ACG
ACA (3)
To infer a maximum parsimony tree, for each
possible tree we calculate the minimum number of
substitutions at each informative site. In the
above example, for sites 5, 7, and 9, tree I
requires in total 4 changes, tree II requires 5
changes, and tree III requires 6 changes. In the
final step, we sum the number of changes over all
the informative sites for each tree and choose
the tree associated with the smallest number of
substitutions. In our case, tree I is chosen
because it requires the smallest number of
changes (4) at the informative sites.
106How to find the best tree ?
- Maximum parsimony searches for the optimal
(minimal) tree. In this process more than one
minimal trees may be found. In order to guarantee
to find the best possible tree an exhaustive
evaluation of all possible tree topologies has to
be carried out. However, this becomes impossible
when there are more than 12 OTUs in a dataset. - Branch and Bound is a variation on maximum
parsimony that garantees to find the minimal tree
without having to evaluate all possible trees.
This way a larger number of taxa can be evaluated
but the method is still limited. - Heuristic searches is a method with step-wise
addition and rearrangement (branch swapping) of
OTUs. Here it is not guaranteed to find the best
tree. - Since, in view of the size of the dataset, it is
often not possible to carry out an exhaustive or
other search for the best tree, it is adviced to
change the order of the taxa in the dataset and
to repeat the analysis, or to indicate to the
program to do this for you by providing a
so-called jumble factor to the program.
107Consensus tree
- Since the Maximum Parsimony method may result in
more than one equally parsimonious tree, a
consensus tree should be created. For the
creation of a consensus tree see bootstrapping.
108Parsimony and branch lengths
(1) G A (3) \1 0/
\ 1 / C -----A
/ \ /0
1\ (2) C T (4) (1) G
A (3) \0 1/
\ 1 / G -----T
/ \ /1 0\ (2) C
T (4) (1) G A (3) \1
1/ \ 1 /
C -----A / \
/0 0\ (2) C A (4)
3 possible trees for 4 OTUs, all describe the
same final state by assuming a total of 3 steps.
Each final state is arrived at via a different
route. Each of the three trees is equally
valid, but the number of steps along the
indiviual branches (or the length of each branch)
is not determined. For this reason branch
lengths are not given in parsimony, but only the
total number of steps for a tree.
109Some final notes on maximum parsimony
- Maximum Parsimony (positive points)
- is based on shared and derived characters. It
therefore is a cladistic rather than a phenetic
method - does not reduce sequence information to a single
number - tries to provide information on the ancestral
sequences - evaluates different trees
- Maximum Parsimony (negative points)
- does not assume an evolutionary model
- is slow in comparison with distance methods
- does not use all the sequence information (only
informative sites are used) - does not correct for multiple mutations (does not
imply a model of evolution) - does not provide information on the branch
lengths - is notorious for its sensitivity to codon bias
110How to root an unrooted tree?
- The majority of methods yield unrooted trees
- To root a tree one should add an outgroup to the
dataset. An outgroup is an OTU for which external
information (eg. paleontological information) is
available that indicates that the outgroup
branched off before all other taxa - Do not choose an outgroup that is very distantly
related to your taxa. This may result in serious
topolocical errors - Do not choose either an outgroup that is too
closely related to the taxa in question. In this
case it may not be a true outgroup - The use of more than one outgroup generally
improves the estimate of tree topology - In the absence of a good outgroup the root may be
positioned by assuming approximately equal
evolutionary rates over all the branches. In this
way the root is put at the midpoint of the
longest pathway between two OTUs
111Maximum likelihood
- It evaluates a hypothesis about evolutionary
history in terms of the probability that the
proposed model and the hypothesized history would
give rise to the observed data set. A history
with a higher probability of reaching the
observed state is preferred to a history with a
lower probability. The method searches for the
tree with the highest probability or likelihood. - The following programs are available from the
web - DNAML (DNA data only. By Joe Felsenstein in the
Phylip package) - FastDNAML (DNA data only. A faster algorithm
applied by Gary Olsen to Joe Felsenstein's DNAML
program ) - ProtML (DNA and protein. By Adachi and Hasegawa,
1992) - TreePuzzle (DNA and protein. By Strimmer and von
Haeseler, 1995). This program applies a heuristic
method and is much faster than PROTML, but does
not guarantee to find the best tree.
112Advantages and disadvantages of the maximum
likelihood method
- There are some supposed adavantages of maximum
likelihood methods over other methods. - It is the estimation method least affected by
sampling error - It is robust to many violations of the
assumptions in the evolutionary model - with very short sequences it tends to outperform
alternative methods such as parsimony or distance
methods. - the method is statistically well founded
- evalutates different tree topologies
- uses all the sequence information
- There are also some supposed disadvantages
- maximum likelihood is very CPU intensive and thus
extremely slow - result is dependent on the model of evolution
used
113Explication of the method
Maximum likelihood evaluates the probability that
the choosen evolutionary model will have
generated the observed sequences. Phylogenies are
then inferred by finding those trees that yield
the highest likelihood. Assume that we have the
aligned nucleotide sequences for four taxa
1 j ....N (1)
A G G C U C C A A ....A (2) A G
G U U C G A A ....A (3) A G C C C A
G A A.... A (4) A U U U C G G A
A.... C and we want to evauate the likelihood
of the unrooted tree represented by the
nucleotides of site j in the sequence and shown
below (1) (2) \
/ \ /
------ / \
/ \ (3) (4) What
is the probabliity that this tree would have
generated the data presented in the sequence
under the the chosen model ?
114Likelihood for one site
The models are time-reversible, therefore the
likelihood of the tree is independent of the
position of the root. Thus it is convenient to
root the tree at an arbitrary internal node.
C C A G \ / /
\/ / A / \ /
\ / A
_ _
C C A G
\ / /
\/ / L(j) Sum(Prob (5)
/ ) \ /
\ /
_ (6) _
Assume that nucleotide sites evolve independently
(the Markovian model of evolution). Then we can
calculate the likelihood for each site separately
and combine these to the total likelihood. For
the likelihood for site j, we have to consider
all the possible scenarios by which the
nucleotides present at the tips of the tree could
have evolved. So the likelihood for a particular
site is the summation of the probablilities of
every possible reconstruction of ancestral
states, given some model of base substitution. So
in this specific case all possible nucleotides A,
G, C, and T occupying nodes (5) and (6), or 4 x 4
16 possibilities
In the case of protein sequences each site may
ooccupy 20 states (that of the 20 amino acids) an
thus 400 possibilities have to be considered.
Since any one of these scenarios could have led
to the amino-acid configuration at the tip of the
tree, we must calculate the probability of each
and sum and sum them to obtain the total
probability for each site j.
115likelihood for the full tree
The likelihood for the full tree then is the
product of the likelihood at each site.
N L L(1) x L(2) ..... x
L(N) P L(j)
j1 Since the individual likelihoods are
extremely small numbers it is convenient to sum
the log likelihoods at each site and report the
likelihood of the entire tree as the log
likelihood.
N ln L ln L(1) ln L(2) ..... ln
L(N) S ln L(j)
j1
116The model of evolution
The PROTML program in the MOLPHY package (Adachi
and Hasegawa, 1992), as well as the TreePUZZLE
program by Strimmer and von Haeseler (1995), have
implemented an instantaneous rate matrix derived
from the Dayhoff emperical substitution matrix.
This has been called the Dayhoff model.
Recently a model called the