Molecular Evolution of Proteins and Phylogenetic Analysis Fred R. Opperdoes Christian de Duve Institute of Cellular Pathology (ICP) and Laboratory of Biochemistry, Universit - PowerPoint PPT Presentation

About This Presentation
Title:

Molecular Evolution of Proteins and Phylogenetic Analysis Fred R. Opperdoes Christian de Duve Institute of Cellular Pathology (ICP) and Laboratory of Biochemistry, Universit

Description:

Fred R. Opperdoes Christian de Duve Institute of Cellular Pathology (ICP) and Laboratory of Biochemistry, Universit catholique de Louvain, Brussels, Belgium – PowerPoint PPT presentation

Number of Views:276
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Molecular Evolution of Proteins and Phylogenetic Analysis Fred R. Opperdoes Christian de Duve Institute of Cellular Pathology (ICP) and Laboratory of Biochemistry, Universit


1
Molecular Evolution of Proteins and Phylogenetic
Analysis Fred R. OpperdoesChristian de Duve
Institute of Cellular Pathology (ICP) and
Laboratory of Biochemistry, Université catholique
de Louvain, Brussels, Belgium
2
The tree of life based on rRNA sequences
Mitochondriates
Amitochondriates
3
The fusion hypothesis the eukaryotic cell is a
chimaera of eubacterial and archaebacterial
traits
Energy metabolism
Genetic machinery
Root?
Common ancestor?
4
Triosephosphate isomerase
Triosephosphate isomerase of eukaryotes is of
typical eubacterial origin and probably has
entered the eukaryotic cell together with the
bacterial endosymbiont that gave rise to the
formation of the mitochondrion
Root?
5
Arguments in favour of protein rather than the
DNA sequences
  • CODON BIAS
  • 64 different possible triplet codes encode 20
    amino acids. One amino acid may be encoded by 1
    to 6 different triplet codes, and 3 of the 64
    codes, called stop (or termination) codons,
    specify "end of peptide sequence"
  • The different codons are used with unequal
    frequency and this distribution of frequency is
    referred to as "codon usage"
  • Codon usage varies between species. Amino-acid
    codons have been degenerated with wobble in the
    third position.

6
The universal genetic code
  • First Second Position
    Third
  • Position ------------------------------------
    Position
  • U(T) C A G
  • U(T) Phe Ser Tyr Cys
    U(T)
  • Phe Ser Tyr Cys
    C
  • Leu Ser STOP STOP
    A
  • Leu Ser STOP Trp
    G
  • C Leu Pro His Arg
    U(T)
  • Leu Pro His Arg
    C
  • Leu Pro Gln Arg
    A
  • Leu Pro Gln Arg
    G
  • A Ile Thr Asn Ser
    U(T)
  • Ile Thr Asn Ser
    C
  • Ile Thr Lys Arg
    A
  • Met Thr Lys Arg
    G

7
Arguments in favour of ... (codon bias 2)
  • Yeasts, protozoa, and animals have different
    codon preferences,
  • This would result in differences in DNA sequence
    related to codon bias and not to evolution.

8
Different species use different codons
  • Homo sapiens gbmam 1 CDS's (389 codons)
  • --------------------------------------------------
    --------------------------
  • fields triplet frequency per thousand
    (number)
  • --------------------------------------------------
    --------------------------
  • UUU 20.6( 8) UCU 5.1( 2) UAU 7.7(
    3) UGU 7.7( 3)
  • UUC 12.9( 5) UCC 20.6( 8) UAC 30.8(
    12) UGC 0.0( 0)
  • UUA 10.3( 4) UCA 18.0( 7) UAA 0.0(
    0) UGA 0.0( 0)
  • UUG 10.3( 4) UCG 0.0( 0) UAG 2.6(
    1) UGG 15.4( 6)

Saccharomyces cerevisiae gbpln 9295 CDS's
(4586264 codons) ---------------------------------
------------------------------------------- fields
triplet frequency per thousand
(number) ---------------------------------------
------------------------------------- UUU
25.9(118900) UCU 23.6(108308) UAU 18.7( 85651)
UGU 8.0( 36624) UUC 18.3( 83880) UCC 14.3(
65421) UAC 14.7( 67599) UGC 4.6( 21255) UUA
26.3(120698) UCA 18.7( 85618) UAA 1.0( 4476)
UGA 0.6( 2742) UUG 27.2(124967) UCG 8.5(
39137) UAG 0.4( 2058) UGG 10.4( 47694)
9
Differences between the Universal and
Mitochondrial Genetic Codes
  • Codon Universal code mitochondrial code
  • UGA Stop Trp
  • AGA Arg Stop
  • AGG Arg Stop (or Lys)
  • AUA Ile Met
  • Modified from Li and Graur, 1991, Fundamentals
    of Molecular Evolution , Sinauer Publ.
  • Only in arthropod mitochonria (Abascal et al.,
    PLoS Biol 4, e127 (2006))

10
Arguments in favour... (codon bias)
  • Also, the protozoa use the codons UAA and UGA to
    encode glutamine, rather than STOP
  • The inclusion of unique codons in a subset of the
    sequences will tend to make that subset appear
    more divergent than they really are

11
Arguments in favour... (codon bias 2)
  • High GC content of DNA seems to be associated
    with aerobiosis in prokaryotes (Naya et al.,
    2002)
  • In all major groups both organisms with AT rich
    and GC rich DNA can be found.
  • The inclusion of unique codons in a subset of the
    sequences will tend to make that subset appear
    more divergent than they really are

12
GC content of DNA in aerobic and anaerobic
prokaryotes
Anaerobic
Aerobic
From Naya et al., J. Mol. Evol. 55 (2002) 260-264
13
The use of protein sequences in phylogeny
requires knowledge of the properties of the
amino acids and their single letter codes
14
The use of protein sequences in phylogeny
requires knowledge of the properties of the
amino acids and their single letter codes
  • Alanine A Leucine L
  • Arginine R Lysine K
  • Asparagine N Methionine M
  • Aspartic acid D Phenylalanine F
  • Cysteine C Proline P
  • Glutamic acid E Serine S
  • Glutamine Q Threonine T
  • Glycine G Tryptophane W
  • Histidine H Tyrosine Y
  • Isoleucine I Valine V

15
Arguments in favour of a phylogenetic analysis of
the corresponding protein rather than the DNA
  • LONG TIME HORIZON
  • When comparing sequences that have diverged for
    possibly a billion years or more, it is very
    likely that the wobble bases in the codons will
    have become randomized. By excluding the wobble
    bases (a general technique), one is actually
    looking at amino acid sequences.So why not
    taking a protein sequence directly?

16
Advantages of the translation of DNA into protein
(1)
  • DNA is composed of only four kinds of unit A, G,
    C and T
  • If gaps are not allowed, on the average, 25 of
    residues in two randomly chosen aligned sequences
    would be identical
  • If gaps are allowed, as much as 50 of residues
    in two randomly chosen aligned sequences can be
    identical. Such a situation may obscure any
    genuine relationship that may exist. Especially
    when comparing distantly related or rapidly
    evolving gene sequences
  • Moreover, it is easier to translate a gene
    sequence into its corresponding protein than to
    remove the third wobble base from each of the
    codons in the gene
  • All open reading frames have alreday been
    translated in to their corresponding peptide
    sequences (GenPept and Uniprot databases)

17
Alignment of two random DNA sequences
Without indels 19 identity Indels
allowed 56 identity
18
Advantages of the translation of DNA into protein
(2)
  • Translation of DNA into 21 different types of
    codon (20 amino acids and a terminator) allows
    the information to sharpen up considerably. Wrong
    frame information is set aside
  • Third-base degeneracies are consolidated
  • After insertion of gaps to align two random
    protein sequences it can be expected that they
    are between 10-20 identical
  • As a result of the translation procedure the
    protein sequences with their 20 amino acids are
    much more easy to align than the corresponding
    DNA sequences with only 4 nucleotides

19
Alignment of two random protein sequences
Without indels 7 identity
Indels allowed 22 identity
20
Advantages of the translation of DNA into protein
(3)
  • If, after this, you still want to align distantly
    related gene sequences, you better prepare first
    a protein alignment and then base yourself on
    this alignment for the alignment of the gene
    sequences and the precise placement of indels in
    the aligned sequences (use EMBOSS tranalign).
  • Conclusion The signal to noise ratio is greatly
    improved when using protein sequences over DNA
    sequences!

21
TBLASTX
  • The blast algorithm TBLASTX allows the use of
    translated nucleic acid sequence information to
    search for distant relationships between genes
  • A translated protein sequence is compared with
    all the translated sequences from a nucleotide
    database

22
NCBI BLASTN output
23
NCBI TBLASTX output
24
Nature of Sequence Divergence in Proteins
  • The observed sequence difference of two diverging
    sequences takes the course of a negative
    exponential. This is the result of the fact that
    each position is subject to reverse changes
    ("back mutations") and multiple hits
  • Thus the observed percentage of difference
    between the protein sequences is not proportional
    to the actual evolutionary difference between two
    homologous sequences
  • The evolutionary distance between two proteins is
    expressed in PAM units. PAM (Dayhoff and Eck,
    1968) stands for "accepted point mutation"

25
Relation between distance and PAM distance
  • PAM Distance
  • value ()
  • 80 50
  • 100 60
  • 200 75
  • 250 85 Twilight zone
  • 300 92
  • (From Doolittle, 1987, Of URFs and ORFs,
    University Science Books)
  • As the evolutionary distance increases, the
    probability of super-imposed mutations becomes
    greater resulting in a lower observed percent
    difference.

26
Relation between distance and PAM distance
Distance
Twilight zone
Pam value
27
The Kimura correction for multiple substitutions
  • The formula used to correct for multiple hits is
    from Motoo Kimura (Kimura, M. The neutral Theory
    of Molecular Evolution, Camb.Univ.Press, 1983,
    page 75)
  • K -Ln(1 - D - (D.D)/5) where D is the observed
    distance and K is corrected distance.
  • This formula gives mean number of estimated
    substitutions per site and, in contrast to D (the
    observed number), can be greater than 1 i.e. more
    than one substitution per site, on average. For
    example, if you observe 0.8 differences per site
    (80 difference 20 identity), then the above
    formula predicts that there have been 2.5
    substitutions per site over the course of
    evolution since the 2 sequences diverged.
  • This can also be expressed in PAM units by
    multiplying by 100 (mean number of substitutions
    per 100 residues).

28
Proteins evolve at highly different rates
Rate of Change Theoretical PAMs /
108 yrs Lookback Time Pseudogenes
400 45 x 106 yrs Fibrinopeptides
90 200 " Lactalbumins 27 670
" Lysozymes 24 850 " Ribonucleases
21 850 " Haemoglobins 12
1500 " Acid proteases 8
2300 " Cytochrome c 4
5000 " Glyceraldehyde-P dehydrogenase
2 9000 " Glutamate
dehydrogenase 1 18000
" PAM number of Accepted Point Mutations per
100 amino acids. Useful lookback time 360 PAMs
29
Some Important Dates in History
  • Event Number of years ago
  • Origin of the Universe 15 4 109 yrs
  • Formation of the Solar System 4.6 "
  • First Self-replicating System 3.5 0.5 "
  • Prokaryotic-Eukaryotic Divergence 2.0 0.5 "
  • Plant-Animal Divergence 1.0 "
  • Invertebrate-Vertebrate Divergence 0.5 "
  • Mammalian Radiation Beginning 0.1 "
  • From Doolittle, Of URFs and ORFs, 1987

30
Construction of a phylogenetic tree from
phosphoglycerate kinase sequences
31
Arguments in favour of a protein rather than a
DNA sequence (3)
  • INTRONS
  • A study of the evolution of a protein using its
    DNA sequence should only include coding sequences
  • This requires that in every DNA sequence all the
    introns are being edited out. This may be
    cumbersome and time consuming
  • An easier approach would be the direct
    translation of the cDNA sequence into its
    corresponding protein sequence

32
Typical structure of a eukaryotic gene
Exon 2
Exon 1
Exon 3
Flanking region
Flanking region
3'
5'
Intron II
Intron I
TATA
Initiation
Stop
Poly (A)
box
codon
codon
addition site
Transcription
AATAA
initiation
33
Arguments in favour of a protein rather than a
DNA sequence (4)
  • MULTIGENE FAMILIES
  • Organisms may contain many highly similar genes,
    while only one peptide sequence can be identified
    (e.g. histones, tubulins and GAPDH in humans).
  • Using these DNA sequences, it would be difficult
    to decide which are expressed and which not and
    thus which genes to include in the analysis.
  • Moreover, if all the genes that are expressed
    encode the same protein, then DNA differences are
    not significant

34
Arguments in favour of a protein rather than a
DNA sequence (5)
  • PROTEIN IS THE UNIT OF SELECTION
  • For protein-encoding genes, the object on which
    natural selection acts is the protein itself.
  • The underlying DNA sequence reflects this process
    in combination with species-specific pressures on
    DNA sequence (like the need for aerophiles to
    have DNA that is GC richer).
  • If function demands that a protein maintains a
    specific sequence, there still is room for the
    DNA sequence to change.

35
Arguments in favour of a protein rather than a
DNA sequence (6)
  • RNA EDITING
  • The DNA sequence doesn't always translate
    into amino acid sequence.
  • In post-translational editing non-coded amino
    acids are added or coded amino acids are removed
    in the editing process.
  • This could lead to major differences in DNA
    sequence (sometimes more than 50) that
    nevertheless leads to roughly the same protein
    sequence after final editing

36
Pan-editing of mitochondrial RNA in Kinetoplastida
UCCuAuuAAuUUUUUGuUAUAu AGuuuuuuAAUGUUGuuuGGuGu
A uuuuuuuAuUGUGuuuAGuuuuG uuuuGuuGuuGuuuGuuuG
GU GuGuuAuuGUUUUGAGAuuGuuG
note that the mature mRNA would not be able to
hybridise with the gene present in the
kinetoplast DNA and thus cannot be detected as
such.
37
Some good advice (1)
  • It is recommended to prepare the phylogenetic
    trees both ways (DNA and Protein) and see how
    they look
  • For a group of species that are relatively close
    in time and closely related (like viral proteins
    or vertebrate enzymes), DNA-based analysis is
    probably a good way to go, since you avoid
    problems of codon bias and randomization of
    wobble bases. But check the protein anyway

38
Some good advice (2)
  • Be aware of the problems of multigene families
    (for instance coding for isoenzymes)
  • Be careful when you decide to exclude or include
    such sequences (you may compare paralogous
    rather than orthologous sequences)

39
What is required
  • A DNA or protein sequence
  • A set of homologous sequences
  • A good multiple sequence alignment
  • Several programs to create a phylogenetic tree

40
(No Transcript)
41
(No Transcript)
42
What is required
  • A DNA or protein sequence
  • A set of homologous sequences
  • A good multiple sequence alignment
  • Several programs to create a phylogenetic tree

43
(No Transcript)
44
(No Transcript)
45
Alignment parametres in ClustalX
46
PAM 250 matrix as used in Clustal
  • C 12,
  • S 0, 2,
  • T -2, 1, 3,
  • P -3, 1, 0, 6,
  • A -2, 1, 1, 1, 2,
  • G -3, 1, 0,-1, 1, 5,
  • N -4, 1, 0,-1, 0, 0, 2,
  • D -5, 0, 0,-1, 0, 1, 2, 4,
  • E -5, 0, 0,-1, 0, 0, 1, 3, 4,
  • Q -5,-1,-1, 0, 0,-1, 1, 2, 2, 4,
  • H -3,-1,-1, 0,-1,-2, 2, 1, 1, 3, 6,
  • R -4, 0,-1, 0,-2,-3, 0,-1,-1, 1, 2, 6,
  • K -5, 0, 0,-1,-1,-2, 1, 0, 0, 1, 0, 3, 5,
  • M -5,-2,-1,-2,-1,-3,-2,-3,-2,-1,-2, 0, 0, 6,
  • I -2,-1, 0,-2,-1,-3,-2,-2,-2,-2,-2,-2,-2, 2, 5,
  • L -6,-3,-2,-3,-2,-4,-3,-4,-3,-2,-2,-3,-3, 4, 2,
    6,
  • V -2,-1, 0,-1, 0,-1,-2,-2,-2,-2,-2,-2,-2, 2, 4,
    2, 4,
  • F -4,-3,-3,-5,-4,-5,-4,-6,-5,-5,-2,-4,-5, 0, 1,
    2,-1, 9,
  • Y 0,-3,-3,-5,-3,-5,-2,-4,-4,-4,
    0,-4,-4,-2,-1,-1,-2, 7,10,

47
ClustalX distance matrix
  • Non-corrected
  • AROC_LEIMJ 0.000 0.036 0.268 0.268 0.232
  • AROC_PSEAE 0.036 0.000 0.268 0.268 0.232
  • AROC_VIBCH 0.268 0.268 0.000 0.089 0.232
  • AROC_VIBAN 0.268 0.268 0.089 0.000 0.232
  • AROC_NEIMB 0.232 0.232 0.232 0.232 0.000
  • Corrected for multiple substitution
  • AROC_LEIMJ 0.000 0.037 0.332 0.332 0.278
  • AROC_PSEAE 0.037 0.000 0.332 0.332 0.278
  • AROC_VIBCH 0.332 0.332 0.000 0.095 0.278
  • AROC_VIBAN 0.332 0.332 0.095 0.000 0.278
  • AROC_NEIMB 0.278 0.278 0.278 0.278 0.000

48
Matrices often used for the alignment of proteins
  • PAM 350 (Dayhoff et al., 1978)
  • BLOSUM30 (Henikoff-Henikoff, 1992)
  • JTT (Jones et al., 1992)
  • mtREV24 (Adachi-Hasegawa, 1996)
  • GONNET 250 matrix (Gonnet et al., 1992)

49
Alignment of two protein sequences (1)
  • For the creation of a phylogenetic tree a good
    alignment of protein sequences is of vital
    importance
  • Only homologous residues should be aligned with
    each other
  • Doubtful regions should not be included in the
    alignment
  • Aligned sequences should have similar lengths

50
Alignment of two protein sequences
  • Alignment requires the user to make assumptions
    regarding relative costs of substitution versus
    insertions and deletions (indels).
  • If substitution cost gtgt gap penalty there will
    be many short gaps and no phylogenetic
    information.
  • In general search for maximum similarity and
    minimize the number of insertions and deletions.
  • Exclude regions that cannot be aligned
    unambiguously!

51
Multiple alignment of protein sequences
  • For the construction of reliable phylogenetic
    trees the quality of a multiple alignment is of
    the utmost importance
  • There are many programs available for the
    multiple alignment of proteins.
  • A good program in the public domain is ClustalW
    or ClustalX
  • Available on the web for free and for any
    platform (PC, Mac, Unix/Linux)
  • They quickly align sequence pairs and roughly
    determine the degrees of identity between each
    pair
  • Then the sequences are aligned more precisely in
    a progressive way starting with the two closest
    sequencesMost programs work best when the
    sequences have similar length.

52
Some rules of thumb for the manual alignment of
proteins (1)
  • An automatically produced multiple alignment
    often needs manual adjustment to improve the
    quality of the alignment.
  • Such improvement can be obtained by using all the
    knowledge that is available about a protein.
  • If a structure is available you should use the
    detailed information about secondary structure
    for the alignment.

53
What is required
  • A DNA or protein sequence
  • A set of homologous sequences
  • A good multiple sequence alignment
  • Several programs to create a phylogenetic tree

54
Tree construction methods (2)
  • Character-based methods
  • maximum parsimony
  • maximum likelihood
  • Non-character-based methods
  • distance matrix methods

55
Tree construction methods (1)
  • Distance matrix methods
  • Cluster analysis (UPGMA, WPGMA, etc)
  • Fitch Margoliash (1967)
  • Transformed distance methods (eg. Li, 1981)
  • Neighbor-joining (Saitou Nei, 1987)
  • ...many more
  • Parsimony methods
  • Maximum parsimony (Protpars, DNApars, PAUP)
  • Other methods
  • Maximum likelihood (DNAML, ProtML, TreePuzzle)
  • Splitstree, Mr. Bayes
  • ... many more

56
Text available from opperdoes_at_bchm.ucl.ac.be
Text and slides http//www.icp.be/opperd/cha
pter8/Website http//www.icp.be/opperd/private
/proteins.html http//www.icp.be/opperd/private
/phylogeny_2006_Athens.ppt
57
Distance Matrix Methods
  • UPGMA (Unweighted Pair Group with Arithmatic
    Mean) uses real (uncorrected) distance values and
    a sequential clustering algorithm. (Should only
    be used with closely related OTUs, or when there
    is constancy of evolutionary rate)
  • Neighbors relation methods
  • FITCH (Fitch, 1981)
  • Neighbor-Joining method, (Saitou and Nei, 1987)
    Should all be used with corrected (see above)
    distance matrices

58
Alignment of two protein sequences (1)
  • For the creation of a phylogenetic tree a good
    alignment of protein sequences is of vital
    importance
  • Only homologous residues should be aligned with
    each other
  • Doubtful regions should not be included in the
    alignment
  • Aligned sequences should have similar lengths

59
Pair-wise alignment of two protein sequences
according to the Dot-Matrix method
60
Dot-Matrix plots
Two homologous sequences with 81 identity
Two homologous sequences with 50 identity
61
Pair-wise alignment of two protein sequences
according to the Dot-Matrix method
62
Alignment of two protein sequences (2)
  • Alignment requires the user to make assumptions
    regarding relative costs of substitution versus
    insertions and deletions (indels).
  • If substitution cost gtgt gap penalty there will
    be many short gaps and no phylogenetic
    information.
  • In general search for maximum identity and
    minimize the number of insertions and deletions.
  • Exclude regions that cannot be aligned
    unambiguously!
  • Visual alignment is possible using the
    "dot-matrix method"

63
Identity matrix as used in Clustal
  • C10,
  • S 0, 10,
  • T 0, 0, 10,
  • P 0, 0, 0, 10,
  • A 0, 0, 0, 0, 10,
  • G 0, 0, 0, 0, 0, 10,
  • N 0, 0, 0, 0, 0, 0, 10,
  • D 0, 0, 0, 0, 0, 0, 0, 10,
  • E 0, 0, 0, 0, 0, 0, 0, 0, 10,
  • Q 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
  • H 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
  • R 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
  • K 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
  • M 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
  • I 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10,
  • L 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    10,
  • V 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    10,
  • F 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 10,

64
Distance matrix withmutation costs for amino
acids
  • A S G L K V T P E D N I Q
    R F Y C H M W Z B X
  • Ala A 0 1 1 2 2 1 1 1 1 1 2 2 2
    2 2 2 2 2 2 2 2 2 2
  • Ser S 1 0 1 1 2 2 1 1 2 2 1 1 2
    1 1 1 1 2 2 1 2 2 2
  • Gly G 1 1 0 2 2 1 2 2 1 1 2 2 2
    1 2 2 1 2 2 1 2 2 2
  • Leu L 2 1 2 0 2 1 2 1 2 2 2 1 1
    1 1 2 2 1 1 1 2 2 2
  • Lys K 2 2 2 2 0 2 1 2 1 2 1 1 1
    1 2 2 2 2 1 2 1 2 2
  • Val V 1 2 1 1 2 0 2 2 1 1 2 1 2
    2 1 2 2 2 1 2 2 2 2
  • Thr T 1 1 2 2 1 2 0 1 2 2 1 1 2
    1 2 2 2 2 1 2 2 2 2
  • Pro P 1 1 2 1 2 2 1 0 2 2 2 2 1
    1 2 2 2 1 2 2 2 2 2
  • Glu E 1 2 1 2 1 1 2 2 0 1 2 2 1
    2 2 2 2 2 2 2 1 2 2
  • Asp D 1 2 1 2 2 1 2 2 1 0 1 2 2
    2 2 1 2 1 2 2 2 1 2
  • Asn N 2 1 2 2 1 2 1 2 2 1 0 1 2
    2 2 1 2 1 2 2 2 1 2
  • Ile I 2 1 2 1 1 1 1 2 2 2 1 0 2
    1 1 2 2 2 1 2 2 2 2
  • Gln Q 2 2 2 1 1 2 2 1 1 2 2 2 0
    1 2 2 2 1 2 2 1 2 2
  • Arg R 2 1 1 1 1 2 1 1 2 2 2 1 1
    0 2 2 1 1 1 1 2 2 2
  • Phe F 2 1 2 1 2 1 2 2 2 2 2 1 2
    2 0 1 1 2 2 2 2 2 2
  • Tyr Y 2 1 2 2 2 2 2 2 2 1 1 2 2
    2 1 0 1 1 3 2 2 1 2
  • Cys C 2 1 1 2 2 2 2 2 2 2 2 2 2
    1 1 1 0 2 2 1 2 2 2

65
Hydrophobicity matrix
  • R K D E B Z S N Q G X T H A
    C M P V L I Y F W
  • Arg R 10 10 9 9 8 8 6 6 6 5 5 5 5 5
    4 3 3 3 3 3 2 1 0
  • Lys K 10 10 9 9 8 8 6 6 6 5 5 5 5 5
    4 3 3 3 3 3 2 1 0
  • Asp D 9 9 10 10 8 8 7 6 6 6 5 5 5 5
    5 4 4 4 3 3 3 2 1
  • Glu E 9 9 10 10 8 8 7 6 6 6 5 5 5 5
    5 4 4 4 3 3 3 2 1
  • Asx B 8 8 8 8 10 10 8 8 8 8 7 7 7 7
    6 6 6 5 5 5 4 4 3
  • Glx Z 8 8 8 8 10 10 8 8 8 8 7 7 7 7
    6 6 6 5 5 5 4 4 3
  • Ser S 6 6 7 7 8 8 10 10 10 10 9 9 9 9
    8 8 7 7 7 7 6 6 4
  • Asn N 6 6 6 6 8 8 10 10 10 10 9 9 9 9
    8 8 8 7 7 7 6 6 4
  • Gln Q 6 6 6 6 8 8 10 10 10 10 9 9 9 9
    8 8 8 7 7 7 6 6 4
  • Gly G 5 5 6 6 8 8 10 10 10 10 9 9 9 9
    8 8 8 8 7 7 6 6 5
  • ??? X 5 5 5 5 7 7 9 9 9 9 10 10 10 10
    9 9 8 8 8 8 7 7 5
  • Thr T 5 5 5 5 7 7 9 9 9 9 10 10 10 10
    9 9 8 8 8 8 7 7 5
  • His H 5 5 5 5 7 7 9 9 9 9 10 10 10 10
    9 9 9 8 8 8 7 7 5
  • Ala A 5 5 5 5 7 7 9 9 9 9 10 10 10 10
    9 9 9 8 8 8 7 7 5
  • Cys C 4 4 5 5 6 6 8 8 8 8 9 9 9 9
    10 10 9 9 9 9 8 8 5
  • Met M 3 3 4 4 6 6 8 8 8 8 9 9 9 9
    10 10 10 10 9 9 8 8 7
  • Pro P 3 3 4 4 6 6 7 8 8 8 8 8 9 9
    9 10 10 10 9 9 9 8 7
  • Val V 3 3 4 4 5 5 7 7 7 8 8 8 8 8
    9 10 10 10 10 10 9 8 7

66
PAM 1 mutation matrix
  • 1 PAM evolutionary distance
  • Ala Arg Asn Asp Cys Gln Glu Gly
    His Ile Leu Lys Met Phe Pro Ser Thr Trp
    Tyr Val
  • A R N D C Q E G
    H I L K M F P S T W
    Y V
  • Ala A 9867 2 9 10 3 8 17 21
    2 6 4 2 6 2 22 35 32 0
    2 18
  • Arg R 1 9913 1 0 1 10 0 0
    10 3 1 19 4 1 4 6 1 8
    0 1
  • Asn N 4 1 9822 36 0 4 6 6
    21 3 1 13 0 1 2 20 9 1
    4 1
  • Asp D 6 0 42 9859 0 6 53 6
    4 1 0 3 0 0 1 5 3 0
    0 1
  • Cys C 1 1 0 0 9973 0 0 0
    1 1 0 0 0 0 1 5 1 0
    3 2
  • Gln Q 3 9 4 5 0 9876 27 1
    23 1 3 6 4 0 6 2 2 0
    0 1
  • Glu E 10 0 7 56 0 35 9865 4
    2 3 1 4 1 0 3 4 2 0
    1 2
  • Gly G 21 1 12 11 1 3 7 9935
    1 0 1 2 1 1 3 21 3 0
    0 5
  • His H 1 8 18 3 1 20 1 0
    9912 0 1 1 0 2 3 1 1 1
    4 1
  • Ile I 2 2 3 1 2 1 2 0
    0 9872 9 2 12 7 0 1 7 0
    1 33
  • Leu L 3 1 3 0 0 6 1 1
    4 22 9947 2 45 13 3 1 3 4
    2 15
  • Lys K 2 37 25 6 0 12 7 2
    2 4 1 9926 20 0 3 8 11 0
    1 1
  • Met M 1 1 0 0 0 2 0 0
    0 5 8 4 9874 1 0 1 2 0
    0 4
  • Phe F 1 1 1 0 0 0 0 1
    2 8 6 0 4 9946 0 2 1 3
    28 0
  • Pro P 13 5 2 1 1 8 3 2
    5 1 2 2 1 1 9926 12 4 0
    0 2

67
PAM 100 matrix as used in Clustal
  • C 14,
  • S -1, 6,
  • T -5, 2, 7,
  • P -6, 1, -1, 10,
  • A -5, 2, 2, 1, 6,
  • G -8, 1, -3, -3, 1, 8,
  • N -8, 2, 0, -3, -1, -1, 7,
  • D -11, -1, -2, -4, -1, -1, 4, 8,
  • E -11, -2, -3, -3, 0, -2, 1, 5, 8,
  • Q -11, -3, -3, -1, -2, -5, -1, 1, 4, 9,
  • H -6, -4, -5, -2, -5, -7, 2, -1, -2, 4, 11,
  • R -6, -1, -4, -2, -5, -8, -3, -6, -5, 1, 1,
    10,
  • K -11, -2, -1, -4, -4, -5, 1, -2, -2, -1, -3,
    3, 8,
  • M -11, -4, -2, -6, -3, -8, -5, -8, -6, -2, -7,
    -2, 1, 13,
  • I -5, -4, -1, -6, -3, -7, -4, -6, -5, -5, -7,
    -4, -4, 2, 9,
  • L -12, -7, -5, -5, -5, -8, -6, -9, -7, -3, -5,
    -7, -6, 4, 2, 9,
  • V -4, -4, -1, -4, 0, -4, -5, -6, -5, -5, -6,
    -6, -6, 1, 5, 1, 8,
  • F -10, -5, -6, -9, -7, -8, -6,-11,-11,-10, -4,
    -7,-11, -2, 0, 0, -5, 12,

68
PAM 250 matrix as used in Clustal
  • C 12,
  • S 0, 2,
  • T -2, 1, 3,
  • P -3, 1, 0, 6,
  • A -2, 1, 1, 1, 2,
  • G -3, 1, 0,-1, 1, 5,
  • N -4, 1, 0,-1, 0, 0, 2,
  • D -5, 0, 0,-1, 0, 1, 2, 4,
  • E -5, 0, 0,-1, 0, 0, 1, 3, 4,
  • Q -5,-1,-1, 0, 0,-1, 1, 2, 2, 4,
  • H -3,-1,-1, 0,-1,-2, 2, 1, 1, 3, 6,
  • R -4, 0,-1, 0,-2,-3, 0,-1,-1, 1, 2, 6,
  • K -5, 0, 0,-1,-1,-2, 1, 0, 0, 1, 0, 3, 5,
  • M -5,-2,-1,-2,-1,-3,-2,-3,-2,-1,-2, 0, 0, 6,
  • I -2,-1, 0,-2,-1,-3,-2,-2,-2,-2,-2,-2,-2, 2, 5,
  • L -6,-3,-2,-3,-2,-4,-3,-4,-3,-2,-2,-3,-3, 4, 2,
    6,
  • V -2,-1, 0,-1, 0,-1,-2,-2,-2,-2,-2,-2,-2, 2, 4,
    2, 4,
  • F -4,-3,-3,-5,-4,-5,-4,-6,-5,-5,-2,-4,-5, 0, 1,
    2,-1, 9,
  • Y 0,-3,-3,-5,-3,-5,-2,-4,-4,-4,
    0,-4,-4,-2,-1,-1,-2, 7,10,

69
Matrices often used for the alignment of proteins
  • PAM 250 (Dayhoff et al., 1978)
  • BLOSUM62 (Henikoff-Henikoff, 1992)
  • JTT (Jones et al., 1992)
  • mtREV24 (Adachi-Hasegawa, 1996)
  • GONNET matrix (Gonnet et al., 1992)

70
Multiple alignment of protein sequences
  • For the construction of reliable phylogenetic
    trees the quality of a multiple alignment is of
    the utmost importance
  • There are many programs available for the
    multiple alignment of proteins.
  • A good program in the public domain is ClustalW
  • A similar program is Pileup of the GCG package
  • They quickly align sequence pairs and roughly
    determine the degrees of identity between each
    pair
  • Then the sequences are aligned more precisely in
    a progressive way starting with the two closest
    sequencesMost programs work best when the
    sequences have similar length.

71
Some rules of thumb for the manual alignment of
proteins (1)
  • An automatically produced multiple alignment
    often needs manual adjustment to improve the
    quality of the alignment.
  • Such improvement can be obtained by using all the
    knowledge that is available about a protein.
  • If a structure is available you should use the
    detailed information about secondary structure
    for the alignment.

72
Some rules of thumb for the manual alignment of
proteins (2)
  • The rules for mutation of amino acids are
    dependent on their physicochemical properties.
  • Surface residues (DRENK) are preferably mutated
    to residues of similar properties. Since they are
    not, or less, involved in protein folding they
    mutate rather easily.
  • Hydrophobic residues (FAMILYVW) are
    preferentially replaced by other hydrophobic
    ones. These residues are mainly internal and
    determine the folding of the protein. They thus
    mutate rather slowly.

73
Some rules of thumb for the manual alignment of
proteins (3)
  • The residues CHQST are indifferent and may be
    replaced with any other type of residue
  • The residues (DRENKCHQST), when conserved
    throughout the alignment are very likely residues
    that are involved in the active site. So the
    multiple alignment should be adjusted
    accordingly
  • Periodicity of charged residues may provide
    information as to the presence of elements of
    secondary structure such as ?-helices and
    ?-strands

74
a-helix
75
b-strand
76
Some rules of thumb for the manual alignment of
proteins (4)
  • Indels (insertions/deletions) are never found in
    elements of secondary structure but only in
    loops.
  • Pro and Gly interfere with secondary structure
    elements and thus have a preference for loops
  • Hydrophobicity (or hydropathy) profiles according
    to Kyte and Doolittle of two homologous proteins
    are in general strikingly similar

77
Proline interferes with a-helix and b-sheet
formation
From Deber and Therien,2002
78
Possible functions for proline in trans-membrane
domains
From Deber and Therien,2002
79
Alignment of malate dehydrogenase sequences
SlclCHR34_tmp.0150 ----MKPST--LSRFKVTVLGASGA
IGQPLALALVQNKRVSEL-----ALYDIVQPR--- lclCHR34_tmp.
0140 ----MRRSQ--GCFFRVAVLGAAGGIGQPLSLLLKNNKYV
KEL-----KLYDVKGGP--- lclCHR34_tmp.0130
MGLLFRRSLTALKKGKVVLFGCSNAVGQPLSLLLKMNPHVEELVCCNTAA
DDDVPGS--- lclCHR28_tmp.0050
-----------MSAVKVAVTGAAGQIGYALVPLIARGALLGPTTPVELRL
LDIEPALKAL . .
. .. . .
lclCHR34_tmp.0150 -GVAVDLSHFPRKVKVTGYPTKWI
HK--ALDGADLVLMSAGMPRRPGMT-HDDLFNTNAL lclCHR34_tmp
.0140 -GVAADLSHICAPAKVTGYTKDELSR--AVENADVVVIP
AGIPRKPGMT-RDDLFNTNAS lclCHR34_tmp.0130
-GIAADLSHIDTLPKVH-YATDEGQWPALLRDAQLILVCFGSSFDLLRED
RDIALKAAAP lclCHR28_tmp.0050
AGVEAELEDCAFPLLDKVVVTADPRV--AFDGVAIAIMCGAFPRKAGME-
RKDLLEMNAR ...
. . .. . . .
lclCHR34_tmp.0150 TVNELSAAVARYAPKSV-LAIISN
PLNSMVPVAAETLQRAGVYDPRKLFGIISLNMMRAR lclCHR34_tmp
.0140 IVRDLAIAVGTHAPKAI-VGIITNPVNSTVPVAAEALKK
VGVYDPARLFGVTTLDVVRAR lclCHR34_tmp.0130
TMRRVMAAVASSDTTGN-VAVVSSPVNALTPFCAELLKASGKFDPRKLFG
VTTLDVIRTR lclCHR28_tmp.0050
IFKEQGEAIAAVAASDCRVVVVGNPANTNALILLKSAQ--GKLNPRHVTA
MTRLDHNRAL .. .
.. . . . .
lclCHR34_tmp.0150 KMLGDFTGQDPEMLDVPVIGGHSG
QTIVPLFSHS--GVELRQEQVEYLTHRVR------- lclCHR34_tmp
.0140 TFVAEALGASPYDVDVPVIGGHSGETIVPLLSG---FPS
LSEEQVRQLTHRIQ------- lclCHR34_tmp.0130
KLVAGTLHMNPYDVNVPVVGGCGGVTACPLIAQT--GLRIPLDDIVRISG
EVQSYGVLFE lclCHR28_tmp.0050
SLLARKAGVPVSQVRNVIIWGNHSSTQVPDTDSAVIGTTPAREAIKDDAL
DDD-----FV ..
. .
lclCHR34_tmp.0150 --VGGD-EVVKAKEGRGSSSLSMA
FAAAEWADGVLRAMDGEKTLLQCSFVESPLFADKCR lclCHR34_tmp
.0140 --FGGD-EVVKAKDGAGSATLSMAFAGNEWTTAVLRALS
GEKGVVVCTYVQS-TVEPSCA lclCHR34_tmp.0130
AAVGADSHDALSTEVAPPVALGLAYAACDFSTSLLKALRGDVGIVECALV
ES-TMRSETP lclCHR28_tmp.0050
QVVRGRGAEIIQLRGLSSAMSAAKAAVDHVHDWIHGTPEGVYVSMGVYSD
ENPYGVPSGL . .
. . . . .
lclCHR34_tmp.0150 FFGSTVEVCKEGIERVLPLPPLNE
YEEEQLDRCLPDLEKN-IRKGLAFVAENAATSTPST lclCHR34_tmp
.0140 FFSSPVLLGNSGVEKIYPVPMLNAYEEKLMAKCLEGLQS
N-ITKGIAFSNK--------- lclCHR34_tmp.0130
FFSSRVELGREGVQRVFPMGALTSYEHELIETAVPELMRD-VQAGIEAAT
QF-------- lclCHR28_tmp.0050
IFSFP-CTCHAGEWTVVSGKLNGDLGKQRLASTIAELQEERAQAGL----
---------- . .
. .
80
Hydrophobicity profiles
  • Profiles according to Kyte and Doolittle of
    homologous proteins are in general strikingly
    similar and may provide a tool in the alignment
    of two or more proteins.
  • The two phosphoglycerate kinase sequences below
    share 50 identical residues.

Trypanosoma congolense PGK
Euglena gracilis PGK
81
Tree construction methods (1)
  • Distance matrix methods
  • Cluster analysis (UPGMA, WPGMA, etc)
  • Fitch Margoliash (1967)
  • Transformed distance methods (eg. Li, 1981)
  • Neighbor-joining (Saitou Nei, 1987)
  • ...many more
  • Parsimony methods
  • Maximum parsimony
  • Other methods
  • Maximum likelihood (Felsenstein, 1981)
  • ... many more

82
Tree construction methods (2)
  • Character-based methods
  • maximum parsimony
  • maximum likelihood
  • Non-character-based methods
  • distance matrix methods

83
Phylogeny (2)
  • Distance Matrix methods (in the public domain)
  • Least squares method (Fitch and Margoliash)
  • Fitch, Kitsch of the Phylip package (Jo
    Felsentein, Univ. Washington)
  • Neighbor-joining method
  • Neighbor of the Phylip package (Jo Felsentein,
    Univ. Washington)
  • Clustal, or Distnj in Protml package (Adachi and
    Hasegawa, Univ. Tokyo)
  • Darwin (Gaston Gonner, ETH, Zurich, via
    mailserver or WWW)
  • Protein Maximum likelihood (in the public domain)
  • Protml (Adachi and Hasegawa, Univ. Tokyo) (very
    cpu intensive)
  • TreePuzzle (Strimmer and von Haeseler, 1997)
  • Protein maximal parsimony (in the public domain)
  • Protpars (Jo Felsentein, Univ. Washington)
  • Paup (David Swofford, latest version will be
    commercial)

84
Some useful information about phylogenetic trees
External nodes
OTUs
Internal
A
nodes
F
A-E are external nodes (extant)
F-I are internal (ancestral) nodes

B
H
OTUs are operational taxonomic
populations
units
C
individuals
I
They can be species
genes
They are the extant (existing) or extinct
(ancestral) OTUs
proteins
G

Root
D

Topology order of the nodes on the
tree
E

85
Distance Matrix Methods
  • UPGMA (Unweighted Pair Group with Arithmatic
    Mean) uses real (uncorrected) distance values and
    a sequential clustering algorithm. (Should only
    be used with closely related OTUs, or when there
    is constancy of evolutionary rate)
  • Transformed distance methods. Corrections may be
    introduced to obtain trees with true evolutionary
    distances (PAM values, Kimura), or corrections
    are carried out with reference to an outgroup
    (Farris, 1971 Klotz et al, 1979). Should be used
    when evolutionary distant organisms are included
    in the dataset
  • Neighbors relation methods
  • FITCH (Fitch, 1981)
  • Neighbor-Joining method, (Saitou and Nei, 1987)
    Should all be used with corrected (see above)
    distance matrices

86
Distance matrix
Uncorrected for Multiple Substitutions
1 2 3 4 5
0.00 0.63 0.63 22.88 18.50
AC007866_13 1 0.00 0.63
22.57 18.50 AC007866_17 2
0.00 22.88 17.87
AC007866_15 3
0.00 5.64 AC007866_9 4
0.00
AC007866_11 5 Using the Kimura correction
method Gap weighting is 0.000000 1
2 3 4 5 0.00
0.63 0.63 27.35 21.29 AC007866_13
1 0.00 0.63 26.90 21.29
AC007866_17 2
0.00 27.35 20.47 AC007866_15 3
0.00 5.88
AC007866_9 4
0.00 AC007866_11 5
Distance matrix as produced by the EMBOSS program
distmat
87
UPGMA
  • UPGMA (Unweighted Pair Group with Arithmetic
    Mean) uses real (uncorrected) distance values and
    a sequential clustering algorithm. (Should only
    be used with closely related OTUs, or when there
    is constancy of evolutionary rate)

88
Tree construction (UPGMA)
First cycle  A  B  C  D  E  B  2    C  4  4    
D  6  6  6    E  6  6  6  4    F  8  8  8  8  8
Cluster the pair of OTUs with the smallest
distance, being A and B, The branching point is
positioned at a distance of 2 / 2 1
substitution.
89
Tree construction (UPGMA)
  • Following the first clustering A and B are
    considered as a single composite OTU(A,B) and we
    now calculate the new distance matrix as follows
  • dist(A,B),C (distAC distBC) / 2 4
  • dist(A,B),D (distAD distBD) / 2 6
  • dist(A,B),E (distAE distBE) / 2 6
  • dist(A,B),F (distAF distBF) / 2 8
  • In other words the distance between a simple
    OTU and a composite OTU is the average of the
    distances between the simple OTU and the
    constituent simple OTUs of the composite OTU.
    Then a new distance matrix is recalculated using
    the newly calculated distances and the whole
    cycle is being repeated

90
Tree construction (UPGMA)
  • Second cycle
  •     A,B  C  D  E
  •  C  4  
  •  D  6  6  
  •  E   6  6  4  
  •  F   8  8  8  8

91
Tree construction (UPGMA)
  • Third cycle
  •     A,B  C  D,E
  •  C  4    
  •  D,E  6 6  
  •  F   8 8   8

92
Tree construction (UPGMA 4)
  • Fourth cycle
  •     AB,C  D,E
  •  D,E   6  
  •  F   8   8

93
Tree construction (UPGMA)
  • Fifth cycle
  •    ABC,DE
  •  F   8
  • The final step consists of clustering the last
    OTU, F,with the composite OTU.

94
Pitfalls of UPGMA
  • The UPGMA clustering method is very sensitive to
    unequal evolutionary rates.
  • Clustering works only if the data are ultrametric
  • Ultrametric distances are defined by the
    satisfaction of the 'three-point condition'.

95
The treepoint condition
  • For any three taxa dist AC lt max (distAB,
    distBC) or,
  • in words the two greatest distances are equal,
    or
  • UPGMA assumes that the evolutionary rate is the
    same for all branches
  • If the assumption of rate constancy among
    lineages does not hold UPGMA may give an
    erroneous topology.

Non-ultrametric tree
96
Unequal rates of mutation lead to wrong trees
  • UPGMA tree construction based on the data of the
    left tree would result in the erroneous tree at
    the right

97
UPGMA (conclusion)
  • UPGMA uses real (uncorrected) distance values and
    a sequential clustering algorithm.
  • This method of tree construction is very
    sensitive to differences in branch length or
    unequal rates of evolution.
  • It should only be used with closely related OTUs,
    or when there is constancy of evolutionary rate.
  • The method is often used in combination with
    isoenzyme or restriction site data or with
    morphological criteria

98
Maximum Parsimony Methods
  • Use sequence information rather than distance
    information
  • Calculate for all possible trees the tree that
    represents the minimum number of substitutions at
    each informative site

99
Maximum Parsimony analysis (2)
  • Parsimony implies that simpler hypotheses are
    preferable to more complicated ones.
  • Maximum parsimony is a character-based method
    that infers a phylogenetic tree by minimizing the
    total number of evolutionary steps required to
    explain a given set of data, or in other words by
    minimizing the total tree length.
  • The steps may be base or amino-acid substitutions
    for sequence data, or gain and loss events for
    restriction site data.

100
Maximum Parsimony analysis (3)
  • Maximum parsimony, when applied to protein
    sequence data either considers each site of the
    sequence as a multistate unordered characterd
    with 20 possible states (the amino-acids) (Eck
    and Dayhoff, 1966), or may take into account the
    genetic code and the number of mutations, 1, 2 or
    3, that is required to explain an observed
    amino-acid substitution. The latter method is
    implemented in the PROTPARS program (Felsenstein,
    1993).
  • The maximum parsimony method searches all
    possible tree topologies for the optimal
    (minimal) tree. However, the number of unrooted
    trees that have to be analysed rapidly increases
    with the number of OTUs.

101
Maximum Parsimony analysis (4)
  • The number of rooted trees (Nr) for n OTUs is
    given byNr (2n -3)!/(2exp(n -2)) (n -2)!
  • The number of unrooted trees (Nr) for n OTUs is
    given byNu (2n -5)!/(2exp(n -3)) (n -3)!  

Number of OTUs unrooted trees rooted
trees  2   1   1  3   1  
3  4   3  
15  5   15   105  6   105  
945  7   954  
10,395  8  10,395   135,135  9 135,135
34,459,425  10 34,459,425
 2.13E15  15  2.13E15   8.E21
This rapid increase in number of trees to be
analysed may make it impossible to apply the
method to very large datasets. In that case the
parsimony method may become very time consuming,
even on very fast computers.
102
maximum parsimony method for 4 nucleic-acid
sequences
  • Site _________________________
    Sequence 1 2 3 4 5 6 7 8 9 1
    A A G A G T G C A 2 A G C C
    G T G C G 3 A G A T A T C C
    A 4 A G A G A T C C G
  • For four OTUs there are three possible unrooted
    trees. The trees are then analysed by searching
    for the ancestral sequences and by counting the
    number of mutations required to explain the
    respective trees

103
(1) AAGAGTGCA AGATATCCA (3) \4
2/
Number of mutations \
4 / AGCCGTGCG --- AGAGATCCG
Tree I 11 /
\ /0
0\ (2) AGCCGTGCG AGAGATCCG
(4) (1) AAGAGTGCA AGCCGTGCG (2)
\1 3/
\ 5
/ AGGAGTGCA --- AGAGGTCCG Tree II
14 /
\ /4 1\
(3) AGATATCCA AGAGATCCG (4) (1)
AAGAGTGCA AGCCGTGCG (2) \1
3/
\ 5 /
AGAAGTGCA --- AGATGTCCG Tree III 16
/ \
/5 2\ (4)
AGAGATCCG AGATATCCA (3)
Tree I has the topology with the least number of
mutations and thus is the most parsimonious
tree. Ancestral trees are calculated This
analysis includes both informative and
non-informative sites in the sequence. When
only informative sites are included a much lesser
number of sites can be analysed, which means in
the case of large datasets a considerable gain in
CPU time.
104
Informative sites
A site is informative only when there are at
least two different kinds of nucleotides at the
site, each of which is represented in at least
two of the sequences under study.  
  • Site _________________________
    Sequence 1 2 3 4 5 6 7 8 9 1
    A A G A G T G C A 2 A G C C
    G T G C G 3 A G A T A T C C
    A 4 A G A G A T C C G

Informative sites are indicated by an asterisk ()
105
Informative sites only
1 GGA 2 GGG 3 ACA 4 ACG (1) GGA
ACA (3) \1 1/
Number of mutations \ 2 /
GGG --- ACG Tree I 4 /
\ /0 0\ (2) GGG
ACG (4) (1) GGA GGG (2) \1
1/ \ 1 /
GCA --- GCG Tree II 5 /
\ /1 1\ (3) ACA
ACG (4) (1) GGA GGG (2) \2
1/ \ 0 /
GCG --- GCG Tree III 6 /
\ /1 2\ (4) ACG
ACA (3)
To infer a maximum parsimony tree, for each
possible tree we calculate the minimum number of
substitutions at each informative site. In the
above example, for sites 5, 7, and 9, tree I
requires in total 4 changes, tree II requires 5
changes, and tree III requires 6 changes. In the
final step, we sum the number of changes over all
the informative sites for each tree and choose
the tree associated with the smallest number of
substitutions. In our case, tree I is chosen
because it requires the smallest number of
changes (4) at the informative sites.
106
How to find the best tree ?
  • Maximum parsimony searches for the optimal
    (minimal) tree. In this process more than one
    minimal trees may be found. In order to guarantee
    to find the best possible tree an exhaustive
    evaluation of all possible tree topologies has to
    be carried out. However, this becomes impossible
    when there are more than 12 OTUs in a dataset.
  • Branch and Bound is a variation on maximum
    parsimony that garantees to find the minimal tree
    without having to evaluate all possible trees.
    This way a larger number of taxa can be evaluated
    but the method is still limited.
  • Heuristic searches is a method with step-wise
    addition and rearrangement (branch swapping) of
    OTUs. Here it is not guaranteed to find the best
    tree.
  • Since, in view of the size of the dataset, it is
    often not possible to carry out an exhaustive or
    other search for the best tree, it is adviced to
    change the order of the taxa in the dataset and
    to repeat the analysis, or to indicate to the
    program to do this for you by providing a
    so-called jumble factor to the program.

107
Consensus tree
  • Since the Maximum Parsimony method may result in
    more than one equally parsimonious tree, a
    consensus tree should be created. For the
    creation of a consensus tree see bootstrapping.

108
Parsimony and branch lengths
(1) G A (3) \1 0/
\ 1 / C -----A
/ \ /0
1\ (2) C T (4) (1) G
A (3) \0 1/
\ 1 / G -----T
/ \ /1 0\ (2) C
T (4) (1) G A (3) \1
1/ \ 1 /
C -----A / \
/0 0\ (2) C A (4)
3 possible trees for 4 OTUs, all describe the
same final state by assuming a total of 3 steps.
Each final state is arrived at via a different
route. Each of the three trees is equally
valid, but the number of steps along the
indiviual branches (or the length of each branch)
is not determined. For this reason branch
lengths are not given in parsimony, but only the
total number of steps for a tree.
109
Some final notes on maximum parsimony
  • Maximum Parsimony (positive points)
  • is based on shared and derived characters. It
    therefore is a cladistic rather than a phenetic
    method
  • does not reduce sequence information to a single
    number
  • tries to provide information on the ancestral
    sequences
  • evaluates different trees
  • Maximum Parsimony (negative points)
  • does not assume an evolutionary model
  • is slow in comparison with distance methods
  • does not use all the sequence information (only
    informative sites are used)
  • does not correct for multiple mutations (does not
    imply a model of evolution)
  • does not provide information on the branch
    lengths
  • is notorious for its sensitivity to codon bias

110
How to root an unrooted tree?
  • The majority of methods yield unrooted trees
  • To root a tree one should add an outgroup to the
    dataset. An outgroup is an OTU for which external
    information (eg. paleontological information) is
    available that indicates that the outgroup
    branched off before all other taxa
  • Do not choose an outgroup that is very distantly
    related to your taxa. This may result in serious
    topolocical errors
  • Do not choose either an outgroup that is too
    closely related to the taxa in question. In this
    case it may not be a true outgroup
  • The use of more than one outgroup generally
    improves the estimate of tree topology
  • In the absence of a good outgroup the root may be
    positioned by assuming approximately equal
    evolutionary rates over all the branches. In this
    way the root is put at the midpoint of the
    longest pathway between two OTUs

111
Maximum likelihood
  • It evaluates a hypothesis about evolutionary
    history in terms of the probability that the
    proposed model and the hypothesized history would
    give rise to the observed data set. A history
    with a higher probability of reaching the
    observed state is preferred to a history with a
    lower probability. The method searches for the
    tree with the highest probability or likelihood.
  • The following programs are available from the
    web
  • DNAML (DNA data only. By Joe Felsenstein in the
    Phylip package)
  • FastDNAML (DNA data only. A faster algorithm
    applied by Gary Olsen to Joe Felsenstein's DNAML
    program )
  • ProtML (DNA and protein. By Adachi and Hasegawa,
    1992)
  • TreePuzzle (DNA and protein. By Strimmer and von
    Haeseler, 1995). This program applies a heuristic
    method and is much faster than PROTML, but does
    not guarantee to find the best tree.

112
Advantages and disadvantages of the maximum
likelihood method
  • There are some supposed adavantages of maximum
    likelihood methods over other methods.
  • It is the estimation method least affected by
    sampling error
  • It is robust to many violations of the
    assumptions in the evolutionary model
  • with very short sequences it tends to outperform
    alternative methods such as parsimony or distance
    methods.
  • the method is statistically well founded
  • evalutates different tree topologies
  • uses all the sequence information
  •  There are also some supposed disadvantages
  • maximum likelihood is very CPU intensive and thus
    extremely slow
  • result is dependent on the model of evolution
    used

113
Explication of the method
Maximum likelihood evaluates the probability that
the choosen evolutionary model will have
generated the observed sequences. Phylogenies are
then inferred by finding those trees that yield
the highest likelihood. Assume that we have the
aligned nucleotide sequences for four taxa
1 j ....N (1)
A G G C U C C A A ....A (2) A G
G U U C G A A ....A (3) A G C C C A
G A A.... A (4) A U U U C G G A
A.... C and we want to evauate the likelihood
of the unrooted tree represented by the
nucleotides of site j in the sequence and shown
below   (1) (2) \
/ \ /
------ / \
/ \ (3) (4) What
is the probabliity that this tree would have
generated the data presented in the sequence
under the the chosen model ?
114
Likelihood for one site
The models are time-reversible, therefore the
likelihood of the tree is independent of the
position of the root. Thus it is convenient to
root the tree at an arbitrary internal node.
C C A G \ / /
\/ / A / \ /
\ / A
_ _
C C A G
\ / /
\/ / L(j) Sum(Prob (5)
/ ) \ /
\ /
_ (6) _
Assume that nucleotide sites evolve independently
(the Markovian model of evolution). Then we can
calculate the likelihood for each site separately
and combine these to the total likelihood. For
the likelihood for site j, we have to consider
all the possible scenarios by which the
nucleotides present at the tips of the tree could
have evolved. So the likelihood for a particular
site is the summation of the probablilities of
every possible reconstruction of ancestral
states, given some model of base substitution. So
in this specific case all possible nucleotides A,
G, C, and T occupying nodes (5) and (6), or 4 x 4
16 possibilities
In the case of protein sequences each site may
ooccupy 20 states (that of the 20 amino acids) an
thus 400 possibilities have to be considered.
Since any one of these scenarios could have led
to the amino-acid configuration at the tip of the
tree, we must calculate the probability of each
and sum and sum them to obtain the total
probability for each site j.
115
likelihood for the full tree
The likelihood for the full tree then is the
product of the likelihood at each site.  
N L L(1) x L(2) ..... x
L(N) P L(j)
j1 Since the individual likelihoods are
extremely small numbers it is convenient to sum
the log likelihoods at each site and report the
likelihood of the entire tree as the log
likelihood.  

N ln L ln L(1) ln L(2) ..... ln
L(N) S ln L(j)
j1
116
The model of evolution
The PROTML program in the MOLPHY package (Adachi
and Hasegawa, 1992), as well as the TreePUZZLE
program by Strimmer and von Haeseler (1995), have
implemented an instantaneous rate matrix derived
from the Dayhoff emperical substitution matrix.
This has been called the Dayhoff model.
Recently a model called the
Write a Comment
User Comments (0)
About PowerShow.com