Molecular Phylogenetics presentation

About This Presentation

Transcript and Presenter's Notes

Title: Molecular Phylogenetics

1
Molecular Phylogenetics
2
Trees

Diagram consisting of branches and nodes
Species tree (how are my species related?)
contains only one representative from each
species
when did speciation take place?
all nodes indicate speciation events
Gene tree (how are my genes related?)
normally contains a number of genes from a single
species
nodes relate either to speciation or gene
duplication events

3
(No Transcript)
4
The purpose of a phylogenetic tree is to
illustrate how a group of objects (usually genes
or organisms) are related to one another
5
Phylogenetic trees are about visualising
evolutionary relationships
Nothing in Biology Makes Sense Except in the
Light of Evolution Theodosius
Dobzhansky (1900-1975)
6
Terms

Phylogeny (phylo tribe genesis)
Homologue
Orthologue
Paralogue
Tree topology
Cladogram
Phenogram

7
Cladogram
8
Phenogram
9
Clade A set of species which includes all of
the species derived from a single common ancestor
10
Molecular Evolution - Li
11
Cladistics and Phenetics

Cladistic approach Trees are drawn based on the
conserved characters
Phenetic approach Trees are based on some
measure of distance between the leaves
Molecular phylogenies are inferred from molecular
(usually sequence) data
either cladistic (e.g. gene order) or phenetic

12
Classes of algorithm used to infer phylogeny from
sequence

Distance methods
Parsimony
Likelihood
Probabilistic methods
Phylogentic invariants

13
Distance methods

Calculate the distance CORRECTING FOR MULTIPLE
HITS
The Distance Matrix
7
Rat 0.0000 0.0646 0.1434 0.1456
0.3213 0.3213 0.7018
Mouse 0.0646 0.0000 0.1716 0.1743
0.3253 0.3743 0.7673
Rabbit 0.1434 0.1716 0.0000 0.0649
0.3582 0.3385 0.7522
Human 0.1456 0.1743 0.0649 0.0000
0.3299 0.2915 0.7116
Oppossum 0.3213 0.3253 0.3582 0.3299
0.0000 0.3279 0.6653
Chicken 0.3213 0.3743 0.3385 0.2915
0.3279 0.0000 0.5721
Frog 0.7018 0.7673 0.7522 0.7116
0.6653 0.5721 0.0000

14
Distance methods

Normally fast and simple
e.g. UPGMA, Neighbour Joining, Minimum Evolution,
Fitch-Margoliash

15
Correction for multiple hits

Only differences can be observed directly not
distances
All distance methods rely (crucially) on this
A great many models used for nucleotide sequences
(e.g. JC, K2P, HKY, Rev, Maximum Likelihood)
aa sequences are infinitely more complicated!
Accuracy falls off drastically for highly
divergent sequences

16
Minimum Evolution

The total length of all branches in the tree
should be a minimum
It has been shown that the minimum evolution tree
is expected to be the true tree provided branch
lengths corrected for multiple hits

17
Neighbour Joining
8
7
1
6
2
3
5
4
18
Neighbour joining is an approximation to minimum
evolution
19
Maximum Parsimony

Occams Razor
Entia non sunt multiplicanda praeter
necessitatem.
William of Occam (1300-1349)

The best tree is the one which requires the least
number of substitutions
20

Check each topology
Count the minimum number of changes required to
explain the data
Choose the tree with the smallest number of
changes
Usually performs well with closely related
sequences but often performs badly with very
distantly related sequences
With distantly related sequences homoplasy
becomes a major problem

21
Maximum Likelihood

Require a model of evolution
Each substitution has an associated likelihood
given a branch of a certain length
A function is derived to represent the likelihood
of the data given the tree, branch-lengths and
additional parameters
Function is minimized

22
Models can be made more parameter rich to
increase their realism

The most common additional parameters are
A correction to allow different substitution
rates for each type of nucleotide change
A correction for the proportion of sites which
are unable to change
A correction for variable site rates at those
sites which can change
The values of the additional parameters will be
estimated in the process (e.g. PAUP)

23
A gamma distribution can be used to model site
rate heterogeneity
24
Long Branches Attract

In a set of sequences evolving at different
rates the sequences evolving rapidly are drawn
together

25
Comparison of methods

Inconsistency
Neighbour Joining (NJ) is very fast but depends
on accurate estimates of distance. This is more
difficult with very divergent data
Parsimony suffers from Long Branch Attraction.
This may be a particular problem for very
divergent data
NJ can suffer from Long Branch Attraction
Parsimony is also computationally intensive
Codon usage bias can be a problem for MP and NJ
Maximum Likelihood is the most reliable but
depends on the choice of model and is very slow
Methods may be combined

26
The Molecular Clock

For a given protein the rate of sequence
evolution is approximately constant across
lineages
Zuckerkandl and Pauling (1965)

This would allow speciation and duplication
events to be dated accurately based on molecular
data
Local and approximate molecular clocks more
reasonable
27
Relative Rate Test

Test whether sets of sequences are evolving at
equal rates (local molecular clock hypothesis)

e.g. RRTree, Robinson-Rechavi http//pbil.univ-lyo
n1.fr/software/rrtree.html
28
Rooting the Tree

In an unrooted tree the direction of evolution is
unknown
The root is the hypothesized ancestor of the
sequences in the tree
The root can either be placed on a branch or at a
node
You should start by viewing an unrooted tree

29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
Automatic rooting

Many software packages will root trees
automatically (e.g. mid-point rooting in NJPlot)
This must involve assumptions BEWARE!

33
Rooting Using an Outgroup

1. The outgroup should be a sequence (or set of
sequences) known to be less closely related to
the rest of the sequences than they are to each
other
2. It should ideally be as closely related as
possible to the rest of the sequences while still
satisfying condition 1
The root must be somewhere between the outgroup
and the rest (either on the node or in a branch)

34
Sometimes two trees may look very different
but, in fact, differ only in the position of the
root
35
What sequences should I use for organism
phylogenies?

Slowly evolving / Fast evolving
rRNA
mitochondrion
other

36
How confident am I that my tree is correct?

Bootstrap values
Bootstrapping is a statistical technique that
can use random resampling of data to determine
sampling error for tree topologies

37
Bootstrapping phylogenies

Characters are resampled with replacement to
create many bootstrap replicate data sets
Each bootstrap replicate data set is analysed
(e.g. with parsimony, distance, ML etc.)
Agreement among the resulting trees is summarized
with a majority-rule consensus tree
Frequencies of occurrence of groups, bootstrap
proportions (BPs), are a measure of support for
those groups

38
Bootstrapping - an example
Ciliate SSUrDNA - parsimony bootstrap
Ochromonas (1)
Symbiodinium (2)
100
Prorocentrum (3)
Euplotes (8)
84
Tetrahymena (9)
96
Loxodes (4)
100
Tracheloraphis (5)
100
Spirostomum (6)
100
Gruberia (7)
Majority-rule consensus
Wim de Grave et al. Fiocruz bioinformatics
training course
39
Bootstrapping
Majority-rule consensus (with minority components)
Wim de Grave et al. Fiocruz bioinformatics
training course
40
Bootstrap - interpretation

Bootstrapping is a very valuable and widely used
technique (it is demanded by some journals)
BPs give an idea of how likely a given branch
would be to be unaffected if additional data,
with the same distribution, became available
BPs are not the same as confidence intervals.
There is no simple mapping between bootstrap
values and confidence intervals. There is no
agreement about what constitutes a good
bootstrap value (gt 70, gt 80, gt 85 ????)
Some theoretical work indicates that BPs can be a
conservative estimate of confidence intervals
If the estimated tree is inconsistent all the
bootstraps in the world wont help you..

41
Jack-knifing

Jack-knifing is very similar to bootstrapping and
differs only in the character resampling strategy
Jack-knifing is not as widely available or widely
used as bootstrapping
Tends to produce broadly similar results

42
Likelihood-based tests of topologies

Kishino-Hasegawa test
Trees specified apriori
KH can be used to test whether two competing
hypotheses have significantly different
likelihood
NB should not be used to test trees that have
been chosen on the basis of the data!
Shimodaira-Hasegawa test
Can be used to test confidence of ML tree
compared to related trees (e.g. second most
likely tree from the data)
Andrew Rambaut http//evolve.zoo.ox.ac.uk/software
/shtests

43
Inferring Sequences at Ancestral Nodes

Maximum likelihood estimates of tree topologies
also provide inferred sequences at ancestral
nodes
Analysis of sequences at ancestral nodes and
sequence changes at ancestral branches can
provide information about the timing of the
acquiring of a novel trait or mutation
PAML (Phylogenetic Analysis using Maximum
Likelihood)
Confidence intervals provided
Selection can be inferred

44
Coalescent models

Consider genetic lineages going back in time
Make inferences from patterns of coalescent
events (e.g. effective population size,
migrations etc.)
Improved efficiency for sequence simulations
Great number of software packages including
LAMARK (Kuhner and Felsenstein)
Has generated enormous interest and body of
literature (for review see Rosenberg and
Nordborg, Nature 2002

For an excellent review of recent progress in
phylogenetics see
Molecular phylogenetics state-of-the-art methods
for looking into the past
Whelan et al, Trends in Genetics, 2001

46
Inferring Function from Sequence Homology
High throughput Genome Annotation
47
Basic Methods
Taken from JA Eisen, Genome Research, 1998

Highest BLAST hit
uncharacterised gene is assigned the function of
the best BLAST hit
An additional cut-off value may be used
Still very frequently used sometimes referred
to as first-pass annotation
Top Hits
Examine a set of top BLAST hits and consider the
consensus function
COGs (Clusters of Orthologous Genes)
Clustering of genes into orthologous groups based
on similarity scores

48
Phylogenomics
Taken from JA Eisen, Genome Research, 1998

Choose genes of interest
Identify homologues
Align sequences
Calculate gene tree
Overlay known functions onto tree
Infer likely function of genes of interest
Note Only proteins with confirmed functions
should be used (to avoid error propagation)

49
Increased power over similarity methods However
increased power comes with increased costs in
terms of labour
50
Duplication and Speciation

Gene duplication may accelerate the rate of
evolution and the rate of functional divergence
Orthologues may provide better information about
function (for a given level of divergence) than
paralogues
Software is available for automatic inference of
gene duplication events on a phylogenetic tree
(e.g. SDI Seán Eddy). This can be used for
improved automation of phylogenomics (e.g. RIO
Eddy et al.)
ref Zmasek and Eddy BMC Bioinformatics 2002

51
Functional Genomics

e.g. Xun Gu Molecular Biology and Evolution 1999
Examine functional divergence after duplication
Type I (rates change)
Type II (replacement at conserved site)
Diverge program

Write a Comment

User Comments (0)

About PowerShow.com

Molecular Phylogenetics PowerPoint PPT Presentation