V4 Prediction of Phylogenies based on single genes presentation

About This Presentation

Transcript and Presenter's Notes

Title: V4 Prediction of Phylogenies based on single genes

1
V4 Prediction of Phylogenies based on single genes
Material of this lecture taken from - chapter 6,
DW Mount Bioinformatics and from Julian
Felsensteins book.

A phylogenetic analysis of a family of related
nucleic acid or protein sequences is a
determination of how the family might have been
derived during evolution. Placing the sequences
as outer branches on a tree, the evolutionary
relationships among the sequences are depicted.
Phylogenies, or evolutionary trees, are the basic
structures to describe differences between
species, and to analyze them statistically. They
have been around for over 140 years. Statistical,
computational, and algorithmic work on them is
ca. 40 years old.
2
3 main approaches in single-gene phylogeny
- maximum parsimony - distance matrix - maximum
likelihood (not covered here)

Popular programs PHYLIP (phylogenetic inference
package J Felsenstein) PAUP (phylogenetic
analysis using parsimony Sinauer Assoc
3
Methods for Single-Gene Phylogeny
Choose set of related sequences
Obtain multiple sequence alignment
Is there strong sequence similarity?
Maximum parsimony methods
Yes

No
Yes
Is there clearly recogniza-ble sequence
similarity?
Distance methods
No
Analyze how well data support prediction
Maximum likelihood methods
4
Parsimony methods
Edwards Cavalli-Sforza (1963) that
evolutionary tree is to be preferred that
involves the minimum net amount of
evolution. ? seek that phylogeny on which, when
we reconstruct the evolutionary events leading to
our data, there are as few events as
possible. (1) We must be able to make a
reconstruction of events, involving as few events
as possible, for any proposed phylogeny. (2) We
must be able to search among all possible
phylogenies for the one or ones that minimize the
number of events.

5
A simple example
Suppose that we have 5 species, each of which
has been scored for 6 characters
?(0,1) We will allow changes 0 ? 1 and
1 ? 0. The initial state at the root of a tree
may be either state 0 or state 1.

6
Evaluating a particular tree
To find the most parsimonious tree, we must have
a way of calculating how many changes of state
are needed on a given tree. This tree
represents the phylogeny of character
1. Reconstruct phylogeny of character 1 on this
tree.

7
Evaluating a particular tree
There are 2 equally good reconstructions, each
involving just one change of character
state. They differ in which state they assume at
the root of the tree, and they differ in which
branch they place the single change.

8
Evaluating a particular tree
3 equally good reconstructions for character 2,
which needs two changes of state.

9
Evaluating a particular tree
A single reconstruction for character 3,
involving one change of state.

10
Evaluating a particular tree
on the right 2 reconstructions for character 4
and 5 because these characters have identical
patterns. single reconstruction for
character 6, one change of state.

11
Evaluating a particular tree
The total number of changes of character state
needed on this tree is 1 2 1 2 2 1
9 Reconstruction of the changes in state on this
tree

12
Evaluating a particular tree
Alternative tree with only 8 changes of
state. The minimum number
of changes of state would be 6, as there are 6
characters that can each have 2 states. Thus, we
have two extra changes ? called homoplasmy.

13
Evaluating a particular tree

Figure right shows another tree also requiring 8
changes. These two most parsimonious trees are
the same tree when the roots of the tree are
removed.
14
Methods of rooting the tree
There are many rooted trees, one for each branch
of this unrooted tree, and all have the same
number of changes of state. The number of
changes of state only depends on the unrooted
tree, and not at all on where the tree is then
rooted. Biologists want to think of trees as
rooted ? need method to place the root in an
otherwise unrooted tree. (1) Outgroup
criterion (2) Use a molecular clock.

15
Outgroup criterion
Assumes that we know the answer in
advance. Suppose that we have a number of great
apes, plus a single old-world monkey. Suppose
that we know that the great apes are a
monophyletic group. If we infer a tree of these
species, we know that the root must be placed on
the lineage that connects the old-world monkey
(outgroup) to the great apes (ingroup).

16
Molecular clock
If an equal amount of changes were observed on
all lineages, there should be a point on the tree
that has equal amounts of change (branch lengths)
from there to all tips. With a molecular clock,
it is only the expected amounts of change that
are equal. The observed amounts may not be. ?
using various methods find a root that makes the
amounts of change approximately equal on all
lineages.

17
Branch lengths
Having found an unrooted tree, locate the changes
on it and find out how many occur in each of the
branches. The location of the changes can be
ambiguous. ? average over all possible
reconstructions of each character for which there
is ambiguity in the unrooted tree. F
ractional numbers in some branches of left tree
add up to (integer) number of changes (right)

18
Open questions
Particularly for larger data sets, need to know
how to count number of changes of state by use of
an algorithm. need to know algorithm for
reconstructing states at interior nodes of the
tree. need to know how to search among all
possible trees for the most parsimonious ones,
and how to infer branch lengths. sofar only
considered simple model of 0/1 characters. DNA
sequences have 4 states, protein sequences 20
states. Justification is it reasonable to use
the parsimony criterion? If so, what does it
implicitly assume about the biology? What is
the statistical status of finding the most
parsimonious tree? Can we make statements how
well-supported it is compared to other trees?

19
Counting evolutionary changes
2 related dynamic programming algorithms Fitch
(1971) and Sankoff (1975) - evaluate a phylogeny
character by character - for each character,
consider it as rooted tree, placing the root
wherever seems appropriate. - update some
information down a tree when we reach the
bottom, the number of changes of state is
available. Do not actually locate changes or
reconstruct interior states at the nodes of the
tree.

20
Fitch algorithm
intended to count the number of changes in a
bifurcating tree with nucleotide sequence data,
in which any one of the 4 bases (A, C, G, T) can
change to any other. At the particular
site, we have observed the bases C, A, C, A and G
in the 5 species. Give them in the order in which
they appear in the tree, left to right.

21
Fitch algorithm
For the left two, at the node that is their
immediate common ancestor, attempt to construct
the intersection of the two sets. But as C ?
A ? instead construct the union C ? A
AC and count 1 change of state. For the
rightmost pair of species, assign common
ancestor as AG, since A ? G ? and count
another change of state. .... proceed to
bottom Total number of changes 3. Algorithm
works on arbitrarily large trees.

22
Complexity of Fitch algorithm
Fitch algorithm can be carried out in a number of
operations that is proportional to the number of
species (tips) on the tree. Dont we need to
multiply this by the number of sites n ? Any
site that is invariant (which has the same base
in all species, e.g. AAAAA) can be
dropped. Other sites with a single variant base
(e.g. ATAAA) will only require a single change of
state on all trees. These too can be
dropped. For sites with the same pattern (e.g.
CACAG) that we have already seen, simply use
number of changes previously computed. Pattern
following same symmetry (e.g. TCTCA CACAG) need
same number of changes ? numerical effort rises
slower than linearly with the number of sites.

23
Sankoff algorithm
Fitch algorithm is very effective but we cant
understand why it works. Sankoff algorithm more
complex, but its structure is more
apparent. Assume that we have a table of the
cost of changes cij between each character state
i and each other state j. Compute the total cost
of the most parsimonious combinations of events
by computing it for each character. For a given
character, compute for each node k in the tree a
quantity Sk(i). This is interpreted as the
minimal cost, given that node k is assigned state
i, of all the events upwards from node k in the
tree.

24
Sankoff algorithm
If we can compute these values for all nodes, we
can also compute them for the bottom node in the
tree. Simply choose the minimum of these
values which is the desired total cost we seek,
the minimum cost of evolution for this
character. At the tips of the tree, the S(i) are
easy to compute. The cost is 0 if the observed
state is state i, and infinite otherwise. If we
have observed an ambigous state, the cost is 0
for all states that it could be, and infinite for
the rest. Now we just need an algorithm to
calculate the S(i) for the immediate common
ancestor of two nodes.

25
Sankoff algorithm
Suppose that the two descendant nodes are called
l and r (for left and right). For their
immediate common ancestor, node a, we compute

The smallest possible cost given that node a is
in state i is the cost cij of going from state i
to state j in the left descendant lineage, plus
the cost Sl(j) of events further up in the
subtree gien that node l is in state j. Select
value of j that minimizes that sum. Same
calculation for right descendant lineage ? sum of
these two minima is the smallest possible cost
for the subtree above node a, given that node a
is in state i. Apply equation successively to
each node in the tree, working downwards. Finally
compute all S0(i) and use previous eq. to find
minimum cost for whole tree.
26
Sankoff algorithm
The array (6,6,7,8) at the bottom
of the tree has a minimum value of 6 minimum
total cost of the tree for this site.

27
Finding the best tree by heuristic search
The obvious method for searching for the most
parsimonious tree is to consider ALL trees and
evaluate each one. Unfortunately, generally the
number of possible trees is too large. ? use
heuristic search methods that attempt to find the
best trees without looking at all possible
trees. (1) Make an initial estimate of the tree
and make small rearrangements of it find
neighboring trees. (2) If any of these
neighbors are better, consider them and continue
search.

28
Distance matrix methods
introduced by Cavalli-Sforza Edwards (1967) and
by Fitch Margoliash (1967) general idea seems
as if it would not work very well
(Felsenstein) - calculate a measure of the
distance between each pair of species - find a
tree that predicts the observed set of distances
as closely as possible. All information from
higher-order combinations of character states is
left out. But computer simulation studies show
that the amount of lost information is remarkably
small. Best way to think about distance matrix
methods consider distances as estimates of the
branch length separating that pair of species.

29
Least square method
- observed table (matrix) of distances Dij - any
particular tree leads to a predicted set of
distances dij.

30
Least square method
Measure of the discrepancy between the observed
and expected distances

where the weights wij can be differently
defined - wij 1 (CavalliSforza, 1967) - wij
1/Dij2 (FitchMargoliash, 1967) - wij
1/Dij (Beyer et al., 1974) Aim Find tree
topology and branch lengths that minimize
Q. Equation above is quadratic in branch
lengths. Take derivative with respect to branch
lengths, set 0, and solve system of linear
equations. Solution will minimize Q.
Doug Brutlags course
31
Least square method

v2
v1
v5
v6
v7
v4
v3
Number species in alphabetical order. The
expected distance between species A and D d14
v1 v7 v4 The expected distance between
speices B and E d25 v5 v6 v7 v2.
32
Least square method
Number all branches of the tree and introduce an
indicator variable xijk xijk 1 if branch k
lies in the path from species i to species j xijk
0 otherwise. The expected distance between i
and j will then be and For the case with
wij 1 ?ij. Note these are k equations for
each of the k branches.

33
Least square method
DAB DAC DAD DAE 4v1 v2 v3 v4 v5
2v6 2v7 DAB DBC DBD DBE v1 4v2
v3 v4 v5 2v6 3v7 DAC DBC DCD DCE
v1 v2 4v3 v4 v5 3v6 2v7 DAD DBD
DCD DDE v1 v2 v3 4v4 v5 2v6
3v7 DAE DBE DCE DDE v1 v2
v3 v4 4v5 3v6 2v7 DAC DAE DBC DBE
DCD DDE 2v1 2v2 3v3 2v4 3v5 6v6
4v7 DAB DAD DBC DCD DBE DDE 2v1 3v2
2v3 3v4 2v5 4v6 6v7 Stack up the (4
3 2 1 10) Dij, in alphabetical order, into
a vector and the coefficients xijk are
arranged in a matrix X with each row
corresponding to the Dij in the row of d
and containing a 1 if branch k occurs on the
path between species i and j.

34
Least square method
If we also stack up the 7 vi into a vector v, the
previous set of linear equations can be compactly
expressed as Multiplied from the left by the
inverse of XTX one can solve for the least
squares branch lengths This is a standard
method of expressing least squares problems in
matrix notation and solving them. check for
example -)

35
Least square method
When we have weighted least squares, with a
diagonal matrix of weights in the same order as
the Dij

then the least square equations can be written
and their solution
36
Finding the least squares tree topology
Now that we are able to assign branch lengths to
each tree topology. we need to search among tree
topologies. This can be done by the same methods
of heuristic search that were presented for the
Maximum Parsimony method. Note no-one has sofar
presented a branch-and-bound method for finding
the least squares tree exactly. Day (1986) has
shown that this problem is NP-complete. The
search is not only among tree topologies, but
also among branch lengths.

37
neighbor-joining method
introduced by Saitou and Nei (1987) algorithm
works by clustering - does not assume a molecular
clock but approximates the minimum evolution
model. Minimum evolution model among
possible tree topologies, choose the one with
minimal total branch length. Neighbor-joining,
as the least-squares method, is guaranteed to
recover the true tree if the distance matrix is
an exact reflection of the tree.

38
neighbor-joining method
(1) For each tip, compute (2) Choose the i and
j for which Dij ui uj is smallest. (3) Join
items i and j. Compute the branch length from i
to the new node (vi) and from j to the new node
(vj) as (4) Compute distance between the new
node (ij) and each of the remaining tips as
(5) Delete tips i and j from the tables and
replace them by the new node, (ij), which is now
treated as a tip. (6) If more than 2 nodes
remain, go back to step (1). Otherwise, connect
the two remaining nodes (e.g. l and m) by a
branch of length Dlm.

39
limitation of distance methods
Distance matrix methods are the easiest phylogeny
method to program, and they are very
fast. Distance methods have problems when the
evolutionary rates vary largely. One can correct
for this in distance methods as well as in
likelihood methods. When variation of rates is
large, these corrections become important. In
likelihood methods, the correction can use
information from changes in one part of the tree
to inform the correction in others. Once a
particular part of the molecule is seen to change
rapidly in the primates, this will affect the
interpretation of that part of the molecule among
the rodents as well. But a distance matrix
method is inherently incapable of propagating the
information in this way. Once one is looking at
changes within rodents, it will forget where
changes were seen among primates.

Write a Comment

User Comments (0)

About PowerShow.com

V4 Prediction of Phylogenies based on single genes PowerPoint PPT Presentation