Building Phylogenetic Trees - PowerPoint PPT Presentation

About This Presentation
Title:

Building Phylogenetic Trees

Description:

1. Building Phylogenetic Trees. Yaw-Ling Lin (???) ... Chimp. Dog. Elephant. A: CAGGTA. B: CAGACA. C: CGGGTA. D: TGCACT. E: TGCGTA. 52 ... – PowerPoint PPT presentation

Number of Views:294
Avg rating:3.0/5.0
Slides: 66
Provided by: Curt90
Category:

less

Transcript and Presenter's Notes

Title: Building Phylogenetic Trees


1
Building Phylogenetic Trees
Yaw-Ling Lin (???) Dept Computer Sci and Info
Management Providence University, Taiwan E-mail
yllin_at_pu.edu.tw WWW http//www.cs.pu.edu.tw/yawl
in
2
Phylogenetic Tree
  • Topology bifurcating
  • Leaves - 1N
  • Internal nodes N12N-2

3
Orthologues / Paralogues
4
Rooted / Unrooted Tree
5
Counting Trees
6
Counting Trees
(2N - 5)!! unrooted trees for N taxa (2N-
3)!! rooted trees for N taxa
7
Rrooting the tree
To root a tree mentally, imagine that the tree is
made of string. Grab the string at the root
and tug on it until the ends of the string (the
taxa) fall opposite the root
Unrooted tree
8
UPGMA -- Unweighted Pair Group Method with
Arithmetic mean
simplest method - uses sequential clustering
algorithm (assumption of rate constancy among
lineages - often violated)
step 1 step 2
(AB) C d(AB)C
Distance matrix Tree
d(AB)C (dAC dAB) / 2
9
UPGMA Step 1combine B and C
10
UPGMA step 2combine BC and D
(1012)/2
(46)/2
11
UPGMA step 3combine A and E
12
UPGMA step 4combine AE and BCD
13
UPGMA Result
14
UPGMA Result
15
UPGMA(1)
16
UPGMA(2)
17
UPGMA -- Ilustrations
18
When UPGMA fails
19
Neighbor Joining
  • Very popular method
  • Does not make molecular clock assumption
    modified distance matrix constructed to adjust
    for differences in evolution rate of each taxon
  • Produces unrooted tree
  • Assumes additivity distance between pairs of
    leaves sum of lengths of edges connecting them
  • Like UPGMA, constructs tree by sequentially
    joining subtrees

20
Additivity
21
Naïve NJ by Additivity?
  • O(n2) (i,j) pairs
  • O(n2) (k,l) pairs
  • (k,l) rejects (i,j) whenever additivity fails
  • O(n4) to pick an (i,j) neighbor pair!
  • So totally O(n5) time suffices

22
Neighbor Joining Once we know the correct (i,j)
pair
23
Neighbour Joining why not pick the smallest
(i,j) pair?
24
Neighbour Joining(3)
25
Neighbour Joining Algorithm
26
Neighbor-Joining Algorithm
27
Neighbor-Joining Complexity
  • The method performs a search using time O(n2) and
    using time O(n2) to update distance matrix.
  • Giving a total time complexity of O(n3),and a
    space complexity of O(n2).

28
Reasoning the NJ Method
  • How did the ideas of Si,j and Ri comes from ?
  • How correct is the algorithm?
  • Heuristic or exact solution?

29
The 1-star Sum of the Branch Lengths
  • D and L as the distance between OTUs and the
    branch length between nodes
  • Each branch is counted N-1 times when all
    distances are added

30
The paired-2-star Sum of the Branch Lengths
31
The paired-2-star Tree Size
32
The Distance and Branch Lengths between a
Combined OTU and another One
33
Before the proof
34
Before the proof (Cont.)
35
Neighbor-Joining The proof
36
Lemma
37
Lemma (Cont.)
38
Proof
39
Proof of the Theorem by contradiction
r
k
i
s
Type1 A -2Dux-2Duv Type2 B -4Dvx2Duv For
the sum in formula b to be nonnegative, Type2
should be more than Type1.
w
B
x
x
v
u
x
A
j
l
Suppose that i and j are not neighbors. Let k and
l be any pair of neighbors, so that i, j, k, and
l are distinct and are represented in the tree
.Consider the sum in formula (b), which is
nonnegative. If m is fifth OUT, then it joins the
tree at point x along one of the indicated arcs.
Say that m is of type 1 if it joins the path from
I to j at any node different from u and that m is
of type 2 if it joins the path from i to j at
node u.
40
Proof of the theorem (Cont.)
If m is of type 1,then the corresponding summand
in formula (b) is -2Dux-2Duv. If m is of type 2,
then the corresponding summand in formula (b) is
-4Dvx2Duv. For the sum in formula (b) to be
nonnegative, there must be at least as many terms
corresponding to OTUs m of type 2 as there are
terms corresponding top OTUs m of type 1. It
follows that there are more OTUs that join the
path from i to j at u than there are OTUs that
join that path at all other nodes
combined. Because neither i nor j has a
neighbor, there must be a pair r,s of neighbors
that argument applied to w that is different from
u, By the above argument applied to w, there are
more OTUs that join the path from i to j at w
than there are OTUs that join that path at all
other nodes combined. The conclusions about u and
w contradict each other, and the theorem follows.
41
Speeding up Neighbor-Joining Tree Construction
  • In this paper, the authors present several
    heuristics for speeding up the NJ method.
  • The heuristics attempt to reduce the search time
    by using a quad-tree.
  • The worst case time complexity remains O(n3) and
    the space complexity after adding the quad-tree
    is still O(n2).
  • The authors have implemented a tool, QuickJoin.

42
Previous Work
  • The neighbor-joining method is introduced by
    Saitou and Nei.
  • The algorithm was later amended by Studier and
    Keppler with a running time O(n3).
  • BIONJ -- Gascuel et al. produce a O(n3)
    implementation of a variant of the NJ algorithm
    that produce more accurate trees in many cases.
  • QuickTree -- Durbin et al. produce an code
    optimized implementation of the NJ algorithm.

43
AppendixProof of neighbour-joining
44
/- of distance methods
  • Advantages
  • easy to perform
  • quick calculation
  • fit for sequences having high similarity scores
  • Disadvantages
  • the sequences are not considered as such (loss of
    information)
  • all sites are generally equally treated (do not
    take into account differences of substitution
    rates )
  • not applicable to distantly divergent sequences.

45
Parsimony
46
Maximum Parsimony Method
principle - search for tree that requires the
smallest number of character state changes
between the OTUs
informative sites - those that favor some trees
over others operationally - at least two
different kinds of residues at the site, each of
which is found in at least two of the OUT
sequences
47
Evaluating Parsimony Scores
  • How do we compute the Parsimony score for a given
    tree?
  • Traditional Parsimony
  • Each base change has a cost of 1
  • Weighted Parsimony
  • Each change is weighted by the score c(a,b)

48
Traditional Parsimony
a
a
  • Solved independently for each position
  • Linear time solution

a,g
a
49
Traditional Parsimony
50
Evaluating Weighted Parsimony
  • Dynamic programming on the tree
  • Initialization
  • For each leaf i set S(i,a) 0 if i is labeled by
    a, otherwise S(i,a) ?
  • Iteration
  • if k is node with children i and j, then S(k,a)
    minb(S(i,b)c(a,b)) minb(S(j,b)c(a,b))
  • Termination
  • cost of tree is minaS(r,a) where r is the root

51
Example
A CAGGTA B CAGACA C CGGGTA D TGCACT E TGCGTA
52
Cost of Evaluating Parsimony
  • Score is evaluated on each position independetly.
    Scores are then summed over all positions.
  • If there are n nodes, m characters, and k
    possible values for each character, then
    complexity is O(nmk)
  • By keeping traceback information, we can
    reconstruct most parsimonious values at each
    ancestor node

53
Weighted Parsimony
54
Traditional Parsimony is not complete
55
Parsimony Searching over all trees by Branch and
Bound
56
Assessing the trees the bootstrap
57
(No Transcript)
58
Simultaneous alignment and phylogeny(1)
59
Inferring trees Maximum Likelihood method
  • Maximum likelihood supposes a model of evolution
    along tree branches.
  • Strategy
  • Find parameters (tree, branch lengths,
    substitution rate) that maximizes the likelihood
    assigned to the data.
  • Note Model of evolution does not include indels!
  • In Phylip package program PROTML

60
Probabilistic Methods
  • The phylogenetic tree represents a generative
    probabilistic model (like HMMs) for the observed
    sequences.
  • Background probabilities q(a)
  • Mutation probabilities P(ab, t)
  • Models for evolutionary mutations
  • Jukes Cantor
  • Kimura 2-parameter model
  • Such models are used to derive the probabilities

61
Jukes Cantor model
  • A model for mutation rates
  • Mutation occurs at a constant rate
  • Each nucleotide is equally likely to mutate into
    any other nucleotide with rate a.

62
Kimura 2-parameter model
  • Allows a different rate for transitions and
    transversions.

63
Mutation Probabilities
  • The rate matrix R is used to derive the mutation
    probability matrix S
  • S is obtained by integration. For Jukes Cantor
  • q can be obtained by setting t to infinity

64
Mutation Probabilities
  • Both models satisfy the following properties
  • Lack of memory
  • Reversibility
  • Exist stationary probabilities Pa s.t.

65
Probabilistic Approach
  • Given P,q, the tree topology and branch lengths,
    we can compute

66
Computing the Tree Likelihood
  • We are interested in the probability of observed
    data given tree and branch lengths
  • Computed by summing over internal nodes
  • This can be done efficiently using a tree upward
    traversal pass.

67
Tree Likelihood Computation
  • Define P(Lka) prob. of leaves below node k
    given that xka
  • Init for leaves P(Lka)1 if xka 0 otherwise
  • Iteration if k is node with children i and j,
    then
  • TerminationLikelihood is

68
Maximum Likelihood (ML)
  • Score each tree by
  • Assumption of independent positions
  • Branch lengths t can be optimized
  • Gradient ascent
  • EM
  • We look for the highest scoring tree
  • Exhaustive
  • Sampling methods (Metropolis)

69
Optimal Tree Search
  • Perform search over possible topologies

Parameter space
Parametric optimization (EM)
Local Maxima
70
Computational Problem
  • Such procedures are computationally expensive!
  • Computation of optimal parameters, per candidate,
    requires non-trivial optimization step.
  • Spend non-negligible computation on a candidate,
    even if it is a low scoring one.
  • In practice, such learning procedures can only
    consider small sets of candidate structures

71
Max Likelihood versus Parsimony
3
1
3
1
2
1
4
1
0.3
0.1
0.09
0.1
0.3
2
2
4
3
4
2
3
4
T1
T2
T3
T
  • (Example from BSA p. 225)
  • Choose tree T, with unequal branch lengths.
  • Generate 1000 sequences of length N according to
    probabilistic model
  • (A) Reconstruction by ML (B)
    Reconstruction by Parsimony

N T1 T2 T3
20 419 339 242
100 638 204 158
500 904 61 35
2000 997 3 0
N T1 T2 T3
20 396 378 224
100 405 515 79
500 404 594 2
2000 353 646 0
Conclusion ML infers right tree as N gets
larger, Parsimony does not necessarily.
72
Max Likelihood versus NJ
  • (Example from BSA p. 225)
  • Choose tree T, with unequal branch lengths.
  • Generate 1000 sequences of length N according to
    probabilistic model
  • (A) Reconstruction by ML (B)
    Reconstruction by NJ

N T1 T2 T3
20 419 339 242
100 638 204 158
500 904 61 35
2000 997 3 0
Conclusion ML infers right tree as N gets
largerl. If the probabilistic model is correct,
the ML distances shall be very close to additive,
therefore the NJ method predicts the correct
tree.
73
Phylip - practicalities
  • Menu-driven, no command line
  • Input file format
  • First line ltnumber of sequencesgt ltnumber of
    letters per sequencegt
  • Next lines Sequences
  • First ten characters is the sequence name
  • Then sequence follows. Spaces and newlines are
    allowed.
  • Dashes (-) signify gaps
  • Example

4 46 hba1 MV-LSPADKTNVKAAWGKVG AHAGEYGAEALERM
FLSFPTTKTYFP beta MVHLTPEEKSAVTALWGKVN VDEVGG
EALGRLLVVYPWTQRFFESF Myoglobin MGLSDGEWQLVLNVWGKV
E ADIPGHGQEVLIRLFKGHPETLEKFD Leghemogl
MGAFSEKQESLVKSSWEAFK QNVPHHSAVFYTLILEKAPAAQNMFS
74
The End
Write a Comment
User Comments (0)
About PowerShow.com