Phylogeny Tree Reconstruction - PowerPoint PPT Presentation

Loading...

PPT – Phylogeny Tree Reconstruction PowerPoint presentation | free to download - id: 14ef4a-N2QwM



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Phylogeny Tree Reconstruction

Description:

Inferring Phylogenies. Trees can be inferred by several criteria: Morphology of the organisms ... Phylogeny and sequence comparison. Basic principles: ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 41
Provided by: root
Learn more at: http://ai.stanford.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Phylogeny Tree Reconstruction


1
Phylogeny Tree Reconstruction
2
Inferring Phylogenies
  • Trees can be inferred by several criteria
  • Morphology of the organisms
  • Sequence comparison
  • Example
  • Orc ACAGTGACGCCCCAAACGT
  • Elf ACAGTGACGCTACAAACGT
  • Dwarf CCTGTGACGTAACAAACGA
  • Hobbit CCTGTGACGTAGCAAACGA
  • Human CCTGTGACGTAGCAAACGA

3
Modeling Evolution
  • During infinitesimal time ?t, there is not enough
    time for two substitutions to happen on the same
    nucleotide
  • So we can estimate P(x y, ?t), for x, y ? A,
    C, G, T
  • Then let
  • P(AA, ?t) P(AT, ?t)
  • S(?t)
  • P(TA, ?t) P(TT, ?t)

x
x
?t
y
4
Modeling Evolution
A
C
  • Reasonable assumption multiplicative
  • (implying a stationary Markov process)
  • S(tt) S(t)S(t)
  • That is, P(x y, tt) ?z P(x z, t) P(z y,
    t)
  • Jukes-Cantor constant rate of evolution
  • 1 - 3?? ?? ?? ??
  • For short time ?, S(?) IR? ?? 1 - 3??
    ?? ??
  • ?? ?? 1 - 3?? ??
  • ?? ?? ?? 1 - 3??

G
T
5
Modeling Evolution
  • Jukes-Cantor
  • For longer times,
  • r(t) s(t) s(t) s(t)
  • S(t) s(t) r(t) s(t) s(t)
  • s(t) s(t) r(t) s(t)
  • s(t) s(t) s(t) r(t)
  • Where we can derive
  • r(t) ¼ (1 3 e-4?t)
  • s(t) ¼ (1 e-4?t)

S(t?) S(t)S(?) S(t)(I R?) Therefore, (S(t
?) S(t))/? S(t) R At the limit of ? ?
0, S(t) S(t) R Equivalently, r -3?r
3?s s -?s ?r Those diff. equations lead
to r(t) ¼ (1 3 e-4?t) s(t) ¼ (1 e-4?t)
6
Modeling Evolution
  • Kimura
  • Transitions A/G, C/T
  • Transversions A/T, A/C, G/T, C/G
  • Transitions (rate ?) are much more likely than
    transversions (rate ?)
  • r(t) s(t) u(t) s(t)
  • S(t) s(t) r(t) s(t) u(t)
  • u(t) s(t) r(t) s(t)
  • s(t) u(t) s(t) r(t)
  • Where s(t) ¼ (1 e-4?t)
  • u(t) ¼ (1 e-4?t e-2(??)t)
  • r(t) 1 2s(t) u(t)

7
Phylogeny and sequence comparison
  • Basic principles
  • Degree of sequence difference is proportional to
    length of independent sequence evolution
  • Only use positions where alignment is pretty
    certain avoid areas with (too many) gaps

8
Distance between two sequences
  • Given sequences xi, xj,
  • Define
  • dij distance between the two sequences
  • One possible definition
  • dij fraction f of sites u where xiu ? xju
  • Better model (Jukes-Cantor)
  • f ¾ (1 e-4?t) ?
  • ¾ e-4?t ¾ - f ? log (e-4?t) log (1 4/3 f)
  • dij t - ¼ ?-1 log(1 4/3 f)

9
A simple clustering method for building tree
  • UPGMA (unweighted pair group method using
    arithmetic averages)
  • Or the Average Linkage Method
  • Given two disjoint clusters Ci, Cj of sequences,
  • 1
  • dij ?p ?Ci, q ?Cjdpq
  • Ci ? Cj
  • Claim that if Ck Ci ? Cj, then distance to
    another cluster Cl is
  • dil Ci djl Cj
  • dkl
  • Ci Cj

Proof ?Ci,Cl dpq ?Cj,Cl dpq dkl
(Ci Cj)
Cl Ci/(CiCl) ?Ci,Cl dpq
Cj/(CjCl) ?Cj,Cl dpq

(Ci Cj) Ci dil Cj djl
(Ci Cj)
10
Algorithm Average Linkage
1
  • Initialization
  • Assign each xi into its own cluster Ci
  • Define one leaf per sequence, height 0
  • Iteration
  • Find two clusters Ci, Cj s.t. dij is min
  • Let Ck Ci ? Cj
  • Define node connecting Ci, Cj,
  • place it at height dij/2
  • Delete Ci, Cj
  • Termination
  • When two clusters i, j remain,
  • place root at height dij/2

4
3
5
2
2
3
5
1
4
11
Example
4
3
2
1
y
z
x
w
v
12
Ultrametric Distances and Molecular Clock
  • Definition
  • A distance function d(.,.) is ultrametric if for
    any three distances dij ? dik ? dij, it is true
    that
  • dij ? dik dij
  • The Molecular Clock
  • The evolutionary distance between species x and y
    is 2? the Earth time to reach the nearest common
    ancestor
  • That is, the molecular clock has constant rate in
    all species

The molecular clock results in ultrametric
distances
years
1
4
2
3
5
13
Ultrametric Distances Average Linkage
1
4
2
3
5
  • Average Linkage is guaranteed to reconstruct
    correctly a binary tree with ultrametric
    distances
  • Proof Exercise (extra credit)

14
Weakness of Average Linkage
  • Molecular clock all species evolve at the same
    rate (Earth time)
  • However, certain species (e.g., mouse, rat)
    evolve much faster
  • Example where UPGMA messes up

AL tree
Correct tree
3
2
1
3
4
4
2
1
15
Additive Distances
1
4
12
8
3
13
7
9
5
11
10
6
2
  • Given a tree, a distance measure is additive if
    the distance between any pair of leaves is the
    sum of lengths of edges connecting them
  • Given a tree T additive distances dij, can
    uniquely reconstruct edge lengths
  • Find two neighboring leaves i, j, with common
    parent k
  • Place parent node k at distance dkm ½ (dim
    djm dij) from any node m

16
Additive Distances
z
x
w
y
  • For any four leaves x, y, z, w, consider the
    three sums
  • d(x, y) d(z, w)
  • d(x, z) d(y, w)
  • d(x, w) d(y, z)
  • One of them is smaller than the other two, which
    are equal
  • d(x, y) d(z, w) lt d(x, z) d(y, w)
    d(x, w) d(y, z)

17
Reconstructing Additive Distances Given T
x
T
D
y
5
4
3
z
3
4
w
7
6
v
If we know T and D, but do not know the length of
each leaf, we can reconstruct those lengths
18
Reconstructing Additive Distances Given T
x
T
D
y
z
w
v
19
Reconstructing Additive Distances Given T
D
x
T
y
z
a
w
D1
v
dax ½ (dvx dwx dvw)
day ½ (dvy dwy dvw)
daz ½ (dvz dwz dvw)
20
Reconstructing Additive Distances Given T
D1
x
T
y
5
4
b
3
z
3
a
4
c
w
7
D2
6
d(a, c) 3 d(b, c) d(a, b) d(a, c) 3 d(c,
z) d(a, z) d(a, c) 7 d(b, x) d(a, x)
d(a, b) 5 d(b, y) d(a, y) d(a, b) 4 d(a,
w) d(z, w) d(a, z) 4 d(a, v) d(z, v)
d(a, z) 6 Correct!!!
v
D3
21
Neighbor-Joining
  • Guaranteed to produce the correct tree if
    distance is additive
  • May produce a good tree even when distance is not
    additive
  • Step 1 Finding neighboring leaves
  • Define
  • Dij dij (ri rj)
  • Where
  • 1
  • ri ?k dik
  • L - 2
  • Claim The above magic trick ensures that Dij
    is minimal iff i, j are neighbors
  • Proof Very technical, please read Durbin et al.!

1
3
0.1
0.1
0.1
0.4
0.4
4
2
22
Algorithm Neighbor-joining
  • Initialization
  • Define T to be the set of leaf nodes, one per
    sequence
  • Let L T
  • Iteration
  • Pick i, j s.t. Dij is minimal
  • Define a new node k, and set dkm ½ (dim djm
    dij) for all m ? L
  • Add k to T, with edges of lengths dik ½ (dij
    ri rj)
  • Remove i, j from L
  • Add k to L
  • Termination
  • When L consists of two nodes, i, j, and the edge
    between them of length dij

23
Parsimony
  • One of the most popular methods
  • Idea
  • Find the tree that explains the observed
    sequences with a minimal number of substitutions
  • Two computational subproblems
  • Find the parsimony cost of a given tree (easy)
  • Search through all tree topologies (hard)

24
Parsimony Scoring
  • Given a tree, and an alignment column u
  • Label internal nodes to minimize the number of
    required substitutions
  • Initialization
  • Set cost C 0 k 2N 1
  • Iteration
  • If k is a leaf, set Rk xku
  • If k is not a leaf,
  • Let i, j be the daughter nodes
  • Set Rk Ri ? Rj if intersection is nonempty
  • Set Rk Ri ? Rj, and C 1, if intersection
    is empty
  • Termination
  • Minimal cost of tree for column u, C

25
Example
A, B C1
A
A, B C1
A
B
B
A
B
A
B
A
26
Example
B
A,B
A
B
A
A,B
A
A
A
A
B
B
A
B
A
A
A
A
B
A
B
A
B
27
Traceback to find ancestral nucleotides
  • Traceback
  • Choose an arbitrary nucleotide from R2N 1 for
    the root
  • Having chosen nucleotide r for parent k,
  • If r ? Ri choose r for daughter i
  • Else, choose arbitrary nucleotide from Ri
  • Easy to see that this traceback produces some
    assignment of cost C

28
Example
Admissible with Traceback
x
B
Still optimal, but inadmissible with Traceback
A
A, B
B
A
x
A
B
A
A, B
A
B
B
x
B
x
A
A
B
B
A
A
A
B
B
B
A
B
A
A
x
A
x
A
A
B
B
29
Number of labeled unrooted tree topologies
2
1
4
4
4
3
  • How many possibilities are there for leaf 4?

30
Number of labeled unrooted tree topologies
2
1
4
3
  • How many possibilities are there for leaf 4?
  • For the 4th leaf, there are 3 possibilities

31
Number of labeled unrooted tree topologies
2
1
4
5
3
  • How many possibilities are there for leaf 5?
  • For the 5th leaf, there are 5 possibilities

32
Number of labeled unrooted tree topologies
2
1
4
5
3
  • How many possibilities are there for leaf 6?
  • For the 6th leaf, there are 7 possibilities

33
Number of labeled unrooted tree topologies
2
1
4
5
3
  • How many possibilities are there for leaf n?
  • For the nth leaf, there are 2n 5 possibilities

34
Number of labeled unrooted tree topologies
2
1
N 10 unrooted 2,027,025 rooted
34,459,425 N 30 unrooted
8.7x1036 rooted 4.95x1038
4
5
3
  • unrooted trees for n taxa (2n-5)(2n-7)...31
    (2n-5)! / 2n-3(n-3)!
  • rooted trees for n taxa (2n-3)(2n-5)(2n-7)...
    3 (2n-3)! / 2n-2(n-2)!

35
Search through tree topologies Branch and Bound
  • Observation adding an edge to an existing tree
    can only increase the parsimony cost
  • Enumerate all unrooted trees with at most n
    leaves
  • i3i5i7i2N5
  • where each ik can take values from 0 (no edge) to
    k
  • At each point keep C smallest cost so far for a
    complete tree
  • Start BB with tree 1000
  • Whenever cost of current tree T is gt C, then
  • T is not optimal
  • Any tree extending T with more edges is not
    optimal
  • Increment by 1 the rightmost nonzero counter

36
Bootstrapping to get the best trees
  • Main outline of algorithm
  • Select random columns from a multiple alignment
    one column can then appear several times
  • Build a phylogenetic tree based on the random
    sample from (1)
  • Repeat (1), (2) many (say, 1000) times
  • Output the tree that is constructed most
    frequently

37
Probabilistic Methods
xroot
t1
t2
x1
x2
  • A more refined measure of evolution along a tree
    than parsimony
  • P(x1, x2, xroot t1, t2) P(xroot) P(x1 t1,
    xroot) P(x2 t2, xroot)
  • If we use Jukes-Cantor, for example, and x1
    xroot A, x2 C, t1 t2 1,
  • pA?¼(1 3e-4a) ?¼(1 e-4a) (¼)3(1
    3e-4a)(1 e-4a)

38
Probabilistic Methods
xroot
xu
x2
xN
x1
  • If we know all internal labels xu,
  • P(x1, x2, , xN, xN1, , x2N-1 T, t)
    P(xroot)?j?rootP(xj xparent(j), tj, parent(j))
  • Usually we dont know the internal labels,
    therefore
  • P(x1, x2, , xN T, t) ?xN1 ?xN2 ?x2N-1
    P(x1, x2, , x2N-1 T, t)

39
Felsensteins Likelihood Algorithm
  • To calculate P(x1, x2, , xN T, t)
  • Initialization
  • Set k 2N 1
  • Recursion Compute P(Lk a) for all a ? ?
  • If k is a leaf node
  • Set P(Lk a) 1(a xk)
  • If k is not a leaf node
  • 1. Compute P(Li b), P(Lj b) for all b, for
    daughter nodes i, j
  • 2. Set P(Lk a) ?b, cP(b a, ti)P(Li b)
    P(c a, tj) P(Lj c)
  • Termination
  • Likelihood at this column P(x1, x2, , xN
    T, t) ?aP(L2N-1 a)P(a)

40
Probabilistic Methods
  • Given M (ungapped) alignment columns of N
    sequences,
  • Define likelihood of a tree
  • L(T, t) P(Data T, t) ?m1M P(x1m, , xnm,
    T, t)
  • Maximum Likelihood Reconstruction
  • Given data X (xij), find a topology T and
    length vector t that maximize likelihood L(T, t)
About PowerShow.com