Loading...

PPT – Phylogeny Tree Reconstruction PowerPoint presentation | free to download - id: 14ef4a-N2QwM

The Adobe Flash plugin is needed to view this content

Phylogeny Tree Reconstruction

Inferring Phylogenies

- Trees can be inferred by several criteria
- Morphology of the organisms
- Sequence comparison
- Example
- Orc ACAGTGACGCCCCAAACGT
- Elf ACAGTGACGCTACAAACGT
- Dwarf CCTGTGACGTAACAAACGA
- Hobbit CCTGTGACGTAGCAAACGA
- Human CCTGTGACGTAGCAAACGA

Modeling Evolution

- During infinitesimal time ?t, there is not enough

time for two substitutions to happen on the same

nucleotide - So we can estimate P(x y, ?t), for x, y ? A,

C, G, T - Then let
- P(AA, ?t) P(AT, ?t)
- S(?t)
- P(TA, ?t) P(TT, ?t)

x

x

?t

y

Modeling Evolution

A

C

- Reasonable assumption multiplicative
- (implying a stationary Markov process)
- S(tt) S(t)S(t)
- That is, P(x y, tt) ?z P(x z, t) P(z y,

t) - Jukes-Cantor constant rate of evolution
- 1 - 3?? ?? ?? ??
- For short time ?, S(?) IR? ?? 1 - 3??

?? ?? - ?? ?? 1 - 3?? ??
- ?? ?? ?? 1 - 3??

G

T

Modeling Evolution

- Jukes-Cantor
- For longer times,
- r(t) s(t) s(t) s(t)
- S(t) s(t) r(t) s(t) s(t)
- s(t) s(t) r(t) s(t)
- s(t) s(t) s(t) r(t)
- Where we can derive
- r(t) ¼ (1 3 e-4?t)
- s(t) ¼ (1 e-4?t)

S(t?) S(t)S(?) S(t)(I R?) Therefore, (S(t

?) S(t))/? S(t) R At the limit of ? ?

0, S(t) S(t) R Equivalently, r -3?r

3?s s -?s ?r Those diff. equations lead

to r(t) ¼ (1 3 e-4?t) s(t) ¼ (1 e-4?t)

Modeling Evolution

- Kimura
- Transitions A/G, C/T
- Transversions A/T, A/C, G/T, C/G
- Transitions (rate ?) are much more likely than

transversions (rate ?) - r(t) s(t) u(t) s(t)
- S(t) s(t) r(t) s(t) u(t)
- u(t) s(t) r(t) s(t)
- s(t) u(t) s(t) r(t)
- Where s(t) ¼ (1 e-4?t)
- u(t) ¼ (1 e-4?t e-2(??)t)
- r(t) 1 2s(t) u(t)

Phylogeny and sequence comparison

- Basic principles
- Degree of sequence difference is proportional to

length of independent sequence evolution - Only use positions where alignment is pretty

certain avoid areas with (too many) gaps

Distance between two sequences

- Given sequences xi, xj,
- Define
- dij distance between the two sequences
- One possible definition
- dij fraction f of sites u where xiu ? xju
- Better model (Jukes-Cantor)
- f ¾ (1 e-4?t) ?
- ¾ e-4?t ¾ - f ? log (e-4?t) log (1 4/3 f)
- dij t - ¼ ?-1 log(1 4/3 f)

A simple clustering method for building tree

- UPGMA (unweighted pair group method using

arithmetic averages) - Or the Average Linkage Method
- Given two disjoint clusters Ci, Cj of sequences,
- 1
- dij ?p ?Ci, q ?Cjdpq
- Ci ? Cj
- Claim that if Ck Ci ? Cj, then distance to

another cluster Cl is - dil Ci djl Cj
- dkl
- Ci Cj

Proof ?Ci,Cl dpq ?Cj,Cl dpq dkl

(Ci Cj)

Cl Ci/(CiCl) ?Ci,Cl dpq

Cj/(CjCl) ?Cj,Cl dpq

(Ci Cj) Ci dil Cj djl

(Ci Cj)

Algorithm Average Linkage

1

- Initialization
- Assign each xi into its own cluster Ci
- Define one leaf per sequence, height 0
- Iteration
- Find two clusters Ci, Cj s.t. dij is min
- Let Ck Ci ? Cj
- Define node connecting Ci, Cj,
- place it at height dij/2
- Delete Ci, Cj
- Termination
- When two clusters i, j remain,
- place root at height dij/2

4

3

5

2

2

3

5

1

4

Example

4

3

2

1

y

z

x

w

v

Ultrametric Distances and Molecular Clock

- Definition
- A distance function d(.,.) is ultrametric if for

any three distances dij ? dik ? dij, it is true

that - dij ? dik dij
- The Molecular Clock
- The evolutionary distance between species x and y

is 2? the Earth time to reach the nearest common

ancestor - That is, the molecular clock has constant rate in

all species

The molecular clock results in ultrametric

distances

years

1

4

2

3

5

Ultrametric Distances Average Linkage

1

4

2

3

5

- Average Linkage is guaranteed to reconstruct

correctly a binary tree with ultrametric

distances - Proof Exercise (extra credit)

Weakness of Average Linkage

- Molecular clock all species evolve at the same

rate (Earth time) - However, certain species (e.g., mouse, rat)

evolve much faster - Example where UPGMA messes up

AL tree

Correct tree

3

2

1

3

4

4

2

1

Additive Distances

1

4

12

8

3

13

7

9

5

11

10

6

2

- Given a tree, a distance measure is additive if

the distance between any pair of leaves is the

sum of lengths of edges connecting them - Given a tree T additive distances dij, can

uniquely reconstruct edge lengths - Find two neighboring leaves i, j, with common

parent k - Place parent node k at distance dkm ½ (dim

djm dij) from any node m

Additive Distances

z

x

w

y

- For any four leaves x, y, z, w, consider the

three sums - d(x, y) d(z, w)
- d(x, z) d(y, w)
- d(x, w) d(y, z)
- One of them is smaller than the other two, which

are equal - d(x, y) d(z, w) lt d(x, z) d(y, w)

d(x, w) d(y, z)

Reconstructing Additive Distances Given T

x

T

D

y

5

4

3

z

3

4

w

7

6

v

If we know T and D, but do not know the length of

each leaf, we can reconstruct those lengths

Reconstructing Additive Distances Given T

x

T

D

y

z

w

v

Reconstructing Additive Distances Given T

D

x

T

y

z

a

w

D1

v

dax ½ (dvx dwx dvw)

day ½ (dvy dwy dvw)

daz ½ (dvz dwz dvw)

Reconstructing Additive Distances Given T

D1

x

T

y

5

4

b

3

z

3

a

4

c

w

7

D2

6

d(a, c) 3 d(b, c) d(a, b) d(a, c) 3 d(c,

z) d(a, z) d(a, c) 7 d(b, x) d(a, x)

d(a, b) 5 d(b, y) d(a, y) d(a, b) 4 d(a,

w) d(z, w) d(a, z) 4 d(a, v) d(z, v)

d(a, z) 6 Correct!!!

v

D3

Neighbor-Joining

- Guaranteed to produce the correct tree if

distance is additive - May produce a good tree even when distance is not

additive - Step 1 Finding neighboring leaves
- Define
- Dij dij (ri rj)
- Where
- 1
- ri ?k dik
- L - 2
- Claim The above magic trick ensures that Dij

is minimal iff i, j are neighbors - Proof Very technical, please read Durbin et al.!

1

3

0.1

0.1

0.1

0.4

0.4

4

2

Algorithm Neighbor-joining

- Initialization
- Define T to be the set of leaf nodes, one per

sequence - Let L T
- Iteration
- Pick i, j s.t. Dij is minimal
- Define a new node k, and set dkm ½ (dim djm

dij) for all m ? L - Add k to T, with edges of lengths dik ½ (dij

ri rj) - Remove i, j from L
- Add k to L
- Termination
- When L consists of two nodes, i, j, and the edge

between them of length dij

Parsimony

- One of the most popular methods
- Idea
- Find the tree that explains the observed

sequences with a minimal number of substitutions - Two computational subproblems
- Find the parsimony cost of a given tree (easy)
- Search through all tree topologies (hard)

Parsimony Scoring

- Given a tree, and an alignment column u
- Label internal nodes to minimize the number of

required substitutions - Initialization
- Set cost C 0 k 2N 1
- Iteration
- If k is a leaf, set Rk xku
- If k is not a leaf,
- Let i, j be the daughter nodes
- Set Rk Ri ? Rj if intersection is nonempty
- Set Rk Ri ? Rj, and C 1, if intersection

is empty - Termination
- Minimal cost of tree for column u, C

Example

A, B C1

A

A, B C1

A

B

B

A

B

A

B

A

Example

B

A,B

A

B

A

A,B

A

A

A

A

B

B

A

B

A

A

A

A

B

A

B

A

B

Traceback to find ancestral nucleotides

- Traceback
- Choose an arbitrary nucleotide from R2N 1 for

the root - Having chosen nucleotide r for parent k,
- If r ? Ri choose r for daughter i
- Else, choose arbitrary nucleotide from Ri
- Easy to see that this traceback produces some

assignment of cost C

Example

Admissible with Traceback

x

B

Still optimal, but inadmissible with Traceback

A

A, B

B

A

x

A

B

A

A, B

A

B

B

x

B

x

A

A

B

B

A

A

A

B

B

B

A

B

A

A

x

A

x

A

A

B

B

Number of labeled unrooted tree topologies

2

1

4

4

4

3

- How many possibilities are there for leaf 4?

Number of labeled unrooted tree topologies

2

1

4

3

- How many possibilities are there for leaf 4?
- For the 4th leaf, there are 3 possibilities

Number of labeled unrooted tree topologies

2

1

4

5

3

- How many possibilities are there for leaf 5?
- For the 5th leaf, there are 5 possibilities

Number of labeled unrooted tree topologies

2

1

4

5

3

- How many possibilities are there for leaf 6?
- For the 6th leaf, there are 7 possibilities

Number of labeled unrooted tree topologies

2

1

4

5

3

- How many possibilities are there for leaf n?
- For the nth leaf, there are 2n 5 possibilities

Number of labeled unrooted tree topologies

2

1

N 10 unrooted 2,027,025 rooted

34,459,425 N 30 unrooted

8.7x1036 rooted 4.95x1038

4

5

3

- unrooted trees for n taxa (2n-5)(2n-7)...31

(2n-5)! / 2n-3(n-3)! - rooted trees for n taxa (2n-3)(2n-5)(2n-7)...

3 (2n-3)! / 2n-2(n-2)!

Search through tree topologies Branch and Bound

- Observation adding an edge to an existing tree

can only increase the parsimony cost - Enumerate all unrooted trees with at most n

leaves - i3i5i7i2N5
- where each ik can take values from 0 (no edge) to

k - At each point keep C smallest cost so far for a

complete tree - Start BB with tree 1000
- Whenever cost of current tree T is gt C, then
- T is not optimal
- Any tree extending T with more edges is not

optimal - Increment by 1 the rightmost nonzero counter

Bootstrapping to get the best trees

- Main outline of algorithm
- Select random columns from a multiple alignment

one column can then appear several times - Build a phylogenetic tree based on the random

sample from (1) - Repeat (1), (2) many (say, 1000) times
- Output the tree that is constructed most

frequently

Probabilistic Methods

xroot

t1

t2

x1

x2

- A more refined measure of evolution along a tree

than parsimony - P(x1, x2, xroot t1, t2) P(xroot) P(x1 t1,

xroot) P(x2 t2, xroot) - If we use Jukes-Cantor, for example, and x1

xroot A, x2 C, t1 t2 1, - pA?¼(1 3e-4a) ?¼(1 e-4a) (¼)3(1

3e-4a)(1 e-4a)

Probabilistic Methods

xroot

xu

x2

xN

x1

- If we know all internal labels xu,
- P(x1, x2, , xN, xN1, , x2N-1 T, t)

P(xroot)?j?rootP(xj xparent(j), tj, parent(j)) - Usually we dont know the internal labels,

therefore - P(x1, x2, , xN T, t) ?xN1 ?xN2 ?x2N-1

P(x1, x2, , x2N-1 T, t)

Felsensteins Likelihood Algorithm

- To calculate P(x1, x2, , xN T, t)
- Initialization
- Set k 2N 1
- Recursion Compute P(Lk a) for all a ? ?
- If k is a leaf node
- Set P(Lk a) 1(a xk)
- If k is not a leaf node
- 1. Compute P(Li b), P(Lj b) for all b, for

daughter nodes i, j - 2. Set P(Lk a) ?b, cP(b a, ti)P(Li b)

P(c a, tj) P(Lj c) - Termination
- Likelihood at this column P(x1, x2, , xN

T, t) ?aP(L2N-1 a)P(a)

Probabilistic Methods

- Given M (ungapped) alignment columns of N

sequences, - Define likelihood of a tree
- L(T, t) P(Data T, t) ?m1M P(x1m, , xnm,

T, t) - Maximum Likelihood Reconstruction
- Given data X (xij), find a topology T and

length vector t that maximize likelihood L(T, t)