Phylogenetic TreesLecture 12

Based on pages 160-176 in Durbin et al (the black

text book).

This class has been edited from Nir Friedmans

lecture which was available at www.cs.huji.ac.il/

nir. Pictures from Tal Pupko slides. Changes by

Dan Geiger and Shlomo Moran.

Evolution

- Evolution of new organisms is driven by
- Diversity
- Different individuals carry different variants of

the same basic blue print - Mutations
- The DNA sequence can be changed due to single

base changes, deletion/insertion of DNA segments,

etc. - Selection bias

The Tree of Life

Source Alberts et al

Tree of life- a better picture

Daprès Ernst Haeckel, 1891

Primate evolution

A phylogeny is a tree that describes the sequence

of speciation events that lead to the forming of

a set of current day species also called a

phylogenetic tree.

Morphological vs. Molecular

- Classical phylogenetic analysis morphological

features number of legs, lengths of legs, etc. - Modern biological methods allow to use molecular

features - Gene sequences
- Protein sequences
- Analysis based on homologous sequences (e.g.,

globins) in different species - Important for many aspects of biology
- Classification
- Understanding biological mechanisms

Morphological topology

(Based on Mc Kenna and Bell, 1997)

Archonta

Ungulata

From sequences to a phylogenetic tree

Rat QEPGGLVVPPTDA Rabbit QEPGGMVVPPTDA Gorilla QE

PGGLVVPPTDA Cat REPGGLVVPPTEG

There are many possible types of sequences to use

(e.g. Mitochondrial vs Nuclear proteins).

Mitochondrial topology

(Based on Pupko et al.,)

Nuclear topology

(Based on Pupko et al. slide)

(tree by Madsenl)

Theory of Evolution

- Basic idea
- speciation events lead to creation of different

species. - Speciation caused by physical separation into

groups where different genetic variants become

dominant - Any two species share a (possibly distant) common

ancestor

Phylogenenetic trees

- Leafs - current day species
- Nodes - hypothetical most recent common ancestors
- Edges length - time from one speciation to the

next

Dangers in Molecular Phylogenies

- Gene and protein sequences can be homologous for

various reasons - Orthologs -- sequences diverged after a

speciation event. Indicative of a new specie. - Paralogs -- sequences diverged after a

duplication event. - Xenologs -- sequences diverged after a horizontal

transfer (e.g., by virus).

Gene Phylogenies

Phylogenies can be constructed to describe

evolution genes.

Three species termed 1,2,3. Two paralog genes A

and B.

Dangers of Paralogs

- If we happen to consider only species 1A, 2B, and

3A, we get a wrong tree that does not represent

the phylogeny of the host species of the given

sequences because duplication does not create new

species.

Gene Duplication

Speciation events

2B

1B

3A

3B

2A

1A

In the sequel we assume all given sequences are

orthologs.

Types of Trees

- A natural model to consider is that of rooted

trees

Common Ancestor

Types of trees

- Unrooted tree represents phylogeny without the

root node

Depending on the model, data from current day

species does not distinguish between different

placements of the root. In this example there

are seven possible ways to place a root.

Rooted versus unrooted trees

Tree c

b

a

c

Represents the three rooted trees

Slide by Tal Pupko

Positioning Roots in Unrooted Trees

- We can estimate the position of the root by

introducing an outgroup - a set of species that are definitely distant from

all the species of interest

Proposed root

Falcon

Aardvark

Bison

Chimp

Dog

Elephant

Type of Data

- Distance-based
- Input is a matrix of distances between species
- Can be fraction of residue they disagree on, or

alignment score between them, or - Character-based
- Examine each character (e.g., residue) separately

Three Methods of Tree Construction

- Distance- A tree that recursively combines two

nodes of the smallest distance. - Parsimony A tree with a total minimum number of

character changes between nodes. - Maximum likelihood - Finding the best Bayesian

network of a tree shape. The method of choice

nowadays. Most known and useful software called

phylip uses this method. http//evolution.genetics

.washington.edu/phylip.html

Distance-Based (1st type Method)

- Input distance matrix between species
- Outline
- Cluster species together
- Initially clusters are singletons
- At each iteration combine two closest clusters

to get a new one

UPGMA Clustering

- Let Ci and Cj be clusters, define distance

between them to be - When we combine two cluster, Ci and Cj, to form a

new cluster Ck, then - Define a node K and place its daughter nodes at

depth d(Ci,Cj)/2

Example

UPGMA construction on five objects. The length of

an edge its (vertical) height.

9

8

0.5d(7,8)

6

7

0.5d(2,3)

2

3

4

5

1

Molecular clock

This phylogenetic tree has all leaves in the same

level. When this property holds, the

phylogenetic tree is said to satisfy a molecular

clock. Namely, the time from a speciation event

to the formation of current species is identical

for all paths (wrong assumption in reality).

Molecular Clock

UPGMA constructs trees that satisfy a molecular

clock, even if the true tree does not satisfy a

molecular clock.

UPGMA

Restrictive Correctness of UPGMA

Proposition If the distance function is derived

by adding edge distances in a tree T with a

molecular clock, then UPGMA will reconstruct T.

Additivity

- Molecular clock defines additive distances,

namely, distances between objects can be realized

by a tree

Basic property of Additivity

- Suppose input distances are additive
- For any three leaves
- Thus

m

c

b

j

a

k

i

Constructing additive treesThe neighbor finding

problem

- Can we use this fact to construct trees assuming

only additivity (but not a molecular clock)?

Yes. The formula shows that if we knew that i

and j are neighboring leaves, then we can

construct their parent node k and compute the

distances of k to all other leaves m. We remove

nodes i,j and add k.

Neighbor Finding

- How can we find from distances alone that a pair

of nodes i,j are neighboring leaves? - Closest nodes arent necessarily neighbors.

Next we show one way to find neighbors from

additive distances.

Neighbor Finding

Theorem (SaitouNei) Assume all edge weights are

positive. If D(i,j) is minimal (among all pairs

of leaves), then i and j are neighboring leaves

in the tree.

Neighbor Joining Algorithm

- Set L to contain all leaves
- Iteration
- Choose i,j such that D(i,j) is minimal
- Create new node k, and set
- remove i,j from L, and add k
- Terminatewhen L 2, connect two remaining

nodes

Neighbor Finding

Notations used in the proof p(i,j) the path

from vertex i to vertex j P(D,C) (e1,e2,e3)

(D,E,F,C)

For a vertex i, and an edge e(i,j) Ni(e)

k e is on p(i,k). ND(e1) 3, ND(e2) 2,

ND(e3) 1 NC(e1) 1

E

F

Neighbor Finding

Notation For e(i,m), we denote d(i,m) by d(e).

Rest of T

k

l

i

j

Neighbor Finding

Proof of Theorem Assume by contradiction that

D(i,j) is minimal for i,j which are not

neighboring leaves. Let (i,l,...,k,j) be the path

from i to j. Let T1 and T2 be the subtrees

rooted at l and k. Let T denote the number

of leaves in T.

Neighbor Finding

Case 1 i or j has a neighboring leaf. WLOG j and

m are such leaves. A. D(i,j) - D(m,j)(L-2)(d(i,j)

- d(j,m) ) (rirj) rm rj

Definition (L-2)(d(i,k)-d(k,m) )rm-ri

Figure

B. rm-ri (L-2)(d(k,m)-d(i,l)) (4-L)d(k,l)

LemmaFigure (since for each

edge e?P(k,l), Nm(e)2 and Ni(e) ? L-2, so

Nm(e)- Ni(e ) 4-L )

Substituting B in A D(i,j) - D(m,j)

(L-2)(d(i,k)-d(i,l)) (4-L)d(k,l) 2d(k,l) gt 0,

contradicting the minimality assumption.

Neighbor Finding

Case 2 Not case 1. Then both T1 and T2 contain 2

neighboring leaves. We show that if D(i,j) is

minimal, then we must have both T1 gt T2 and

T2 gt T1 - which is a contradiction, hence

D(i,j) is not minimal.

We prove that T1 gt T2 by assuming that T1

T2 and reaching a contradiction. The proof

that T2 gt T1 is similar. Let n,m be

neighboring leaves in T1.

Neighbor Finding

A. 0 D(m,n) - D(i,j) (L-2)(d(m,n) - d(i,j) )

(rirj) (rmrn)

B. rj-rmlt (L-2)(d(j,k) d(m,p))

(T1-T2)d(k,p) (Because Nj(e)- Nm(e ) lt

T1-T2).

C. ri-rn lt (L-2)(d(i,k) d(n,p))

(T1-T2)d(l,p) Adding B and C, noting that

d(l,p)gtd(k,p) and using the assumption T1 -

T2 0 D. (rirj) (rmrn) lt

(L-2)(d(i,j)-d(n,m)) 2(T1-T2)d(k,p)

Substituting D in the right hand side of A 0

D(m,n) - D(i,j)lt 2(T1-T2)d(k,p), hence

T1-T2 gt 0, a contradiction.