Loading...

PPT – V4 Prediction of Phylogenies based on single genes PowerPoint presentation | free to download - id: 20eec4-ZDc1Z

The Adobe Flash plugin is needed to view this content

V4 Prediction of Phylogenies based on single genes

Material of this lecture taken from - chapter 6,

DW Mount Bioinformatics and from Julian

Felsensteins book.

A phylogenetic analysis of a family of related

nucleic acid or protein sequences is a

determination of how the family might have been

derived during evolution. Placing the sequences

as outer branches on a tree, the evolutionary

relationships among the sequences are depicted.

Phylogenies, or evolutionary trees, are the basic

structures to describe differences between

species, and to analyze them statistically. They

have been around for over 140 years. Statistical,

computational, and algorithmic work on them is

ca. 40 years old.

3 main approaches in single-gene phylogeny

- maximum parsimony - distance matrix - maximum

likelihood (not covered here)

Popular programs PHYLIP (phylogenetic inference

package J Felsenstein) PAUP (phylogenetic

analysis using parsimony Sinauer Assoc

Methods for Single-Gene Phylogeny

Choose set of related sequences

Obtain multiple sequence alignment

Is there strong sequence similarity?

Maximum parsimony methods

Yes

No

Yes

Is there clearly recogniza-ble sequence

similarity?

Distance methods

No

Analyze how well data support prediction

Maximum likelihood methods

Parsimony methods

Edwards Cavalli-Sforza (1963) that

evolutionary tree is to be preferred that

involves the minimum net amount of

evolution. ? seek that phylogeny on which, when

we reconstruct the evolutionary events leading to

our data, there are as few events as

possible. (1) We must be able to make a

reconstruction of events, involving as few events

as possible, for any proposed phylogeny. (2) We

must be able to search among all possible

phylogenies for the one or ones that minimize the

number of events.

A simple example

Suppose that we have 5 species, each of which

has been scored for 6 characters

?(0,1) We will allow changes 0 ? 1 and

1 ? 0. The initial state at the root of a tree

may be either state 0 or state 1.

Evaluating a particular tree

To find the most parsimonious tree, we must have

a way of calculating how many changes of state

are needed on a given tree. This tree

represents the phylogeny of character

1. Reconstruct phylogeny of character 1 on this

tree.

Evaluating a particular tree

There are 2 equally good reconstructions, each

involving just one change of character

state. They differ in which state they assume at

the root of the tree, and they differ in which

branch they place the single change.

Evaluating a particular tree

3 equally good reconstructions for character 2,

which needs two changes of state.

Evaluating a particular tree

A single reconstruction for character 3,

involving one change of state.

Evaluating a particular tree

on the right 2 reconstructions for character 4

and 5 because these characters have identical

patterns. single reconstruction for

character 6, one change of state.

Evaluating a particular tree

The total number of changes of character state

needed on this tree is 1 2 1 2 2 1

9 Reconstruction of the changes in state on this

tree

Evaluating a particular tree

Alternative tree with only 8 changes of

state. The minimum number

of changes of state would be 6, as there are 6

characters that can each have 2 states. Thus, we

have two extra changes ? called homoplasmy.

Evaluating a particular tree

Figure right shows another tree also requiring 8

changes. These two most parsimonious trees are

the same tree when the roots of the tree are

removed.

Methods of rooting the tree

There are many rooted trees, one for each branch

of this unrooted tree, and all have the same

number of changes of state. The number of

changes of state only depends on the unrooted

tree, and not at all on where the tree is then

rooted. Biologists want to think of trees as

rooted ? need method to place the root in an

otherwise unrooted tree. (1) Outgroup

criterion (2) Use a molecular clock.

Outgroup criterion

Assumes that we know the answer in

advance. Suppose that we have a number of great

apes, plus a single old-world monkey. Suppose

that we know that the great apes are a

monophyletic group. If we infer a tree of these

species, we know that the root must be placed on

the lineage that connects the old-world monkey

(outgroup) to the great apes (ingroup).

Molecular clock

If an equal amount of changes were observed on

all lineages, there should be a point on the tree

that has equal amounts of change (branch lengths)

from there to all tips. With a molecular clock,

it is only the expected amounts of change that

are equal. The observed amounts may not be. ?

using various methods find a root that makes the

amounts of change approximately equal on all

lineages.

Branch lengths

Having found an unrooted tree, locate the changes

on it and find out how many occur in each of the

branches. The location of the changes can be

ambiguous. ? average over all possible

reconstructions of each character for which there

is ambiguity in the unrooted tree. F

ractional numbers in some branches of left tree

add up to (integer) number of changes (right)

Open questions

Particularly for larger data sets, need to know

how to count number of changes of state by use of

an algorithm. need to know algorithm for

reconstructing states at interior nodes of the

tree. need to know how to search among all

possible trees for the most parsimonious ones,

and how to infer branch lengths. sofar only

considered simple model of 0/1 characters. DNA

sequences have 4 states, protein sequences 20

states. Justification is it reasonable to use

the parsimony criterion? If so, what does it

implicitly assume about the biology? What is

the statistical status of finding the most

parsimonious tree? Can we make statements how

well-supported it is compared to other trees?

Counting evolutionary changes

2 related dynamic programming algorithms Fitch

(1971) and Sankoff (1975) - evaluate a phylogeny

character by character - for each character,

consider it as rooted tree, placing the root

wherever seems appropriate. - update some

information down a tree when we reach the

bottom, the number of changes of state is

available. Do not actually locate changes or

reconstruct interior states at the nodes of the

tree.

Fitch algorithm

intended to count the number of changes in a

bifurcating tree with nucleotide sequence data,

in which any one of the 4 bases (A, C, G, T) can

change to any other. At the particular

site, we have observed the bases C, A, C, A and G

in the 5 species. Give them in the order in which

they appear in the tree, left to right.

Fitch algorithm

For the left two, at the node that is their

immediate common ancestor, attempt to construct

the intersection of the two sets. But as C ?

A ? instead construct the union C ? A

AC and count 1 change of state. For the

rightmost pair of species, assign common

ancestor as AG, since A ? G ? and count

another change of state. .... proceed to

bottom Total number of changes 3. Algorithm

works on arbitrarily large trees.

Complexity of Fitch algorithm

Fitch algorithm can be carried out in a number of

operations that is proportional to the number of

species (tips) on the tree. Dont we need to

multiply this by the number of sites n ? Any

site that is invariant (which has the same base

in all species, e.g. AAAAA) can be

dropped. Other sites with a single variant base

(e.g. ATAAA) will only require a single change of

state on all trees. These too can be

dropped. For sites with the same pattern (e.g.

CACAG) that we have already seen, simply use

number of changes previously computed. Pattern

following same symmetry (e.g. TCTCA CACAG) need

same number of changes ? numerical effort rises

slower than linearly with the number of sites.

Sankoff algorithm

Fitch algorithm is very effective but we cant

understand why it works. Sankoff algorithm more

complex, but its structure is more

apparent. Assume that we have a table of the

cost of changes cij between each character state

i and each other state j. Compute the total cost

of the most parsimonious combinations of events

by computing it for each character. For a given

character, compute for each node k in the tree a

quantity Sk(i). This is interpreted as the

minimal cost, given that node k is assigned state

i, of all the events upwards from node k in the

tree.

Sankoff algorithm

If we can compute these values for all nodes, we

can also compute them for the bottom node in the

tree. Simply choose the minimum of these

values which is the desired total cost we seek,

the minimum cost of evolution for this

character. At the tips of the tree, the S(i) are

easy to compute. The cost is 0 if the observed

state is state i, and infinite otherwise. If we

have observed an ambigous state, the cost is 0

for all states that it could be, and infinite for

the rest. Now we just need an algorithm to

calculate the S(i) for the immediate common

ancestor of two nodes.

Sankoff algorithm

Suppose that the two descendant nodes are called

l and r (for left and right). For their

immediate common ancestor, node a, we compute

The smallest possible cost given that node a is

in state i is the cost cij of going from state i

to state j in the left descendant lineage, plus

the cost Sl(j) of events further up in the

subtree gien that node l is in state j. Select

value of j that minimizes that sum. Same

calculation for right descendant lineage ? sum of

these two minima is the smallest possible cost

for the subtree above node a, given that node a

is in state i. Apply equation successively to

each node in the tree, working downwards. Finally

compute all S0(i) and use previous eq. to find

minimum cost for whole tree.

Sankoff algorithm

The array (6,6,7,8) at the bottom

of the tree has a minimum value of 6 minimum

total cost of the tree for this site.

Finding the best tree by heuristic search

The obvious method for searching for the most

parsimonious tree is to consider ALL trees and

evaluate each one. Unfortunately, generally the

number of possible trees is too large. ? use

heuristic search methods that attempt to find the

best trees without looking at all possible

trees. (1) Make an initial estimate of the tree

and make small rearrangements of it find

neighboring trees. (2) If any of these

neighbors are better, consider them and continue

search.

Distance matrix methods

introduced by Cavalli-Sforza Edwards (1967) and

by Fitch Margoliash (1967) general idea seems

as if it would not work very well

(Felsenstein) - calculate a measure of the

distance between each pair of species - find a

tree that predicts the observed set of distances

as closely as possible. All information from

higher-order combinations of character states is

left out. But computer simulation studies show

that the amount of lost information is remarkably

small. Best way to think about distance matrix

methods consider distances as estimates of the

branch length separating that pair of species.

Least square method

- observed table (matrix) of distances Dij - any

particular tree leads to a predicted set of

distances dij.

Least square method

Measure of the discrepancy between the observed

and expected distances

where the weights wij can be differently

defined - wij 1 (CavalliSforza, 1967) - wij

1/Dij2 (FitchMargoliash, 1967) - wij

1/Dij (Beyer et al., 1974) Aim Find tree

topology and branch lengths that minimize

Q. Equation above is quadratic in branch

lengths. Take derivative with respect to branch

lengths, set 0, and solve system of linear

equations. Solution will minimize Q.

Doug Brutlags course

Least square method

v2

v1

v5

v6

v7

v4

v3

Number species in alphabetical order. The

expected distance between species A and D d14

v1 v7 v4 The expected distance between

speices B and E d25 v5 v6 v7 v2.

Least square method

Number all branches of the tree and introduce an

indicator variable xijk xijk 1 if branch k

lies in the path from species i to species j xijk

0 otherwise. The expected distance between i

and j will then be and For the case with

wij 1 ?ij. Note these are k equations for

each of the k branches.

Least square method

DAB DAC DAD DAE 4v1 v2 v3 v4 v5

2v6 2v7 DAB DBC DBD DBE v1 4v2

v3 v4 v5 2v6 3v7 DAC DBC DCD DCE

v1 v2 4v3 v4 v5 3v6 2v7 DAD DBD

DCD DDE v1 v2 v3 4v4 v5 2v6

3v7 DAE DBE DCE DDE v1 v2

v3 v4 4v5 3v6 2v7 DAC DAE DBC DBE

DCD DDE 2v1 2v2 3v3 2v4 3v5 6v6

4v7 DAB DAD DBC DCD DBE DDE 2v1 3v2

2v3 3v4 2v5 4v6 6v7 Stack up the (4

3 2 1 10) Dij, in alphabetical order, into

a vector and the coefficients xijk are

arranged in a matrix X with each row

corresponding to the Dij in the row of d

and containing a 1 if branch k occurs on the

path between species i and j.

Least square method

If we also stack up the 7 vi into a vector v, the

previous set of linear equations can be compactly

expressed as Multiplied from the left by the

inverse of XTX one can solve for the least

squares branch lengths This is a standard

method of expressing least squares problems in

matrix notation and solving them. check for

example -)

Least square method

When we have weighted least squares, with a

diagonal matrix of weights in the same order as

the Dij

then the least square equations can be written

and their solution

Finding the least squares tree topology

Now that we are able to assign branch lengths to

each tree topology. we need to search among tree

topologies. This can be done by the same methods

of heuristic search that were presented for the

Maximum Parsimony method. Note no-one has sofar

presented a branch-and-bound method for finding

the least squares tree exactly. Day (1986) has

shown that this problem is NP-complete. The

search is not only among tree topologies, but

also among branch lengths.

neighbor-joining method

introduced by Saitou and Nei (1987) algorithm

works by clustering - does not assume a molecular

clock but approximates the minimum evolution

model. Minimum evolution model among

possible tree topologies, choose the one with

minimal total branch length. Neighbor-joining,

as the least-squares method, is guaranteed to

recover the true tree if the distance matrix is

an exact reflection of the tree.

neighbor-joining method

(1) For each tip, compute (2) Choose the i and

j for which Dij ui uj is smallest. (3) Join

items i and j. Compute the branch length from i

to the new node (vi) and from j to the new node

(vj) as (4) Compute distance between the new

node (ij) and each of the remaining tips as

(5) Delete tips i and j from the tables and

replace them by the new node, (ij), which is now

treated as a tip. (6) If more than 2 nodes

remain, go back to step (1). Otherwise, connect

the two remaining nodes (e.g. l and m) by a

branch of length Dlm.

limitation of distance methods

Distance matrix methods are the easiest phylogeny

method to program, and they are very

fast. Distance methods have problems when the

evolutionary rates vary largely. One can correct

for this in distance methods as well as in

likelihood methods. When variation of rates is

large, these corrections become important. In

likelihood methods, the correction can use

information from changes in one part of the tree

to inform the correction in others. Once a

particular part of the molecule is seen to change

rapidly in the primates, this will affect the

interpretation of that part of the molecule among

the rodents as well. But a distance matrix

method is inherently incapable of propagating the

information in this way. Once one is looking at

changes within rodents, it will forget where

changes were seen among primates.