Phylogenetics%20I - PowerPoint PPT Presentation

About This Presentation

Title:

Phylogenetics%20I

Description:

White-fronted capuchin. Slow loris. Tree shrew. Japanese pipistrelle. Long-tailed bat ... White-fronted capuchin. Slow loris. Squirrel. Dormouse. Cane-rat ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 54

Provided by: hat89

Learn more at: http://darwin.informatics.indiana.edu

Category:

more less

Transcript and Presenter's Notes

Title: Phylogenetics%20I

1
Phylogenetics I
2
Evolution

Evolution of new organisms is driven by
Mutations
The DNA sequence can be changed due to single
base changes, deletion/insertion of DNA segments,
etc.
Selection bias

3
Theory of Evolution

Basic idea
speciation events lead to creation of different
species.
Speciation caused by physical separation into
groups where different genetic variants become
dominant
Any two species share a (possibly distant) common
ancestor

4
The Tree of Life
5
Primate evolution
A phylogeny is a tree that describes the sequence
of speciation events that lead to the forming of
a set of current day species also called a
phylogenetic tree.
6
Morphological vs. Molecular

Classical phylogenetic analysis morphological
features number of legs, lengths of legs, etc.
Modern biological methods allow to use molecular
features
Gene sequences
Protein sequences

7
Morphological topology
(Based on Mc Kenna and Bell, 1997)
Archonta
Ungulata
8
From sequences to a phylogenetic tree
Rat QEPGGLVVPPTDA Rabbit QEPGGMVVPPTDA Gorilla QE
PGGLVVPPTDA Cat REPGGLVVPPTEG
There are many possible types of sequences to use
(e.g. Mitochondrial vs Nuclear proteins).
9
Mitochondrial topology
(Based on Pupko et al.,)
10
Nuclear topology
(Based on Pupko et al. slide)
(tree by Madsenl)
11
Phylogenenetic trees

Leaves - current day species (or taxa plural of
taxon)
Internal vertices - hypothetical common ancestors
Edges length - time from one speciation to the
next

12
Twists in molecular phylogenies

We have to emphasize that gene/protein sequence
can be homologous for several different reasons
Orthologs -- sequences diverged after a
speciation event
Paralogs -- sequences diverged after a
duplication event
Xenologs -- sequences diverged after a horizontal
transfer (e.g., by virus)

13
Paralogs
Consider evolutionary tree of three taxa
Gene Duplication
and assume that at some point in the past a gene
duplication event occurred.
14
Paralogs
The gene evolution is described by this tree (A,
B are the copies of the same gene).
Gene Duplication
Speciation events
2B
1B
3A
3B
2A
1A
15
Paralogs

If we happen to consider genes 1A, 2B, and 3A of
species 1,2,3, we get a wrong tree that does not
represent the phylogeny of the host species

Gene Duplication
S
S
S
Speciation events
2B
1B
3A
3B
2A
1A
16
Types of Trees

A natural model to consider is that of rooted
trees

Common Ancestor
17
Types of trees

Unrooted tree represents the same phylogeny
without the root node

Depending on the model, data from current day
species does not distinguish between different
placements of the root.
18
Rooted versus unrooted trees
Tree c
b
a
c
Represents the three rooted trees
19
Total numbers of trees

For N taxa,
Rooted bifurcating trees
(2n-3)!! (2n-3)!/2n-2(n-2)!
Unrooted bifurcating trees
(2n-5)!!
Tree shapes

20
Positioning Roots in Unrooted Trees

We can estimate the position of the root by
introducing an outgroup
a set of species that are definitely distant from
all the species of interest

Proposed root
Falcon
Aardvark
Bison
Chimp
Dog
Elephant
21
Type of Data

Distance-based
Input is a matrix of distances between species
Can be fraction of residue they disagree on, or
alignment score between them, or
Character-based
Examine each character (e.g., residue) separately

22
Two methods of tree Construction

Distance- A weighted tree that realizes the
distances between the objects.
Parsimony A tree with a total minimum number of
character changes between nodes.

We start with distance based methods, considering
the following question Given a set of species
(leaves in a supposed tree), and distances
between them construct a phylogeny which best
fits the distances.
23
Distance Matrix

Given n species, we can compute the n x n
distance matrix Dij
Dij may be defined as the edit distance between a
gene in species i and species j, where the gene
of interest is sequenced for all n species.

24
The distance between two sequences

Protein sequences
PAM
BLOSUM
DNA sequences
Jukes-Cantor
HGY
Kimura 2-Parameter

25
General Stationary Time-reversible Model
. pCrCA pGrGA pTrTA
pArAC . pGrGC pTrTC
pArAG pCrCG . pTrTG
pArAT pCrCT pGrGT .
R
(Diagonal elements such that rows sum to zero)
Time reversibility pirij pjrji
26
General Stationary Time-reversible Model

P(t) eRt
Given rates, one can find transition
probabilities, and vice-versa.

27
Jukes-Cantor
. u/3 u/3 u/3
u/3 . u/3 u/3
u/3 u/3 . u/3
u/3 u/3 u/3 .
R
28
Jukes-Cantor

P(no mutation) e-4/3ut
P(at least one mutation) 1-e-4/3ut
Ds ¾ (1-e-4/3ut)
D ? ut -3/4 ln (1-4/3 Ds)

29
Kimura 2-Parameter
A C G T
. b a b
b . b a
a b . b
b a b .
R
a/b transition/transversion bias ? R a2b 1
per unit time
30
Kimura 2-Parameter

aR/(R1), b0.5/(R1)

31
HKY (Hasegawa, Kishino, Yano)
. mpC mkpG mpT
mpA . mpG mkpT
mkpA mpC . mpT
mpA mkpC mpG .
R
k transversion / transition
32
Distances in Trees

Edges may have weights reflecting
Number of mutations on evolutionary path from one
species to another
Time estimate for evolution of one species into
another
In a tree T, we often compute
dij(T) - the length of a path between leaves i
and j

33
Distance in Trees an Exampe
d1,4 12 13 14 17 12 68
34
Fitting Distance Matrix

Given n species, we can compute the n x n
distance matrix Dij
Evolution of these genes is described by a tree
that we dont know.
We need an algorithm to construct a tree that
best fits the distance matrix Dij

35
Reconstructing a 3 Leaved Tree

Tree reconstruction for any 3x3 matrix is
straightforward
We have 3 leaves i, j, k and a center vertex c

Observe dic djc Dij dic dkc Dik djc
dkc Djk
36
Reconstructing a 3 Leaved Tree
37
Trees with gt 3 Leaves

An tree with n leaves has 2n-3 edges
This means fitting a given tree to a distance
matrix D requires solving a system of n choose
2 equations with 2n-3 variables
This is not always possible to solve for n gt 3

38
Additive Distance Matrices
Matrix D is ADDITIVE if there exists a tree T
with dij(T) Dij
NON-ADDITIVE otherwise
39
Distance Based Phylogeny Problem

Goal Reconstruct an evolutionary tree from a
distance matrix
Input n x n distance matrix Dij
Output weighted tree T with n leaves fitting D
If D is additive, this problem has a solution and
there is a simple algorithm to solve it

40
Using Neighboring Leaves to Construct the Tree

Find neighboring leaves i and j with parent k
Remove the rows and columns of i and j
Add a new row and column corresponding to k,
where the distance from k to any other leaf m can
be computed as

Dkm (Dim Djm Dij)/2
Compress i and j into k, iterate algorithm for
rest of tree
41
Finding Neighboring Leaves

To find neighboring leaves we simply select a
pair of closest leaves.

42
Finding Neighboring Leaves

To find neighboring leaves we simply select a
pair of closest leaves.
WRONG

43
Finding Neighboring Leaves

Closest leaves arent necessarily neighbors
i and j are neighbors, but (dij 13) gt (djk 12)

Finding a pair of neighboring leaves is
a nontrivial problem!

44
Neighbor Joining Algorithm

In 1987 Naruya Saitou and Masatoshi Nei developed
a neighbor joining algorithm for phylogenetic
tree reconstruction
Finds a pair of leaves that are close to each
other but far from other leaves implicitly finds
a pair of neighboring leaves
Advantages works well for additive and other
non-additive matrices, it does not have the
flawed molecular clock assumption

45
Constructing additive treesThe neighbor joining
algorithm

Let i, j be neighboring leaves in a tree, let k
be their parent, and let m be any other vertex.
The formula
shows that we can compute the distances of k to
all other leaves. This suggest the following
method to construct tree from a distance matrix
Find neighboring leaves i,j in the tree,
Replace i,j by their parent k and recursively
construct a tree T for the smaller set.
Add i,j as children of k in T.

46
Neighbor Finding

How can we find from distances alone a pair of
nodes which are neighboring leaves?
Closest nodes arent necessarily neighboring
leaves.

Next we show one way to find neighbors from
distances.
47
Neighbor Finding Seitou Nei algorithm
Definitions
Theorem (Saitou Nei) Assume all edge weights
are positive. If D(i,j) is minimal (among all
pairs of leaves), then i and j are neighboring
leaves in the tree.
48
Complexity of Neighbor Joining Algorithm

Naive Implementation
Initialization ?(L2) to compute d(r,i) and
C(i,j) for all i,j?L.
Each Iteration
O(L2) to find the maximal C(i,j).
O(L) to compute C(m,k)m? L for the new node k.
Total of O(L3).

r
C(m,k)
m
k
49
Complexity of Neighbor Joining Algorithm

Using Heap to store the C(i,j)s
Input Distance matrix D d(i,j), and an
arbitrary object r.
Initialization ?(L2) to compute and heapify the
C(i,j)s in a heap H.
Each Iteration
O(log L) to find and delete the maximal C(i,j)
from H.
O(L) to add the values d(k,m) to D, for all
objects m.
O(L) to delete d(m,i), d(m,j) from D (for all
m).
O(L log L) to delete C(i,m), C(j,m) and add
C(k,m) from H, for all objects m.
Total of O(L2 log L).
(implementation details are omitted)

50
Neighbor Joining Algorithm

Applicable to matrices which are not additive
Known to work good in practice
The algorithm and its variants are the most
widely used distance-based algorithms today.

51
The Four Point Condition
Compute 1. Dij Dkl, 2. Dik Djl, 3. Dil Djk
2
3
1
2 and 3 represent the same number the length of
all edges the middle edge (it is counted twice)
1 represents a smaller number the length of all
edges the middle edge
52
The Four Point Condition Theorem

The four point condition for the quartet i,j,k,l
is satisfied if two of these sums are the same,
with the third sum smaller than these first two
Theorem An n x n matrix D is additive if and
only if the four point condition holds for every
quartet 1 i,j,k,l n

53
Least Squares Distance Phylogeny Problem

If the distance matrix D is NOT additive, then we
look for a tree T that approximates D the best
Squared Error ?i,j (dij(T)
Dij)2
Squared Error is a measure of the quality of the
fit between distance matrix and the tree we want
to minimize it.
Least Squares Distance Phylogeny Problem finding
the best approximation tree T for a non-additive
matrix D (NP-hard).