Phylogenetics%20I - PowerPoint PPT Presentation

About This Presentation
Title:

Phylogenetics%20I

Description:

White-fronted capuchin. Slow loris. Tree shrew. Japanese pipistrelle. Long-tailed bat ... White-fronted capuchin. Slow loris. Squirrel. Dormouse. Cane-rat ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 54
Provided by: hat89
Category:

less

Transcript and Presenter's Notes

Title: Phylogenetics%20I


1
Phylogenetics I
2
Evolution
  • Evolution of new organisms is driven by
  • Mutations
  • The DNA sequence can be changed due to single
    base changes, deletion/insertion of DNA segments,
    etc.
  • Selection bias

3
Theory of Evolution
  • Basic idea
  • speciation events lead to creation of different
    species.
  • Speciation caused by physical separation into
    groups where different genetic variants become
    dominant
  • Any two species share a (possibly distant) common
    ancestor

4
The Tree of Life
5
Primate evolution
A phylogeny is a tree that describes the sequence
of speciation events that lead to the forming of
a set of current day species also called a
phylogenetic tree.
6
Morphological vs. Molecular
  • Classical phylogenetic analysis morphological
    features number of legs, lengths of legs, etc.
  • Modern biological methods allow to use molecular
    features
  • Gene sequences
  • Protein sequences

7
Morphological topology
(Based on Mc Kenna and Bell, 1997)
Archonta
Ungulata
8
From sequences to a phylogenetic tree
Rat QEPGGLVVPPTDA Rabbit QEPGGMVVPPTDA Gorilla QE
PGGLVVPPTDA Cat REPGGLVVPPTEG
There are many possible types of sequences to use
(e.g. Mitochondrial vs Nuclear proteins).
9
Mitochondrial topology
(Based on Pupko et al.,)
10
Nuclear topology
(Based on Pupko et al. slide)
(tree by Madsenl)
11
Phylogenenetic trees
  • Leaves - current day species (or taxa plural of
    taxon)
  • Internal vertices - hypothetical common ancestors
  • Edges length - time from one speciation to the
    next

12
Twists in molecular phylogenies
  • We have to emphasize that gene/protein sequence
    can be homologous for several different reasons
  • Orthologs -- sequences diverged after a
    speciation event
  • Paralogs -- sequences diverged after a
    duplication event
  • Xenologs -- sequences diverged after a horizontal
    transfer (e.g., by virus)

13
Paralogs
Consider evolutionary tree of three taxa
Gene Duplication
and assume that at some point in the past a gene
duplication event occurred.
14
Paralogs
The gene evolution is described by this tree (A,
B are the copies of the same gene).
Gene Duplication
Speciation events
2B
1B
3A
3B
2A
1A
15
Paralogs
  • If we happen to consider genes 1A, 2B, and 3A of
    species 1,2,3, we get a wrong tree that does not
    represent the phylogeny of the host species

Gene Duplication
S
S
S
Speciation events
2B
1B
3A
3B
2A
1A
16
Types of Trees
  • A natural model to consider is that of rooted
    trees

Common Ancestor
17
Types of trees
  • Unrooted tree represents the same phylogeny
    without the root node

Depending on the model, data from current day
species does not distinguish between different
placements of the root.
18
Rooted versus unrooted trees
Tree c
b
a
c
Represents the three rooted trees
19
Total numbers of trees
  • For N taxa,
  • Rooted bifurcating trees
  • (2n-3)!! (2n-3)!/2n-2(n-2)!
  • Unrooted bifurcating trees
  • (2n-5)!!
  • Tree shapes

20
Positioning Roots in Unrooted Trees
  • We can estimate the position of the root by
    introducing an outgroup
  • a set of species that are definitely distant from
    all the species of interest

Proposed root
Falcon
Aardvark
Bison
Chimp
Dog
Elephant
21
Type of Data
  • Distance-based
  • Input is a matrix of distances between species
  • Can be fraction of residue they disagree on, or
    alignment score between them, or
  • Character-based
  • Examine each character (e.g., residue) separately

22
Two methods of tree Construction
  • Distance- A weighted tree that realizes the
    distances between the objects.
  • Parsimony A tree with a total minimum number of
    character changes between nodes.

We start with distance based methods, considering
the following question Given a set of species
(leaves in a supposed tree), and distances
between them construct a phylogeny which best
fits the distances.
23
Distance Matrix
  • Given n species, we can compute the n x n
    distance matrix Dij
  • Dij may be defined as the edit distance between a
    gene in species i and species j, where the gene
    of interest is sequenced for all n species.

24
The distance between two sequences
  • Protein sequences
  • PAM
  • BLOSUM
  • DNA sequences
  • Jukes-Cantor
  • HGY
  • Kimura 2-Parameter

25
General Stationary Time-reversible Model
. pCrCA pGrGA pTrTA
pArAC . pGrGC pTrTC
pArAG pCrCG . pTrTG
pArAT pCrCT pGrGT .
R
(Diagonal elements such that rows sum to zero)
Time reversibility pirij pjrji
26
General Stationary Time-reversible Model
  • P(t) eRt
  • Given rates, one can find transition
    probabilities, and vice-versa.

27
Jukes-Cantor
. u/3 u/3 u/3
u/3 . u/3 u/3
u/3 u/3 . u/3
u/3 u/3 u/3 .
R
28
Jukes-Cantor
  • P(no mutation) e-4/3ut
  • P(at least one mutation) 1-e-4/3ut
  • Ds ¾ (1-e-4/3ut)
  • D ? ut -3/4 ln (1-4/3 Ds)

29
Kimura 2-Parameter
A C G T
. b a b
b . b a
a b . b
b a b .
R
a/b transition/transversion bias ? R a2b 1
per unit time
30
Kimura 2-Parameter
  • aR/(R1), b0.5/(R1)

31
HKY (Hasegawa, Kishino, Yano)
. mpC mkpG mpT
mpA . mpG mkpT
mkpA mpC . mpT
mpA mkpC mpG .
R
k transversion / transition
32
Distances in Trees
  • Edges may have weights reflecting
  • Number of mutations on evolutionary path from one
    species to another
  • Time estimate for evolution of one species into
    another
  • In a tree T, we often compute
  • dij(T) - the length of a path between leaves i
    and j

33
Distance in Trees an Exampe
d1,4 12 13 14 17 12 68
34
Fitting Distance Matrix
  • Given n species, we can compute the n x n
    distance matrix Dij
  • Evolution of these genes is described by a tree
    that we dont know.
  • We need an algorithm to construct a tree that
    best fits the distance matrix Dij

35
Reconstructing a 3 Leaved Tree
  • Tree reconstruction for any 3x3 matrix is
    straightforward
  • We have 3 leaves i, j, k and a center vertex c

Observe dic djc Dij dic dkc Dik djc
dkc Djk
36
Reconstructing a 3 Leaved Tree
37
Trees with gt 3 Leaves
  • An tree with n leaves has 2n-3 edges
  • This means fitting a given tree to a distance
    matrix D requires solving a system of n choose
    2 equations with 2n-3 variables
  • This is not always possible to solve for n gt 3

38
Additive Distance Matrices
Matrix D is ADDITIVE if there exists a tree T
with dij(T) Dij
NON-ADDITIVE otherwise
39
Distance Based Phylogeny Problem
  • Goal Reconstruct an evolutionary tree from a
    distance matrix
  • Input n x n distance matrix Dij
  • Output weighted tree T with n leaves fitting D
  • If D is additive, this problem has a solution and
    there is a simple algorithm to solve it

40
Using Neighboring Leaves to Construct the Tree
  • Find neighboring leaves i and j with parent k
  • Remove the rows and columns of i and j
  • Add a new row and column corresponding to k,
    where the distance from k to any other leaf m can
    be computed as

Dkm (Dim Djm Dij)/2
Compress i and j into k, iterate algorithm for
rest of tree
41
Finding Neighboring Leaves
  • To find neighboring leaves we simply select a
    pair of closest leaves.

42
Finding Neighboring Leaves
  • To find neighboring leaves we simply select a
    pair of closest leaves.
  • WRONG

43
Finding Neighboring Leaves
  • Closest leaves arent necessarily neighbors
  • i and j are neighbors, but (dij 13) gt (djk 12)
  • Finding a pair of neighboring leaves is
  • a nontrivial problem!

44
Neighbor Joining Algorithm
  • In 1987 Naruya Saitou and Masatoshi Nei developed
    a neighbor joining algorithm for phylogenetic
    tree reconstruction
  • Finds a pair of leaves that are close to each
    other but far from other leaves implicitly finds
    a pair of neighboring leaves
  • Advantages works well for additive and other
    non-additive matrices, it does not have the
    flawed molecular clock assumption

45
Constructing additive treesThe neighbor joining
algorithm
  • Let i, j be neighboring leaves in a tree, let k
    be their parent, and let m be any other vertex.
  • The formula
  • shows that we can compute the distances of k to
    all other leaves. This suggest the following
    method to construct tree from a distance matrix
  • Find neighboring leaves i,j in the tree,
  • Replace i,j by their parent k and recursively
    construct a tree T for the smaller set.
  • Add i,j as children of k in T.

46
Neighbor Finding
  • How can we find from distances alone a pair of
    nodes which are neighboring leaves?
  • Closest nodes arent necessarily neighboring
    leaves.

Next we show one way to find neighbors from
distances.
47
Neighbor Finding Seitou Nei algorithm
Definitions
Theorem (Saitou Nei) Assume all edge weights
are positive. If D(i,j) is minimal (among all
pairs of leaves), then i and j are neighboring
leaves in the tree.
48
Complexity of Neighbor Joining Algorithm
  • Naive Implementation
  • Initialization ?(L2) to compute d(r,i) and
    C(i,j) for all i,j?L.
  • Each Iteration
  • O(L2) to find the maximal C(i,j).
  • O(L) to compute C(m,k)m? L for the new node k.
  • Total of O(L3).

r
C(m,k)
m
k
49
Complexity of Neighbor Joining Algorithm
  • Using Heap to store the C(i,j)s
  • Input Distance matrix D d(i,j), and an
    arbitrary object r.
  • Initialization ?(L2) to compute and heapify the
    C(i,j)s in a heap H.
  • Each Iteration
  • O(log L) to find and delete the maximal C(i,j)
    from H.
  • O(L) to add the values d(k,m) to D, for all
    objects m.
  • O(L) to delete d(m,i), d(m,j) from D (for all
    m).
  • O(L log L) to delete C(i,m), C(j,m) and add
    C(k,m) from H, for all objects m.
  • Total of O(L2 log L).
  • (implementation details are omitted)

50
Neighbor Joining Algorithm
  • Applicable to matrices which are not additive
  • Known to work good in practice
  • The algorithm and its variants are the most
    widely used distance-based algorithms today.

51
The Four Point Condition
Compute 1. Dij Dkl, 2. Dik Djl, 3. Dil Djk
2
3
1
2 and 3 represent the same number the length of
all edges the middle edge (it is counted twice)
1 represents a smaller number the length of all
edges the middle edge
52
The Four Point Condition Theorem
  • The four point condition for the quartet i,j,k,l
    is satisfied if two of these sums are the same,
    with the third sum smaller than these first two
  • Theorem An n x n matrix D is additive if and
    only if the four point condition holds for every
    quartet 1 i,j,k,l n

53
Least Squares Distance Phylogeny Problem
  • If the distance matrix D is NOT additive, then we
    look for a tree T that approximates D the best
  • Squared Error ?i,j (dij(T)
    Dij)2
  • Squared Error is a measure of the quality of the
    fit between distance matrix and the tree we want
    to minimize it.
  • Least Squares Distance Phylogeny Problem finding
    the best approximation tree T for a non-additive
    matrix D (NP-hard).
Write a Comment
User Comments (0)
About PowerShow.com