Title: Perfect Phylogeny MLE for Phylogeny Lecture 14
1Perfect Phylogeny MLE for Phylogeny Lecture 14
Based on SetubalMeidanis 6.2, Durbin et. Al. 8.1
2Some Announcements
- The Final Exam will take Place on Friday,
17.2.04, 0900, at Taub 8. - Allowed Material CourseTutorial slides the
textbooks of the course (Durbin et el,
SetubalMeidanis, Gusfield).
- Lab offered next semester
- algorithms for constructing phylogenetic trees
- http//www.cs.technion.ac.il/moran/lab06.htm
32. The perfect phylogeny problem
- A character is assumed to be a property which
distinguishes between species (e.g. dental
structure). - A characters state is a value of the character
(human dental structure). - Problem Given set of species, specified by their
characters, reconstruct their evolutionary tree.
4The Perfect Phylogeny Problem(pure graph
theoretic setting)
Input Partial colorings (C1,,Ck) of a set of
vertices U (in the example 3 total colorings
left, center, right, each by two colors).
Problem Is there a tree T(V,E), s.t. U?V
and for i1,,k,, Ci is a convex (partial)
coloring of T?
NP-Hard In general, in P for some special cases
5Perfect Phylogeny for directed binary characters
- Input a matrix where rows correspond to objects
(species), columns to characters. - Each character has two states 0 (non exists) or
1 (exists). - Question Is there a directed perfect phylogeny
tree for the given species, in which all the
characters have value 0 at the root?
(00000)
C1 C2 C3 C4 C5
A 1 1 0 0 0
B 0 0 1 0 0
C 1 1 0 0 1
D 0 0 1 1 0
E 0 1 0 0 0
E
B
(01000)
D
(00100)
(00110)
A
C
(11000)
(11001)
6Perfect Phylogeny for directed binary characters
- By the definition, for each character C there is
one edge in which it is converted from 0 to 1. In
the below tree, the edge on which character C2 is
converted to 1 is marked. The resulted tree is
convex for this character.
the edge on which character C2 is converted to 1
C1 C2 C3 C4 C5
A 1
B 0
C 1
D 0
E 1
0
C2
E
B
1
D
0
0
A
C
1
1
7Perfect Phylogeny for directed binary characters
- A tree is a directed perfect phylogeny for a
given 0-1 matrix M iff we can map each character
to an edge s.t. edge labeled by Ci represent
changing character Cis state from 0 to 1. Below
we show such a tree for the given matrix
C1 C2 C3 C4 C5
A 1 1 0 0 0
B 0 0 1 0 0
C 1 1 0 0 1
D 0 0 1 1 0
E 0 1 0 0 0
8Efficient algorithm for the Binary Perfect
Phylogeny Problem
- Definition Given a 0-1 matrix M, OkjMjk1,
ie Ok is the set of objects that have character
Ck. - Theorem M has a perfect phylogenetic tree iff
the sets Oi are laminar, ie for all i, j,
either Oi and Oj are disjoint, or one includes
the other.
Laminar
Not Laminar
C1 C2 C3 C4 C5
A 1 1 0 0 0
B 0 0 1 0 0
C 1 1 0 0 1
D 0 0 1 1 0
E 0 1 0 0 0
C1 C2 C3 C4 C5
A 1 1 0 0 0
B 0 0 1 0 1
C 1 1 0 0 1
D 0 0 1 1 0
E 0 1 0 0 1
9Proof
- ? Assume M has a perfect phylogeny, and let Ci,
Cj be given. - Consider the edges labeled Ci and Cj.
- Case 1 There is a root to leaf path containing
both edges. Then one is included in the other (C2
and C1 below). - Case 2 not case 1. Then they are disjoint (C2
and C3).
C2
C3
C1
C4
E
D
B
C5
A
C
10Proof (cont.)
- ? Assume for all i, j, either Oi and Oj are
disjoint, or one includes the other. We prove by
induction on the number of characters that M has
a perfect phylogenetic tree for the matrix. - Basis one character. Then there are at most two
objects, one with and one without this character.
C1
A 1
B 0
11Proof (cont.)
- ? Induction step Assume correctness for n-1
characters, and consider a matrix with n
characters (non-zero columns). WLOG assume that
O1 is not contained in Oj for j gt 1. - Let S1 be the set of objects j for which Mj1 1,
and S2 be the remaining objects. Then each
character belongs to objects in S1 or S2, but not
both (prove!). By induction there are trees T1
and T2 for S1 and S2. Combining them as below
gives the desired tree.
C1 C2 C3 C4 C5
A 1 1 0 0 0
B 0 0 1 0 0
C 1 1 0 0 1
D 0 0 1 1 0
E 1 0 0 0 0
S1A,C,E S2B,D
1
T1
T2
12Efficient Implementation
- 1 Sort the columns (characters) by decreasing
value when considered as binary numbers. (Time
complexity O(mn), using radix sort). - Claim If the binary value of column i is larger
than that of column j, then Oi is not a proper
subset of Oj. - Proof Oi Oj gt 0 means the 1s in Oi are not
covered by the 1s in Oj.
C1 C2 C3 C4 C5
A 1 1 0 0 0
B 0 0 1 0 0
C 1 1 0 0 1
D 0 0 1 1 0
E 0 1 0 0 0
C2 C1 C3 C5 C4
A 1 1 0 0 0
B 0 0 1 0 0
C 1 1 0 1 0
D 0 0 1 0 1
E 1 0 0 0 0
13Efficient Implementation(2)
- 2. Make a backwards linked list of the 1s in
each row (leftmost 1 in each row points at
itself). Time complexity O(mn).
C2 C1 C3 C5 C4
A 1 1 0 0 0
B 0 0 1 0 0
C 1 1 0 1 0
D 0 0 1 0 1
E 1 0 0 0 0
Claim If the columns are sorted, then the set of
columns is laminar iff for each column i, all the
links leaving column i point at the same column.
Can be checked in O(mn) time.
14Examples
laminar
Not laminar
A 1 1 0 0 0
B 0 0 1 0 0
C 1 1 0 1 0
D 0 0 1 0 1
E 1 0 0 0 0
A 1 1 0 0 0
B 0 0 1 0 0
C 1 1 0 1 0
D 0 0 1 0 1
E 1 0 1 1 0
15Efficient Implementation(3)
- 3. When the matrix is laminar, the tree edges
corresponding to characters are defined by the
backwards links in the matrix.
remaining edges and leaves are determined by the
characters of each object. Needs O(mn) time.
C2 C1 C3 C5 C4
A 1 1 0 0 0
B 0 0 1 0 0
C 1 1 0 1 0
D 0 0 1 0 1
E 1 0 0 0 0
C2
C3
C1
C4
E
D
B
C5
A
C
16A scenario where Maximum Parsimony (and Perfect
Phylogeny) are misleading
Consider a model with 4 letters (DNA), where the
probability for a substitution is proportional to
time.
1
4
In the following topology, 2 and 3 are likely to
be as the origin, but 4 and 5 are likely to be
different. In this case, Maximum Parsimony
principle may be useless or misleading.
A
A
3
2
A
17Parsimony may be useless/misleading
A
I Uninformative
II Uninformative
A
III Uninformative
C
Assume the (likely) scenario where leaves 2 and 3
are the same. There are 4 combinations of
substitution for leaves 1,4. In the first three,
all three topologies will obtain the same
parsimony score.
G
In the fourth, a wrong topology will score best
18Case I Parsimony is Useless
A
A
1
4
A
A
3
2
A
Score0
Score0
Score0
19Case II Parsimony is Useless
G
A
1
4
A
A
3
2
A
Score1
Score1
Score1
20Case III Parsimony is useless
G
C
1
4
A
A
3
2
A
Score2
Score2
Score2
21Case III Parsimony is misleading
C
C
1
4
A
A
3
2
A
Score1
Score2
Score2
22Parsimony is correct only in rare cases
Will infer correctly only in the rare case of a
change on the central edge, or
In an even more rare case of a parallel change
from A to C on the pendant edges to 1 and 2.
233. Maximum Likelihood Approach
Consider the phylogenetic tree to be a stochastic
process.
A simple model assumes that in each edge,
likelihood of transition from character a to
charcter b is given by parameters ?ba . The
liklihood of a letter a in the root is qa. Given
the complete tree, its probability is defined by
the values of the ?ba s and the qas.
24Maximum Likelihood Approach(2)
When the data consists only of the leaves
sequences (but the topology is fixed)
Write down the likelihood of the data (leaves
sequences) given the tree. Use EM to estimate the
?ba parameters. When the tree is not given
Search for the tree that maximizes
Prob(dataTree, ?EM)