Algorithms For Quartet Based Phylogeny Reconstruction - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Algorithms For Quartet Based Phylogeny Reconstruction

Description:

Phylogeny pattern of historical relationships among species (taxa) ... Lemma. Build a local conflict list involving e; ... Lemma ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 54
Provided by: gan98
Category:

less

Transcript and Presenter's Notes

Title: Algorithms For Quartet Based Phylogeny Reconstruction


1
Algorithms For Quartet BasedPhylogeny
Reconstruction
Gang Wu Department of Computing
Science University of Alberta Edmonton, Alberta,
Canada
2
Outline
  • Introduction
  • Research methods
  • Computational results
  • Conclusions and future works

3
Common Phylogeny Terminology
Phylogeny pattern of historical relationships
among species (taxa). Tree mathematical
structure used to depict the evolutionary history
of a group of taxa
Leaf Nodes
Branches or Edges
A
Represent the taxa (genes, populations,
etc.) used to infer the phylogeny
internal
B
C
D
ROOT of the Tree (common ancestor of all taxa)
E
Internal Nodes (represent hypothetical ancestors
of the taxa)
4
(No Transcript)
5
Quartet Based Phylogeny Reconstruction
  • Quartet four taxa (A, B, C, D)
  • Quartet topology an unrooted tree for a quartet
  • Three possible quartet topologies for a quartet.

ABCD
ACBD
ADBC
6
Process of Quartet Based Phylogeny Reconstruction
7
Definitions
A quartet topology abcd is consistent with a
phylogeny T, or a phylogeny T satisfies a
quartet topology abcd , iff a,b,c,d are all
leaves of T and the path from a to b does not
share any nodes with the path from c to d.
8
b
a
c
aecd
d
f
e
Phylogeny T
quartet topology aecd is consistent with T, or T
satisfies aecd
9
Definitions
A quartet topology set Q is complete iff Q
contains a quartet topology for each four taxa
over S.
Given a quartet topology set Q on a taxon set S,
Q is compatible iff there is a phylogeny on S
which satisfies all the quartet topologies in Q.
10
aecd abcd abce abcf abde abdf abef
afcd acef adef becd bfcd bcef
bdef cdef
Quartet topology set Q on taxon set
Sa,b,c,d,e,f
Q is complete
11
Definitions
Given a taxon set S, we define the phylogeny that
reveals the correct relationships among the taxa
in S as the true phylogeny on S, denoted as
Ttrue. If a quartet topology q is inconsistent
with Ttrue, then q is a quartet error of Ttrue.
b
a
c
aced
d
f
A quartet error
e
true phylogeny Ttrue
12
Research Goal
Given a quartet topology set Q on taxon set S,
reconstruct the phylogeny that can reveal the
true phylogeny on S as much as possible.
13
Research Methods
  • An answer set programming based method and a
    look-ahead branch and bound method for solving
    the general Maximum Quartet Consistency (MQC)
    problem
  • A polynomial time algorithm for solving a special
    MQC problem
  • Three efficient algorithms to reconstruct true
    phylogeny with high success probabilities.

14
MQC/MQI Problem
Maximum Quartet Consistency Problem (MQC) Input
A set Q of quartet topologies on S. Goal Find a
phylogeny T on S such that the number of
consistent quartet topologies in Q is maximized.
Minimum Quartet Inconsistency Problem
(MQI) Input A set Q of quartets on S. Goal
Find a phylogeny T on S such that the number of
inconsistent quartet topologies in Q is minimized.
15
MQC/MQI Problem
aced abcd abce abcf abde abdf abef
afcd acef adef becd bfcd bcef
bdef cdef
We cannot find any phylogeny that can satisfy all
the quartet topologies in Q
Then we turn to find a phylogeny that can satisfy
a maximum number of quartet topologies in Q
Quartet topology set Q on taxon set
Sa,b,c,d,e,f
16
MQC/MQI Problem
aced abcd abce abcf abde abdf abef
afcd acef adef becd bfcd bcef
bdef cdef
Quartet topology set Q on taxon set
Sa,b,c,d,e,f
Phylogeny Topt for MQC problem
If a quartet topology q ? Q is not consistent
with the optimal phylogeny Topt, then it is a
conflicting quartet topology.
17
Why MQC/MQI
  • Simulation results show that in most cases the
    resultant phylogenies of MQC/MQI problem are the
    true phylogenies, especially when the number of
    quartet errors is small.
  • MQC/MQI problem is NP-complete, and it is a
    challenge to solve it efficiently in practical
    phylogeny reconstruction.

18
Outline
  • Introduction
  • Research methods
  • Answer set programming method based on
    ultrametric phylogeny
  • Computational results
  • Conclusions and future works

19
Ultrametric Phylogeny and Matrix
  • Ultrametric Phylogeny
  • Label each internal node with a positive integer
    number
  • Along any root to a leaf path, the labels on the
    path are strictly decreasing

20
Ultrametric Phylogeny and Matrix
  • Ultrametric Phylogeny
  • Label each internal node with a positive integer
    number
  • Along any root to a leaf path, the labels on the
    path are strictly decreasing

Ultrametric Matrix Each entry value is the label
of least common ancestor of the two leaf nodes.
It is
  • Symmetric, M(i, i) 0 and
  • For every triplet (i, j, k) there are two equal
    values among
  • M(i, j), M(j, k), and M(i, k) and they are
    greater than the third value.

e.g. i1, j3, k4, M(1, 3)M(3, 4)gt M(1, 4)
21
Theorem A quartet topology abcd is consistent
with a phylogeny T iff any ultrametric labeling
scheme M of T satisfies min M(a, c), M(b, d)
gt minM(a, b), M(c, d).
4
3
1
2
s1
s5
s4
s3
s2
s1 s5 s2 s3 is consistent with the tree and
its corresponding matrix min M(1, 2), M(5,
3)4 gt minM(1, 5), M(2, 3)1. Condition
satisfied!
An ultrametric matrix satisfies a quartet
topology abcd if the above inequality is
satisfied
22
Theorem Given a quartet topology set Q on S and
an ultrametric phylogeny T on S, T satisfies k
quartet topologies in Q if and only if the
corresponding ultrametric matrix M on S satisfies
the same k quartet topologies in Q.
We transfer the original MQC problem into an
ultrametric matrix searching problem
23
Problem Formulation
  • Input
  • nn matrix M(i,j), the domain of each matrix
    entry is 0..n-1
  • Quartet topology set Q.
  • Goal
  • Find a solution to M(i,j), so that
  • The matrix is ultrametric
  • The number quartet topologies satisfied by the
    matrix is maximized.

24
Answer Set Programming
Given a set of logic rules
a b, c, not d
b not d
Use some solver program to find the solution
a, b, c
25
Formulation in Answer Set Programming
Domain
1m(1, 2, 1),m(1, 2, 2),m(1, 2, 3),m(1, 2,
4),m(1, 2, 5)1 matrix entry (1,2) takes exactly
one value in the domain 1,5
Ultrametric Constraints
for three matrix values, m(i,j), m(j,k) and
m(i,k), two of them are equal and greater than
the third one
Quartet Constraints
if minm(i,k),m(j,l)gtminm(i,j),m(k,l) then
quartet i,jk,l is satisfied
Objective
maximize q(i,j,k,l)
26
Outline
  • Introduction
  • Research methods
  • Answer set programming method based on
    ultrametric phylogeny
  • A lookahead Branch and Bound algorithm
  • Computational results
  • Conclusions and future works

27
Background
Local conflict Incompatible set with 3 quartet
topologies and 5 taxa. For example, abcd, acbe
and acde. Theorem Given a complete set of
quartet topologies Q over a set of taxa S and
some taxon e in S, Q is compatible iff there
exists no local conflict whose taxon set includes
e. Idea Construct a local conflict list
involving a taxon e, and then try to resolve all
the local conflicts in the list by changing less
than k quartet topologies. Method Branch and
Bound
28
Lookahead
Contribution of changing a quartet topology The
difference between the size of the local conflict
lists before and after a quartet topology
changing.
Suppose current local conflict list has the size
of 100 We choose a quartet topology abcd and
change it to acbd The new local conflict list
has the size of 60 Then the contribution of
abcd -gt acbd is 40
At each search node, we first have a lookahead
mechanism to test the contribution of each
possible branch and choose the one with maximum
contribution to continue searching.
29
Outline of Algorithm
  • At every node in the search tree
  • Test to decide to cut the
    node or not
    (m is the number of local conflicts, k is the
    maximum quartet errors, k1 is the number of
    changed quartet topologies so far)
  • Determine need-to-be-changed quartet topologies
    (If there are 3(k-k1) distinct local conflicts
    involving q, then q must be changed)
  • Determine need-to-be-fixed quartet topologies
    (find optimal edges and all the quartet
    topologies consistent with the optimal edges are
    fixed)
  • Use the quartet inference rules on the quartet
    topologies generated in step 3 (e.g. abcd and
    abce -gt abde)

30
Outline of Algorithm-Contd
5. Build a local conflict list and partition it
into two parts IF there are
need-to-be-changed quartet topologies
Pick the need-to-be-changed quartet topology
achieving the largest contribution to resolve
ELSE Pick the resolvement way achieving
the largest contribution
31
Outline
  • Introduction
  • Research methods
  • Answer set programming method based on
    ultrametric phylogeny
  • A lookahead Branch and Bound algorithm.
  • A polynomial time algorithm when the number of
    conflicting quartet topologies is O(n)
  • Computational results
  • Conclusions and future works

32
Background
  • MQC/MQI is NP-complete if Q is complete
  • Known result If the number of conflicting
    quartet topologies is less than (n is
    the number of taxa), then MQC/MQI can be solved
    in polynomial time
  • We extend the result to O(n).

33
Lemma
Let E denote the set of conflicting quartet
topologies in an optimal solution to the MQC/MQI
problem. There is a taxon e such that the number
of quartet topologies in E involving e is less
than or equal to
  • A quartet topology contains 4 taxa
  • E contains total 4E taxa
  • Input taxon set size is n.

34
Lemma
In the MQC/MQI problem, if there is no
conflicting quartet topologies involving taxon e,
then the problem can be solved in O(n4) time.
  • Build a local conflict list involving e
  • For the 3 quartet topologies in a local conflict,
    2 must contain e. Therefore they are not
    conflicting quartet topologies
  • Then the third quartet topology containing e
    must be changed to resolve the local conflict.

35
Theorem
There is a polynomial algorithm that solves the
MQC/MQI problem when the number of conflicting
quartet topologies in the given complete quartet
topology set Q is at most cn, where c is a
positive constant and n is the number of taxa.
  • At least one taxon, e, is involved in at most 4c
    quartet topologies in E.
  • We try every possible change for every possible
    combination of the 4c quartet topologies in E. It
    is still polynomial since 4c is a constant
  • Rest quartet topologies in E are not involved e,
    and can be determined in O(n4) time.

36
Algorithm
  • For (every taxon e ? S)
  • Construct the local conflict list involving
    taxon e
  • If (the size of the list is 30c(n - 4))
  • For (every k ? 0, 4c)
  • For (every combination of k topologies
    involving e)
  • For (every possible change of
    these k topologies)
  • Change (cn - k) topologies
    not involving e to resolve conflicts
  • Update the best solution if
    there is no conflict left
  • If (the best solution is empty)
  • Set Q must contain more than cn conflicting
    quartet topologies, exit
  • Else
  • Construct the phylogeny associated with the
    best solution
  • Return the best solution and its associated
    phylogeny.

37
Outline
  • Introduction
  • Research methods
  • Answer set programming based on ultrametric
    phylogeny
  • A lookahead Branch and Bound algorithm.
  • A polynomial time algorithm when the number of
    conflicting quartet topologies is O(n)
  • A probabilistic model and three efficient
    algorithms to compute Ttrue with high
    probabilities.
  • Computational results
  • Conclusions and future works

38
Background
Given a tree T on n leaves, there exists an
internal node (separator) whose removal
partitions the tree into connected components,
each with at most n/2 leaves, and such node can
be found in O(n) time.
Separator
39
Probabilistic Model
  • Given Ttrue, generate a complete quartet topology
    set QTtrue for Ttrue.
  • For every quartet topology in QTtrue , with
    probability p/2, change its topology into each of
    the other two topologies (then every quartet
    topology has a probability of p to be a quartet
    error).
  • Simulation results show that current quartet
    inference methods can achieve over 80 accuracy
    for inferred quartet topologies.
  • We assume 0 lt p lt 1/3

40
Theorem
Given a quartet topology set Q with no quartet
errors (p0), Ttrue can be constructed in O(n2)
time by querying at most (n-4) log(n-1) quartet
topologies in Q.
  • Start with a random quartet topology
  • Iteratively insert a new taxon to grow the
    phylogeny.

41
g
aecg
agcd
42
Theorem
When 0 lt p lt 1/3 , we can reconstruct Ttrue in
O(n4 log n) time with a probability at least
  • Use a voting scheme to decide into which branch
    the new taxon should be inserted.

43
aecg, aedg, becg, bedg, agcf, bfdg, ..
g
agcd, bgcd, ecgd, fgcd
44
Compatible 5-subset
A compatible complete quartet topology set on a
taxon set of 5 taxa.
b
aecd abcd abce abde becd
a
c
e
d
45
Theorem
When 0 lt p lt 1/3 , and we can start with a
compatible 5-subset, then Ttrue can be
reconstructed in O(n5) time with a probability at
least
where
46
Outline
  • Introduction
  • Research methods
  • Answer set programming based on ultrametric
    phylogeny
  • A lookahead Branch and Bound algorithm.
  • A polynomial time algorithm when the number of
    conflicting quartet topologies is O(n)
  • A probabilistic model and three efficient
    algorithms to compute Ttrue with high
    probabilities.
  • Computational results
  • Conclusions and future works

47
Running times of the exact algorithms for MQC
problem
DP (dynamic programming method by A. Ben-Dor et
al. GN (fixed parameter method by Gramm et
al. ASP (our Answer set programming
method) LBnB-Opt (our lookahead branch and bound
method)
48
Running times on datasets with small quartet
errors
49
Probability comparison among the proposed
algorithm (M-VOTE), the hypercleaning algorithm
(HC), the answer set programming method for the
MQC problem (ASP), and the theoretical success
probability of M-VOTE.
50
Conclusions
  • The answer set programming formulation gives a
    new perspective of the MQC problem.
  • The proposed exact algorithms outperform other
    exact algorithms significantly.
  • In general problem instances, the answer set
    programming method has the greatest efficiency.
  • If the quartet errors are small and the quartet
    topology set is complete, the Lookahead branch
    and bound algorithm has the greatest efficiency.

51
Conclusions
  • The polynomial time algorithm for solving a
    special MQC/MQI problem has improved current
    result in this area.
  • The probabilistic model and the proposed
    algorithms open up several research directions.

52
Future Work
  • Design a quartet specific answer set programming
    solver
  • Investigate other possible probability
    distributions and design efficient algorithms.

53
Questions?
Write a Comment
User Comments (0)
About PowerShow.com