PHYLOGENY RECONSTRUCTION - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

PHYLOGENY RECONSTRUCTION

Description:

If along any root to leaf path, the labels of the internal nodes on the path is ... species set S, there exists a rooted phylogeny T which satisfies a maximum ... – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 36
Provided by: gan98
Category:

less

Transcript and Presenter's Notes

Title: PHYLOGENY RECONSTRUCTION


1
PHYLOGENY RECONSTRUCTION FROM QUARTETS
Gang Wu Department of Computing
Science University of Alberta
2
Outline
  • Introduction
  • Research Methods
  • Computational Results and Analysis

3
Common Phylogenetic Tree Terminology
Phylogeny pattern of historical relationships
among species . Tree mathematical structure used
to depict the evolutionary history of a group of
species
Leaf Nodes
Branches or Edges
A
Represent the species (genes, populations,
etc.) used to infer the phylogeny
internal
B
C
D
ROOT of the Tree (common ancestor of all species)
E
Internal Nodes (represent hypothetical ancestors
of the species)
4
Phylogeny Example for Mammal
5
Rooted and Unrooted tree
B
C
B
C
Root
D
Root
D
A
Unrooted tree
A
A
B
A
C
D
C
B
B
D
Rooted tree
Root
Root
6
General Process of Phylogeny Construction
Input A set of (DNA or protein) sequences for
the species
Output An evolutionary tree(phylogeny) whose
leaf nodes are the input species
Methods Maximum Parsimony (MP), Maximum
Likelyhood (ML),etc
Not suitable for large trees (over 20 species).
Current software all use heuristics to speed up
the computational time
7
Quartet Based Phylogeny Construction
  • There is only one unrooted tree for one, two or
    three species.
  • There are three possible unrooted trees for four
    species (A, B, C, D)
  • Quartets are smallest informative unrooted trees
  • MP or ML can be solved exactly on quartets

ABCD
ACBD
ADBC
8
Process of Quartet Based Phylogeny Construction
9
Definitions
A quartet abcd is consistent with a phylogeny T,
or a phylogeny T satisfies a quartet abcd , if
and only if a,b,c,d are all leaves of T and the
path from a to b does not share any nodes with
the path from c to d.
10
aecd abcd abce abcf abde abdf abef
afcd acef adef becd bfcd bcef
bdef cdef
Phylogeny T
Quartet Set Q
Phylogeny T
quartet aecd is consistent with T, or T
satisfies aecd
11
Definitions
Given a set of quartets Q on a set S of species,
Q is compatible, if and only if there is a
phylogeny on S which satisfies all the quartets
in Q.
A set Q of quartet topologies is complete if Q
contains a quartet topology for each four labels
over label set S.
12
aecd abcd abce abcf abde abdf abef
afcd acef adef becd bfcd bcef
bdef cdef
Quartet Set Q
Phylogeny T
  • The quartet set Q is compatible
  • The quartet set Q is complete

13
Problem Descriptions
In practice, the given quartet set Q usually
contains errors and thus is incompatible.
Quartet Compatibility Problem(QCP) Input A
set Q of quartets on S Question Is Q
compatible? Equivalently, is there a phylogeny T
on S such that all quartets in Q are satisfied?
Maximum Quartet Consistency Problem (MQC) Input
A set Q of quartets on S. Goal Find a phylogeny
T on S such that the number of consistent
quartets in Q is maximized.
Minimum Quartet Inconsistency Problem
(MQI) Input A set Q of quartets on S. Goal
Find a phylogeny T on S such that the number of
inconsistent quartets in Q is minimized.
14
aced abcd abce abcf abde abdf abef
afcd acef adef becd bfcd bcef
bdef cdef
Input Quartet Set Q
Quartet Compatibility Problem(QCP)?
No
MQC or MQI ?
Only aced is not satisfied
15
Known Results
Quartet Compatibility Problem(QCP) can be solved
in polynomial time if the given quartet set Q is
complete. But it is NP-Complete if Q is
incomplete.
Maximum Quartet Consistency Problem (MQC) and
Minimum Quartet Inconsistency Problem (MQI) are
NP-Complete even if Q is complete.
Exact algorithms "Guarantee" to find the
optimal or "best" tree. Heuristic algorithms
Approximate or quick-and-dirty methods that
attempt to find the optimal tree, but cannot
guarantee to do so.
16
Known Results
Lots of Heuristics. Best known approximation
algorithm is Quartet Cleaning, with approximation
ratio of for MQI, where n is number of
species
There are only two exact algorithms in
literature. Dynamic programming has the
complexity of , where m is the
number of input quartets and n is number of input
species. It is a general algorithm. Fixed
Parameter Algorithm has the complexity of
, where k is the largest number of
quartet errors and n is the number of input
species. Good if k is very small compared to the
total number of quartets. Worse than Dynamic
Programming if k is relatively large.
Dynamic programming can solve MQC problem with 20
species in 6 days in a 300MHz computer. Fixed
Parameter Algorithm can solve MQI problem with 50
species when k 100 in 40 minutes in a 750MHz
computer.
17
Research Objectives
  • Exact algorithm for MQC
  • Quartet set Q is complete
  • Faster
  • Can solve problem with more species

18
Ultrametric Tree and Matrix
Ultrametric Tree We label each internal node
with a number. If along any root to leaf path,
the labels of the internal nodes on the path is
strictly decreasing, then the tree with its
labels is called ultrametric tree.
Ultrametric Matrix Each entry value is the label
of least common ancestor of the two leaf nodes.
It is
  • Symmetric, M(i, i) 0 and
  • For every triplet (i, j, k) there are two equal
    values among
  • M(i, j), M(j, k), and M(i, k) and they are
    greater than the third value.

e.g. i1, j3, k4, M(1, 3)M(3, 4)gt M(1, 4)
19
Theorem 1 A quartet abcd is consistent with a
phylogeny T if and only if any ultrametric
labeling scheme M of T satisfies min M(a, c),
M(b, d) gt minM(a, b), M(c, d).
20
Theorem 1 A quartet abcd is consistent with a
phylogeny T if and only if any ultrametric
labeling scheme M of T satisfies min M(a, c),
M(b, d) gt minM(a, b), M(c, d).
s1 s5 s2 s3 is consistent with the tree and
its corresponding matrix min M(1, 2), M(5,
3)4 gt minM(1, 5), M(2, 3)1. Condition
satisfied!
21
Theorem 1 A quartet abcd is consistent with a
phylogeny T if and only if any ultrametric
labeling scheme M of T satisfies min M(a, c),
M(b, d) gt minM(a, b), M(c, d).
s1 s4 s2 s5 is NOT consistent with the tree and
its corresponding matrix min M(1, 2), M(4, 5)
minM(1, 4), M(2, 5)3. Condition not
satisfied!
22
Theorem 2 Given a set Q of quartets on a set of
species S and an ultrametric phylogeny T on S, T
satisfies the maximum number of quartets in Q if
and only if the corresponding ultrametric matrix
M on S satisfies the maximum number of quartets
in Q.
We transfer the original MQC problem into an
ultrametric matrix searching problem
23
(No Transcript)
24
Formulation in Answer Set Programming
Domain
1m(1, 2, 1),m(1, 2, 2),m(1, 2, 3),m(1, 2,
4),m(1, 2, 5)1 matrix entry (1,2) takes exactly
one value in the domain 1,5
Ultrametric Constraints
for three matrix values, m(i,j), m(j,k) and
m(i,k), two of them are equal and greater than
the third one
Quartet Constraints
if minm(i,k),m(j,l)gtminm(i,j),m(k,l) then
quartet i,jk,l is satisfied
Objective
maximize q(i,j,k,l)
25
Optimizations
26
Experiment Results
n number of species p percentage of quartet
errors
27
Phylogenetic Analysis on Prokaryote Dataset
  • 20 species, total 4845 quartets.
  • is generated by PHYLIP using Neighbor Joining. It
    can satisfy 3968 quartets.
  • is generated by our program, it can satisfy 3984
    quartets and more accurate w.r.t. Bergeys Code

28
Phylogenetic Analysis of SARS
  • Severe Acute Respiratory Syndrome (SARS) is
    recognized as a coronavirus
  • The coronaviruses are currently divided into
    three groups
  • The representative viruses from each group are
    shown as

29
Phylogeny Construction Procedure
  • Get the whole whole genome data and protein data
    for each virus from NCBI
  • Compute a distance matrix M for these viruses
    using a measure proposed by Xiaomeng
  • Use the quartet-based algorithm to generate a
    phylogeny from M.
  • Use Neighbor Joining Algorithm in PHYLIP package
    to generate another phylogeny from M.
  • Compute the average distance between SARS-Cov
    and Group 1(D1), Group 2(D2), and Group 3(D3)
    viruses, respectively.

30
Phylogeny on Protein Data without Outgroup
Both Neighbor Joining and Quartet-based methods
generate the same phylogeny
D1466.3 D2459.8 D3460.6
31
Phylogeny on Protein Data with Outgroup
By Neighbor Joining, the relation of SARS-Cov to
Group 2 and Group 3 varies from tree to tree. The
following is a phylogeny, where SARS-Cov lies in
Group 3.
D1464.4 D2460.3 D3459
32
Phylogeny on Protein Data with Outgroup
By Neighbor Joining, the relation of SARS-Cov to
Group 2 and Group 3 varies from tree to tree.
The following is another tree where SARS-Cov lies
in Group 2.
D1464.5 D2458.8 D3459.7
33
Phylogeny on Protein Data with Outgroup
The Quartet-based Method can produce consistent
phylogeny on various outgroups
D1464.4 D2460.3 D3459
34
Phylogeny on Genome Data with Outgroup
Both Neighbor Joining and Quartet-based methods
generate the same phylogeny
D1457.4 D2456.1 D3455.2
35
Summary
  • Our phylogeny construction method can
    successfully identify three groups of
    coronaviruses.
  • SARS-Cov locates more closely to group 2 and 3
    than group 1. The average distances of SARS-Cov
    to the group 2 and 3 viruses are approximately
    same.
  • Based on whole protein data, our quartet-based
    method can consistently generate same phylogeny
    with various outgroups. This phylogeny suggests
    that SARS-Cov lies more likely in the group 3.
Write a Comment
User Comments (0)
About PowerShow.com