Problems with large-scale phylogeny - PowerPoint PPT Presentation

About This Presentation
Title:

Problems with large-scale phylogeny

Description:

... (in the best software packages) can be sped up, to solve MP and ML faster. ... The DCM technique for speeding up MP/ML searches. DCM2-MP/ML ... – PowerPoint PPT presentation

Number of Views:13
Avg rating:3.0/5.0
Slides: 37
Provided by: tandyw
Category:

less

Transcript and Presenter's Notes

Title: Problems with large-scale phylogeny


1
Problems with large-scale phylogeny
  • Tandy Warnow, UT-Austin
  • Department of Computer Sciences
  • Center for Computational Biology and
    Bioinformatics

2
Phylogeny
From the Tree of the Life Website,University of
Arizona
Orangutan
Human
Gorilla
Chimpanzee
3
DNA Sequence Evolution
4
Molecular Systematics
U
V
W
X
Y
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
X
U
Y
V
W
5
Quantifying Error
FN
FN false negative (missing edge) FP false
positive (incorrect edge) 50 error rate
FP
6
Methods and Conjectures
  • Popular methods Neighbor-Joining (polynomial
    time, distance-based), heuristics for Maximum
    Parsimony and Maximum Likelihood
  • Big debates about which is better, and when

7
Methods and Conjectures
  • Popular methods Neighbor-Joining (polynomial
    time, distance-based), heuristics for Maximum
    Parsimony and Maximum Likelihood
  • Big debates about which is better, and when
  • Our research shows big differences between NJ
    and MP, on large enough trees

8
Methods and Conjectures
  • Popular methods Neighbor-Joining (polynomial
    time, distance-based), heuristics for Maximum
    Parsimony and Maximum Likelihood
  • Big debates about which is better, and when
  • Our research shows big differences between NJ
    and MP, on large enough trees
  • Our research also shows that current techniques
    (in the best software packages) can be sped up,
    to solve MP and ML faster.

9
Computational challenges for Assembling the Tree
of Life
  • 8 million species for the Tree of Life -- cannot
    currently analyze more than a few hundred (and
    even this takes years)
  • We need new methods for inferring large
    phylogenies - hard optimization problems!
  • We need new software for visualizing large trees
  • We need new database technology
  • Not all phylogenies are trees, so we need methods
    for inferring phylogenetic networks

10
Our research projects
  • DCM-boosting phylogenetic reconstruction methods
    (improving the accuracy of NJ and speeding-up MP
    and ML)
  • Phylogenetic reconstruction from gene orders
  • Reticulate evolution detection and phylogenetic
    network reconstruction
  • Visualization of large trees

11
DCM-boosting NJ
  • Outline
  • Convergence rates (how long do the sequences need
    to be for methods to reconstruct the true tree
    with high probability?)
  • DCM-boosting Neighbor-Joining
  • Experimental study comparing DCM-NJ to NJ on
    large trees

12
The Jukes-Cantor model of DNA sequence evolution
  • A random DNA sequence evolves down the tree from
    the root
  • The positions within the sequence evolve
    independently and identically
  • If the nucleotide at a particular position
    changes on an edge, it changes with equal
    probability to the other nucleotides

13
The General Markov model of DNA sequence evolution
  • A random DNA sequence evolves down the tree from
    the root
  • The positions within the sequence evolve
    independently and identically (or under a
    distribution of rates across sites)
  • Each edge has a 4x4 stochastic substitution
    matrix governing the evolution of a random site
    on the edge

14
Statistical Performance Issues
  • Statistical consistency does the reconstruction
    method return the true tree with high probability
    from long enough sequences?
  • Convergence Rate at what sequence length will
    the reconstruction method return the true tree
    with high probability?
  • Robustness if we violate the model conditions,
    what can we say about the performance of the
    method?

15
Absolute fast convergence vs. exponential
convergence
16
Theoretical Comparison of Methods
  • Theorem 1 Warnow et al. 2001DCMNJ is absolute
    fast converging for the GM model.
  • Theorem 3 Atteson 1999NJ is exponentially
    converging for the GM model (but is not known to
    be afc).

17
DCM1 a divide-and-conquer strategy to improve
NJs accuracy
Phase I Basic step Divide the dataset into many
small diameter subproblems. Construct NJ
trees on each subproblem, and merge
subtrees, using the Strict Consensus Merger.
Refine the resultant tree using PAUPs
constrained search. Do the basic step for each
way of setting the diameter. Phase II Pick
the best tree out of the set of O(n2) trees.
18
Strict Consensus Merger
19
DCM-Boosting Warnow et al. 2001
  • DCMSQS is a two-phase procedure which reduces
    the sequence length requirement of methods.

Exponentially converging method
Absolute fast converging method
DCM
SQS
  • DCMNJSQS is the result of DCM-boosting NJ.
  • We can replace SQS by MP or ML, and get better
    empirical performance (though not provably afc)

20
DCM-boosting Neighbor Joining
  • DCM-boosting makes distance-based methods more
    accurate (we have established this for other
    distance-based methods, too)

0.8
NJ
DCM-NJ
0.6
Error Rate
0.4
0.2
0
0
400
800
1600
1200
No. Taxa
21
Summary of DCM-NJ
  • These are the first polynomial time methods that
    improve upon NJ (with respect to topological
    accuracy) and are never worse than NJ.
  • The advantage obtained with DCMNJMP and DCMNJML
    increases with number of taxa, deviation from a
    molecular clock, and rate of evolution.
  • In practice these new methods are slower than NJ
    (minutes vs. seconds), but still much faster than
    MP and ML (which can take days).

22
Time is a bottleneck for MP and ML
  • Systematists tend to prefer trees with the
    optimal maximum parsimony score or optimal
    maximum likelihood score however, both problems
    are hard to solve
  • (Our experimental studies show that NJ doesnt do
    as well as MP when trees are big and have high
    rates of evolution, so NJ and other fast methods
    arent sufficiently reliable.)

Local optimum
MP score
Global optimum
Phylogenetic trees
23
MP/ML heuristics
Fake study
Performance of hill-climbing heuristic
MP score of best trees
Time
24
DCM-boosting Speeding up MP/ML heuristics
Fake study
Performance of hill-climbing heuristic
MP score of best trees
Desired Performance
Time
25
DCM-boosting MP and ML
  • Idea it is better to run a computationally
    expensive method on two subproblems of somewhat
    smaller size
  • The DCM is different we decompose the dataset
    into just two subproblems, but they are bigger,
    and only for one threshold, but we use the same
    merger technique, and same refinement stage
  • Challenge how to pick the best decomposition?
  • This depends upon the base method

26
Addressing the accuracy/time issues
Disk-Covering Methods
DCM1 decomposition lots of small diameter
subproblems. (Used for NJ.)
DCM2 decomposition Very few subproblems, each
somewhat smaller. (Used for MP or ML.)
27
Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTT
ACA
GTA
GTA
ACA
ACT
GTT
28
Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTA
ACA
ACT
2
1
1
3
3
2
GTT
GTT
ACA
GTA
MP score 7
MP score 5
GTA
ACA
ACA
GTA
2
1
1
ACT
GTT
MP score 4
Optimal MP tree
29
Maximum Parsimony computational complexity
30
The DCM technique for speeding up MP/ML searches
31
DCM2-MP/ML
  • Step 1 pick a threshold at which the threshold
    graph is connected, and divide the dataset into
    two overlapping subsets.
  • Step 2 Compute trees on each subset using a
    heuristic for MP or ML
  • Step 3 Merge subtrees using the Strict Consensus
    Merger
  • Step 4 Refine the resultant tree using PAUP
    constrained search

32
Phase I of DCMNJ
  • For each value, q, in the distance matrix,
    compute a tree tq as follows
  • Divide the dataset into subsets of diameter q
  • Construct trees on each subset using NJ
  • Merge the trees using the Strict Consensus Merger
    technique
  • Refine the (probably unresolved) tree into a
    bifurcating tree

33
Study of hill-climbing heuristics
Biological dataset of 500 rbcL sequences
(benchmark dataset). Previous best known trees
have MP score 16531.
34
Current best DCM2 technique
  • Pick threshold to get two subproblems
  • Use expensive but accurate base method
  • Use SCM to merge subtrees
  • Use PAUPs constrained search with moderately
    expensive hill-climbing heuristic

35
DCM2 vs hill-climbing
Biological dataset of 388 rRNA sequences.
Maximum subproblem size 70
36
DCM2 vs hill-climbing
Biological dataset of 503 rRNA sequences.
Maximum subproblem size 64
37
DCM2 vs hill-climbing
Biological dataset of 816 rRNA sequences.
Maximum subproblem size 55
38
What we see
  • Some datasets decompose well, and DCM gives real
    advantage
  • The bigger the dataset, and the more careful the
    heuristic search, the less good the decomposition
    has to be for DCM to give an advantage
  • Outlier identification may help

39
Other projects (briefly)
  • Gene order phylogeny GRAPPA (our free software)
    is the fastest and most accurate software for
    reconstructing phylogenies from gene order and
    content data. Joint project with Bob Jansen (UT)
    and Bernard Moret (UNM), and others.
  • Reticulate evolution inference. Our research
    shows no existing method for reconstructing
    networks work, and that methods (such as ILD) for
    detecting reticulation fail. Joint project with
    Randy Linder (UT) and Bernard Moret.

40
Acknowledgements
  • Funding
  • The David and Lucile Packard Foundation, and
  • The National Science Foundation.
  • Collaborators
  • Bernard Moret (UNM), Daniel Huson
    (Tubingen), Lisa Vawter (Aventis), Katherine St.
    John (CUNY), Randy Linder (UT), Bob Jansen (UT)
  • Students Luay Nakhleh, Usman Roshan, Jerry Sun,
    and Li-San Wang

41
Phylolab, U. Texas
Please visit us at http//www.cs.utexas.edu/users/
phylo/
Write a Comment
User Comments (0)
About PowerShow.com