http://creativecommons.org/licenses/by-sa/2.0/ - PowerPoint PPT Presentation

1 / 79

About This Presentation

Title:

http://creativecommons.org/licenses/by-sa/2.0/

Description:

Maximum Parsimony. Character based method. NP-hard (reduction to the Steiner tree problem) ... MRP---Matrix Representation using Parsimony (very popular) ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 80

Provided by: Usm16

Learn more at: http://www.cs.njit.edu

Category:

more less

Transcript and Presenter's Notes

Title: http://creativecommons.org/licenses/by-sa/2.0/

1
http//creativecommons.org/licenses/by-sa/2.0/
2
CIS786, Lecture 3

Usman Roshan

3
Maximum Parsimony

Character based method
NP-hard (reduction to the Steiner tree problem)
Widely-used in phylogenetics
Slower than NJ but more accurate
Faster than ML
Assumes i.i.d.

4
Maximum Parsimony

Input Set S of n aligned sequences of length k
Output A phylogenetic tree T
leaf-labeled by sequences in S
additional sequences of length k labeling the
internal nodes of T
such that is minimized.

5
Maximum parsimony (example)

Input Four sequences
ACT
ACA
GTT
GTA
Question which of the three trees has the best
MP scores?

6
Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTT
ACA
GTA
GTA
ACA
ACT
GTT
7
Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTA
ACA
ACT
2
1
1
3
3
2
GTT
GTT
ACA
GTA
MP score 7
MP score 5
GTA
ACA
ACA
GTA
2
1
1
ACT
GTT
MP score 4
Optimal MP tree
8
Maximum Parsimony computational complexity
9
Local search strategies
10
Local search for MP

Determine a candidate solution s
While s is not a local minimum
Find a neighbor s of s such that MP(s)ltMP(s)
If found set ss
Else return s and exit
Time complexity unknown---could take forever or
end quickly depending on starting tree and local
move
Need to specify how to construct starting tree
and local move

11
Starting tree for MP

Random phylogeny---O(n) time
Greedy-MP

12
Greedy-MP
Greedy-MP takes O(n3k) time
13
Faster Greedy MP3-way labeling

If we can assign optimal labels to each internal
node rooted in each possible way, we can speed up
computation by order of n
Optimal 3-way labeling
Sort all 3n subtrees using bucket sort in O(n)
Starting from small subtrees compute optimal
labelings
For each subtree rooted at v, the optimal
labelings of children nodes is already computed
Total time O(nk)

14
Faster Greedy MP3-way labeling

If we can assign optimal labels to each internal
node rooted in each possible way, we can speed up
computation by order of n
Optimal 3-way labeling
Sort all 3n subtrees using bucket sort in O(n)
Starting from small subtrees compute optimal
labelings
For each subtree rooted at v, the optimal
labelings of children nodes is already computed
Total time O(nk)

With optimal labeling it takes constant Time to
compute MP score for each Edge and so total
Greedy-MP time Is O(n2k)
15
Local moves for MP NNI

For each edge we get two different topologies
Neighborhood size is 2n-6

16
Local moves for MP SPR

Neighborhood size is quadratic in number of taxa
Computing the minimum number of SPR moves between
two rooted phylogenies is NP-hard

17
Local moves for MP TBR

Neighborhood size is cubic in number of taxa
Computing the minimum number of TBR moves between
two rooted phylogenies is NP-hard

Tree Bisection and Reconnection (TBR)

Tree Bisection and Reconnection (TBR)

Delete an edge
20

Tree Bisection and Reconnection (TBR)

Tree Bisection and Reconnection (TBR)

Reconnect the trees with a new edge that
bifurcates an edge in each tree
22
Local optima is a problem
23
Iterated local search escape local optima by
perturbation
Local optimum
Local search
24
Iterated local search escape local optima by
perturbation
Local optimum
Local search
Perturbation
Output of perturbation
25
Iterated local search escape local optima by
perturbation
Local optimum
Local search
Perturbation
Local search
Output of perturbation
26
ILS for MP

Ratchet
Iterative-DCM3
TNT

27
Iterated local search escape local optima by
perturbation
Local optimum
Local search
Perturbation
Local search
Output of perturbation
28
Ratchet

Perturbation input alignment and phylogeny
Sample with replacement p of sites and reweigh
them to w
Perform local search on modified dataset starting
from the input phylogeny
Reset the alignment to original after completion
and output the local minimum

29
Ratchet escaping local minimaby data
perturbation
Local optimum
Local search
Ratchet search
Local search
Output of ratchet
30
Ratchet escaping local minimaby data
perturbation
Local optimum
Local search
Ratchet search
Local search
Output of ratchet
But how well does this perform? We have to
examine this experimentally on real data
31
Experimental methodology for MP on real data

Collect alignments of real datasets
Usually constructed using ClustalW
Followed by manual (eye) adjustments
Must be reliable to get sensible tree!
Run methods for a fixed time period
Compare MP scores as a function of time
Examine how scores improve over time
Rate of convergence of different methods (not
sequence length but as a function of time)

32
Experimental methodology for MP on real data

We use rRNA and DNA alignments
Obtained from researchers and public databases
We run iterative improvement and ratchet each for
24 hours beginning from a randomized greedy MP
tree
Each method was run five times and average scores
were plotted
We use PAUP---very widely used software package
for various types of phylogenetic analysis

33
500 aligned rbcL sequences (Zilla dataset)
34
854 aligned rbcL sequences
35
2000 aligned Eukaryotes
36
7180 aligned 3domain
37
13921 aligned Proteobacteria
38
Comparison of MP heuristics

What about other techniques for escaping local
minima?
TNT a combination of divide-and-conquer,
simulated annealing, and genetic algorithms
Sectorial search (random) construct ancestral
sequence states using parsimony randomly select
a subset of nodes compute iterative-improvement
trees and if better tree found then replace
Genetic algorithm (fuse) Exchange subtrees
between two trees to see if better ones are found
Default search (1) Do sectorial search starting
from five randomized greedy MP trees (2) apply
genetic algorithm to find better ones (3) output
best tree

39
Comparison of MP heuristics

What about other techniques for escaping local
minima?
TNT a combination of divide-and-conquer,
simulated annealing, and genetic algorithms
Sectorial search (random) construct ancestral
sequence states using parsimony randomly select
a subset of nodes compute iterative-improvement
trees and if better tree found then replace
Genetic algorithm (fuse) Exchange subtrees
between two trees to see if better ones are found
Default search (1) Do sectorial search starting
from five randomized greedy MP trees (2) apply
genetic algorithm to find better ones (3) output
best tree

How does this compare to PAUP-ratchet?
40
Experimental methodology for MP on real data

We use rRNA and DNA alignments
Obtained from researchers and public databases
We run PAUP-ratchet, TNT-default, and
TNT-ratchet each for 24 hours beginning from
randomized greedy MP trees
Each method was run five times on each dataset
and average scores were plotted

41
500 aligned rbcL sequences (Zilla dataset)
42
854 aligned rbcL sequences
43
2000 aligned Eukaryotes
44
7180 aligned 3domain
45
13921 aligned Proteobacteria
46
Can we do even better?

Yes! But first lets look at
Disk-Covering Methods

47
Disk Covering Methods (DCMs)

DCMs are divide-and-conquer booster methods. They
divide the dataset into small subproblems,
compute subtrees using a given base method, merge
the subtrees, and refine the supertree.
DCMs to date
DCM1 for improving statistical performance of
distance-based methods.
DCM2 for improving heuristic search for MP and
ML
DCM3 latest, fastest, and best (in accuracy and
optimality) DCM

48
DCM2 technique for speeding up MP searches
49
DCM2 decomposition

DCM2
Input distance matrix d, threshold
, sequences S
Algorithm
1a. Compute a threshold graph G using q and d
1b. Perform a minimum weight triangulation of G

Find separator X in G which minimizes max
where are the connected components of G
X
Output subproblems as .

50
Threshold graph

Add edges until graph is connected
Perform minimum weight triangulation
NP-hard
Triangulated graphperfect elimination ordering
(PEO)
Max cliques can be determined in linear time
Use greedy triangulation heuristic compute PEO
by adding vertices which minimize largest edge
added
Worst case is O(n3) but fast in practice

51
Finding DCM2 separator

Find separator X in G which minimizes max
where are the connected components of G
X
Output subproblems as
This takes O(n3) worst case time perform depth
first search on each component (O(n2)) for each
of O(n) separators

52
DCM2 subsets
53
DCM3 decomposition - example
54
DCM1 vs DCM2
DCM1 decomposition NJ gets better accuracy on
small diameter subproblems (which we shall return
to later)
DCM2 decomposition Getting a smaller number of
smaller subproblems speeds up solution
55
We saw how decomposition takes place, now on to
supertree methods
56
Supertree Methods
57
Optimization problems

Subtree Compatibility Given set of trees
,does there exist tree
,such that, (we
say contains ).
NP-hard (Steel 1992)
Special cases are poly-time (rooted trees, DCM)
MRP also NP-hard

58
Direct supertree methods

Strict consensus supertrees, MinCutSupertrees

59
Indirect supertree methods

MRP, Average consensus

60
MRP---Matrix Representation using Parsimony (very
popular)
61
Strict Consensus Merger---faster and used in DCMs
62
Strict Consensus Merger compatible subtrees
63
Strict Consensus Merger compatible but collision
64
Strict Consensus Merger incompatible subtrees
65
Strict Consensus Merger incompatible and
collision
66
Strict Consensus Merger difference from Gordons
SC method
67
We saw how decomposition takes place, now on to
supertree methods
68
Tree Refinement

Challenge given unresolved tree, find optimal
refinement that has an optimal parsimony score
NP-hard

69
Tree Refinement
70
We saw how decomposition takes place, now on to
supertree methods
71
Comparing DCM decompositions
72
Study of DCM decompositions
Comparison of MP scores
Comparison of running times
DCM2 is faster and better than DCM1
73
Best DCM (DCM2) vs Random
Comparison of MP scores
Comparison of running times
DCM2 is better than RANDOM w.r.t MP scores and
running times
74
DCM2 (comparing two different thresholds)
Comparison of MP scores
Comparison of running times
75
Threshold selection techniques
Biological dataset of 503 rRNA sequences.
Threshold value at which we get two subproblems
has best MP score.
76
Comparing supertree methods
77
MRP vs. SCM
Comparison of MP scores
Comparison of running times

SCM is better than MRP

78
Comparing tree refinement techniques
79
Study of tree refinement techniques
Comparison of MP scores
Comparison of running times
Constrained tree search had best MP scores but is
slower than other methods
80
Next time