http://creativecommons.org/licenses/by-sa/2.0/ - PowerPoint PPT Presentation

1 / 79
About This Presentation
Title:

http://creativecommons.org/licenses/by-sa/2.0/

Description:

Maximum Parsimony. Character based method. NP-hard (reduction to the Steiner tree problem) ... MRP---Matrix Representation using Parsimony (very popular) ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 80
Provided by: Usm16
Learn more at: http://www.cs.njit.edu
Category:

less

Transcript and Presenter's Notes

Title: http://creativecommons.org/licenses/by-sa/2.0/


1
http//creativecommons.org/licenses/by-sa/2.0/
2
CIS786, Lecture 3
  • Usman Roshan

3
Maximum Parsimony
  • Character based method
  • NP-hard (reduction to the Steiner tree problem)
  • Widely-used in phylogenetics
  • Slower than NJ but more accurate
  • Faster than ML
  • Assumes i.i.d.

4
Maximum Parsimony
  • Input Set S of n aligned sequences of length k
  • Output A phylogenetic tree T
  • leaf-labeled by sequences in S
  • additional sequences of length k labeling the
    internal nodes of T
  • such that is minimized.

5
Maximum parsimony (example)
  • Input Four sequences
  • ACT
  • ACA
  • GTT
  • GTA
  • Question which of the three trees has the best
    MP scores?

6
Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTT
ACA
GTA
GTA
ACA
ACT
GTT
7
Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTA
ACA
ACT
2
1
1
3
3
2
GTT
GTT
ACA
GTA
MP score 7
MP score 5
GTA
ACA
ACA
GTA
2
1
1
ACT
GTT
MP score 4
Optimal MP tree
8
Maximum Parsimony computational complexity
9
Local search strategies
10
Local search for MP
  • Determine a candidate solution s
  • While s is not a local minimum
  • Find a neighbor s of s such that MP(s)ltMP(s)
  • If found set ss
  • Else return s and exit
  • Time complexity unknown---could take forever or
    end quickly depending on starting tree and local
    move
  • Need to specify how to construct starting tree
    and local move

11
Starting tree for MP
  • Random phylogeny---O(n) time
  • Greedy-MP

12
Greedy-MP
Greedy-MP takes O(n3k) time
13
Faster Greedy MP3-way labeling
  • If we can assign optimal labels to each internal
    node rooted in each possible way, we can speed up
    computation by order of n
  • Optimal 3-way labeling
  • Sort all 3n subtrees using bucket sort in O(n)
  • Starting from small subtrees compute optimal
    labelings
  • For each subtree rooted at v, the optimal
    labelings of children nodes is already computed
  • Total time O(nk)

14
Faster Greedy MP3-way labeling
  • If we can assign optimal labels to each internal
    node rooted in each possible way, we can speed up
    computation by order of n
  • Optimal 3-way labeling
  • Sort all 3n subtrees using bucket sort in O(n)
  • Starting from small subtrees compute optimal
    labelings
  • For each subtree rooted at v, the optimal
    labelings of children nodes is already computed
  • Total time O(nk)

With optimal labeling it takes constant Time to
compute MP score for each Edge and so total
Greedy-MP time Is O(n2k)
15
Local moves for MP NNI
  • For each edge we get two different topologies
  • Neighborhood size is 2n-6

16
Local moves for MP SPR
  • Neighborhood size is quadratic in number of taxa
  • Computing the minimum number of SPR moves between
    two rooted phylogenies is NP-hard

17
Local moves for MP TBR
  • Neighborhood size is cubic in number of taxa
  • Computing the minimum number of TBR moves between
    two rooted phylogenies is NP-hard

18
  • Tree Bisection and Reconnection (TBR)

19
  • Tree Bisection and Reconnection (TBR)

Delete an edge
20
  • Tree Bisection and Reconnection (TBR)

21
  • Tree Bisection and Reconnection (TBR)

Reconnect the trees with a new edge that
bifurcates an edge in each tree
22
Local optima is a problem
23
Iterated local search escape local optima by
perturbation
Local optimum
Local search
24
Iterated local search escape local optima by
perturbation
Local optimum
Local search
Perturbation
Output of perturbation
25
Iterated local search escape local optima by
perturbation
Local optimum
Local search
Perturbation
Local search
Output of perturbation
26
ILS for MP
  • Ratchet
  • Iterative-DCM3
  • TNT

27
Iterated local search escape local optima by
perturbation
Local optimum
Local search
Perturbation
Local search
Output of perturbation
28
Ratchet
  • Perturbation input alignment and phylogeny
  • Sample with replacement p of sites and reweigh
    them to w
  • Perform local search on modified dataset starting
    from the input phylogeny
  • Reset the alignment to original after completion
    and output the local minimum

29
Ratchet escaping local minimaby data
perturbation
Local optimum
Local search
Ratchet search
Local search
Output of ratchet
30
Ratchet escaping local minimaby data
perturbation
Local optimum
Local search
Ratchet search
Local search
Output of ratchet
But how well does this perform? We have to
examine this experimentally on real data
31
Experimental methodology for MP on real data
  • Collect alignments of real datasets
  • Usually constructed using ClustalW
  • Followed by manual (eye) adjustments
  • Must be reliable to get sensible tree!
  • Run methods for a fixed time period
  • Compare MP scores as a function of time
  • Examine how scores improve over time
  • Rate of convergence of different methods (not
    sequence length but as a function of time)

32
Experimental methodology for MP on real data
  • We use rRNA and DNA alignments
  • Obtained from researchers and public databases
  • We run iterative improvement and ratchet each for
    24 hours beginning from a randomized greedy MP
    tree
  • Each method was run five times and average scores
    were plotted
  • We use PAUP---very widely used software package
    for various types of phylogenetic analysis

33
500 aligned rbcL sequences (Zilla dataset)
34
854 aligned rbcL sequences
35
2000 aligned Eukaryotes
36
7180 aligned 3domain
37
13921 aligned Proteobacteria
38
Comparison of MP heuristics
  • What about other techniques for escaping local
    minima?
  • TNT a combination of divide-and-conquer,
    simulated annealing, and genetic algorithms
  • Sectorial search (random) construct ancestral
    sequence states using parsimony randomly select
    a subset of nodes compute iterative-improvement
    trees and if better tree found then replace
  • Genetic algorithm (fuse) Exchange subtrees
    between two trees to see if better ones are found
  • Default search (1) Do sectorial search starting
    from five randomized greedy MP trees (2) apply
    genetic algorithm to find better ones (3) output
    best tree

39
Comparison of MP heuristics
  • What about other techniques for escaping local
    minima?
  • TNT a combination of divide-and-conquer,
    simulated annealing, and genetic algorithms
  • Sectorial search (random) construct ancestral
    sequence states using parsimony randomly select
    a subset of nodes compute iterative-improvement
    trees and if better tree found then replace
  • Genetic algorithm (fuse) Exchange subtrees
    between two trees to see if better ones are found
  • Default search (1) Do sectorial search starting
    from five randomized greedy MP trees (2) apply
    genetic algorithm to find better ones (3) output
    best tree

How does this compare to PAUP-ratchet?
40
Experimental methodology for MP on real data
  • We use rRNA and DNA alignments
  • Obtained from researchers and public databases
  • We run PAUP-ratchet, TNT-default, and
    TNT-ratchet each for 24 hours beginning from
    randomized greedy MP trees
  • Each method was run five times on each dataset
    and average scores were plotted

41
500 aligned rbcL sequences (Zilla dataset)
42
854 aligned rbcL sequences
43
2000 aligned Eukaryotes
44
7180 aligned 3domain
45
13921 aligned Proteobacteria
46
Can we do even better?
  • Yes! But first lets look at
  • Disk-Covering Methods

47
Disk Covering Methods (DCMs)
  • DCMs are divide-and-conquer booster methods. They
    divide the dataset into small subproblems,
    compute subtrees using a given base method, merge
    the subtrees, and refine the supertree.
  • DCMs to date
  • DCM1 for improving statistical performance of
    distance-based methods.
  • DCM2 for improving heuristic search for MP and
    ML
  • DCM3 latest, fastest, and best (in accuracy and
    optimality) DCM

48
DCM2 technique for speeding up MP searches
49
DCM2 decomposition
  • DCM2
  • Input distance matrix d, threshold
    , sequences S
  • Algorithm
  • 1a. Compute a threshold graph G using q and d
  • 1b. Perform a minimum weight triangulation of G
  • Find separator X in G which minimizes max
    where are the connected components of G
    X
  • Output subproblems as .

50
Threshold graph
  • Add edges until graph is connected
  • Perform minimum weight triangulation
  • NP-hard
  • Triangulated graphperfect elimination ordering
    (PEO)
  • Max cliques can be determined in linear time
  • Use greedy triangulation heuristic compute PEO
    by adding vertices which minimize largest edge
    added
  • Worst case is O(n3) but fast in practice

51
Finding DCM2 separator
  1. Find separator X in G which minimizes max
    where are the connected components of G
    X
  2. Output subproblems as
  3. This takes O(n3) worst case time perform depth
    first search on each component (O(n2)) for each
    of O(n) separators

52
DCM2 subsets
53
DCM3 decomposition - example
54
DCM1 vs DCM2
DCM1 decomposition NJ gets better accuracy on
small diameter subproblems (which we shall return
to later)
DCM2 decomposition Getting a smaller number of
smaller subproblems speeds up solution
55
We saw how decomposition takes place, now on to
supertree methods
56
Supertree Methods
57
Optimization problems
  • Subtree Compatibility Given set of trees
  • ,does there exist tree
    ,such that, (we
    say contains ).
  • NP-hard (Steel 1992)
  • Special cases are poly-time (rooted trees, DCM)
  • MRP also NP-hard

58
Direct supertree methods
  • Strict consensus supertrees, MinCutSupertrees

59
Indirect supertree methods
  • MRP, Average consensus

60
MRP---Matrix Representation using Parsimony (very
popular)
61
Strict Consensus Merger---faster and used in DCMs
62
Strict Consensus Merger compatible subtrees
63
Strict Consensus Merger compatible but collision
64
Strict Consensus Merger incompatible subtrees
65
Strict Consensus Merger incompatible and
collision
66
Strict Consensus Merger difference from Gordons
SC method
67
We saw how decomposition takes place, now on to
supertree methods
68
Tree Refinement
  • Challenge given unresolved tree, find optimal
    refinement that has an optimal parsimony score
  • NP-hard

69
Tree Refinement
70
We saw how decomposition takes place, now on to
supertree methods
71
Comparing DCM decompositions
72
Study of DCM decompositions
Comparison of MP scores
Comparison of running times
DCM2 is faster and better than DCM1
73
Best DCM (DCM2) vs Random
Comparison of MP scores
Comparison of running times
DCM2 is better than RANDOM w.r.t MP scores and
running times
74
DCM2 (comparing two different thresholds)
Comparison of MP scores
Comparison of running times
75
Threshold selection techniques
Biological dataset of 503 rRNA sequences.
Threshold value at which we get two subproblems
has best MP score.
76
Comparing supertree methods
77
MRP vs. SCM
Comparison of MP scores
Comparison of running times
  1. SCM is better than MRP

78
Comparing tree refinement techniques
79
Study of tree refinement techniques
Comparison of MP scores
Comparison of running times
Constrained tree search had best MP scores but is
slower than other methods
80
Next time
  • DCM1 for improving NJ
  • Recursive-Iterative-DCM3 state of the art in
    solving MP and ML
Write a Comment
User Comments (0)
About PowerShow.com