NPhardness and Phylogeny Reconstruction - PowerPoint PPT Presentation

About This Presentation
Title:

NPhardness and Phylogeny Reconstruction

Description:

From the Tree of the Life Website, University of Arizona. Orangutan. Gorilla. Chimpanzee. Human. DNA Sequence Evolution. AAGACTT. TGGACTT. AAGGCCT ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 52
Provided by: tandyw
Category:

less

Transcript and Presenter's Notes

Title: NPhardness and Phylogeny Reconstruction


1
NP-hardness and Phylogeny Reconstruction
  • Tandy Warnow
  • Department of Computer Sciences
  • University of Texas at Austin

2
Phylogeny
From the Tree of the Life Website,University of
Arizona
Orangutan
Human
Gorilla
Chimpanzee
3
Phylogeny
From the Tree of the Life Website,University of
Arizona
Orangutan
Human
Gorilla
Chimpanzee
4
DNA Sequence Evolution
5
Evolution informs about everything in biology
  • Big genome sequencing projects just produce data
    -- so what?
  • Evolutionary history relates all organisms and
    genes, and helps us understand and predict
  • interactions between genes (genetic networks)
  • drug design
  • predicting functions of genes
  • influenza vaccine development
  • origins and spread of disease
  • origins and migrations of humans

6
Molecular Systematics
U
V
W
X
Y
TAGCCCA
TAGACTT
TGCACAA
TGCGCTT
AGGGCAT
X
U
Y
V
W
7
Major methods for phylogeny reconstruction
  • Biology Polynomial time methods (good enough for
    small datasets), and local search heuristics for
    NP-hard optimization problems
  • Linguistics an exact algorithm for an
    NP-hard optimization problem

8
Outline for the rest of the talk
  • NP-hard and polynomial time problems
  • Phylogeny reconstruction in biology the NP-hard
    maximum parsimony problem, and how we can solve
    it better
  • Phylogeny reconstruction in linguistics the
    NP-hard perfect phylogeny problem, and how we
    solve it exactly
  • An open problem from whole genome phylogeny
  • Thoughts about computational biology, and the
    role of mathematics in this field

9
Polynomial-time problems
  • Shortest path Given edge-weighted graph
    G (V,E) and two vertices, v and w, find
    shortest path from v to w (O(n2) time)
  • 2-colorability Given graph G (V,E), determine
    if we can assign two colors to the vertices of G
    so that no edge connects vertices of the same
    color (O(nm) time)
  • 3-clique Given graph G (V,E), determine if G
    contains a 3-clique (O(n3) time)
  • For all these, nV and mE.

10
NP-hard problems
  • Some problems seem hard to solve
  • Hamilton path Given graph G , determine if G has
    a simple path going through every vertex
  • 3-colorability Given graph G, determine if G can
    be properly 3-colored
  • Max-clique Given graph G, find a largest clique
    in the graph

11
Technical definition of NP-hard
  • NP is the class of decision problems for which
    yes instances can be proven in polynomial
    time. (Example I can prove to you that a graph
    has a 3-coloring by presenting that 3-coloring to
    you. So 3-coloring is in NP.)
  • Definition A problem X is NP-hard if every
    problem in NP can be reduced to X in polynomial
    time (yes-instances mapped to yes-instances, and
    no-instances mapped to no-instances). So
    2-coloring can be reduced to 3-coloring
  • Definition A problem X is in P if it is in NP
    and can be solved in polynomial time.

12
NP-hard optimization problems
  • Graph-theoretic examples
  • Travelling Salesperson (TSP) find minimum cost
    tour visiting every vertex
  • Maximum Clique find maximum sized subset of
    vertices which are all pairwise adjacent
  • Minimum Vertex Coloring find minimum number of
    colors so that every vertex can be assigned a
    color, and no edge connects vertices of the same
    color.

13
NP-hard decision problems
  • Each optimization problem has corresponding
    decision problem. For example, the max clique
    optimization problem corresponds to the decision
    problem
  • Input Graph G(V,E), positive integer B
  • Question Does there exist a subset V of V such
    that VB and V is a clique?

14
NP-hard problems and polynomial time problems (P
vs. NP)
  • Some decision problems can be solved in
    polynomial time
  • Can graph G be 2-colored?
  • Does graph G have a 5-clique?
  • Some decision problems seem to not be solvable in
    polynomial time
  • Can graph G be 3-colored?
  • Does graph G have a k-clique?

15
P vs. NP, continued
  • The big question in theoretical computer
    science is
  • Is it possible to solve an NP-hard problem in
    polynomial time?
  • If the answer is yes, then all NP-hard problems
    can be solved in polynomial time, so PNP. This
    is generally not believed.

16
Coping with NP-hard problems
  • Since NP-hard problems may not be solvable in
    polynomial time, the options are
  • Solve the problem exactly (but use lots of time
    on some inputs)
  • Use heuristics which may not solve the problem
    exactly (and which might be computationally
    expensive, anyway)

17
Example Maximum Clique
  • Exact solution find largest k so that some
    subset of size k is a clique. Runs in O(nk)
    time.
  • Heuristic Pick a vertex at random, and greedily
    assemble a set which is a clique, and stop when
    you cant add any more vertices. Repeat until
    tired (or bored, or running out of time, or ).
    How do we evaluate the running time, or accuracy?

18
General comments for NP-hard optimization problems
  • Getting exact solutions may not be possible for
    some problems on some inputs, without spending a
    great deal of time.
  • You may not know when you have an optimal
    solution, if you use a heuristic.
  • Sometimes exact solutions may not be necessary,
    and approximate solutions may suffice. (But this
    may not be true for biology.)

19
Major methods for phylogeny reconstruction
  • Biology Polynomial time methods (good enough for
    small datasets), and local search heuristics for
    NP-hard optimization problems
  • Linguistics an exact algorithm for an
    NP-hard optimization problem

20
Polynomial time methods
  • Quartet-based methods
  • Construct trees on all 4-leaf subsets
  • Combine quartet trees into tree on full dataset
  • Distance-based methods
  • Estimate pairwise distance matrix dij
  • Find tree T and edge-weights w(e) so that dTij
    approximates dij
  • For both methods, if there are no errors (in
    quartet trees or pairwise distances) then the
    correct tree can be obtained in polynomial time.
    Otherwise, optimization problems are NP-hard.
  • Polytime heuristics along these lines are popular.

21
Phylogeny reconstruction
  • In biology, the most popular approaches for
    reconstructing phylogenetic trees are heuristics
    for Maximum Parsimony (NP-hard) or Maximum
    Likelihood (conjectured to be NP-hard)
  • In historical linguistics, a new approach based
    upon exactly solving the NP-hard Perfect
    Phylogeny problem has been useful.

22
DNA Sequence Evolution
23
Maximum Parsimony
  • Given a set S of strings of the same length over
    a fixed alphabet, find a tree T leaf-labelled by
    S and with all internal nodes labelled by strings
    of the same length over the same alphabet which
    minimizes the sum of the edge lengths.
  • Motivation seeks to minimize the total number of
    point mutations needed to explain the data
  • NP-hard

24
Major phylogeny reconstruction methods
  • In biology mostly hill-climbing heuristics that
    attempt to solve NP-hard optimization problems
    (maximum parsimony or maximum likelihood)
  • In historical linguistics much less is
    established, but an exact solution to an
    NP-hard problem looks very promising.

25
Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTT
ACA
GTA
GTA
ACA
ACT
GTT
26
Maximum Parsimony
ACT
ACT
ACA
GTA
GTT
GTA
ACA
ACT
2
1
1
3
3
2
GTT
GTT
ACA
GTA
MP score 7
MP score 5
GTA
ACA
ACA
GTA
2
1
1
ACT
GTT
MP score 4
Optimal MP tree
27
Maximum Parsimony computational complexity
28
Exact solutions fixed-parameter approaches
  • Fixed-parameter approaches restrict some
    parameter and solve the problem exactly for those
    cases. Examples
  • Does graph G(V,E) have a k-clique? Solvable
    in O(nk) time (nV).
  • Does graph G(V,E) have a k-coloring? Solvable in
    O(kn) time for general k, and in O(nm) time for
    k2 (nV, and mE).

29
Solving MP (maximum parsimony) and ML (maximum
likelihood)
  • Why are MP and ML hard? The search space is huge
    -- there are (2n-5)!! trees, it is easy to
    get stuck in local optima, and there can be many
    optimal trees.
  • Why try to solve MP or ML? Our experimental
    studies show that polynomial time algorithms
    dont do as well as MP or ML when trees are big
    and have high rates of evolution.
  • Why solve MP and ML well? Because trees can
    change in biologically significant ways with
    small changes in objective criterion.

Local optimum
MP score
Global optimum
Phylogenetic trees
30
MP/ML heuristics
Fake study
Performance of hill-climbing heuristic
MP score of best trees
Time
31
Speeding up MP/ML heuristics
Fake study
Performance of hill-climbing heuristic
MP score of best trees
Desired Performance
Time
32
Divide-and-Conquer Approach
  • Step 1 Get good starting tree
  • 1. Decompose the dataset into smaller,
    overlapping subsets.
  • Construct phylogenetic trees on the subsets using
    a base method.
  • Merge the subtrees into a single tree on the
    entire dataset.
  • Refine the resultant tree to produce a binary
    tree.
  • Follow with usual heuristic (hill-climbing or
    other such strategy) to improve tree.

33
Divide-and-conquer approaches
  • Step 1 Get good starting tree
  • Divide dataset into overlapping subsets
  • Construct trees on each subset
  • Combine subtrees into tree on full dataset
  • Refine into binary tree if needed
  • Step 2 Apply favored heuristic to improve tree.

34
Using divide-and-conquer for MP and ML
  • Conjecture better (more accurate) solutions will
    be found in less time, if we analyze a small
    number of smaller subsets and then combine
    solutions
  • Need
  • 1. techniques for decomposing datasets,
  • 2. base methods for subproblems, and
  • 3. techniques for combining subtrees

35
The DCM3 technique for speeding up MP/ML searches
36
DCM Decompositions
Input Set S of sequences, distance matrix d,
threshold value
1. Compute threshold graph
2. Perform minimum weight triangulation
DCM1 decomposition
DCM2 decomposition
37
DCM3 Decompositions
Input Set S of sequences, and estimate T of the
true tree
1. Compute short subtree graph G(S,T), based
upon T
2. Find clique separator in the graph G(S,T), and
form subproblems
The graph G(S,T)
DCM3 decomposition
38
Strict Consensus Merger (SCM)
39
DCM3-boosting a base method
  • Decompose the dataset into smaller, overlapping
    subsets, using DCM3
  • Construct phylogenetic trees on the subsets using
    a base method
  • Merge the subtrees into a single tree using the
    Strict Consensus Merger
  • Use PAUP constrained search to refine the
    resultant tree

40
Iterative-DCM3 vs Ratchet
41
Iterative-DCM3 vs Ratchet
42
Comments
  • Developing heuristics with good performance takes
    mathematical insights, but may not involve
    proofs. Even so, its really important.
  • Extracting information from the set of optimal
    (and near-optimal) solutions is a major open
    problem.
  • Other types of data (gene orders, morphology)
    present novel challenges.
  • Reticulate evolution detection and reconstruction
    is a major open problem.

43
Ringe-Warnow Phylogenetic Tree of Indo-European
44
Cognate Classes
  • Two words w1 and w2 are in the same cognate
    class, if they evolved from the same word through
    sound changes.
  • French champ and Italian champo are both
    descendants of Latin campus thus the two words
    belong to the same cognate class.
  • Spanish mucho and English much are not in the
    same cognate class.

45
Phylogenies of Languages
  • Languages evolve over time, just as biological
    species do (geographic and other separations
    induce changes that over time make different
    dialects incomprehensible -- and new languages
    appear)
  • The result can be modelled as a rooted tree
  • The interesting thing is that many
    characteristics of languages evolve without back
    mutation or parallel evolution -- so a perfect
    phylogeny is possible!

46
Historical Linguistic Data
  • A character is a function that maps a set of
    languages, L, to a set of states.
  • Three kinds of characters
  • Phonological (sound changes)
  • Lexical (meanings based on a wordlist)
  • Morphological (grammatical features)

47
Perfect Phylogeny
  • A phylogeny T for a set S of taxa is a perfect
    phylogeny if each state of each character
    occupies a subtree (no character has
    back-mutations or parallel evolution)

48
Homoplasy-Free Evolution (perfect phylogenies)
  • YES NO

49
The Comparative Method(Hoenigswald 1960)
  • Used to verify relatedness between languages and
    to infer features of the ancestral languages of a
    group of related languages
  • Step 1 establish sound correspondence in a set
    of related languages
  • Step 2 establish cognate classes

50
The Ringe-Warnow Model of Language Evolution
  • The nodes of the tree which contain elements of
    the same cognate class should form a rooted
    connected subgraph of the true tree
  • The model is known as the Character Compatibility
    or Perfect Phylogeny.

51
Character Compatibility and Perfect Phylogeny
  • Ringe and Warnow postulated that all properly
    encoded characters for the Indo-European
    languages should be compatible on the true tree,
    if such a tree existed
  • A tree T on which all characters are compatible
    is called a perfect phylogeny

52
The Perfect Phylogeny Problem
  • Given a set S of taxa (species, languages, etc.)
    determine if a perfect phylogeny T exists for S.
  • The problem of determining whether a perfect
    phylogeny exists is NP-hard (McMorris et al.
    1994, Steel 1991).

53
Triangulated Graphs
  • A graph is triangulated if it has no simple
    cycles of size four or more.

54
Triangulating Colored GraphsAn Example
  • A graph that can be c-triangulated

55
Triangulating Colored GraphsAn Example
  • A graph that can be c-triangulated

56
Triangulating Colored GraphsAn Example
  • A graph that cannot be c-triangulated

57
Triangulating Colored Graphs (TCG)
  • Triangulating Colored Graphs given a
    vertex-colored graph G, determine if G can be
    c-triangulated.

58
The PP and TCG Problems
  • Bunemans Theorem
    A perfect phylogeny exists for a set S
    if and only if the associated character state
    intersection graph can be c-triangulated.
  • The PP and TCG problems are polynomially
    equivalent and NP-hard.

59
Solving the PP Problem Using Bunemans Theorem
  • Yes Instance of PP
  • c1 c2 c3
  • s1 3 2 1
  • s2 1 2 2
  • s3 1 1 3
  • s4 2 1 1

60
Solving the PP Problem Using Bunemans Theorem
  • Yes Instance of PP
  • c1 c2 c3
  • s1 3 2 1
  • s2 1 2 2
  • s3 1 1 3
  • s4 2 1 1

61
Some special cases are easy
  • Binary character perfect phylogeny solvable in
    linear time
  • r-state characters solvable in polynomial time
    for each r (combinatorial algorithm)
  • Two character perfect phylogeny solvable in
    polynomial time (produces 2-colored graph)
  • k-character perfect phylogeny solvable in
    polynomial time for each k (produces k-colored
    graphs -- connections to Robertson-Seymour graph
    minor theory)

62
The Indo-European (IE) Dataset
  • 24 languages
  • 22 phonological characters, 15 morphological
    characters, and 333 lexical characters
  • Total number of working characters is 390
    (multiple character coding, and parallel
    development)
  • A phylogenetic tree T on the IE dataset (Ringe,
    Taylor and Warnow)
  • T is compatible with all but 22 characters 16
    (18) monomorphic and 6 polymorphic
  • Resolves most of the significant controversies in
    Indo-European evolution shows however that
    Germanic is a problem (not treelike)

63
Phylogenetic Tree of the IE Dataset
64
An open problem to take home
  • computing the transposition distance
    between two genomes
  • (important in whole genome phylogeny
    reconstruction)

65
Genomes As Signed Permutations
1 5 3 4 -2 -6or6 2 -4 3 5 1 etc.
66
Genomes Evolve by Rearrangements
1 2 3 4 5 6 7 8 9 10
  • Inversion (Reversal)
  • Transposition
  • Inverted Transposition

67
An open problem to play with
  • Given two permutations on 1,2,n, compute the
    minimum transposition distance (unknown
    computational complexity)
  • (The corresponding problem for inversion
    distances involves very beautiful graph theory
    and algorithms.)

68
Summary
  • NP-hard optimization problems abound in phylogeny
    reconstruction, and in computational biology in
    general, and need very accurate solutions
  • Many real problems have beautiful and natural
    combinatorial and graph-theoretic formulations

69
Acknowledgements
  • NSF and the David and Lucile Packard Foundation
    (funding)
  • Collaborators Bernard Moret (UNM CS), Donald
    Ringe (Penn Linguistics)
  • Students Usman Roshan and Luay Nakhleh

70
Phylolab, U. Texas
Please visit us at http//www.cs.utexas.edu/users/
phylo/
Write a Comment
User Comments (0)
About PowerShow.com