Loading...

PPT – On the Hardness of Inferring Phylogenies from TripletDissimilarities PowerPoint presentation | free to view - id: f8888-ZDc1Z

The Adobe Flash plugin is needed to view this content

On the Hardness of Inferring Phylogenies from

Triplet-Dissimilarities

- Ilan Gronau Shlomo Moran
- Technion Israel Institute of Technology
- Haifa, Israel

Pairwise-Distance Based Reconstruction

DT

E

M

L

G

H

B

Optimization Criteria

We wish the tree-metric DT to approximate

simultaneously the pairwise distances in D.

should be close to

D

DT

Two closeness measures studied here

Maximal Difference (l8 )

- Maximal Distortion

Maximal Difference (l8 ) vs. Maximal Distortion

B E G H L M

D

DT

B E G H L M

Goal Find optimal T, which minimizes the

maximal difference/distortion between D and DT

Previous works on Approximating Dissimilarities

by Tree Distances

- Negative results (NP-hardness)
- Closest tree-metric (even ultrametric ) to

dissimilarity matrix under l1 l2 Day 87 - Closest tree-metric to dissimilarity matrix

under l8 ABFPT99 - Hard to approximate better than 1.125
- Implicit Hard to approximate closest MaxDist

tree within any constant factor - Positive results
- Closest ultrametric to dissimilarity matrix

under l8 Krivanek 88 - 3-approximation of closest additive metric to a

given metric ABFPT99 - (implicit 6-approximation for general

dissimilarity matrices)

This Work Triplet-Distances Distances to

Triplets Midpoints

C(i,j,k)

tT (i jk)

- tT (i jk) tT (i kj)
- tT (i ij) 0
- tT (i jj) DT (i, j)

i

k

j

Triplet-Distances Defined by 2-Distances

- Each distance Matrix D defines 3-trees

t(i jk) ½D(i,j)D(i,k)-D(j,k).

i

Any metric on 3 taxa

8

9

j

7

k

Triplet-Distance Based Reconstruction

t(i jk) ½D(i,j)D(i,k)-D(j,k).

BB BE BG.. LL LM MM

B E G H L M

reconstruct

?

Why use Triplet-Distances?

1. They enable more accurate estimations of

2-distances. 2. They are used (de facto) by known

reconstruction algorithms

Improved Estimations of Pairwise Distances

Information Loss

D

Calculate D(H,E)

Improved Estimations (cont)

- Estimate D(H,E) by calculating all the 3-trees on

H,E,XX?H,E - (Or calculate just one 3-tree, for a trusted

3rd taxon X - V. Ranwez, O. Gascuel, Improvement of

distance-based phylogenetic methods by a local

maximum likelihood approach using triplets,

Mol.Biol. Evol. 19(11) 19521963. (2002)

(Implicit) use of Triplet-Distances in

2-Distance Reconstruction Algorithms

t(i jk) ½D(i,j)D(i,k)-D(j,k).

1st use Triplet Distances from a Single

Source

- Fix a taxon r, and construct a tree T which

minimizes - Optimal solution is doable in O(n2) time, and is

used eg in - (FKW95) Optimal approximation of distances by

ultrametric trees. - (ABFPT99) The best known approximation of

distances by general trees - (BB99) Fast construction of Buneman trees.

2nd useSaitouNei Neighbour Joining

The neighbors-selection criterion of NJ selects a

taxon-pair i,j which maximizes the sum

r

r

i

r

r

r

r

j

r

r

Previous Works on Triplet-Dissimilarities/Distanc

es

- I. Gronau, S. Moran Neighbor Joining Algorithms

for Inferring Phylogenies via LCA-Distances,

Journal of Computational Biology 14(1) pp. 1-15

(2007). - Works which use the total weights of 3 trees
- S. Joly, GL Calve, Three Way Distances, Journal

of Classification 12 pp. 191-205 (1995) - L. Pachter, D. Speyer Reconstructing Trees from

Subtrees Weights , Applied Mathematics Letters 17

pp. 615-621 (2004) - D. Levy, R. Yoshida, L. Pachter, Beyond pairwise

distances Neighbor-joining with phylogenetic

diversity estimates, Mol. Biol. Evol. 23(3)

491498 (2006) .

Summary of Results

- Results for Maximal Difference (l8)
- Decision problem is NP-Hard
- ? IS there a tree T s.t. t,tT 8 ? ?
- Hardness-of-approximation of optimization problem
- ? Finding a tree T s.t. t,tT 8

1.4t,tOPT8 - A 15-approximation algorithm
- ? Using the 6-approximation algorithm for

2-dissimilarities from ABFPT99 - Result for Maximal Distortion
- Hardness-of-approximation within any constant

factor

NP Hardness of the Decision Problem

We use a reduction from 3SAT (the problem of

determining whether a 3CNF formula is

satisfiable)

We show

If one can determine for (t,?) whether there

exists a tree T s.t. t,tT 8 ?, then one can

determine for every 3CNF formula f whether it is

satisfiable.

The Reduction

Given a 3CNF formula f we define triplet

distances ? and an error bound ? which enforce

the output tree to imply a satisfying assignment

to f.

- The set of taxa
- Taxa T , F.
- A taxon for every literal ( ).
- 3 taxa for every clause Cj ( y j1 , y j2 , y j3

).

Properties Enforced by the Input (?,?)

- One the following can be enforced on each taxa

triplet (u,v,w) - taxon u is close to Path(v,w), or
- taxon u is far to Path(v,w)

u

Enforcing Truth Assignmaent

- A truth assignment to f is implied by the

following - T is far from F
- For each i, is far from , and both of

and are close to Path(T ,F)

Thus we set xi T iff xi is close to T.

Enforcing Clauses-Satisfaction

A clause C( l 1 ? l 2 ? l 3 ) is satisfied iff

At least one literal l i is true, i.e. is close

to T.

(l 1 ? l 2 ? l 3 ) is satisfied iff it is not

like this

We need to guarantee that all clauses avoid the

above by the close/far relations.

Clauses-Satisfaction (cont)

-?(l 1 ? l 2 ? l 3 ) is satisfied iff out of the

three paths Path(l 1 , l 2), Path(l 1 , l 3),

Path(l 2 , l 3), at least two paths are close

to T .

l 3

T

F

l 1

l 2

Clauses-Satisfaction (cont)

We attach a taxon to each such path y1 is

close to Path ( l 2,l 3) y2 is close to Path (

l 1,l 3) y3 is close to Path ( l 1,l 2)

?(l 1 ? l 2 ? l 3 ) is satisfied iff at least

two yis can be located close to T.

Clauses-Satisfaction (end)

and, at least two of the yis can be located

close to T Path( y 2,y 3), Path( y 1,y 3),

Path( y 1,y 2), are close to T

So, (l 1 ? l 2 ? l 3 ) is satisfied iff all the

above paths are close to T

Construction Example

f is satisfiable ? there is a tree T which

satisfies all bounds

A1 tT (T , F ) 2a2ß A2 i1..n

tT (T ) a tT (F

) a B1 j1..m tT (y j1 l j2 l j3 )

a tT (y j2 l j1 l j3 ) a tT (y

j3 l j1 l j2 ) a B2 j1..m tT (y j1

T F ) a tT (y j2 T F ) a tT

(y j3 T F ) a B3 j1..m tT (T y j2

y j3 ) a tT (T y j1 y j3 ) a

tT (T y j1 y j2 ) a

Hardness of Approximation Results

By stretching the close/far restrictions, the

following problems are also shown NP hard

- Approximating Maximal Difference
- Finding a tree T s.t. t,tT 8 1.4t,tOPT8
- Approximating Maximal Distortion
- Finding a tree T s.t.
- MaxDist(t,tT ) C MaxDist(t,tOPT) for any

constant C

Details in I. Gronau and S. moran, On The

Hardness of Inferring Phylogenies from

Triplet-Dissimilarities, Theoretical Computer

Science 389(1-2), December 2007, pp. 44-55.

Open Problems/Further Research

- Extending hardness results for 3-diss tables

induced by 2-diss matrices - (t(i jk) ½D(i,j)D(i,k)-D(j,k) )
- Extending hardness results for naturally

looking trees - (binary trees with constant-bounded edge

weights) - Check Performance of NJ when neighbor selection

formula computed from real 3-distances. - Devise algorithms which use 3-distances as input.
- Does optimization of 3-diss lead to good

topological accuracy (under accepted models of

sequence evolution) - (it is known that optimization of 2-diss doesnt

lead to good topological accuracy)

Thank You

Distance-Based Phylogenetic Reconstruction

- Compute distances between all taxon-pairs
- Find a tree (edge-weighted) best-describing the

distances

Optimization Criteria

- Known measures of closeness
- l8 -
- lp -
- MaxDist -

( where 0/01 )

The Reduction

f

, ?

3CNF formula

There is a tree T s.t. t,tT 8 ?

f is satisfiable

If one can determine for (t,?) whether there

exists a tree T s.t. t,tT 8 ?, then one can

determine for every 3CNF formula f whether it is

satisfiable.

The Reduction

Define a set of lower and upper bounds A1 tT (T

, F ) 2a2ß A2 i1..n tT (T

) a tT (F ) a B1

j1..m tT (y j1 l j2 l j3 ) a tT (y

j2 l j1 l j3 ) a tT (y j3 l j1 l j2 )

a B2 j1..m tT (y j1 T F ) a

tT (y j2 T F ) a tT (y j3 T F )

a B3 j1..m tT (T y j2 y j3 ) a

tT (T y j1 y j3 ) a tT (T y j1 y j2

) a

The Reduction

f

tu

2?

,

3CNF formula

There is a tree T s.t. tl tT tu

f is satisfiable

If one can determine for (t,?) whether there

exists a tree T s.t. t,tT 8 ?, then one can

determine for every 3CNF formula f whether it is

satisfiable.

The Reduction

- Define the set of taxa.
- Define a set of lower and upper bounds on some

entries of tT. - f is satisfiable ? there is a tree T which

satisfies all bounds - Define ? according to the slackness required for

the proof of ?.

The Reduction

- Define the set of taxa
- Taxa T , F.
- A taxon for every literal ( ).
- 3 taxa for every clause ( y j1 , y j2 , y j3 ).

The Analysis

A1 tT (T , F ) 2a2ß A2 i1..n tT

(T ) a tT (F )

a

- Trees satisfying A1 and A2 imply a

truth-assignment to x1 ,..., xn.

The Analysis

B1 j1..m tT (y j1 l j2 l j3 ) a tT

(y j2 l j1 l j3 ) a tT (y j3 l j1 l

j2 ) a B2 j1..m tT (y j1 T F )

a tT (y j2 T F ) a tT (y j3 T F

) a B3 j1..m tT (T y j2 y j3 )

a tT (T y j1 y j3 ) a tT (T y j1

y j2 ) a

There is a tree T which satisfies all bounds ? f

is satisfiable

- B1 and B2 imply that y ja l jb l jc for

a,b,c1,2,3. - B3 implies that at least two of y j1, y j2, y j3

are satisfied.

The Reduction t(f)

A1 tT (T , F ) 2a2ß A2 i1..n

tT (T ) a tT (F

) a B1 j1..m tT (y j1 l j2 l j3 )

a tT (y j2 l j1 l j3 ) a tT (y

j3 l j1 l j2 ) a B2 j1..m tT (y j1

T F ) a tT (y j2 T F ) a tT

(y j3 T F ) a B3 j1..m tT (T y j2

y j3 ) a tT (T y j1 y j3 ) a

tT (T y j1 y j2 ) a

- In our constructed tree
- All 2-distances are in 2a , 2a2ß.
- All 3-distances are in a , a2ß.
- ? ?ß.

A1 t(T , F ) 2a3ß A2 i1..n t(T

) a-ß t(F )

a-ß B1 j1..m t(y j1 l j2 l j3 ) a-ß

t(y j2 l j1 l j3 ) a-ß t(y j3 l j1 l j2

) a-ß B2 j1..m t(y j1 T F ) aß

t(y j2 T F ) aß t(y j3 T F )

aß B3 j1..m t(T y j2 y j3 ) a-ß

t(T y j1 y j3 ) a-ß t(T y j1 y j2 )

a-ß Other 2-distances t(s , t )

2a2ß Other 3-distances t(s t u ) a2ß