Loading...

PPT – Graphs and Graph Theory in Computational Biology PowerPoint presentation | free to download - id: 3c849a-MmFlO

The Adobe Flash plugin is needed to view this content

Graphs and Graph Theory in Computational Biology

- Dan Gusfield
- Miami University, May 15, 2008
- (four hour tutorial)

Some examples of graphs in biology

- Taken from the web - see the citations for

details. Many other examples of graphs more

complex than trees in biology.

From Max Delbrueck Center, Berlin

Yeast protein interactions

From http//www-personal.umich.edu/mejn/networks/

Protein-Protein Interactions

Protein-Protein Interaction Modelling Dr. Peter

Uetz Institut fur Toxikologie und Genetik

Forschungszentrum Karlsruhe

NY Times May 5, 2008 The Diseasome

http//www.nytimes.com/interactive/2008/05/05/scie

nce/20080506_DISEASE.html

Graphs and Graph Theory

- 1. Numerous uses of graphs and networks to

represent biological phenomena at many conceptual

levels. Maybe several 1000s of papers using graph

representations, particularly trees, but little

graph theory. - 2. A respectable number of papers that develop

new non-trivial graph theory for problems in

biology. 100s of papers, maybe 1000. - 3. A handful of papers exploiting or extending

non-trivial classic graph theory for problems in

biology. Perhaps a few hundred.

Introduction and Conclusion

- Very diverse biological applications and very

diverse graph theory. So no single grand reason

for graphs and no single graph topic in biology. - Lots of opportunity for graph theorists and

graph algorithmists to develop or apply graph

theory to biological problems. Even more

opportunity for combinatorial optimization.

What I will do in this tutorial

- Emphasis on points 2 and 3, i.e., Examples of

the development of new non-trivial graph theory,

and of the exploitation of classic graph theory.

And (my apologies) I will mostly emphasize topics

I have been involved with. - Still,
- There are some hot biological areas today where

graphs arise, and some graph topics that recur

commonly, and I should point those out even if I

will not talk in detail on those topics.

The digression

- Hot biology Network biology -- biological

phenomena that are represented by networks --

gene regulatory networks and protein interaction

networks, just to name two. These form the core

of Systems biology. Other relationships in

biology represented by graphs and networks. Ex.

diseasome. - Recurring graph problems graph problems in

clustering data ( ex. finding cliques or

variants of cliques) variants of graph

isomorphism in network motif or molecular pathway

problems need for more random graph theory for

significance testing

Clique Problems

- Clique problems are recurrent in clustering

applications, but true cliques are

computationally hard to find. Suggested research

for graph theorist and algorithmists

computationally tractable, biologically

meaningful alternatives to cliques. As examples

maximum density subgraphs extreme sets in a

graph.

Subgraph density

- Given a graph G, and a subset S of its nodes,

let G(S) be the subgraph of G induced by S, i.e,

G(S) has node set S and edge set E(S) consisting

of all edges in G both of whose ends are in S. - A Maximum Density subgraph of G is induced by the

set of nodes S which the Maximizes E(S)/S. - The maximum density subgraph can be found in

polynomial time. It has the flavor of a maximum

clique, but has different properties.

Extreme Sets

- In an edge-weighted undirected graph G, a

subset S of nodes of G is called an extreme set

if for every subset S of S, the total weight of

the edges crossing from S to V-S - is larger than the total weight of the edges

crossing from S to V-S. - All the extreme sets in a graph can be found in

polynomial time.

Also

- There is also a great need for more

sophisticated application of random graph theory

in the study of biological networks. This is

needed in order to establish null models to use

in assessing the statistical significance of

subgraphs, paths, patterns and motifs that are

found in biological networks. - We need to be able to distinguish observed

patterns and subgraphs from those that occur with

a high probability in a random graph, under a

biologically appropriate model of randomness (an

open field).

End of digression

- Start of the main tutorial Examples of Graph

Theory in Bioinformatics - and Computational Biology

Outline

- Three Smaller examples Euler paths and

sequencing Tanglegrams and co-evolution Network

Design and Multiple Alignment. - Haplotyping by Perfect Phylogeny Graph

Realization. - Phylogenetic Networks Incompatibility Graph

Galled-Trees Recombination Networks The

Decomposition Theorem and sufficient conditions. - Multi-state Perfect Phylogeny and Chordal Graphs.

To start Three small examples

- Euler paths in sequencing and sequence assembly.
- Tanglegrams and planarity testing in the study of

co-evolution. - Application of Tree-Design approximations in

multiple sequence alignment. Interplay between

trees and strings.

Topic I Eulerian paths in sequencing problems

- The general situation is that we have a (DNA say)
- molecule S whose sequence is unknown, but
- we know all the k-mers that occur in S, for some

fixed k. Given those k-mers, we want to determine

S, if possible, or determine whatever is possible

to determine about S. Note that k is not related

to the - alphabet size.
- A very useful approach to problems of this type

is to build an Eulerian digraph, based on the

(k-1)-mers.

Euler graph for general k

- For general k, there is one node for each

(k-1)-mer contained in - an observed k-mer. Then there is a directed edge

from the node for (k-1)mer A to the node for

(k-1)mer B, if the - (k-2) suffix of A matches the (k-2) prefix of

B, so that A and B - can be overlapped to form the observed

k-mer. - Example k 5 and we observe the 5-mer XXYZW.
- Then there will be a node for XXYZ and a node for

XYZW - and a directed edge from the first node to the

second node. Those - two nodes and the directed edge between them

represent the - 5-mer XXYZW. In some applications, there will be

one such edge for each observation of that 5-mer.

Ex. k 3. The graph will have one node for each

of the 2-mers in the observed 3-mers. Then there

is a directed edge from the node for the 2-mer XY

to the node for the 2-mer YZ, for any X, Z.

The Euler graph derived from the sequence

ACACGCAACTTAAA If a triple is observed more than

once, there should be One directed edge for each

observation of the triple.

The point Every Eulerian path in the graph

specifies a sequence whose k-mers match the given

data, and conversely every sequence whose k-mers

match the data specifies an Eulerian path in the

graph. So the set of Eulerian paths specifies

the set of candidate sequences for the unknown

original sequence.

Algorithms exist for efficiently finding Eulerian

paths, for counting their number, for

determining uniqueness etc. so we can use this

representation to study the set of

candidate sequences. Compare this approach to

earlier efforts to represent the set

of candidates by a graph with a Hamilton path

each node represents an observed k-mer, not a

(k-1)-mer.

Making finer distinctions in Euler paths

In general there may be many Eulerian paths in

the graph, and we want some additional criteria

to distinguish the goodness of one Eulierian path

compared to another. Different biological

considerations translate into having a value for

each subpath of length two. Then the value of an

Eulerian path P with n edges is the sum of the

n-1 values of the n-1 length-two subpaths in

P. The problem is to find an Eulerian path with

maximum value. We have some reasonable

approximations for that, but a simpler case can

be solved optimally in polynomial time.

The case of a binary alphabet, but arbitrary k

- Since the alphabet size is two, each node in the

graph has at most two incoming edges - and two outgoing edges. Assume exactly two each.

001

110

Ex. k 4

011

110

101

The case of a binary alphabet, but arbitrary k

- At any node, there are two possible ways for
- an Euler path to pass through the node.

001

110

turning

Ex. k 4

011

110

101

The case of a binary alphabet, but arbitrary k

- At any node, there are two possible ways for
- an Euler path to pass through the node.

001

110

crossing

Ex. k 4

011

110

101

So in terms of subpaths of length two, we have

two choices at each node.

Restating the optimal Euler path problem

- We are given an Eulerian graph where the in

and out degrees are at most two at each node,

and at each node there is a given value for the

turning pair, and a value for the crossing pair.

Then choose the turning or the crossing pairs at

the nodes to maximize the total value of the

choices, subject to the requirement that the

choices create an Euler path in the graph.

Main Result

- The problem can be solved in polynomial time.
- The set of choices that give Euler paths has a

matroidal structure, which allows a

matroid-greedy algorithm to find the optimal

Euler path. - A more direct algorithm based on Minimum Spanning

Trees also solves the problem.

The Matroid Structure

- At every node v, the edge pair (crossing or

turning) which has the lowest value is called the

low pair, and the other pair is the high pair.

The difference in values is called the loss at v. - A subset S of nodes is called independent if

there is an Euler path in the graph where at

every node in S, the low pair is chosen. - As defined, the family of independent sets form a

matroid, and so we can find, by a greedy

algorithm, an independent set which minimizes the

loss - and this gives the optimal Euler path.

Topic II Tanglegrams

- A Tanglegram is a pair of trees drawn in the

plane with no crossing edges, with the same

labeled leaf set. The leaves of one tree are

displayed on a line, and the leaves of the other

tree are displayed on a parallel line. - A straight line connect each leaf in one tree to

the leaf with the same label in the other tree. - The number of crossing lines is a measure of the

similarity of the trees.

(No Transcript)

(No Transcript)

(No Transcript)

(No Transcript)

Topic III Multiple Sequence Alignment

- Interplay between sequences and trees.
- Exploitation of network design approximation.

(No Transcript)

Intro to Hours 2 and 3 Two Post-HGP Topics

- Two topics in Population Genomics
- SNP Haplotyping in populations
- Reconstructing a history of recombination
- These topics in Population Genomics illustrate

current challenges in biology, and illustrate the

use of graph theory, combinatorial algorithms and

discrete mathematics in biology.

What is population genomics?

- The Human genome sequence is done.
- Now we want to sequence many individuals in a

population to correlate similarities and

differences in their sequences with genetic

traits (e.g. disease or disease susceptibility). - Presently, we cant sequence large numbers of

individuals, but we can sample the sequences at

SNP sites.

SNP Data

- A SNP is a Single Nucleotide Polymorphism - a

site in the genome where two different

nucleotides appear with sufficient frequency in

the population (say each with 5 frequency or

more). - SNP maps have been compiled with a density of

about 1 site per 1000. - SNP data is what is mostly collected in

populations - it is much cheaper to collect than

full sequence data, and focuses on variation in

the population, which is what is of interest.

Haplotype Map Project HAPMAP

- NIH lead project (100M) to find common SNP

haplotypes (SNP sequences) in the Human

population. - Association mapping HAPMAP used to try to

associate genetic-influenced diseases with

specific SNP haplotypes, to either find causal

haplotypes, or to find the region near causal

mutations. - The key to the logic of Association mapping is

historical recombination in populations. Nature

has done the experiments, now we try to make

sense of the results.

Topic IV Perfect Phylogeny Haplotyping via Graph

Realization

Genotypes and Haplotypes

- Each individual has two copies of each

chromosome. - At each site, each chromosome has one of two

alleles (states) denoted by 0 and 1 (motivated by

- SNPs)

0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0

0

Two haplotypes per individual

Merge the haplotypes

2 1 2 1 0 0 1 2 0

Genotype for the individual

Haplotyping Problem

- Biological Problem For disease association

studies, haplotype data is more valuable than

genotype data, but haplotype data is hard to

collect. Genotype data is easy to collect. - Computational Problem Given a set of n

genotypes, determine the original set of n

haplotype pairs that generated the n genotypes.

This is hopeless without a genetic model.

The Perfect Phylogeny Model for SNP sequences

Only one mutation per site allowed.

sites

12345

00000

Ancestral sequence

1

4

Site mutations on edges

3

00010

The tree derives the set M 10100 10000 01011 0101

0 00010

2

10100

5

10000

01010

01011

Extant sequences at the leaves

When can a set of sequences be derived on a

perfect phylogeny?

- Classic NASC Arrange the sequences in a matrix.

Then (with no duplicate columns), the sequences

can be generated on a unique perfect phylogeny if

and only if no two columns (sites) contain all

four pairs - 0,0 and 0,1 and 1,0 and 1,1

This is the 4-Gamete Test

So, in the case of binary characters, if each

pair of columns allows a tree, then the entire

set of columns allows a tree. For M of dimension

n by m, the existence of a perfect phylogeny for

M can be tested in O(nm) time and a tree built in

that time, if there is one. Gusfield, Networks 91

We will use the classic theorem in two more

modern and more genetic applications.

The Perfect Phylogeny Model

- We assume that the evolution of extant haplotypes

can be displayed on a rooted, directed tree, with

the all-0 haplotype at the root, where each site

changes from 0 to 1 on exactly one edge, and each

extant haplotype is created by accumulating the

changes on a path from the root to a leaf, where

that haplotype is displayed. - In other words, the extant haplotypes evolved

along a perfect phylogeny with all-0 root. - Justification Haplotype Blocks, rare

recombination, base problem whose solution to be

modified to incorporate more biological

complexity.

Perfect Phylogeny Haplotype (PPH)

Given a set of genotypes S, find an explaining

set of haplotypes that fits a perfect phylogeny.

sites

A haplotype pair explains a genotype if the merge

of the haplotypes creates the genotype. Example

The merge of 0 1 and 1 0 explains 2 2.

S

Genotype matrix

The PPH Problem

Given a set of genotypes, find an explaining set

of haplotypes that fits a perfect phylogeny

The Haplotype Phylogeny Problem

Given a set of genotypes, find an explaining set

of haplotypes that fits a perfect phylogeny

00

1

2

b

00

a

a

b

c

c

01

01

10

10

10

The Alternative Explanation

No tree possible for this explanation

Efficient Solutions to the PPH problem - n

genotypes, m sites

- Reduction to a graph realization problem (GPPH) -

build on Bixby-Wagner or Fushishige solution to

graph realization O(nm alpha(nm)) time.

Gusfield, Recomb 02 - Reduction to graph realization - build on Tuttes

graph realization method O(nm2) time. Chung,

Gusfield 03 - Direct, from scratch combinatorial approach

-O(nm2) Bafna, Gusfield et al JCB 03 - Berkeley (EHK) approach - specialize the Tutte

solution to the PPH problem - O(nm2) time. - Linear-time solutions - Recomb 2005, Ding,

Filkov, Gusfield and a different linear time

solution.

The Reduction Approach

- This is the original polynomial time method.

Conceptually simplest at a high level (but not at

the implementation level) and most extendable to

other problems nearly linear-time but not

linear-time.

The case of the 1s

- For any row i in S, the set of 1 entries in row i

specify the exact set of mutations on the path

from the root to the least common ancestor of the

two leaves labeled i, in every perfect phylogeny

for S. - The order of those 1 entries on the path is also

the same in every perfect phylogeny for S, and is

easy to determine by leaf counting.

Leaf Counting

In any column c, count two for each 1, and count

one for each 2. The total is the number of

leaves below mutation c, in every perfect

phylogeny for S. So if we know the set

of mutations on a path from the root, we

know their order as well.

S

Count 5 4 2 2 1 1 1

Simple Conclusions

Subtree for row i data

sites

Root

The order is known for the red mutations together

with the leftmost blue(?) mutation.

1 2 3 4 5 6 7 i0 1 0 1 2 2 2

2 4

5

But what to do with the remaining blue entries

(2s) in a row?

More Simple Tools

- For any row i in S, and any column c, if S(i,c)

is 2, then in every perfect phylogeny for S, the

path between the two leaves labeled i, must

contain the edge with mutation c. - Further, every mutation c on the path

between the two i leaves must be from such a

column c.

From Row Data to Tree Constraints

Subtree for row i data

sites

Root

1 2 3 4 5 6 7 i0 1 0 1 2 2 2

2 4

Edges 5, 6 and 7 must be on the blue path, and 5

is already known to follow 4, but we dont where

to put 6 and 7.

5

i

i

The Graph Theoretic Problem

- Given a genotype matrix S with n sites, and a

red-blue subgraph for each row i,

create a directed tree T where each integer from

1 to n labels exactly one edge, so that each

subgraph is contained in T.

i

i

Powerful Tool Tree and Graph Realization

- Let Rn be the integers 1 to n, and let P be an

unordered subset of Rn. P is called a path set. - A tree T with n edges, where each is labeled with

a unique integer of Rn, realizes P if there is a

contiguous path in T labeled with the integers of

P and no others. - Given a family P1, P2, P3Pk of path sets, tree T

realizes the family if it realizes each Pi. - The graph realization problem generalizes the

consecutive ones problem, where T is a path. - More generally, each set specifies a fundamental

cycle in the unknown graph.

Tree Realization Example

5

P1 1, 5, 8 P2 2, 4 P3 1, 2, 5, 6 P4 3, 6,

8 P5 1, 5, 6, 7

1

6

8

2

4

3

7

Realizing Tree T

More generally, think of each path set as

specifying a fundamental cycle containing the

edges in the specified path.

Graph Realization

- Polynomial time (almost linear-time)

algorithms exist for the graph realization

problem, given the family of fundamental cycles

the unknown graph should contain Whitney,

Tutte, Cunningham, Edmonds, Bixby, Wagner,

Gavril, Tamari, Fushishige, Lofgren 1930s -

1980s - Most of the literature on this problem is in

the context of determining if a binary matroid is

graphic. - The algorithms are not simple none

implemented before 2002.

Reducing PPH to graph realization

- We solve any instance of the PPH problem by

creating appropriate path sets, so that a

solution to the resulting graph realization

problem leads to a solution to the PPH problem

instance. - The key issue How to encode the needed

subgraph - for each row, and glue them together at the

root.

From Row Data to Tree Constraints

Subtree for row i data

sites

Root

1 2 3 4 5 6 7 i0 1 0 1 2 2 2

2 4

Edges 5, 6 and 7 must be on the blue path, and 5

is already known to follow 4.

5

i

i

Encoding a Red-Blue directed path

2

P1 U, 2 P2 U, 2, 4 P3 2, 4 P4 2, 4, 5 P5 4, 5

U

4

2

5

4

forced

In T

5

U is a glue edge used to glue together the

directed paths from the different rows.

Now add a path set for the blues in row i.

sites

Root

1 2 3 4 5 6 7 i0 1 0 1 2 2 2

2 4

5

P 5, 6, 7

i

i

Thats the Reduction

The resulting path-sets encode everything that

is known about row i in the input. The family of

path-sets are input to the graph- realization

problem, and every solution to the that

graph-realization problem specifies a solution

to the PPH problem, and conversely.

Whitney (1933?) characterized the set of all

solutions to graph realization (based on the

three-connected components of a graph) and Tarjan

et al showed how to find these in linear time.

An implicit representation of all solutions

Whitney (1930) proved that a graph realization

problem has a unique solution if and only if the

graph is three-connected. That is, at least

three nodes must be removed in order to

disconnect the graph (assuming it is

connected). Whitney (1931) proved that if the

solution is not unique, then there is a

semi-unique decomposition of the graph into

three- connected components, so that the graph

realizations are in one- one correspondence with

all the ways that these components can be

twisted relative to each other. So the number

of solutions is 2(number of three connected

comps. -1).

Tree Realization Example

5

P1 1, 5, 8 P2 2, 4 P3 1, 2, 5, 6 P4 3, 6,

8 P5 1, 5, 6, 7

1

6

8

2

4

3

7

Realizing Tree T with edges added to create

a fundamental cycle for each path

Topic V Phylogenetic Networks with Recombination

When can a set of sequences be derived on a

perfect phylogeny?

- Classic NASC Arrange the sequences in a matrix.

Then (with no duplicate columns), the sequences

can be generated on a unique perfect phylogeny if

and only if no two columns (sites) contain all

four pairs - 0,0 and 0,1 and 1,0 and 1,1

This is the 4-Gamete Test

Incompatible Sites

- A pair of sites (columns) of M that fail the
- 4-gametes test are said to be incompatible.
- A site that is not in such a pair is compatible.

A richer model

10100 10000 01011 01010 00010 10101 added

12345

00000

1

4

M

3

00010

2

10100

5

Pair 4, 5 fails the four gamete-test. The sites

4, 5 are incompatible.

10000

01010

01011

Real sequence histories often involve

recombination.

Sequence Recombination

01011

10100

S

P

5

Single crossover recombination

10101

A recombination of P and S at recombination point

5.

The first 4 sites come from P (Prefix) and the

sites from 5 onward come from S (Suffix).

Network with Recombination ARG

10100 10000 01011 01010 00010 10101 new

12345

00000

1

4

M

3

00010

2

10100

5

10000

P

01010

The previous tree with one recombination event

now derives all the sequences.

01011

5

S

10101

A Min ARG for Kreitmans data

ARG created by SHRUB

An illustration of why we are interested in

recombinationAssociation Mapping of Complex

Diseases Using ARGs

Association Mapping

- A major strategy being practiced to find genes

influencing disease from haplotypes of a subset

of SNPs. - Disease mutations unobserved.
- A simple example to explain association mapping

and why ARGs are useful, assuming the true ARG is

known.

Disease mutation site

0

1

0

0

1

SNPs

Very Simplistic Mapping the Unobserved Mutation

of Mendelian Diseases with ARGs

00000

Assumption (for now) A sequence is diseased iff

it carries the single disease mutation

4

00010

a00010

3

1

10010

00100

5

00101

2

b10010

01100

S

S

P

4

c00100

01101

P

g00101

3

d10100

f01101

Where is the disease mutation?

e01100

Diseased

Mapping Disease Gene with Inferred ARGs

- ..the best information that we could possibly

get about association is to know the full

coalescent genealogy Zollner and Pritchard,

2005 - But we do not know the true ARG!
- Goal infer ARGs from SNP data for association

mapping - Not easy and often approximation (e.g. Zollner

and Pritchard) - Improved results to do the inference Y. Wu

(RECOMB 2007)

Results on Reconstructing the Evolution of SNP

Sequences

- Part I Clean mathematical and algorithmic

results Galled-Trees, near-uniqueness,

graph-theory lower bound, and the Decomposition

theorem - Part II Practical computation of Lower and

Upper bounds on the number of recombinations

needed. Construction of (optimal)

phylogenetic networks uniform sampling

haplotyping with ARGs LD mapping - Part III Varied Biological Applications
- Part IV Extension to Gene Conversion
- Part V The Minimum Mosaic Model of Recombination

This talk will discuss topics in Parts I

Problem If not a tree, then what?

- If the set of sequences M cannot be derived on a

perfect phylogeny (true tree) how much deviation

from a tree is required? - We want a network for M that uses a small number

of recombinations, and we want the resulting

network to be as tree-like as possible.

A tree-like network for the same sequences

generated by the prior network.

4

3

1

s

p

a 00010

2

c 00100

b 10010

d 10100

2

5

s

4

p

g 00101

e 01100

f 01101

Recombination Cycles

- In a Phylogenetic Network, with a recombination

node x, if we trace two paths backwards from x,

then the paths will eventually meet. - The cycle specified by those two paths is called

a recombination cycle.

Galled-Trees

- A phylogenetic network where no recombination

cycles share an edge is called a galled tree. - A cycle in a galled-tree is called a gall.
- Question if M cannot be generated on a true

tree, can it be generated on a galled-tree?

(No Transcript)

Results about galled-trees

- Theorem Efficient (provably polynomial-time)

algorithm to determine whether or not any

sequence set M can be derived on a galled-tree. - Theorem A galled-tree (if one exists) produced

by the algorithm minimizes the number of

recombinations used over all possible

phylogenetic-networks. - Theorem If M can be derived on a galled tree,

then the Galled-Tree is nearly unique. This

is important for biological conclusions derived

from the galled-tree.

Papers from 2003-2007.

Elaboration on Near Uniqueness

Theorem The number of arrangements

(permutations) of the sites on any gall is at

most three, and this happens only if the gall has

two sites. If the gall has more than two sites,

then the number of arrangements is at most

two. If the gall has four or more sites, with at

least two sites on each side of the recombination

point (not the side of the gall) then the

arrangement is forced and unique. Theorem All

other features of the galled-trees for M are

invariant.

A whiff of the ideas behind the results

Incompatible Sites

- A pair of sites (columns) of M that fail the
- 4-gametes test are said to be incompatible.
- A site that is not in such a pair is compatible.

1 2 3 4 5

Incompatibility Graph G(M)

a b c d e f g

0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0

0 0 1 1 0 1 0 0 1 0 1

4

M

1

3

2

5

Two nodes are connected iff the pair of sites are

incompatible, i.e, fail the 4-gamete test.

THE MAIN TOOL We represent the pairwise

incompatibilities in a incompatibility graph.

The connected components of G(M) are very

informative

- Theorem The number of non-trivial connected

components is a lower-bound on the number of

recombinations needed in any network. - Theorem When M can be derived on a galled-tree,

all the incompatible sites in a gall must come

from a single connected component C, and that

gall must contain all the sites from C.

Compatible sites need not be inside any blob. - In a galled-tree the number of recombinations is

exactly the number of connected components in

G(M), and hence is minimum over all possible

phylogenetic networks for M.

Incompatibility Graph

4

4

3

1

3

2

5

1

s

p

a 00010

2

c 00100

b 10010

d 10100

2

5

s

4

p

g 00101

e 01100

f 01101

A Graph Theoretic Necessary Condition for a

Galled-Tree

- If M can be generated on a galled-tree, then

the incompatibility graph must be a bipartite

bi-convex graph. Other structural properties - of the conflict graph can be deduced and
- exploited.

Galled-Tree Haplotyping

- Problem Given genotype matrix G, if there is

no PPH solution for G, is there a haplotyping H

for G such that H can be derived on a Galled-Tree?

A different Neccessary Condition for a one-gall

tree

- 1. There exists a set of sequences S such that

for every pair of incompatible sites p,q, a

single p,q state-pair appears in all sequences in

S, and does not appear in any sequence outside S. - 2. There must be a number x such that
- p lt x lt q, for each incompatible pair p,q.

Example

a b c d e f g

0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1

1 0 0 0 1 0 1 1 0 0 0 0 0 1 0 1 0

4

3

H

1

s

p

a 000100

2

c 001000

b 100100

6

d 101000

2

5

e1010001

g 001010

S e,d the sequences below the recombination

node.

f 011000

Surprising Result - Yun Song

- The necessary condition is also sufficient.
- Yun S. Song in TCBB 2006

Coming full circle - back to genotypes

- When can a set of genotypes be explained by a

set of haplotypes derived on a galled-tree,

rather than on a perfect phylogeny? - The Song NASC can be translated into an ILP,
- using the part of the
- MinIncompat ILP that identifies which site pairs

are incompatibile.

- For the one gall problem, the ILP formulation

solves very efficiently (200 rows x 40 sites in

seconds to minutes). So far, the 2-gall case

does not solve well (ongoing work). - (Dan Brown, Gusfield 2006).

Coming full circle - back to genotypes

- When can a set of genotypes be explained by a

set of haplotypes that derived on a galled-tree,

rather than on a perfect phylogeny? - Recently, we developed an Integer Linear

Programming solution to this problem, and are - now testing the practical efficiency of it.
- (Brown, Gusfield).

Change of Scope Minimizing Recombinations in

unconstrained networks

- Problem given a set of sequences M, find a

phylogenetic network generating M, minimizing the

number of recombinations used to generate M,

allowing only one mutation per site. This has

biological meaning in appropriate contexts. - We can solve this problem in poly-time for the

special case of Galled-Trees. - The minimization problem is NP-hard in general.

Minimization is an NP-hard Problem

- What we have done
- 1. Solve small data-sets optimally with

exponential-time methods - or with algorithms that work well in practice
- 2. Efficiently compute lower and upper bounds on

the number of - needed recombinations.

3. Apply these methods to address

specific biological and bio-tech questions.

The Decomposition Theorem

Since the minimization problem is NP-hard we want

to break up a problem into subproblems that can

be solved separately and combined.

1 2 3 4 5

Incompatibility Graph G(M)

a b c d e f g

0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0

0 0 1 1 0 1 0 0 1 0 1

4

M

1

3

2

5

Two nodes are connected iff the pair of sites are

incompatible, i.e, fail the 4-gamete test.

THE MAIN TOOL We represent the pairwise

incompatibilities in a incompatibility graph.

The connected components of G(M) are very

informative

- For example we have the Theorem
- The number of non-trivial connected components is

a lower-bound on the number of recombinations

needed in any network.

Recombination Cycles

- In a Phylogenetic Network, with a recombination

node x, if we trace two paths backwards from x,

then the paths will eventually meet. - The cycle specified by those two paths is called

a recombination cycle.

A maximal set of intersecting cycles forms a Blob

00000

4

00010

3

1

10010

00100

5

00101

2

01100

S

4

S

P

01101

p

3

If directions on the edges are removed, a blob

is a bi-connected component of the network.

Blobed Trees

- Contracting each blob in a network results in a

directed, rooted tree, otherwise one of the

blobs was not maximal. Simple, but key

insight. - So every phylogenetic network can be viewed as a

directed tree of blobs - a blobbed-tree. - The blobs are the non-tree-like parts of the

network.

Every network is a tree of blobs.

A network where every blob is a single cycle

is a Galled-Tree.

Ugly tangled network inside the blob.

A Simple Observation

- In any network N for M, all sites from the

same connected component of G(M) must appear

together in a single blob in N.

The Decomposition Theorem

- Theorem For any set of sequences M, there is a

phylogenetic network that derives M, where each

blob contains all and only the sites in one

non-trivial connected component of G(M). The

compatible sites can always be put on edges

outside of any blob. This - fully-decomposed network is the finest

decomposition possible.

Example Network for input M with one blob

00000

4

00010

a00010

3

1

10010

00100

5

00101

2

01100

S

b10010

4

S

P

01101

p

c00100

g00101

3

d10100

f01101

e01100

The fully- decomposed network for M

Incompatibility Graph

4

4

3

1

3

2

5

1

s

p

a 00010

2

c 00100

b 10010

d 10100

2

5

s

4

p

g 00101

e 01100

f 01101

Moreover, the backbone tree is invariant over

all the fully-decomposed networks for M, and can

be determined in polynomial-time. So, we can

find a network for M by solving the recombination

minimization problem for each connected component

of G(M) separately, and then connect those

subnetworks in an invariant way.

Algorithmically

- Finding the tree part of the blobbed-tree is

easy. - Determining the sequences labeling the exterior

nodes on any blob is easy. - Determining a good structure inside a blob B is

the problem of generating the sequences of the

exterior nodes of B. - It is easy to test whether the exterior sequences

on B can be generated with only a single

recombination. The original galled-tree problem

is now just the problem of testing whether one

single-crossover recombination is sufficient for

each blob. - That can be solved by successively removing each

exterior sequence and testing if the remaining

sequences can be generated on a perfect phylogeny

of the correct form.

However

- While fully-decomposed networks always exist,

they do not necessarily minimize the number of

recombination nodes, over all possible networks. - That is, sometimes it pays to put sites from

different connected components together on the

same blob.

Sufficient Conditions

But we can prove several useful sufficient

conditions for when there is a fully-decomposed

network that minimizes the number of

recombinations, over all possible networks. The

deepest result Theorem Let N be a phylogenetic

network for input M, let L be the set of

sequences that label the nodes of N, and let G(L)

be the incompatibility graph for L. If G(L) and

G(M) have the same number of connected

components, then there is a fully-decomposed

network for M with the same number of

recombinations as in N. JCB December 2007

Corollary

A fully-decomposed network exists that minimizes

the number of recombinations, unless every

optimal network uses some recombination node(s)

labeled by sequence(s) not in M, and the addition

of those sequences to M creates an

incompatibility between sites in different

components of G(M).

000000

Sequences in M are in black. Sequence 100010 is

not in M.

4

3

1

5

G(M) has two components. Each requires two recs,

but this combined network needs only three.

4

s

p

6

2

100010

5

3

p

s

s

p

000100

001000

0011010

100101

010010

100001

G(L) has one component. The addition of

sequence 100010 reduces the number of

components from 2 to 1.

A Practical Sufficient Condition

If M can be derived on a network N in which every

edge contains at most one site, and every node is

labeled with a sequence in M, then there is a

fully-decomposed network for M which minimizes

the number of recombinations over all possible

networks for M.

Another Practical Sufficient Condition

If M can be derived on a network N where the

number of recombinations equals

the (poly-computable) Haplotype Lower Bound,

then there is a fully decomposed network for M

which minimizes the number of recombinations

over all possible networks.

Topic VI Perfect Phylogeny Extension to

non-binary characters

- We detail the case of three allowed states per

character.

What is a Perfect Phylogeny for non-binary

characters?

- Input consists of n sequences M with m sites

(characters) each, where each site can take one

of k states. - In a Perfect Phylogeny T for M, each node of T is

labeled with an m-length sequence where each site

has a value from 1 to k. - T has n leaves, one for each sequence in M,

labeled by that sequence. - For each character-state pair (C,s), the nodes of

T that are labeled with state s for character C,

form a connected subtree of T. It follows that

the subtrees for any C are node-disjoint

Example A perfect phylogeny for input M

(2,3,2)

A B C

(3,2,1)

1

(3,2,3)

2

(3,2,3)

3

4

(1,2,3)

5

M

n 5 m 3 k 3

(1,2,3)

(1,1,3)

Example

(2,3,2)

A B C

(3,2,1)

1

(3,2,3)

2

(3,2,3)

3

The tree for State 2 of Character B

4

(1,2,3)

5

M

n 5 m 3 k 3

(1,2,3)

(1,1,3)

Perfect Phylogeny Problem

Given M, is there a Perfect Phylogeny for M?

Chordal Graphs

Basic Definition A graph G is called Chordal if

every cycle of length four or more contains a

chord. More useful result A graph G is chordal

if and only if every minimal vertex separator in

G is a clique. Chordal graphs have a large

number of applications, more based on the

separator result than on the basic definition.

For example, a chordal graph on n nodes can have

at most n maximal cliques and n-1 minimal vertex

separators.

Another Classic Chordal Graph Theorem

A graph G is chordal if and only if it is the

intersection graph of a set S of subtrees of a

tree T. Each node of G is a member of S.

b,c

c,d,e,g

c

a,e,g

b

g

d

a

a,e

e,f,g

f

e

b,c,d

G

T

Relation to Perfect Phylogeny

In a perfect phylogeny T for a table E, for any

character C and any state X of character C, the

sub-forest of T induced by the nodes labeled

(C,X) form a single, connected subtree of T. So,

there is a natural set of subtrees of T induced

by E.

Chordal Completion Approach to Perfect Phylogeny

A B C

A B C

1

Graph G(E) has one node for each character-state

pair in E, and an edge between two nodes if and

only if there is a row in E with both

those character-state pairs.

1 1 1

2

3

2 2 2

4

3 3 3

5

G(E)

Table E

Each row of table E induces a clique in G(E).

Classic Theorem

Note that if table E has K columns, then G(E) is

a K-partite graph.

Theorem (Buneman 196?)

There is a perfect phylogeny for table E if and

only if edges can be added to graph G(E) to make

it a chordal, K-partite graph. If there is such

a chordal graph, denote it by G(E).

Deeper Result If G(E) exists

- Let C(E) be the graph derived from graph G(E) as

follows create a node in C(E) for each maximal

clique in G(E), and create an edge (u,v) in C(E)

iff the cliques for u and v in G(E) share a

node. Weight edge (u,v) by the number of shared

nodes. Note that C(E) can be created from G(E)

in polynomial time. - Any Maximum Spanning Tree T in C(E) is a perfect

phylogeny for E. Actually, T can be found more

directly in linear time from G(E).

Perfect Phylogeny Results

The perfect phylogeny problem was open for about

20 years, but solved by Dress, Steel, Warnow and

Kannan, Agarwalla and Fernandez-Baca. For any

fixed bound on the number of states per

character, the Perfect Phylogeny Problem can be

solved in polynomial time. However, if the

number of states per character is not

bounded, then the problem is NP-Complete. Also,

for any fixed number of characters, the problem

can be solved in polynomial time.

Dress-Steel solution for 3-state Perfect

phylogeny given complete data (1991)

- Recode each site M(i) of M as three binary sites

M(i,1), M(i,2), M(i,3) each indicating the

taxa that have state 1, 2, or 3. - Theorem (DS) There is a 3-state perfect phylogeny

for M, if and only if there is a binary-character

perfect phylogeny for some subset of M

consisting of exactly two of the columns - M(i,1), M(i,2), M(i,3), for each column i

of M.

Example

M

M

A,1 A,2 A,3 B,1 B,2 B,3 C,1 C,2 C,3

A B C

1

1

2

2

3

3

4

4

5

5

Compatible subset

Solved in Poly-Time by 2-SAT

As stated, the problem still seems like it would

take exponential time to solve, but in fact it is

easy to code the problem as a 2-SAT problem (Y.

Wu) and hence is solvable in polynomial time.

The Dress-Steel paper gave an independent

poly-time solution.