Graphs and Graph Theory in Computational Biology - PowerPoint PPT Presentation

Loading...

PPT – Graphs and Graph Theory in Computational Biology PowerPoint presentation | free to download - id: 3c849a-MmFlO



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Graphs and Graph Theory in Computational Biology

Description:

Graphs and Graph Theory in Computational Biology Dan Gusfield Miami University, May 15, 2008 (four hour tutorial) Some examples of graphs in biology Taken from the ... – PowerPoint PPT presentation

Number of Views:146
Avg rating:3.0/5.0
Slides: 138
Provided by: csiflabsC
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Graphs and Graph Theory in Computational Biology


1
Graphs and Graph Theory in Computational Biology
  • Dan Gusfield
  • Miami University, May 15, 2008
  • (four hour tutorial)

2
Some examples of graphs in biology
  • Taken from the web - see the citations for
    details. Many other examples of graphs more
    complex than trees in biology.

3
From Max Delbrueck Center, Berlin
4
Yeast protein interactions
From http//www-personal.umich.edu/mejn/networks/
5
Protein-Protein Interactions
6
Protein-Protein Interaction Modelling Dr. Peter
Uetz Institut fur Toxikologie und Genetik
Forschungszentrum Karlsruhe
7
NY Times May 5, 2008 The Diseasome
http//www.nytimes.com/interactive/2008/05/05/scie
nce/20080506_DISEASE.html
8
Graphs and Graph Theory
  • 1. Numerous uses of graphs and networks to
    represent biological phenomena at many conceptual
    levels. Maybe several 1000s of papers using graph
    representations, particularly trees, but little
    graph theory.
  • 2. A respectable number of papers that develop
    new non-trivial graph theory for problems in
    biology. 100s of papers, maybe 1000.
  • 3. A handful of papers exploiting or extending
    non-trivial classic graph theory for problems in
    biology. Perhaps a few hundred.

9
Introduction and Conclusion
  • Very diverse biological applications and very
    diverse graph theory. So no single grand reason
    for graphs and no single graph topic in biology.
  • Lots of opportunity for graph theorists and
    graph algorithmists to develop or apply graph
    theory to biological problems. Even more
    opportunity for combinatorial optimization.

10
What I will do in this tutorial
  • Emphasis on points 2 and 3, i.e., Examples of
    the development of new non-trivial graph theory,
    and of the exploitation of classic graph theory.
    And (my apologies) I will mostly emphasize topics
    I have been involved with.
  • Still,
  • There are some hot biological areas today where
    graphs arise, and some graph topics that recur
    commonly, and I should point those out even if I
    will not talk in detail on those topics.

11
The digression
  • Hot biology Network biology -- biological
    phenomena that are represented by networks --
    gene regulatory networks and protein interaction
    networks, just to name two. These form the core
    of Systems biology. Other relationships in
    biology represented by graphs and networks. Ex.
    diseasome.
  • Recurring graph problems graph problems in
    clustering data ( ex. finding cliques or
    variants of cliques) variants of graph
    isomorphism in network motif or molecular pathway
    problems need for more random graph theory for
    significance testing

12
Clique Problems
  • Clique problems are recurrent in clustering
    applications, but true cliques are
    computationally hard to find. Suggested research
    for graph theorist and algorithmists
    computationally tractable, biologically
    meaningful alternatives to cliques. As examples
    maximum density subgraphs extreme sets in a
    graph.

13
Subgraph density
  • Given a graph G, and a subset S of its nodes,
    let G(S) be the subgraph of G induced by S, i.e,
    G(S) has node set S and edge set E(S) consisting
    of all edges in G both of whose ends are in S.
  • A Maximum Density subgraph of G is induced by the
    set of nodes S which the Maximizes E(S)/S.
  • The maximum density subgraph can be found in
    polynomial time. It has the flavor of a maximum
    clique, but has different properties.

14
Extreme Sets
  • In an edge-weighted undirected graph G, a
    subset S of nodes of G is called an extreme set
    if for every subset S of S, the total weight of
    the edges crossing from S to V-S
  • is larger than the total weight of the edges
    crossing from S to V-S.
  • All the extreme sets in a graph can be found in
    polynomial time.

15
Also
  • There is also a great need for more
    sophisticated application of random graph theory
    in the study of biological networks. This is
    needed in order to establish null models to use
    in assessing the statistical significance of
    subgraphs, paths, patterns and motifs that are
    found in biological networks.
  • We need to be able to distinguish observed
    patterns and subgraphs from those that occur with
    a high probability in a random graph, under a
    biologically appropriate model of randomness (an
    open field).

16
End of digression
  • Start of the main tutorial Examples of Graph
    Theory in Bioinformatics
  • and Computational Biology

17
Outline
  • Three Smaller examples Euler paths and
    sequencing Tanglegrams and co-evolution Network
    Design and Multiple Alignment.
  • Haplotyping by Perfect Phylogeny Graph
    Realization.
  • Phylogenetic Networks Incompatibility Graph
    Galled-Trees Recombination Networks The
    Decomposition Theorem and sufficient conditions.
  • Multi-state Perfect Phylogeny and Chordal Graphs.

18
To start Three small examples
  • Euler paths in sequencing and sequence assembly.
  • Tanglegrams and planarity testing in the study of
    co-evolution.
  • Application of Tree-Design approximations in
    multiple sequence alignment. Interplay between
    trees and strings.

19
Topic I Eulerian paths in sequencing problems
  • The general situation is that we have a (DNA say)
  • molecule S whose sequence is unknown, but
  • we know all the k-mers that occur in S, for some
    fixed k. Given those k-mers, we want to determine
    S, if possible, or determine whatever is possible
    to determine about S. Note that k is not related
    to the
  • alphabet size.
  • A very useful approach to problems of this type
    is to build an Eulerian digraph, based on the
    (k-1)-mers.

20
Euler graph for general k
  • For general k, there is one node for each
    (k-1)-mer contained in
  • an observed k-mer. Then there is a directed edge
    from the node for (k-1)mer A to the node for
    (k-1)mer B, if the
  • (k-2) suffix of A matches the (k-2) prefix of
    B, so that A and B
  • can be overlapped to form the observed
    k-mer.
  • Example k 5 and we observe the 5-mer XXYZW.
  • Then there will be a node for XXYZ and a node for
    XYZW
  • and a directed edge from the first node to the
    second node. Those
  • two nodes and the directed edge between them
    represent the
  • 5-mer XXYZW. In some applications, there will be
    one such edge for each observation of that 5-mer.

21
Ex. k 3. The graph will have one node for each
of the 2-mers in the observed 3-mers. Then there
is a directed edge from the node for the 2-mer XY
to the node for the 2-mer YZ, for any X, Z.
The Euler graph derived from the sequence
ACACGCAACTTAAA If a triple is observed more than
once, there should be One directed edge for each
observation of the triple.
22
The point Every Eulerian path in the graph
specifies a sequence whose k-mers match the given
data, and conversely every sequence whose k-mers
match the data specifies an Eulerian path in the
graph. So the set of Eulerian paths specifies
the set of candidate sequences for the unknown
original sequence.
Algorithms exist for efficiently finding Eulerian
paths, for counting their number, for
determining uniqueness etc. so we can use this
representation to study the set of
candidate sequences. Compare this approach to
earlier efforts to represent the set
of candidates by a graph with a Hamilton path
each node represents an observed k-mer, not a
(k-1)-mer.
23
Making finer distinctions in Euler paths
In general there may be many Eulerian paths in
the graph, and we want some additional criteria
to distinguish the goodness of one Eulierian path
compared to another. Different biological
considerations translate into having a value for
each subpath of length two. Then the value of an
Eulerian path P with n edges is the sum of the
n-1 values of the n-1 length-two subpaths in
P. The problem is to find an Eulerian path with
maximum value. We have some reasonable
approximations for that, but a simpler case can
be solved optimally in polynomial time.
24
The case of a binary alphabet, but arbitrary k
  • Since the alphabet size is two, each node in the
    graph has at most two incoming edges
  • and two outgoing edges. Assume exactly two each.

001
110
Ex. k 4
011
110
101
25
The case of a binary alphabet, but arbitrary k
  • At any node, there are two possible ways for
  • an Euler path to pass through the node.

001
110
turning
Ex. k 4
011
110
101
26
The case of a binary alphabet, but arbitrary k
  • At any node, there are two possible ways for
  • an Euler path to pass through the node.

001
110
crossing
Ex. k 4
011
110
101
So in terms of subpaths of length two, we have
two choices at each node.
27
Restating the optimal Euler path problem
  • We are given an Eulerian graph where the in
    and out degrees are at most two at each node,
    and at each node there is a given value for the
    turning pair, and a value for the crossing pair.
    Then choose the turning or the crossing pairs at
    the nodes to maximize the total value of the
    choices, subject to the requirement that the
    choices create an Euler path in the graph.

28
Main Result
  • The problem can be solved in polynomial time.
  • The set of choices that give Euler paths has a
    matroidal structure, which allows a
    matroid-greedy algorithm to find the optimal
    Euler path.
  • A more direct algorithm based on Minimum Spanning
    Trees also solves the problem.

29
The Matroid Structure
  • At every node v, the edge pair (crossing or
    turning) which has the lowest value is called the
    low pair, and the other pair is the high pair.
    The difference in values is called the loss at v.
  • A subset S of nodes is called independent if
    there is an Euler path in the graph where at
    every node in S, the low pair is chosen.
  • As defined, the family of independent sets form a
    matroid, and so we can find, by a greedy
    algorithm, an independent set which minimizes the
    loss - and this gives the optimal Euler path.

30
Topic II Tanglegrams
  • A Tanglegram is a pair of trees drawn in the
    plane with no crossing edges, with the same
    labeled leaf set. The leaves of one tree are
    displayed on a line, and the leaves of the other
    tree are displayed on a parallel line.
  • A straight line connect each leaf in one tree to
    the leaf with the same label in the other tree.
  • The number of crossing lines is a measure of the
    similarity of the trees.

31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
Topic III Multiple Sequence Alignment
  • Interplay between sequences and trees.
  • Exploitation of network design approximation.

36
(No Transcript)
37
Intro to Hours 2 and 3 Two Post-HGP Topics
  • Two topics in Population Genomics
  • SNP Haplotyping in populations
  • Reconstructing a history of recombination
  • These topics in Population Genomics illustrate
    current challenges in biology, and illustrate the
    use of graph theory, combinatorial algorithms and
    discrete mathematics in biology.

38
What is population genomics?
  • The Human genome sequence is done.
  • Now we want to sequence many individuals in a
    population to correlate similarities and
    differences in their sequences with genetic
    traits (e.g. disease or disease susceptibility).
  • Presently, we cant sequence large numbers of
    individuals, but we can sample the sequences at
    SNP sites.

39
SNP Data
  • A SNP is a Single Nucleotide Polymorphism - a
    site in the genome where two different
    nucleotides appear with sufficient frequency in
    the population (say each with 5 frequency or
    more).
  • SNP maps have been compiled with a density of
    about 1 site per 1000.
  • SNP data is what is mostly collected in
    populations - it is much cheaper to collect than
    full sequence data, and focuses on variation in
    the population, which is what is of interest.

40
Haplotype Map Project HAPMAP
  • NIH lead project (100M) to find common SNP
    haplotypes (SNP sequences) in the Human
    population.
  • Association mapping HAPMAP used to try to
    associate genetic-influenced diseases with
    specific SNP haplotypes, to either find causal
    haplotypes, or to find the region near causal
    mutations.
  • The key to the logic of Association mapping is
    historical recombination in populations. Nature
    has done the experiments, now we try to make
    sense of the results.

41
Topic IV Perfect Phylogeny Haplotyping via Graph
Realization
42
Genotypes and Haplotypes
  • Each individual has two copies of each
    chromosome.
  • At each site, each chromosome has one of two
    alleles (states) denoted by 0 and 1 (motivated by
  • SNPs)

0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0
0
Two haplotypes per individual
Merge the haplotypes
2 1 2 1 0 0 1 2 0
Genotype for the individual
43
Haplotyping Problem
  • Biological Problem For disease association
    studies, haplotype data is more valuable than
    genotype data, but haplotype data is hard to
    collect. Genotype data is easy to collect.
  • Computational Problem Given a set of n
    genotypes, determine the original set of n
    haplotype pairs that generated the n genotypes.
    This is hopeless without a genetic model.

44
The Perfect Phylogeny Model for SNP sequences

Only one mutation per site allowed.
sites
12345
00000
Ancestral sequence
1
4
Site mutations on edges
3
00010
The tree derives the set M 10100 10000 01011 0101
0 00010
2
10100
5
10000
01010
01011
Extant sequences at the leaves
45
When can a set of sequences be derived on a
perfect phylogeny?
  • Classic NASC Arrange the sequences in a matrix.
    Then (with no duplicate columns), the sequences
    can be generated on a unique perfect phylogeny if
    and only if no two columns (sites) contain all
    four pairs
  • 0,0 and 0,1 and 1,0 and 1,1

This is the 4-Gamete Test
46
So, in the case of binary characters, if each
pair of columns allows a tree, then the entire
set of columns allows a tree. For M of dimension
n by m, the existence of a perfect phylogeny for
M can be tested in O(nm) time and a tree built in
that time, if there is one. Gusfield, Networks 91
We will use the classic theorem in two more
modern and more genetic applications.
47
The Perfect Phylogeny Model
  • We assume that the evolution of extant haplotypes
    can be displayed on a rooted, directed tree, with
    the all-0 haplotype at the root, where each site
    changes from 0 to 1 on exactly one edge, and each
    extant haplotype is created by accumulating the
    changes on a path from the root to a leaf, where
    that haplotype is displayed.
  • In other words, the extant haplotypes evolved
    along a perfect phylogeny with all-0 root.
  • Justification Haplotype Blocks, rare
    recombination, base problem whose solution to be
    modified to incorporate more biological
    complexity.

48
Perfect Phylogeny Haplotype (PPH)
Given a set of genotypes S, find an explaining
set of haplotypes that fits a perfect phylogeny.
sites
A haplotype pair explains a genotype if the merge
of the haplotypes creates the genotype. Example
The merge of 0 1 and 1 0 explains 2 2.
S
Genotype matrix
49
The PPH Problem
Given a set of genotypes, find an explaining set
of haplotypes that fits a perfect phylogeny
50
The Haplotype Phylogeny Problem
Given a set of genotypes, find an explaining set
of haplotypes that fits a perfect phylogeny
00
1
2
b
00
a
a
b
c
c
01
01

10
10
10
51
The Alternative Explanation
No tree possible for this explanation
52
Efficient Solutions to the PPH problem - n
genotypes, m sites
  • Reduction to a graph realization problem (GPPH) -
    build on Bixby-Wagner or Fushishige solution to
    graph realization O(nm alpha(nm)) time.
    Gusfield, Recomb 02
  • Reduction to graph realization - build on Tuttes
    graph realization method O(nm2) time. Chung,
    Gusfield 03
  • Direct, from scratch combinatorial approach
    -O(nm2) Bafna, Gusfield et al JCB 03
  • Berkeley (EHK) approach - specialize the Tutte
    solution to the PPH problem - O(nm2) time.
  • Linear-time solutions - Recomb 2005, Ding,
    Filkov, Gusfield and a different linear time
    solution.

53
The Reduction Approach
  • This is the original polynomial time method.
    Conceptually simplest at a high level (but not at
    the implementation level) and most extendable to
    other problems nearly linear-time but not
    linear-time.

54
The case of the 1s
  • For any row i in S, the set of 1 entries in row i
    specify the exact set of mutations on the path
    from the root to the least common ancestor of the
    two leaves labeled i, in every perfect phylogeny
    for S.
  • The order of those 1 entries on the path is also
    the same in every perfect phylogeny for S, and is
    easy to determine by leaf counting.

55
Leaf Counting
In any column c, count two for each 1, and count
one for each 2. The total is the number of
leaves below mutation c, in every perfect
phylogeny for S. So if we know the set
of mutations on a path from the root, we
know their order as well.
S
Count 5 4 2 2 1 1 1
56
Simple Conclusions
Subtree for row i data
sites
Root
The order is known for the red mutations together
with the leftmost blue(?) mutation.
1 2 3 4 5 6 7 i0 1 0 1 2 2 2
2 4
5
57
But what to do with the remaining blue entries
(2s) in a row?
58
More Simple Tools
  • For any row i in S, and any column c, if S(i,c)
    is 2, then in every perfect phylogeny for S, the
    path between the two leaves labeled i, must
    contain the edge with mutation c.
  • Further, every mutation c on the path
    between the two i leaves must be from such a
    column c.

59
From Row Data to Tree Constraints
Subtree for row i data
sites
Root
1 2 3 4 5 6 7 i0 1 0 1 2 2 2
2 4
Edges 5, 6 and 7 must be on the blue path, and 5
is already known to follow 4, but we dont where
to put 6 and 7.
5
i
i
60
The Graph Theoretic Problem
  • Given a genotype matrix S with n sites, and a
    red-blue subgraph for each row i,

create a directed tree T where each integer from
1 to n labels exactly one edge, so that each
subgraph is contained in T.
i
i
61
Powerful Tool Tree and Graph Realization
  • Let Rn be the integers 1 to n, and let P be an
    unordered subset of Rn. P is called a path set.
  • A tree T with n edges, where each is labeled with
    a unique integer of Rn, realizes P if there is a
    contiguous path in T labeled with the integers of
    P and no others.
  • Given a family P1, P2, P3Pk of path sets, tree T
    realizes the family if it realizes each Pi.
  • The graph realization problem generalizes the
    consecutive ones problem, where T is a path.
  • More generally, each set specifies a fundamental
    cycle in the unknown graph.

62
Tree Realization Example
5
P1 1, 5, 8 P2 2, 4 P3 1, 2, 5, 6 P4 3, 6,
8 P5 1, 5, 6, 7
1
6
8
2
4
3
7
Realizing Tree T
More generally, think of each path set as
specifying a fundamental cycle containing the
edges in the specified path.
63
Graph Realization
  • Polynomial time (almost linear-time)
    algorithms exist for the graph realization
    problem, given the family of fundamental cycles
    the unknown graph should contain Whitney,
    Tutte, Cunningham, Edmonds, Bixby, Wagner,
    Gavril, Tamari, Fushishige, Lofgren 1930s -
    1980s
  • Most of the literature on this problem is in
    the context of determining if a binary matroid is
    graphic.
  • The algorithms are not simple none
    implemented before 2002.

64
Reducing PPH to graph realization
  • We solve any instance of the PPH problem by
    creating appropriate path sets, so that a
    solution to the resulting graph realization
    problem leads to a solution to the PPH problem
    instance.
  • The key issue How to encode the needed
    subgraph
  • for each row, and glue them together at the
    root.

65
From Row Data to Tree Constraints
Subtree for row i data
sites
Root
1 2 3 4 5 6 7 i0 1 0 1 2 2 2
2 4
Edges 5, 6 and 7 must be on the blue path, and 5
is already known to follow 4.
5
i
i
66
Encoding a Red-Blue directed path
2
P1 U, 2 P2 U, 2, 4 P3 2, 4 P4 2, 4, 5 P5 4, 5
U
4
2
5
4
forced
In T
5
U is a glue edge used to glue together the
directed paths from the different rows.
67
Now add a path set for the blues in row i.
sites
Root
1 2 3 4 5 6 7 i0 1 0 1 2 2 2
2 4
5
P 5, 6, 7
i
i
68
Thats the Reduction
The resulting path-sets encode everything that
is known about row i in the input. The family of
path-sets are input to the graph- realization
problem, and every solution to the that
graph-realization problem specifies a solution
to the PPH problem, and conversely.
Whitney (1933?) characterized the set of all
solutions to graph realization (based on the
three-connected components of a graph) and Tarjan
et al showed how to find these in linear time.
69
An implicit representation of all solutions
Whitney (1930) proved that a graph realization
problem has a unique solution if and only if the
graph is three-connected. That is, at least
three nodes must be removed in order to
disconnect the graph (assuming it is
connected). Whitney (1931) proved that if the
solution is not unique, then there is a
semi-unique decomposition of the graph into
three- connected components, so that the graph
realizations are in one- one correspondence with
all the ways that these components can be
twisted relative to each other. So the number
of solutions is 2(number of three connected
comps. -1).
70
Tree Realization Example
5
P1 1, 5, 8 P2 2, 4 P3 1, 2, 5, 6 P4 3, 6,
8 P5 1, 5, 6, 7
1
6
8
2
4
3
7
Realizing Tree T with edges added to create
a fundamental cycle for each path
71
Topic V Phylogenetic Networks with Recombination
72
When can a set of sequences be derived on a
perfect phylogeny?
  • Classic NASC Arrange the sequences in a matrix.
    Then (with no duplicate columns), the sequences
    can be generated on a unique perfect phylogeny if
    and only if no two columns (sites) contain all
    four pairs
  • 0,0 and 0,1 and 1,0 and 1,1

This is the 4-Gamete Test
73
Incompatible Sites
  • A pair of sites (columns) of M that fail the
  • 4-gametes test are said to be incompatible.
  • A site that is not in such a pair is compatible.

74
A richer model

10100 10000 01011 01010 00010 10101 added
12345
00000
1
4
M
3
00010
2
10100
5
Pair 4, 5 fails the four gamete-test. The sites
4, 5 are incompatible.
10000
01010
01011
Real sequence histories often involve
recombination.
75
Sequence Recombination
01011
10100
S
P
5
Single crossover recombination
10101
A recombination of P and S at recombination point
5.
The first 4 sites come from P (Prefix) and the
sites from 5 onward come from S (Suffix).
76
Network with Recombination ARG

10100 10000 01011 01010 00010 10101 new
12345
00000
1
4
M
3
00010
2
10100
5
10000
P
01010
The previous tree with one recombination event
now derives all the sequences.
01011
5
S
10101
77
A Min ARG for Kreitmans data
ARG created by SHRUB
78
An illustration of why we are interested in
recombinationAssociation Mapping of Complex
Diseases Using ARGs
79
Association Mapping
  • A major strategy being practiced to find genes
    influencing disease from haplotypes of a subset
    of SNPs.
  • Disease mutations unobserved.
  • A simple example to explain association mapping
    and why ARGs are useful, assuming the true ARG is
    known.

Disease mutation site
0
1
0
0
1
SNPs
80
Very Simplistic Mapping the Unobserved Mutation
of Mendelian Diseases with ARGs
00000
Assumption (for now) A sequence is diseased iff
it carries the single disease mutation
4
00010
a00010
3
1
10010
00100
5
00101
2
b10010
01100
S
S
P
4
c00100
01101
P
g00101
3
d10100
f01101
Where is the disease mutation?
e01100
Diseased
81
Mapping Disease Gene with Inferred ARGs
  • ..the best information that we could possibly
    get about association is to know the full
    coalescent genealogy Zollner and Pritchard,
    2005
  • But we do not know the true ARG!
  • Goal infer ARGs from SNP data for association
    mapping
  • Not easy and often approximation (e.g. Zollner
    and Pritchard)
  • Improved results to do the inference Y. Wu
    (RECOMB 2007)

82
Results on Reconstructing the Evolution of SNP
Sequences
  • Part I Clean mathematical and algorithmic
    results Galled-Trees, near-uniqueness,
    graph-theory lower bound, and the Decomposition
    theorem
  • Part II Practical computation of Lower and
    Upper bounds on the number of recombinations
    needed. Construction of (optimal)
    phylogenetic networks uniform sampling
    haplotyping with ARGs LD mapping
  • Part III Varied Biological Applications
  • Part IV Extension to Gene Conversion
  • Part V The Minimum Mosaic Model of Recombination

This talk will discuss topics in Parts I
83
Problem If not a tree, then what?
  • If the set of sequences M cannot be derived on a
    perfect phylogeny (true tree) how much deviation
    from a tree is required?
  • We want a network for M that uses a small number
    of recombinations, and we want the resulting
    network to be as tree-like as possible.

84
A tree-like network for the same sequences
generated by the prior network.
4
3
1
s
p
a 00010
2
c 00100
b 10010
d 10100
2
5
s
4
p
g 00101
e 01100
f 01101
85
Recombination Cycles
  • In a Phylogenetic Network, with a recombination
    node x, if we trace two paths backwards from x,
    then the paths will eventually meet.
  • The cycle specified by those two paths is called
    a recombination cycle.

86
Galled-Trees
  • A phylogenetic network where no recombination
    cycles share an edge is called a galled tree.
  • A cycle in a galled-tree is called a gall.
  • Question if M cannot be generated on a true
    tree, can it be generated on a galled-tree?

87
(No Transcript)
88
Results about galled-trees
  • Theorem Efficient (provably polynomial-time)
    algorithm to determine whether or not any
    sequence set M can be derived on a galled-tree.
  • Theorem A galled-tree (if one exists) produced
    by the algorithm minimizes the number of
    recombinations used over all possible
    phylogenetic-networks.
  • Theorem If M can be derived on a galled tree,
    then the Galled-Tree is nearly unique. This
    is important for biological conclusions derived
    from the galled-tree.

Papers from 2003-2007.
89
Elaboration on Near Uniqueness
Theorem The number of arrangements
(permutations) of the sites on any gall is at
most three, and this happens only if the gall has
two sites. If the gall has more than two sites,
then the number of arrangements is at most
two. If the gall has four or more sites, with at
least two sites on each side of the recombination
point (not the side of the gall) then the
arrangement is forced and unique. Theorem All
other features of the galled-trees for M are
invariant.
90
A whiff of the ideas behind the results
91
Incompatible Sites
  • A pair of sites (columns) of M that fail the
  • 4-gametes test are said to be incompatible.
  • A site that is not in such a pair is compatible.

92
1 2 3 4 5
Incompatibility Graph G(M)
a b c d e f g
0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0
0 0 1 1 0 1 0 0 1 0 1
4
M
1
3
2
5
Two nodes are connected iff the pair of sites are
incompatible, i.e, fail the 4-gamete test.
THE MAIN TOOL We represent the pairwise
incompatibilities in a incompatibility graph.
93
The connected components of G(M) are very
informative
  • Theorem The number of non-trivial connected
    components is a lower-bound on the number of
    recombinations needed in any network.
  • Theorem When M can be derived on a galled-tree,
    all the incompatible sites in a gall must come
    from a single connected component C, and that
    gall must contain all the sites from C.
    Compatible sites need not be inside any blob.
  • In a galled-tree the number of recombinations is
    exactly the number of connected components in
    G(M), and hence is minimum over all possible
    phylogenetic networks for M.

94
Incompatibility Graph
4
4
3
1
3
2
5
1
s
p
a 00010
2
c 00100
b 10010
d 10100
2
5
s
4
p
g 00101
e 01100
f 01101
95
A Graph Theoretic Necessary Condition for a
Galled-Tree
  • If M can be generated on a galled-tree, then
    the incompatibility graph must be a bipartite
    bi-convex graph. Other structural properties
  • of the conflict graph can be deduced and
  • exploited.

96
Galled-Tree Haplotyping
  • Problem Given genotype matrix G, if there is
    no PPH solution for G, is there a haplotyping H
    for G such that H can be derived on a Galled-Tree?

97
A different Neccessary Condition for a one-gall
tree
  • 1. There exists a set of sequences S such that
    for every pair of incompatible sites p,q, a
    single p,q state-pair appears in all sequences in
    S, and does not appear in any sequence outside S.
  • 2. There must be a number x such that
  • p lt x lt q, for each incompatible pair p,q.

98
Example
a b c d e f g
0 0 0 1 0 0 1 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1
1 0 0 0 1 0 1 1 0 0 0 0 0 1 0 1 0
4
3
H
1
s
p
a 000100
2
c 001000
b 100100
6
d 101000
2
5
e1010001
g 001010
S e,d the sequences below the recombination
node.
f 011000
99
Surprising Result - Yun Song
  • The necessary condition is also sufficient.
  • Yun S. Song in TCBB 2006

100
Coming full circle - back to genotypes
  • When can a set of genotypes be explained by a
    set of haplotypes derived on a galled-tree,
    rather than on a perfect phylogeny?
  • The Song NASC can be translated into an ILP,
  • using the part of the
  • MinIncompat ILP that identifies which site pairs
    are incompatibile.

101
  • For the one gall problem, the ILP formulation
    solves very efficiently (200 rows x 40 sites in
    seconds to minutes). So far, the 2-gall case
    does not solve well (ongoing work).
  • (Dan Brown, Gusfield 2006).

102
Coming full circle - back to genotypes
  • When can a set of genotypes be explained by a
    set of haplotypes that derived on a galled-tree,
    rather than on a perfect phylogeny?
  • Recently, we developed an Integer Linear
    Programming solution to this problem, and are
  • now testing the practical efficiency of it.
  • (Brown, Gusfield).

103
Change of Scope Minimizing Recombinations in
unconstrained networks
  • Problem given a set of sequences M, find a
    phylogenetic network generating M, minimizing the
    number of recombinations used to generate M,
    allowing only one mutation per site. This has
    biological meaning in appropriate contexts.
  • We can solve this problem in poly-time for the
    special case of Galled-Trees.
  • The minimization problem is NP-hard in general.

104
Minimization is an NP-hard Problem
  • What we have done
  • 1. Solve small data-sets optimally with
    exponential-time methods
  • or with algorithms that work well in practice
  • 2. Efficiently compute lower and upper bounds on
    the number of
  • needed recombinations.

3. Apply these methods to address
specific biological and bio-tech questions.
105
The Decomposition Theorem
Since the minimization problem is NP-hard we want
to break up a problem into subproblems that can
be solved separately and combined.
106
1 2 3 4 5
Incompatibility Graph G(M)
a b c d e f g
0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0
0 0 1 1 0 1 0 0 1 0 1
4
M
1
3
2
5
Two nodes are connected iff the pair of sites are
incompatible, i.e, fail the 4-gamete test.
THE MAIN TOOL We represent the pairwise
incompatibilities in a incompatibility graph.
107
The connected components of G(M) are very
informative
  • For example we have the Theorem
  • The number of non-trivial connected components is
    a lower-bound on the number of recombinations
    needed in any network.

108
Recombination Cycles
  • In a Phylogenetic Network, with a recombination
    node x, if we trace two paths backwards from x,
    then the paths will eventually meet.
  • The cycle specified by those two paths is called
    a recombination cycle.

109
A maximal set of intersecting cycles forms a Blob
00000
4
00010
3
1
10010
00100
5
00101
2
01100
S
4
S
P
01101
p
3
If directions on the edges are removed, a blob
is a bi-connected component of the network.
110
Blobed Trees
  • Contracting each blob in a network results in a
    directed, rooted tree, otherwise one of the
    blobs was not maximal. Simple, but key
    insight.
  • So every phylogenetic network can be viewed as a
    directed tree of blobs - a blobbed-tree.
  • The blobs are the non-tree-like parts of the
    network.

111
Every network is a tree of blobs.
A network where every blob is a single cycle
is a Galled-Tree.
Ugly tangled network inside the blob.
112
A Simple Observation
  • In any network N for M, all sites from the
    same connected component of G(M) must appear
    together in a single blob in N.

113
The Decomposition Theorem
  • Theorem For any set of sequences M, there is a
    phylogenetic network that derives M, where each
    blob contains all and only the sites in one
    non-trivial connected component of G(M). The
    compatible sites can always be put on edges
    outside of any blob. This
  • fully-decomposed network is the finest
    decomposition possible.

114
Example Network for input M with one blob
00000
4
00010
a00010
3
1
10010
00100
5
00101
2
01100
S
b10010
4
S
P
01101
p
c00100
g00101
3
d10100
f01101
e01100
115
The fully- decomposed network for M
Incompatibility Graph
4
4
3
1
3
2
5
1
s
p
a 00010
2
c 00100
b 10010
d 10100
2
5
s
4
p
g 00101
e 01100
f 01101
116
Moreover, the backbone tree is invariant over
all the fully-decomposed networks for M, and can
be determined in polynomial-time. So, we can
find a network for M by solving the recombination
minimization problem for each connected component
of G(M) separately, and then connect those
subnetworks in an invariant way.
117
Algorithmically
  • Finding the tree part of the blobbed-tree is
    easy.
  • Determining the sequences labeling the exterior
    nodes on any blob is easy.
  • Determining a good structure inside a blob B is
    the problem of generating the sequences of the
    exterior nodes of B.
  • It is easy to test whether the exterior sequences
    on B can be generated with only a single
    recombination. The original galled-tree problem
    is now just the problem of testing whether one
    single-crossover recombination is sufficient for
    each blob.
  • That can be solved by successively removing each
    exterior sequence and testing if the remaining
    sequences can be generated on a perfect phylogeny
    of the correct form.

118
However
  • While fully-decomposed networks always exist,
    they do not necessarily minimize the number of
    recombination nodes, over all possible networks.
  • That is, sometimes it pays to put sites from
    different connected components together on the
    same blob.

119

Sufficient Conditions
But we can prove several useful sufficient
conditions for when there is a fully-decomposed
network that minimizes the number of
recombinations, over all possible networks. The
deepest result Theorem Let N be a phylogenetic
network for input M, let L be the set of
sequences that label the nodes of N, and let G(L)
be the incompatibility graph for L. If G(L) and
G(M) have the same number of connected
components, then there is a fully-decomposed
network for M with the same number of
recombinations as in N. JCB December 2007
120
Corollary
A fully-decomposed network exists that minimizes
the number of recombinations, unless every
optimal network uses some recombination node(s)
labeled by sequence(s) not in M, and the addition
of those sequences to M creates an
incompatibility between sites in different
components of G(M).
121
000000
Sequences in M are in black. Sequence 100010 is
not in M.
4
3
1
5
G(M) has two components. Each requires two recs,
but this combined network needs only three.
4
s
p
6
2
100010
5
3
p
s
s
p
000100
001000
0011010
100101
010010
100001
G(L) has one component. The addition of
sequence 100010 reduces the number of
components from 2 to 1.
122
A Practical Sufficient Condition
If M can be derived on a network N in which every
edge contains at most one site, and every node is
labeled with a sequence in M, then there is a
fully-decomposed network for M which minimizes
the number of recombinations over all possible
networks for M.
123
Another Practical Sufficient Condition
If M can be derived on a network N where the
number of recombinations equals
the (poly-computable) Haplotype Lower Bound,
then there is a fully decomposed network for M
which minimizes the number of recombinations
over all possible networks.
124
Topic VI Perfect Phylogeny Extension to
non-binary characters
  • We detail the case of three allowed states per
    character.

125
What is a Perfect Phylogeny for non-binary
characters?
  • Input consists of n sequences M with m sites
    (characters) each, where each site can take one
    of k states.
  • In a Perfect Phylogeny T for M, each node of T is
    labeled with an m-length sequence where each site
    has a value from 1 to k.
  • T has n leaves, one for each sequence in M,
    labeled by that sequence.
  • For each character-state pair (C,s), the nodes of
    T that are labeled with state s for character C,
    form a connected subtree of T. It follows that
    the subtrees for any C are node-disjoint

126
Example A perfect phylogeny for input M
(2,3,2)
A B C
(3,2,1)
1
(3,2,3)
2
(3,2,3)
3
4
(1,2,3)
5
M
n 5 m 3 k 3
(1,2,3)
(1,1,3)
127
Example
(2,3,2)
A B C
(3,2,1)
1
(3,2,3)
2
(3,2,3)
3
The tree for State 2 of Character B
4
(1,2,3)
5
M
n 5 m 3 k 3
(1,2,3)
(1,1,3)
128
Perfect Phylogeny Problem
Given M, is there a Perfect Phylogeny for M?
129
Chordal Graphs
Basic Definition A graph G is called Chordal if
every cycle of length four or more contains a
chord. More useful result A graph G is chordal
if and only if every minimal vertex separator in
G is a clique. Chordal graphs have a large
number of applications, more based on the
separator result than on the basic definition.
For example, a chordal graph on n nodes can have
at most n maximal cliques and n-1 minimal vertex
separators.
130
Another Classic Chordal Graph Theorem
A graph G is chordal if and only if it is the
intersection graph of a set S of subtrees of a
tree T. Each node of G is a member of S.
b,c
c,d,e,g
c
a,e,g
b
g
d
a
a,e
e,f,g
f
e
b,c,d
G
T
131
Relation to Perfect Phylogeny
In a perfect phylogeny T for a table E, for any
character C and any state X of character C, the
sub-forest of T induced by the nodes labeled
(C,X) form a single, connected subtree of T. So,
there is a natural set of subtrees of T induced
by E.
132
Chordal Completion Approach to Perfect Phylogeny
A B C
A B C
1
Graph G(E) has one node for each character-state
pair in E, and an edge between two nodes if and
only if there is a row in E with both
those character-state pairs.
1 1 1
2
3
2 2 2
4
3 3 3
5
G(E)
Table E
Each row of table E induces a clique in G(E).
133
Classic Theorem
Note that if table E has K columns, then G(E) is
a K-partite graph.
Theorem (Buneman 196?)
There is a perfect phylogeny for table E if and
only if edges can be added to graph G(E) to make
it a chordal, K-partite graph. If there is such
a chordal graph, denote it by G(E).
134
Deeper Result If G(E) exists
  • Let C(E) be the graph derived from graph G(E) as
    follows create a node in C(E) for each maximal
    clique in G(E), and create an edge (u,v) in C(E)
    iff the cliques for u and v in G(E) share a
    node. Weight edge (u,v) by the number of shared
    nodes. Note that C(E) can be created from G(E)
    in polynomial time.
  • Any Maximum Spanning Tree T in C(E) is a perfect
    phylogeny for E. Actually, T can be found more
    directly in linear time from G(E).

135
Perfect Phylogeny Results
The perfect phylogeny problem was open for about
20 years, but solved by Dress, Steel, Warnow and
Kannan, Agarwalla and Fernandez-Baca. For any
fixed bound on the number of states per
character, the Perfect Phylogeny Problem can be
solved in polynomial time. However, if the
number of states per character is not
bounded, then the problem is NP-Complete. Also,
for any fixed number of characters, the problem
can be solved in polynomial time.
136
Dress-Steel solution for 3-state Perfect
phylogeny given complete data (1991)
  • Recode each site M(i) of M as three binary sites
    M(i,1), M(i,2), M(i,3) each indicating the
    taxa that have state 1, 2, or 3.
  • Theorem (DS) There is a 3-state perfect phylogeny
    for M, if and only if there is a binary-character
    perfect phylogeny for some subset of M
    consisting of exactly two of the columns
  • M(i,1), M(i,2), M(i,3), for each column i
    of M.

137
Example
M
M
A,1 A,2 A,3 B,1 B,2 B,3 C,1 C,2 C,3
A B C
1
1
2
2
3
3
4
4
5
5
Compatible subset
138
Solved in Poly-Time by 2-SAT
As stated, the problem still seems like it would
take exponential time to solve, but in fact it is
easy to code the problem as a 2-SAT problem (Y.
Wu) and hence is solvable in polynomial time.
The Dress-Steel paper gave an independent
poly-time solution.
About PowerShow.com