Title: Combinatorial methods in Bioinformatics: the haplotyping problem
1Combinatorial methods in Bioinformatics the
haplotyping problem
- Paola Bonizzoni
- DISCo
- Università di Milano-Bicocca
2Content
- Motivation biological terms
- Combinatorial methods in haplotyping
- Haplotyping via perfect phylogeny the PPH
problem - Inference of incomplete perfect phylogeny
algorithms - Incomplete pph and missing data
- Other models open problems
3Biological terms
Diploid organism
Biallelic site i Value(?i) ? A,C,G,T ? 2
heterozygous
?i ?i1 ?i2
haplotype
homozygous
4Motivations
- Human genetic variations are related to diseases
(cancers, diabetes, osteoporoses) most common
variation is the Single Nucleotide Polymorphism
(SNP) on haplotypes in chromosomes - The human genome project produces genotype
sequences of humans - Computational methods to derive haplotypes from
genotype data are demanded - Ongoing international HapMap project find
haplotype differences on large scale - population data
graphs
Set-cover problems
Combinatorial methods
Optimization problems
5Haplotyping the formal model
- Haplotype m-vector hlt0, 1,, 0gt over
0,1m - Genotype m-sequence glt0,1, ,0,0,
1,1gt over 0,1,
g lt, , 0,, 1 gt
Def.
Haplotypes lth, kgt solve genotype g iff
g(i) implies h(i) ? k(i)
h(i) k(i) g(i) otherwise
6Examples
h
g solved by ltk,hgt g k
klt0,0,1,1,0,1,1gt
hlt0,1,1,0,0,1,1gt
Clark inference rule
7Haplotype inference the general problem
- Problem HI
- Instance a set Gg1, ,g m of genotypes and a
set Hh1, ,h n of haplotypes, - Solution a set H of haplotypes that solves
each genotype g in G s.t. H ? H.
H derives from an inference RULE
8Type of inference rules
- Clarks rule haplotypes solve g by an iterative
rule - Gusfield coalescent model haplotypes are related
to genotypes by a tree model - Pedigree data haplotypes are related to
genotypes by a directed graph
9Mendelian law and Recombination
Mother
Father
Parent
B D
A C
B
A
C
D
Child
A C
B D
A D
B C
A
C
A
D
B
C
D
B
10Pedigree
- Pedigree, nuclear family, founder
11Pedigree
- Pedigree, nuclear family, founder
Mother
Father
ID Num
Founders
Genotypes
Family trio
loop
Children
Nuclear family
12Haplotyping from genotypes The problem methods
- Problem
- Input genotype data (missing).
- Output haplotypes.
- Input data
- Data with pedigree (dependent).
- Data without pedigree info (independent).
- Statistical methods
- Find the most likely haplotypes based on genotype
data. - Adv solid theoretical bases
- Disadv computation intensive
- Rule-based methods
- Define rules based on some plausible assumptions
and find those haplotypes consistent with these
rules. - Adv usually simple thus very fast
- Disadv no numerical assessment of the
reliability of the results
13HI by the perfect phylogeny model
00000
g1 0, 1,,,1
G
H
g2 , 0,0,0,1
1, 0,0,0,1
0, 0,0,0,1
0, 1,1,0,1
Genotypes are the mating of haplotypes in a tree
Given G find H and T that explain G!
14Perfect Phylogeny models
- Input data 0-1 matrix A characters, species
- Output data phylogeny for A
-
R
1 1 0 0 0
0 0 1 0 0
1 1 0 0 1
0 0 1 1 0
Path c3c4
15Perfect phylogeny
Def.
A pp T for a 0-1 matrix A
- each row si labels exactly one leaf of T
- each column cj labels exactly one edge of T
- each internal edge labelled by at least one
column cj - row si gives the 0,1 path from the root to si
Path c3c4
0 0 1 1
16pp model another view
x
L(x) cluster of x
set of leaves of T x
A pp is associated to a tree-family (S,C) with
Ss1 ,, sn CS ? S S is a cluster s.t.
?X, Y in C , if X?Y?? then X?Y or Y ? X.
17pp another view
A tree-family (S,C) is represented by a 0-1
matrix
c i
0 1 0 0 0
0 0 1 0 0
1 1 0 0 1
0 0 1 1 0
- for each set in C at least a column
s j
Lemma A 0-1 matrix is a pp iff it represents a
tree-family
18Haplotyping by the pp
- A 0-1 matrix B represents the phylogenetic tree
for a set H of haplotypes - si haplotype
- ci SNPs
SNP site
00000
0-1 switch in position i only once in the tree !!
01000
19Haplotyping and the pp observations
- The root of T may not be the haplotype 000000
- 0-1 switch or 1-0 switch (directed case)
00011
00000
00011
00011
00011
0-1 switch
1-0 switch
01010
01001
01000
01000
01000
11000
11001
11010
01100
01010
01001
20HI problem in the pp model
- Input data a 0-1-matrix B n ? m of genotypes
G - Output data a 0-1 matrix B 2n ? m of
haplotypes s.t. - (1) each g ? G is solved by a pair of
rows lth,kgt in B - (2) B has a pp (tree family)
???
DECISION Problem
21 An example
a 1 0 a 0 1
a b 0 c 1 0
b 0 1 b 0 0
c 1 0 c 1 0
22The pph problem solutions
- An undirected algorithm Gusfield Recomb 2002
- An O(nm2)- algorithm Karp et al. Recomb 2003
- A linear time O(nm) algorithm ??
- Optimal algorithm
- A related problem the incomplete directed pp
(IDP) - Inferring a pp from a 0-1- matrix
- O(nm klog2(n m)) algorithm Peer, T. Pupko, R.
Shamir, R. Sharan SIAM 2004
23IDP problem
Instance A 0-1-? Matrix A Solution solve ?
Into 0 or 1 to obtain a matrix A and a pp for
A, or say no pp exists
1 2 3 4 5
- OPEN PROBLEM find an optimal algorithm ??
24Decision algorithms for incomplete pp
- Based on
- Characterization of 0-1 matrix A that has a pp
- -Tree family - - forbidden
submatrix - give a no certificate
Bipartite graph G(A)(S,C,E) with E(si,cj)
bij 1
Forbidden subgraph
X
Y
10
11
01
25Test a 0-1 matrix A has a pp?
- O(nm) algorithm (Gusfield 1991)
- Steps
- Given A order c1, ,cm as (decreasing) binary
numbers A - Let L(i,j)k , k maxl ltj Ai,l1
- Let index(j) maxL(i,j) i
- Then apply th.
TH. A has a pp iff L(i,j) index(j) for
each (i,j) s.t. Ai,j1
26Idea
27The IDP algorithm
C
c
s1
s3
s2
28Other HI problems via the pp model
- Incomplete 0-1--? matrix because of missing
data - haplotypes pp (Ihpp) haplotype rows
- genotype pp (Igpp) genotype rows
-
- Algorithms
- Ihpp IDP given a row as a root (polynomial
time) - NP-complete otherwise
- Igpp has polynomial solution under rich data
hypothesis (Karp et al. Recomb 2004 Icalp 2004
) - NP-complete otherwise
29HI problem and other models
- Haplotype inference in pedigree data under the
recombination model
child
30Pedigree graph
father
mather
child
31Haplotype inference in pedigree
00 01 10
10 11 00
01 11 01
32Problems
- MPT-MRHI (Pedigree tree multi-mating minimum
recombination HI) - SPT-MRHI (Pedigree tree single-mating minimum
recombination HI)
Np-complete even if the graph is acyclic, but
unbounded number of children
OPEN
33Conclusions
34References