Title: Haplotype Phasing using Semidefinite Programming
1 Haplotype Phasing using Semidefinite Programming
- Parag Namjoshi
- CSEE Department
- University of Maryland Baltimore County
Joint work with Konstantinos Kalpakis
2Outline
- Biology Review
- Motivation
- Previous work
- Our contribution
- Experimental results
- Conclusions
3Biology Review
- living systems are composed of cells
- the code for the creation of the cells is packed
in a molecule called DNA. - DNA consists of four nucleic acids Adenine,
Cytosine, Guanine, and Thymine arranged as
complementary strands of a double helix. - DNA strand string of A,C,G, Ts.
4Chromosomes
- the genome is arranged as set of distinct
chromosomes.
- mammals are diploids
- humans have 22 x
- and y chromosomes.
- chromosomes occur in
- homologous pairs
- one homologous chromosome
- is inherited from each parent
-
- homologous chromosomes contain the same genes in
the same order (up to mutations)
5Single Nucleotide Polymorphisms.
- Single Nucleotide Polymorphism (SNP) mutation
of a single base. - evidence suggests that in humans
- 90 of variation is due to SNPs
- DNA has long conserved regions punctuated by SNPs
- there is one SNP in approximately 1000 bases
- most SNPS are bi-allelic
- at any given locus, only two of the four possible
nucleotides are present in 95 of the population - the restriction (projection) of a DNA strand to
SNP sites is a haplotype
6What are Genotypes?
- the genotype of diploid organisms is the
conflation of the inherited haplotypes
7Genotype Haplotype Std. Representation
- genotypes and haplotypes can be represented as a
0,1,2 vectors - independently for each site
- identify each one of the two letters that appear
in it with 0 or 1 - replace each homozygous site with 0/1 using the
mapping above - replace heterozygous sites with 2
8Haplotypes vs. Genotypes
- large scale polymorphism studies such as Linkage
Disequilibrium need haplotype information - however, experimentally
- it is expensive to segregate the haplotypes of
the individuals - it is easier to observe the genotypes of those
individuals - can we find haplotypes from the genotypes
computationally? - a genotype with h heterozygous sites can be
explained (phased) by 2h-1 different haplotype
pairs - how do you choose among them?
9Haplotype Phasing with Parsimony
- in Population haplotyping, given genotypes from
different individuals we want to find a set of
haplotypes which resolve all the genotypes - Recall that there can be many such solutions
- Experimental evidence suggests that the number of
such haplotypes is small - HPP Haplotype Phasing Problem with Pure
Parsimony - Given a set of genotypes, find a minimum size set
of haplotypes which conflate to produce the given
genotypes - other criteria for choosing among possible sets
of haplotypes are - perfect phylogeny, minimum total pairwise
distance, minimum diameter, etc - we focus on HPP problem
- Lancia, Pinotti, and Rizzi proved that the HPP is
NPcomplete as well as APXhard
10Clarks Rule
- Clark (1990) describes a greedy inference rule to
find a small set of haplotypes resolving a set of
genotypes - Starting with a set of haplotypes H that resolves
all the homozygous genotypes, do the following - for each unresolved genotype g
- if there is a pair (h, h) that resolves g with h
in H, then add h to H, else stop - the solution obtained is sensitive to the order
in which genotypes are resolved - Clarks rule may terminate with some genotypes
unresolved (orphans) - The rule can be modified to include a pair of
haplotypes that resolve an orphan genotype, and
continue as before
11Gusfields TIP
- Gusfield (1999) introduces the TIP approach
- enumerate all distinct haplotypes that can be
used to resolve any single heterozygous genotype - solve an Integer linear Program (IP) to select a
minimum size set haplotypes from the enumerated
haplotypes that explains the genotypes - TIP uses O(2L n) variables and constraints, where
L is the maximum number of heterozygous loci of
any genotype - Gusfield describes a number of important
improvements to the basic approach above that
improve performance
12Harrower-Brown IP
- Harrower and Brown give an alternate 0-1 IP for
the HPP problem (HB-IP) - explain the n genotypes with 2n haplotypes (not
necessarily distinct) - the number of distinct haplotypes used are
minimized - the number of variables and constraints is
polynomial in n, m
13The QIP approach - Outline
- arithmetic representation of genotypes
- semidefinite programming (SDP)
- Quadratic Integer Program (QIP) for HPP
- a semidefinite programming based heuristic to
solve QIP - experimental results
- concluding remarks
14Arithmetic Representation of Genotypes
- represent each genotype g as a vector d with
- each homozygous locus takes value 0 or 2 iff it
was 0 or 1 in g - each heterozygous locus takes value 1
- conflation can now be replaced by addition
- if haplotypes h1 and h2 explain genotype d, then
- d h1 h2
- we call d an arithmetic genotype
g 0 1 2 h1 0 1 0 h2 0 1 1
d 0 2 1 h1 0 1 0 h2 0 1 1
g ? d
15Arithmetic Genotypes
- let ? be n x m matrix with the arithmetic
genotypes as rows - let H be k x m matrix with haplotypes as rows
- if haplotypes in H resolve ?, then
- ? S H
- where S is a n x k 0-1-2 matrix
- the row of S for a homozygous genotype has a
single 2 - all other rows have exactly two 1s
- we call S a selector matrix
- ith row of S selects two haplotypes (rows of
H) to explain ith genotype
16The k-HPP Problem
- the k-HPP problem
- Given nxm matrix ? representing a set of n
distinct genotypes each with m loci - Find an nxk 0-1-2 selector matrix S and a kxm 0-1
haplotype matrix H such that - ? S H
- S has as few non-zero columns as possible
- all row-sums of S are 2
- HPP is equivalent to k-HPP with k2n
- lower Bounds for HPP
- is a well known lower bound
- Lemma rank(?) is a lower bound for HPP
- Consider an optimal solution S, H
- Since ? S H, we know that rank(?)
min(rank(S), rank(H)), and thus H must have at
least rank(?) distinct rows (haplotypes)
17Finding H given ? and S
- given ? and H to find an S is easy
- given ? and S find an H by solving a 2-SAT
problem - If genotype i is resolved by haplotypes t and l,
then for each locus j, add following clauses - If di,j 0, add two clauses (ht,j) (hl,j)
- If di,j 2, add two clauses (ht,j) (hl,j)
- If di,j 1, add clauses (ht,j V hl,j ) (ht,j
V hl,j) - Only one of the ht,j ,hl,j must both be 1
- 2-SAT problem
- has km variables and 2nm clauses
- can be solved in (almost) linear time
- any satisfying assignment gives a resolution of
the genotypes
18Quadratic, Vector, and Semi-definite Programs
- Quadratic Integer Program
- Optimize a quadratic objective function subject
to quadratic constraints on integer variables - Strict, when each term has total degree 0 or 2
- Vector program
- optimize a linear objective function of inner
products of vector variables subject to linear
constraints on inner products of those variables - Strict quadratic programs lead to vector programs
(products of variables are mapped to inner
products of corresponding vectors) - SDP program
- optimize a linear objective function of the
elements of a matrix X subject to - linear constraints on the elements of X
- X being a positive semi-definite matrix
- Vector programs lead to SDP (X is the matrix of
all vector inner products) - SDP programs can be solved in polynomial-time
with small numerical errors, thus - solving vector programs, thus
- solving relaxations of strict Quadratic Integer
programs - construct an approximate solution to a quadratic
integer program from a solution of its
relaxation, obtained via SDP
19Quadratic Integer Program for the k-HPP
Subject to
20QIP Heuristic SDPRoundingBacktracking
- recursively solve k-HPP
- using SDP compute vectors for the variables of
QIP - for each selector variable Si,j, compute
- PSi,jprobability that a random hyperplane
separates the vectors of Si,j and z variables
(ala MAX-CUT) - round to 1 the Si,j with the highest PSi,j
- residual k-HPPk-HPP problem with the rounded
Si,js fixed to their rounded value - if the residual k-HPP is infeasible
- round Si,j to 0 instead
- if the new residual k-HPP is still infeasible
- backtrack by returning infeasible
- recursively solve the residual k-HPP
21Experiments
- we experiment with three approaches for the HPP
problem - Clarks rule
- LP relaxation of Gusfields TIP scheme with
simple rounding - the QIP heuristic for kHPP with k 2n
- The MATLAB package SDPT 3.02 is used to solve the
SDP relaxation of the problem - all experiments are done on a single CPU MATLAB
on a Dual Xeon 2.4 Ghz desktop with 1GB memory
22Experimental Datasets
- we use synthetic datasets A and B
- each with 20 instances for each triplet (n, m, k)
(5, 5, 5), (8, 8, 8), (10, 10, 10), and (15,
15, 15) (and for B, recombination levels ? 0,
16 and 40) - generate instances of the HPP problem as follows
- randomly mate k haplotypes with m loci to produce
n genotypes - generation of haplotypes for dataset A
- each locus of k haplotypes takes value 0/1 with
probability ½ independent of other loci and other
genotypes - generation of haplotypes for dataset B
- Use Hudsons program to generate haplotypes with
these parameters - diploid population of size 106
- mutation rate 1.5 10-6
- recombination levels ? 0, 16 and 40
corresponding to crossover probabilities 0, 4
10-6, and 10-5
23Experimental Results
24QIP Extensions
- QIP can be extended to handle many variants of
basic k-HPP problem, such as - partial Genotypes
- Some loci in some genotypes are unknown
- shared haplotypes
- Prior knowledge of shared haplotypes
- allowing for erroneous genotypes and loci editing
- allowing for outlier genotypes
25Concluding Remarks
- developed arithmetic formulation for the HPP
problem - provides new lower bound
- yields simple quadratic IP (QIP)
- QIP can be extended to handle many variants,
incorporate prior information etc - SDP relaxation of QIP that can be solved in
polynomial time - SDProundingbacktracking gives QIP heuristic
- experimentally
- Demonstrate competitiveness of QIP heuristic vs
Clarks rule and Gusfields TIP relaxation - Show that rank of the genotypes is a tighter
lower bound than - future work
- Analysis of worst-case performance ratio of the
QIP heuristic - Devise algorithms that scale better
26Thank You !
Questions ?