Haplotype Phasing using Semidefinite Programming - PowerPoint PPT Presentation

About This Presentation

Title:

Haplotype Phasing using Semidefinite Programming

Description:

the code for the creation of the cells is packed in a molecule called DNA. ... a minimum size set of haplotypes which conflate to produce the given genotypes ... – PowerPoint PPT presentation

Number of Views:74

Avg rating:3.0/5.0

Slides: 27

Provided by: paragna

Learn more at: https://redirect.cs.umbc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Haplotype Phasing using Semidefinite Programming

1
Haplotype Phasing using Semidefinite Programming

Parag Namjoshi
CSEE Department
University of Maryland Baltimore County

Joint work with Konstantinos Kalpakis
2
Outline

Biology Review
Motivation
Previous work
Our contribution
Experimental results
Conclusions

3
Biology Review

living systems are composed of cells
the code for the creation of the cells is packed
in a molecule called DNA.
DNA consists of four nucleic acids Adenine,
Cytosine, Guanine, and Thymine arranged as
complementary strands of a double helix.
DNA strand string of A,C,G, Ts.

4
Chromosomes

the genome is arranged as set of distinct
chromosomes.

mammals are diploids
humans have 22 x
and y chromosomes.
chromosomes occur in
homologous pairs
one homologous chromosome
is inherited from each parent
homologous chromosomes contain the same genes in
the same order (up to mutations)

5
Single Nucleotide Polymorphisms.

Single Nucleotide Polymorphism (SNP) mutation
of a single base.
evidence suggests that in humans
90 of variation is due to SNPs
DNA has long conserved regions punctuated by SNPs
there is one SNP in approximately 1000 bases
most SNPS are bi-allelic
at any given locus, only two of the four possible
nucleotides are present in 95 of the population
the restriction (projection) of a DNA strand to
SNP sites is a haplotype

6
What are Genotypes?

the genotype of diploid organisms is the
conflation of the inherited haplotypes

7
Genotype Haplotype Std. Representation

genotypes and haplotypes can be represented as a
0,1,2 vectors
independently for each site
identify each one of the two letters that appear
in it with 0 or 1
replace each homozygous site with 0/1 using the
mapping above
replace heterozygous sites with 2

8
Haplotypes vs. Genotypes

large scale polymorphism studies such as Linkage
Disequilibrium need haplotype information
however, experimentally
it is expensive to segregate the haplotypes of
the individuals
it is easier to observe the genotypes of those
individuals
can we find haplotypes from the genotypes
computationally?
a genotype with h heterozygous sites can be
explained (phased) by 2h-1 different haplotype
pairs
how do you choose among them?

9
Haplotype Phasing with Parsimony

in Population haplotyping, given genotypes from
different individuals we want to find a set of
haplotypes which resolve all the genotypes
Recall that there can be many such solutions
Experimental evidence suggests that the number of
such haplotypes is small
HPP Haplotype Phasing Problem with Pure
Parsimony
Given a set of genotypes, find a minimum size set
of haplotypes which conflate to produce the given
genotypes
other criteria for choosing among possible sets
of haplotypes are
perfect phylogeny, minimum total pairwise
distance, minimum diameter, etc
we focus on HPP problem
Lancia, Pinotti, and Rizzi proved that the HPP is
NPcomplete as well as APXhard

10
Clarks Rule

Clark (1990) describes a greedy inference rule to
find a small set of haplotypes resolving a set of
genotypes
Starting with a set of haplotypes H that resolves
all the homozygous genotypes, do the following
for each unresolved genotype g
if there is a pair (h, h) that resolves g with h
in H, then add h to H, else stop
the solution obtained is sensitive to the order
in which genotypes are resolved
Clarks rule may terminate with some genotypes
unresolved (orphans)
The rule can be modified to include a pair of
haplotypes that resolve an orphan genotype, and
continue as before

11
Gusfields TIP

Gusfield (1999) introduces the TIP approach
enumerate all distinct haplotypes that can be
used to resolve any single heterozygous genotype
solve an Integer linear Program (IP) to select a
minimum size set haplotypes from the enumerated
haplotypes that explains the genotypes
TIP uses O(2L n) variables and constraints, where
L is the maximum number of heterozygous loci of
any genotype
Gusfield describes a number of important
improvements to the basic approach above that
improve performance

12
Harrower-Brown IP

Harrower and Brown give an alternate 0-1 IP for
the HPP problem (HB-IP)
explain the n genotypes with 2n haplotypes (not
necessarily distinct)
the number of distinct haplotypes used are
minimized
the number of variables and constraints is
polynomial in n, m

13
The QIP approach - Outline

arithmetic representation of genotypes
semidefinite programming (SDP)
Quadratic Integer Program (QIP) for HPP
a semidefinite programming based heuristic to
solve QIP
experimental results
concluding remarks

14
Arithmetic Representation of Genotypes

represent each genotype g as a vector d with
each homozygous locus takes value 0 or 2 iff it
was 0 or 1 in g
each heterozygous locus takes value 1
conflation can now be replaced by addition
if haplotypes h1 and h2 explain genotype d, then
d h1 h2
we call d an arithmetic genotype

g 0 1 2 h1 0 1 0 h2 0 1 1
d 0 2 1 h1 0 1 0 h2 0 1 1
g ? d
15
Arithmetic Genotypes

let ? be n x m matrix with the arithmetic
genotypes as rows
let H be k x m matrix with haplotypes as rows
if haplotypes in H resolve ?, then
? S H
where S is a n x k 0-1-2 matrix
the row of S for a homozygous genotype has a
single 2
all other rows have exactly two 1s
we call S a selector matrix
ith row of S selects two haplotypes (rows of
H) to explain ith genotype

16
The k-HPP Problem

the k-HPP problem
Given nxm matrix ? representing a set of n
distinct genotypes each with m loci
Find an nxk 0-1-2 selector matrix S and a kxm 0-1
haplotype matrix H such that
? S H
S has as few non-zero columns as possible
all row-sums of S are 2
HPP is equivalent to k-HPP with k2n
lower Bounds for HPP
is a well known lower bound
Lemma rank(?) is a lower bound for HPP
Consider an optimal solution S, H
Since ? S H, we know that rank(?)
min(rank(S), rank(H)), and thus H must have at
least rank(?) distinct rows (haplotypes)

17
Finding H given ? and S

given ? and H to find an S is easy
given ? and S find an H by solving a 2-SAT
problem
If genotype i is resolved by haplotypes t and l,
then for each locus j, add following clauses
If di,j 0, add two clauses (ht,j) (hl,j)
If di,j 2, add two clauses (ht,j) (hl,j)
If di,j 1, add clauses (ht,j V hl,j ) (ht,j
V hl,j)
Only one of the ht,j ,hl,j must both be 1
2-SAT problem
has km variables and 2nm clauses
can be solved in (almost) linear time
any satisfying assignment gives a resolution of
the genotypes

18
Quadratic, Vector, and Semi-definite Programs

Quadratic Integer Program
Optimize a quadratic objective function subject
to quadratic constraints on integer variables
Strict, when each term has total degree 0 or 2
Vector program
optimize a linear objective function of inner
products of vector variables subject to linear
constraints on inner products of those variables
Strict quadratic programs lead to vector programs
(products of variables are mapped to inner
products of corresponding vectors)
SDP program
optimize a linear objective function of the
elements of a matrix X subject to
linear constraints on the elements of X
X being a positive semi-definite matrix
Vector programs lead to SDP (X is the matrix of
all vector inner products)
SDP programs can be solved in polynomial-time
with small numerical errors, thus
solving vector programs, thus
solving relaxations of strict Quadratic Integer
programs
construct an approximate solution to a quadratic
integer program from a solution of its
relaxation, obtained via SDP

19
Quadratic Integer Program for the k-HPP
Subject to
20
QIP Heuristic SDPRoundingBacktracking

recursively solve k-HPP
using SDP compute vectors for the variables of
QIP
for each selector variable Si,j, compute
PSi,jprobability that a random hyperplane
separates the vectors of Si,j and z variables
(ala MAX-CUT)
round to 1 the Si,j with the highest PSi,j
residual k-HPPk-HPP problem with the rounded
Si,js fixed to their rounded value
if the residual k-HPP is infeasible
round Si,j to 0 instead
if the new residual k-HPP is still infeasible
backtrack by returning infeasible
recursively solve the residual k-HPP

21
Experiments

we experiment with three approaches for the HPP
problem
Clarks rule
LP relaxation of Gusfields TIP scheme with
simple rounding
the QIP heuristic for kHPP with k 2n
The MATLAB package SDPT 3.02 is used to solve the
SDP relaxation of the problem
all experiments are done on a single CPU MATLAB
on a Dual Xeon 2.4 Ghz desktop with 1GB memory

22
Experimental Datasets

we use synthetic datasets A and B
each with 20 instances for each triplet (n, m, k)
(5, 5, 5), (8, 8, 8), (10, 10, 10), and (15,
15, 15) (and for B, recombination levels ? 0,
16 and 40)
generate instances of the HPP problem as follows
randomly mate k haplotypes with m loci to produce
n genotypes
generation of haplotypes for dataset A
each locus of k haplotypes takes value 0/1 with
probability ½ independent of other loci and other
genotypes
generation of haplotypes for dataset B
Use Hudsons program to generate haplotypes with
these parameters
diploid population of size 106
mutation rate 1.5 10-6
recombination levels ? 0, 16 and 40
corresponding to crossover probabilities 0, 4
10-6, and 10-5

23
Experimental Results
24
QIP Extensions

QIP can be extended to handle many variants of
basic k-HPP problem, such as
partial Genotypes
Some loci in some genotypes are unknown
shared haplotypes
Prior knowledge of shared haplotypes
allowing for erroneous genotypes and loci editing
allowing for outlier genotypes

25
Concluding Remarks

developed arithmetic formulation for the HPP
problem
provides new lower bound
yields simple quadratic IP (QIP)
QIP can be extended to handle many variants,
incorporate prior information etc
SDP relaxation of QIP that can be solved in
polynomial time
SDProundingbacktracking gives QIP heuristic
experimentally
Demonstrate competitiveness of QIP heuristic vs
Clarks rule and Gusfields TIP relaxation
Show that rank of the genotypes is a tighter
lower bound than
future work
Analysis of worst-case performance ratio of the
QIP heuristic
Devise algorithms that scale better