Haplotype Phasing using Semidefinite Programming - PowerPoint PPT Presentation

About This Presentation
Title:

Haplotype Phasing using Semidefinite Programming

Description:

the code for the creation of the cells is packed in a molecule called DNA. ... a minimum size set of haplotypes which conflate to produce the given genotypes ... – PowerPoint PPT presentation

Number of Views:74
Avg rating:3.0/5.0
Slides: 27
Provided by: paragna
Category:

less

Transcript and Presenter's Notes

Title: Haplotype Phasing using Semidefinite Programming


1
Haplotype Phasing using Semidefinite Programming
  • Parag Namjoshi
  • CSEE Department
  • University of Maryland Baltimore County

Joint work with Konstantinos Kalpakis
2
Outline
  • Biology Review
  • Motivation
  • Previous work
  • Our contribution
  • Experimental results
  • Conclusions

3
Biology Review
  • living systems are composed of cells
  • the code for the creation of the cells is packed
    in a molecule called DNA.
  • DNA consists of four nucleic acids Adenine,
    Cytosine, Guanine, and Thymine arranged as
    complementary strands of a double helix.
  • DNA strand string of A,C,G, Ts.

4
Chromosomes
  • the genome is arranged as set of distinct
    chromosomes.
  • mammals are diploids
  • humans have 22 x
  • and y chromosomes.
  • chromosomes occur in
  • homologous pairs
  • one homologous chromosome
  • is inherited from each parent
  • homologous chromosomes contain the same genes in
    the same order (up to mutations)

5
Single Nucleotide Polymorphisms.
  • Single Nucleotide Polymorphism (SNP) mutation
    of a single base.
  • evidence suggests that in humans
  • 90 of variation is due to SNPs
  • DNA has long conserved regions punctuated by SNPs
  • there is one SNP in approximately 1000 bases
  • most SNPS are bi-allelic
  • at any given locus, only two of the four possible
    nucleotides are present in 95 of the population
  • the restriction (projection) of a DNA strand to
    SNP sites is a haplotype

6
What are Genotypes?
  • the genotype of diploid organisms is the
    conflation of the inherited haplotypes

7
Genotype Haplotype Std. Representation
  • genotypes and haplotypes can be represented as a
    0,1,2 vectors
  • independently for each site
  • identify each one of the two letters that appear
    in it with 0 or 1
  • replace each homozygous site with 0/1 using the
    mapping above
  • replace heterozygous sites with 2

8
Haplotypes vs. Genotypes
  • large scale polymorphism studies such as Linkage
    Disequilibrium need haplotype information
  • however, experimentally
  • it is expensive to segregate the haplotypes of
    the individuals
  • it is easier to observe the genotypes of those
    individuals
  • can we find haplotypes from the genotypes
    computationally?
  • a genotype with h heterozygous sites can be
    explained (phased) by 2h-1 different haplotype
    pairs
  • how do you choose among them?

9
Haplotype Phasing with Parsimony
  • in Population haplotyping, given genotypes from
    different individuals we want to find a set of
    haplotypes which resolve all the genotypes
  • Recall that there can be many such solutions
  • Experimental evidence suggests that the number of
    such haplotypes is small
  • HPP Haplotype Phasing Problem with Pure
    Parsimony
  • Given a set of genotypes, find a minimum size set
    of haplotypes which conflate to produce the given
    genotypes
  • other criteria for choosing among possible sets
    of haplotypes are
  • perfect phylogeny, minimum total pairwise
    distance, minimum diameter, etc
  • we focus on HPP problem
  • Lancia, Pinotti, and Rizzi proved that the HPP is
    NPcomplete as well as APXhard

10
Clarks Rule
  • Clark (1990) describes a greedy inference rule to
    find a small set of haplotypes resolving a set of
    genotypes
  • Starting with a set of haplotypes H that resolves
    all the homozygous genotypes, do the following
  • for each unresolved genotype g
  • if there is a pair (h, h) that resolves g with h
    in H, then add h to H, else stop
  • the solution obtained is sensitive to the order
    in which genotypes are resolved
  • Clarks rule may terminate with some genotypes
    unresolved (orphans)
  • The rule can be modified to include a pair of
    haplotypes that resolve an orphan genotype, and
    continue as before

11
Gusfields TIP
  • Gusfield (1999) introduces the TIP approach
  • enumerate all distinct haplotypes that can be
    used to resolve any single heterozygous genotype
  • solve an Integer linear Program (IP) to select a
    minimum size set haplotypes from the enumerated
    haplotypes that explains the genotypes
  • TIP uses O(2L n) variables and constraints, where
    L is the maximum number of heterozygous loci of
    any genotype
  • Gusfield describes a number of important
    improvements to the basic approach above that
    improve performance

12
Harrower-Brown IP
  • Harrower and Brown give an alternate 0-1 IP for
    the HPP problem (HB-IP)
  • explain the n genotypes with 2n haplotypes (not
    necessarily distinct)
  • the number of distinct haplotypes used are
    minimized
  • the number of variables and constraints is
    polynomial in n, m

13
The QIP approach - Outline
  • arithmetic representation of genotypes
  • semidefinite programming (SDP)
  • Quadratic Integer Program (QIP) for HPP
  • a semidefinite programming based heuristic to
    solve QIP
  • experimental results
  • concluding remarks

14
Arithmetic Representation of Genotypes
  • represent each genotype g as a vector d with
  • each homozygous locus takes value 0 or 2 iff it
    was 0 or 1 in g
  • each heterozygous locus takes value 1
  • conflation can now be replaced by addition
  • if haplotypes h1 and h2 explain genotype d, then
  • d h1 h2
  • we call d an arithmetic genotype

g 0 1 2 h1 0 1 0 h2 0 1 1
d 0 2 1 h1 0 1 0 h2 0 1 1
g ? d
15
Arithmetic Genotypes
  • let ? be n x m matrix with the arithmetic
    genotypes as rows
  • let H be k x m matrix with haplotypes as rows
  • if haplotypes in H resolve ?, then
  • ? S H
  • where S is a n x k 0-1-2 matrix
  • the row of S for a homozygous genotype has a
    single 2
  • all other rows have exactly two 1s
  • we call S a selector matrix
  • ith row of S selects two haplotypes (rows of
    H) to explain ith genotype

16
The k-HPP Problem
  • the k-HPP problem
  • Given nxm matrix ? representing a set of n
    distinct genotypes each with m loci
  • Find an nxk 0-1-2 selector matrix S and a kxm 0-1
    haplotype matrix H such that
  • ? S H
  • S has as few non-zero columns as possible
  • all row-sums of S are 2
  • HPP is equivalent to k-HPP with k2n
  • lower Bounds for HPP
  • is a well known lower bound
  • Lemma rank(?) is a lower bound for HPP
  • Consider an optimal solution S, H
  • Since ? S H, we know that rank(?)
    min(rank(S), rank(H)), and thus H must have at
    least rank(?) distinct rows (haplotypes)

17
Finding H given ? and S
  • given ? and H to find an S is easy
  • given ? and S find an H by solving a 2-SAT
    problem
  • If genotype i is resolved by haplotypes t and l,
    then for each locus j, add following clauses
  • If di,j 0, add two clauses (ht,j) (hl,j)
  • If di,j 2, add two clauses (ht,j) (hl,j)
  • If di,j 1, add clauses (ht,j V hl,j ) (ht,j
    V hl,j)
  • Only one of the ht,j ,hl,j must both be 1
  • 2-SAT problem
  • has km variables and 2nm clauses
  • can be solved in (almost) linear time
  • any satisfying assignment gives a resolution of
    the genotypes

18
Quadratic, Vector, and Semi-definite Programs
  • Quadratic Integer Program
  • Optimize a quadratic objective function subject
    to quadratic constraints on integer variables
  • Strict, when each term has total degree 0 or 2
  • Vector program
  • optimize a linear objective function of inner
    products of vector variables subject to linear
    constraints on inner products of those variables
  • Strict quadratic programs lead to vector programs
    (products of variables are mapped to inner
    products of corresponding vectors)
  • SDP program
  • optimize a linear objective function of the
    elements of a matrix X subject to
  • linear constraints on the elements of X
  • X being a positive semi-definite matrix
  • Vector programs lead to SDP (X is the matrix of
    all vector inner products)
  • SDP programs can be solved in polynomial-time
    with small numerical errors, thus
  • solving vector programs, thus
  • solving relaxations of strict Quadratic Integer
    programs
  • construct an approximate solution to a quadratic
    integer program from a solution of its
    relaxation, obtained via SDP

19
Quadratic Integer Program for the k-HPP
Subject to
20
QIP Heuristic SDPRoundingBacktracking
  • recursively solve k-HPP
  • using SDP compute vectors for the variables of
    QIP
  • for each selector variable Si,j, compute
  • PSi,jprobability that a random hyperplane
    separates the vectors of Si,j and z variables
    (ala MAX-CUT)
  • round to 1 the Si,j with the highest PSi,j
  • residual k-HPPk-HPP problem with the rounded
    Si,js fixed to their rounded value
  • if the residual k-HPP is infeasible
  • round Si,j to 0 instead
  • if the new residual k-HPP is still infeasible
  • backtrack by returning infeasible
  • recursively solve the residual k-HPP

21
Experiments
  • we experiment with three approaches for the HPP
    problem
  • Clarks rule
  • LP relaxation of Gusfields TIP scheme with
    simple rounding
  • the QIP heuristic for kHPP with k 2n
  • The MATLAB package SDPT 3.02 is used to solve the
    SDP relaxation of the problem
  • all experiments are done on a single CPU MATLAB
    on a Dual Xeon 2.4 Ghz desktop with 1GB memory

22
Experimental Datasets
  • we use synthetic datasets A and B
  • each with 20 instances for each triplet (n, m, k)
    (5, 5, 5), (8, 8, 8), (10, 10, 10), and (15,
    15, 15) (and for B, recombination levels ? 0,
    16 and 40)
  • generate instances of the HPP problem as follows
  • randomly mate k haplotypes with m loci to produce
    n genotypes
  • generation of haplotypes for dataset A
  • each locus of k haplotypes takes value 0/1 with
    probability ½ independent of other loci and other
    genotypes
  • generation of haplotypes for dataset B
  • Use Hudsons program to generate haplotypes with
    these parameters
  • diploid population of size 106
  • mutation rate 1.5 10-6
  • recombination levels ? 0, 16 and 40
    corresponding to crossover probabilities 0, 4
    10-6, and 10-5

23
Experimental Results
24
QIP Extensions
  • QIP can be extended to handle many variants of
    basic k-HPP problem, such as
  • partial Genotypes
  • Some loci in some genotypes are unknown
  • shared haplotypes
  • Prior knowledge of shared haplotypes
  • allowing for erroneous genotypes and loci editing
  • allowing for outlier genotypes

25
Concluding Remarks
  • developed arithmetic formulation for the HPP
    problem
  • provides new lower bound
  • yields simple quadratic IP (QIP)
  • QIP can be extended to handle many variants,
    incorporate prior information etc
  • SDP relaxation of QIP that can be solved in
    polynomial time
  • SDProundingbacktracking gives QIP heuristic
  • experimentally
  • Demonstrate competitiveness of QIP heuristic vs
    Clarks rule and Gusfields TIP relaxation
  • Show that rank of the genotypes is a tighter
    lower bound than
  • future work
  • Analysis of worst-case performance ratio of the
    QIP heuristic
  • Devise algorithms that scale better

26
Thank You !
Questions ?
Write a Comment
User Comments (0)
About PowerShow.com