1 / 47

Single Nucleotide Polymorphisms

Instructor Yao-Ting Huang

Bioinformatics Laboratory, Department of Computer

Science Information Engineering, National Chung

Cheng University.

Genetic Variants

- We are distinguished from each other by genetic

variants. - Single Nucleotide Polymorphisms (SNP)
- Insertion/deletion
- Copy Number Polymorphism (CNP)
- Inversion

Genetic Variants Over Time

Variants observed in a population

Mutations over time

Common Ancestor

time

present

SNPs and Haplotypes

- A Single Nucleotide Polymorphism (SNP),

pronounced snip, is a single DNA base variation

observed in the human population. - A haplotype stands for a set of linked SNPs on

the same chromosome.

Single Nucleotide Polymorphism

- We only consider SNPs observed with sufficient

frequency in the population. - SNP the minor allele frequency is at least 5.
- Mutation the minor allele frequency is less than

5.

C T T A G C T T

C T T A G T T T

SNP

A C T T A G C T T

99.9

A C T T A G T T T

0.1

Mutation

Single Nucleotide Polymorphism

- All humans share 99.9 the same DNA sequence
- SNPs occur about every 200600 base pairs.
- 90 of human genome variation comes SNPs.
- The human genome contains about four million

SNPs. - Because the probability of recurrent mutation at

the same locus is quite low, we usually observe

only two alleles at a SNP locus.

Single Nucleotide Polymorphism

- The SNPs differ among members in the human

population.

Black eye Brown eye Black eye Blue eye Brown

eye Brown eye

GATATTCGTACGGA-T GATGTTCGTACTGAAT GATATTCGTACGGA-T

GATATTCGTACGGAAT GATGTTCGTACTGAAT GATGTTCGTACTGAA

T

Haplotypes

AG- 2/6 GTA 3/6 AGA 1/6

DNASequences of 6 individuals

Discovery of SNPs

- The DNA of two individuals differs in less than

0.1. - Hinds et al. identified 1,586,383 Single

Nucleotide Polymorphisms across three human

populations (Science, 2005).

The HapMap Project

- The International HapMap project aims to provide

a map of SNPs in the human genome (269

individuals from 4 populations). - Phase I 1,007,329 SNPs.
- Phase II (ongoing) 4.6 millions SNPs.

Haplotype v.s. Genotype

- The collection of haplotypes has been limited

because the human genome is a diploid. - In above projects, genotypes instead of

haplotypes are collected due to cost

consideration.

Haplotype v.s. Genotype

- Genotypes only tell us the alleles at each SNP

locus. - But we dont know the connection of alleles at

different SNP loci. - There could be several possible haplotype pairs

for the same genotype.

or

We dont know which haplotype pair is real.

Three Possible Genotypes at Each SNP Locus

- At SNP1, it is possible to observe three

genotypes (A, C), (A, A), and (C, C) in the

population. - (A, C) Heterozygous (One major and one minor

alleles). - (A, A) Homozygous wild type (two major alleles).
- (C, C) Homozygous mutant (two minor alleles).

T

C

G3

C

T

SNP1

SNP2

Haplotype Inference

- Inferring the haplotypes from a set of genotypes

is called haplotype inference. - Without further assumption, this problem can not

be solved. - Most combinatorial methods consider the maximum

parsimony model to solve this problem. - Methods based on this model search for a minimum

set of haplotypes which can explain all

genotypes. - This problem is shown to be APX-hard (Lancia

etal, 2005).

Maximum Parsimony

or

- Find a minimum set of haplotypes that can explain

all genotypes.

Related Works

- Statistical methods
- Niu, et al. (2002) developed a PL-EM algorithm

called HAPLOTYPER. - Stephens and Donnelly (2003) designed a MCMC

algorithm based on Gibbs sampling called PHASE. - Combinatorial methods
- Gusfield (2003) proposed an integer linear

programming for this problem. - Wang and Xu (2003) developed a branching and

bound algorithm called HAPAR to find the optimal

solution. - Brown and Harrower (2004) proposed a new integer

linear programming for this problem.

Our Results

- Huang et al. An approximation algorithm for

haplotype inference by maximum parsimony, Journal

of Computational Biology, 2005.

Yao-Ting Huang

Approximation Approaches to NP-hard problems

- Formulate the problem to an integer linear

problem - Relax to a Linear Programming (LP) problem and

solve it. - Gusfield and Brown formulate the haplotype

inference problem into integer programming. - Formulate the problem to an integer quadratic

programming (IQP) problem - Relax to a Semi-Definite Programming (SDP)

problem and solve it. - We formulate the haplotype inference problem into

an IQP problem.

Integer Quadratic Programming

- Define xi as an integer variable with values 1 or

-1. - xi 1 if the i-th haplotype is selected.
- xi -1 if the i-th haplotype is not selected.
- Finding a minimum set of haplotypes is to

minimize the following function

Integer Quadratic Programming

- Each genotype must be explained by at least one

pair of haplotypes. - For genotype G1, the following inequality must be

satisfied.

Suppose h1 and h2 are selected

or

Integer Quadratic Programming

Constraint Functions

- Maximum parsimony

Find a minimum set of haplotypes

which can explain all genotypes.

An Iterative Semi-definite Programming Relaxation

Algorithm

Integer Quadratic Programming

Semi-definite Programming

Vector Formulation

Vector Solution

SDP Solution

Integral Solution

Relaxation

Integer Quadratic Programming

Vector Formulation

- We relax xi into a (m1)-dimensional unit vector

yi. - Replace the integer constant 1 with another unit

vector y0 (1, 0, , 0).

SDP Formulation

Vector Formulation

- Let Y (y0 y1 ym)T(y0 y1 ym)

Reformulation

Vector Formulation

Solving SDP

Semidefinite Programming

- The SDP problem can be solved by algorithms such

as the interior point method in polynomial time. - We can obtain the SDP solution matrix Y.

Decomposition

SDP Solution

- Recall that Y (y0 y1 ym)T(y0 y1 ym).
- Use the incomplete Choleskey decomposition method

to obtain vector solutions y0, y1, , ym.

Randomized Rounding

IntegralSolution

Vector Solution

- Randomly generate two unit vectors z1 and z2.
- Set xi 1 if
- ( z1 yi ) ( z1 y0 ) gt 0, and
- ( z2 yi ) ( z2 y0 ) gt 0.
- Set xi -1 otherwise.

We will discuss this later

Iterative Process

Integer Quadratic Programming

- Check if all inequalities are satisfied.
- No, repeat this algorithm only for those

unsatisfied inequalities. - Yes, we are done.

Analysis of the SDP-relaxation Algorithm

- Recall the randomized rounding
- Randomly generate two unit vectors z1 and z2.
- Set xi 1 if
- ( z1 yi ) ( z1 y0 ) gt 0, and
- ( z2 yi ) ( z2 y0 ) gt 0.
- Set xi -1 otherwise.
- We will show that the randomized rounding outputs

a solution Ew at least as good as the optimal

solution.

Analysis of the SDP-relaxation Algorithm

- The randomized rounding method can output a

solution Ew at least as good as the optimal

solution. - We will show OPT(IQP) OPT(SDP) Ew.
- The solution space of SDP includes that of IQP,
- We already have OPT(IQP) OPT(SDP).
- We can set yi (1,0,0,0, ) ? xi 1.
- We can set yi (-1,0,0,0, ) ? xi -1.

Analysis of the SDP-relaxation Algorithm

- We still need to prove
- OPT(IQP) OPT(SDP) Ew.

gt lt?

Analysis of the SDP-relaxation Algorithm

- Recall xi 1 if
- ( z1 yi ) ( z1 y0 ) gt 0, and
- ( z2 yi ) ( z2 y0 ) gt 0.
- Note that cos? vi vj
- Let the angle between vectors y0 and yi be ?.
- Recall that cos? gt 0 when ?0, p/2 or p, 3p/2.

Analysis of the SDP-relaxation Algorithm

- Recall xi 1 if
- ( z1 yi ) ( z1 y0 ) gt 0, and
- ( z2 yi ) ( z2 y0 ) gt 0.
- Let the angle between vectors y0 and yi be ?.
- ( z1 yi ) ( z1 y0 ) gt 0 if z1 is within region

(p-?) or the opposite region. - ( z2 yi ) ( z2 y0 ) gt 0 if z2 is within region

(p-?) or the opposite region.. - xi 1 with probability ((p-?) /p)2.

Analysis of the SDP-relaxation Algorithm

Analysis of the SDP-relaxation Algorithm

- We now complete the proof
- OPT(IQP) OPT(SDP) Ew.

Simulation Methods

- The haplotypes are used to validate the result.
- We randomly pair two haplotypes to generate a

genotype.

HaplotypeData

GenotypeData

Solution

h1 h2 hm

G1 h1h4 G2 h2hm Gn h1h2

G1 h1h4 G2 h1h2 Gn h1h2

SDPHapInferHAPARHAPLOTYPER PHASE

Results

- We prove that SDPHapInfer gives a solution of

O(log n)-approximation with a high probability,

where n is the number of genotypes. - We implement SDPHapInfer in MatLab.
- We compare the number of haplotypes found by

different methods on simulated data sets.

Experimental Results (1/2)

Error rate

Number of genotypes

100 simluated data sets of 10 haplotypes with 20

SNPs

The Challenge

- The problem of inferring haplotypes for long

genotypes is still a challenging problem. - Existing methods are forced to
- partition the genotypes into small segments,
- infer haplotype in each segment,
- and concatenate inferred haplotypes to construct

a final solution.

The First Application of SDP on Approximation

Algorithms

- A 0.878 randomized approximation algorithm for

the MAXCUT problem is developed by SDP relaxation

technique. - The LP-relaxation can only achieve 0.5

approximation ratio. - An upper bound has shown to be 0.941.
- Goemans, M. and Williamson, D. at ACM STOC 1994.

The MAXCUT Problem

- Given an undirected graph with n nodes Gx1 , x2

, , xn, find a cut to maximize the number of

edges on the cut. - Let xi be 1 if the vertex is at one side of the

cut, and -1 if the vertex is at the other side of

the cut.

-1

-1

-1

Integer Quadratic Programming

- Define aij be 1 if the edge (xi , xj) exists and

0 otherwise.

x2

x1

x3

x4

- Relax the integer constraint of xi to be the unit

length vector in dimension m.

Semidefinite Programming Formulation

x2

x1

x3

x4

- Let X be (v1 ,v2 , , vn)T ? (v1, v2 ,, vn).

Randomized Rounding Method

- Once X is found, perform Cholesky decomposition

to obtain the vector solutions v1, v2, , vn. - Pick a random unit vector r and
- Set xi 1 if vi ? r 0
- Set xi -1 if vi ? r lt 0
- Note that cos? vi ? vj
- The edge (vi , vj) is on the cut iff (vi ? r )

and (vj ? r) has different sign.

vi

r

?

vj

Analysis

- Denote C as the size of the cut found by the

above algorithm. - The expectation that each edge (xi , xj) is the

solution is

vi

r

?

vj

Analysis

- The randomized rounding partition the nodes by a

hyperplane.

r

1

1

1

-1

-1

Linear Algebra Background

- A symmetric n?n matrix A is positive semidefinite

iff xTAx ? 0 , for every x?Rn. - ABTB , for some m?n matrix B.
- All the eigenvalues of A are non-negative.
- The inner product of symmetric matrices A and B is