Constrained Hidden Markov Models for Population-based Haplotyping

About This Presentation

Title:

Constrained Hidden Markov Models for Population-based Haplotyping

Description:

Constrained Hidden Markov Models for Population-based Haplotyping ... constrained HMMs ... Constrained HMM for modeling genotypes. Apriori-style structure ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 32

Provided by: nielsla

Category:

more less

Transcript and Presenter's Notes

Title: Constrained Hidden Markov Models for Population-based Haplotyping

1
Constrained Hidden Markov Models for
Population-based Haplotyping
Application of Probabilistic ILP II, FP6-508861
www.aprill.org

Niels Landwehr
Joint work with Taneli Mielikäinen, Lauri Eronen,
Hannu Toivonen, Heikki Mannila
University of Freiburg / University of Helsinki

2
Outline

Population-based haplotype reconstruction
Infer haplotypes from genotypes reconstruct
hidden phase of genetic data
Important problem in biology/medicine e.g.
disease association studies
An approach using constrained HMMs
Sparse markov chains to represent conserved
haplotype fragments
HMM model that can be learned directly from
genotype data
Experimental results

3
Human Genome and SNPs
SNP (marker)
SNP (marker)
SNP (marker)
...GATATTCGTACGGATGTTTCCA... ...GATGTTCGTACTGATGTC
TCCA... ...GATATTCGTACGGATGTTTCCA... ...GATATTCGTA
CGGATGTTTCCA... ...GATGTTCGTACTGATGTCTCCA... ...GA
TGTTCGTACTGATGTCTCCA...
Individuals
1 2 3 4 5 6
DNA Sequence
4
Haplotypes
SNP
SNP
SNP
A G T G T C A
G T A G T G T
C G T C
Individuals
1 2 3 4 5 6
DNA Sequence
5
Haplotypes
SNP
SNP
SNP
1 0 1 0 1 0 1
0 1 1 0 1 0 1
0 0 1 0
Individuals
1 2 3 4 5 6
DNA Sequence
6
Why Haplotypes?

Haplotypes
define our genetic individuality
contribute to risk factors of complex diseases
(e.g., diabetes)
Disease Association Studies (Gene Mapping)
find genetic difference between a case and a
control population
Identifying SNPs responsible for disease might
help find a cure
Also useful for
Linkage disequilibrium studies Summarize genetic
variation
Understanding evolution of human populations

7
The problem Haplotypes not directly observable
. 1 . . . 1 . . . 0 . . . 0 . . . 1 .
. 0 . . . 0 . . . 0 . . . 1 . . . 1 .
Paternal
Maternal
8
Population-based Haplotype Reconstruction

Given the genotypes of several individuals, infer
for every individual the most likely underlying
haplotype pair
Hidden data reconstruction problem using
probabilistic model exploit patterns in the
haplotypes (linkage disequilibrium)

haplotype pair
genotype
1 0 1 0 0 1 1 1 0 1 1 1 0 0 1 0
0,1 0,1 0,1 1 0,1 1 0 0,1
1 0 1 0 0 1 1 1 0 1 1 1 0 0 1 1
0 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1
0,1 0,1 1 1 1 1 0 1
0,1 0,1 0,1 1 0,1 1 0 1
Individual 2
Individual 3
Individual 1
9
Haplotype Reconstruction Problem (CS Perspective)
Input A set G of genotypes
Output A set H of corresponding haplotype pairs
such that
10
Population-based Haplotype Reconstruction

Given a model M for the distribution of
haplotypes, can infer most likely resolution

Hardy-Weinberg equilibrium

Need to estimate this model from available
genotype data

11
Prior Work on Haplotype Reconstruction

Competitive application domain for several years
many systems developed
characterized by the statistical model and
learning/reconstruction algorithms employed
Special-purpose statistical models
Approximate Coalescent (PHASE 2001,2003,2005)
Block-based (Gerbil 2004,2005)
Variable-length MC (HaploRec 2004,2006)
Founder-based (HIT 2005)
Local clusters (fastPHASE 2006)

12
Prior Work on Haplotype Reconstruction

Special-purpose learning/reconstruction
algorithms
MCMC variant
Approximate EM partition ligation
Our approach
Model haplotypes using (sparse) markov chains
Natural extension to a Hidden Markov Model on
genotypes
Directly learnable from genotype data (standard
Baum-Welsh)

13
Constrained HMMs for haplotyping
Path for haplotype 0,1,1,0

Modeling haplotypes
Standard markov chain
More general order k markov chain

14
Constrained HMMs for haplotyping

Modeling genotypes
Hidden phase (order of pair) Hidden Markov Model
States pairs of states of the underlying markov
chain (state of the maternal/paternal sequence)
Output symbol unordered pair
Path in the model sample two haplotypes, output
corresponding genotype
Have to enforce Hardy-Weinberg equilibrium
Parameter tying constraints on transition
probabilities
Algorithms
Learning standard Baum-Welsh
Reconstruction of most likely haplotype pair
Viterbi

15
Constrained HMMs for haplotyping

Example paths for genotype 0,1,1,0,1,0

16
Sparse Markov Modeling (SpaMM)

Higher-order models (long history) needed
exponential size of model
However, out of the possible history
blocks, only few occur in data (conserved
fragments)
Idea Sparse model, iterative structure learning
algorithm to identify conserved fragments
(Apriori-style)

Initialize first-order-model()
em-training( ) repeat
regularize-and-extend( ) em-training(
) until
17
SpaMM Model (order 1)

Initial model standard markov chain of order 1

Iteration extend order of model by 1, prune
unlikely parts
Avoids combinatorial explosion of model size

18
SpaMM Model (order 2)

Iteration extend order of model by 1, prune
unlikely paths
Avoids combinatorial explosion of model size

19
SpaMM Model (order 3)

Iteration extend order of model by 1, prune
unlikely paths
Avoids combinatorial explosion of model size

20
SpaMM Model (order 4)

Iteration extend order of model by 1, prune
unlikely paths
Avoids combinatorial explosion of model size

21
SpaMM Model (order 5)

Iteration extend order of model by 1, prune
unlikely paths
Avoids combinatorial explosion of model size

22
SpaMM Model (order 6)

Iteration extend order of model by 1, prune
unlikely paths
Avoids combinatorial explosion of model size

23
SpaMM Model (final)

Final model Model structure encodes conserved
fragments
Concise representation of all haplotypes with
non-zero probability

24
Experimental Evaluation

Real world population data
Correct haplotypes have been inferred from trios
Daly dataset 103 SNP markers for 174 individuals
Yoruba population 100 datasets, 500 SNP markers
each, 60 individuals
Problem Setting
Given the set of genotypes, algorithm outputs
most likely haplotype pairs
Difference to real haplotype pairs is measured in
switch distance ( recombinations needed to
transform pairs, normalized)

25
Results Haplotype Reconstruction

Many well-engineered systems
Smart priors, averaging over several random
restarts of EM, ...
SpaMM proof-of-concept implementation, not tuned

26
Results Haplotype Reconstruction

PHASE most accurate, then fastPHASE, then SpaMM
however, PHASE too slow for long maps
SpaMM beats fastPHASE without averaging
overall, competitive accuracy

27
Results Runtime

Runtime in seconds for phasing 100 markers (log.
scale)
SpaMM scales linearly in markers
like fastPHASE, HaploRec, HIT
unlike PHASE, Gerbil

28
Results Genotype imputation

Most haplotyping methods can also predict missing
genotype values
for SpaMM, can be read off Viterbi path

29
Results Genotype imputation

fastPHASE best known method
Again, SpaMM beats fastPHASE without averaging

30
Conclusions

SpaMM new haplotyping method
sparse Markov chains to encode conserved
haplotype fragments
Constrained HMM for modeling genotypes
Apriori-style structure learning algorithm
Simple, accurate, interpretable output
Future work
Accuracy can probably be improved using standard
techniques (EM random restarts, averaging, ...)

31
Thanks!

Write a Comment

User Comments (0)

About PowerShow.com

Constrained Hidden Markov Models for Population-based Haplotyping - PowerPoint PPT Presentation

Constrained Hidden Markov Models for Population-based Haplotyping

Constrained Hidden Markov Models for Population-based Haplotyping ... constrained HMMs ... Constrained HMM for modeling genotypes. Apriori-style structure ... – PowerPoint PPT presentation