Title: Constrained Hidden Markov Models for Population-based Haplotyping
1Constrained Hidden Markov Models for
Population-based Haplotyping
Application of Probabilistic ILP II, FP6-508861
www.aprill.org
- Niels Landwehr
- Joint work with Taneli Mielikäinen, Lauri Eronen,
Hannu Toivonen, Heikki Mannila - University of Freiburg / University of Helsinki
2Outline
- Population-based haplotype reconstruction
- Infer haplotypes from genotypes reconstruct
hidden phase of genetic data - Important problem in biology/medicine e.g.
disease association studies - An approach using constrained HMMs
- Sparse markov chains to represent conserved
haplotype fragments - HMM model that can be learned directly from
genotype data - Experimental results
3Human Genome and SNPs
SNP (marker)
SNP (marker)
SNP (marker)
...GATATTCGTACGGATGTTTCCA... ...GATGTTCGTACTGATGTC
TCCA... ...GATATTCGTACGGATGTTTCCA... ...GATATTCGTA
CGGATGTTTCCA... ...GATGTTCGTACTGATGTCTCCA... ...GA
TGTTCGTACTGATGTCTCCA...
Individuals
1 2 3 4 5 6
DNA Sequence
4Haplotypes
SNP
SNP
SNP
A G T G T C A
G T A G T G T
C G T C
Individuals
1 2 3 4 5 6
DNA Sequence
5Haplotypes
SNP
SNP
SNP
1 0 1 0 1 0 1
0 1 1 0 1 0 1
0 0 1 0
Individuals
1 2 3 4 5 6
DNA Sequence
6Why Haplotypes?
- Haplotypes
- define our genetic individuality
- contribute to risk factors of complex diseases
(e.g., diabetes) - Disease Association Studies (Gene Mapping)
- find genetic difference between a case and a
control population - Identifying SNPs responsible for disease might
help find a cure - Also useful for
- Linkage disequilibrium studies Summarize genetic
variation - Understanding evolution of human populations
7The problem Haplotypes not directly observable
. 1 . . . 1 . . . 0 . . . 0 . . . 1 .
. 0 . . . 0 . . . 0 . . . 1 . . . 1 .
Paternal
Maternal
8Population-based Haplotype Reconstruction
- Given the genotypes of several individuals, infer
for every individual the most likely underlying
haplotype pair - Hidden data reconstruction problem using
probabilistic model exploit patterns in the
haplotypes (linkage disequilibrium)
haplotype pair
genotype
1 0 1 0 0 1 1 1 0 1 1 1 0 0 1 0
0,1 0,1 0,1 1 0,1 1 0 0,1
1 0 1 0 0 1 1 1 0 1 1 1 0 0 1 1
0 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1
0,1 0,1 1 1 1 1 0 1
0,1 0,1 0,1 1 0,1 1 0 1
Individual 2
Individual 3
Individual 1
9Haplotype Reconstruction Problem (CS Perspective)
Input A set G of genotypes
Output A set H of corresponding haplotype pairs
such that
10Population-based Haplotype Reconstruction
- Given a model M for the distribution of
haplotypes, can infer most likely resolution
Hardy-Weinberg equilibrium
- Need to estimate this model from available
genotype data
11Prior Work on Haplotype Reconstruction
- Competitive application domain for several years
many systems developed - characterized by the statistical model and
learning/reconstruction algorithms employed - Special-purpose statistical models
- Approximate Coalescent (PHASE 2001,2003,2005)
- Block-based (Gerbil 2004,2005)
- Variable-length MC (HaploRec 2004,2006)
- Founder-based (HIT 2005)
- Local clusters (fastPHASE 2006)
12Prior Work on Haplotype Reconstruction
- Special-purpose learning/reconstruction
algorithms - MCMC variant
- Approximate EM partition ligation
-
- Our approach
- Model haplotypes using (sparse) markov chains
- Natural extension to a Hidden Markov Model on
genotypes - Directly learnable from genotype data (standard
Baum-Welsh)
13Constrained HMMs for haplotyping
Path for haplotype 0,1,1,0
- Modeling haplotypes
- Standard markov chain
- More general order k markov chain
14Constrained HMMs for haplotyping
- Modeling genotypes
- Hidden phase (order of pair) Hidden Markov Model
- States pairs of states of the underlying markov
chain (state of the maternal/paternal sequence) - Output symbol unordered pair
- Path in the model sample two haplotypes, output
corresponding genotype - Have to enforce Hardy-Weinberg equilibrium
- Parameter tying constraints on transition
probabilities - Algorithms
- Learning standard Baum-Welsh
- Reconstruction of most likely haplotype pair
Viterbi
15Constrained HMMs for haplotyping
- Example paths for genotype 0,1,1,0,1,0
16Sparse Markov Modeling (SpaMM)
- Higher-order models (long history) needed
exponential size of model - However, out of the possible history
blocks, only few occur in data (conserved
fragments) - Idea Sparse model, iterative structure learning
algorithm to identify conserved fragments
(Apriori-style)
Initialize first-order-model()
em-training( ) repeat
regularize-and-extend( ) em-training(
) until
17SpaMM Model (order 1)
- Initial model standard markov chain of order 1
- Iteration extend order of model by 1, prune
unlikely parts - Avoids combinatorial explosion of model size
18SpaMM Model (order 2)
- Iteration extend order of model by 1, prune
unlikely paths - Avoids combinatorial explosion of model size
19SpaMM Model (order 3)
- Iteration extend order of model by 1, prune
unlikely paths - Avoids combinatorial explosion of model size
20SpaMM Model (order 4)
- Iteration extend order of model by 1, prune
unlikely paths - Avoids combinatorial explosion of model size
21SpaMM Model (order 5)
- Iteration extend order of model by 1, prune
unlikely paths - Avoids combinatorial explosion of model size
22SpaMM Model (order 6)
- Iteration extend order of model by 1, prune
unlikely paths - Avoids combinatorial explosion of model size
23SpaMM Model (final)
- Final model Model structure encodes conserved
fragments - Concise representation of all haplotypes with
non-zero probability
24Experimental Evaluation
- Real world population data
- Correct haplotypes have been inferred from trios
- Daly dataset 103 SNP markers for 174 individuals
- Yoruba population 100 datasets, 500 SNP markers
each, 60 individuals - Problem Setting
- Given the set of genotypes, algorithm outputs
most likely haplotype pairs - Difference to real haplotype pairs is measured in
switch distance ( recombinations needed to
transform pairs, normalized)
25Results Haplotype Reconstruction
- Many well-engineered systems
- Smart priors, averaging over several random
restarts of EM, ... - SpaMM proof-of-concept implementation, not tuned
26Results Haplotype Reconstruction
- PHASE most accurate, then fastPHASE, then SpaMM
- however, PHASE too slow for long maps
- SpaMM beats fastPHASE without averaging
- overall, competitive accuracy
27Results Runtime
- Runtime in seconds for phasing 100 markers (log.
scale) - SpaMM scales linearly in markers
- like fastPHASE, HaploRec, HIT
- unlike PHASE, Gerbil
28Results Genotype imputation
- Most haplotyping methods can also predict missing
genotype values - for SpaMM, can be read off Viterbi path
29Results Genotype imputation
- fastPHASE best known method
- Again, SpaMM beats fastPHASE without averaging
30Conclusions
- SpaMM new haplotyping method
- sparse Markov chains to encode conserved
haplotype fragments - Constrained HMM for modeling genotypes
- Apriori-style structure learning algorithm
- Simple, accurate, interpretable output
- Future work
- Accuracy can probably be improved using standard
techniques (EM random restarts, averaging, ...)
31Thanks!