Constrained Hidden Markov Models for Population-based Haplotyping - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Constrained Hidden Markov Models for Population-based Haplotyping

Description:

Constrained Hidden Markov Models for Population-based Haplotyping ... constrained HMMs ... Constrained HMM for modeling genotypes. Apriori-style structure ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 32
Provided by: nielsla
Category:

less

Transcript and Presenter's Notes

Title: Constrained Hidden Markov Models for Population-based Haplotyping


1
Constrained Hidden Markov Models for
Population-based Haplotyping
Application of Probabilistic ILP II, FP6-508861
www.aprill.org
  • Niels Landwehr
  • Joint work with Taneli Mielikäinen, Lauri Eronen,
    Hannu Toivonen, Heikki Mannila
  • University of Freiburg / University of Helsinki

2
Outline
  • Population-based haplotype reconstruction
  • Infer haplotypes from genotypes reconstruct
    hidden phase of genetic data
  • Important problem in biology/medicine e.g.
    disease association studies
  • An approach using constrained HMMs
  • Sparse markov chains to represent conserved
    haplotype fragments
  • HMM model that can be learned directly from
    genotype data
  • Experimental results

3
Human Genome and SNPs
SNP (marker)
SNP (marker)
SNP (marker)
...GATATTCGTACGGATGTTTCCA... ...GATGTTCGTACTGATGTC
TCCA... ...GATATTCGTACGGATGTTTCCA... ...GATATTCGTA
CGGATGTTTCCA... ...GATGTTCGTACTGATGTCTCCA... ...GA
TGTTCGTACTGATGTCTCCA...
Individuals
1 2 3 4 5 6
DNA Sequence
4
Haplotypes
SNP
SNP
SNP
A G T G T C A
G T A G T G T
C G T C
Individuals
1 2 3 4 5 6
DNA Sequence
5
Haplotypes
SNP
SNP
SNP
1 0 1 0 1 0 1
0 1 1 0 1 0 1
0 0 1 0
Individuals
1 2 3 4 5 6
DNA Sequence
6
Why Haplotypes?
  • Haplotypes
  • define our genetic individuality
  • contribute to risk factors of complex diseases
    (e.g., diabetes)
  • Disease Association Studies (Gene Mapping)
  • find genetic difference between a case and a
    control population
  • Identifying SNPs responsible for disease might
    help find a cure
  • Also useful for
  • Linkage disequilibrium studies Summarize genetic
    variation
  • Understanding evolution of human populations

7
The problem Haplotypes not directly observable
. 1 . . . 1 . . . 0 . . . 0 . . . 1 .
. 0 . . . 0 . . . 0 . . . 1 . . . 1 .
Paternal
Maternal
8
Population-based Haplotype Reconstruction
  • Given the genotypes of several individuals, infer
    for every individual the most likely underlying
    haplotype pair
  • Hidden data reconstruction problem using
    probabilistic model exploit patterns in the
    haplotypes (linkage disequilibrium)

haplotype pair
genotype
1 0 1 0 0 1 1 1 0 1 1 1 0 0 1 0
0,1 0,1 0,1 1 0,1 1 0 0,1
1 0 1 0 0 1 1 1 0 1 1 1 0 0 1 1
0 1 0 1 1 1 1 1 1 1 1 1 0 0 1 1
0,1 0,1 1 1 1 1 0 1
0,1 0,1 0,1 1 0,1 1 0 1
Individual 2
Individual 3
Individual 1
9
Haplotype Reconstruction Problem (CS Perspective)
Input A set G of genotypes
Output A set H of corresponding haplotype pairs
such that
10
Population-based Haplotype Reconstruction
  • Given a model M for the distribution of
    haplotypes, can infer most likely resolution

Hardy-Weinberg equilibrium
  • Need to estimate this model from available
    genotype data

11
Prior Work on Haplotype Reconstruction
  • Competitive application domain for several years
    many systems developed
  • characterized by the statistical model and
    learning/reconstruction algorithms employed
  • Special-purpose statistical models
  • Approximate Coalescent (PHASE 2001,2003,2005)
  • Block-based (Gerbil 2004,2005)
  • Variable-length MC (HaploRec 2004,2006)
  • Founder-based (HIT 2005)
  • Local clusters (fastPHASE 2006)

12
Prior Work on Haplotype Reconstruction
  • Special-purpose learning/reconstruction
    algorithms
  • MCMC variant
  • Approximate EM partition ligation
  • Our approach
  • Model haplotypes using (sparse) markov chains
  • Natural extension to a Hidden Markov Model on
    genotypes
  • Directly learnable from genotype data (standard
    Baum-Welsh)

13
Constrained HMMs for haplotyping
Path for haplotype 0,1,1,0
  • Modeling haplotypes
  • Standard markov chain
  • More general order k markov chain

14
Constrained HMMs for haplotyping
  • Modeling genotypes
  • Hidden phase (order of pair) Hidden Markov Model
  • States pairs of states of the underlying markov
    chain (state of the maternal/paternal sequence)
  • Output symbol unordered pair
  • Path in the model sample two haplotypes, output
    corresponding genotype
  • Have to enforce Hardy-Weinberg equilibrium
  • Parameter tying constraints on transition
    probabilities
  • Algorithms
  • Learning standard Baum-Welsh
  • Reconstruction of most likely haplotype pair
    Viterbi

15
Constrained HMMs for haplotyping
  • Example paths for genotype 0,1,1,0,1,0

16
Sparse Markov Modeling (SpaMM)
  • Higher-order models (long history) needed
    exponential size of model
  • However, out of the possible history
    blocks, only few occur in data (conserved
    fragments)
  • Idea Sparse model, iterative structure learning
    algorithm to identify conserved fragments
    (Apriori-style)

Initialize first-order-model()
em-training( ) repeat
regularize-and-extend( ) em-training(
) until
17
SpaMM Model (order 1)
  • Initial model standard markov chain of order 1
  • Iteration extend order of model by 1, prune
    unlikely parts
  • Avoids combinatorial explosion of model size

18
SpaMM Model (order 2)
  • Iteration extend order of model by 1, prune
    unlikely paths
  • Avoids combinatorial explosion of model size

19
SpaMM Model (order 3)
  • Iteration extend order of model by 1, prune
    unlikely paths
  • Avoids combinatorial explosion of model size

20
SpaMM Model (order 4)
  • Iteration extend order of model by 1, prune
    unlikely paths
  • Avoids combinatorial explosion of model size

21
SpaMM Model (order 5)
  • Iteration extend order of model by 1, prune
    unlikely paths
  • Avoids combinatorial explosion of model size

22
SpaMM Model (order 6)
  • Iteration extend order of model by 1, prune
    unlikely paths
  • Avoids combinatorial explosion of model size

23
SpaMM Model (final)
  • Final model Model structure encodes conserved
    fragments
  • Concise representation of all haplotypes with
    non-zero probability

24
Experimental Evaluation
  • Real world population data
  • Correct haplotypes have been inferred from trios
  • Daly dataset 103 SNP markers for 174 individuals
  • Yoruba population 100 datasets, 500 SNP markers
    each, 60 individuals
  • Problem Setting
  • Given the set of genotypes, algorithm outputs
    most likely haplotype pairs
  • Difference to real haplotype pairs is measured in
    switch distance ( recombinations needed to
    transform pairs, normalized)

25
Results Haplotype Reconstruction
  • Many well-engineered systems
  • Smart priors, averaging over several random
    restarts of EM, ...
  • SpaMM proof-of-concept implementation, not tuned

26
Results Haplotype Reconstruction
  • PHASE most accurate, then fastPHASE, then SpaMM
  • however, PHASE too slow for long maps
  • SpaMM beats fastPHASE without averaging
  • overall, competitive accuracy

27
Results Runtime
  • Runtime in seconds for phasing 100 markers (log.
    scale)
  • SpaMM scales linearly in markers
  • like fastPHASE, HaploRec, HIT
  • unlike PHASE, Gerbil

28
Results Genotype imputation
  • Most haplotyping methods can also predict missing
    genotype values
  • for SpaMM, can be read off Viterbi path

29
Results Genotype imputation
  • fastPHASE best known method
  • Again, SpaMM beats fastPHASE without averaging

30
Conclusions
  • SpaMM new haplotyping method
  • sparse Markov chains to encode conserved
    haplotype fragments
  • Constrained HMM for modeling genotypes
  • Apriori-style structure learning algorithm
  • Simple, accurate, interpretable output
  • Future work
  • Accuracy can probably be improved using standard
    techniques (EM random restarts, averaging, ...)

31
Thanks!
Write a Comment
User Comments (0)
About PowerShow.com