Genotype%20Error%20Detection%20using%20Hidden%20Markov%20Models%20of%20Haplotype%20Diversity - PowerPoint PPT Presentation

About This Presentation
Title:

Genotype%20Error%20Detection%20using%20Hidden%20Markov%20Models%20of%20Haplotype%20Diversity

Description:

For pedigree data some errors detected as Mendelian Inconsistencies (MIs) Undetected errors ... Our 2-step algorithm exploits pedigree info ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 31
Provided by: jlk2
Category:

less

Transcript and Presenter's Notes

Title: Genotype%20Error%20Detection%20using%20Hidden%20Markov%20Models%20of%20Haplotype%20Diversity


1
Genotype Error Detection using Hidden Markov
Models of Haplotype Diversity
Justin Kennedy, Ion Mandoiu, Bogdan Pasaniuc CSE
Department, University of Connecticut
2
Outline
  • Introduction
  • Likelihood Sensitivity Approach to Error
    Detection
  • HMM-Based Algorithms
  • Experimental Results
  • Conclusion

3
Genotyping Errors
  • A real problem despite advances in genotyping
    technology
  • Zaitlen et al. 2005 found 1.1 inconsistencies
    among the 20 million dbSNP genotypes typed
    multiple times
  • Error types
  • Systematic errors (e.g., assay failure) detected
    by departure from HWE Hosking et al. 2004
  • For pedigree data some errors detected as
    Mendelian Inconsistencies (MIs)
  • Undetected errors
  • E.g., if mother/father/child are all
    heterozygous, any error is Mendelian consistent
  • Only 30 detectable as MIs for trios Gordon et
    al. 1999

4
Effects of Undetected Genotyping Errors
  • Even low error levels can have large effects for
    some study designs (e.g. rare alleles,
    haplotype-based)
  • Errors as low as .1 can increase Type I error
    rates in haplotype sharing transmission
    disequilibrium test (HS-TDT) KnappBecker04
  • 1 errors decrease power by 10-50 for linkage,
    and by 5-20 for association Douglas et al. 00,
    Abecasis et al. 01

5
Related Work
  • Improved genotype calling algorithms
  • Di et al. 05, RabbeeSpeed 06, Nicolae et al.
    06
  • Explicit modeling in analysis methods
  • Kruglyak et al 96, Sieberts et al. 01, Sobel et
    al. 02, Abecasis et al. 02
  • Computationally complex
  • Separate error detection step
  • Douglas et al. 00, Becker et al. 06
  • Detected errors can be retyped, imputed, or
    ignored in downstream analyses

6
Outline
  • Introduction
  • Likelihood Sensitivity Approach to Error
    Detection
  • HMM-Based Algorithms
  • Experimental Results
  • Conclusion

7
Likelihood Sensitivity Approach to Error
Detection Becker et al. 06
8
Likelihood Sensitivity Approach to Error
Detection Becker et al. 06
?
Likelihood of best phasing for original trio T
9
Likelihood Sensitivity Approach to Error
Detection Becker et al. 06
Mother
Father
0 1 2 1 0 2
0 2 2 1 0 2
Child
0 2 2 1 0 2
?
  • Large change in likelihood suggests likely error
  • Flag genotype as an error if L(T)/L(T) gt R,
    where R is the detection threshold (e.g., R104)

10
Implementation in FAMHAPBecker et al. 06
  • Window-based algorithm
  • For each window including the SNP under test,
    generate list of H most frequent haplotypes
    (default H50)
  • Find most likely trio phasings by pruned search
    over the H4 quadruples of frequent haplotypes
  • Flag genotype as an error if L(T)/L(T) gt R for
    at least one window

11
Limitations of FAMHAP Implementation
  • Truncating the list of haplotypes to size H may
    lead to sub-optimal phasings and inaccurate L(T)
    values
  • False positives caused by nearby errors (due to
    the use of multiple short windows)
  • Our approach
  • HMM model of haplotype diversity ? all haplotypes
    are represented no need for short windows
  • Alternate likelihood functions ? scalable runtime

12
Outline
  • Introduction
  • Likelihood Sensitivity Approach to Error
    Detection
  • HMM-Based Algorithms
  • Experimental Results
  • Conclusion

13
HMM Model
(Figure from Rastas et al. 07)
  • Similar to models proposed by Schwartz 04,
    Rastas et al. 05, KimmelShamir 05
  • Block-free model, paths with high transition
    probability correspond to founder haplotypes

14
HMM Training
  • Previous works use EM training of HMM based on
    unrelated genotype data
  • Our 2-step algorithm exploits pedigree info
  • Step 1 Infer haplotypes using pedigree-aware
    algorithm based on entropy-minimization
  • Step 2 train HMM based on inferred haplotypes,
    using Baum-Welch

15
Alternate Likelihood Functions
  • Maximum phasing probabilityis hard to compute
  • We use alternate likelihood functions that are
    monotonic under data deletion efficiently
    computable
  • Viterbi probability (ViterbiProb) the maximum
    probability of a set of 4 HMM paths that emit 4
    haplotypes compatible with the trio
  • Probability of Viterbi Haplotypes (ViterbiHaps)
    product of total probabilities of the 4 Viterbi
    haplotypes
  • Total Trio Probability (TotalProb) total
    probability P(T) that the HMM emits four
    haplotypes that explain trio T along all possible
    4-tuples of paths

16
Efficient Computation of Viterbi Probability
  • For a fixed trio, Viterbi paths can be found
    using a 4-path version of Viterbis algorithm in
    time
  • K3 speed-up by factoring common terms

Where
17
Overall Runtimes
  • Viterbi probability
  • Likelihoods of all 3N modified trios can be
    computed within time using
    forward-backward algorithm
  • Overall runtime for M trios
  • Probability of Viterbi haplotypes
  • Obtain haplotypes from standard traceback, then
    compute haplotype probabilities using forward
    algorithms
  • Overall runtime
  • Total trio probability
  • Similar pre-computation speed-up
    forward-backward algorithm
  • Overall runtime

18
Outline
  • Introduction
  • Likelihood Sensitivity Approach to Error
    Detection
  • HMM-Based Algorithms
  • Experimental Results
  • Conclusion

19
Datasets
  • Real dataset Becker et al. 2006
  • 35 SNPs per individual genotype sequence
  • 551 trios
  • Synthetic datasets
  • 35 SNPs, 30-551 trios
  • Preserved missing data pattern of real dataset
  • Haplotypes assigned to trios based on frequencies
    inferred from real dataset
  • 1 error rate, four error insertion models
  • Random allele
  • Random genotype
  • Heterozygous-to-homozygous
  • Homozygous-to-heterozygous

20
Experimental Setup
  • Two strategies for handling MIs
  • Set all three individuals to unknown prior to
    error detection, or
  • Set child only to unknown (preserving parents
    original data)
  • Two testing strategies
  • Test one SNP genotype ViterbiProb-1,
    ViterbiHaps-1, TotalProb-1
  • Simultaneously test three SNP genotypes at the
    same locus ViterbiProb-3, ViterbiHaps-3,
    TotalProb-3

21
Comparison with FAMHAP (Random Allele Errors)
22
Children vs. Parents (Random Allele Errors)
23
Error Model Comparison(TrioProb-1 Parents)
24
Error Model Comparison(TrioProb-1 Children)
25
Unrelated vs. Trio Likelihood Sensitivity
Unrelated ViterbiProb-1 Likelihood ratios
(children)
Trio ViterbiProb-1 Likelihood ratios (children)
26
Pedigree Info vs. Sample Size Effect
27
TrioProb-1 Results on Real Dataset
  • Becker et al. 06 resequenced all trio members
    at 41 loci flagged by FAMHAP-3
  • 23 SNP genotypes were identified as true errors
  • 413-23100 resequenced SNP genotypes agree with
    original calls
  • Predictive value for R104 is between 18/2669
    and 24/2692, compared to 23/4156 for FAMHAP-3

28
Outline
  • Introduction
  • Likelihood Sensitivity Approach to Error
    Detection
  • HMM-Based Algorithms
  • Experimental Results
  • Conclusion

29
Conclusion
  • We have proposed efficient methods for error
    detection in trio genotype data based on a HMM
    model of haplotype diversity
  • Exploiting pedigree info is very useful
  • Improved detection accuracy compared to FAMHAP
  • Runtime linear in SNPs and trios
  • Ongoing work
  • Improve detection accuracy via iteration
  • Fix MIs using likelihood before error detection
  • Correct errors with high likelihood ratio, then
    recompute likelihood ratios (possibly after
    re-phasing and HMM re-training)
  • Integration with genotype calling algorithms
  • Combine low level intensity data with
    haplotype-based likelihoods

30
Questions?
31
Accuracy Measures
  • Sensitivity
  • TP/(TPTF)
  • Predictive value
  • TP/(TPFP)
  • Specificity
  • TN/(FPTN)
  • False Positive rate
  • 1-Specificity
Write a Comment
User Comments (0)
About PowerShow.com