Haplotype analysis of populationbased association studies - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Haplotype analysis of populationbased association studies

Description:

Much of common human genetic variation can be arranged on relatively few ... in HapMap that have not been typed in the association study (i.e. IMPUTATION) ... – PowerPoint PPT presentation

Number of Views:131
Avg rating:3.0/5.0
Slides: 34
Provided by: amo131
Category:

less

Transcript and Presenter's Notes

Title: Haplotype analysis of populationbased association studies


1
Haplotype analysis of population-based
association studies
2
Why study haplotypes?
  • Much of common human genetic variation can be
    arranged on relatively few haplotypes within
    blocks of strong LD across the genome that are
    rarely disturbed by meiosis.
  • Functional properties of a protein are determined
    by the linear sequence of amino acids,
    corresponding to DNA variation on a haplotype.
  • Combination of causal variants in cis in
    HPC2/ELAC2 increase risk of prostate cancer.
  • Rare causal variants may reside on specific
    haplotype backgrounds that could not be
    identified through single-locus analysis.
  • Here we will consider two specific applications
    of haplotypes in population-based association
    studies
  • Testing for association of disease with SNPs in a
    candidate gene or small candidate region.
  • Fine-mapping within candidate regions.

3
Testing for association of disease with SNPs in
a candidate gene
4
Logistic regression framework
  • Convenient to model log-odds of disease in a
    logistic regression framework, assuming
    multiplicative haplotype risks.
  • Likelihood contribution of individual with
    phenotype yi, carrying pair of haplotypes Hi1 and
    Hi2 given by
  • where denotes the additive effect of
    haplotype Hj on the log-odds of disease.
  • Can be easily extended to incorporate covariates
    in the same way as for single-locus analysis.

5
Testing for haplotype association with disease
  • Under null hypothesis of no association, each
    haplotype has the same risk of disease, i.e.
    is constant for all haplotypes Hj.
  • Maximise likelihood
  • under null and alternative models.
  • Deviance has approximate chi-squared distribution
    with n-1 degrees of freedom, where n is the
    number of distinct observed haplotypes in H.

6
Unknown phase
  • Haplotypes cannot generally be recovered from the
    unphased genotype data generated by current SNP
    typing technology.
  • Statistical algorithms exist to infer haplotypes
    from unphased genotype data
  • SNPHAP maximum-likelihood via implementation of
    E-M algorithm.
  • PHASE pseudo-Bayesian MCMC algorithm using
    population genetics model for haplotype
    evolution.
  • Statistical algorithms demonstrated to work well
    in blocks of strong LD.
  • Inferred haplotypes are just estimates of the
    true underlying phase, so it is important to
    address the error in the estimation process.

7
Unphased genotype data
  • Consider all possible haplotype configurations
    consistent with unphased genotype data, weighted
    in likelihood by the corresponding phase
    assignment probability.
  • Likelihood contribution of individual with
    phenotype yi, with unphased multi-locus genotype
    Gi given by
  • where is estimated by one of
    the haplotype reconstruction algorithms.
  • Naturally allows for missing genotype data.
  • Deviance has approximate chi-squared distribution
    with n-1 degrees of freedom, where n is the
    number of distinct haplotypes consistent with
    observed unphased genotype data.

8
Parsimony issues
  • There may be many haplotypes consistent with
    observed unphased SNP genotype data, some of
    which will be very rare.
  • Difficult to estimate and interpret disease odds
    for rare haplotypes.
  • Each rare haplotype may contribute little
    information about association, but requires an
    additional degree of freedom in the deviance,
    leading to lack of power to detect association.
  • Pool together rare haplotypes, and assign the
    same odds of disease to everything in this
    dustbin category. However, may group together
    high- and low-risk rare haplotypes, which may
    mask association with disease.
  • More powerful to group haplotypes based on their
    likely evolution with a block of strong LD.

9
Haplotype evolution
0 0 0 0 0
Ancestral haplotype
1
2
5
4
3
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
0 1 0 0 0
0 1 0 0 0
0 1 0 0 0
0 1 1 0 0
0 0 0 0 1
0 1 1 0 0
0 1 0 0 0
0 0 0 0 0
0 0 0 1 0
0 0 0 1 0
0 0 0 1 0
0 0 0 0 0
0 0 0 0 1
0 0 0 0 1
0 0 0 0 1
0 1 0 0 0
0 0 0 0 1
10
Haplotype clustering
  • We expect a pair of haplotypes carrying the same
    disease mutation to share more recent common
    ancestry than a random pair of haplotypes from
    the population.
  • Reduce the dimensionality of the problem by
    taking advantage of the expectation that
    similar SNP haplotypes will have comparable
    risks in the region flanking the disease gene.
  • Cluster haplotypes according to their similarity.
    Assign the same contribution to log-odds of
    disease to haplotypes in the same cluster, but
    allow haplotype effects to vary between clusters.
  • Depends on availability of appropriate metric of
    distance between pairs of haplotypes. A number
    of laternative metrics exist, most based on the
    proportion of SNPs at which pairs of haplotypes
    share the same allele.

11
Measuring haplotype similarity
  • Assuming that haplotype diversity is driven
    primarily by mutation, as opposed to
    recombination events, score similarity in terms
    of SNP alleles shared in common.
  • Similarity between haplotype Hj and Hk defined as
  • where
  • Could utilise alternative similarity metric. For
    example, incorporate allele frequencies to give
    greater weight to sharing rare alleles.

12
Hierarchical clustering
  • Standard hierarchical clustering algorithms
    available to group haplotypes according to their
    similarity.
  • Haplotypes are successively combined into
    increasingly diverse clusters, until ultimately
    all haplotypes form a single clade (equivalent to
    the null model). Represented as a dendogram.
  • Can test for haplotype association with disease
    via the deviance at each level of the dendogram,
    where all haplotypes in the same cluster have the
    same additive effect to the log-odds of disease
    (see Durrant et al. 2004, for example).
  • Multiple testing issue can correct maximal
    deviance using Bonferroni correction or by means
    of permutation methods.

13
Distance
T1
T2
T4
T5
T6
T7
Th9
h9 clusters of identical haplotypes
14
Bayesian partition model
1121
1111
ß1 0.50
Specify number of clusters, K.
2121
Specify K cluster centres, CK.
K 2
Allocate haplotypes to nearest cluster centre.
2122
2222
Allocate log-odds of disease, ß, to each cluster.
2112
ß2 -0.34
Assign prior distribution to K, CK, and ß, and
sample from their joint posterior distribution
given phenotype data and consistent SNP
haplotypes under the logistic regression model.
Implemented in Bayesian MCMC algorithm, GENEBPM,
developed by Morris (2005).
15
Prior density function
  • Under the null hypothesis of no association,
    there will be a single cluster of haplotypes, all
    with the same risk of disease.
  • Assume a prior probability of 0.5 of a single
    cluster (i.e. K 1) of haplotypes, and assume a
    truncate geometric distribution for K gt 1.
  • Prior probability of association is 0.5, and
    favours smaller numbers of clusters (i.e. more
    parsimonious models).
  • Can reduce prior probability of association to
    account for testing multiple candidate genes or
    regions.
  • Assume that each haplotype is equally likely to
    be chosen as a centre for any cluster.
  • Assume independent N(µ,sB) distributions for the
    log-odds of disease in each cluster.
  • Hyperparameters µ has Uniform density, and sB
    has an exponential distribution with expectation
    1.

16
Example CYP2D6
  • The gene CYP2D6 on human chromosome 22q13 has an
    established role in drug metabolism.
  • Hosking et al. (2002) genotyped 1,018 individuals
    at 32 SNP markers across an 890kb region flanking
    CYP2D6.
  • By typing four functional polymorphisms in
    CYP2D6, 41 individuals were found to carry two
    mutations (not necessarily the same variant), and
    hence were predicted to be recessive poor drug
    metaboliser (PDM) cases.
  • Standard single-locus analysis of SNP markers
    revealed highly-significant evidence of
    association across 400kb block of strong LD
    flanking the CYP2D6 locus.
  • GENEBPM algorithm applied to the 32 SNP markers
    (excluding the known functional polymorphisms) to
    test for association between haplotypes in the
    region flanking CYP2D6 and PDM phenotype.

17
GENEBPM analysis
  • 878 haplotypes consistent with the observed
    unphased SNP genotype data. 41 common
    haplotypes with relative frequency greater than
    0.5.
  • Posterior distribution of the number of clusters
    of haplotypes ranges from 3 to 22, with a mode of
    5. Posterior probability of association gt 99.9.

18
Posterior haplotype similarity
Log-odds-ratios (baseline haplotype 1) for
high-risk clusters (A) 7.25-7.28 (B) 2.12-3.05.
19
Clustering of cases
Variant 1 associated with cluster (A). Variant 2
associated with cluster (B).
20
When is GENEBPM analysis appropriate?
  • Use of the GENEBPM algorithm relies on robust
    haplotype estimation, and assumes minimal
    evidence of ancestral recombination in
    calculating the haplotype similarity metric.
  • GENEBPM is appropriate for testing for
    association within a strong block of LD, or
    within a single candidate gene subject to limited
    recombination.
  • GENEBPM has been demonstrated to perform well,
    even in the presence of ancestral recombination
    (for example CYP2D6 application).
  • For larger genetic regions
  • Perform sliding window analysis, with independent
    analyses undertaken with overlapping sets of
    SNPs.
  • Break into non-overlapping blocks of strong LD,
    with independent analyses performed within each
    block.
  • Remember to take account of multiple testing.

21
Fine-mapping within candidate regions
22
Fine-mapping and the hidden SNP problem
  • The goal of fine-mapping studies is to improve
    the resolution of estimates of the location of
    functional polymorphism(s) contributing to
    disease.
  • It is possible that we have not typed the
    functional polymorphism itself, but we can use
    genotypes at nearby SNPs and local patterns of LD
    to infer genotypes at the unobserved locus (or so
    called hidden SNP).
  • By modelling association between disease and
    genotypes at the hidden SNP for many different
    possible locations across the candidate region,
    we can plot a statistical measure of support for
    association in order to fine-map the functional
    polymorphism.

23
Likelihood formulation
  • Logistic regression model parameterised in terms
    of the log-odds of disease, ß, for each genotype
    at the hidden SNP.
  • Calculate the likelihood of observed phenotype
    data, y, for a case-control sample of
    individuals, given their haplotypes, H, at marker
    SNPs as a summation over possible genotypes at
    the hidden SNP at location z.
  • where
  • We cannot evaluate the conditional distribution
    of hidden SNP genotypes, f(XH,z), directly.
    However, we can approximate the likelihood by
    taking samples of hidden SNP genotypes, X1, X2,
    , XT, from this conditional distribution using
    population genetics theory.

24
Sampling genotypes at the hidden SNP
0 0 0 0 0 0
Ancestral haplotype
z
1
2
5
4
3
1 0 0 0 0 0
1 0 0 0 0 0
1 0 0 0 0 0
0 1 0 0 0 0
0 1 0 1 0 0
0 1 0 1 0 0
0 1 1 1 0 0
0 0 0 0 0 1
0 1 1 1 0 0
0 1 0 1 0 0
0 0 0 0 0 0
0 0 0 0 1 0
0 0 0 0 1 0
0 0 0 0 1 0
0 0 0 0 0 0
0 0 0 0 0 1
0 0 0 0 0 1
0 0 0 0 0 1
0 1 0 1 0 0
0 0 0 0 0 1
25
Sampling practicalities
  • A common distribution for genealogical trees
    representing the ancestry of a sample of
    chromosomes is given by the coalescent process.
  • However, we cannot sample directly from the
    conditional distribution of genealogies, given
    the observed marker SNP haplotype data.
  • One attractive solution is to use the product of
    approximate conditionals (PAC) likelihood
    introduced by Li and Stephens (2003).
  • Conditional on the estimated fine-scale
    recombination rate, ?, across the candidate
    region, it follows that
  • Haplotypes are not observed directly from
    unphased SNP genotype data. However, PHASE can
    be used to infer haplotypes, and obtain estimates
    of the recombination rate across the candidate
    region.

26
Localisation
  • Consider many positions, Z z1, z2, , zP, for
    the functional polymorphism.
  • Sample from the conditional distribution of
    hidden SNP genotypes at each position, given the
    observed haplotype data.
  • Obtain the likelihood f(yH,zj) at each position
    by integration over the hidden SNP genotype odds
    of disease, ß.
  • Posterior probability that the functional
    polymorphism is located at position zj given by
    Bayes theorem
  • where f(zj) denotes the prior probability
    that the functional polymorphism is located at
    position zj.
  • Assuming all positions to be equally likely, a
    priori, we can approximate the posterior
    probability by

27
Example cystic fibrosis (CF)
  • CF is a fully penetrant recessive disease, most
    common in white populations with an incidence of
    one case in each 2500 live births.
  • Preliminary linkage analysis had suggested a
    1.8Mb candidate region for a single CF gene on
    chromosome 7q31.
  • More recently, a 3bp deletion, ?F508, has been
    identified within this region in the CFTR gene.
  • It is now well established that ?F508 accounts
    for 66 of all chromosomal mutations in
    individuals with CF, with the remainder made up
    of many rare mutations in the same gene.
  • Kerem et al. (1989) obtained marker haplotypes
    from 94 case chromosomes and 92 control
    chromosomes using 23 RFLPs in the candidate
    region.
  • Of the case chromosomes, 62 have now been
    confirmed as carrying the ?F508 mutation.
  • Single locus analysis revealed strong evidence of
    association with CF of RFLPs in 300kb region
    flanking CFTR.

28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
Comments
  • Flexibility of logistic regression modelling
    framework allows for covariates including
    environmental risk factors, polygenic effects and
    indicators of population structure.
  • Can allow for a range of different genetic models
    of disease at the hidden SNP.
  • Hidden SNP need not be fixed as binary. Could
    easily extend the method to allow for genetic
    heterogeneity by considering a multi-allelic
    hidden locus, or several tightly linked SNPs
    coding for risk haplotypes at functional
    polymorphisms.
  • With accurate estimation of haplotypes, method
    can be used to fine-map functional polymorphisms
    across entire chromosomes.
  • The hidden SNP approach can also be used to test
    for association between disease and an unobserved
    functional polymorphism at a fixed location by
    comparison of f(yH,z) with the likelihood under
    the null model of no association.
  • With accurate information regarding patterns of
    LD from the International HapMap project, we can
    use the hidden SNP approach to test for
    association between polymorphisms in HapMap that
    have not been typed in the association study
    (i.e. IMPUTATION).

32
Summary
  • Haplotype analysis of population-based
    association studies may have greater power than
    single-locus tests
  • Take account of background patterns of LD between
    loci.
  • Correspond to the block-like structure of common
    genetic variation.
  • Haplotypes can be reconstructed from unphased
    genotype data within blocks of strong LD using
    statistical algorithms.
  • Power of haplotype-based tests of association can
    be improved by clustering haplotypes according to
    their similarity, used as a proxy for recent
    shared ancestry.
  • Fine-mapping of disease loci can be implemented
    by treating the functional polymorphism(s) as
    hidden SNP(s), and simulating the distribution of
    genotypes at these loci using population genetics
    theory.
  • Within a logistic regression framework, we can
    incorporate covariates to account for non-genetic
    risk factors, polygenic effects, and indicators
    of population structure.

33
References
  • Durrant C, Zondervan KT, Cardon LR, Hunt S,
    Deloukas P, Morris AP (2004). Linkage
    disequilibrium mapping via cladistic analysis of
    SNP haplotypes. Am J Hum Genet 75 35-43.
  • Hosking LK et al. (2002). Linkage disequilibrium
    mapping identifies a 390kb region associated with
    CYP2D6 poor drug metabolising activity.
    Pharmacogenomics J 2 165-175.
  • Kerem BS et al. (1989). Identification of the
    cystic fibrosis gene genetic analysis. Science
    245 1073-1080.
  • Li N, Stephens M (2003). Modelling linkage
    disequilibrium, and identifying recombination
    hotspots using SNP data. Genetics 165
    2213-2233.
  • Morris AP (2005). Direct analysis of unphased
    SNP genotype data in population-based association
    studies via Bayesian partition modeling of
    haplotypes. Genet Epidemiol 29 91-107.
Write a Comment
User Comments (0)
About PowerShow.com