Title: Haplotype analysis of populationbased association studies
1Haplotype analysis of population-based
association studies
2Why study haplotypes?
- Much of common human genetic variation can be
arranged on relatively few haplotypes within
blocks of strong LD across the genome that are
rarely disturbed by meiosis. - Functional properties of a protein are determined
by the linear sequence of amino acids,
corresponding to DNA variation on a haplotype. - Combination of causal variants in cis in
HPC2/ELAC2 increase risk of prostate cancer. - Rare causal variants may reside on specific
haplotype backgrounds that could not be
identified through single-locus analysis. - Here we will consider two specific applications
of haplotypes in population-based association
studies - Testing for association of disease with SNPs in a
candidate gene or small candidate region. - Fine-mapping within candidate regions.
3Testing for association of disease with SNPs in
a candidate gene
4Logistic regression framework
- Convenient to model log-odds of disease in a
logistic regression framework, assuming
multiplicative haplotype risks. - Likelihood contribution of individual with
phenotype yi, carrying pair of haplotypes Hi1 and
Hi2 given by -
-
- where denotes the additive effect of
haplotype Hj on the log-odds of disease. - Can be easily extended to incorporate covariates
in the same way as for single-locus analysis.
5Testing for haplotype association with disease
- Under null hypothesis of no association, each
haplotype has the same risk of disease, i.e.
is constant for all haplotypes Hj. - Maximise likelihood
- under null and alternative models.
- Deviance has approximate chi-squared distribution
with n-1 degrees of freedom, where n is the
number of distinct observed haplotypes in H.
6Unknown phase
- Haplotypes cannot generally be recovered from the
unphased genotype data generated by current SNP
typing technology. - Statistical algorithms exist to infer haplotypes
from unphased genotype data - SNPHAP maximum-likelihood via implementation of
E-M algorithm. - PHASE pseudo-Bayesian MCMC algorithm using
population genetics model for haplotype
evolution. - Statistical algorithms demonstrated to work well
in blocks of strong LD. - Inferred haplotypes are just estimates of the
true underlying phase, so it is important to
address the error in the estimation process.
7Unphased genotype data
- Consider all possible haplotype configurations
consistent with unphased genotype data, weighted
in likelihood by the corresponding phase
assignment probability. - Likelihood contribution of individual with
phenotype yi, with unphased multi-locus genotype
Gi given by - where is estimated by one of
the haplotype reconstruction algorithms. - Naturally allows for missing genotype data.
- Deviance has approximate chi-squared distribution
with n-1 degrees of freedom, where n is the
number of distinct haplotypes consistent with
observed unphased genotype data.
8Parsimony issues
- There may be many haplotypes consistent with
observed unphased SNP genotype data, some of
which will be very rare. - Difficult to estimate and interpret disease odds
for rare haplotypes. - Each rare haplotype may contribute little
information about association, but requires an
additional degree of freedom in the deviance,
leading to lack of power to detect association. - Pool together rare haplotypes, and assign the
same odds of disease to everything in this
dustbin category. However, may group together
high- and low-risk rare haplotypes, which may
mask association with disease. - More powerful to group haplotypes based on their
likely evolution with a block of strong LD.
9Haplotype evolution
0 0 0 0 0
Ancestral haplotype
1
2
5
4
3
1 0 0 0 0
1 0 0 0 0
1 0 0 0 0
0 1 0 0 0
0 1 0 0 0
0 1 0 0 0
0 1 1 0 0
0 0 0 0 1
0 1 1 0 0
0 1 0 0 0
0 0 0 0 0
0 0 0 1 0
0 0 0 1 0
0 0 0 1 0
0 0 0 0 0
0 0 0 0 1
0 0 0 0 1
0 0 0 0 1
0 1 0 0 0
0 0 0 0 1
10Haplotype clustering
- We expect a pair of haplotypes carrying the same
disease mutation to share more recent common
ancestry than a random pair of haplotypes from
the population. - Reduce the dimensionality of the problem by
taking advantage of the expectation that
similar SNP haplotypes will have comparable
risks in the region flanking the disease gene. - Cluster haplotypes according to their similarity.
Assign the same contribution to log-odds of
disease to haplotypes in the same cluster, but
allow haplotype effects to vary between clusters. - Depends on availability of appropriate metric of
distance between pairs of haplotypes. A number
of laternative metrics exist, most based on the
proportion of SNPs at which pairs of haplotypes
share the same allele.
11Measuring haplotype similarity
- Assuming that haplotype diversity is driven
primarily by mutation, as opposed to
recombination events, score similarity in terms
of SNP alleles shared in common. - Similarity between haplotype Hj and Hk defined as
-
-
- where
- Could utilise alternative similarity metric. For
example, incorporate allele frequencies to give
greater weight to sharing rare alleles.
12Hierarchical clustering
- Standard hierarchical clustering algorithms
available to group haplotypes according to their
similarity. - Haplotypes are successively combined into
increasingly diverse clusters, until ultimately
all haplotypes form a single clade (equivalent to
the null model). Represented as a dendogram. - Can test for haplotype association with disease
via the deviance at each level of the dendogram,
where all haplotypes in the same cluster have the
same additive effect to the log-odds of disease
(see Durrant et al. 2004, for example). - Multiple testing issue can correct maximal
deviance using Bonferroni correction or by means
of permutation methods.
13Distance
T1
T2
T4
T5
T6
T7
Th9
h9 clusters of identical haplotypes
14Bayesian partition model
1121
1111
ß1 0.50
Specify number of clusters, K.
2121
Specify K cluster centres, CK.
K 2
Allocate haplotypes to nearest cluster centre.
2122
2222
Allocate log-odds of disease, ß, to each cluster.
2112
ß2 -0.34
Assign prior distribution to K, CK, and ß, and
sample from their joint posterior distribution
given phenotype data and consistent SNP
haplotypes under the logistic regression model.
Implemented in Bayesian MCMC algorithm, GENEBPM,
developed by Morris (2005).
15Prior density function
- Under the null hypothesis of no association,
there will be a single cluster of haplotypes, all
with the same risk of disease. - Assume a prior probability of 0.5 of a single
cluster (i.e. K 1) of haplotypes, and assume a
truncate geometric distribution for K gt 1. - Prior probability of association is 0.5, and
favours smaller numbers of clusters (i.e. more
parsimonious models). - Can reduce prior probability of association to
account for testing multiple candidate genes or
regions. - Assume that each haplotype is equally likely to
be chosen as a centre for any cluster. - Assume independent N(µ,sB) distributions for the
log-odds of disease in each cluster. - Hyperparameters µ has Uniform density, and sB
has an exponential distribution with expectation
1.
16Example CYP2D6
- The gene CYP2D6 on human chromosome 22q13 has an
established role in drug metabolism. - Hosking et al. (2002) genotyped 1,018 individuals
at 32 SNP markers across an 890kb region flanking
CYP2D6. - By typing four functional polymorphisms in
CYP2D6, 41 individuals were found to carry two
mutations (not necessarily the same variant), and
hence were predicted to be recessive poor drug
metaboliser (PDM) cases. - Standard single-locus analysis of SNP markers
revealed highly-significant evidence of
association across 400kb block of strong LD
flanking the CYP2D6 locus. - GENEBPM algorithm applied to the 32 SNP markers
(excluding the known functional polymorphisms) to
test for association between haplotypes in the
region flanking CYP2D6 and PDM phenotype.
17GENEBPM analysis
- 878 haplotypes consistent with the observed
unphased SNP genotype data. 41 common
haplotypes with relative frequency greater than
0.5. - Posterior distribution of the number of clusters
of haplotypes ranges from 3 to 22, with a mode of
5. Posterior probability of association gt 99.9.
18Posterior haplotype similarity
Log-odds-ratios (baseline haplotype 1) for
high-risk clusters (A) 7.25-7.28 (B) 2.12-3.05.
19Clustering of cases
Variant 1 associated with cluster (A). Variant 2
associated with cluster (B).
20When is GENEBPM analysis appropriate?
- Use of the GENEBPM algorithm relies on robust
haplotype estimation, and assumes minimal
evidence of ancestral recombination in
calculating the haplotype similarity metric. - GENEBPM is appropriate for testing for
association within a strong block of LD, or
within a single candidate gene subject to limited
recombination. - GENEBPM has been demonstrated to perform well,
even in the presence of ancestral recombination
(for example CYP2D6 application). - For larger genetic regions
- Perform sliding window analysis, with independent
analyses undertaken with overlapping sets of
SNPs. - Break into non-overlapping blocks of strong LD,
with independent analyses performed within each
block. - Remember to take account of multiple testing.
21Fine-mapping within candidate regions
22Fine-mapping and the hidden SNP problem
- The goal of fine-mapping studies is to improve
the resolution of estimates of the location of
functional polymorphism(s) contributing to
disease. - It is possible that we have not typed the
functional polymorphism itself, but we can use
genotypes at nearby SNPs and local patterns of LD
to infer genotypes at the unobserved locus (or so
called hidden SNP). - By modelling association between disease and
genotypes at the hidden SNP for many different
possible locations across the candidate region,
we can plot a statistical measure of support for
association in order to fine-map the functional
polymorphism.
23Likelihood formulation
- Logistic regression model parameterised in terms
of the log-odds of disease, ß, for each genotype
at the hidden SNP. - Calculate the likelihood of observed phenotype
data, y, for a case-control sample of
individuals, given their haplotypes, H, at marker
SNPs as a summation over possible genotypes at
the hidden SNP at location z. - where
- We cannot evaluate the conditional distribution
of hidden SNP genotypes, f(XH,z), directly.
However, we can approximate the likelihood by
taking samples of hidden SNP genotypes, X1, X2,
, XT, from this conditional distribution using
population genetics theory.
24Sampling genotypes at the hidden SNP
0 0 0 0 0 0
Ancestral haplotype
z
1
2
5
4
3
1 0 0 0 0 0
1 0 0 0 0 0
1 0 0 0 0 0
0 1 0 0 0 0
0 1 0 1 0 0
0 1 0 1 0 0
0 1 1 1 0 0
0 0 0 0 0 1
0 1 1 1 0 0
0 1 0 1 0 0
0 0 0 0 0 0
0 0 0 0 1 0
0 0 0 0 1 0
0 0 0 0 1 0
0 0 0 0 0 0
0 0 0 0 0 1
0 0 0 0 0 1
0 0 0 0 0 1
0 1 0 1 0 0
0 0 0 0 0 1
25Sampling practicalities
- A common distribution for genealogical trees
representing the ancestry of a sample of
chromosomes is given by the coalescent process. - However, we cannot sample directly from the
conditional distribution of genealogies, given
the observed marker SNP haplotype data. - One attractive solution is to use the product of
approximate conditionals (PAC) likelihood
introduced by Li and Stephens (2003). - Conditional on the estimated fine-scale
recombination rate, ?, across the candidate
region, it follows that - Haplotypes are not observed directly from
unphased SNP genotype data. However, PHASE can
be used to infer haplotypes, and obtain estimates
of the recombination rate across the candidate
region.
26Localisation
- Consider many positions, Z z1, z2, , zP, for
the functional polymorphism. - Sample from the conditional distribution of
hidden SNP genotypes at each position, given the
observed haplotype data. - Obtain the likelihood f(yH,zj) at each position
by integration over the hidden SNP genotype odds
of disease, ß. - Posterior probability that the functional
polymorphism is located at position zj given by
Bayes theorem -
- where f(zj) denotes the prior probability
that the functional polymorphism is located at
position zj. - Assuming all positions to be equally likely, a
priori, we can approximate the posterior
probability by
27Example cystic fibrosis (CF)
- CF is a fully penetrant recessive disease, most
common in white populations with an incidence of
one case in each 2500 live births. - Preliminary linkage analysis had suggested a
1.8Mb candidate region for a single CF gene on
chromosome 7q31. - More recently, a 3bp deletion, ?F508, has been
identified within this region in the CFTR gene. - It is now well established that ?F508 accounts
for 66 of all chromosomal mutations in
individuals with CF, with the remainder made up
of many rare mutations in the same gene. - Kerem et al. (1989) obtained marker haplotypes
from 94 case chromosomes and 92 control
chromosomes using 23 RFLPs in the candidate
region. - Of the case chromosomes, 62 have now been
confirmed as carrying the ?F508 mutation. - Single locus analysis revealed strong evidence of
association with CF of RFLPs in 300kb region
flanking CFTR.
28(No Transcript)
29(No Transcript)
30(No Transcript)
31Comments
- Flexibility of logistic regression modelling
framework allows for covariates including
environmental risk factors, polygenic effects and
indicators of population structure. - Can allow for a range of different genetic models
of disease at the hidden SNP. - Hidden SNP need not be fixed as binary. Could
easily extend the method to allow for genetic
heterogeneity by considering a multi-allelic
hidden locus, or several tightly linked SNPs
coding for risk haplotypes at functional
polymorphisms. - With accurate estimation of haplotypes, method
can be used to fine-map functional polymorphisms
across entire chromosomes. - The hidden SNP approach can also be used to test
for association between disease and an unobserved
functional polymorphism at a fixed location by
comparison of f(yH,z) with the likelihood under
the null model of no association. - With accurate information regarding patterns of
LD from the International HapMap project, we can
use the hidden SNP approach to test for
association between polymorphisms in HapMap that
have not been typed in the association study
(i.e. IMPUTATION).
32Summary
- Haplotype analysis of population-based
association studies may have greater power than
single-locus tests - Take account of background patterns of LD between
loci. - Correspond to the block-like structure of common
genetic variation. - Haplotypes can be reconstructed from unphased
genotype data within blocks of strong LD using
statistical algorithms. - Power of haplotype-based tests of association can
be improved by clustering haplotypes according to
their similarity, used as a proxy for recent
shared ancestry. - Fine-mapping of disease loci can be implemented
by treating the functional polymorphism(s) as
hidden SNP(s), and simulating the distribution of
genotypes at these loci using population genetics
theory. - Within a logistic regression framework, we can
incorporate covariates to account for non-genetic
risk factors, polygenic effects, and indicators
of population structure.
33References
- Durrant C, Zondervan KT, Cardon LR, Hunt S,
Deloukas P, Morris AP (2004). Linkage
disequilibrium mapping via cladistic analysis of
SNP haplotypes. Am J Hum Genet 75 35-43. - Hosking LK et al. (2002). Linkage disequilibrium
mapping identifies a 390kb region associated with
CYP2D6 poor drug metabolising activity.
Pharmacogenomics J 2 165-175. - Kerem BS et al. (1989). Identification of the
cystic fibrosis gene genetic analysis. Science
245 1073-1080. - Li N, Stephens M (2003). Modelling linkage
disequilibrium, and identifying recombination
hotspots using SNP data. Genetics 165
2213-2233. - Morris AP (2005). Direct analysis of unphased
SNP genotype data in population-based association
studies via Bayesian partition modeling of
haplotypes. Genet Epidemiol 29 91-107.