Bayesian Inference of Epistatic Interactions in Casecontrol Studies

About This Presentation

Title:

Bayesian Inference of Epistatic Interactions in Casecontrol Studies

Description:

Bayesian Inference of Epistatic Interactions in Case-control Studies. Yu Zhang & Jun S Liu ... Epistasis is a phenomenon whereby the effects of a given gene on ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 52

Provided by: yixua

Category:

more less

Transcript and Presenter's Notes

Title: Bayesian Inference of Epistatic Interactions in Casecontrol Studies

1
Bayesian Inference of Epistatic Interactions in
Case-control Studies

Yu Zhang Jun S Liu
Nature Genetics 39, 1167 - 1173 (2007)
Presented by Yixuan Chen
2/29/08

2
Outlines

Epistasis Background
Methods
The Bayesian marker partition model
MCMC sampling
B statistic
Results
Epistasis models and simulations
Comparisons
Genome-wide association study of AMD
Discussions

3
Epistasis

Epistasis is a phenomenon whereby the effects of
a given gene on a biological trait are masked or
enhanced by one or more other genes.
For complex traits such as diabetes, asthma,
hypertension, etc., the presence of epistasis is
a particular cause for concern.
An increasing number of reports have indicated
the presence of multilocus interactions in many
human complex traits.

4
Genome-wide Epistasis

The number of possible interaction combinations
is astronomical.
It is a daunting task to catch one or a very
few disease-related interactions.

5
BEAM Algorithm

Bayesian Epistasis Association Mapping
A Bayesian partitioning model
B statistic and conditional B statistic
BEAM is significantly more powerful.
A genome-scale epistasis mapping is both feasible
and desirable.

6
Notations

Suppose Nd cases and Nu controls were genotyped
at L SNP markers.
D(d1,,dNd) case genotypes
U(u1,,uNu) control genotypes
di(di1,,diL)
ui(ui1,,uiL)

7
Marker Partitioning

The L markers are partitioned into 3 groups
group 0 contains markers unlinked to the disease
group 1 contains markers contributing
independently to the disease risk
group 2 contains markers that jointly influence
the disease risk (interactions).

8
Group Membership

Let I(I1,,IL) indicate the membership of the
markers with Ij0, 1 and 2, respectively.
The goal is to infer the set j Ij gt 0.
Let l0, l1, l2 denote the number of markers in
each group (l0 l1 l2L)
Let D0, D1 and D2 denote case genotypes of
markers in group 0, 1 and 2, respectively.

9
The Bayesian marker partition model

The likelihood model assumes independence between
markers in the control population.
The genotype frequencies of each biallelic marker
in group 1 in the disease population
The likelihood of D1 is
nj1, nj2, nj3 are genotype counts of marker
j

10
Dirichlet Distribution

A family of continuous multivariate probability
distributions parameterized by the vector a of
positive reals
The probabilities of K rival events are xi.

11
Dirichlet Prior for Group 1

Assume a Dirichlet(a) prior for ?j1,?j2,?j3,
where a(a1,a2,a3)
Integrate out T1 and obtain the marginal
probability

12
Dirichlet Prior for Group 2

3l2 possible genotype combinations with frequency
nk the number of genotype combination k
Assume a Dirichlet(ß) prior distribution of
Integrate out T2

13
Dirichlet Prior for Group 0

Same distributions as controls
The genotype frequencies of the L markers in the
control population
njk and mjk are the numbers of individuals with
genotype k at marker j in D and U

14
Dirichlet Prior for Group 0 (c1)

Assume Dirichlet priors for ?j with parameters
Integrate out T

15
Posterior Distribution

The posterior distribution of I

16
Markov chain Monte Carlo

Markov chain Monte Carlo (MCMC) methods are a
class of algorithms for sampling from probability
distributions
Based on constructing a Markov chain that has the
desired distribution as its stationary
distribution.
The state of the chain after a large number of
steps is then used as a sample from the desired
distribution.
The quality of the sample improves as a function
of the number of steps.

17
Metropolis-Hastings algorithm
18
M-H in BEAM

Two types of proposals are used
Randomly change a markers group membership
Randomly exchange two markers between groups 0, 1
and 2.
The proposed move is accepted according to the
M-H ratio, which is just a ratio of Gamma
functions.
To improve the sampling efficiency
Set a lower bound on the number of markers in
group 2
Gradually reduce this bound to 0 during burn-in
This forces the algorithm to explore the space of
high-order interactions.
Also used an annealing strategy in burn-in
iterations
a temperature set high initially and gradually
reduced to 1

19
The simulated data on Model 4 contains 1,000
markers from 1,000 cases and 1,000 controls, with
MAF 0.1 and marginal effect size 0.4. Run BEAM
for 150,000 burn-ins plus 200,000 samplings in
three chains, with prior p1p21/3.
20
Circles denote the overall posterior
probabilities of associations, with marginal and
joint associations combined. Plus signs denote
posterior probabilities of marginal associations.
Three circles on the top correspond to the three
simulated disease markers having interaction
effects.
21
B Statistic

A hypothesis-testing procedure to check each
marker or set of markers for significant
associations
For each set M of k markers to be tested, the
null hypothesis is that markers in M are not
associated with the disease.
Here, k1,2,3, represents single-marker, two-way
and three-way interactions, etc.

22
B Statistic (c1)

Define the B statistic for the marker set M
P0(DM, UM) and PA(DM, UM) are really the Bayes
factors
the marginal probabilities of the data with
parameters integrated out from our Bayesian model
under the null and the alternative models,
respectively

23
Bayes Factor

Given a model selection problem between two
models M1 and M2, on the basis of a data vector
x. The Bayes factor K is given by
p(x Mi) the marginal likelihood for
Mi.
Similar to a likelihood-ratio test
instead of maximizing the likelihood
Bayesians average it over the parameters

24
B Statistic (c2)

Choose both P0 and PA as an equal mixture of two
distributions
One that assumes independence among markers in M,
Pind(X), of the form of P(D1I)
The other a saturated joint distribution of
genotype combinations across all markers in M,
Pjoin(X), as P(D2I)
Under the null hypothesis, the B statistic is
asymptotically distributed as a shifted ?2 with
3k1 degrees of freedom

25
Conditional B Statistic

A set of k (2,3,) markers may include t(ltk)
markers that are significant through either
marginal or partial interaction associations.
The asymptotic null distribution of BMT is a
shifted ?2 with 3k3t degrees of freedom.

26
Simulated Epistasis Models
27
Simulated Epistasis Models (c1)
28
Simulated Epistasis Models (c2)

Model 6 is a 6-way interaction model
Denote the genotypes of each SNP by 0, 1, and 2
Code each genotype combination over 6 disease
loci by integers between 0728
Assign disease effect ? 50 to genotype
combinations 4, 5, 7, 111, 114, 253, 254, 360,
387, 603, and 630.
? 50 so that these genotype combinations can
explain a non-trivial portion (gt10) of cases.

29
Simulated Epistasis Models (c3)

50 data sets for each disease model were
simulated under each setting
Marker minor allele frequencies (MAF) are
uniformly in 0.05, 0.5.
Each untyped disease locus is linked to one
genotyped marker
The remaining markers are unlinked

30
Comparison Algorithms

The stepwise logistic regression approach
All markers are individually tested and ranked
for marginal associations.
The top 10 of markers are selected, among which
all k-way (k2 or 3) interactions are tested and
ranked
A ?2 test with two degrees of freedom to test for
single-marker associations.
A stepwise B-stat
the same search strategy as stepwise logistic
regression
use the B statistic for testing significance

31
Power Calculation

Define the power of each method as the proportion
of 50 data sets in which all truly associated
markers are identified and show statistically
significant associations (adjusted P values below
0.1) with the disease.

32
A Hierarchical Procedure to Declare Significance

Marginal associations
report all markers with significant marginal
associations after a Bonferroni correction for
the number of markers, L.
2-way interactions
after the Bonferroni correction for L(L-1)/2
tests
report all significant novel 2-way interactions
Neither markers has been reported earlier
if one marker has been reported earlier
compute its conditional B-statistic (or the
conditional log likelihood ratio (LLR) for
logistic regression)
report the interaction if significant after a
Bonferroni correction.

33
A Hierarchical Procedure to Declare Significance
(c1)

3-way interactions
report all novel 3-way interactions that are
significant after a Bonferroni correction for
L(L-1)(L- 2)/6 tests
if t1 or 2 markers were already found
significant
calculate the conditional B-statistic (or the
conditional LLR)
report the interaction if it is still significant
after a Bonferroni correction.
All p-values were estimated by a chi-square
distribution with d3k3t degrees of freedom, for
k1,2,3, t0,1,2, tltk, and adjusted by Bonferroni
corrections.

34
Results

The power for detecting marginal associations was
not compromised by using the more complex models.

BEAM (B), the stepwise B-stat (S), the stepwise
logistic regression (L) the 2-d.f. ?2 test (C)
Each data set contains 1,000 markers. Black bars
represent the power for 1,000 cases and 1,000
controls. Gray bars represent the power for 2,000
cases and 2,000 controls.
35
BEAM (B), the stepwise B-stat (S), the stepwise
logistic regression (L) the 2-d.f. ?2 test (C)
36
BEAM (B), the stepwise B-stat (S), the stepwise
logistic regression (L) the 2-d.f. ?2 test (C)
Each data set contains 1,000 markers. Black bars
represent the power for 1,000 cases and 1,000
controls. Gray bars represent the power for 2,000
cases and 2,000 controls.
37
Results (c1)

BEAM performs better especially when either
disease allele frequencies or marginal effects
were small.
The power of all methods decreases with the decay
of the LD (measured in r2) between disease loci
and associated markers.

38
Type I Errors

All three epistasis mapping methods made similar
amounts of type I errors.
At the 0.1 significance level, they all made 10
type I errors (after Bonferroni correction) when
searching only for marginally significant
markers.
All methods made much fewer than 10 type I
errors when searching for interactions.

39
(No Transcript)
40
Impact of mismatch in allele frequencies and LD

The power of association mapping can be greatly
hampered by the discrepancy of allele frequencies
between unobserved disease loci and associated
genotyped markers.
Investigated the impact based on model 2
MAFs at two interacting disease loci were both
0.1
The marginal effect size per disease locus was
0.5
One linked marker had the matched MAF, whereas
the other had an MAF ranging from 0.05 to 0.5.
The LD between disease loci and associated
markers was controlled to range from D0.7 to
D1.

41
(No Transcript)
42
Genome-wide association study of AMD

The AMD (age-related macular degeneration) data
set contains 116,204 SNPs genotyped for 96
affected individuals and 50 controls.
Remove nonpolymorphic SNPs and those that
significantly deviated from Hardy-Weinberg
Equilibrium (HWE)
Remove additional SNPs containing more than five
missing genotypes.
After the filtration, 96,932 SNPs remained.

43
(No Transcript)
44
Prior Calibration

With only 146 individuals and 100,000 SNPs, the
posterior probability of associations for each
marker is strongly influenced by the choice of
priors, although the order of these probabilities
is nearly invariant.

45
(No Transcript)
46
Simulation Based on AMD Data
47
Comparison with other epistasis mapping approaches

MDR identifies k-way interactions through an
exhaustive search and evaluates the association
between each interaction and the disease by
cross-validations.
Logic regression infers a tree-based relationship
between the disease status and a set of
markers.It evaluates the detected associations by
permutation tests.
BGTA uses a bootstrap-type resampling screening
procedure to select markers, and those markers
with return frequencies greater than the third
quartile plus 1.8 times the interquartile range
are deemed disease-associated markers.

48
(No Transcript)
49
Discussions

The BEAM algorithm has two essential components
a Bayesian epistasis inference tool implemented
via MCMC
a novel test statistic for evaluating statistical
significance
A natural advantage of the Bayesian approach
incorporate prior knowledge about each marker
quantify all information and uncertainties in the
form of posteriors
Evaluating the statistical significance of a
candidate finding via P values
more robust to model choice and prior assumptions
can give the scientist peace of mind

50
Discussions (c1)

The power of epistasis mappings depends
critically on
sample size
effects of disease mutations
any discrepancy in allele frequencies between
disease loci and associated markers
There are several issues that may affect the
accuracy
population substructures
genotyping errors
disease heterogeneities

51
THANK YOU

Write a Comment

User Comments (0)

About PowerShow.com

Bayesian Inference of Epistatic Interactions in Casecontrol Studies - PowerPoint PPT Presentation

Bayesian Inference of Epistatic Interactions in Casecontrol Studies

Bayesian Inference of Epistatic Interactions in Case-control Studies. Yu Zhang & Jun S Liu ... Epistasis is a phenomenon whereby the effects of a given gene on ... – PowerPoint PPT presentation