Bayesian Inference of Epistatic Interactions in Casecontrol Studies - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Bayesian Inference of Epistatic Interactions in Casecontrol Studies

Description:

Bayesian Inference of Epistatic Interactions in Case-control Studies. Yu Zhang & Jun S Liu ... Epistasis is a phenomenon whereby the effects of a given gene on ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 52
Provided by: yixua
Category:

less

Transcript and Presenter's Notes

Title: Bayesian Inference of Epistatic Interactions in Casecontrol Studies


1
Bayesian Inference of Epistatic Interactions in
Case-control Studies
  • Yu Zhang Jun S Liu
  • Nature Genetics 39, 1167 - 1173 (2007)
  • Presented by Yixuan Chen
  • 2/29/08

2
Outlines
  • Epistasis Background
  • Methods
  • The Bayesian marker partition model
  • MCMC sampling
  • B statistic
  • Results
  • Epistasis models and simulations
  • Comparisons
  • Genome-wide association study of AMD
  • Discussions

3
Epistasis
  • Epistasis is a phenomenon whereby the effects of
    a given gene on a biological trait are masked or
    enhanced by one or more other genes.
  • For complex traits such as diabetes, asthma,
    hypertension, etc., the presence of epistasis is
    a particular cause for concern.
  • An increasing number of reports have indicated
    the presence of multilocus interactions in many
    human complex traits.

4
Genome-wide Epistasis
  • The number of possible interaction combinations
    is astronomical.
  • It is a daunting task to catch one or a very
    few disease-related interactions.

5
BEAM Algorithm
  • Bayesian Epistasis Association Mapping
  • A Bayesian partitioning model
  • B statistic and conditional B statistic
  • BEAM is significantly more powerful.
  • A genome-scale epistasis mapping is both feasible
    and desirable.

6
Notations
  • Suppose Nd cases and Nu controls were genotyped
    at L SNP markers.
  • D(d1,,dNd) case genotypes
  • U(u1,,uNu) control genotypes
  • di(di1,,diL)
  • ui(ui1,,uiL)

7
Marker Partitioning
  • The L markers are partitioned into 3 groups
  • group 0 contains markers unlinked to the disease
  • group 1 contains markers contributing
    independently to the disease risk
  • group 2 contains markers that jointly influence
    the disease risk (interactions).

8
Group Membership
  • Let I(I1,,IL) indicate the membership of the
    markers with Ij0, 1 and 2, respectively.
  • The goal is to infer the set j Ij gt 0.
  • Let l0, l1, l2 denote the number of markers in
    each group (l0 l1 l2L)
  • Let D0, D1 and D2 denote case genotypes of
    markers in group 0, 1 and 2, respectively.

9
The Bayesian marker partition model
  • The likelihood model assumes independence between
    markers in the control population.
  • The genotype frequencies of each biallelic marker
    in group 1 in the disease population
  • The likelihood of D1 is
  • nj1, nj2, nj3 are genotype counts of marker
    j

10
Dirichlet Distribution
  • A family of continuous multivariate probability
    distributions parameterized by the vector a of
    positive reals
  • The probabilities of K rival events are xi.

11
Dirichlet Prior for Group 1
  • Assume a Dirichlet(a) prior for ?j1,?j2,?j3,
    where a(a1,a2,a3)
  • Integrate out T1 and obtain the marginal
    probability

12
Dirichlet Prior for Group 2
  • 3l2 possible genotype combinations with frequency
  • nk the number of genotype combination k
  • Assume a Dirichlet(ß) prior distribution of
  • Integrate out T2

13
Dirichlet Prior for Group 0
  • Same distributions as controls
  • The genotype frequencies of the L markers in the
    control population
  • njk and mjk are the numbers of individuals with
    genotype k at marker j in D and U

14
Dirichlet Prior for Group 0 (c1)
  • Assume Dirichlet priors for ?j with parameters
  • Integrate out T

15
Posterior Distribution
  • The posterior distribution of I

16
Markov chain Monte Carlo
  • Markov chain Monte Carlo (MCMC) methods are a
    class of algorithms for sampling from probability
    distributions
  • Based on constructing a Markov chain that has the
    desired distribution as its stationary
    distribution.
  • The state of the chain after a large number of
    steps is then used as a sample from the desired
    distribution.
  • The quality of the sample improves as a function
    of the number of steps.

17
Metropolis-Hastings algorithm
18
M-H in BEAM
  • Two types of proposals are used
  • Randomly change a markers group membership
  • Randomly exchange two markers between groups 0, 1
    and 2.
  • The proposed move is accepted according to the
    M-H ratio, which is just a ratio of Gamma
    functions.
  • To improve the sampling efficiency
  • Set a lower bound on the number of markers in
    group 2
  • Gradually reduce this bound to 0 during burn-in
  • This forces the algorithm to explore the space of
    high-order interactions.
  • Also used an annealing strategy in burn-in
    iterations
  • a temperature set high initially and gradually
    reduced to 1

19
The simulated data on Model 4 contains 1,000
markers from 1,000 cases and 1,000 controls, with
MAF 0.1 and marginal effect size 0.4. Run BEAM
for 150,000 burn-ins plus 200,000 samplings in
three chains, with prior p1p21/3.
20
Circles denote the overall posterior
probabilities of associations, with marginal and
joint associations combined. Plus signs denote
posterior probabilities of marginal associations.
Three circles on the top correspond to the three
simulated disease markers having interaction
effects.
21
B Statistic
  • A hypothesis-testing procedure to check each
    marker or set of markers for significant
    associations
  • For each set M of k markers to be tested, the
    null hypothesis is that markers in M are not
    associated with the disease.
  • Here, k1,2,3, represents single-marker, two-way
    and three-way interactions, etc.

22
B Statistic (c1)
  • Define the B statistic for the marker set M
  • P0(DM, UM) and PA(DM, UM) are really the Bayes
    factors
  • the marginal probabilities of the data with
    parameters integrated out from our Bayesian model
  • under the null and the alternative models,
    respectively

23
Bayes Factor
  • Given a model selection problem between two
    models M1 and M2, on the basis of a data vector
    x. The Bayes factor K is given by
  • p(x Mi) the marginal likelihood for
    Mi.
  • Similar to a likelihood-ratio test
  • instead of maximizing the likelihood
  • Bayesians average it over the parameters

24
B Statistic (c2)
  • Choose both P0 and PA as an equal mixture of two
    distributions
  • One that assumes independence among markers in M,
    Pind(X), of the form of P(D1I)
  • The other a saturated joint distribution of
    genotype combinations across all markers in M,
    Pjoin(X), as P(D2I)
  • Under the null hypothesis, the B statistic is
    asymptotically distributed as a shifted ?2 with
    3k1 degrees of freedom

25
Conditional B Statistic
  • A set of k (2,3,) markers may include t(ltk)
    markers that are significant through either
    marginal or partial interaction associations.
  • The asymptotic null distribution of BMT is a
    shifted ?2 with 3k3t degrees of freedom.

26
Simulated Epistasis Models
27
Simulated Epistasis Models (c1)
28
Simulated Epistasis Models (c2)
  • Model 6 is a 6-way interaction model
  • Denote the genotypes of each SNP by 0, 1, and 2
  • Code each genotype combination over 6 disease
    loci by integers between 0728
  • Assign disease effect ? 50 to genotype
    combinations 4, 5, 7, 111, 114, 253, 254, 360,
    387, 603, and 630.
  • ? 50 so that these genotype combinations can
    explain a non-trivial portion (gt10) of cases.

29
Simulated Epistasis Models (c3)
  • 50 data sets for each disease model were
    simulated under each setting
  • Marker minor allele frequencies (MAF) are
    uniformly in 0.05, 0.5.
  • Each untyped disease locus is linked to one
    genotyped marker
  • The remaining markers are unlinked

30
Comparison Algorithms
  • The stepwise logistic regression approach
  • All markers are individually tested and ranked
    for marginal associations.
  • The top 10 of markers are selected, among which
    all k-way (k2 or 3) interactions are tested and
    ranked
  • A ?2 test with two degrees of freedom to test for
    single-marker associations.
  • A stepwise B-stat
  • the same search strategy as stepwise logistic
    regression
  • use the B statistic for testing significance

31
Power Calculation
  • Define the power of each method as the proportion
    of 50 data sets in which all truly associated
    markers are identified and show statistically
    significant associations (adjusted P values below
    0.1) with the disease.

32
A Hierarchical Procedure to Declare Significance
  • Marginal associations
  • report all markers with significant marginal
    associations after a Bonferroni correction for
    the number of markers, L.
  • 2-way interactions
  • after the Bonferroni correction for L(L-1)/2
    tests
  • report all significant novel 2-way interactions
  • Neither markers has been reported earlier
  • if one marker has been reported earlier
  • compute its conditional B-statistic (or the
    conditional log likelihood ratio (LLR) for
    logistic regression)
  • report the interaction if significant after a
    Bonferroni correction.

33
A Hierarchical Procedure to Declare Significance
(c1)
  • 3-way interactions
  • report all novel 3-way interactions that are
    significant after a Bonferroni correction for
    L(L-1)(L- 2)/6 tests
  • if t1 or 2 markers were already found
    significant
  • calculate the conditional B-statistic (or the
    conditional LLR)
  • report the interaction if it is still significant
    after a Bonferroni correction.
  • All p-values were estimated by a chi-square
    distribution with d3k3t degrees of freedom, for
    k1,2,3, t0,1,2, tltk, and adjusted by Bonferroni
    corrections.

34
Results
  • The power for detecting marginal associations was
    not compromised by using the more complex models.

BEAM (B), the stepwise B-stat (S), the stepwise
logistic regression (L) the 2-d.f. ?2 test (C)
Each data set contains 1,000 markers. Black bars
represent the power for 1,000 cases and 1,000
controls. Gray bars represent the power for 2,000
cases and 2,000 controls.
35
BEAM (B), the stepwise B-stat (S), the stepwise
logistic regression (L) the 2-d.f. ?2 test (C)
36
BEAM (B), the stepwise B-stat (S), the stepwise
logistic regression (L) the 2-d.f. ?2 test (C)
Each data set contains 1,000 markers. Black bars
represent the power for 1,000 cases and 1,000
controls. Gray bars represent the power for 2,000
cases and 2,000 controls.
37
Results (c1)
  • BEAM performs better especially when either
    disease allele frequencies or marginal effects
    were small.
  • The power of all methods decreases with the decay
    of the LD (measured in r2) between disease loci
    and associated markers.

38
Type I Errors
  • All three epistasis mapping methods made similar
    amounts of type I errors.
  • At the 0.1 significance level, they all made 10
    type I errors (after Bonferroni correction) when
    searching only for marginally significant
    markers.
  • All methods made much fewer than 10 type I
    errors when searching for interactions.

39
(No Transcript)
40
Impact of mismatch in allele frequencies and LD
  • The power of association mapping can be greatly
    hampered by the discrepancy of allele frequencies
    between unobserved disease loci and associated
    genotyped markers.
  • Investigated the impact based on model 2
  • MAFs at two interacting disease loci were both
    0.1
  • The marginal effect size per disease locus was
    0.5
  • One linked marker had the matched MAF, whereas
    the other had an MAF ranging from 0.05 to 0.5.
  • The LD between disease loci and associated
    markers was controlled to range from D0.7 to
    D1.

41
(No Transcript)
42
Genome-wide association study of AMD
  • The AMD (age-related macular degeneration) data
    set contains 116,204 SNPs genotyped for 96
    affected individuals and 50 controls.
  • Remove nonpolymorphic SNPs and those that
    significantly deviated from Hardy-Weinberg
    Equilibrium (HWE)
  • Remove additional SNPs containing more than five
    missing genotypes.
  • After the filtration, 96,932 SNPs remained.

43
(No Transcript)
44
Prior Calibration
  • With only 146 individuals and 100,000 SNPs, the
    posterior probability of associations for each
    marker is strongly influenced by the choice of
    priors, although the order of these probabilities
    is nearly invariant.

45
(No Transcript)
46
Simulation Based on AMD Data
47
Comparison with other epistasis mapping approaches
  • MDR identifies k-way interactions through an
    exhaustive search and evaluates the association
    between each interaction and the disease by
    cross-validations.
  • Logic regression infers a tree-based relationship
    between the disease status and a set of
    markers.It evaluates the detected associations by
    permutation tests.
  • BGTA uses a bootstrap-type resampling screening
    procedure to select markers, and those markers
    with return frequencies greater than the third
    quartile plus 1.8 times the interquartile range
    are deemed disease-associated markers.

48
(No Transcript)
49
Discussions
  • The BEAM algorithm has two essential components
  • a Bayesian epistasis inference tool implemented
    via MCMC
  • a novel test statistic for evaluating statistical
    significance
  • A natural advantage of the Bayesian approach
  • incorporate prior knowledge about each marker
  • quantify all information and uncertainties in the
    form of posteriors
  • Evaluating the statistical significance of a
    candidate finding via P values
  • more robust to model choice and prior assumptions
  • can give the scientist peace of mind

50
Discussions (c1)
  • The power of epistasis mappings depends
    critically on
  • sample size
  • effects of disease mutations
  • any discrepancy in allele frequencies between
    disease loci and associated markers
  • There are several issues that may affect the
    accuracy
  • population substructures
  • genotyping errors
  • disease heterogeneities

51
THANK YOU
Write a Comment
User Comments (0)
About PowerShow.com