Title: MCMC for multilevel logistic regressions in MLwiN
1Lecture 14
- MCMC for multilevel logistic regressions in MLwiN
2Lecture Contents
- Recap of MCMC from day 3
- Recap of logistic regression from days 1 4
- Metropolis Hastings sampling
- Reunion Island dataset (2 and 3 level)
- VPC for binomial models
- Method Comparison on Rodriguez/Goldman examples
3MCMC Methods (recap)
- Goal To sample from joint posterior
distribution. - Problem For complex models this involves
multidimensional integration. - Solution It may be possible to sample from
conditional posterior distributions, - It can be shown that after convergence such a
sampling approach generates dependent samples
from the joint posterior distribution.
4Gibbs Sampling (recap)
- When we can sample directly from the conditional
posterior distributions then such an algorithm is
known as Gibbs Sampling. - This proceeds as follows for the variance
components example - Firstly give all unknown parameters starting
values, - Next loop through the following steps
5Gibbs Sampling for VC model
These steps are then repeated with the
generated values from this loop replacing the
starting values. The chain of values produced by
this procedure are known as a Markov chain. Note
that ß is generated as a block while each uj is
updated individually.
6Algorithm Summary
- Repeat the following four steps
- 1. Generate ß from its (Multivariate) Normal
conditional distribution. - 2. Generate each uj from its Normal conditional
distribution. - 3. Generate 1/su2 from its Gamma conditional
distribution. - 3. Generate 1/se2 from its Gamma conditional
distribution.
7Logistic regression model
- A standard Bayesian logistic regression model
(e.g. for the rat tumour example) can be written
as follows - Both MLwiN and WinBUGS can fit this model but can
we write out the conditional posterior
distributions and use Gibbs Sampling?
8Conditional distribution for ß0
This distribution is not a standard distribution
and so we cannot simply simulate from a
standard random number generator. However both
WinBUGS and MLwiN can fit this model using
MCMC. We will in this lecture describe how MLwiN
does this before considering WinBUGS in the next
lecture.
9Metropolis Hastings (MH) sampling
- An alternative, and more general, way to
construct am MCMC sampler. - A form of generalised rejection sampling (see
later) where values are drawn from approximate
distributions and corrected so that,
asymptotically they behave as random observations
from the target distribution. - MH sampling algorithms sequentially draw
candidate observations from a proposal
distribution, conditional on the current
observations thus inducing a Markov chain.
10General MH algorithm step
- Let us focus on a single parameter ? and its
posterior distribution p(?Y). - Now at iteration t, ? takes value ?t and we
generate a new value ? from a proposal
distribution q. - Then we accept this new value and let
- ?t1 ? with acceptance probability a(?, ?t)
otherwise we set ?t1 ?t. - The acceptance probability
11Choosing a proposal distribution
- Remarkably the proposal distribution can have
almost any form. - There are some (silly) exceptions e.g. a proposal
that has point mass at one value but assuming
that the proposal allows the chain to explore the
whole posterior and doesnt produce a recurrent
chain we are OK. - Three special cases of Metropolis Hasting
algorithm are - Random walk metropolis sampling.
- Independence sampling.
- Gibbs sampling!
12Pure Metropolis Sampling
- The general MH sampling algorithm is due to
Hastings (1970) however this is a generalisation
of pure Metropolis Sampling (Metropolis et al.
(1953)). - This special case is when
- i.e. the proposal distribution is symmetric.
- This then reduces the acceptance probability to
13Random Walk Metropolis
- This is an example of pure Metropolis sampling.
- Here q(?1?2) q(?1 ?2).
- Typical examples of random walk proposals are
Normal distributions centered around the current
value of the parameter i.e. q(?1?2) N(?2,s2)
where s2 is the (fixed) proposal variance that
can be tuned to give particular acceptance rates. - This is the method used within MLwiN.
14Independence Sampler
- The independence sampler is so called as each
proposal is independent of the current parameter
value i.e. - q(?1?2) q(?1).
- This leads to acceptance probability
- Example proposal distributions could be a Normal
based around the ML estimate with inflated
variance. The independence sampler can sometimes
work very well but equally can work very badly!
15Gibbs Sampling
- The Gibbs sampler that we have already studied is
a special case of the MH algorithm. The proposal
distribution is the full conditional distribution
which leads to acceptance rate 1 as shown below
16MH Sampling in MLwiN
- MLwiN actually uses a hybrid method.
- Gibbs sampling steps are used for variance
parameters. - MH steps are used for fixed effects and
residuals. - Univariate Normal proposal distributions are
used. - For the proposal standard deviation a scaled IGLS
standard error is initially used (multiplied by
5.8 on the variance scale). - However an adaptive method is used to tune these
proposal distributions prior to the burn-in
period.
17MH Sampling for Normal model
- For the education dataset we can illustrate MH
Sampling on the VC model by modifying steps 1 and
2. - Repeat the following four steps
- 1. Generate ßi by Univariate Normal MH Sampling.
- 2. Generate each uj by Univariate Normal MH
Sampling. - 3. Generate 1/su2 from its Gamma conditional
distribution. - 3. Generate 1/se2 from its Gamma conditional
distribution.
18MH Sampling in MLwiN
- Here we see how to change method. Note for
Binomial responses this will change automatically
and Gibbs Sampling will not be available.
19Trajectories plot
- Here we see MH sampling for ß.
20Adaptive Method (ad hoc)
- One way of finding a good proposal distribution
is to choose a distribution that gives a
particular acceptance rate. It has been shown
that a good acceptance rate is often around
50. - MLwiN has incorporated an adaptive method that
uses this fact to construct univariate Normal
proposals with an acceptance rate of
approximately 50. - Method
- Before the burn-in we have an adaptation period
where the sampler improves the proposal
distribution. The adaptive method requires a
desired acceptance rate e.g. 50 and tolerance
e.g. 10 resulting in an acceptable range of
(40,60). -
21Adaptive method algorithm
- Run the sampler for consecutive batches of 100
iterations. Compare the number accepted, N with
the desired acceptance rate, R. - Repeat this procedure until 3 consecutive values
of N lie within the acceptable range and then
mark this parameter. When all the parameters are
marked the adaptation period is over. - N.B. Proposal SDs are still modified after being
marked until adaptation period is over.
22Example of Adaptive period
- Adaptive method used on parameter, ß0
N SD Accepted N in Row
0 1.0 - -
100 0.505 1 0
200 0.263 4 0
300 0.138 5 0
400 0.074 7 0
500 0.046 19 0
600 0.032 29 0
700 0.031 48 1
800 0.026 40 2
900 0.024 46 3
1000 0.021 51 3
1500 0.022 48 3
23Comparison of Gibbs vs MH
- MLwiN also has a MVN MH algorithm. A comparison
of ESS for a run of 5,000 iterations of the VC
model follows
Gibbs MH MH MV
ß0 216 33 59
ß1 4413 973 303
s2u 2821 2140 2919
s2e 4712 4895 4728
242 level Reunion Island dataset
- We have already studied this dataset with a
continuous response. - Here we consider a subset with only the 1st
lactation for each cow resulting in 2 levels
cows nested within herds. - The (Binary) response of interest is fscr
whether the first service results in a
conception. - There are two predictors ai (whether
insemination was natural or artificial) and
heifer (the age of the cow).
25MCMC algorithm
- Our MLwiN algorithm has 3 steps
- 1. Generate ßi by Univariate Normal MH Sampling.
- 2. Generate each uj by Univariate Normal MH
Sampling. - 3. Generate 1/su2 from its Gamma conditional
distribution.
26MLwin Demo
- The 2-level model is set up and run in IGLS
giving the following starting values
27Trajectories for 5k iterations
- Here we see some poor mixing particularly for the
variance
28DIC for binomial models
- We can use DIC to check whether we need random
effects and whether to include heifer in the
model. Note we only ran each model for 5,000
iterations.
Model pD DIC
VC AI 18.37 2087.15 (2086.48 50k)
LR AI 2.16 2095.70
VC AI Heifer 22.16 2087.43
29VC Model after 50k iterations
- Here is a trace for the herd level variance after
50k iterations that suggests we need to run even
longer!
30VPC for Binomial models
- VPC is harder to calculate for Binomial models as
the level 1 variance is part of the Binomial
distribution and hence related to the mean and on
a different scale to higher level variances. - Goldstein et al. (2002) propose 4 methods
- Use a Taylor series approximation.
- Use a simulation based approach.
- Switch to a Normal response model.
- Use the latent variable approach in Snijders and
Bosker.
31VPC for Binomial models
- Snijders and Bosker (1999) suggest the following
- The variance of a standard logistic distribution
is p2/3 and so the level 1 variance should be
replaced by this value. - In the Reunion Island example this means
- Or in other words 2.63 of the variation is at
the herd level. The fact that there isnt a huge
level 2 variance may in part explain the poor
mixing of the MCMC algorithm.
323 level reunion island dataset
- We will now fit the same models to the 3-level
dataset. After 5,000 we have the following rather
worrying results
33Running for longer
- We ran the chains for 100k after a burn-in of 5k
and thinned the chains by a factor of 10. The
results are still a little worrying
34Potential solutions
- In the last 2 slides we have seen some bad mixing
behaviour for some parameters. - The solution of running for longer seems to work
but we need to run for a very long time! - In the next lecture we will look at WinBUGS
methods for this model and also a
reparameterisation of the model known as
hierarchical centering. - For this model it looks like MCMC is extremely
computationally intensive to get reliable
estimates. In what follows we look at an example
where MCMC is clearly useful.
35The Guatemalan Child Health dataset.
- This consists of a subsample of 2,449 respondents
from the 1987 National Survey of Maternal and
Child Helath, with a 3-level structure of births
within mothers within communities. - The subsample consists of all women from the
chosen communities who had some form of prenatal
care during pregnancy. The response variable is
whether this prenatal care was modern (physician
or trained nurse) or not. - Rodriguez and Goldman (1995) use the structure of
this dataset to consider how well
quasi-likelihood methods compare with considering
the dataset without the multilevel structure and
fitting a standard logistic regression. - They perform this by constructing simulated
datasets based on the original structure but with
known true values for the fixed effects and
variance parameters. - They consider the MQL method and show that the
estimates of the fixed effects produced by MQL
are worse than the estimates produced by standard
logistic regression disregarding the multilevel
structure!
36The Guatemalan Child Health dataset.
- Goldstein and Rasbash (1996) consider the same
problem but use the PQL method. They show that
the results produced by PQL 2nd order estimation
are far better than for MQL but still biased. - The model in this situation is
- In this formulation i,j and k index the level 1,
2 and 3 units respectively. - The variables x1,x2 and x3 are composite scales
at each level because the original model
contained many covariates at each level. - Browne (1998) considered the hybrid
Metropolis-Gibbs method in MLwiN and two possible
variance priors (Gamma-1(e,e) and Uniform
37Simulation Results
- The following gives point estimates (MCSE) for 4
methods and 500 simulated datasets.
Parameter (True) MQL1 PQL2 Gamma Uniform
ß0 (0.65) 0.474 (0.01) 0.612 (0.01) 0.638 (0.01) 0.655 (0.01)
ß1 (1.00) 0.741 (0.01) 0.945 (0.01) 0.991 (0.01) 1.015 (0.01)
ß2 (1.00) 0.753 (0.01) 0.958 (0.01) 1.006 (0.01) 1.031 (0.01)
ß3 (1.00) 0.727 (0.01) 0.942 (0.01) 0.982 (0.01) 1.007 (0.01)
s2v (1.00) 0.550 (0.01) 0.888 (0.01) 1.023 (0.01) 1.108 (0.01)
s2u (1.00) 0.026 (0.01) 0.568 (0.01) 0.964 (0.02) 1.130 (0.02)
38Simulation Results
- The following gives interval coverage
probabilities (90/95) for 4 methods and 500
simulated datasets.
Parameter (True) MQL1 PQL2 Gamma Uniform
ß0 (0.65) 67.6/76.8 86.2/92.0 86.8/93.2 88.6/93.6
ß1 (1.00) 56.2/68.6 90.4/96.2 92.8/96.4 92.2/96.4
ß2 (1.00) 13.2/17.6 84.6/90.8 88.4/92.6 88.6/92.8
ß3 (1.00) 59.0/69.6 85.2/89.8 86.2/92.2 88.6/93.6
s2v (1.00) 0.6/2.4 70.2/77.6 89.4/94.4 87.8/92.2
s2u (1.00) 0.0/0.0 21.2/26.8 84.2/88.6 88.0/93.0
39Summary of simulations
- The Bayesian approach yields excellent bias and
coverage results. - For the fixed effects, MQL performs badly but the
other 3 methods all do well. - For the random effects, MQL and PQL both perform
badly but MCMC with both priors is much better. - Note that this is an extreme scenario with small
levels 1 in level 2 yet high level 2 variance and
in other examples MQL/PQL will not be so bad.
40Introduction to Practical
- In the practical you will be let loose on MLwiN
with two datasets - A dataset on contraceptive use in Bangladesh.
- A veterinary epidemiology dataset on pneumonia in
pigs.