Title: Simple methods to improve MCMC efficiency in random effect models
1Simple methods to improve MCMC efficiency in
random effect models
- William Browne, Mousa Golalizadeh, Martin Green
and Fiona Steele - Universities of Bristol and Nottingham
- Thanks to ESRC for supporting this work
2Summary
- Introduction.
- Application 1 Clutch size in great tits.
- Method 1 Hierarchical centering.
- Method 2 Parameter expansion.
- Application 2 Mastitis incidence in dairy
cattle. - Method 3 Orthogonal predictors.
- Application 3 Contraceptive discontinuation in
Indonesia. - Conclusions.
3Introduction/Synopsis
- MCMC methods allow easy fitting of complex random
effects models - The simplest (default) MCMC algorithms can
produce poorly mixing chains. - By reparameterising the model one can greatly
improve mixing. - These reparameterisations are easy to implement
in WinBUGS (or MLwiN) - The choice of reparameterisation depends in part
on the model/dataset.
4Application 1 Great tit nesting behaviour
(crossed random effects)
- Original work was collaborative research with
Richard Pettifor (Institute of Zoology, London),
and Robin McCleery and Ben Sheldon (University of
Oxford).
5Application 1 Great tit nesting behaviour
(crossed random effects)
- A longitudinal study of great tits nesting in
Wytham Woods, Oxfordshire. - 6 responses 3 continuous 3 binary.
- Clutch size, lay date and mean nestling mass.
- Nest success, male and female survival.
- Data 4165 nesting attempts over a period of 34
years. - There are 4 higher-level classifications of the
data female parent, male parent, nestbox and
year. - We only consider Clutch size here
6Data background
The data structure can be summarised as follows
Note there is very little information on each
individual male and female bird but we can get
some estimates of variability via a random
effects model.
7MCMC efficiency for clutch size response
- The MCMC algorithm used in the univariate
analysis of clutch size was a simple 10-step
Gibbs sampling algorithm. - .
- To compare methods for each parameter we can look
at the effective sample sizes (ESS) which give an
estimate of how many independent samples we
have for each parameter as opposed to 50,000
dependent samples. - ESS of iterations/?,
8Clutch Size
Here we see that the average clutch size is just
below 9 eggs with large variability between
female birds and some variability between years.
Male birds and nest boxes have less impact.
9Effective Sample sizes
We will now consider methods that will improve
the ESS values for particular parameters. We will
firstly consider the fixed effect parameter.
10Trace and autocorrelation plots for fixed effect
using standard Gibbs sampling algorithm
11Hierarchical Centering
This method was devised by Gelfand et al. (1995)
for use in nested models. Basically (where
feasible) parameters are moved up the hierarchy
in a model reformulation. For example
is equivalent to
The motivation here is we remove the strong
negative correlation between the fixed and random
effects by reformulation.
12Hierarchical Centering
In our cross-classified model we have 4 possible
hierarchies up which we can move parameters. We
have chosen to move the fixed effect up the year
hierarchy as its variance had biggest ESS
although this choice is rather arbitrary.
The ESS for the fixed effect increases 50-fold
from 602 to 35,063 while for the year level
variance we have a smaller improvement from
29,604 to 34,626. Note this formulation also runs
faster 1864s vs 2601s (in WinBUGS).
13Trace and autocorrelation plots for fixed effect
using hierarchical centering formulation
14Parameter Expansion
- We next consider the variances and in particular
the between-male bird variance. - When the posterior distribution of a variance
parameter has some mass near zero this can hamper
the mixing of the chains for both the variance
parameter and the associated random effects. - The pictures over the page illustrate such poor
mixing. - One solution is parameter expansion (Liu et al.
1998). - In this method we add an extra parameter to the
model to improve mixing.
15Trace plots for between males variance and a
sample male effect using standard Gibbs sampling
algorithm
16Autocorrelation plot for male variance and a
sample male effect using standard Gibbs sampling
algorithm
17Parameter Expansion
In our example we use parameter expansion for all
4 hierarchies. Note the ? parameters have an
impact on both the random effects and their
variance.
The original parameters can be found by
Note the models are not identical as we now have
different prior distributions for the variances.
18Parameter Expansion
- For the between males variance we have a 20-fold
increase in ESS from 33 to 600. - The parameter expanded model has different prior
distributions for the variances although these
priors are still diffuse. - It should be noted that the point and interval
estimate of the level 2 variance has changed from
- 0.034 (0.002,0.126) to 0.064 (0.000,0.172).
- Parameter expansion is computationally slower
3662s vs 2601s for our example.
19Trace plots for between males variance and a
sample male effect using parameter expansion.
20Combining the two methods
Hierarchical centering and parameter expansion
can easily be combined in the same model. Here we
perform centering on the year classification and
parameter expansion on the other 3 hierarchies.
21Effective Sample sizes
As we can see below the effective sample sizes
for all parameters are improved for this
formulation while running time remains
approximately the same.
22Applications 2 3 Multilevel discrete-time
survival analysis
- Used to model the durations until the occurrence
of events e.g. length of time to death. - Different from standard regression due to
right-censoring and time varying covariates. - In many applications events may be repeatable and
the outcome is duration of continuous exposure to
the risk of an event. - In our applications cows may suffer mastitis more
than once and women can initiate and discontinue
use of contraceptives several times. - We begin with two levels of data episodes nested
within individuals. - With a discrete time approach we expand the data
to create a lower level time interval, which
represents a fixed period of time and is nested
within episode.
23Data Expansion
- Suppose we have time intervals (e.g. days, weeks)
indexed by t 1,...,K where K is the maximum
duration of an episode. - Let tij denote the number of intervals for which
individual i is observed in episode j. - Before modelling we now need to expand the data
for each episode ij to obtain tij records. - We have ytij0 for t1,,tij-1 and the response
for t tij is 1 if an episode ends in an event
and 0 for censored episodes. - Example Individual observed for 7 months and has
event after 4 months - Response vector (0,0,0,1,0,0,0),
- Time intervals (1,2,3,4,1,2,3)
24Modelling
- Having restructured the data we can analyse the
event occurrence indicator ytij using standard
methods for clustered binary response data. - Here we model the probability of an event
occurrence ptij as a function of duration (zt)
and covariates xtij with uj being an individual
specific random effect representing unobserved
time-invariant individual characteristics.
25MCMC algorithm for random effect logistic
regression models
Consider the following simple model
A standard MCMC algorithm (as used in MLwiN
(Rasbash et al. (2000), Browne (2003)) is Step 1
Update using random walk Metropolis
sampling. Step 2 Update using
random walk Metropolis sampling. Step 3 Update
from its inverse Gamma full conditional
using Gibbs Sampling
26Hierarchical centered formulation
We can reparameterise by replacing the residuals
uj with random effects uj ß0uj. Here the uj
are (hierarchically) centred around ß0.
This formulation allows Gibbs sampling to be used
for the fixed effect ß0 in addition to the random
effects variance. It is interesting to consider
how an MCMC algorithm for this centered
formulation can be expressed in terms of the
original parameterisation.
27MCMC Algorithm
- At iteration t1
- Step 1 Update ß0 from its normal conditional
distribution - Adjust
- to keep uj fixed.
- The other steps are unchanged by
reparameterisation as the random walk Metropolis
sampling for uj is equivalent to the same step
for uj and the final step is unchanged.
28Why should hierarchical centering be useful for
discrete time survival models ?
- As we have seen discrete time survival models are
a special case of random effects logistic
regression models. - Expansion results in many higher level predictors
that can be (hierarchically) centered around. - Hierarchical centering should improve mixing but
also speed up estimation. - Will be particularly useful when we have many
level 1 units per level 2 unit and large level 2
variance, where the speed up will be greatest
see application 2. - Not always useful but other potential solutions
see application 3.
29Application 2 Mastitis incidence in dairy cattle
- Mastitis inflammation of mammary gland of dairy
cows, usually caused by a bacterial infection.
Infections arising in dry (non-lactating) period
often result in clinical mastitis in early
lactation. - Green et al. (2007) use multilevel survival
models to investigate how cow, farm and
management factors in the dry period correlate
with mastitis incidence. - Data 2 years x 52 dairy farms 103 farm years
8,710 cow lactation periods following dry
periods expands to 256,382 records in total!
30Mastitis model
- The Model has many predictors.
- The duration terms zt consists of an intercept
plus polynomials in log time to order 3. - The 4 levels are farm, farm year, cow dry period
and weekly obs. although no random effects occur
at the dry period level. - The model can easily be converted to a
hierarchical centered formulation by centering
around the 10 farm-year level fixed effects
31Mastitis predictors
- Predictors at dry period level (6) are
- Parity of cow (5 categories 4 dummies)
- Dummies for Somatic cell count high before drying
off, and whether two SCC readings were available. - At the farm level (9)
- One dummy for cows remaining standing for 30
minutes after dry cow treatment, - Two dummy variables to indicate whether cubicle
bedding is disinfected in the early dry period
and whether this is not applicable due to the
system used. - Two dummy variables to indicate if transition cow
cubicles are bedded at least once daily and
whether there are in fact transition cow
cubicles. - A dummy variable as to whether the cubicle
bedding is disinfected in the transition dry
period - Three dummy variables to indicate whether the
transition cubicle feed and loaf area is scraped
daily, more often than daily or doesn't exist.
32Mastitis results
- The model was run for 50,000 iterations after an
adapting period and a further burnin of 500
iterations in MLwiN. - Due to the number of parameters in model we will
in the table that follows simply give effective
sample size (ESS) values although it should also
be noted that parameter values for the
non-centred formulation were rather variable due
to the small ESS - The estimation took 19 hours for the non-centred
formulation and 11½ hours for the centred
formulation.
33Effective sample sizes
34Method 3a Orthogonal polynomials
- It is noticeable that the parameters a1 -a3 have
small ESS. These correspond to the predictor
polynomials in log duration and the three
correlations between the predictors are -0.61,
0.79 and -0.90 respectively. - We can without altering the rest of the analysis
replace this group of predictors via a
reparameterisation that makes the predictors less
correlated and in fact we choose to make them
orthogonal i.e. - To do this we can keep the first predictor and
then add the rest in one at a time solving the
above at each stage. - This results in predictors z
- logt, logt20.94logt, logt3 1.50logt2-2.05logt
35Results of orthogonal polynomials
- Fitting this model will give coefficient a that
can be easily converted back to the coefficients
for the original predictors - a1 a10.94 a 2-2.05 a 3, a2 a21.50 a 3, ,
a3 a3 - Effective sample size comparison
36Application 3 Contraceptive discontinuation in
Indonesia
- Steele et al. (2004) use multilevel multistate
models to study transitions in and out of
contraceptive use in Indonesia. Here we consider
a simplification of their model which considers
only the transition from use to non-use, commonly
referred to as contraceptive discontinuation. - The data come from the 1997 Indonesia Demographic
and Health Survey. Contraceptive use histories
were collected retrospectively for the six-year
period prior to the survey, and include
information on the month and year of starting and
ceasing use, the method used, and the reason for
discontinuation. - The analysis is based on 17,833 episodes of
contraceptive use for 12,594 women, where an
episode is defined as a continuous period of
using the same method of contraception. - Restructuring the data to discrete-time format
with monthly time intervals leads to 365,205
records. To reduce the size, monthly intervals
are grouped into six-month intervals and a
binomial response is defined with denominator
equal to the number of months of use within a
six-month interval. Aggregation of intervals
leads to a dataset with 68,515 records.
37Model
- We here have intervals nested within episodes
nested within women.
Here we include discrete categories for the
duration terms zt a piecewise constant hazard
with categories representing 6-11 months, 12-23
months, 24-35 months and gt35 months with a base
category of 0-5 months.
38Predictors
- At the episode level we have
- Age categorized as lt25,25-34 and 34-49.
- Contraceptive method categorized as
pill/injectable, Norplant/IUD, other modern and
traditional. - At the woman level we have
- Education (3 categories).
- Type of region of residence (urban/rural).
- Socioeconomic status (low, medium or high).
39Results
The table below gives the effective sample sizes
based on runs of 250,000 iterations.
Here the hierarchical centered formulation does
really badly. This is because the cluster
variance s2u is very small estimates of 0.041
and 0.022 for the two methods
40A closer look at the residuals
- It is well known that hierarchical centering
works best if the cluster level variance is
substantial. - Here we see that both the variance is small and
the distribution of the residuals is not very
normal. - This is due to a few women who discontinue usage
very quickly and often, whilst many women never
discontinue!
Std residuals
Normal scores
41Simple logistic regression
- We will consider first removing the random
effects from the model (due to their small
variance) which will result in a simple logistic
regression model. - It will then be impossible to perform
hierarchical centering however we will consider
an extension to the orthogonalisation performed
in the previous application. - Note that Hills and Smith (1992) talk about using
orthogonal parameterisations and Roberts and
Gilks give it one sentence in MCMC in Practice.
Here we consider it in combination with the
simple (single site) random walk Metropolis
sampler where reduction of correlation in the
posterior is perhaps most important.
42Method 3bOrthogonal parameterisation
- For simplicity assume we have all predictors in
one matrix P and that we can write ztaxtijß as
ptij? where ?(a,ß). - Step 1 Number the predictors in some ordering
1,,N. - Step 2 Take each predictor in turn and replace
it with a predictor that is orthogonal to all the
already considered predictors. - For predictor pk.
- Note this requires solving k-1 equations in k-1
unknown w parameters. - A different orthogonal set of predictors results
from each ordering.
43Orthogonal parameterisation
- The second step of the algorithm produces both a
set of orthogonal predictors that span the same
space as the original predictors and a group of w
coefficients that can be combined to form a lower
diagonal matrix W. - We can fit this model and recover the
coefficients for the original predictors by
pre-multiplication by WT. - It is worth noting here that we use improper
Uniform priors for the coefficients and if we
used proper priors we would need to also
calculate the Jacobian for the reparameterisation
to ensure the same priors are used. - We ordered the predictors in what follows so that
the level 2 predictors were last before
performing reparameterisation.
44Results
- The following is based on 50,000 iterations
Here we see almost universal benefit of the
orthogonal parameterisation with virtually zero
time costs and very little programming!
45Combining orthogonalisation with parameter
expansion
- Combining orthogonalisation and parameter
expansion we have
We ran this model using WinBUGS and only 25,000
iterations following a burnin of 500 iterations
which took 34 hours compared to 23½ for 250k in
MLwiN without parameter expansion. The results
are given overleaf.
46Results for full model
- Here we compare simply using the orthogonal
approach in MLwiN for 250k with both orthogonal
predictors and parameter expansion in WinBUGS for
25k. Note this takes 1.5 times as long
47Trace plots of variance chain
Before Parameter Expansion
Here we see the far greater mixing of the
variance chain after parameter expansion. It is
worth noting that parameter expansion uses a
different prior for s2u and results in an
estimate of 0.059 (0.048) as opposed to 0.008
(0.006) without and earlier estimates of
0.041(0.026) and 0.022 (0.018) before
orthogonalisation. Note however that all
estimates bar parameter expansion are based on
very low ESS!
After Parameter Expansion
48Conclusions
- Hierarchical centering as is well known - works
well when the cluster variance is big but is no
good for small variances - Other research building on hierarchical centering
includes Papaspiliopoulus et al. (2003,2007)
showing how to construct partially non-centred
parameterisations. - Parameter expansion works well to improve mixing
when the cluster variance is small but results in
a different prior for the variance. - Orthogonalisation of predictors appears to be a
good idea generally but is slightly more involved
than the other reparameterisations i.e. the
predictors need orthogonalising outside WinBUGS
and the chains need transforming back. - An interesting area of further research is
choosing the order for orthogonalisation i.e.
which set of orthogonal predictors to use.
49References
- Browne, W.J. (2003). MCMC Estimation in MLwiN.
London Institute of Education, University of
London - Browne, W.J. (2004). An illustration of the use
of reparameterisation methods for improving MCMC
efficiency in crossed random effect models
Multilevel Modelling Newsletter 16 (1) 13-25 - Gamerman D. (1997) Sampling from the posterior
distribution in generalized linear mixed models.
Statistics and Computing. 7, 57--68. - Gelfand, A.E., Sahu, S.K. and Carlin, B.P. (1995)
Efficient parameterisations for normal linear
mixed models. Biometrika 83, 479--488 - Gelman, A., Huang, Z., van Dyk, D., and
Boscardin, W.J. (2007). Using redundant
parameterizations to fit hierarchical models.
Journal of Computational and Graphical Statistics
(to appear). - Green, M.J., Bradley, A.J., Medley, G.F. and
Browne, W.J. (2007) Cow, Farm and Management
Factors during the Dry Period that determine the
rate of Clinical Mastitis after calving. Journal
of Dairy Science 90 3764--3776. - Hills, S.E. and Smith, A.F.M. (1992)
Parameterization Issues in Bayesian Inference. In
Bayesian Statistics 4, (J M Bernardo, J O Berger,
A P Dawid, and A F M Smith, eds), Oxford
University Press, UK, pp. 227--246.
50References cont.
- Liu, C., Rubin, D.B., and Wu, Y.N. (1998)
Parameter expansion to accelerate EM The PX-EM
algorithm. Biometrika 85 (4) 755-770. - Liu, J.S., Wu, Y.N. (1999) Parameter Expansion
for Data Augmentation. Journal Of The American
Statistical Association 94 1264-1274 - Papaspiliopoulos, O, Roberts, G.O. and Skold, M.
(2003) Non-centred Parameterisations for
Hierarchical Models and Data Augmentation. In
Bayesian Statistics 7, (J M Bernardo, M J
Bayarri, J O Berger, A P Dawid, D Heckerman, A F
M Smith and M West, eds), Oxford University
Press, UK, pp. 307--32 - Papaspiliopoulos, O, Roberts, G.O. and Skold, M.
(2007) A General Framework for the
Parametrization of Hierarchical Models.
Statistical Science 22, 59--73. - Rasbash, J., Browne, W.J., Healy, M, Cameron, B
and Charlton, C. (2000). The MLwiN software
package version 1.10. London Institute of
Education, University of London. - Steele, F., Goldstein, H. and Browne, W.J.
(2004). A general multilevel multistate competing
risks model for event history data, with an
application to a study of contraceptive use
dynamics. Statistical Modelling 4 145--159 - Van Dyk, D.A., and Meng, X-L. (2001) The Art of
Data Augmentation. Journal of Computational and
Graphical Statistics. 10, 1--50.