Simple methods to improve MCMC efficiency in random effect models - PowerPoint PPT Presentation

About This Presentation
Title:

Simple methods to improve MCMC efficiency in random effect models

Description:

... of mammary gland of dairy cows, usually caused by a bacterial ... years x 52 dairy farms 103 farm years 8,710 cow lactation periods following ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 51
Provided by: Bro78
Category:

less

Transcript and Presenter's Notes

Title: Simple methods to improve MCMC efficiency in random effect models


1
Simple methods to improve MCMC efficiency in
random effect models
  • William Browne, Mousa Golalizadeh, Martin Green
    and Fiona Steele
  • Universities of Bristol and Nottingham
  • Thanks to ESRC for supporting this work

2
Summary
  • Introduction.
  • Application 1 Clutch size in great tits.
  • Method 1 Hierarchical centering.
  • Method 2 Parameter expansion.
  • Application 2 Mastitis incidence in dairy
    cattle.
  • Method 3 Orthogonal predictors.
  • Application 3 Contraceptive discontinuation in
    Indonesia.
  • Conclusions.

3
Introduction/Synopsis
  • MCMC methods allow easy fitting of complex random
    effects models
  • The simplest (default) MCMC algorithms can
    produce poorly mixing chains.
  • By reparameterising the model one can greatly
    improve mixing.
  • These reparameterisations are easy to implement
    in WinBUGS (or MLwiN)
  • The choice of reparameterisation depends in part
    on the model/dataset.

4
Application 1 Great tit nesting behaviour
(crossed random effects)
  • Original work was collaborative research with
    Richard Pettifor (Institute of Zoology, London),
    and Robin McCleery and Ben Sheldon (University of
    Oxford).

5
Application 1 Great tit nesting behaviour
(crossed random effects)
  • A longitudinal study of great tits nesting in
    Wytham Woods, Oxfordshire.
  • 6 responses 3 continuous 3 binary.
  • Clutch size, lay date and mean nestling mass.
  • Nest success, male and female survival.
  • Data 4165 nesting attempts over a period of 34
    years.
  • There are 4 higher-level classifications of the
    data female parent, male parent, nestbox and
    year.
  • We only consider Clutch size here

6
Data background
The data structure can be summarised as follows
Note there is very little information on each
individual male and female bird but we can get
some estimates of variability via a random
effects model.
7
MCMC efficiency for clutch size response
  • The MCMC algorithm used in the univariate
    analysis of clutch size was a simple 10-step
    Gibbs sampling algorithm.
  • .
  • To compare methods for each parameter we can look
    at the effective sample sizes (ESS) which give an
    estimate of how many independent samples we
    have for each parameter as opposed to 50,000
    dependent samples.
  • ESS of iterations/?,

8
Clutch Size
Here we see that the average clutch size is just
below 9 eggs with large variability between
female birds and some variability between years.
Male birds and nest boxes have less impact.
9
Effective Sample sizes
We will now consider methods that will improve
the ESS values for particular parameters. We will
firstly consider the fixed effect parameter.
10
Trace and autocorrelation plots for fixed effect
using standard Gibbs sampling algorithm
11
Hierarchical Centering
This method was devised by Gelfand et al. (1995)
for use in nested models. Basically (where
feasible) parameters are moved up the hierarchy
in a model reformulation. For example
is equivalent to
The motivation here is we remove the strong
negative correlation between the fixed and random
effects by reformulation.
12
Hierarchical Centering
In our cross-classified model we have 4 possible
hierarchies up which we can move parameters. We
have chosen to move the fixed effect up the year
hierarchy as its variance had biggest ESS
although this choice is rather arbitrary.
The ESS for the fixed effect increases 50-fold
from 602 to 35,063 while for the year level
variance we have a smaller improvement from
29,604 to 34,626. Note this formulation also runs
faster 1864s vs 2601s (in WinBUGS).
13
Trace and autocorrelation plots for fixed effect
using hierarchical centering formulation
14
Parameter Expansion
  • We next consider the variances and in particular
    the between-male bird variance.
  • When the posterior distribution of a variance
    parameter has some mass near zero this can hamper
    the mixing of the chains for both the variance
    parameter and the associated random effects.
  • The pictures over the page illustrate such poor
    mixing.
  • One solution is parameter expansion (Liu et al.
    1998).
  • In this method we add an extra parameter to the
    model to improve mixing.

15
Trace plots for between males variance and a
sample male effect using standard Gibbs sampling
algorithm
16
Autocorrelation plot for male variance and a
sample male effect using standard Gibbs sampling
algorithm
17
Parameter Expansion
In our example we use parameter expansion for all
4 hierarchies. Note the ? parameters have an
impact on both the random effects and their
variance.
The original parameters can be found by
Note the models are not identical as we now have
different prior distributions for the variances.
18
Parameter Expansion
  • For the between males variance we have a 20-fold
    increase in ESS from 33 to 600.
  • The parameter expanded model has different prior
    distributions for the variances although these
    priors are still diffuse.
  • It should be noted that the point and interval
    estimate of the level 2 variance has changed from
  • 0.034 (0.002,0.126) to 0.064 (0.000,0.172).
  • Parameter expansion is computationally slower
    3662s vs 2601s for our example.

19
Trace plots for between males variance and a
sample male effect using parameter expansion.
20
Combining the two methods
Hierarchical centering and parameter expansion
can easily be combined in the same model. Here we
perform centering on the year classification and
parameter expansion on the other 3 hierarchies.
21
Effective Sample sizes
As we can see below the effective sample sizes
for all parameters are improved for this
formulation while running time remains
approximately the same.
22
Applications 2 3 Multilevel discrete-time
survival analysis
  • Used to model the durations until the occurrence
    of events e.g. length of time to death.
  • Different from standard regression due to
    right-censoring and time varying covariates.
  • In many applications events may be repeatable and
    the outcome is duration of continuous exposure to
    the risk of an event.
  • In our applications cows may suffer mastitis more
    than once and women can initiate and discontinue
    use of contraceptives several times.
  • We begin with two levels of data episodes nested
    within individuals.
  • With a discrete time approach we expand the data
    to create a lower level time interval, which
    represents a fixed period of time and is nested
    within episode.

23
Data Expansion
  • Suppose we have time intervals (e.g. days, weeks)
    indexed by t 1,...,K where K is the maximum
    duration of an episode.
  • Let tij denote the number of intervals for which
    individual i is observed in episode j.
  • Before modelling we now need to expand the data
    for each episode ij to obtain tij records.
  • We have ytij0 for t1,,tij-1 and the response
    for t tij is 1 if an episode ends in an event
    and 0 for censored episodes.
  • Example Individual observed for 7 months and has
    event after 4 months
  • Response vector (0,0,0,1,0,0,0),
  • Time intervals (1,2,3,4,1,2,3)

24
Modelling
  • Having restructured the data we can analyse the
    event occurrence indicator ytij using standard
    methods for clustered binary response data.
  • Here we model the probability of an event
    occurrence ptij as a function of duration (zt)
    and covariates xtij with uj being an individual
    specific random effect representing unobserved
    time-invariant individual characteristics.

25
MCMC algorithm for random effect logistic
regression models
Consider the following simple model
A standard MCMC algorithm (as used in MLwiN
(Rasbash et al. (2000), Browne (2003)) is Step 1
Update using random walk Metropolis
sampling. Step 2 Update using
random walk Metropolis sampling. Step 3 Update
from its inverse Gamma full conditional
using Gibbs Sampling
26
Hierarchical centered formulation
We can reparameterise by replacing the residuals
uj with random effects uj ß0uj. Here the uj
are (hierarchically) centred around ß0.
This formulation allows Gibbs sampling to be used
for the fixed effect ß0 in addition to the random
effects variance. It is interesting to consider
how an MCMC algorithm for this centered
formulation can be expressed in terms of the
original parameterisation.
27
MCMC Algorithm
  • At iteration t1
  • Step 1 Update ß0 from its normal conditional
    distribution
  • Adjust
  • to keep uj fixed.
  • The other steps are unchanged by
    reparameterisation as the random walk Metropolis
    sampling for uj is equivalent to the same step
    for uj and the final step is unchanged.

28
Why should hierarchical centering be useful for
discrete time survival models ?
  • As we have seen discrete time survival models are
    a special case of random effects logistic
    regression models.
  • Expansion results in many higher level predictors
    that can be (hierarchically) centered around.
  • Hierarchical centering should improve mixing but
    also speed up estimation.
  • Will be particularly useful when we have many
    level 1 units per level 2 unit and large level 2
    variance, where the speed up will be greatest
    see application 2.
  • Not always useful but other potential solutions
    see application 3.

29
Application 2 Mastitis incidence in dairy cattle
  • Mastitis inflammation of mammary gland of dairy
    cows, usually caused by a bacterial infection.
    Infections arising in dry (non-lactating) period
    often result in clinical mastitis in early
    lactation.
  • Green et al. (2007) use multilevel survival
    models to investigate how cow, farm and
    management factors in the dry period correlate
    with mastitis incidence.
  • Data 2 years x 52 dairy farms 103 farm years
    8,710 cow lactation periods following dry
    periods expands to 256,382 records in total!

30
Mastitis model
  • The Model has many predictors.
  • The duration terms zt consists of an intercept
    plus polynomials in log time to order 3.
  • The 4 levels are farm, farm year, cow dry period
    and weekly obs. although no random effects occur
    at the dry period level.
  • The model can easily be converted to a
    hierarchical centered formulation by centering
    around the 10 farm-year level fixed effects

31
Mastitis predictors
  • Predictors at dry period level (6) are
  • Parity of cow (5 categories 4 dummies)
  • Dummies for Somatic cell count high before drying
    off, and whether two SCC readings were available.
  • At the farm level (9)
  • One dummy for cows remaining standing for 30
    minutes after dry cow treatment,
  • Two dummy variables to indicate whether cubicle
    bedding is disinfected in the early dry period
    and whether this is not applicable due to the
    system used.
  • Two dummy variables to indicate if transition cow
    cubicles are bedded at least once daily and
    whether there are in fact transition cow
    cubicles.
  • A dummy variable as to whether the cubicle
    bedding is disinfected in the transition dry
    period
  • Three dummy variables to indicate whether the
    transition cubicle feed and loaf area is scraped
    daily, more often than daily or doesn't exist.

32
Mastitis results
  • The model was run for 50,000 iterations after an
    adapting period and a further burnin of 500
    iterations in MLwiN.
  • Due to the number of parameters in model we will
    in the table that follows simply give effective
    sample size (ESS) values although it should also
    be noted that parameter values for the
    non-centred formulation were rather variable due
    to the small ESS
  • The estimation took 19 hours for the non-centred
    formulation and 11½ hours for the centred
    formulation.

33
Effective sample sizes
34
Method 3a Orthogonal polynomials
  • It is noticeable that the parameters a1 -a3 have
    small ESS. These correspond to the predictor
    polynomials in log duration and the three
    correlations between the predictors are -0.61,
    0.79 and -0.90 respectively.
  • We can without altering the rest of the analysis
    replace this group of predictors via a
    reparameterisation that makes the predictors less
    correlated and in fact we choose to make them
    orthogonal i.e.
  • To do this we can keep the first predictor and
    then add the rest in one at a time solving the
    above at each stage.
  • This results in predictors z
  • logt, logt20.94logt, logt3 1.50logt2-2.05logt

35
Results of orthogonal polynomials
  • Fitting this model will give coefficient a that
    can be easily converted back to the coefficients
    for the original predictors
  • a1 a10.94 a 2-2.05 a 3, a2 a21.50 a 3, ,
    a3 a3
  • Effective sample size comparison

36
Application 3 Contraceptive discontinuation in
Indonesia
  • Steele et al. (2004) use multilevel multistate
    models to study transitions in and out of
    contraceptive use in Indonesia. Here we consider
    a simplification of their model which considers
    only the transition from use to non-use, commonly
    referred to as contraceptive discontinuation.
  • The data come from the 1997 Indonesia Demographic
    and Health Survey. Contraceptive use histories
    were collected retrospectively for the six-year
    period prior to the survey, and include
    information on the month and year of starting and
    ceasing use, the method used, and the reason for
    discontinuation.
  • The analysis is based on 17,833 episodes of
    contraceptive use for 12,594 women, where an
    episode is defined as a continuous period of
    using the same method of contraception.
  • Restructuring the data to discrete-time format
    with monthly time intervals leads to 365,205
    records. To reduce the size, monthly intervals
    are grouped into six-month intervals and a
    binomial response is defined with denominator
    equal to the number of months of use within a
    six-month interval. Aggregation of intervals
    leads to a dataset with 68,515 records.

37
Model
  • We here have intervals nested within episodes
    nested within women.

Here we include discrete categories for the
duration terms zt a piecewise constant hazard
with categories representing 6-11 months, 12-23
months, 24-35 months and gt35 months with a base
category of 0-5 months.
38
Predictors
  • At the episode level we have
  • Age categorized as lt25,25-34 and 34-49.
  • Contraceptive method categorized as
    pill/injectable, Norplant/IUD, other modern and
    traditional.
  • At the woman level we have
  • Education (3 categories).
  • Type of region of residence (urban/rural).
  • Socioeconomic status (low, medium or high).

39
Results
The table below gives the effective sample sizes
based on runs of 250,000 iterations.
Here the hierarchical centered formulation does
really badly. This is because the cluster
variance s2u is very small estimates of 0.041
and 0.022 for the two methods
40
A closer look at the residuals
  • It is well known that hierarchical centering
    works best if the cluster level variance is
    substantial.
  • Here we see that both the variance is small and
    the distribution of the residuals is not very
    normal.
  • This is due to a few women who discontinue usage
    very quickly and often, whilst many women never
    discontinue!

Std residuals
Normal scores
41
Simple logistic regression
  • We will consider first removing the random
    effects from the model (due to their small
    variance) which will result in a simple logistic
    regression model.
  • It will then be impossible to perform
    hierarchical centering however we will consider
    an extension to the orthogonalisation performed
    in the previous application.
  • Note that Hills and Smith (1992) talk about using
    orthogonal parameterisations and Roberts and
    Gilks give it one sentence in MCMC in Practice.
    Here we consider it in combination with the
    simple (single site) random walk Metropolis
    sampler where reduction of correlation in the
    posterior is perhaps most important.

42
Method 3bOrthogonal parameterisation
  • For simplicity assume we have all predictors in
    one matrix P and that we can write ztaxtijß as
    ptij? where ?(a,ß).
  • Step 1 Number the predictors in some ordering
    1,,N.
  • Step 2 Take each predictor in turn and replace
    it with a predictor that is orthogonal to all the
    already considered predictors.
  • For predictor pk.
  • Note this requires solving k-1 equations in k-1
    unknown w parameters.
  • A different orthogonal set of predictors results
    from each ordering.

43
Orthogonal parameterisation
  • The second step of the algorithm produces both a
    set of orthogonal predictors that span the same
    space as the original predictors and a group of w
    coefficients that can be combined to form a lower
    diagonal matrix W.
  • We can fit this model and recover the
    coefficients for the original predictors by
    pre-multiplication by WT.
  • It is worth noting here that we use improper
    Uniform priors for the coefficients and if we
    used proper priors we would need to also
    calculate the Jacobian for the reparameterisation
    to ensure the same priors are used.
  • We ordered the predictors in what follows so that
    the level 2 predictors were last before
    performing reparameterisation.

44
Results
  • The following is based on 50,000 iterations

Here we see almost universal benefit of the
orthogonal parameterisation with virtually zero
time costs and very little programming!
45
Combining orthogonalisation with parameter
expansion
  • Combining orthogonalisation and parameter
    expansion we have

We ran this model using WinBUGS and only 25,000
iterations following a burnin of 500 iterations
which took 34 hours compared to 23½ for 250k in
MLwiN without parameter expansion. The results
are given overleaf.
46
Results for full model
  • Here we compare simply using the orthogonal
    approach in MLwiN for 250k with both orthogonal
    predictors and parameter expansion in WinBUGS for
    25k. Note this takes 1.5 times as long

47
Trace plots of variance chain
Before Parameter Expansion
Here we see the far greater mixing of the
variance chain after parameter expansion. It is
worth noting that parameter expansion uses a
different prior for s2u and results in an
estimate of 0.059 (0.048) as opposed to 0.008
(0.006) without and earlier estimates of
0.041(0.026) and 0.022 (0.018) before
orthogonalisation. Note however that all
estimates bar parameter expansion are based on
very low ESS!
After Parameter Expansion
48
Conclusions
  • Hierarchical centering as is well known - works
    well when the cluster variance is big but is no
    good for small variances
  • Other research building on hierarchical centering
    includes Papaspiliopoulus et al. (2003,2007)
    showing how to construct partially non-centred
    parameterisations.
  • Parameter expansion works well to improve mixing
    when the cluster variance is small but results in
    a different prior for the variance.
  • Orthogonalisation of predictors appears to be a
    good idea generally but is slightly more involved
    than the other reparameterisations i.e. the
    predictors need orthogonalising outside WinBUGS
    and the chains need transforming back.
  • An interesting area of further research is
    choosing the order for orthogonalisation i.e.
    which set of orthogonal predictors to use.

49
References
  • Browne, W.J. (2003). MCMC Estimation in MLwiN.
    London Institute of Education, University of
    London
  • Browne, W.J. (2004). An illustration of the use
    of reparameterisation methods for improving MCMC
    efficiency in crossed random effect models
    Multilevel Modelling Newsletter 16 (1) 13-25
  • Gamerman D. (1997) Sampling from the posterior
    distribution in generalized linear mixed models.
    Statistics and Computing. 7, 57--68.
  • Gelfand, A.E., Sahu, S.K. and Carlin, B.P. (1995)
    Efficient parameterisations for normal linear
    mixed models. Biometrika 83, 479--488
  • Gelman, A., Huang, Z., van Dyk, D., and
    Boscardin, W.J. (2007). Using redundant
    parameterizations to fit hierarchical models.
    Journal of Computational and Graphical Statistics
    (to appear).
  • Green, M.J., Bradley, A.J., Medley, G.F. and
    Browne, W.J. (2007) Cow, Farm and Management
    Factors during the Dry Period that determine the
    rate of Clinical Mastitis after calving. Journal
    of Dairy Science 90 3764--3776.
  • Hills, S.E. and Smith, A.F.M. (1992)
    Parameterization Issues in Bayesian Inference. In
    Bayesian Statistics 4, (J M Bernardo, J O Berger,
    A P Dawid, and A F M Smith, eds), Oxford
    University Press, UK, pp. 227--246.

50
References cont.
  • Liu, C., Rubin, D.B., and Wu, Y.N. (1998)
    Parameter expansion to accelerate EM The PX-EM
    algorithm. Biometrika 85 (4) 755-770.
  • Liu, J.S., Wu, Y.N. (1999) Parameter Expansion
    for Data Augmentation. Journal Of The American
    Statistical Association 94 1264-1274
  • Papaspiliopoulos, O, Roberts, G.O. and Skold, M.
    (2003) Non-centred Parameterisations for
    Hierarchical Models and Data Augmentation. In
    Bayesian Statistics 7, (J M Bernardo, M J
    Bayarri, J O Berger, A P Dawid, D Heckerman, A F
    M Smith and M West, eds), Oxford University
    Press, UK, pp. 307--32
  • Papaspiliopoulos, O, Roberts, G.O. and Skold, M.
    (2007) A General Framework for the
    Parametrization of Hierarchical Models.
    Statistical Science 22, 59--73.
  • Rasbash, J., Browne, W.J., Healy, M, Cameron, B
    and Charlton, C. (2000). The MLwiN software
    package version 1.10. London Institute of
    Education, University of London.
  • Steele, F., Goldstein, H. and Browne, W.J.
    (2004). A general multilevel multistate competing
    risks model for event history data, with an
    application to a study of contraceptive use
    dynamics. Statistical Modelling 4 145--159
  • Van Dyk, D.A., and Meng, X-L. (2001) The Art of
    Data Augmentation. Journal of Computational and
    Graphical Statistics. 10, 1--50.
Write a Comment
User Comments (0)
About PowerShow.com