Mixed effects models for a hierarchical analysis

of fMRI data, and Bayesian model

selection applied to activation detection

- Keith Worsley12, Chuanhong Liao1, John Aston123,
- Jean-Baptiste Poline4, Gary Duncan5, Vali Petre2,

- Frank Morales6, Alan Evans2, Ed George7
- 1Department of Mathematics and Statistics, McGill

University, - 2Brain Imaging Centre, Montreal Neurological

Institute, - 3Imperial College, London,
- 4Service Hospitalier Frédéric Joliot, CEA, Orsay,

- 5Centre de Recherche en Sciences Neurologiques,

Université de Montréal, - 6Cuban Neuroscience Centre
- 7Wharton School

fMRI data 120 scans, 3 scans each of hot, rest,

warm, rest, hot, rest,

T (hot warm effect) / S.d. t110 if no

effect

(No Transcript)

Results from 4 runs on the same subject

Run 1

Run 2

Run 3

Run 4

1

Effect,

0

E

i

_mag_ef

-1

0.2

Sd,

S

0.1

i

_mag_sd

0

5

T stat,

0

E

/ S

i

i

_mag_t

-5

MULTISTAT mixed effects linear model for

combining effects from different

runs/sessions/subjects

- Ei effect for run/session/subject i
- Si standard error of effect
- Mixed effects model
- Ei covariatesi c Si WNiF ? WNiR

from FMRILM

?

?

Usually 1, but could add group, treatment,

age, sex, ...

Random effect, due to variability from run to run

Fixed effects error, due to variability within

the same run

_mag_ef

very noisy sd

_mag_sd

and Tgt15.96 for Plt0.05 (corrected)

_mag_t

so no response is detected

REML estimation using the EM algorithm

- Slow to converge (10 iterations by default).
- Stable (maintains estimate ?2 gt 0 ), but
- ?2 biased if ?2 (random effect) is small, so
- Re-parameterize the variance model
- Var(Ei) Si2 ?2
- (Si2 minj Sj2) (?2

minj Sj2) - Si2

?2 - ?2 ?2 minj Sj2 (less biased estimate)

?

?

Solution Spatial regularization of the sd

- Basic idea increase df by spatial smoothing

(local pooling) of the sd. - Cant smooth the random effects sd directly, -

too much anatomical structure. - Instead,
- random effects sd
- fixed effects sd
- which removes the anatomical structure before

smoothing.

? )

sd smooth ?

fixed effects sd

Average Si

?

Random effects sd, 3 df

Fixed effects sd, 440 df

Mixed effects sd, 100 df

0.2

0.15

0.1

0.05

0

divide

multiply

Random sd / fixed sd

Smoothed sd ratio _sdratio

1.5

random effect, sd ratio 1.3

1

0.5

Effective df depends on smoothing

- dfratio dfrandom(2 1)
- 1 1 1
- dfeff dfratio dffixed

FWHMratio2 3/2 FWHMdata2

e.g. dfrandom 3, dffixed 4 ? 110

440, FWHMdata 8mm

fixed effects analysis, dfeff 440

dfeff

FWHM 19mm

Target 100 df

random effects analysis, dfeff 3

FWHMratio

Final result 19mm smoothing, 100 effective df

_mag_ef

_ef

less noisy sd

_sd

_mag_sd

and Tgt4.93 for Plt0.05 (corrected)

_t

_mag_t

and now we can detect a response!

Estimating the delay of the response

- Delay or latency to the peak of the HRF is

approximated by a linear combination of two

optimally chosen basis functions

delay

basis1

basis2

HRF

shift

- HRF(t shift) basis1(t) w1(shift)

basis2(t) w2(shift) - Convolve bases with the stimulus, then add to

the linear model

- Fit linear model, estimate w1 and w2
- Equate w2 / w1 to estimates, then solve for

shift (Hensen et al., 2002) - To reduce bias when the magnitude is small, use
- shift / (1 1/T2)
- where T w1 / Sd(w1) is the T statistic for the

magnitude - Shrinks shift to 0 where there is little

evidence for a response.

w2 / w1

w1

w2

Shift of the hot stimulus

T stat for magnitude _mag_t T stat for

shift _del_t

Shift (secs) _del_ef Sd of shift

(secs) _del_sd

Shift of the hot stimulus

T stat for magnitude _mag_t T stat for

shift _del_t

Tgt4

T2

Shift (secs) _del_ef Sd of shift

(secs) _del_sd

1 sec

/- 0.5 sec

Combining shifts of the hot stimulus (Contours

are T stat for magnitude gt 4)

_ef

_del_ef

_sd

_del_sd

_t

_del_t

Shift of the hot stimulus

Shift (secs) _del_ef

T stat for magnitude _mag_t gt 4.93

(No Transcript)

FWHM the local smoothness of the noise

voxel size (1 correlation)1/2

FWHM (2 log

2)1/2

(If the noise is modeled as white noise smoothed

with a Gaussian kernel, this would be its FWHM)

Volume FWHM3

P-values depend on Resels

Resels

Local maximum T 4.5

Clusters above t 3.0, search volume resels 500

0.1

0.1

0.08

0.08

0.06

0.06

P value of local max

P value of cluster

0.04

0.04

0.02

0.02

0

0

0

500

1000

0

0.5

1

1.5

2

Resels of search volume

Resels of cluster

STAT_SUMMARY

In between use Discrete Local Maxima (DLM)

High FWHM use Random Field Theory

Low FWHM use Bonferroni

Bonferroni

4.7

4.6

4.5

True

T, 10 df

4.4

Random Field Theory

4.3

T, 20 df

Discrete Local Maxima (DLM)

4.2

Gaussianized threshold

4.1

Gaussian

4

3.9

Bonferroni, NResels

3.8

3.7

0

1

2

3

4

5

6

7

8

9

10

FWHM of smoothing kernel (voxels)

STAT_SUMMARY

In between use Discrete Local Maxima (DLM)

High FWHM use Random Field Theory

Low FWHM use Bonferroni

0.12

Gaussian

T, 20 df

T, 10 df

0.1

Random Field Theory

Bonferroni

0.08

DLM can ½ P-value when FWHM 3 voxels

0.06

P-value

0.04

Discrete Local Maxima

True

0.02

Bonferroni, NResels

0

0

1

2

3

4

5

6

7

8

9

10

FWHM of smoothing kernel (voxels)

(No Transcript)

STAT_SUMMARY example single run, hot-warm

Detected by BON and DLM but not by RFT

Detected by DLM, but not by BON or RFT

Bayesian Model Selection (thanks to Ed George)

- Z-statistic SPM at voxel i is Zi N(mi,1), i

1, , n - Most of the mis are zero (unactivated voxels)

and a few are non-zero (activated voxels), but we

do not know which voxels are activated, and by

how much (mi) - This is a model selection problem, where we add

an extra model parameter (mi) for the mean of

each activated voxel - Simple Bayesian set-up
- each voxel is independently active with

probability p - the activation is itself drawn independently

from a Gaussian distribution mi N(0,c) - The hyperparameter p controls the expected

proportion of activated voxels, and c controls

their expected activation

- Surprise! This prior setup is related to the

canonical penalized sum-of-squares criterion - AF Sactivated voxels Zi2 F q
- where - q is the number of activated voxels and
- - F is a fixed penalty for adding an activated

voxel - Popular model selection criteria simply entail
- - maximizing AF for a particular choice of F
- - which is equivalent to thresholding the image

at vF - Some choices of F
- - F 0 all voxels activated
- - F 2 Mallows Cp and AIC
- F log n BIC
- F 2 log n RIC
- P(Z gt vF) 0.05/n Bonferroni (almost same as

RIC!)

- The Bayesian relationship with AF is obtained by

re-expressing the posterior of the activated

voxels, given the data - P(activated voxels Zs) a exp ( c/2(1c) AF )
- where
- F (1c)/c 2 log(1-p)/p log(1c)
- Since p and c control the expected number and

size of the activation, the dependence of F on p

and c provides an implicit connection between the

penalty F and the sorts of models for which its

value may be appropriate

- The awful truth p and c are unknown
- Empirical Bayes idea use p and c that maximize

the marginal likelihood, which simplifies to - L(p,c Zs) a ?i (1-p)exp(-Zi2/2)

p(1c)-1/2exp(-Zi2/2(1c) ) - This is identical to fitting a classic mixture

model with - - a probability of (1-p) that Zi N(0,1)
- - a probability of p that Zi N(0,c)
- - vF is the value of Z where the two components

are equal - Using these estimated values of p and c gives us

an adaptive penalty F, or equivalently a

threshold vF, that is implicitly based on the SPM - All we have to do is fit the mixture model but

does it work?

- Same data as before hot warm stimulus, four

runs - - proportion of activated voxels p 0.57
- - variance of activated voxels c 5.8 (sd

2.4) - - penalty F 1.59 (a bit like AIC)
- - threshold vF 1.26 (?) seems a bit low

AIC vF 2 FDR (0.05) vF 2.67 BIC vF

3.21 RIC vF 4.55 Bon (0.05) vF 4.66

Null model N(0,1)

vF threshold where components are equal

Mixture

Histogram of SPM (n30786)

57 activated voxels, N(0,5.8)

43 un- activated voxels, N(0,1)

Z

- Same data as before hot warm stimulus, one

run - - proportion of activated voxels p 0.80
- - variance of activated voxels c 1.55
- - penalty F -3.02 (?)
- - all voxels activated !!!!!! What is going on?

AIC vF 2 FDR (0.05) vF 2.67 BIC vF

3.21 RIC vF 4.55 Bon (0.05) vF 4.66

Null model N(0,1)

components are never equal!

Histogram of SPM (n30768)

80 activated voxels, N(0,1.55)

Mixture

20 un- activated voxels, N(0,1)

Z

Difference with SPM

- FMRISTAT second level analysis is univariate

(can only lead to a T test) SPM is multivariate

(can lead to an F test) - To do this, SPM has to assume various parameters

are global - Not clear if allowing for correlated contrasts

at second level improves inference for a single

contrast i.e. better T stats (in the end, most

stats are T stats, not F) in fact if the

correlations and models across subjects are

equal, nothing is gained - FMRISTAT uses spatial information to boost df

SPM is mass univariate

Tgt4.86

Tgt4.86

T gt 4.93 (P lt 0.05, corrected)

Tgt4.86

T gt 4.93 (P lt 0.05, corrected)

Tgt4.86

Conjunction Minimum Ti gt threshold

Minimum of Ti _conj

Average of Ti _mag_t

For P0.05, threshold 1.82

For P0.05, threshold 4.93

Efficiency 82

Efficiency optimum block design

Sd of hot stimulus

Sd of hot-warm

0.5

0.5

20

20

0.4

0.4

15

15

Optimum design

Magnitude

0.3

0.3

10

10

Optimum design

0.2

0.2

X

5

5

0.1

0.1

0

X

0

0

0

5

10

15

20

5

10

15

20

InterStimulus Interval (secs)

(secs)

(secs)

1

1

20

20

0.8

0.8

15

15

Delay

0.6

0.6

Optimum design X

Optimum design X

10

10

0.4

0.4

5

5

0.2

0.2

(Not enough signal)

(Not enough signal)

0

0

0

5

10

15

20

5

10

15

20

Stimulus Duration (secs)

Efficiency optimum event design

0.5

(Not enough signal)

uniform . . . . . . . . .

____ magnitudes . delays

random .. . ... .. .

0.45

concentrated

0.4

0.35

0.3

Sd of effect (secs for delays)

0.25

0.2

0.15

0.1

0.05

0

5

10

15

20

Average time between events (secs)

How many subjects?

- Largest portion of variance comes from the last

stage i.e. combining over subjects - sdrun2 sdsess2

sdsubj2 - nrun nsess nsubj nsess nsubj

nsubj - If you want to optimize total scanner time, take

more subjects. - What you do at early stages doesnt matter very

much!

- Comparison
- Different slice acquisition times
- Drift removal
- Temporal correlation
- Estimation of effects
- Rationale
- Random effects
- Map of the delay

- SPM99
- Adds a temporal derivative
- Low frequency cosines (flat at the ends)
- AR(1), global parameter, bias reduction not

necessary - Band pass filter, then least-squares, then

correction for temporal correlation - More robust,
- but lower df
- No regularization,
- low df, no conjuncs
- No

- fmristat
- Shifts the model
- Splines
- (free at the ends)
- AR(p), voxel parameters, bias reduction
- Pre-whiten, then least squares (no further

corrections needed) - More accurate, higher df
- Regularization, high df, conjuncs
- Yes

References

- http//www.math.mcgill.ca/keith/fmristat
- Worsley et al. (2002). A general statistical

analysis for fMRI data. NeuroImage, 151-15. - Liao et al. (2002). Estimating the delay of the

response in fMRI data. NeuroImage, 16593-606.

Functional connectivity

- Measured by the correlation between residuals at

every pair of voxels (6D data!) - Local maxima are larger than all 12 neighbours
- P-value can be calculated using random field

theory - Good at detecting focal connectivity, but
- PCA of residuals x voxels is better at detecting

large regions of co-correlated voxels

Activation only

Correlation only

Voxel 2

Voxel 2

Voxel 1

Voxel 1

Correlations gt 0.7, Plt10-10 (corrected)

First Principal Component gt threshold

False Discovery Rate (FDR)

- Benjamini and Hochberg (1995), Journal of the

Royal Statistical Society - Benjamini and Yekutieli (2001), Annals of

Statistics - Genovese et al. (2001), NeuroImage
- FDR controls the expected proportion of false

positives amongst the discoveries, whereas - Bonferroni / random field theory controls the

probability of any false positives - No correction controls the proportion of false

positives in the volume

P lt 0.05 (uncorrected), T gt 1.64 5 of volume is

false

Signal Gaussian white noise

Signal

True

Noise

False

FDR lt 0.05, T gt 2.82 5 of discoveries is false

P lt 0.05 (corrected), T gt 4.22 5 probability of

any false

Comparison of thresholds

- FDR depends on the ordered P-values
- P1 lt P2 lt lt Pn. To control the FDR at a

0.05, find - K max i Pi lt (i/n) a, threshold the

P-values at PK - Proportion of true 1 0.1 0.01

0.001 0.0001 - Threshold T 1.64 2.56 3.28

3.88 4.41 - Bonferroni thresholds the P-values at a/n
- Number of voxels 1 10 100 1000

10000 - Threshold T 1.64 2.58 3.29

3.89 4.42 - Random field theory resels volume / FHHM3
- Number of resels 0 1 10

100 1000 - Threshold T 1.64 2.82 3.46

4.09 4.65

P lt 0.05 (uncorrected), T gt 1.64 5 of volume is

false

FDR lt 0.05, T gt 2.67 5 of discoveries is false

P lt 0.05 (corrected), T gt 4.93 5 probability of

any false

(No Transcript)

PCA_IMAGE PCA of time ? space

1 exclude first frames

2 drift

3 long-range correlation or anatomical effect

remove by converting to of brain

4 signal?

FMRILM fits a linear model for fMRI time series

with AR(p) errors

- Linear model
- ?

? - Yt (stimulust HRF) b driftt c errort
- AR(p) errors
- ? ?

? - errort a1 errort-1 ap errort-p s WNt

unknown parameters

(No Transcript)

FMRIDESIGN example pain perception

(No Transcript)

FMRILM first step estimate the autocorrelation

?

- AR(1) model errort a1 errort-1 s WNt
- Fit the linear model using least squares
- errort Yt fitted Yt
- â1 Correlation ( errort , errort-1)
- Estimating errorts changes their correlation

structure slightly, so â1 is slightly biased

which_stats _cor - Raw autocorrelation Smoothed 12.4mm Bias

corrected â1 -

-0.05 0

Effective df depends on smoothing

- Variability in acor lowers df
- Df depends
- on contrast
- Smoothing acor brings df back up

dfacor dfresidual(2 1)

1 1 2 acor(contrast of data)2

dfeff dfresidual dfacor

FWHMacor2 3/2 FWHMdata2

Hot-warm stimulus

Hot stimulus

FWHMdata 8.79

Residual df 110

Residual df 110

100

100

Target 100 df

Target 100 df

Contrast of data, acor 0.79

50

50

Contrast of data, acor 0.61

dfeff

dfeff

0

0

0

10

20

30

0

10

20

30

FWHM 10.3mm

FWHM 12.4mm

FWHMacor

FWHMacor

FMRILM second step refit the linear model

Pre-whiten Yt Yt â1 Yt-1, then fit using

least squares

Hot - warm effect, _mag_ef

Sd of effect, _mag_sd

1

0.25

0.2

0.5

0.15

0

0.1

-0.5

0.05

-1

0

which_stats _mag_ef _mag_sd _mag_t

T effect / sd, 110 df _mag_t

6

4

2

0

-2

-4

-6

Higher order AR model? Try AR(3) _AR

a

a

a

1

2

3

0.3

0.2

AR(1) seems to be adequate

0.1

0

has little effect on the T statistics

-0.1

AR(1)

AR(2)

AR(3)

5

0

-5

Non-isotropic data (spatially varying FWHM)

- fMRI data is smoother in GM than WM
- VBM data is highly non-isotropic
- Has little effect on P-values for local maxima

(use average FWHM inside search region), but - Has a big effect on P-values for spatial extents

smooth regions ? big clusters,

rough regions ? small clusters, so - Replace cluster volume by cluster resels

volume / FWHM3

_fwhm

_fwhm

FWHM (mm) of scans (110 df)

FWHM (mm) of effects (3 df)

20

20

Resels1.90 P0.007

15

15

10

10

Resels0.57 P0.387

5

5

0

0

FWHM of effects (smoothed)

effects / scans FWHM (smoothed)

20

1.5

15

10

1

5

0

0.5