Title: Lecture 12. Bayesian Regression with conjugate and convenient priors
1Lecture 12. Bayesian Regression with conjugate
and convenient priors
- Conjugate Prior Analysis
- Convenient Priors
2The Bayesian Setup
- For the normal linear model, we have
- yi N(?i, ?2) for i ? 1,,n
- where ?i is just an indicator for the expression
- ?i B0 B1X1i BkXki
- The object of statistical inference is the
posterior distribution of the parameters B0,,Bk
and ?2. - By Bayes Rule, we know that this is simply
- p(B0,,Bk, ?2 Y, X) ? p(B0,,Bk, ?2) ? ?i p(yi
?i, ?2)
3Conjugate priors and the normal linear model
- Suppose that instead of an improper prior, we
decide to use the conjugate prior. - For the normal regression model, the conjugate
prior distribution for p(B0,,Bk, ?2) is the
normal-inverse-gamma distribution. - Weve seen this distribution before when we
studied the normal model with unknown mean and
variance. We know that this distribution can be
factored such that - p(B0,,Bk, ?2) p(B0,,Bk ?2) p(?2)
- p(B0,,Bk ?2) NMV(Bprior , ?prior),
- where ?prior is the prior covariance matrix for
the Bs. - and p(?2) Inverse-Gamma(aprior, bprior).
4Conjugate priors and the normal linear model
- If we use a conjugate prior, then the prior
distribution will have the same form. Thus, the
posterior distribution will also be
normal-inverse-gamma. If we integrate out ?2 the
marginal for B will be a multivariate t-dist
- Notice that the coefficients are essentially a
weighted average of the prior coefficients
described by Bprior and the standard OLS
estimates B. - The weights are provided by the
conditional prior precision ?-1 and the data XTX.
This should make clear that as we increase our
prior precision (decrease our prior variance) for
B, we place greater posterior weight on our prior
beliefs relative to the data. Note Zellner
(1971) treats Bprior and the conditional prior
variance ? in the following way suppose you have
two data setsY1,X1 and Y2,X2. He sets Bprior
equal to the posterior mean for a regression
analysis of X1, Y1 with the improper prior 1/?2
and sets ? equal to X1TX1.
5Conjugate priors and the normal linear model
- To summarize our uncertainty about the
coefficients, the variance-covariance matrix for
B is given by
The posterior standard deviation can be taken
from the square root of the diagonal of this
matrix.
The second term is the maximum likelihood
estimate of the variance. The third terms states
that our variance estimates will be greater if
our prior values for the regression coefficients
differ from their posterior values, especially if
we indicate a great deal of confidence in our
prior beliefs by assigning small variances in the
matrix ?. The fourth term states that our
variance estimates for the regression
coefficients will be greater if the standard OLS
estimates differ from the posterior values,
especially if XTX is a large number.
Winbugs implementation would proceed in a manner
akin to the earlier example.
6Example OneParty Activists and Partisan
Polarization
- Parties reward their core supporters with policy
in exchange for activists assistance during the
campaign. - Research Design. Regression Analysis by party.
- Dependent variable Party-in-Government Ideology
- Partys Median DW-Nominate Score for the 93rd
through the 107th Congress - Independent variable Mean Party Activist
Ideology - Average response to the NES 7-point
liberal-conservative ideology scale among a
partys identifiers who were active in the
campaign. - Independent variable Mean Party Non-Activist
Ideology - Average response to the NES 7-point
liberal-conservative ideology among a partys
identifiers who were not active in the campaign
7Classical OLS Estimates
Denotes statistical significance in a
one-tailed test at p lt .10 (n 15). Denotes
statistical significance in a one-tailed test at
p lt .05. Denotes statistical significance in
a one-tailed test at p lt .025.
8Is there a fairer test?
- With the small sample size, it is difficult to
say conclusively that the parties are not
following their activists. - - To improve the statistical power of the test,
we could wait. - - Someone more clever could come up with a
better research design. - - We could go the Bayesian route and incorporate
prior beliefs about the data-generating process
to the model.
9The probability model (conjugate prior analysis)
Likelihood Party Ideologyi iid Normal(?i,
?2) ?i ?0 ?1 Activist Ideology ?2
Non-Activist Ideology Priors ?0, ?1, ?2 ?2
Normal( BParty , ? ), where BDem -100,
.099, .234T, BRep -100, .189, .301T and
? Diag10000, prior varB1, 1 ?2
Inv-Gamma(1, 1) We will vary prior varB1 from 1
to a small number to see how strongly our prior
beliefs have to be to find a statistically
significant posterior value for ?1
Based on the bivariate regressions with
significant Activist coefficients
10 Significant at p lt .1, one-tailed test
11 Significant at p lt .1, one-tailed test
12Example 2-Western and Jackman (1994)
- What explains cross-national variation in union
density? - Union density is defined as the percentage of the
work force who belongs to a labor union. - Competing theories
- Wallerstein union density depends on the size of
the civilian labor force. - Stephens union density depends on industrial
concentration. - Note These two predictors correlate at -.92.
- Control variable presence of a pro-labor
government. - Sample n 20 countries with a continuous
history of democracy since World War II.
13Results with non-informative priors
14Justification for the Bayesian Approach
- Wallerstein and Stephens reach an empirical
impasse, where because of the small sample size
and multicollinear variables, they are not able
to adjudicate between the two theories. - The incorporation of prior information provides
additional structure to the data, which helps to
uniquely identify the two coefficients. - Priors can be developed as equivalent to prior
data sets, inflating the de facto n. - The data set contains all available observations
from a population of interestit is not a random
sample. More generally, cross-national data sets
are not generated by a repeatable
data-generating process. - Frequentist inference about a statistic (e.g. a
regression coef.) is obtained through the
assumption that the process generating the data
could be repeated a large number of times. - Specifically, frequentist inference is about the
proportion of the time that, in the long-run,
realizations of this statistic will fall within
some interval. - If there is no long-run, or possibility of
repetition, then probabilistic summaries are not
appropriate
15The probability model
- As best as I can tell since Western and Jackman
didnt specify the full probability model, we
have - union densityi N(?i , ?2)
- ?i ?0 ?1Left Govt ?2Labor Force
?3Industrial Concentration - ?Wallerstein NMV (0, .3 , -5 , 0T ,
Diag100000, 0.15, 2.5,100000 ) - So, informative priors are chosen for Left Govt
and Labor Force while diffuse priors are chosen
for the Intercept and Industrial Concentration. - ?Stephens NMV (0, .3 , 0 , 10T , Diag100000,
0.15, 100000, 5 ) - So, informative priors are chosen for Left Govt
and Industrial Concentration while diffuse priors
are chosen for the Intercept and Labor Force. - I believe that ?2 is assumed to be known.
16Results with Wallersteins priors
17Results with Stephens priors
18Final comments on Western and Jackman
- 1) Even with Stephens priors, Wallersteins
hypothesis appears robust however, the opposite
relationship does not hold. - 2) Western and Jackman reanalyze the data with
the same prior means, but inflate the prior
variances for the regression coefficients to see
how sensitive the results are - 3) Western and Jackman report the Bayesian
influence statistic which describes the influence
of the ith observation on the joint posterior
distribution of the regression coefficients. - ? this statistic is interesting because it shows
how observations tend to become more influential
with larger prior variances. - ? like the traditional Cooks distance and other
measures of leverage, this statistic provides
evidence about the effects of outliers on
posterior inference.
19Convenient, but non-conjugate priors
- In WinBugs, we typically would not trick the
program into implementing conjugate or improper
priors. - Instead, we would typically assume that our prior
beliefs for the regression coefficients and the
variance can be factored into separate
distributions. Thus, p(? , ?) p(?)p(?) - A common model assume the following form for the
likelihood - yi N(?i , ?), where ?i ?0 ?1X1i
?nXni for all i - The non-informative priors would be defined as
follows - ?j N(0, .00001) for all j
- and ? Gamma(.00001, .00001)
- WinBugs implementation is straightforward, except
that we may need to write-out a list of initial
values for our parameters, especially for ?. This
is because when WinBugs creates initial values
for these priors, it is possible that the program
will make implausible choices (e.g. ? -10). - See Congdons code online and the WinBugs help
for examples.
20Job SatisfactionCongdon Example 4.3
- Theory Job satisfaction is a function of age,
autonomy, and income. - The Likelihood
- Satisfactioni Normal(?i , ?)
- ?i ?0 ?1Agei ?2Autonomyi ?3Incomei
- The Priors
- ?i N(0, .001) for all i and ?
Gamma(.01,.01)
21WinBugs Code
- model
- define the likelihood
- for (i in 168)
- satisfactioni dnorm(mui,tau)
- mui lt- b1agei b2autonomyi
b3incomei b4 -
- define the priors
- for (j in 14)
- bj dnorm(0, .001)
-
- tau dgamma(0.0001, 0.0001)
22Bayesian Path Analysis
- Path analysis is a method that purports to
examine causal effects for systems of equations.
The causal models are assumed to look something
like
Variable 1
Variable 3
Variable 2
Variable 5
Variable 4
A regression model of the effects of Variables
2-4 on Variable 5 will often provide estimates of
the regression coefficients with desirable
properties, but may understate, for example, the
effect of Variable 2 on Variable 5, since
Variable 2 influences outcomes directly and
through its effects on Variables 3 and 4. Basic
methodological approach estimate a separate
regression for each variable that is dependent at
some point in the system of equations with every
variable standardized. A path coefficient is the
sum of an independent variables effects on a
particular dependent variable.
23Path Analysis of Job Satisfaction
- Theory
- Age is an unmoved mover
- Autonomyi a0 a1Agei
- Incomei b0 b1Agei b2Autonomyi
- Satisfactioni c0 c1Agei c2Autonomyi
c3Incomei -
- Probability model
- Autonomyi Norm(a1 a2Agei , ?autonomy)
- aj Norm(0, .001) for j 1,2 and ?autonomy
Gamma(.001, .001) - Incomei Norm(b1 b2Agei b3Autonomyi ,
?income) - bj Norm(0, .001) for j 1,2,3 and ?income
Gamma(.001, .001) -
- Satisfactioni Norm(c1 c2Agei c3Autonomyi
c4Incomei , ?satisfaction) - cj Norm(0, .001) for j 1,2,3,4 and
?satisfaction Gamma(.001, .001)
24WinBugs Implementation of Path Analysis
- model
- for (i in 168) specify the likelihood
- Autonomyi dnorm(muauti, tauaut)
- muauti lt- a1 a2Agei
-
- Incomei dnorm(muinci, tauinc)
- muinci lt- b1 b2Agei
b3Autonomyi - Satisfactioni dnorm(musati , tausat)
- musati lt- c1 c2Agei
c3Autonomyi c4Incomei -
- specify priors
- for (j in 12) aj dnorm(0,.001)
- for (j in 13) bj dnorm(0,.001)
- for (j in 14) cj dnorm(0,.001)
- tauaut dgamma(0.001, 0.001)
- tauinc dgamma(0.001, 0.001)
- tausat dgamma(0.001, 0.001)
-
25WinBugs Implementation of Path Analysis
- model
- for (i in 168) specify the likelihood
- Oldnessi lt- (Agei mean(Age)) / sd(Age)
- Auti lt- (Autonomyi mean(Autonomy)) /
sd(Autonomy) - Auti dnorm(muauti, tauaut)
- muauti lt- a1 a2Oldnessi
-
- Inci lt- (Incomei mean(Income)) /
sd(Income) - Inci dnorm(muinci, tauinc)
- muinci lt- b1 b2Oldnessi b3Auti
- Sati lt- (Satisfactioni mean(Satisfaction)
) / sd(Satisfaction) - Satisfactioni dnorm(musati , tausat)
- musati lt- c1 c2Oldnessi c3Auti
c4Inci -
- C2 is the direct effect of age c3a2 is
the effect of age through autonomy c4b2 is
the effect - of age through income and c4b3a2 is
the effect of age through autonomy through income - TotAge lt- c2 c3a2 c4b2
c4b3a2