Lecture 12. Bayesian Regression with conjugate and convenient priors - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Lecture 12. Bayesian Regression with conjugate and convenient priors

Description:

Lecture 12. Bayesian Regression with conjugate and convenient priors ... the NES' 7-point liberal-conservative ideology scale among a party's identifiers ... – PowerPoint PPT presentation

Number of Views:152

Avg rating:3.0/5.0

Slides: 26

Provided by: jeffgry

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 12. Bayesian Regression with conjugate and convenient priors

1
Lecture 12. Bayesian Regression with conjugate
and convenient priors

Conjugate Prior Analysis
Convenient Priors

2
The Bayesian Setup

For the normal linear model, we have
yi N(?i, ?2) for i ? 1,,n
where ?i is just an indicator for the expression
?i B0 B1X1i BkXki
The object of statistical inference is the
posterior distribution of the parameters B0,,Bk
and ?2.
By Bayes Rule, we know that this is simply
p(B0,,Bk, ?2 Y, X) ? p(B0,,Bk, ?2) ? ?i p(yi
?i, ?2)

3
Conjugate priors and the normal linear model

Suppose that instead of an improper prior, we
decide to use the conjugate prior.
For the normal regression model, the conjugate
prior distribution for p(B0,,Bk, ?2) is the
normal-inverse-gamma distribution.
Weve seen this distribution before when we
studied the normal model with unknown mean and
variance. We know that this distribution can be
factored such that
p(B0,,Bk, ?2) p(B0,,Bk ?2) p(?2)
p(B0,,Bk ?2) NMV(Bprior , ?prior),
where ?prior is the prior covariance matrix for
the Bs.
and p(?2) Inverse-Gamma(aprior, bprior).

4
Conjugate priors and the normal linear model

If we use a conjugate prior, then the prior
distribution will have the same form. Thus, the
posterior distribution will also be
normal-inverse-gamma. If we integrate out ?2 the
marginal for B will be a multivariate t-dist

- Notice that the coefficients are essentially a
weighted average of the prior coefficients
described by Bprior and the standard OLS
estimates B. - The weights are provided by the
conditional prior precision ?-1 and the data XTX.
This should make clear that as we increase our
prior precision (decrease our prior variance) for
B, we place greater posterior weight on our prior
beliefs relative to the data. Note Zellner
(1971) treats Bprior and the conditional prior
variance ? in the following way suppose you have
two data setsY1,X1 and Y2,X2. He sets Bprior
equal to the posterior mean for a regression
analysis of X1, Y1 with the improper prior 1/?2
and sets ? equal to X1TX1.
5
Conjugate priors and the normal linear model

To summarize our uncertainty about the
coefficients, the variance-covariance matrix for
B is given by

The posterior standard deviation can be taken
from the square root of the diagonal of this
matrix.
The second term is the maximum likelihood
estimate of the variance. The third terms states
that our variance estimates will be greater if
our prior values for the regression coefficients
differ from their posterior values, especially if
we indicate a great deal of confidence in our
prior beliefs by assigning small variances in the
matrix ?. The fourth term states that our
variance estimates for the regression
coefficients will be greater if the standard OLS
estimates differ from the posterior values,
especially if XTX is a large number.
Winbugs implementation would proceed in a manner
akin to the earlier example.
6
Example OneParty Activists and Partisan
Polarization

Parties reward their core supporters with policy
in exchange for activists assistance during the
campaign.
Research Design. Regression Analysis by party.
Dependent variable Party-in-Government Ideology
Partys Median DW-Nominate Score for the 93rd
through the 107th Congress
Independent variable Mean Party Activist
Ideology
Average response to the NES 7-point
liberal-conservative ideology scale among a
partys identifiers who were active in the
campaign.
Independent variable Mean Party Non-Activist
Ideology
Average response to the NES 7-point
liberal-conservative ideology among a partys
identifiers who were not active in the campaign

7
Classical OLS Estimates
Denotes statistical significance in a
one-tailed test at p lt .10 (n 15). Denotes
statistical significance in a one-tailed test at
p lt .05. Denotes statistical significance in
a one-tailed test at p lt .025.
8
Is there a fairer test?

With the small sample size, it is difficult to
say conclusively that the parties are not
following their activists.
- To improve the statistical power of the test,
we could wait.
- Someone more clever could come up with a
better research design.
- We could go the Bayesian route and incorporate
prior beliefs about the data-generating process
to the model.

9
The probability model (conjugate prior analysis)
Likelihood Party Ideologyi iid Normal(?i,
?2) ?i ?0 ?1 Activist Ideology ?2
Non-Activist Ideology Priors ?0, ?1, ?2 ?2
Normal( BParty , ? ), where BDem -100,
.099, .234T, BRep -100, .189, .301T and
? Diag10000, prior varB1, 1 ?2
Inv-Gamma(1, 1) We will vary prior varB1 from 1
to a small number to see how strongly our prior
beliefs have to be to find a statistically
significant posterior value for ?1
Based on the bivariate regressions with
significant Activist coefficients
10
Significant at p lt .1, one-tailed test
11
Significant at p lt .1, one-tailed test
12
Example 2-Western and Jackman (1994)

What explains cross-national variation in union
density?
Union density is defined as the percentage of the
work force who belongs to a labor union.
Competing theories
Wallerstein union density depends on the size of
the civilian labor force.
Stephens union density depends on industrial
concentration.
Note These two predictors correlate at -.92.
Control variable presence of a pro-labor
government.
Sample n 20 countries with a continuous
history of democracy since World War II.

13
Results with non-informative priors
14
Justification for the Bayesian Approach

Wallerstein and Stephens reach an empirical
impasse, where because of the small sample size
and multicollinear variables, they are not able
to adjudicate between the two theories.
The incorporation of prior information provides
additional structure to the data, which helps to
uniquely identify the two coefficients.
Priors can be developed as equivalent to prior
data sets, inflating the de facto n.
The data set contains all available observations
from a population of interestit is not a random
sample. More generally, cross-national data sets
are not generated by a repeatable
data-generating process.
Frequentist inference about a statistic (e.g. a
regression coef.) is obtained through the
assumption that the process generating the data
could be repeated a large number of times.
Specifically, frequentist inference is about the
proportion of the time that, in the long-run,
realizations of this statistic will fall within
some interval.
If there is no long-run, or possibility of
repetition, then probabilistic summaries are not
appropriate

15
The probability model

As best as I can tell since Western and Jackman
didnt specify the full probability model, we
have
union densityi N(?i , ?2)
?i ?0 ?1Left Govt ?2Labor Force
?3Industrial Concentration
?Wallerstein NMV (0, .3 , -5 , 0T ,
Diag100000, 0.15, 2.5,100000 )
So, informative priors are chosen for Left Govt
and Labor Force while diffuse priors are chosen
for the Intercept and Industrial Concentration.
?Stephens NMV (0, .3 , 0 , 10T , Diag100000,
0.15, 100000, 5 )
So, informative priors are chosen for Left Govt
and Industrial Concentration while diffuse priors
are chosen for the Intercept and Labor Force.
I believe that ?2 is assumed to be known.

16
Results with Wallersteins priors
17
Results with Stephens priors
18
Final comments on Western and Jackman

1) Even with Stephens priors, Wallersteins
hypothesis appears robust however, the opposite
relationship does not hold.
2) Western and Jackman reanalyze the data with
the same prior means, but inflate the prior
variances for the regression coefficients to see
how sensitive the results are
3) Western and Jackman report the Bayesian
influence statistic which describes the influence
of the ith observation on the joint posterior
distribution of the regression coefficients.
? this statistic is interesting because it shows
how observations tend to become more influential
with larger prior variances.
? like the traditional Cooks distance and other
measures of leverage, this statistic provides
evidence about the effects of outliers on
posterior inference.

19
Convenient, but non-conjugate priors

In WinBugs, we typically would not trick the
program into implementing conjugate or improper
priors.
Instead, we would typically assume that our prior
beliefs for the regression coefficients and the
variance can be factored into separate
distributions. Thus, p(? , ?) p(?)p(?)
A common model assume the following form for the
likelihood
yi N(?i , ?), where ?i ?0 ?1X1i
?nXni for all i
The non-informative priors would be defined as
follows
?j N(0, .00001) for all j
and ? Gamma(.00001, .00001)
WinBugs implementation is straightforward, except
that we may need to write-out a list of initial
values for our parameters, especially for ?. This
is because when WinBugs creates initial values
for these priors, it is possible that the program
will make implausible choices (e.g. ? -10).
See Congdons code online and the WinBugs help
for examples.

20
Job SatisfactionCongdon Example 4.3

Theory Job satisfaction is a function of age,
autonomy, and income.
The Likelihood
Satisfactioni Normal(?i , ?)
?i ?0 ?1Agei ?2Autonomyi ?3Incomei
The Priors
?i N(0, .001) for all i and ?
Gamma(.01,.01)

21
WinBugs Code

model
define the likelihood
for (i in 168)
satisfactioni dnorm(mui,tau)
mui lt- b1agei b2autonomyi
b3incomei b4
define the priors
for (j in 14)
bj dnorm(0, .001)
tau dgamma(0.0001, 0.0001)

22
Bayesian Path Analysis

Path analysis is a method that purports to
examine causal effects for systems of equations.
The causal models are assumed to look something
like

Variable 1
Variable 3
Variable 2
Variable 5
Variable 4
A regression model of the effects of Variables
2-4 on Variable 5 will often provide estimates of
the regression coefficients with desirable
properties, but may understate, for example, the
effect of Variable 2 on Variable 5, since
Variable 2 influences outcomes directly and
through its effects on Variables 3 and 4. Basic
methodological approach estimate a separate
regression for each variable that is dependent at
some point in the system of equations with every
variable standardized. A path coefficient is the
sum of an independent variables effects on a
particular dependent variable.
23
Path Analysis of Job Satisfaction

Theory
Age is an unmoved mover
Autonomyi a0 a1Agei
Incomei b0 b1Agei b2Autonomyi
Satisfactioni c0 c1Agei c2Autonomyi
c3Incomei
Probability model
Autonomyi Norm(a1 a2Agei , ?autonomy)
aj Norm(0, .001) for j 1,2 and ?autonomy
Gamma(.001, .001)
Incomei Norm(b1 b2Agei b3Autonomyi ,
?income)
bj Norm(0, .001) for j 1,2,3 and ?income
Gamma(.001, .001)
Satisfactioni Norm(c1 c2Agei c3Autonomyi
c4Incomei , ?satisfaction)
cj Norm(0, .001) for j 1,2,3,4 and
?satisfaction Gamma(.001, .001)

24
WinBugs Implementation of Path Analysis

model
for (i in 168) specify the likelihood
Autonomyi dnorm(muauti, tauaut)
muauti lt- a1 a2Agei
Incomei dnorm(muinci, tauinc)
muinci lt- b1 b2Agei
b3Autonomyi
Satisfactioni dnorm(musati , tausat)
musati lt- c1 c2Agei
c3Autonomyi c4Incomei
specify priors
for (j in 12) aj dnorm(0,.001)
for (j in 13) bj dnorm(0,.001)
for (j in 14) cj dnorm(0,.001)
tauaut dgamma(0.001, 0.001)
tauinc dgamma(0.001, 0.001)
tausat dgamma(0.001, 0.001)

25
WinBugs Implementation of Path Analysis

model
for (i in 168) specify the likelihood
Oldnessi lt- (Agei mean(Age)) / sd(Age)
Auti lt- (Autonomyi mean(Autonomy)) /
sd(Autonomy)
Auti dnorm(muauti, tauaut)
muauti lt- a1 a2Oldnessi
Inci lt- (Incomei mean(Income)) /
sd(Income)
Inci dnorm(muinci, tauinc)
muinci lt- b1 b2Oldnessi b3Auti
Sati lt- (Satisfactioni mean(Satisfaction)
) / sd(Satisfaction)
Satisfactioni dnorm(musati , tausat)
musati lt- c1 c2Oldnessi c3Auti
c4Inci
C2 is the direct effect of age c3a2 is
the effect of age through autonomy c4b2 is
the effect
of age through income and c4b3a2 is
the effect of age through autonomy through income
TotAge lt- c2 c3a2 c4b2
c4b3a2