Title: Lecture 11' Bayesian Regression with conjugate and nonconjugate priors
1Lecture 11. Bayesian Regression with conjugate
and non-conjugate priors
- Set-up of the Bayesian Regression Model
- The standard improper non-informative prior
- Conjugate Prior Analysis
2A really, really fast review of multiple
regression
- Many studies concern the relationship between two
or more observable quantities. - How do changes in quantity y (the dependent
variable) vary as a function of another quantity
x (the independent variable)? - Regression models allow us to examine the
conditional distribution of y given x,
parameterized as p(yB,x) when the n observations
(yi, xi) are exchangeable. - The normal linear model occurs when a
distribution of y given x is normal with a mean
equal to a linear function of X E(yiB,X)
B1X1i BkXki for i ? 1,,n and X1 is a
vector of ones. - The ordinary linear regression model occurs when
the variance of y given X, B is assumed to be
constant over all observations. - In other words, we have an ordinary linear
regression model when - yi N(B1 BkXki , ?2) for i ? 1,,n
3More review of regression
- If yi N(B1 BkXki , ?2), then it is well
known that the ordinary least squares estimates
and the maximum likelihood estimates of the
parameters B1, , Bm are equivalent. - If B B1, , BmT, then the frequentist
estimate B is - B (XTX)-1XTY
- We know by the Gauss-Markov theorem that this
estimate is BLUE. - The frequentist estimate of ?2 is s2
(Y-XB)T(Y-XB)/(n-k) - The uncertainty about the quantities B is
summarized by the regression coefficients
standard errors which is the diagonal of the
matrix Var(B) s2(XTX)-1 - Finally, we know that if Vi is the ith diagonal
element of Var(B), then (Bk-0)/(sVi1/2) tn-k - This statistic forms the basis for our hypothesis
tests.
4The Bayesian Setup
- For the normal linear model, we have
- yi N(?i, ?2) for i ? 1,,n
- where ?i is just an indicator for the expression
- ?i B0 B1X1i BkXki
- The object of statistical inference is the
posterior distribution of the parameters B0,,Bk
and ?2. - By Bayes Rule, we know that this is simply
- p(B0,,Bk, ?2 Y, X) ? p(B0,,Bk, ?2) ? ?i p(yi
?i, ?2)
5Bayesian regression with standard non-informative
priors
- By Bayes Rule, the posterior distribution is
- p(B0,,Bk, ?2 Y, X) ? p(B0,,Bk, ?2) ? ?i p(yi
?i, ?2) - To make inferences about the regression
coefficients, we obviously need to choose a prior
distribution for B, ?2. - The standard non-informative prior distribution
is uniform on (B, log ?2), which is equivalent
to - p(B, log ?2)? ?-2
- This prior is a good choice for statistical
models when you have a lot of data points and
only a few parameters. - Why? Because if you have a large lot of data and
few parameters, then the likelihood function is
very sharply peaked, which means that the
likelihood (the data) will dominate posterior
inferences. With small sample sizes or a lot of
parameters, prior distributions or hierarchical
models become more important for analysis.
6Bayesian regression with non-informative priors
- If YB,?2,X N(XB, ?2I) and p(B, log ?2)? ?-2
then it follows that the conditional posterior
distribution p(B?2data) can be written - p(B?2data) Nmultivariate (B, ?2(XTX)-1)
where B (XTX)-1XTY - This follows from by completing the square, like
we have seen over and over again when dealing
with the normal distribution. - The posterior distribution of ?2 can be written
as - p(?2data)Scaled Inv-??2(n-k,s2) where s2
(Y-XB)T(Y-XB)/(n-k) - To make inferences about the marginal posterior
of B, we can either - 1) take repeated samples from the posterior of
?2 and then B ?2 like we did when we studied the
normal model with unknown mean and variance or, - 2) integrate out ?2 from the conditional
posterior for B and find that the marginal
distribution of B is a multivariate t
distribution - p(B data) multivariate tn-k(B, s2(XTX)-1)
- ? Notice the close comparison with the classical
results. The key difference would be
interpretation of the standard errors.
7Congdon Example 4.1
- In Example 4.1, Congdon considers a peculiar
example concerning a method for weighing small
stuff, where the method has an error that is
assumed to be normal with mean zero. - The dependent variable is the recorded weight and
the independent variables are dummy variables for
the presence or absence of object A and object B.
X11 if A is present and 0 otherwise. X21 if B
is present and 0 otherwise. Both objects may be
weighed together. - Thus, yi N(B1X1i B2X2i , ?2) note, no
intercept - B1 weight of object A and B2 weight of object
B - Congdon adopts the reference prior p(B1, B2, ?2)
? ?2
8WinBugs Implementation of Example 4.1
- model
- for (i in 118) this set of statements
generates the sample variance s2 - yhati lt- b1Xi,1b2Xi,2
- ei lt- yi-yhati
- e2i lt- pow(ei,2)
-
- s2 lt- sum(e2)/16 this
equals ?e2 / (n - k) - for (i in 12) mean of regression coeff
- bi lt- inprod(XTX.INVi,,Xy)
- Xyi lt- inprod(X,i,y)
- for (j in 12)
- Preci,j lt- XTXi,j/s2
- Dispi,j lt- s2XTX.INVi,j
- XTXi,j lt- inprod(X,i,X,j)
-
-
- XTX.INV12,12 lt- inverse(XTX,)
- note WinBugs now has a different command for
inversion than that used by Congdon
9Bayesian Model Evaluation
- Model evaluation proceeds just like it would in
the traditional OLS framework. Models are largely
evaluated based on their fitted (predicted)
values. For example,
- In this respect, Bayesian model evaluation is
easy. - The fitted value for an observation i is a random
draw from a t distribution with mean XiB and
variance s2(IXi(XTX)Xi) on n-k degrees of
freedom. In most cases, you can simply use XiB
as the fitted value and ignore this additional
uncertainty. - Other diagnostics that you should use are plots
of errors versus covariates histograms for the
error term, etc. to make sure that the
assumptions of the normal linear model hold.
10Comment on the regression setup
- There are two subtle points regarding the
Bayesian regression setup. - First, a full Bayesian model includes a
distribution for the independent variable X, p(X
?). Therefore, we have a joint likelihood p(X,Y
?,?,?) and joint prior p(?,?,?). The
fundamental assumption of the normal linear model
is that p(yX, ?, ?) and p(X ?) are independent
in their prior distributions such that the
posterior distribution factors into p(?, ?, ?
X,Y) p(?, ? X,Y) p(? X,Y) - As a result, p(?, ? X,Y) ? p(?, ?) p(Y ?, ?,
X) - Second, when we setup our probability model, we
are implicitly conditioning on a model, call it
H, which represents our beliefs about the
data-generating process. Thus, - p(?, ? X,Y,H) ? p(?, ?H) p(Y ?, ?, X,H)
- It is important to keep in mind that our
inferences are dependent on H, and this is
equally true for the frequentist perspective,
where results can be dependent on the choice of
likelihood function, covariates, etc.
11Conjugate priors and the normal linear model
- Suppose that instead of an improper prior, we
decide to use the conjugate prior. - For the normal regression model, the conjugate
prior distribution for p(B0,,Bk, ?2) is the
normal-inverse-gamma distribution. - Weve seen this distribution before when we
studied the normal model with unknown mean and
variance. We know that this distribution can be
factored such that - p(B0,,Bk, ?2) p(B0,,Bk ?2) p(?2)
- p(B0,,Bk ?2) NMV(Bprior , ?prior),
- where ?prior is the prior covariance matrix for
the Bs. - and p(?2) Inverse-Gamma(aprior, bprior).
12Conjugate priors and the normal linear model
- If we use a conjugate prior, then the prior
distribution will have the same form. Thus, the
posterior distribution will also be
normal-inverse-gamma. If we integrate out ?2 the
marginal for B will be a multivariate t-dist
- Notice that the coefficients are essentially a
weighted average of the prior coefficients
described by Bprior and the standard OLS
estimates B. - The weights are provided by the
conditional prior precision ?-1 and the data XTX.
This should make clear that as we increase our
prior precision (decrease our prior variance) for
B, we place greater posterior weight on our prior
beliefs relative to the data. Note Zellner
(1971) treats Bprior and the conditional prior
variance ? in the following way suppose you have
two data setsY1,X1 and Y2,X2. He sets Bprior
equal to the posterior mean for a regression
analysis of X1, Y1 with the improper prior 1/?2
and sets ? equal to X1TX1.
13Conjugate priors and the normal linear model
- To summarize our uncertainty about the
coefficients, the variance-covariance matrix for
B is given by
The posterior standard deviation can be taken
from the square root of the diagonal of this
matrix.
The main object of interest here is in the
expression
The second term is the maximum likelihood
estimate of the variance. The third terms states
that our variance estimates will be greater if
our prior values for the regression coefficients
differ from their posterior values, especially if
we indicate a great deal of confidence in our
prior beliefs by assigning small variances in the
matrix ?. The fourth term states that our
variance estimates for the regression
coefficients will be greater if the standard OLS
estimates differ from the posterior values,
especially if XTX is a large number.
Winbugs implementation would proceed in a manner
akin to the earlier example.