1 / 31

Heteroskedasticity

- Outline
- 1) What is it?
- 2) What are the consequences for our Least

Squares estimator when we have heteroskedasticity - 3) How do we test for heteroskedasticity?
- 4) How do we correct a model that has

heteroskedasticity

What is Heteroskedasticity

Review the assumption of Gauss-Markov

- Linear Regression Model

y ?1 ?2x e - Error Term has a mean of zero E(e) 0 ? E(y)

?1 ?2x - Error term has constant variance Var(e) E(e2)

a CONSTANT - In other words we assume all the observations

are equally reliable - Error term is not correlated with itself (no

serial correlation) Cov(ei,ej) E(eiej) 0

i?j - Data on X are not random and thus are

uncorrelated with the error term Cov(X,e)

E(Xe) 0

This is the assumption of a homoskedastic error

?2

A homoskedastic error is one that has constant

variance. A heteroskedastic error is one that has

a nonconstant variance.

Heteroskedasticity is more commonly a problem for

cross-section data sets, although a time-series

model can also have a non-constant variance.

This diagram shows a constant variance (OLD

assumption) for the error term The line shows Ey

(food) for any given X. Notice that a family

making 500 Is expected to deviate ( )

from its Ey the same as a family making 1500.

Y (food)

f(yx)

.

?2 SAME value

.

Ey x

.

x ( income)

x500

X1000

X1500

This diagram shows a non-constant variance for

the error term that appears to increase as X

increases. The variation in the Gates food

budgetfrom their average is greater than for the

Correas.

f(yx)

y

.

.

?2 increases

E(yx)

.

x (income)

XCorreas 500

XTRUMP 1000

XGATES 1500

What are the causes?

Direct

Indirect

- Scale effects
- Structual shifts
- Learning effects

- Omitted variables
- Outliers
- Parameter variation

Again, heteroskedasticity is more commonly a

problem for cross-section data sets, although a

time-series model can also have a non-constant

variance.

What are the Implications for Least Squares?

- We have to ask where did we used the

assumption? Or why was the assumption needed in

the first place? - We used the assumption in the derivation of the

variance formulas for the least squares

estimators, b1 and b2. - For b2 the formula for Var(b2) was

BUT this last step uses the assumption that ?t2

is a constant ?2.

If this is not the case, then the formula is

Remember

Therefore, if we ignore the problem of a

heteroskedastic error and estimate the variance

of b2 using the formula on the previous slide,

when in fact we should have used the formula

directly on this slide, then our estimates of

the variance of b2 are wrong. Any hypothesis

tests or confidence intervals based on them will

be invalid. Note, however, that the proof of

Unbiasedness E(b2) ??2 did not use the

assumption of a homoskedastic error. Therefore,

a heteroskedastic error will not bias the

coefficient estimates, but it will bias the

estimates of their variances.

In other words, if there is heteroskedasticity

- OLS estimators are still LINEAR and UNBIASED
- OLS estimators are NOT EFFICIENT
- Usual formulas give INCORRECT STANDARD ERRORS for

OLS - Any hypothesis tests or confidence intervals

based on the usual formulas for the standard

errors are WRONG

How do We Test for a Heteroskedastic Error

- 1) Visual Inspection of the residuals
- Because we never observe actual values for the

error term, we never know for sure whether it is

heteroskecastic or not. However, we can run a

least squares regression and examine the

residuals to see if they show a pattern

consistent with a non- constant variance.

This regression resulted in the following

residuals plotted against the variable X (weekly

income). It appears as the the variation in the

residuals increases with higher values of X,

suggesting a heteroskedastic error.

- Formal Tests for Heteroskedasticity There are

many different tests that can be used for

heteroskedasticity. We will look at 4 of them - 2) Goldfeld Quandt Test
- a) Suppose we think that the error might be

heteroskedastic. We examine the residuals and

notice that the variance in the residuals appears

to be larger for larger values of a (continuous)

dependent variable xj - Note that it is necessary to make some

assumption about the form of the

heteroskedasticity, that is, an assumption about

how the variance of et changes. For the food

expenditure problem, the residuals tell us that

an increasing function of xt (weekly income) is a

good candidate. Other models may have a variance

that is a decreasing function of xt or is a

function of some variable other than xt.

- The idea behind the Goldfeld Quandt Test
- We want to test if there is heterok. of the kind

that is proportional to xj - Sort the data in descending order by the variable

xj that you think causes the heterosk., and then

split the data in half. Omit a few of the middle

observations. - Run the regression on each half of the data.
- Conduct a formal hypothesis test to decide

whether or not there is a heteroskedastic error

based on an examination of the SSE from each

half. - If the error is heteroskedastic with a larger

variance for the larger values of xt , then we

should find

And where SSElarge comes from the the regression

using the subset of large values of xt., which

has tlarge observations SSEsmall comes from the

regression using the subsetof small values of

xt, which has tsmall observations

Where

- Conducting the Test

The error is Homoskedastic so that

The error is Heteroskedastic

It can be shown that the GQ statistic has a

F-distribution with (tl-k) d.o.f. in the

numerator and (ts-k) d.o.f. in the

denominator. If GQ gt Fc ? we reject Ho. We find

that the error is heteroskedastic.

Food Expenditure Example

This code sorts the data according to X because

we believe that the error variance is increasing

in xt.

proc sort datafood

by descending x

data food_large

set food

if _n_ lt 20

proc reg

bigvalues model y

x data food_small

set food

if _n_

gt 21 proc reg

littlevalues model y x run

This code estimates the model for the first 20

observations, which are the observations with

large values of xt.

This code estimates the model for the second 20

observations, which are the observations will

small values of xt.

The REG Procedure

Model bigvalues

Dependent Variable y

Analysis of Variance

Sum of

Mean Source DF Squares

Square F Value Pr gt F Model

1 4756.81422 4756.81422

2.08 0.1663 Error 18

41147 2285.93938 Corrected Total

19 45904 Root MSE

47.81150 R-Square 0.1036

Dependent Mean 148.32250 Adj R-Sq

0.0538 Coeff Var

32.23483 Parameter

Estimates Parameter

Standard Variable DF Estimate

Error t Value Pr gt t Intercept

1 48.17674 70.24191 0.69

0.5015 x 1 0.11767

0.08157 1.44 0.1663

The REG Procedure

Model littlevalues

Dependent Variable y

Analysis of Variance

Sum of

Mean Source DF Squares

Square F Value Pr gt F Model

1 8370.95124 8370.95124

12.27 0.0025 Error 18

12284 682.45537 Corrected Total

19 20655 Root MSE

26.12385 R-Square 0.4053

Dependent Mean 112.30350 Adj R-Sq

0.3722 Coeff Var

23.26183 Parameter

Estimates Parameter

Standard Variable DF Estimate

Error t Value Pr gt t Intercept

1 12.93884 28.96658 0.45

0.6604 x 1 0.18234

0.05206 3.50 0.0025

Fc 2.22 (see SAS) ? Reject Ho

- 3) Parks Test
- this test is described in Gujarati 1995, p.369.

This test proposes that the error variance is a

log-log function of one (or more) explanatory

variable(s), say X. In in the form - Note that the relationship is NOT LINEAR, like

before. Look at Figure6.3(b) and Figure6.3(c) in

the book. - Follow OLS estimation, and use the OLS estimated

residuals êt in the auxiliary regression

ln(êt2) b0 b1 ln(Xt) ut - The test statistic is the t-ratio on the

parameter estimate for b1. If the t-ratio shows

that the estimated parameter b1 is significantly

different from zero then there is evidence for

heteroskedasticity. Since this is an approximate

test it is appropriate to consider that the test

statistic has an asymptotic normal distribution

so that at a 5 significance level the critical

value is 1.96. - The Park test and the Golfeldt-Quandt tests

require precise before hand knowledge of the

cause of the HET the xj variable(s) causing the

HET AND the functional form of this HET. If you

have such knowledge, then the Park test and the

Goldfeldt-Quandt tests are more powerful (less

Type II error, less often do you accept a FALSE

null compared to accepting a true alternate - an

alternate that you specified) than the subsequent

tests. The Park test is only asymptotically true.

- 4) Breusch-Pagan Test Is there some variation in

the squared residuals which can be explained by

variation in some independent variables? - Estimate the OLS regression and obtain the

residuals - Use the squared residuals as the dependent

variables in a secondary equation that includes

the independent variables suspected of being

related to error term. - êt2 b0 b1 lnXt b1 lnXt

.ut - Test the joint hypothesis that coefficients of

ALL the Xs in the second regression are zero. (An

F test of significance. Use TEST in SAS. See Ch

8.1 and 8.2 ) - Can also test nR2?2df where R2 is the R-sqred

from the auxiliary regression and dfnumber of

regressors (Xs) in auxiliary regression. - The Park and Goldfeld-Quandt tests require

knowledge of the form of the HET - the particular

functional form. If you have such knowledge,

then the previous tests are more powerful (less

Type II error, less often do you accept a FALSE

null compared to accepting a true alternate - an

alternate that you specified). The Breusch-Pagan

test does not require knowledge of the functional

form of the HET, but it still assumes that we

know which variables cause it. It is also

sensitive to deviations in normality.

(No Transcript)

- 5) Whites Test variation of Breush-Pagan, but

using ALL THE Xs - Estimate the OLS regression and obtain the

residuals - Use the squared residuals as the dependent

variable in a secondary equation that includes

EVERY ONE of the explanatory variables, their

squares, and all their pair cross products (i.e.

x1x2, x1x3 and x2x3 but not x1x2x3) - êt2 b0 b1Xt b2Zt b3Xt2
- b4Zt2 b5XtZt.ut
- Test the joint hypothesis that coefficients of

ALL the Xs in the second regression are zero. (An

F test of significance. Use TEST in SAS. See Ch

8.1 and 8.2 ) - Can also test nR2?2df where R2 is the R-sqred

from the auxiliary regression and dfnumber of

regressors (Xs) in auxiliary regression. - This test does not assume knowledge of which

variables cause the HET. If you have such

knowledge, all the previous test are more

powerful (less Type II error). The White test is

only asymptotically true (needs lots of data),

and is more commonly used now. SAS can do it

automatically

PROC REG DATA whatever MODEL whatever

whatever /SPEC RUN QUIT

I am running PROC REG with the ACOV and SPEC

options to obtain gtheteroscadascity consistent

(White-corrected") test statistics. I need to

gtcollect these test statistics into a SAS

dataset. The variance covariance matrix output

that is output into the parameters dataset using

the OUTEST and COVOUT options does not seem to

be White corrected. Any suggestions gt as gt to how

I might pull out the test statistics?

Thanks. Proc reg datafile1 model y x / acov

spec ods output ParameterEstimatesthe_parms

AcovEst the_acov SpecTest the_spec

which will yield files named (the_parms,

the_acov, and the_spec) with the tables from

those sections of the output.

How Do We Correct for a Heteroskedastic Error?

- Just redefine the variables (for example use

income per capita instead of income). This works

some times - Robust OLS estimation using the White Standard

Errors earlier we saw that in the presence of

heteroskedasticity, the correct formula for the

variance of b2 is - So we just run OLS, and calculate the variance

of the betas separately, with the formula above.

In this formula, we use the squared residual for

each observation as the estimate of its variance,

which are called Whites Estimatorsof the error

variance. - Remember, OLS parameter estimates are still

UNBIASED Eb true ß. - We will not do this by hand though.. Fortunately,

White asymptotic covariance estimation can be

performed with the ACOV option in SAS PROC REG.

(Also explore PROC ROBUSTREG) - PROC REG DATA thedata MODEL depvar

indep vars / ACOV RUN QUIT - HOWEVER The variances are reported separtely in

the White var-cov section of SAS output. BEWARE

that the t stats that SAS reports in the regular

regression output are WRONG. So we have to

calculate them manual dividing

estimate/sqrt_of_variance.

How Do We Correct for a Heteroskedastic Error?

- It is a pain to have to calculate the t

statistics manually from the regression output.

There is a way to make SAS do this for us too - PROC REG DATA thedata MODEL depvar

firstX secondX thridX / ACOV TEST firstX

0 - TEST secondX 0
- TEST thridX 0 RUN QUIT

- This will provide us with the correct

t-statistics and p-values for each of the

regressors, so we do not have to calculate them

manually. - It is important to say that this only works for

large samples (LOTS OF DATA!!!).

- 3) Generalized Least Squares (GLS)
- Idea Transform the model with a heteroskedastic

error into a model with a homoskedastic error.

Then apply the method of least squares. This

requires us to assume a specification for the

error variance. As earlier, we will assume that

the variance increases with xt.

Where

Transform the model by dividing every piece of it

by the standard deviation of the error.

This new model has an error term that is the

original error term divided by the square root of

xt. Its variance is constant.

This method is called Weighted Least

Squares. It is more efficient than simply

applying Least Squares to the model. Least

Squares gives equal weight to all observations.

Weighted Least Squares gives each observation a

weight that is inversely related to its value of

the square root of xt. Therefore, large values

of xt which we have assumed have a large variance

will get less weight than smaller values of xt

when estimating the intercept and slope of the

regression line

We need to estimate this model

This requires us to construct 3 new variables .

. and to estimate the model

Notice that it does NOT have an INTERCEPTso use

the /NOINT option in SAS

It is possible to do this in SAS automatically

using PROC REG DATA thedata MODEL depvar

indep vars / noint WEIGHT variabletoweightby

RUN QUIT / or PROC MODEL or PROC GLM /

SAS code to do test for Heterosk and perform

Weighted Least Squares NOTE look at 11.24 for

another way to do it

data whatever set whatever ystar y/sqrt(x)

x1star 1/sqrt(x)

x2star x/sqrt(x)

output run proc

reg data whatever foodglsmodel ystarx1star

x2star/noint / noint to run the model without

an intercept / run

(No Transcript)

SAS code to test for Heterok. another way

/ We have the variables dep_var, inc and height

from our dataset whatever / / The following

code just runs the tests / proc model

datawhatever parms a1 b1 b2 / declares

parameters of a model. Each parameter has a

single value associated with it which is the

same for all observations / dep_var a1

b1 inc b2 height fit / fit estimates

the model / / white pagan(inc height) / we do

Whites test, and Pagans test on the

varsinc and heigh / outresid1 outresid /

output residual to outside file / run /

white and pagan may also work with PROC REG.

We may not have to necessarily use proc model

/

SAS code to perform weighted least squares

another way

proc model datawhatever parms a1 b1

b2 inc2_inv 1/inc2 / we create the weights.

In this case they are 1/var_squared because

we are assuming heterok. Is of the

form sigma2_t constant_sigma2 X_jt2

/ exp a1 b1 inc b2 inc2 fit exp

/ fit exp tell SAS to estimate just

the dependent variable exp. We could ommit

the exp and just write fit because there

is only one equation being fitted in this

model / / NOTE the model above DOES have an

intercept. This is because we assumed the form of

the heterok. to be such that the sigma depends on

the SQUARE of the X . See also 11.26 / /

white pagan(1 inc inc2) weight inc2_inv /

tells SAS to divide all the indep vars in

the model by the variable inc2_inv /

run / NOTE II WEIGHT vartoweightby

works also with proc reg and proc autoreg

commands, right before the run statement /

- If instead, the proportional heteroskedasticity

is suspected to be - of the form

- Then the model would be

- We could proceed by forming the variable
- and then proceeding EXACTLY as we did before,

with the model

- Testing for Heteroscedasticity White test in SAS
- The regression model is specified as , where the

's are identically and independently

distributed and .If the 's are not

independent or their variances are not constant,

the parameter estimates are unbiased, but the

estimate of the covariance matrix is

inconsistent. In the case of heteroscedasticity,

the ACOV option provides a consistent estimate of

the covariance matrix. If the regression data are

from a simple random sample, the ACOV option

produces the covariance matrix. This matrix is - (X'X)-1 (X' diag(ei2)X) (X'X)-1
- where
- ei yi - xi b
- The SPEC option performs a model specification

test. The null hypothesis for this test maintains

that the errors are homoscedastic, independent of

the regressors and that several technical

assumptions about the model specification are

valid. For details, see theorem 2 and assumptions

1 -7 of White (1980). When the model is correctly

specified and the errors are independent of the

regressors, the rejection of this null hypothesis

is evidence of heteroscedasticity. In

implementing this test, an estimator of the

average covariance matrix (White 1980, p. 822) is

constructed and inverted. The nonsingularity of

this matrix is one of the assumptions in the null

hypothesis about the model specification. When

PROC REG determines this matrix to be numerically

singular, a generalized inverse is used and a

note to this effect is written to the log. In

such cases, care should be taken in interpreting

the results of this test. - When you specify the SPEC option, tests listed in

the TEST statement are performed with both the

usual covariance matrix and the

heteroscedasticity consistent covariance matrix.

Tests performed with the consistent covariance

matrix are asymptotic. For more information,

refer to White (1980). - Both the ACOV and SPEC options can be specified

in a MODEL or PRINT statement.