Regression Diagnostics - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Regression Diagnostics

Description:

The conditions required for the model must be checked. ... R commend: qqnorm(r) Index plot of studentized residuals. STAT611, Term I 09-10 ... – PowerPoint PPT presentation

Number of Views:261
Avg rating:3.0/5.0
Slides: 36
Provided by: sba448
Category:

less

Transcript and Presenter's Notes

Title: Regression Diagnostics


1
Regression Diagnostics
  • Chapter 4
  • Contents
  • Residuals
  • Graphical Methods
  • Multicolinearity, Nonnormality,
    Heteroscedasticity, Autocorrelation
  • Measures of Influence

2
4.1 Introduction
  • The conditions required for the model must be
    checked. Violation of any conditions makes the
    inferences invalid.
  • Is the error variable normally distributed?
  • Is the error variance constant?
  • Are the errors independent?
  • Can we identify outlier?
  • Is multicolinearity (intercorrelation) a problem?

Draw a histogram of the residuals
Plot the residuals versus the time periods
3
4.2 Residuals
  • Ordinary least squares residuals
  • ei Yi ? Yi
  • where
  • Studentized residuals
  • Externally studentized residuals

4
Where pii is the ith diagonal element of the hat
matrix P X(XX)?1X is the estimate
of ? with ith observation deleted.
4.2 Residuals
Once the studentized residuals are calculated,
the externally studentized residuals can be
calculated through the relationship
5
4.3 Graphical Methods
  • There is no single statistical tool that is as
    powerful as a well-chosen graph Chambers et
    al. (1983)
  • Eye-balling can give diagnostic insights no
    formal diagnostics will ever provide Huber
    (1991).

6
Anscombes Quartet Four Data Sets Having Same
Values of Summary
Mean(Y) 7.501, Mean(X) 9.0, Std(Y) 2.031,
Std(X)3.32, Cor(Y, X) 0.8, Y 3 0.5 X, etc.
7
(No Transcript)
8
Graphical methods can be useful in many ways
  • Detect errors in the data (e.g., an outlying
    point may be a result of a typographical error,
  • Recognizing patterns in the data (e.g., clusters,
    outliers),
  • Explore relationship among variables,
  • Discover new phenomena,
  • Confirm or negate assumptions,
  • Assess the adequacy of a fitted model,
  • Suggest remedial actions (e.g., transform the
    data),
  • Enhance numerical analysis in general.

9
Graphical methods can be classified into two
classes
  • Graphs before fitting the model. These are
    useful in correcting errors in data and in
    selecting a model
  • Graphs after fitting a model. These are
    particularly useful for checking the model
    assumptions and for assessing the goodness of the
    fit.

10
Functionality of common plots
  • Plot of Y vs Xi, I 1, , p to reveal Y-X
    relationship.
  • Normal probability plot of studentized residuals
    for checking the normality assumption.
  • Scatter plots of studentized residuals against
    each of the predictor variables for checking
    linearity of Y-X relation, and constancy of error
    variance.

11
Functionality of common plots
  • Scatter plots of the studentized residuals versus
    the fitted values similar to the above.
  • Index plot of the studentized residuals for
    checking the independence assumption.
  • Matrix plot of the predictors for checking
    multicolinearity

12
Example 4.1. Using R (lm() and plot()) to do the
following plots based the motor inn data in
Example 3.1
  • Plot Y vs, respectively, X1, X2, X3, X4, X5, X6.
  • Give a matrix plot of all X variables
  • Plot the OLS residuals vs Xi, i 1, 2, , 6
  • Plot of OLS residuals vs the fitted values
  • Normal probability plot of studentized residuals
    r
  • R commend qqnorm(r)
  • Index plot of studentized residuals

13
Diagnostics Multicolinearity
  • Example 4.2 Predicting house price (EX4-01.xls)
  • A real estate agent believes that a house selling
    price can be predicted using the house size,
    number of bedrooms, and lot size.
  • A random sample of 100 houses was drawn and data
    recorded.
  • Analyze the relationship among the four variables

14
Diagnostics Multicolinearity
  • The proposed model isPRICE ?0 ?1 BEDROOMS
    ?2 H-SIZE ?3 LOTSIZE ?

The model is valid, but no variable is
significantly related to the selling price !!!
Why?
15
Diagnostics Multicolinearity
  • Multicolinearity is found to be a problem.
  • Multicolinearity causes two kinds of
    difficulties
  • The t statistics appear to be too small.
  • The b coefficients cannot be interpreted as
    slopes.

16
(No Transcript)
17
Remedying Violations of the Required Conditions
  • Nonnormality or heteroscedasticity can be
    remedied using transformations on the y variable.
  • The transformations can improve the linear
    relationship between the dependent variable and
    the independent variables.
  • Many computer software systems allow us to make
    the transformations easily.

18
Reducing Nonnormality by Transformations
  • A brief list of transformations
  • Y log Y (for Y gt 0)
  • Use when the ? increases with Y, or
  • Use when the error distribution is positively
    skewed
  • Y Y2
  • Use when the ?2 is proportional to E(Y), or
  • Use when the error distribution is negatively
    skewed
  • Y Y1/2 (for Y gt 0)
  • Use when the ?2 is proportional to E(Y)
  • Y 1/Y
  • Use when ?2 increases significantly when y
    increases beyond some critical value.

19
Durbin - Watson TestAre the Errors
Autocorrelated?
  • This test detects first order autocorrelation
    between consecutive residuals in a time series
  • If autocorrelation exists the error variables are
    not independent

Residual at time i
20
Positive First Order Autocorrelation

Residuals



0
Time




Positive first order autocorrelation occurs when
consecutive residuals tend to be similar.
Then, the value of d is small (less than 2).
21
Negative First Order Autocorrelation
Residuals



0
Time




Negative first order autocorrelation occurs when
consecutive residuals tend to markedly differ.
Then, the value of d is large (greater than 2).
22
One tail test for Positive First Order
Autocorrelation
  • If dltdL there is enough evidence to show that
    positive first-order correlation exists
  • If dgtdU there is not enough evidence to show that
    positive first-order correlation exists
  • If d is between dL and dU the test is
    inconclusive.

23
One Tail Test for Negative First Order
Autocorrelation
  • If dgt4-dL, negative first order correlation
    exists
  • If dlt4-dU, negative first order correlation does
    not exists
  • if d falls between 4-dU and 4-dL the test is
    inconclusive.

24
Two-Tail Test for First Order Autocorrelation
  • If dltdL or dgt4-dL first order autocorrelation
    exists
  • If d falls between dL and dU or between 4-dU and
    4-dLthe test is inconclusive
  • If d falls between dU and 4-dU there is no
    evidence for first order autocorrelation

25
Testing the Existence of Autocorrelation, Example
  • Example 4.3 (EX4-03)
  • How does the weather affect the sales of lift
    tickets in a ski resort?
  • Data of the past 20 years sales of tickets, along
    with the total snowfall and the average
    temperature during Christmas week in each year,
    was collected.
  • The model hypothesized was
  • TICKETS b0 b1SNOWFALL b2TEMPERATURE e
  • Regression analysis yielded the following
    results

26
The Regression Equation Assessment (I)
The model seems to be very poor
  • R-square0.1200
  • It is not valid (Signif. F 0.3373)
  • No variable is linearly related to Sales

27
Diagnostics The Error Distribution
The errors histogram
The errors may be normally distributed
28
Diagnostics Heteroscedasticity
29
Diagnostics First Order Autocorrelation
The errors are not independent!!
30
Diagnostics First Order Autocorrelation
Using the computer - Excel
Tools gt Data Analysis gt Regression (check the
residual option and then OK) Tools gt Data
Analysis Plus gt Durbin Watson Statistic gt
Highlight the range of the residuals from the
regression run gt OK
Test for positive first order auto-correlation n
20, p2. From the Durbin-Watson table we have
dL1.10, dU1.54. The statistic
d0.5931 Conclusion Because dltdL , there is
sufficient evidence to infer that positive first
order autocorrelation exists.
The residuals
31
The Modified Model Time Included
The modified regression model (EX4-02mod.xls) TIC
KETSb0 b1SNOWFALL b2TEMPERATURE b3TIME e
  • All the required conditions are met for this
    model.
  • The fit of this model is high R2 0.7410.
  • The model is valid. Significance F .0001.
  • SNOWFALL and TIME are linearly related to
    ticket sales.
  • TEMPERATURE is not linearly related to ticket
    sales.

32
4.4. Leverage, Influence, and Outliers
  • Influential point a point is an influential
    point if its deletion causes substantial changes
    in the fitted model (estimated coefficients,
    fitted values, t-tests, etc.).
  • Outliers in the Response Variable observations
    with large standardized residuals are outliers in
    the response variable. A rule of thumb larger
    than 3 sd away from the mean (zero).
  • Leverage value pii, ith diagonal element of P
    matrix.
  • Outliers in the Predictors Outliers in the
    predictors (the X-space) are defined based on the
    magnitude of pii,

33
4.4. Leverage, Influence, and Outliers
  • specifically, if
  • pii gt 2(p1)/n
  • then, the ith observation is an outlying
    observation
  • with respect to X variables.
  • This is because pii measures the distance of a
    point to X. It is clearer in simple linear
    regression

34
4.5. Measures of Influence
  • Let
    be the fitted values and the estimate of ? when
    we drop the ith observation.
  • Cooks Distance measures the influence of the
    ith observation by summarizing the differences
    between the fitted values obtained from the full
    data and the fitted values obtained by deleting
    the ith observation

35
4.5. Measures of Influence
  • Cooks Distance can be calculated through the
    relation
  • A rule of thumb if Ci gt1, then the ith
    observation is influential.
  • More flexible and informative way of detecting
    the influential observations is to do an index
    plot of Ci .
  • There are other measures of influence, see text,
    Sec 4.9
  • See R script file MotorInn.r in course website
    for details in calculating various quantities and
    plotting.
Write a Comment
User Comments (0)
About PowerShow.com