Topic 18: Remedies - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Topic 18: Remedies

Description:

Check normality of the residuals with a normal quantile plot ... If there appears to be a curvilinear pattern, generate the graphics version with ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 52
Provided by: georgep56
Category:

less

Transcript and Presenter's Notes

Title: Topic 18: Remedies


1
Topic 18 Remedies
2
Outline
  • Regression diagnostics
  • Remedial Measures
  • Weighted regression
  • Ridge Regression
  • Robust Regression
  • Bootstrapping
  • Validation
  • Qualitative predictor
  • Piecewise Linear Model

3
Regression DiagnosticsSummary
  • Check normality of the residuals with a normal
    quantile plot
  • Plot the residuals versus predicted values,
    versus each of the Xs and (when appropriate)
    versus time
  • Examine the partial regression plots
  • If there appears to be a curvilinear pattern,
    generate the graphics version with a smooth

4
Regression DiagnosticsSummary
  • Examine
  • the studentized deleted residuals (RSTUDENT in
    the output)
  • The hat matrix diagonals
  • Dffits, Cooks D, and the DFBETAS
  • Check observations that are extreme on these
    measures relative to the other observations

5
Regression DiagnosticsSummary
  • Examine the tolerance for each X
  • If there are variables with low tolerance, you
    need to do some model building
  • Recode variables
  • Variable selection

6
Remedial measures
  • Weighted least squares
  • Ridge regression
  • Robust regression
  • Nonparametric regression
  • Bootstrapping

7
Maximum Likelihood
8
Weighted regression
  • Maximization of L with respect to ßs
  • Equivalent to minimization of
  • Weight of each observation wi1/si2

9
Weighted least squares
  • Least squares problem is to minimize the sum of
    wi times the squared residual for case i (similar
    to MLE)
  • Computations are easy, use the weight statement
    in proc reg
  • bw (X?WX)-1(X?WY)
  • where W is a diagonal matrix of the weights
  • The problem in many cases is to determine the
    weights

10
Determination of weights
  • Find a relationship between the absolute residual
    and another variable and use this as a model for
    the standard deviation
  • Similarly for the squared residual and the
    variance
  • Use grouped data or approximately grouped data to
    estimate the variance

11
Determination of weights
  • With a model for the standard deviation or the
    variance, we can approximate the optimal weights
  • Optimal weights are proportional to the inverse
    of the variance

12
NKNW Example
  • NKNW p 406
  • Y is diastolic blood pressure
  • X is age
  • n 54 healthy adult women aged 20 to 60 years
    old

13
Get the data and check it
data a1 infile ../data/ch10ta01.dat' input
age diast proc print dataa1 run
14
Plot the relationship
symbol1 vcircle ism70 proc gplot dataa1
plot diastage run
15
Diastolic bp vs age
16
Run the regression
proc reg dataa1 model diastage output
outa2 rresid run
17
Regression output
Source DF Value Pr gt F Model 1 35.79
lt.0001 Error 52 Total 53
18
Regression output
Root MSE 8.14 R-Square 0.40 Adj R-Sq 0.39 Dep
Mean 79.1 Coef Var 10.2
19
Regression output
Par St Var DF Est Err t P Int
1 56.1 3.9 14.06 lt.0001 age 1 0.58 .09 5.98
lt.0001
20
Use the output data set to get the absolute and
squared residuals
data a2 set a2 absrabs(resid)
sqrrresidresid
21
Do the plots with a smooth
proc gplot dataa2 plot (resid absr
sqrr)age run
22
Residuals vs age
23
Absolute value of the residuals vs age
24
Squared residuals vs age
25
Predict the standard deviation (absolute value of
the residual)
proc reg dataa2 model absrage output
outa3 pshat
Note that a3 has the predicted standard
deviations (shat)
26
Compute the weights
data a3 set a3 wt1/(shatshat)
27
Regression with weights
proc reg dataa3 model diastage / clb
weight wt run
28
Output
Source DF F P Model 1 56.64
lt.0001 Error 52 Total 53
29
Output
Root MSE 1.21302 R-Square
0.5214 Adj R-Sq 0.5122 Dependent Mean
73.55134 Coeff Var 1.64921
30
Output
Par St Var Est Err t P Int
55.5 2.5 22.04 lt.0001 age 0.59 0.07 7.53
lt.0001
31
Ridge regression
  • Similar to a very old idea in numerical analysis
  • If (X?X) is difficult to invert (near singular)
    then approximate by inverting (X?XkI).
  • Estimators of coefficients are biased but more
    stable.
  • For some value of k ridge regression estimator
    has a smaller mean square error than ordinary
    least square estimator.
  • Interesting but has not turned out to be a useful
    method in practice .
  • Ridge k is an option for model statement .

32
Robust regression
  • Basic idea is to have a procedure that is not
    sensitive to outliers
  • Alternatives to least squares, minimize
  • sum of absolute values of residuals
  • median of the squares of residuals
  • Do weighted regression with weights based on
    residuals, and iterate

33
Nonparametric regression
  • Several versions
  • We have used ism70
  • Interesting theory
  • All versions have some smoothing parameter
    similar to the 70 in ism70
  • Confidence intervals and significance tests not
    fully developed

34
Bootstrap
  • Very important theoretical development that has
    had a major impact on applied statistics
  • Based on simulation
  • Sample with replacement from the data or
    residuals and get the distribution of the
    quantity of interest
  • CI usually based on quantiles of the sampling
    distribution

35
Model validation
  • Three approaches to checking the validity of the
    model
  • Collect new data, does it fit the model
  • Compare with theory, other data, simulation
  • Use some of the data for the basic analysis and
    some for validity check

36
One qualitative explanatory variable
  • Indicator (or dummy) variables have the value 0
    when the quality is absent and 1 when the quality
    is present
  • Examples include
  • Gender as an explanatory variable
  • Placebo versus control

37
Binary predictor
  • X1 has values 0 and 1 corresponding to two
    different groups
  • X2 is a continuous variable
  • Y ß0 ß1X1 ß2X2 ß3X1X2 ?
  • For X1 0 Y ß0 ß2X2 ?
  • For X1 1 Y (ß0 ß1) (ß2 ß3)X2 ?

38
Binary predictor
  • H0 ß1 ß3 0 tests the hypothesis that the
    lines are the same
  • H0 ß1 0 tests equal intercepts
  • H0 ß3 0 tests equal slopes

39
More models
  • If a categorical (qualitative) variable has
    several k possible values we need k-1 indicator
    variables
  • These can be defined in many different ways we
    will do this in Chapter 16
  • We also can have several categorical explanatory
    variables, interactions, etc

40
Piecewise Linear model
  • At some (known) point or points, the slope of the
    relationship changes
  • Consider NKNW p 476 (n 8)
  • Y is unit cost
  • X1 is lot size
  • The slope is allowed to change at X1 500

41
Plot the data
42
Model
  • Our model has
  • An intercept
  • A coefficient for lot size (the slope)
  • An additional explanatory variable that will add
    a constant to the slope whenever lot size is
    greater than 500

43
Data step
Data a1 set a1 if lotsize le 500 then
cslope0 if lotsize gt 500 then
cslopelotsize-500
44
Check the data
Obs cost lotsize cslope 1 2.57
650 150 2 4.40 340 0
3 4.52 400 0 4 1.39 800
300 5 4.75 300 0 6
3.55 570 70 7 2.49 720
220 8 3.77 480 0
45
Run the regression
proc reg dataa1 model costlotsize cslope
output outa2 pcosthat run
46
Output
Source DF F Value Pr gt F Model 2 79.06
0.0002 Error 5 Total 7
47
Output
Root MSE 0.24494 R-Square
0.9693 Dependent Mean 3.43000 Adj R-Sq
0.9571 Coeff Var 7.14106
48
Output
Variable Est t P Int 5.89545
9.76 0.0002 lotsize -0.00395 -2.65 0.0454 cslope
-0.00389 -1.69 0.1528
49
Plot data with fit
symbol1 vcircle inone cblack symbol2 vnone
ijoin cblack proc sort dataa2 by
lotsize proc gplot dataa2 plot (cost
costhat)lotsize /overlay run
50
The plot
51
Background Reading
  • We used programs NKNW406.sas, NKNW459.sas, and
    NKNW476.sas to generate the output for today.
Write a Comment
User Comments (0)
About PowerShow.com