Title: Topic 18: Remedies
1Topic 18 Remedies
2Outline
- Regression diagnostics
- Remedial Measures
- Weighted regression
- Ridge Regression
- Robust Regression
- Bootstrapping
- Validation
- Qualitative predictor
- Piecewise Linear Model
3Regression DiagnosticsSummary
- Check normality of the residuals with a normal
quantile plot - Plot the residuals versus predicted values,
versus each of the Xs and (when appropriate)
versus time - Examine the partial regression plots
- If there appears to be a curvilinear pattern,
generate the graphics version with a smooth
4Regression DiagnosticsSummary
- Examine
- the studentized deleted residuals (RSTUDENT in
the output) - The hat matrix diagonals
- Dffits, Cooks D, and the DFBETAS
- Check observations that are extreme on these
measures relative to the other observations
5Regression DiagnosticsSummary
- Examine the tolerance for each X
- If there are variables with low tolerance, you
need to do some model building - Recode variables
- Variable selection
6Remedial measures
- Weighted least squares
- Ridge regression
- Robust regression
- Nonparametric regression
- Bootstrapping
7Maximum Likelihood
8Weighted regression
- Maximization of L with respect to ßs
- Equivalent to minimization of
- Weight of each observation wi1/si2
9Weighted least squares
- Least squares problem is to minimize the sum of
wi times the squared residual for case i (similar
to MLE) - Computations are easy, use the weight statement
in proc reg - bw (X?WX)-1(X?WY)
- where W is a diagonal matrix of the weights
- The problem in many cases is to determine the
weights
10Determination of weights
- Find a relationship between the absolute residual
and another variable and use this as a model for
the standard deviation - Similarly for the squared residual and the
variance - Use grouped data or approximately grouped data to
estimate the variance
11Determination of weights
- With a model for the standard deviation or the
variance, we can approximate the optimal weights - Optimal weights are proportional to the inverse
of the variance
12NKNW Example
- NKNW p 406
- Y is diastolic blood pressure
- X is age
- n 54 healthy adult women aged 20 to 60 years
old
13Get the data and check it
data a1 infile ../data/ch10ta01.dat' input
age diast proc print dataa1 run
14Plot the relationship
symbol1 vcircle ism70 proc gplot dataa1
plot diastage run
15Diastolic bp vs age
16Run the regression
proc reg dataa1 model diastage output
outa2 rresid run
17Regression output
Source DF Value Pr gt F Model 1 35.79
lt.0001 Error 52 Total 53
18Regression output
Root MSE 8.14 R-Square 0.40 Adj R-Sq 0.39 Dep
Mean 79.1 Coef Var 10.2
19Regression output
Par St Var DF Est Err t P Int
1 56.1 3.9 14.06 lt.0001 age 1 0.58 .09 5.98
lt.0001
20Use the output data set to get the absolute and
squared residuals
data a2 set a2 absrabs(resid)
sqrrresidresid
21Do the plots with a smooth
proc gplot dataa2 plot (resid absr
sqrr)age run
22Residuals vs age
23Absolute value of the residuals vs age
24Squared residuals vs age
25Predict the standard deviation (absolute value of
the residual)
proc reg dataa2 model absrage output
outa3 pshat
Note that a3 has the predicted standard
deviations (shat)
26Compute the weights
data a3 set a3 wt1/(shatshat)
27Regression with weights
proc reg dataa3 model diastage / clb
weight wt run
28Output
Source DF F P Model 1 56.64
lt.0001 Error 52 Total 53
29Output
Root MSE 1.21302 R-Square
0.5214 Adj R-Sq 0.5122 Dependent Mean
73.55134 Coeff Var 1.64921
30Output
Par St Var Est Err t P Int
55.5 2.5 22.04 lt.0001 age 0.59 0.07 7.53
lt.0001
31Ridge regression
- Similar to a very old idea in numerical analysis
- If (X?X) is difficult to invert (near singular)
then approximate by inverting (X?XkI). - Estimators of coefficients are biased but more
stable. - For some value of k ridge regression estimator
has a smaller mean square error than ordinary
least square estimator. - Interesting but has not turned out to be a useful
method in practice . - Ridge k is an option for model statement .
32Robust regression
- Basic idea is to have a procedure that is not
sensitive to outliers - Alternatives to least squares, minimize
- sum of absolute values of residuals
- median of the squares of residuals
- Do weighted regression with weights based on
residuals, and iterate
33Nonparametric regression
- Several versions
- We have used ism70
- Interesting theory
- All versions have some smoothing parameter
similar to the 70 in ism70 - Confidence intervals and significance tests not
fully developed
34Bootstrap
- Very important theoretical development that has
had a major impact on applied statistics - Based on simulation
- Sample with replacement from the data or
residuals and get the distribution of the
quantity of interest - CI usually based on quantiles of the sampling
distribution
35Model validation
- Three approaches to checking the validity of the
model - Collect new data, does it fit the model
- Compare with theory, other data, simulation
- Use some of the data for the basic analysis and
some for validity check
36One qualitative explanatory variable
- Indicator (or dummy) variables have the value 0
when the quality is absent and 1 when the quality
is present - Examples include
- Gender as an explanatory variable
- Placebo versus control
37Binary predictor
- X1 has values 0 and 1 corresponding to two
different groups - X2 is a continuous variable
- Y ß0 ß1X1 ß2X2 ß3X1X2 ?
- For X1 0 Y ß0 ß2X2 ?
- For X1 1 Y (ß0 ß1) (ß2 ß3)X2 ?
38Binary predictor
- H0 ß1 ß3 0 tests the hypothesis that the
lines are the same - H0 ß1 0 tests equal intercepts
- H0 ß3 0 tests equal slopes
39More models
- If a categorical (qualitative) variable has
several k possible values we need k-1 indicator
variables - These can be defined in many different ways we
will do this in Chapter 16 - We also can have several categorical explanatory
variables, interactions, etc
40Piecewise Linear model
- At some (known) point or points, the slope of the
relationship changes - Consider NKNW p 476 (n 8)
- Y is unit cost
- X1 is lot size
- The slope is allowed to change at X1 500
41Plot the data
42Model
- Our model has
- An intercept
- A coefficient for lot size (the slope)
- An additional explanatory variable that will add
a constant to the slope whenever lot size is
greater than 500
43Data step
Data a1 set a1 if lotsize le 500 then
cslope0 if lotsize gt 500 then
cslopelotsize-500
44Check the data
Obs cost lotsize cslope 1 2.57
650 150 2 4.40 340 0
3 4.52 400 0 4 1.39 800
300 5 4.75 300 0 6
3.55 570 70 7 2.49 720
220 8 3.77 480 0
45Run the regression
proc reg dataa1 model costlotsize cslope
output outa2 pcosthat run
46Output
Source DF F Value Pr gt F Model 2 79.06
0.0002 Error 5 Total 7
47Output
Root MSE 0.24494 R-Square
0.9693 Dependent Mean 3.43000 Adj R-Sq
0.9571 Coeff Var 7.14106
48Output
Variable Est t P Int 5.89545
9.76 0.0002 lotsize -0.00395 -2.65 0.0454 cslope
-0.00389 -1.69 0.1528
49Plot data with fit
symbol1 vcircle inone cblack symbol2 vnone
ijoin cblack proc sort dataa2 by
lotsize proc gplot dataa2 plot (cost
costhat)lotsize /overlay run
50The plot
51Background Reading
- We used programs NKNW406.sas, NKNW459.sas, and
NKNW476.sas to generate the output for today.