SPlus Interlude contd

About This Presentation

Title:

SPlus Interlude contd

Description:

Slope of regression line is bi; intercept 0 ... of SDs, invalid t- and F-tests, inefficiency (Ham. p. 113) OA3103, Fall AY 2003. 10/28/02 ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 24

Provided by: Butt9

Category:

more less

Transcript and Presenter's Notes

Title: SPlus Interlude contd

1
S-Plus Interlude (contd)

Beer Calories vs. Alcohol Light-ness
What does X look like?
Is this model any better?
Use options(contrasts contr.treatment)
How about Calories Alcohol Light?
How about Country? What models best?

2
Partial Regression Plots

Or Partial regression leverage plots
(residuals of y, regressed on all xs but xi)
vs. (residuals of xi, regressed on all xs but
xi)
Slope of regression line is bi intercept 0
Interpretation relationship of y and xi, after
removing the effect of other xs
Shows how individual points affect fit

3
Checking the Assumptions

Everything depends on the assumptions
especially with small sample sizes
Check all that you can
Usually wrong to some degree
Some cant be checked at all
Consequences of being wrong biased estimates,
bad ests. of SDs, invalid t- and F-tests,
inefficiency (Ham. p. 113)

4
The Assumptions

1. X is fixed each yi depends on its
corresponding xs repeating the same experiment
gives same X, different y
If not, the analysis is conditional on the
observed values of the Xs
Experiment vs. observational study
Test assumption by finding out how the data
were collected

5
Assume Model is correct

1A The linear model is correct Ey is a linear
function of the xs in the model
If we have irrelevant xs in the model, our SEs
are too big
If we have relevant xs not in the model, forget
it estimates are biased, SEs and t-tests are
wrongand no way to check
If the truth is non-linear, forget it, too

6
Assume Eei 0

2. Errors have mean 0 Eei 0, for all i
If not, b0 is a biased est. of b0
No other damage done
Impossible to check from data (why?)
Conceivably, we can check by examining how data
are collected?
If X random, require all cov (Xi, ei) 0 this,
too, cannot be checked

7
Assume Homoscedasticity

3. Errors have constant variance
varei s2, for all i
If not, bs are unbiased, but SEs, t-tests, etc.
fall apart
Check by plotting residuals against fitted
values, or abs(resids) vs. fitted.
Try lm(), l1fit(), lowess() to look for trend in
that plot

8
S-Plus Interlude

Abs (residuals) vs. fitted values
lm () is already suspect
l1fit() minimizes the sum of the absolute
residuals, instead of sum of squared ones
Handy in general as a description, but doesnt
give t-tests and the like
lowess() adds a smooth curve

9
Weighted Least Squares

What if you know the variances of the ei arent
equal?
Example yi is the mean of ni observationsIf
each of the ni have the same variance, then yi
has variance s2/ni
Maybe the ys come from different pieces of
equipment, each with its own SD
Assume that the variances are known

10
WLS and GLS

So we have y Xb e, e N(0, W) where W might
be diagonal or not
What if you pre-multiply by W-1/2?
Then W-1/2 y W-1/2 Xb W-1/2 e, ory Xb
e, and e N(0, I), so we can do OLS on y and
X b (XTX)-1XTy
In original units, b (XTW-1X)-1XTW-1y
In Splus, call lm() with weights
W need not be diagonal estimation hard

11
Assume No autocorrelation

4 Cov (ei , ej) 0 if i is not equal to j
If not, bs are unbiased, but SEs, t-tests, etc.
fall apart (again)
Almost inevitable in data collected at points
adjacent in time or space requires time series
analysis
Durbin-Watson test (H. p. 118), acf()
Basic idea correlation of all pairs (ei, ei-1)
Pictures (smoothed ri vs. time) help

12
Assume Errors are Normal

5 Each error ei N(0, s2)
If not, cant use t- or F-tests, especially in
small samples
Can test with histograms, qqplots, symmetry
plots, K-S or c2 g-o-f tests
Can be helped by transformation
Assumptions 2-5 e MVN (0, s2I)

13
The Moral

Your model is not 100 correct, but often it
produces believable results
Its (maybe a good) approximation
The quality of the approximation depends on the
degree to which the assumptions are met
The proof of quality is in prediction
Try cross-validation

14
Cross-validation

Ideally, you could fit the model on one set of
data (the training set), evaluate its quality
on another (the test set)
What if the fit isnt so good? If you do this
process again, youve already seen the test set
Another good plan divide data into, say, 10
parts

15
Cross-Validation contd

Leave out one part, fit model with other nine
then predict the missing tenth
Do this ten times so that each part has been in
nine times and out once
Measure average RSS (or RSE or ) across all 10
models xval()
You can repeat this (maybe using common random
numbers for the split)

16
Leverage and Influence

Some cases more important than others
Predicted y yhat Xb X(XTX)-1XTy
X(XTX)-1XT puts the hat on y, so its called
the hat matrix, H
Diagonal entries of H, hi, are leverages
Measures the ability to have influence its a
consequence of the expts design

17
Leverage

Leverages (hi) fall between 1/n and 1
Rule of thumb beware of hi gt 2p/n
Another beware when max(hi) gt .2
Compute with lm.influence(), hat()
Estimated variance of i-th residuals2 (1 - hi)
High leverage -gt good fit, small variance

18
Influence

A point with high leverage may or may not
actually have high influence
A point is influential if it has a big effect on
the regression
We can compute the bs, and s2, leaving each case
out in turn, one at a time
These come from lm.influence()

19
DFBETA

DFBETA the change in a b that comes from
omitting a case, expressed in SDs
DFBETA for case i and column k isDFBETAi, k bk
-- b(no i)k s(no i) /sqrt (RSSk)
RSSk is the RSS from regressing col. k on all
other columns (and including pt. i)
By how many SDs does b change w/o i?
Beware when DFBETA gt 2/sqrt(n)

20
Residuals

Not only the bs, but yhats and residuals,
change when a case is omitted
Standardized residual zi ri .
s sqrt(1-hi)
Studentized residual ti use s(no i) for s
ti is a little t-test of does case i shift the
intercept significantly? use level a/n

21
Cooks Distance

Cooks distance measures influence
Di zi2 hi p (1 - hi)
Di increases as zi increases
When hi is small, Di is smallas hi -gt 1, Di
gets huge
This effect is on all the bs, not just one
Look for Di gt 1or maybe Di gt 4/n

22
Splus Interlude

plot.lm() gives Cooks distance plot
lm.influence() gives leverages, new bs, new ss
dfbetas() available
Q What to do about influential points?
A Do they change your conclusions?

23
Proportional Leverage Plot

This is the leverage plot (relationship of y to
one x, adjusting for other xs) but...
Points have areas proportional to DFBETA
(adjusted to be between 1 and 100)
That area is the of SDs by which that case
affects the line
Shows influence problems, maybe curvilinearity or
heteroscedasticity

Write a Comment

User Comments (0)

About PowerShow.com

SPlus Interlude contd - PowerPoint PPT Presentation

SPlus Interlude contd

Slope of regression line is bi; intercept 0 ... of SDs, invalid t- and F-tests, inefficiency (Ham. p. 113) OA3103, Fall AY 2003. 10/28/02 ... – PowerPoint PPT presentation