SPlus Interlude contd - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

SPlus Interlude contd

Description:

Slope of regression line is bi; intercept 0 ... of SDs, invalid t- and F-tests, inefficiency (Ham. p. 113) OA3103, Fall AY 2003. 10/28/02 ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 24
Provided by: Butt9
Category:

less

Transcript and Presenter's Notes

Title: SPlus Interlude contd


1
S-Plus Interlude (contd)
  • Beer Calories vs. Alcohol Light-ness
  • What does X look like?
  • Is this model any better?
  • Use options(contrasts contr.treatment)
  • How about Calories Alcohol Light?
  • How about Country? What models best?

2
Partial Regression Plots
  • Or Partial regression leverage plots
  • (residuals of y, regressed on all xs but xi)
    vs. (residuals of xi, regressed on all xs but
    xi)
  • Slope of regression line is bi intercept 0
  • Interpretation relationship of y and xi, after
    removing the effect of other xs
  • Shows how individual points affect fit

3
Checking the Assumptions
  • Everything depends on the assumptions
  • especially with small sample sizes
  • Check all that you can
  • Usually wrong to some degree
  • Some cant be checked at all
  • Consequences of being wrong biased estimates,
    bad ests. of SDs, invalid t- and F-tests,
    inefficiency (Ham. p. 113)

4
The Assumptions
  • 1. X is fixed each yi depends on its
    corresponding xs repeating the same experiment
    gives same X, different y
  • If not, the analysis is conditional on the
    observed values of the Xs
  • Experiment vs. observational study
  • Test assumption by finding out how the data
    were collected

5
Assume Model is correct
  • 1A The linear model is correct Ey is a linear
    function of the xs in the model
  • If we have irrelevant xs in the model, our SEs
    are too big
  • If we have relevant xs not in the model, forget
    it estimates are biased, SEs and t-tests are
    wrongand no way to check
  • If the truth is non-linear, forget it, too

6
Assume Eei 0
  • 2. Errors have mean 0 Eei 0, for all i
  • If not, b0 is a biased est. of b0
  • No other damage done
  • Impossible to check from data (why?)
  • Conceivably, we can check by examining how data
    are collected?
  • If X random, require all cov (Xi, ei) 0 this,
    too, cannot be checked

7
Assume Homoscedasticity
  • 3. Errors have constant variance
  • varei s2, for all i
  • If not, bs are unbiased, but SEs, t-tests, etc.
    fall apart
  • Check by plotting residuals against fitted
    values, or abs(resids) vs. fitted.
  • Try lm(), l1fit(), lowess() to look for trend in
    that plot

8
S-Plus Interlude
  • Abs (residuals) vs. fitted values
  • lm () is already suspect
  • l1fit() minimizes the sum of the absolute
    residuals, instead of sum of squared ones
  • Handy in general as a description, but doesnt
    give t-tests and the like
  • lowess() adds a smooth curve

9
Weighted Least Squares
  • What if you know the variances of the ei arent
    equal?
  • Example yi is the mean of ni observationsIf
    each of the ni have the same variance, then yi
    has variance s2/ni
  • Maybe the ys come from different pieces of
    equipment, each with its own SD
  • Assume that the variances are known

10
WLS and GLS
  • So we have y Xb e, e N(0, W) where W might
    be diagonal or not
  • What if you pre-multiply by W-1/2?
  • Then W-1/2 y W-1/2 Xb W-1/2 e, ory Xb
    e, and e N(0, I), so we can do OLS on y and
    X b (XTX)-1XTy
  • In original units, b (XTW-1X)-1XTW-1y
  • In Splus, call lm() with weights
  • W need not be diagonal estimation hard

11
Assume No autocorrelation
  • 4 Cov (ei , ej) 0 if i is not equal to j
  • If not, bs are unbiased, but SEs, t-tests, etc.
    fall apart (again)
  • Almost inevitable in data collected at points
    adjacent in time or space requires time series
    analysis
  • Durbin-Watson test (H. p. 118), acf()
  • Basic idea correlation of all pairs (ei, ei-1)
  • Pictures (smoothed ri vs. time) help

12
Assume Errors are Normal
  • 5 Each error ei N(0, s2)
  • If not, cant use t- or F-tests, especially in
    small samples
  • Can test with histograms, qqplots, symmetry
    plots, K-S or c2 g-o-f tests
  • Can be helped by transformation
  • Assumptions 2-5 e MVN (0, s2I)

13
The Moral
  • Your model is not 100 correct, but often it
    produces believable results
  • Its (maybe a good) approximation
  • The quality of the approximation depends on the
    degree to which the assumptions are met
  • The proof of quality is in prediction
  • Try cross-validation

14
Cross-validation
  • Ideally, you could fit the model on one set of
    data (the training set), evaluate its quality
    on another (the test set)
  • What if the fit isnt so good? If you do this
    process again, youve already seen the test set
  • Another good plan divide data into, say, 10
    parts

15
Cross-Validation contd
  • Leave out one part, fit model with other nine
    then predict the missing tenth
  • Do this ten times so that each part has been in
    nine times and out once
  • Measure average RSS (or RSE or ) across all 10
    models xval()
  • You can repeat this (maybe using common random
    numbers for the split)

16
Leverage and Influence
  • Some cases more important than others
  • Predicted y yhat Xb X(XTX)-1XTy
  • X(XTX)-1XT puts the hat on y, so its called
    the hat matrix, H
  • Diagonal entries of H, hi, are leverages
  • Measures the ability to have influence its a
    consequence of the expts design

17
Leverage
  • Leverages (hi) fall between 1/n and 1
  • Rule of thumb beware of hi gt 2p/n
  • Another beware when max(hi) gt .2
  • Compute with lm.influence(), hat()
  • Estimated variance of i-th residuals2 (1 - hi)
  • High leverage -gt good fit, small variance

18
Influence
  • A point with high leverage may or may not
    actually have high influence
  • A point is influential if it has a big effect on
    the regression
  • We can compute the bs, and s2, leaving each case
    out in turn, one at a time
  • These come from lm.influence()

19
DFBETA
  • DFBETA the change in a b that comes from
    omitting a case, expressed in SDs
  • DFBETA for case i and column k isDFBETAi, k bk
    -- b(no i)k s(no i) /sqrt (RSSk)
  • RSSk is the RSS from regressing col. k on all
    other columns (and including pt. i)
  • By how many SDs does b change w/o i?
  • Beware when DFBETA gt 2/sqrt(n)

20
Residuals
  • Not only the bs, but yhats and residuals,
    change when a case is omitted
  • Standardized residual zi ri .
    s sqrt(1-hi)
  • Studentized residual ti use s(no i) for s
  • ti is a little t-test of does case i shift the
    intercept significantly? use level a/n

21
Cooks Distance
  • Cooks distance measures influence
  • Di zi2 hi p (1 - hi)
  • Di increases as zi increases
  • When hi is small, Di is smallas hi -gt 1, Di
    gets huge
  • This effect is on all the bs, not just one
  • Look for Di gt 1or maybe Di gt 4/n

22
Splus Interlude
  • plot.lm() gives Cooks distance plot
  • lm.influence() gives leverages, new bs, new ss
    dfbetas() available
  • Q What to do about influential points?
  • A Do they change your conclusions?

23
Proportional Leverage Plot
  • This is the leverage plot (relationship of y to
    one x, adjusting for other xs) but...
  • Points have areas proportional to DFBETA
    (adjusted to be between 1 and 100)
  • That area is the of SDs by which that case
    affects the line
  • Shows influence problems, maybe curvilinearity or
    heteroscedasticity
Write a Comment
User Comments (0)
About PowerShow.com