Analysis of Covariance - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Analysis of Covariance

Description:

plot(Root,Fruit,type='n',ylab='Seed production',xlab='Initial root diameter' ... in the list will correspond to a value of the factor (in alphabetical order) ... – PowerPoint PPT presentation

Number of Views:162
Avg rating:3.0/5.0
Slides: 24
Provided by: harry8
Category:

less

Transcript and Presenter's Notes

Title: Analysis of Covariance


1
Analysis of Covariance
  • Harry R. Erwin, PhD
  • School of Computing and Technology
  • University of Sunderland

2
Resources
  • Crawley, MJ (2005) Statistics An Introduction
    Using R. Wiley.
  • Freund, RJ, and WJ Wilson (1998) Regression
    Analysis, Academic Press.
  • Gonick, L., and Woollcott Smith (1993) A Cartoon
    Guide to Statistics. HarperResource (for fun).

3
Introduction
  • Analysis of covariance (ANCOVA) combines
    regression and ANOVA
  • Response variable is continuous
  • One or more explanatory factors (the treatments)
  • One or more continuous explanatory variables
  • Usually done in a treatment study where
    explanatory variables are being included to
    improve the basic treatment/control comparison.
  • Interaction between the slope for an explanatory
    variable and the treatment is not wanted. (Life
    is hard.)
  • Maximal model includes estimating slopes and
    intercepts for each combination of the
    explanatory factors.
  • Model simplification is the goal.

4
Context
  • The goal of analysis of covariance is to reduce
    the error variance. This increases the power of
    tests and narrows the confidence intervals.
  • There may be measurable variables that affect the
    response but have nothing to do with the factors
    (treatments) in the experiment.
  • Analysis of covariance adjusts for those
    variables.

5
The Covariance Model
  • For one treatment factor and one continuous
    control variable, xij, the model is
  • yij ?0 ?i ?1xij ?ij
  • This says the response is a constant (?0) plus a
    second constant (?i, depending on the factor)
    plus a third constant (?1) times the control
    variable (or covariate) plus an error (?ij).
  • The interest is in the difference between the
    treatment means (the ?i), not in the ?0 or ?1.
    You want to be able to reduce your model.

6
Assumptions in ANCOVA
  • The covariate xij is not affected by the
    experimental factors.
  • The regression relationship measured by ?1 must
    be the same for all factor levels.
  • You need to verify these assumptions.

7
General Approach to ANCOVA
  • First look at the effect of xij. If it isnt
    significant, do an ANOVA and be done with it.
  • Check to see that xij is not significantly
    affected by the factor values.
  • Test to see that ?1 is not significantly
    different for all factor levels. This is an
    interaction (a bad thing) between the factors and
    the covariates.
  • Order matters the covariates come after the
    factors in the model because theyre less
    important.
  • If both tests pass, do the ANCOVA.

8
Example
  • Response variable is weight
  • Explanatory factor is sex
  • Continuous explanatory variable is age.
  • weightmale amale bmale ? age
  • weightfemale afemale bfemale ? age
  • Six possible models.
  • The goal is to eliminate as many parameters as
    possible.
  • Reduce the model until all parameters are
    significant.

9
Book Example
  • Notes
  • Use of plots to get insight into the significance
    of explanatory variables.
  • Note use of lm() in the models. It produces the
    same results as aov(), but with a different
    report.
  • Order mattersnon-orthogonal data!
  • Use of summary.aov()
  • Eliminate interactions first.
  • anova() used in comparisons.
  • summary.lm() to provide the parameter estimates

10
Background
  • This experiment studies the ability of a plant to
    regrow and produce seeds after grazing.
  • The pregrazing size is the diameter of the top of
    the rootstock
  • Grazing has two levels grazed or ungrazed.
  • Response is weight of seeds produced at the end
    of the growing season.
  • Size of plant is believed to matter and also
    whether it was grazed.

11
Step 1
  • compensationT)
  • attach(compensation)
  • names(compensation)
  • 1 "Root" "Fruit" "Grazing
  • par(mfrowc(2,2))
  • plot(Root,Fruit)
  • plot(Grazing,Fruit)

12
Plot 1
13
Step 2
  • modelway--inflates Grazing sum of sqs!
  • summary.aov(model)
  • Df Sum Sq Mean Sq F value
    Pr(F)
  • Root 1 16795.0 16795.0 359.9681 2.2e-16
  • Grazing 1 5264.4 5264.4 112.8316
    1.209e-12
  • RootGrazing 1 4.8 4.8 0.1031
    0.75
  • Residuals 36 1679.6 46.7
  • modelGrazing is more important.
  • summary.aov(model)
  • Df Sum Sq Mean Sq F value
    Pr(F)
  • Grazing 1 2910.4 2910.4 62.3795
    2.262e-09
  • Root 1 19148.9 19148.9 410.4201 2.2e-16
  • GrazingRoot 1 4.8 4.8 0.1031
    0.75
  • Residuals 36 1679.6 46.7

14
Check to see if the interaction term is important
  • model2
  • anova(model,model2)?use anova to compare models
  • Analysis of Variance Table
  • Model 1 Fruit Grazing Root
  • Model 2 Fruit Grazing Root ?simpler model
  • Res.Df RSS Df Sum of Sq F Pr(F)
  • 1 36 1679.65
  • 2 37 1684.46 -1 -4.81 0.1031 0.75

15
Report
  • summary.lm(model2)
  • Coefficients
  • Estimate Std. Error t value
    Pr(t)
  • (Intercept) -127.829 9.664 -13.23
    1.35e-15
  • GrazingUngrazed 36.103 3.357 10.75
    6.11e-13
  • Root 23.560 1.149 20.51 2e-16
  • Residual standard error 6.747 on 37 degrees of
    freedom
  • Multiple R-squared 0.9291, Adjusted R-squared
    0.9252
  • F-statistic 242.3 on 2 and 37 DF, p-value 2.2e-16
  • Row 1 is the intercept for the factor level
    first in the alphabet (Grazed as opposed to
    Ungrazed). Row 2 is the difference Ungrazed
    Grazed. Row 3 is the slope of the graph of seed
    production against rootstock size. Row 4 (when
    present) is the difference in slopes if the
    interaction term is significant. (Not significant
    here! 8)

16
Whats Going On?
  • sf
  • sr
  • plot(Root,Fruit,type"n",ylab"Seed
    production",xlab"Initial root diameter")
  • points(sr1,sf1,pch16)
  • points(sr2,sf2)
  • plot(Root,Fruit,type"n",ylab"Seed
    production",xlab"Initial root diameter")
  • points(sr1,sf1,pch16)
  • points(sr2,sf2)
  • abline(-127.829,23.56)
  • abline(-127.82936.103,23.56,lty2)

17
Plot 2
18
Suppose we ignored the initial root size?
  • tapply(Fruit,Grazing,mean)
  • Grazed Ungrazed
  • 67.9405 50.8805 ? the opposite of the true
    situation!
  • summary(aov(FruitGrazing))
  • Df Sum Sq Mean Sq F value Pr(F)
  • Grazing 1 2910.4 2910.4 5.3086 0.02678
  • Residuals 38 20833.4 548.2
  • ---
  • Signif. codes 0 0.001 0.01 0.05
    . 0.1 1

19
Order Matters for Non-Orthogonal Data!
  • The total variation in the response (SSY) is
    equal to the sum of the
  • Variation explained by the treatment (SSA), plus
    the
  • Variation explained by the covariate, plus the
  • Variation explained by the interaction between
    the factor levels and the covariate (hopefully
    small), plus the
  • Variation explained by the error term.
  • Since the factor levels and the covariate are
    dependent in non-orthogonal data, fitting the
    covariate first inflates the variation explained
    by the treatment, potentially producing an
    invalid positive result.
  • So put the treatment variable first in the model.

20
Because Order Matters!
  • Do you fit the categorical (treatment, T) or the
    continuous (control, L) explanatory variable
    first? With non-orthogonal data, order matters.
  • Use a logical order. Hence fit to the treatment
    variable first. Youre interested in the effect
    of the treatment, not of the control variable.
  • If the interaction between the treatment and
    control variables is significant, stop! It means
    the slopes differ significantly, which is a
    (nasty) problem.

21
Reading the Summary
summary.lm(model2) Call lm(formula Fruit
Grazing Root) Residuals Min 1Q
Median 3Q Max -17.1920 -2.8224
0.3223 3.9144 17.3290 Coefficients
Estimate Std. Error t value
Pr(t) (Intercept) -127.829 9.664
-13.23 1.35e-15 GrazingUngrazed 36.103
3.357 10.75 6.11e-13 Root
23.560 1.149 20.51 Residual standard error 6.747 on 37
degrees of freedom Multiple R-Squared
0.9291, Adjusted R-squared 0.9252 F-statistic
242.3 on 2 and 37 DF, p-value 22
Using split()
  • Applies to a vector or dataframe.
  • sd(or vector), d, based on the factor, f.
  • sd will be a list of vectors. Each vector in the
    list will correspond to a value of the factor (in
    alphabetical order).
  • Each vector in sd can be plotted using its own
    symbol to give insight into the differences
    between factors.
  • Book example.

23
The Moral
  • If you have covariates, use them. They will
    improve your confidence intervals or identify
    that you have a problem.
  • Order matters(it always does in regression).
  • Start by removing the highest order interaction
    terms first.
  • Use a logical order.
  • If the treatment (categorical) interacts
    significantly with the control (continuous), stop!
Write a Comment
User Comments (0)
About PowerShow.com