Inference About Regression Coefficients - PowerPoint PPT Presentation

About This Presentation
Title:

Inference About Regression Coefficients

Description:

We want to explain a person's height by means of foot length. ... variables are Right and Left, the length of the right foot and the left foot, respectively. ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 79
Provided by: mois1
Category:

less

Transcript and Presenter's Notes

Title: Inference About Regression Coefficients


1
  • Inference About Regression Coefficients

2
BENDRIX.XLS
  • This is a continuation of the Bendrix
    manufacturing example from the previous chapter.
  • As before, the response variable is Overhead and
    the explanatory variables are MachHrs and
    ProRuns.
  • The data are contained in this file.
  • What inferences can we make about the regression
    coefficients?

3
Multiple Regression Output
  • We obtain the output from using StatPros
    Multiple Regression procedure.

4
Multiple Regression Output -- continued
  • Regression coefficients estimate the true, but
    unobservable, population coefficients.
  • The standard error of bi indicates the accuracy
    of these point estimates.
  • For example, the effect on Overhead of a one-unit
    increase in MachHrs is 43.536.
  • We are 95 confident that the coefficient is
    between 36.357 to 50.715. Similar statements can
    be made for the coefficient of ProdRuns and the
    intercept term.

5
  • Multicollinearity

6
The Problem
  • We want to explain a persons height by means of
    foot length.
  • The response variable is Height, and the
    explanatory variables are Right and Left, the
    length of the right foot and the left foot,
    respectively.
  • What can occur when we regress Height on both
    Right and Left?

7
Multicollinearity
  • The relationship between the explanatory variable
    X and the response variable Y is not always
    accurately reflected in the coefficient of X it
    depends on which other Xs are included or not
    included in the equation.
  • This is especially true when there is a linear
    relationship between to or more explanatory
    variables, in which case we have
    multicollinearity.
  • By definition multicollinearity is the presence
    of a fairly strong linear relationship between
    two or more explanatory variables, and it can
    make estimation difficult.

8
Solution
  • Admittedly, there is no need to include both
    Right and Left in an equation for Height - either
    one would do - but we include both to make a
    point.
  • It is likely that there is a large correlation
    between height and foot size, so we would expect
    this regression equation to do a good job.
  • The R2 value will probably be large. But what
    about the coefficients of Right and Left? Here is
    a problem.

9
Solution -- continued
  • The coefficient of Right indicates that the right
    foots effect on Height in addition to the effect
    of the left foot. This additional effect is
    probably minimal. That is, after the effect of
    Left on Height has already been taken into
    account, the extra information provided by Right
    is probably minimal. But it goes the other way
    also. The extra effort of Left, in addition to
    that provided by Right, is probably minimal.

10
HEIGHT.XLS
  • To show what can happen numerically, we generated
    a hypothetical data set of heights and left and
    right foot lengths in this file.
  • We did this so that, except for random error,
    height is approximately 32 plus 3.2 times foot
    length (all expressed in inches).
  • As shown in the table to the right, the
    correlations between Height and either Right or
    Left in our data set are quite large, and the
    correlation between Right and Left is very close
    to 1.

11
Solution -- continued
  • The regression output when both Right and Left
    are entered in the equation for Height appears in
    this table.

12
Solution -- continued
  • This output tells a somewhat confusing story.
  • The multiple R and the corresponding R2 are about
    what we would expect, given the correlations
    between Height and either Right or Left.
  • In particular, the multiple R is close to the
    correlation between Height and either Right or
    Left. Also, the se value is quite good. It
    implies that predictions of height from this
    regression equation will typically be off by only
    about 2 inches.

13
Solution -- continued
  • However, the coefficients of Right and Left are
    not all what we might expect, given that we
    generated heights as approximately 32 plus 3.2
    times foot length.
  • In fact, the coefficient of Left has the wrong
    sign - it is negative!
  • Besides this wrong sign, the tip-off that there
    is a problem is that the t-value of Left is quite
    small and the corresponding p-value is quite
    large.

14
Solution -- continued
  • Judging by this, we might conclude that Height
    and Left are either not related or are related
    negatively. But we know from the table of
    correlations that both of these are false.
  • In contrast, the coefficient of Right has the
    correct sign, and its t-value and associated
    p-value do imply statistical significance, at
    least at the 5 level.
  • However, this happened mostly by chance, slight
    changes in the data could change the results
    completely.

15
Solution -- continued
  • The problem is although both Right and Left are
    clearly related to Height, it is impossible for
    the least squares method to distinguish their
    separate effects.
  • Note that the regression equation does estimate
    the combined effect fairly well, the sum of the
    coefficients is 3.178 which is close to the
    coefficient of 3.2 we used to generate the data.
  • Therefore, the estimated equation will work well
    for predicting heights. It just does not have
    reliable estimates of the individual coefficients
    of Right and Left.

16
Solution -- continued
  • To see what happens when either Right or Left are
    excluded from the regression equation, we show
    the results of simple regression.
  • When Right is only variable in the equation, it
    becomes Predicted Height 31.546
    3.195Right
  • The R2 and se values are 81.6 and 2.005, and the
    t-value and p-value for the coefficient of Right
    are now 21.34 and 0.000 - very significant.

17
Solution -- continued
  • Similarly, when the Left is the only variable in
    the equation, it becomes Predicted Height
    31.526 3.197Left
  • The R2 and se values are 81.1 and 2.033, and the
    t-value and p-value for the coefficient of Left
    are 20.99 and 0.0000 - again very significant.
  • Clearly, both of these equations tell almost
    identical stories, and they are much easier to
    interpret than the equation with both Right and
    Left included.

18
  • Include/Exclude Decisions

19
CATALOGS1.XLS
  • This file contains data on 100 customers who
    purchased mail-order products from the HyTex
    Company in 1998.
  • Recall from Example 3.11 that HyTex is a direct
    marketer of stereo equipment, personal computers,
    and other electronic products.
  • HyTex advertises entirely by mailing catalogs to
    its customers, and all of its orders are taken
    over the telephone.
  • We want to estimate and interpret a regression
    equation for Spent98 based on all of these
    variables.

20
The Data
  • The company spends a great deal of money on its
    catalog mailings, and it wants to be sure that
    this is paying off in sales.
  • For each customer there are data on the following
    variables
  • Age in years.
  • Gender coded as 1 for males, 0 for females
  • OwnHome coded as 1 if customer owns a home, 0
    otherwise
  • Married coded as 1 if customer is currently
    married, 0 otherwise

21
The Data -- continued
  • Close coded as 1 if customers lives reasonably
    close to a shopping area that sells similar
    merchandise, 2 otherwise
  • Salary combined annual salary of customer and
    spouse (if any)
  • Children number of children living with customer
  • Customer97 coded as a 1 if customer purchased
    from HyTex during 1997, 0 otherwise
  • Spent97 total amount of purchase made from HyTex
    during 1997
  • Catalogs Number of catalogs sent to the customer
    in 1998
  • Spent98 total amount of purchase made from HyTex
    during 1998

22
The Data -- continued
  • With this much data, 1000 observations, we can
    certainly afford to set aside part of the data
    set for validation.
  • Although any split could be used, lets base the
    regression on the first 250 observations and use
    the other 750 for validation.

23
The Regression
  • We begin by entering all of the potential
    explanatory variables.
  • Our goal then is exclude variables that arent
    necessary, based on their t-values and p-values.
    To do this we follow the Guidelines for Including
    / Excluding Variables in a Regression Equation.
  • The regression output with all explanatory
    variables included is provided on the following
    slide.

24
(No Transcript)
25
Analysis
  • This output indicates a fairly good fit. The R2
    value is 79.1 and se is about 424.
  • From the p-value column, we see that there are
    three variables, Age, Own_Home, and Married, that
    have p-values well above 0.05.
  • These are the obvious candidates for exclusion.
    It is often best to exclude one variable at a
    time starting with the variable with the highest
    p-value.
  • The regression output with all insignificant
    variables excluded is seen in the output on the
    next slide.

26
(No Transcript)
27
Interpretation of Final Regression Equation
  • The coefficient of Gender implies that an average
    male customer spent about 130 less than the
    average female customer. Similarly, an average
    customer living close to stores with this type of
    merchandise spent about 288 less than those
    customers living far form stores.
  • The coefficient of Salary implies that, on
    average, about 1.5 cents of every salary dollar
    was spent on HyTex merchandise.

28
Interpretation of Final Regression Equation --
continued
  • The coefficient of Children implies that 158
    less was spent for every extra child living at
    home.
  • The Customer97 and Spent97 terms are somewhat
    more difficult to interpret.
  • First, both of these terms are 0 for customers
    who didnt purchase from HyTEx in 1997.
  • For those that did the terms become -724
    0.47Spent97
  • The coefficient 0.47 implies that each extra
    dollar spent in 1997 can be expected to
    contribute an extra 47 cents in 1998.

29
Interpretation of Final Regression Equation --
continued
  • The median spender in 1997 spent about 900. So
    if we substitute this for Spent 97 we obtain
    -301.
  • Therefore, this median spender from 1997 can be
    expected to spend about 301 less in 1998 than
    the 1997 nonspender.
  • The coefficient of Catalog implies that each
    extra catalog can be expected to generate about
    43 in extra spending.

30
Cautionary Notes
  • When we validate this final regression equation
    with the 750 customers, using the procedure from
    Section 11.7, we find R2 and se values of 75.7
    and 485.
  • These arent bad. They show little deterioration
    from the values based on the original 250
    customers.
  • We havent tried all possibilities yet. We
    havent tried nonlinear or interaction variables,
    nor have we looked at different coding schemes
    we havent checked for nonconstant error variance
    or looked at potential effects of outliers.

31
  • The Partial F Test

32
BANK.XLS
  • Recall from Example 11.3 that the Fifth National
    Bank has 208 employees.
  • The data for these employees are stored in this
    file.
  • In the previous chapter we ran several
    regressions for Salary to see whether there is
    convincing evidence of salary discrimination
    against females.
  • We will continue this analysis here.

33
Analysis Overview
  • First, we will regress Salary versus the Female
    dummy, YrsExper, and the interactions between
    Female and YrsExper, labeled Fem_YrsExper. This
    will be the reduced equation.
  • Then well see whether the JobGrade dummies Job_2
    to Job_6 add anything significant to the reduced
    equation. If so, we will then see whether the
    interactions between the Female dummy and the
    JobGrade dummies, labeled Fem_Job2 to Fem_Job6,
    add anything significant to what we already have.

34
Analysis Overview -- continued
  • If so, well finally see whether the education
    dummies Ed_2 to Ed_5 add anything significant to
    what we already have.

35
Solution
  • First, note that we created all of the dummies
    and interaction variables with StatPros Data
    Utilities procedures.
  • Also, note that we have used three sets of
    dummies, for gender, job grad and education
    level.
  • When we use these in a regression equation, the
    dummy for one category of each should always be
    excluded it is the reference category. The
    reference categories we have used are male, job
    grade 1 and education level 1.

36
Solution -- continued
  • The output for the smallest equation using
    Female, YrsExper, and Fem_YrsExper as explanatory
    variables is shown here.

37
Solution -- continued
  • Were off to a good start. These three variables
    already explain 63.9 of the variation of Salary.
  • The output for the next equation which adds the
    explanatory variables Job_2 to Job_6 is on the
    next slide.
  • This equation appears much better. For example R2
    has increased to 81.1. We check whether it is
    significantly better with the partial test in
    rows 26-30.

38
(No Transcript)
39
Solution -- continued
  • The degrees of freedom in cell C28 is the same as
    the value in cell C12, the degrees of freedom for
    SSE.
  • Then we calculate the F-ratio in cell C29 with
    the formula ((Reduced!D12-D12)/C27)/E12 were
    Reduced!D12 refers to SSE for the reduced
    equation from the Reduced sheet.
  • Finally, we calculate the corresponding p-value
    in cell C30 with the formula FDIST(C29,C27,C28).
    It is practically 0, so there is no doubt that
    the job grade dummies add significantly to the
    explanatory power of the equation.

40
Solution -- continued
  • Do the interactions between the Female dummy and
    the job dummies add anything more?
  • We again use the partial F test, but now the
    previous complete equation becomes the new
    reduced equation, and the equation that includes
    the new interaction terms becomes the new
    equation.
  • The output for this new complete equation is
    shown on the next slide.
  • We perform the partial F test in rows 31-35 as
    exactly as before. The formula in C34 is
    ((Complete!D12-D12)/C32)/E12.

41
(No Transcript)
42
Solution -- continued
  • Again the p-value is extremely small, so there is
    no doubt that the interaction terms add
    significantly to what we already had.
  • Finally, we add the education dummies.
  • The resulting output is shown on the next slide.
    We see how the terms reduced and complete are
    relative.
  • This output now corresponds to the complete
    equation, and the previous output corresponds to
    the reduced equation.

43
(No Transcript)
44
Solution -- continued
  • The formula in cell C38 for the F-ratio is now
    ((MoreComplete!D12-D12/C36)/E12. The R2 value
    increased from 84.0 to 84.7. Also the p-value
    is not extremely small.
  • According to the partial F test, it is not quite
    enough to qualify for statistical significance at
    the 5 level.
  • Based on this evidence, there is not much to gain
    from including the education dummies in the
    equation, so we would probably elect to exclude
    them.

45
Concluding Comments
  • First, the partial test is the formal test of
    significance for an extra set of variables. Many
    users look only at the R2 and/or se values to
    check whether extra variables are doing a good
    job.
  • Second, if the partial F test shows that a block
    of variables is significant, it does not imply
    that each variable in this block is significant.
    Some of these values can have low t-values.

46
Concluding Comments -- continued
  • Third, producing all of these outputs and doing
    the partial F tests is a lot of work. Therefore,
    we included a Block option in StatPro to make
    life easier. To run the analysis in this example
    use StatPro/Regression analysis/Block menu item.
    After selecting Salary as the response variable,
    we see this dialog box.

47
Concluding Comments -- continued
  • We want four blocks of explanatory variables, and
    we want a given block to enter only if it passes
    the partial F test at the 5 level. In later
    dialog boxes we specify the explanatory
    variables. Once we have specified all this, the
    regression calculations are done in stages. The
    output from this appears on the next two slides.
    The output spans over two figures. Note that the
    output for Block 4 has been left off because it
    did not pass the F test at 5.

48
(No Transcript)
49
(No Transcript)
50
Concluding Comments -- continued
  • Finally, we have concentrated on the partial F
    test and statistical significance in this
    example. We dont want you to lose sight,
    however, of the bigger picture. Once we have
    decided on a final regression equation we need
    to analyze its implications for the problem at
    hand.
  • In this case the bank is interested in possible
    salary discrimination against females, so we
    should interpret this final equation in these
    terms. Our point is simply that you shouldnt get
    so caught in the details of statistical
    significance that you lose sight of the original
    purpose of the analysis!

51
  • Outliers

52
Questions
  • Of the 208 employees at Fifth National Bank, are
    there any obvious outliers?
  • In what sense are they outliers?
  • Does it matter to the regression results,
    particularly those concerning gender
    discrimination, whether the outliers are removed?

53
BANK.XLS
  • There are several places we could look for
    outliers.
  • An obvious place is the Salary variable.
  • The boxplot shown here shows that there are
    several employees making substantially more in
    salary than most of the employees.

54
Solution
  • We could consider these outliers and remove them,
    arguing perhaps that these are senior managers
    who shouldnt be included in the discrimination
    analysis.
  • We leave it to you to check whether the
    regression results are any different with these
    high salary employees than without them.
  • Another place to look is at the scatterplot of
    the residuals versus the fitted values. This type
    of plot shows points with abnormally large
    residuals.

55
Solution -- continued
  • For example, we ran the regression with Female,
    YrsExper, Fem_YrsExper, and the five job grade
    dummies, and we obtained the output and
    scatterplot shown here.

56
(No Transcript)
57
Solution -- continued
  • This scatterplot has several points that could be
    considered outliers, but we focus on the point
    identified in the figure.
  • The residual for this point is approximately -21.
  • Given the se for this regression is approximately
    5, this residual is over four standard errors
    below 0 - quite a lot.
  • This person is found to be unusual and special
    circumstances can explain for this.

58
Solution -- continued
  • If we delete this employee and rerun the
    regression with the same variables, we obtain the
    output shown here.

59
Solution -- continued
  • Now, recalling that gender discrimination is the
    key issue in this example we compare the
    coefficients of Female and Fem_YrsExper in the
    two outputs.
  • The coefficient of Female has dropped from 6.063
    to 4.353. In words, the Y-intercept for the
    female regression line used to be about 6000
    higher than for the male line, now its only
    about 4350.
  • More importantly, the coefficient of Fem_YrsExper
    has changed from -1.021 to -0.721. This
    coefficient indicates how much less steep the
    female line for Salary versus Yrs_Exper is than
    the male line.

60
Solution -- continued
  • So a change from -1.021 to -0.721 indicates less
    discrimination against females now than before.
    In other words, this unusual female employee
    accounts for a good bit of the discrimination
    argument - although a strong argument still
    exists even without her.

61
  • Prediction

62
Questions
  • Consider the following three male employees at
    Fifth National
  • Employee 5 makes 29,000, is in job grade 1, and
    has 3 years of experience at the bank.
  • Employee 156 makes 45,000, is in job grade 4,
    and has 6 years of experience at the bank.
  • Employee 198 makes 60,000, is in job grade 6,
    and has 12 years of experience at the bank.
  • Using regression equations for Salary that
    includes the explanatory variables Female,
    YrsExper, FemYrs_Exper, and the job grade dummies
    Job_2 to Job_6, check that the predicted salaries
    for these three employees are close to their
    actual salaries.

63
Questions -- continued
  • Then predict the salaries these employees would
    obtain if they were females.
  • How large are the discrepancies?
  • When estimating the equation, exclude the last
    employee, employee 208, whom we diagnosed as an
    outlier in Section 14.9.

64
Solution
  • The analysis appears on the next slide.
  • The top part includes the variables we need for
    this analysis.
  • Note how employee 208 has been separated from the
    rest of the data, so that she is not included in
    the regression analysis.
  • The usual regression output is not shown, but the
    standard error of estimate and estimated
    coefficients have been copied to cell B216 and
    the range B218J218.

65
(No Transcript)
66
Solution -- continued
  • The values for male employees 5, 156, and 198
    have been copied to the range B222B224.
  • We can then substitute their values into the
    regression equation to obtain their predicted
    salaries in column A.
  • The formula in cell A222, for example,
    is B218SUMPRODUCT(C218J218,C222J222)
  • Clearly the predictions are quite good for these
    three employees. The worst prediction is off by
    less than 2000.

67
Solution -- continued
  • To see what would happen if these employees were
    females, we need to adjust the values of the
    explanatory variables Female and Fem_YrsExper.
  • For each employee in rows B227-229, the value of
    Female becomes 1 and the value of Fem_YrsExper
    becomes the same as the YrsExper. Copying the
    formula in A222 down to these rows gives the
    predicted salary for the females.
  • One way to compare females to males is to enter
    the formula (A227-B222)/B216 in cell B227 and
    copy it down.

68
Solution -- continued
  • This is the number of standard errors the
    predicated female salary is above (if positive)
    or below (if negative) the actual male salary.
  • As we discussed earlier with this data set,
    females with only a few years experience actually
    tend to make more than males. But the opposite
    occurs for employees with many years of
    experience.
  • For example, male employee 198 is earning just
    about the regression equation predicts he should
    earn. But if he were female, we would predict a
    salary about 4500 below the male, almost a full
    standard error lower.

69
  • Prediction

70
The Problem
  • Besides the 50 regions in the data set, Pharmex,
    does business in five other regions, which have
    promotional expenses indexes of 114, 98, 84, 122,
    and 101.
  • Find the predicted Sales and a 95 prediction
    interval for each of these regions.
  • Also, find the mean Sales for all regions with
    each of these values of Promote, along with 95
    confidence interval for these means.

71
PHARMEX.XLS
  • This example cannot be solved with StatPro but it
    is relatively easy with Excels built-in
    functions.
  • We illustrate the procedure in this file shown
    here on the next slide.
  • The original data appear in Column B and C. We
    use the range names SalesOld and PromoteOld for
    the data in these columns.
  • The new regions appear in rows 9-13. Their given
    values of Promote are in the range G9G13, which
    we name PromoteNew.

72
(No Transcript)
73
Solution
  • To obtain the predicted sales for these regions,
    we use Excels TREND function by highlighting the
    range H9H13, typing the formula
    TREND(SalesOld,PromoteOld,PromoteNew)and
    pressing Ctrl-Shift-Enter.
  • This substitutes the new values of the
    explanatory variable (in the third argument) into
    the regression equation based on the data from
    the first two arguments.

74
Solution -- continued
  • We can then use these same predictions in rows
    19-23 for the mean sales values.
  • For example, we predict the same Sales value of
    112.03 for a single region with Promote equal to
    114 or the mean of all regions with this value of
    Promote.
  • According to the approximate standard error of
    predication for any individual value is se,
    calculated in cell H6 with the formula
    STEYX(SalesOld,PromoteOld)

75
Solution -- continued
  • The more exact standard error of prediction
    depends on the value of Promote. To calculate it
    we enter the formula H6SQRT(11/50(G9-AV
    ERAGE(PromoteOld))2 /(48STDEV(PromoteOld))2)
    in cell I9 and copy it down through cell I13.
  • We then calculate the lower and upper limits of
    95 prediction intervals in columns J and K.
    These use the t-multiple in cell I3, obtained
    with the formula TINV(0.05,73).

76
Solution -- continued
  • The formulas in cells J9 and K9 are then
    H9-I3I9 and H9I3I9 which we copy down to
    row 13.
  • The calculations for the mean predictions in rows
    19-23 are almost identical. The only difference
    is that the approximate standard error is se
    divided by the square root of 75 calculated in
    H16.
  • The more exact standard of error in column I are
    then calculated by entering the
    formulaH6SQRT(1/50(G9-AVERAGE(PromoteOld))2
    /(48STDEV(PromoteOld))2)in cell I19 and
    copying it down.

77
Conclusions
  • We have gone through these rather tedious
    calculations to make several points.
  • First, the approximate standard errors se and
    are usually quite accurate. This is
    fortunate because the exact standard errors are
    difficult to calculate and are not always given
    in statistical software packages.
  • Second, a simple rule of thumb for calculating
    individual 95 prediction intervals is to go out
    an amount 2se, on either side of the predicted
    value. Again this is not exactly correct but as
    calculations in this example indicate, it works
    quite well.

78
Conclusions -- continued
  • Finally, we see from the wide prediction
    intervals how much uncertainty remains.
  • The reason is the relatively large standard error
    of estimate, se.
  • Contrary to what you may believe this is not a
    sample size problem.
  • The whole problem is that Promote is not highly
    correlated with Sales. The only way to decrease
    se and get more accurate predictions is to find
    other explanatory variables that are more closely
    related to Sales.
Write a Comment
User Comments (0)
About PowerShow.com