Title: Chapter 12: Multiple Regression and Model Building
1Chapter 12 Multiple Regression and Model Building
2Where Weve Been
- Introduced the straight-line model relating a
dependent variable y to an independent variable x - Estimated the parameters of the straight-line
model using least squares - Assesses the model estimates
- Used the model to estimate a value of y given x
3Where Were Going
- Introduce a multiple-regression model to relate a
variable y to two or more x variables - Present multiple regression models with both
quantitative and qualitative independent
variables - Assess how well the multiple regression model
fits the sample data - Show how analyzing the model residuals can help
detect problems with the model and the necessary
modifications
412.1 Multiple Regression Models
512.1 Multiple Regression Models
- Analyzing a Multiple-Regression Model
- Step 1 Hypothesize the deterministic portion of
the model by choosing the independent variables
x1, x2, , xk. - Step 2 Estimate the unknown parameters ? 0, ?1,
?2, , ?k . - Step 3 Specify the probability distribution of ?
and estimate the standard deviation ? of this
distribution.
612.1 Multiple Regression Models
- Analyzing a Multiple-Regression Model
- Step 4 Check that the assumptions about ? are
satisfied if not make the required modifications
to the model. - Step 5 Statistically evaluate the usefulness of
the model. - Step 6 If the model is useful, use it for
prediction, estimation and other purposes.
712.1 Multiple Regression Models
- Assumptions about the Random Error ?
- The mean is equal to 0.
- The variance is equal to ? 2.
- The probability distribution is a normal
distribution. - Random errors are independent of one another.
812.2 The First-Order Model Estimating and
Making Inferences about the ? Parameters
- A First-Order Model in Five Quantitative
Independent Variables - where x1, x2, , xk are all quantitative
variables that are not functions of other
independent variables.
912.2 The First-Order Model Estimating and
Making Inferences about the ? Parameters
- A First-Order Model in Five Quantitative
Independent Variables - The parameters are estimated by finding the
values for the ? s that minimize
1012.2 The First-Order Model Estimating and
Making Inferences about the ? Parameters
- A First-Order Model in Five Quantitative
Independent Variables - The parameters are estimated by finding the
values for the ? s that minimize
Only a truly talented mathematician (or geek)
would choose to solve the necessary system of
simultaneous linear equations by hand. In
practice, computers are left to do the
complicated calculation required by multiple
regression models.
1112.2 The First-Order Model Estimating and
Making Inferences about the ? Parameters
- A collector of antique clocks hypothesizes that
the auction price can be modeled as
1212.2 The First-Order Model Estimating and
Making Inferences about the ? Parameters
- Based on the data in Table 12.1, the least
squares prediction equation, the equation that
minimizes SSE, is
1312.2 The First-Order Model Estimating and
Making Inferences about the ? Parameters
- Based on the data in Table 12.1, the least
squares prediction equation, the equation that
minimizes SSE, is
The estimate for ? 1 is interpreted as the
expected change in y given a one-unit change in
x1 holding x2 constant
The estimate for ? 2 is interpreted as the
expected change in y given a one-unit change in
x2 holding x1 constant
1412.2 The First-Order Model Estimating and
Making Inferences about the ? Parameters
- Based on the data in Table 12.1, the least
squares prediction equation, the equation that
minimizes SSE, is
Since it makes no sense to sell a clock of age 0
at an auction with no bidders, the intercept term
has no meaningful interpretation in this example.
1512.2 The First-Order ModelEstimating and
Making Inferences about the ? Parameters
Test of an Individual Parameter Coefficient in
the Multiple Regression Model
1612.2 The First-Order ModelEstimating and
Making Inferences about the ? Parameters
Test of the Parameter Coefficient on the Number
of Bidders
1712.2 The First-Order ModelEstimating and
Making Inferences about the ? Parameters
Test of the Parameter Coefficient on the Number
of Bidders
Since t gt t, reject the null hypothesis.
1812.2 The First-Order ModelEstimating and
Making Inferences about the ? Parameters
A 100(1-?) Confidence Interval for a ? Parameter
1912.2 The First-Order ModelEstimating and
Making Inferences about the ? Parameters
A 100(1-?) Confidence Interval for ? 1
2012.2 The First-Order ModelEstimating and
Making Inferences about the ? Parameters
A 100(1-?) Confidence Interval for ? 1
Holding the number of bidders constant, the
result above tells us that we can be 90 sure
that the auction price will rise between 11.20
and 14.28 for each 1-year increase in age.
2112.3 Evaluating Overall Model Utility
- Evidence of a linear relationship between y and xi
- There may be no relationship between y and xi
- Type II error occurred
- The relationship between y and xi is more complex
than a straight-line relationship
2212.3 Evaluating Overall Model Utility
- The multiple coefficient of determination, R2,
measures how much of the overall variation in y
is explained by the least squares prediction
equation.
2312.3 Evaluating Overall Model Utility
- High values of R2 suggest a good model, but the
usefulness of R2 falls as the number of
observations becomes close to the number of
parameters estimated.
2412.3 Evaluating Overall Model Utility
Ra2 adjusts for the number of observations and
the number of parameter estimates. It will
always have a value no greater than R2.
2512.3 Evaluating Overall Model Utility
2612.3 Evaluating Overall Model Utility
Rejecting the null hypothesis means that
something in your model helps explain variations
in y, but it may be that another model provides
more reliable estimates and predictions.
2712.3 Evaluating Overall Model Utility
- A collector of antique clocks hypothesizes that
the auction price can be modeled as
2812.3 Evaluating Overall Model Utility
- A collector of antique clocks hypothesizes that
the auction price can be modeled as
Something in the model is useful, but the F-test
cant tell us which x-variables are individually
useful.
2912.3 Evaluating Overall Model Utility
- Checking the Utility of a Multiple-Regression
Model - Use the F-test to conduct a test of the adequacy
of the overall model. - Conduct t-tests on the most important ?
parameters. - Examine Ra2 and 2s to evaluate how well the model
fits the data.
3012.4 Using the Model for Estimation and
Prediction
- The model of antique clock prices can be used to
predict sale prices for clocks of a certain age
with a particular number of bidders. - What is the mean sale price for all 150-year-old
clocks with 10 bidders?
3112.4 Using the Model for Estimation and
Prediction
- What is the mean auction sale price for a single
150-year-old clock with 10 bidders?
The average value of all clocks with these
characteristics can be found by using the
statistical software to generate a confidence
interval. (See Figure 12.7) In this case, the
confidence interval indicates that we can be 95
sure that the average price of a single
150-year-old clock sold at auction with 10
bidders will be between 1,154.10 and 1,709.30.
3212.4 Using the Model for Estimation and
Prediction
3312.4 Using the Model for Estimation and
Prediction
- What is the mean sale price for a single
50-year-old clock with 2 bidders?
3412.4 Using the Model for Estimation and
Prediction
- What is the mean sale price for a single
50-year-old clock with 2 bidders?
Since 50 years-of-age and 2 bidders are both
outside of the range of values in our data set,
any prediction using these values would be
unreliable.
3512.5 Model Building Interaction Models
- In some cases, the impact of an independent
variable xi on y will depend on the value of some
other independent variable xk. - Interaction models include the cross-products of
independent variables as well as the first-order
values.
3612.5 Model Building Interaction Models
3712.5 Model Building Interaction Models
- In the antique clock auction example, assume the
collector has reason to believe that the impact
of age (x1) on price (y) varies with the number
of bidders (x2) . - The model is now
- y ?0 ?1x1 ?2x2 ?3x1x2 ? .
3812.5 Model Building Interaction Models
3912.5 Model Building Interaction Models
- In the antique clock auction example, assume the
collector has reason to believe that the impact
of age (x1) on price (y) varies with the number
of bidders (x2) . - The model is now
- y ?0 ?1x1 ?2x2 ?3x1x2 ? .
4012.5 Model Building Interaction Models
- In the antique clock auction example, assume the
collector has reason to believe that the impact
of age (x1) on price (y) varies with the number
of bidders (x2) . - The model is now
- y ?0 ?1x1 ?2x2 ?3x1x2 ? .
The MINITAB results are reported in Figure 12.11
in the text.
4112.5 Model Building Interaction Models
- In the antique clock auction example, assume the
collector has reason to believe that the impact
of age (x1) on price (y) varies with the number
of bidders (x2) . - The model is now
- y ?0 ?1x1 ?2x2 ?3x1x2 ? .
4212.5 Model Building Interaction Models
Once the interaction term has passed the t-test,
it is unnecessary to test the individual
independent variables.
4312.6 Model Building Quadratic and Other Higher
Order Models
- A quadratic (second-order) model includes the
square of an independent variable - y ?0 ?1x ?2x2 ?.
- This allows more complex relationships to be
modeled.
4412.6 Model Building Quadratic and Other Higher
Order Models
- A quadratic (second-order) model includes the
square of an independent variable - y ?0 ?1x ?2x2 ?.
- ?1 is the shift parameter and
- ?2 is the rate of curvature.
4512.6 Model Building Quadratic and Other Higher
Order Models
- Example 12.7 considers whether home size (x)
impacts electrical usage (y) in a positive but
decreasing way. - The MINITAB results are shown in Figure 12.13.
4612.6 Model Building Quadratic and Other Higher
Order Models
4712.6 Model Building Quadratic and Other Higher
Order Models
- According to the results, the equation that
minimizes SSE for the 10 observations is
4812.6 Model Building Quadratic and Other Higher
Order Models
4912.6 Model Building Quadratic and Other Higher
Order Models
- Since 0 is not in the range of the independent
variable (a house of 0 ft2?), the estimated
intercept is not meaningful. - The positive estimate on ?1 indicates a positive
relationship, although the slope is not constant
(weve estimated a curve, not a straight line). - The negative value on ?2 indicates the rate of
increase in power usage declines for larger homes.
5012.6 Model Building Quadratic and Other Higher
Order Models
- The Global F-Test
- H0 ?1 ?2 0
- Ha At least one of the coefficients ? 0
- The test statistic is F 189.71, p-value near 0.
- Reject H0.
5112.6 Model Building Quadratic and Other Higher
Order Models
- t-Test of ?2
- H0 ?2 0
- Ha ?2lt 0
- The test statistic is t -7.62, p-value .0001
(two-tailed). - The one-tailed test statistic is .0001/2 .00005
- Reject the null hypothesis.
5212.6 Model Building Quadratic and Other Higher
Order Models
- Complete Second-Order Model with Two
Quantitative Independent Variables - E(y) ?0 ?1x1 ?2x2 ?3x1x2 ?4x12 ?5x22
-
y-intercept
Signs and values of these parameters control the
type of surface and the rates of curvature
Changing ?1 and ?2 causes the surface to shift
along the x1 and x2 axes
Controls the rotation of the surface
5312.6 Model Building Quadratic and Other Higher
Order Models
5412.7 Model Building Qualitative (Dummy)
Variable Models
- Qualitative variables can be included in
regression models through the use of dummy
variables. - Assign a value of 0 (the base level) to one
category and 1, 2, 3 to the other categories.
5512.7 Model Building Qualitative (Dummy)
Variable Models
- A Qualitative Independent Variable with k Levels
- where xi is the dummy variable for level i 1
and
5612.7 Model Building Qualitative (Dummy)
Variable Models
- For the golf ball example from Chapter 10, there
were four levels (the brands).Testing differences
in brands can be done with the model
5712.7 Model Building Qualitative (Dummy)
Variable Models
- Brand A is the base level, so ?0 represents the
mean distance (?A) for Brand A, and - ?1 ?B - ?A
- ?2 ?C - ?A
- ?3 ?D - ?A
5812.7 Model Building Qualitative (Dummy)
Variable Models
- Testing that the four means are equal is
equivalent to testing the significance of the ?s - H0 ?1 ?2 ?3 0
- Ha At least of one the ?s ? 0
5912.7 Model Building Qualitative (Dummy)
Variable Models
- Testing that the four means are equal is
equivalent to testing the significance of the ?s - H0 ?1 ?2 ?3 0
- Ha At least of one the ?s ? 0
The test statistic is the F-statistic. Here F
43.99, p-value ? .000. Hence we reject the null
hypothesis that the golf balls all have the same
mean driving distance.
6012.7 Model Building Qualitative (Dummy)
Variable Models
- Testing that the four means are equal is
equivalent to testing the significance of the ?s - H0 ?1 ?2 ?3 0
- Ha At least of one the ?s ? 0
The test statistic if the F-statistic. Here F
43.99, p-value ? .000. Hence we reject the null
hypothesis that the golf balls all have the same
mean driving distance.
Remember that the maximum number of dummy
variables is one less than the number of levels
for the qualitative variable.
6112.8 Model Building Models with Both
Quantitative and Qualitative Variables
- Suppose a first-order model is used to evaluate
the impact on mean monthly sales of expenditures
in three advertising media television, radio and
newspaper. - Expenditure, x1, is a quantitative variable
- Types of media, x2 and x3, are qualitative
variables (limited to k levels -1)
6212.8 Model Building Models with Both
Quantitative and Qualitative Variables
6312.8 Model Building Models with Both
Quantitative and Qualitative Variables
6412.8 Model Building Models with Both
Quantitative and Qualitative Variables
- Suppose now a second-order model is used to
evaluate the impact of expenditures in the three
advertising media on sales. - The relationship between expenditures, x1, and
sales, y, is assumed to be curvilinear.
6512.8 Model Building Models with Both
Quantitative and Qualitative Variables
- In this model, each medium is assumed to have
the save impact on sales.
6612.8 Model Building Models with Both
Quantitative and Qualitative Variables
In this model, the intercepts differ but the
shapes of the curves are the same.
6712.8 Model Building Models with Both
Quantitative and Qualitative Variables
In this model, the response curve for each media
type is different that is, advertising
expenditure and media type interact, at varying
rates.
6812.9 Model Building Comparing Nested Models
- Two models are nested if one model contains all
the terms of the second model and at least one
additional term. The more complex of the two
models is called the complete model and the
simpler of the two is called the reduced model.
6912.9 Model Building Comparing Nested Models
- Recall the interaction model relating the auction
price (y) of antique clocks to age (x1) and
bidders (x2)
7012.9 Model Building Comparing Nested Models
- If the relationship is not constant, a
second-order model should be considered
7112.9 Model Building Comparing Nested Models
- If the complete model produces a better fit, then
the ?s on the quadratic terms should be
significant. - H0 ?4 ?5 0
- Ha At least one of ?4 and ?5 is non-zero
7212.9 Model Building Comparing Nested Models
- F-Test for Comparing Nested Models
7312.9 Model Building Comparing Nested Models
- F-Test for Comparing Nested Models
- where
- SSER sum of squared errors for the reduced
model - SSEC sum of squared errors for the complete
model - MSEC mean square error (s2) for the complete
model - k g number of ? parameters specified in H0
- k 1 number of ? parameters in the complete
model - n sample size
- Rejection region F gt F?, with k g numerator
and n (k 1) denominator degrees of freedom.
7412.9 Model Building Comparing Nested Models
- The growth of carnations (y) is assumed to be a
function of the temperature (x1) and the amount
of fertilizer (x2). - The data are shown in Table 12.6 in the text.
7512.9 Model Building Comparing Nested Models
- The growth of carnations (y) is assumed to be a
function of the temperature (x1) and the amount
of fertilizer (x2).
The complete second order model is The least
squares prediction equation from Table 12.6
is rounded to
7612.9 Model Building Comparing Nested Models
- The growth of carnations (y) is assumed to be a
function of the temperature (x1) and the amount
of fertilizer (x2).
To test the significance of the contribution of
the interaction and second-order terms, use H0
?3 ?4 ?5 0 Ha At least one of ?3, ?4 or
?5 ? 0 This requires estimating the complete
model in reduced form, dropping the parameters
in the null hypothesis. Results are given in
Figure 12.31.
7712.9 Model Building Comparing Nested Models
7812.9 Model Building Comparing Nested Models
Reject the null hypothesis the complete model
seems to provide better predictions than the
reduced model.
7912.9 Model Building Comparing Nested Models
- A parsimonious model is a general linear model
with a small number of ? parameters. In
situations where two competing models have
essentially the same predictive power (as
determined by an F-test), choose the more
parsimonious of the two.
8012.9 Model Building Comparing Nested Models
- A parsimonious model is a general linear model
with a small number of ? parameters. In
situations where two competing models have
essentially the same predictive power (as
determined by an F-test), choose the more
parsimonious of the two.
If the models are not nested, the choice is more
subjective, based on Ra2, s, and an understanding
of the theory behind the model.
8112.10 Model Building Stepwise Regression
- It is often unclear which independent variables
have a significant impact on y. - Screening variables in an attempt to identify the
most important ones is known as stepwise
regression.
8212.10 Model Building Stepwise Regression
8312.10 Model Building Stepwise Regression
- Stepwise regression must be used with caution
- Many t-tests are conducted, leading to high
probabilities of Type I or Type II errors. - Usually, no interaction or higher-order terms are
considered and reality may not be that simple.
8412.11 Residual Analysis Checking the Regression
Assumptions
- Regression analysis is based on the four
assumptions about the random error ? considered
earlier. - The mean is equal to 0.
- The variance is equal to ? 2.
- The probability distribution is a normal
distribution. - Random errors are independent of one another.
8512.11 Residual Analysis Checking the Regression
Assumptions
- If these assumptions are not valid, the results
of the regression estimation are called into
question. - Checking the validity of the assumptions involves
analyzing the residuals of the regression.
8612.11 Residual Analysis Checking the Regression
Assumptions
- A regression residual is defined as the
difference between an observed y-value and its
corresponding predicted value
8712.11 Residual Analysis Checking the Regression
Assumptions
- Properties of the Regression Residuals
- The mean of the residuals is equal to 0.
- The standard deviation of the residuals is equal
to the standard deviations of the fitted
regression model.
8812.11 Residual Analysis Checking the Regression
Assumptions
- If the model is misspecified, the mean of ? will
not equal 0. - Residual analysis may reveal this problem.
- The home-size electricity usage example
illustrates this.
8912.11 Residual Analysis Checking the Regression
Assumptions
- The plot of the first-order model shows a
curvilinear residual pattern
- while the quadratic model shows a more random
pattern.
9012.11 Residual Analysis Checking the Regression
Assumptions
- A pattern in the residual plot may indicate a
problem with the model.
9112.11 Residual Analysis Checking the Regression
Assumptions
- A residual larger than 3s (in absolute value) is
considered an outlier. - Outliers will have an undue influence on the
estimates. - 1. Mistakenly recorded data
- 2. An observation that is for some reason truly
different from the others - 3. Random chance
9212.11 Residual Analysis Checking the Regression
Assumptions
- A residual larger than 3s (in absolute value) is
considered an outlier. - Leaving an outlier that should be removed in the
data set will produce misleading estimates and
predictions (1 2 above). - So will removing an outlier that actually belongs
in the data set (3 above).
9312.11 Residual Analysis Checking the Regression
Assumptions
- Residual plots should be centered on 0 and within
3s of 0. - Residual histograms should be relatively
bell-shaped. - Residual normal probability plots should display
straight lines.
9412.11 Residual Analysis Checking the Regression
Assumptions
Regression Analysis is Robust with respect to
(small) nonnormal errors.
- Slight departures from normality will not
seriously harm the validity of the estimates, but
as the departure from normality grows, the
validity falls.
9512.11 Residual Analysis Checking the Regression
Assumptions
- If the variance of ? changes as y changes, the
constant variance assumption is violated.
9612.11 Residual Analysis Checking the Regression
Assumptions
- A first-order model is used to relate the
salaries (y) of social workers to years of
experience (x).
9712.11 Residual Analysis Checking the Regression
Assumptions
9812.11 Residual Analysis Checking the Regression
Assumptions
- The model seems to provide good predictions, but
the residual plot reveals a non-random pattern - The residual increases as the estimated mean
salary increases, violating the constant variance
assumption
9912.11 Residual Analysis Checking the Regression
Assumptions
- Transforming the dependent variable often
stabilizes the residual - Possible transformations of y
- Natural logarithm
- Square root
- sin-1y1/2
10012.11 Residual Analysis Checking the Regression
Assumptions
10112.11 Residual Analysis Checking the Regression
Assumptions
10212.11 Residual Analysis Checking the Regression
Assumptions
10312.11 Residual Analysis Checking the Regression
Assumptions
10412.11 Residual Analysis Checking the Regression
Assumptions
10512.11 Residual Analysis Checking the Regression
Assumptions
10612.12 Some Pitfalls Estimability,
Multicollinearity and Extrapolation
10712.12 Some Pitfalls Estimability,
Multicollinearity and Extrapolation
10812.12 Some Pitfalls Estimability,
Multicollinearity and Extrapolation
Problem 1 Parameter Estimability
If x does not take on a sufficient number of
different values, no single unique line can be
estimated.
10912.12 Some Pitfalls Estimability,
Multicollinearity and Extrapolation
Problem 2 Multicollinearity
Multicollinearity exists when two or more of the
independent variables in a regression are
correlated.
If xi and xj move together in some way, finding
the impact on y of a one-unit change in either of
them holding the other constant will be difficult
or impossible.
11012.12 Some Pitfalls Estimability,
Multicollinearity and Extrapolation
Problem 2 Multicollinearity
Multicollinearity can be detected in various
ways. A simple check is to calculate the
correlation coefficients (rij) for each pair of
independent variables in the model. Any
significant rij may indicate a multicollinearity
problem.
- If severe multicollinearity exists, the result
may be - Significant F-values but insignificant t-values
- Signs on ?s opposite to those expected
- Errors in ? estimates, standard errors, etc.
11112.12 Some Pitfalls Estimability,
Multicollinearity and Extrapolation
- The Federal Trade Commission (FTC) ranks
cigarettes according to their tar (x1), nicotine
(x2), weight in grams (x3) and carbon monoxide
(y) content . - 25 data points (see Table 12.11) are used to
estimate the model
11212.12 Some Pitfalls Estimability,
Multicollinearity and Extrapolation
11312.12 Some Pitfalls Estimability,
Multicollinearity and Extrapolation
- F 78.98, p-value lt .0001
- t?1 3.97, p-value .0007
- t?2 -0.67, p-value .5072
- t?3 -0.3, p-value .9735
11412.12 Some Pitfalls Estimability,
Multicollinearity and Extrapolation
The negative signs on two variables and the
insignificant t-values are suggestive of
multicollinearity .
- F 78.98, p-value lt .0001
- t?1 3.97, p-value .0007
- t?2 -0.67, p-value .5072
- t?3 -0.3, p-value .9735
11512.12 Some Pitfalls Estimability,
Multicollinearity and Extrapolation
- The coefficients of correlation, rij, provide
further evidence - rtar, nicotine .9766
- rtar, weight .4908
- rweight, nicotine .5002
- Each rij is significantly different from 0 at the
? .05 level.
11612.12 Some Pitfalls Estimability,
Multicollinearity and Extrapolation
- Possible Responses to Problems Created by
Multicollinearity in Regression - Drop one or more correlated independent variables
from the model. - If all the xs are retained,
- Avoid making inferences about the individual ?
parameters from the t-tests. - Restrict inferences about E(y) and future y
values to values of the xs that fall within the
range of the sample data.
11712.12 Some Pitfalls Estimability,
Multicollinearity and Extrapolation
Problem 3 Extrapolation
The data used to estimate the model provide
information only on the range of values in the
data set. There is no reason to assume that the
dependent variables response will be the same
over a different range of values.
11812.12 Some Pitfalls Estimability,
Multicollinearity and Extrapolation
Problem 3 Extrapolation
11912.12 Some Pitfalls Estimability,
Multicollinearity and Extrapolation
Problem 4 Correlated Errors
If the error terms are not independent (a
frequent problem in time series), the model tests
and prediction intervals are invalid. Special
techniques are used to deal with time series
models.