Title: Chapter 8 Linear regression
1Chapter 8 Linear regression
2Scatterplot
- One double Whopper contains 53 grams of protein,
65 grams of fat and 1020 calories. So two double
would contain enough calories for a day. - fat versus protein for 30 items on the Burger
King menu
3The linear model
- Parameters
- Intercept
- Slope
- Model is NOT perfect!
- Predicted value
- Residual Observed predicted
- Overestimate when residuallt0
- Underestimate when residualgt0
4The linear model (cont.)
- We write our model as
- This model says that our predictions from our
model follow a straight line. - The data values will scatter closely around it,
if the model is a good fit .
5How did I get the line?
- Best fit line
- Minimize residuals overall
- The line of best fit is the line for which the
sum of the squared residuals is smallest. The
least squares line minimizes
6How do I get the line? (cont.)
- The regression line
- Slope in units of y per unit of x
- Intercept in units of y
7Interpreting the regression line
- Slope
- Increasing 1 unit in x ? increasing units in
y - In particular, moving one standard deviation away
from the mean in x moves us r standard deviations
away from the mean in y. - Intercept predicted value of y when x0
- Predicted value at x
8Fat Versus Protein An Example
- The regression line for the Burger King data fits
the data well - The equation is
- The predicted fat content for a BK Broiler
chicken sandwich is - 6.8 0.97(30) 35.9 grams of fat.
9How Big Can Predicted Values Get?
- r cannot be bigger than 1 (in absolute value), so
each predicted y tends to be closer to its mean
(in standard deviations) than its corresponding x
was. - This property of the linear model is called
regression to the mean the line is called the
regression line.
10Sir Francis Galton
11Residuals Revisited
- The residuals are the part of the data that
hasnt been modeled. - Data Model Residual
- or (equivalently)
- Residual Data Model
- Or, in symbols,
12Residuals Revisited (cont.)
- When a regression model is appropriate, nothing
interesting should be left behind. - Residual plot should have no pattern, no bend, no
outlier. - Residual against x
- Residual against predicted value
- The spread of the residual plot should be the
same throughout.
13Residuals Revisited (cont.)
- If the residuals show no interesting pattern in
the residual plot, we use standard deviation of
the residuals to measure how much the points
spread around the regression line. The standard
deviation of residuals is given by - We need the Equal Variance Assumption for the
standard deviation of residuals. The associated
condition to check is the Does the Plot Thicken?
Condition.
14Residual plot of regression stopping distance on
car speed
15How well does the linear model fit?
- The variation in the residuals is the key to
assessing how well the model fits. - Total fat (y) sd 16.4g
- Residual sd 9.2g
- less variation
- How much of the variation
- is accounted for by the model?
- How much is left in the residuals?
16Variation
- The squared correlation gives the fraction
of the datas variation explained by the model. - We can view as the percentage of
variability in y that is NOT explained by the
regression line, or the variability that has been
left in the residuals - For the BK model, r2 0.832 0.69, so 31 of
the variability in total fat has been left in the
residuals.
17R2The Variation Accounted For
- 0 no variance explained
- 1 all variance explained by the model
- How big should R2 be to conclude the model fit
the data well? - R2 is always between 0 and 100. What makes a
good R2 value depends on the kind of data you
are analyzing and on what you want to do with it.
18Check the following conditions
- The two variables are both quantitative
- The relationship is linear (straight enough)
- Scatterplot
- Residual plot
- No outliers Are there very large residuals?
- Scatterplot
- Residual plot
- Equal variance all residuals should share the
same spread - Residual plot Does the Plot Thicken?
19Summary
- Whether the linear model is appropriate?
- Residual plot
- How well does the model fit?
- R2
20Reality Check Is the Regression Reasonable?
- Statistics dont come out of nowhere. They are
based on data. - The results of a statistical analysis should
reinforce your common sense, not fly in its face.
- If the results are surprising, then either youve
learned something new about the world or your
analysis is wrong. - When you perform a regression, think about the
coefficients and ask yourself whether they make
sense.
21What Can Go Wrong?
- Dont fit a straight line to a nonlinear
relationship. - Beware extraordinary points (y-values that stand
off from the linear pattern or extreme x-values). - Dont extrapolate beyond the datathe linear
model may no longer hold outside of the range of
the data. - Dont infer that x causes y just because there is
a good linear model for their relationshipassocia
tion is not causation. - Dont choose a model based on R2 alone.
22What have we learned?
- When the relationship between two quantitative
variables is fairly straight, a linear model can
help summarize that relationship. - The regression line doesnt pass through all the
points, but it is the best compromise in the
sense that it has the smallest sum of squared
residuals.
23What have we learned? (cont.)
- The correlation tells us several things about the
regression - The slope of the line is based on the
correlation, adjusted for the units of x and y. - For each SD in x that we are away from the x
mean, we expect to be r SDs in y away from the y
mean. - Since r is always between -1 and 1, each
predicted y is fewer SDs away from its mean than
the corresponding x was (regression to the mean). - R2 gives us the fraction of the response
accounted for by the regression model.
24What have we learned? (cont.)
- The residuals also reveal how well the model
works. - If a plot of the residuals against predicted
values shows a pattern, we should re-examine the
data to see why. - The standard deviation of the residuals
quantifies the amount of scatter around the line.
25What have we learned? (cont.)
- The linear model makes no sense unless the Linear
Relationship Assumption is satisfied. - Also, we need to check the Straight Enough
Condition and Outlier Condition with a
scatterplot. - For the standard deviation of the residuals, we
must make the Equal Variance Assumption. We
check it by looking at both the original
scatterplot and the residual plot for Does the
Plot Thicken? Condition.
26TI-83
- Enter data as lists first
- Press STAT
- Then move the cursor to CALC
- Press 4 (LinReg(axb)) or Press 8 (LinReg(abx))
- Then put the list names for which you want to do
regression, e.g., L1, L2 - Press ENTER
- To see
- Set DIAGNOSTICS ON
- 2ND 0 (CATALOG), move the cursor down to
DiagnosticsOn - Press ENTER (You will see DONE)
- Now repeat the above operations for linear
regression, you will see the correlation
coefficient and
27TI-83
- How to make the residual plot?
- Same as making a scatterplot
- Make the XLIST as the explanatory variable
- Make the YLIST as RESID
28Summary for Chapters 7 and 8
- How to read a scatter plot?
- Direction
- Form
- Strength
- Correlation coefficient
- When can you use it?
- How to calculate it?
- How to interpret it?
- Linear regression
- When can you use it?
- How to calculate it?
- How to interpret it?
- How to make predictions?
- How to read residual plot?