Describing Relationships: Regression, Prediction, and Causation - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Describing Relationships: Regression, Prediction, and Causation

Description:

In Statistics we use a slightly different notation: ... Better to investigate: Has the rate of divorce or the rate of suicide changed over time? ... – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 43
Provided by: vcujemaysw
Category:

less

Transcript and Presenter's Notes

Title: Describing Relationships: Regression, Prediction, and Causation


1
Chapter 15
  • Describing Relationships Regression,
    Prediction, and Causation

2
Linear Regression
  • Objective To quantify the linear relationship
    between an explanatory variable and a response
    variable.
  • We can then predict the average response for all
    subjects with a given value of the explanatory
    variable.
  • Regression equation y a bx
  • x is the value of the explanatory variable
  • y is the average value of the response variable
  • note that a and b are just the intercept and
    slope of a straight line
  • note that r and b are not the same thing, but
    their signs will agree

3
Thought Question 1
How would you draw a line through the points?
How do you determine which line fits best?
4
Linear Equations
High School Teacher
5
The Linear Model
  • Remember from Algebra that a straight line can be
    written as
  • In Statistics we use a slightly different
    notation
  • We write to emphasize that the points that
    satisfy this equation are just our predicted
    values, not the actual data values.

6
Example Fat Versus Protein
  • The following is a scatterplot of total fat
    versus protein for 30 items on the Burger King
    menu

7
Residuals
  • The model wont be perfect, regardless of the
    line we draw.
  • Some points will be above the line and some will
    be below.
  • The estimate made from a model is the predicted
    value (denoted as ).

8
Residuals (cont.)
  • The difference between the observed value and its
    associated predicted value is called the
    residual.
  • To find the residuals, we always subtract the
    predicted value from the observed one

9
Residuals (cont.)
  • A negative residual means the predicted value is
    too big (an overestimate).
  • A positive residual means the predicted value is
    too small (an underestimate).

10
Best Fit Means Least Squares
  • Some residuals are positive, others are negative,
    and, on average, they cancel each other out.
  • So, we cant assess how well the line fits by
    adding up all the residuals.
  • Similar to what we did with deviations, we square
    the residuals and add the squares.
  • The smaller the sum, the better the fit.
  • The line of best fit is the line for which the
    sum of the squared residuals is smallest.

11
Least Squares
  • Used to determine the best line
  • We want the line to be as close as possible to
    the data points in the vertical (y) direction
    (since that is what we are trying to predict)
  • Least Squares use the line that minimizes the
    sum of the squares of the vertical distances of
    the data points from the line

12
The Linear Model (cont.)
  • We write b1 and b0 for the slope and intercept of
    the line. The bs are called the coefficients of
    the linear model.
  • The coefficient b1 is the slope, which tells us
    how rapidly changes with respect to x. The
    coefficient b0 is the intercept, which tells
    where the line hits (intercepts) the y-axis.

13
The Least Squares Line
  • In our model, we have a slope (b1)
  • The slope is built from the correlation and the
    standard deviations
  • Our slope is always in units of y per unit of x.
  • The slope has the same sign as the correlation
    coefficient.

14
The Least Squares Line (cont.)
  • In our model, we also have an intercept (b0).
  • The intercept is built from the means and the
    slope
  • Our intercept is always in units of y.

15
Example
  • Fill in the missing information in the table
    below

16
Interpretation of the Slope and Intercept
  • The slope indicates the amount by which
  • changes when x changes by one unit.
  • The intercept is the value of y when x 0. It is
    not always meaningful.

17
Example
  • The regression line for the Burger King data is
  • Interpret the slope and the intercept.
  • Slope For every one gram increase in protein,
    the fat
  • content increases by 0.97g.
  • Intercept A BK meal that has 0g of protein
    contains
  • 6.8g of fat.

18
Thought Question 2
From a long-term study on several families,
researchers constructed a scatterplot of the
cholesterol level of a child at age 50 versus the
cholesterol level of the father at age 50. You
know the cholesterol level of your best friends
father at age 50. How could you use this
scatterplot to predict what your best friends
cholesterol level will be at age 50?
19
Predictions
  • In predicting a value of y based on some given
    value of x ...
  • 1. If there is not a linear correlation, the
    best predicted y-value is y.

2. If there is a linear correlation, the best
predicted y-value is found by substituting the
x-value into the regression equation.
20
Example Fat Versus Protein
  • The regression line for the
  • Burger King data fits the data
  • well
  • The equation is
  • The predicted fat content for a BK Broiler
    chicken sandwich that contains 30g of protein is
  • 6.8 0.97(30) 35.9 grams of fat.

21
Prediction via Regression Line
Husband and Wife Ages
Hand, et al., A Handbook of Small Data Sets,
London Chapman and Hall
  • The regression equation is y 3.6 0.97x
  • y is the average age of all husbands who have
    wives of age x
  • For all women aged 30, we predict the average
    husband age to be 32.7 years
  • 3.6 (0.97)(30) 32.7 years
  • Suppose we know that an individual wifes age is
    30. What would we predict her husbands age to
    be?


22
The Least Squares Line (cont.)
  • Since regression and correlation are closely
    related, we need to check the same conditions for
    regressions as we did for correlations
  • Quantitative Variables Condition
  • Straight Enough Condition
  • Outlier Condition

23
Guidelines for Using The Regression Equation
  • 1. If there is no linear correlation, dont
    use the regression equation to make predictions.
  • 2. When using the regression equation for
    predictions, stay within the scope of the
    available sample data.
  • 3. A regression equation based on old data is
    not necessarily valid now.
  • 4. Dont make predictions about a population
    that is different from the population from
    which the sample data were drawn.

24
Definitions
  • Marginal Change refers to the slope the
    amount the response variable changes when the
    explanatory variable changes by one unit.
  • Outlier - A point lying far away from the other
    data points.
  • Influential Point - An outlier that that has the
    potential to change the regression line.

Try
25
Residuals Revisited
  • Residuals help us to see whether the model makes
    sense.
  • When a regression model is appropriate, nothing
    interesting should be left behind.
  • After we fit a regression model, we usually plot
    the residuals in the hope of findingnothing.

26
Residual Plot Analysis
If a residual plot does not reveal any pattern,
the regression equation is a good representation
of the association between the two variables. If
a residual plot reveals some systematic pattern,
the regression equation is not a good
representation of the association between the two
variables.
27
Residuals Revisited (cont.)
  • The residuals for the BK menu regression look
    appropriately boring

Plot
28
Coefficient of Determination (R2)
  • Measures usefulness of regression prediction
  • R2 (or r2, the square of the correlation)
    measures the percentage of the variation in the
    values of the response variable (y) that is
    explained by the regression line
  • r1 R21 regression line explains all (100)
    of the variation in y
  • r.7 R2.49 regression line explains almost
    half (50) of the variation in y

29
R2 (cont)
  • Along with the slope and intercept for a
    regression, you should always report R2 so that
    readers can judge for themselves how successful
    the regression is at fitting the data.
  • Statistics is about variation, and R2 measures
    the success of the regression model in terms of
    the fraction of the variation of y accounted for
    by the regression.

30
A CautionBeware of Extrapolation
  • Sarahs height was plotted against her age
  • Can you predict her height at age 42 months?
  • Can you predict her height at age 30 years (360
    months)?

31
A CautionBeware of Extrapolation
  • Regression liney 71.95 .383 x
  • height at age 42 months? y 88 cm.
  • height at age 30 years? y 209.8 cm.
  • She is predicted to be 6' 10.5" at age 30.

32
Correlation Does Not Imply Causation
  • Even very strong correlations may not correspond
    to a real causal relationship.

33
Evidence of Causation
  • A properly conducted experiment establishes the
    connection
  • Other considerations
  • A reasonable explanation for a cause and effect
    exists
  • The connection happens in repeated trials
  • The connection happens under varying conditions
  • Potential confounding factors are ruled out
  • Alleged cause precedes the effect in time

34
Evidence of Causation
  • An observed relationship can be used for
    prediction without worrying about causation as
    long as the patterns found in past data continue
    to hold true.
  • We must make sure that the prediction makes
    sense.
  • We must be very careful of extreme extrapolation.

35
Reasons Two Variables May Be Related (Correlated)
  • Explanatory variable causes change in response
    variable
  • Response variable causes change in explanatory
    variable
  • Explanatory may have some cause, but is not the
    sole cause of changes in the response variable
  • Confounding variables may exist
  • Both variables may result from a common cause
  • such as, both variables changing over time
  • The correlation may be merely a coincidence

36
Response causes Explanatory
  • Explanatory Hotel advertising dollars
  • Response Occupancy rate
  • Positive correlation? more advertising leads
    to increased occupancy rate?
  • Actual correlation is negative lower occupancy
    leads to more advertising

37
Explanatory is notSole Contributor
  • Explanatory Consumption of barbecued foods
  • Response Incidence of stomach cancer
  • barbecued foods are known to contain carcinogens,
    but other lifestyle choices may also contribute

38
Common Response(both variables change due to
common cause)
  • Explanatory Divorce among men
  • Response Percent abusing alcohol
  • Both may result from an unhappy marriage.

39
Both Variables are Changing Over Time
  • Both divorces and suicides have increased
    dramatically since 1900.
  • Are divorces causing suicides?
  • Are suicides causing divorces???
  • The population has increased dramatically since
    1900 (causing both to increase).
  • Better to investigate Has the rate of divorce
    or the rate of suicide changed over time?

40
The Relationship May Be Just a Coincidence
  • We will see some strong correlations (or
    apparent associations) just by chance, even when
    the variables are not related in the population

41
Coincidence (?)
Vaccines and Brain Damage
  • A required whooping cough vaccine was blamed for
    seizures that caused brain damage
  • led to reduced production of vaccine (due to
    lawsuits)
  • Study of 38,000 children found no evidence for
    the accusations (reported in New York Times)
  • people confused association with
    cause-and-effect
  • virtually every kid received the vaccineit was
    inevitable that, by chance, brain damage caused
    by other factors would occasionally occur in a
    recently vaccinated child

42
Key Concepts
  • Least Squares Regression Equation
  • R2
  • Correlation does not imply causation
  • Confirming causation
  • Reasons variables may be correlated
Write a Comment
User Comments (0)
About PowerShow.com