Title: Chapter 3 Association: Contingency, Correlation, and Regression
1Chapter 3Association Contingency, Correlation,
and Regression
- Section 3.1
- How Can We Explore the Association between Two
Categorical Variables?
2Learning Objectives
- Identify variable type Response or Explanatory
- Define Association
- Contingency tables
- Calculate proportions and conditional proportions
3Learning Objective 1Response and Explanatory
variables
- Response variable (Dependent Variable)
- the outcome variable on which comparisons are
made - Explanatory variable (Independent variable)
- defines the groups to be compared with respect to
values on the response variable - Example Response/Explanatory
- Blood alcohol level/ of beers consumed
- Grade on test/Amount of study time
- Yield of corn per bushel/Amount of rainfall
4Learning Objective 2Association
- The main purpose of data analysis with two
variables is to investigate whether there is an
association and to describe that association - An association exists between two variables if a
particular value for one variable is more likely
to occur with certain values of the other
variable
5Learning Objective 3Contingency Table
- A contingency table
- Displays two categorical variables
- The rows list the categories of one variable
- The columns list the categories of the other
variable - Entries in the table are frequencies
6Learning Objective 3Contingency Table
What is the response variable? What is the
explanatory variable?
7Learning Objective 4Calculate proportions and
conditional proportions
8Learning Objective 4 Calculate proportions and
conditional proportions
- What proportion of organic foods contain
pesticides? - What proportion of conventionally grown foods
contain pesticides? - What proportion of all sampled items contain
pesticide residuals? -
9Learning Objective 4Calculate proportions and
conditional proportions
Use side by side bar charts to show conditional
proportions Allows for easy comparison of the
explanatory variable with respect to the response
variable
10Learning Objective 4Calculate proportions and
conditional proportions
- If there was no association between organic and
conventional foods, then the proportions for the
response variable categories would be the same
for each food type
11Chapter 3Association Contingency, Correlation,
and Regression
- Section 3.2
- How Can We Explore the Association between Two
Quantitative Variables?
12Learning Objectives
- Constructing scatterplots
- Interpreting a scatterplot
- Correlation
- Calculating correlation
13Learning Objective 1Scatterplot
- Graphical display of relationship between two
quantitative variables - Horizontal Axis Explanatory variable, x
- Vertical Axis Response variable, y
14Learning Objective 1Internet Usage and Gross
National Product (GDP) Data Set
15Learning Objective 1Internet Usage and Gross
National Product (GDP)
- Enter values of explanatory variable
- (x) in L1
- Enter values of of response variable
- (y) in L2
- STAT PLOT
- Plot 1 on
- Type scatter plot
- X list L2
- Y list L1
- ZOOM
- 9ZoomStat
- Graph
16Learning Objective 1Baseball Average and Team
Scoring
17Learning Objective 1Baseball Average and Team
Scoring
- Enter values of explanatory variable
- (x) in L1
- Enter values of of response variable
- (y) in L2
- STAT PLOT
- Plot 1 on
- Type scatter plot
- X list L1
- Y list L2
- ZOOM
- 9ZoomStat
- Graph
Use L3 for x and L4 for y. You will use data
from prior example again later on in the
PowerPoint.
18Learning Objective 2Interpreting Scatterplots
- You can describe the overall pattern of a
scatterplot by the trend, direction, and strength
of the relationship between the two variables - Trend linear, curved, clusters, no pattern
- Direction positive, negative, no direction
- Strength how closely the points fit the trend
- Also look for outliers from the overall trend
19Learning Objective 2Interpreting Scatterplots
Direction/Association
- Two quantitative variables x and y are
- Positively associated when
- High values of x tend to occur with high values
of y - Low values of x tend to occur with low values of
y - Negatively associated when high values of one
variable tend to pair with low values of the
other variable
20Learning Objective 2Example 100 cars on the
lot of a used-car dealership
- Would you expect a positive association, a
- negative association or no association between
- the age of the car and the mileage on the
- odometer?
- Positive association
- Negative association
- No association
21Learning Objective 2Example Did the Butterfly
Ballot Cost Al Gore the 2000 Presidential
Election?
22Learning Objective 3Linear Correlation, r
- Measures the strength and direction of the linear
association between x and y - A positive r value indicates a positive
association - A negative r value indicates a negative
association - An r value close to 1 or -1 indicates a strong
linear association - An r value close to 0 indicates a weak
association
23Learning Objective 3Correlation coefficient
Measuring Strength Direction of a Linear
Relationship
24Learning Objective 3Properties of Correlation
- Always falls between -1 and 1
- Sign of correlation denotes direction
- (-) indicates negative linear association
- () indicates positive linear association
- Correlation has a unitless measure - does not
depend on the variables units - Two variables have the same correlation no matter
which is treated as the response variable - Correlation is not resistant to outliers
- Correlation only measures strength of linear
relationship
25Leaning Objective 4Calculating the Correlation
Coefficient
Per Capita Gross Domestic Product and Average
Life Expectancy for Countries in Western Europe
26Learning Objective 4Calculating the Correlation
Coefficient
27Learning Objective 4Internet Usage and Gross
National Product (GDP)
- STAT CALC menu
- Choose 8 LinReg(abx)
- 1st number x variable
- 2nd number y variable
- Enter
Correlation .889
28Learning Objective 4Baseball Average and Team
Scoring
- Enter x data into L1
- Enter y data into L2
- STAT CALC memu
- Choose 8 LinReg(abx)
- 1st number x variable
- 2nd number y variable
- Enter
Correlation .874
29Learning Objective 4Cereal Sodium and Sugar
30Chapter 3Association Contingency, Correlation,
and Regression
- Section 3.3
- How Can We Predict the Outcome of a Variable?
31Learning Objectives
- Definition of a regression line
- Use a regression equation for prediction
- Interpret the slope and y-intercept of a
regression line - Identify the least-squares regression line as the
one that minimizes the sum of squared residuals - Calculate the least-squares regression line
32Learning Objectives
- Compare roles of explanatory and response
variables in correlation and regression - Calculate r2 and interpret
33Learning Objective 1Regression Analysis
- The first step of a regression analysis is to
identify the response and explanatory variables - We use y to denote the response variable
- We use x to denote the explanatory variable
34Learning Objective 1Regression Line
- A regression line is a straight line that
describes how the response variable (y) changes
as the explanatory variable (x) changes - A regression line predicts the value of the
response variable (y) for a given level of the
explanatory variable (x) - The y-intercept of the regression line is denoted
by a - The slope of the regression line is denoted by b
35Learning Objective 2Example How Can
Anthropologists Predict Height Using Human
Remains?
- Regression Equation
- is the predicted height and is the
length of a femur (thighbone), measured in
centimeters
- Use the regression equation to predict the height
of a person whose femur length was 50 centimeters
36Learning Objective 3Interpreting the y-Intercept
- y-Intercept
- The predicted value for y when x 0
- Helps in plotting the line
- May not have any interpretative value if no
observations had x values near 0
37Learning Objective 3Interpreting the Slope
- Slope measures the change in the predicted
variable (y) for a 1 unit increase in the
explanatory variable in (x) - Example A 1 cm increase in femur length results
in a 2.4 cm increase in predicted height
38Learning Objective 3Slope Values Positive,
Negative, Equal to 0
39Learning Objective 3Regression Line
- At a given value of x, the equation
- Predicts a single value of the response variable
- But we should not expect all subjects at that
value of x to have the same value of y - Variability occurs in the y values!
40Learning Objective 3The Regression Line
- The regression line connects the estimated means
of y at the various x values - In summary,
- Describes the relationship between x and the
estimated means of y at the various values of x
41Learning Objective 4Residuals
- Measures the size of the prediction errors, the
vertical distance between the point and the
regression line - Each observation has a residual
- Calculation for each residual
- A large residual indicates an unusual
observation
42Learning Objective 4Least Squares Method
Yields the Regression Line
- Residual sum of squares
- The least squares regression line is the line
that minimizes the vertical distance between the
points and their predictions, i.e., it minimizes
the residual sum of squares - Note the sum of the residuals about the
regression line will always be zero
43Learning Objective 5Regression Formulas for
y-Intercept and Slope
Regression line always passes through
44Learning Objective 5Calculating the slope and y
intercept for the regression line
Slope 26.4
y intercept-2.28
45Learning Objective 5Internet Usage and Gross
National Product (GDP)
46Learning Objective 5Internet Usage and Gross
National Product
- Enter x data into L1
- Enter y data into L2
- STAT CALC menu
- Choose 8 LinReg(abx)
- 1st number x variable
- 2nd number y variable
- Enter
1.548x-3.63
47Learning Objective 5Baseball Average and Team
Scoring
48Learning Objective 5Baseball average and Team
Scoring
- Enter x data into L1
- Enter y data into L2
- STAT CALC
- Choose 8 LinReg(abx)
- 1st number x variable
- 2nd number y variable
- Enter
49Learning Objective 5Cereal Sodium and Sugar
50Learning Objective 6The Slope and the
Correlation
- Correlation
- Describes the strength of the linear association
between 2 variables - Does not change when the units of measurement
change - Does not depend upon which variable is the
response and which is the explanatory
51Learning Objective 6The Slope and the
Correlation
- Slope
- Numerical value depends on the units used to
measure the variables - Does not tell us whether the association is
strong or weak - The two variables must be identified as response
and explanatory variables - The regression equation can be used to predict
values of the response variable for given values
of the explanatory variable
52Learning Objective 7The Squared Correlation
- When a strong linear association exists, the
regression equation predictions tend to be much
better than the predictions using only - We measure the proportional reduction in error
and call it, r2
53Learning Objective 7The Squared Correlation
- measures the proportion of the variation in
the y-values that is accounted for by the linear
relationship of y with x - A correlation of .9 means that
- 81 of the variation in the y-values can be
explained by the explanatory variable, x
54Chapter 3Association Contingency, Correlation,
and Regression
- Section 3.4
- What Are Some Cautions in Analyzing Association?
55Learning Objectives
- Extrapolation
- Outliers and Influential Observations
- Correlations does not imply causation
- Lurking variables and confounding
- Simpsons Paradox
56Learning Objective 1Extrapolation
- Extrapolation Using a regression line to predict
y-values for x-values outside the observed range
of the data - Riskier the farther we move from the range of the
given x-values - There is no guarantee that the relationship given
by the regression equation holds outside the
range of sampled x-values
57Learning Objective 2Outliers and Influential
Points
- Construct a scatterplot
- Search for data points that are well outside of
the trend that the remainder of the data points
follow
58Learning Objective 2Outliers and Influential
Points
- A regression outlier is an observation that lies
far away from the trend that the rest of the data
follows - An observation is influential if
- Its x value is relatively low or high compared to
the remainder of the data - The observation is a regression outlier
- Influential observations tend to pull the
regression line toward that data point and away
from the rest of the data
59Learning Objective 2Outliers and Influential
Points
- Impact of removing an Influential data point
60Learning Objective 3Correlation does not Imply
Causation
- A strong correlation between x and y means that
there is a strong linear association that exists
between the two variables - A strong correlation between x and y, does not
mean that x causes y
61Data are available for all fires in Chicago last
year on x number of firefighters at the fires
and y cost of damages due to fire
Learning Objective 3Association does not imply
causation
- Would you expect the correlation to be negative,
zero, or positive? - If the correlation is positive, does this mean
that having more firefighters at a fire causes
the damages to be worse? Yes or No - Identify a third variable that could be
considered a common cause of x and y - Distance from the fire station
- Intensity of the fire
- Size of the fire
62Learning Objective 4Lurking Variables
Confounding
- A lurking variable is a variable, usually
unobserved, that influences the association
between the variables of primary interest - Ice cream sales and drowning lurking variable
temperature - Reading level and shoe size lurking
variableage - Childhood obesity rate and GDP-lurking
variabletime - When two explanatory variables are both
associated with a response variable but are also
associated with each other, there is said to be
confounding - Lurking variables are not measured in the study
but have the potential for confounding
63Learning Objective 5Simpsons Paradox
- Simpsons Paradox
- When the direction of an association between two
variables changes after we include a third
variable and analyze the data at separate levels
of that variable
64Learning Objective 5Simpsons Paradox Example
Is Smoking Actually Beneficial to Your Health?
Probability of Death of Smoker
139/58224 Probability of Death of
Nonsmoker230/73231
This cant be true that smoking improves your
chances of living! Whats going on!
65Learning Objective 5Simpsons Paradox Example
Break out Data by Age
66Learning Objective 5Simpsons Paradox Example
- An association can look quite different after
adjusting for the effect of a third variable by
grouping the data according to the values of the
third variable