Title: Chapter 2: Looking at Data Relationships Section 9'1: Data Analysis for TwoWay Tables
1Chapter 2 Looking at Data RelationshipsSectio
n 9.1 Data Analysis for Two-Way Tables
2Relationships Between 2 Variables
- More than one variable can be measured on each
individual. - Examples
- Gender and Height
- Eye color and Major
- Size and Cost
- We want to look at the relationship among these
variables.
3Relationships Between 2 Variables
- A response variable measures an outcome of a
study. An explanatory variable explains or causes
changes in the response variables. - The explanatory variable influences the response
variable. - Examples
- Attendance for class and the grade of STAT 303
- Gender and Height
- Smoking and lung cancer
4Relationships Between 2 Variables
- We may be interested in relationships of
different types of variables. - Categorical and Numeric
- E.g. Gender and Height
- Categorical and Categorical
- E.g. Eye color and Major
- Numeric and Numeric
- E.g. Size and Cost
51. Relationships Between Categorical and Numeric
Variables
- We are interested in comparing the numerical
variable across each of the levels of the
categorical variable. - Examples
- Compare high speeds for 4 different car brands
- Compare prices for no-airbag, one-airbag and
two-airbag cars - Compare GPR for 20 different majors
61. Relationships Between Categorical and Numeric
Variables
- We could look at summary statistics for each
group. - Example prices for no airbags, one airbag and
two airbag cars. - Explanatory airbag
- Response price
71. Relationships Between Categorical and Numeric
Variables
Side by Side Boxplots
81. Relationships Between Categorical and Numeric
Variables
- Associations A categorical and numeric variable
are associated if the distribution of the numeric
variable is not the same for all populations.
(The populations are defined by the values the
categorical explanatory variable takes on.) - Example of no association
92. Relationships Between Two Categorical Variables
- Depending on the situation, one of the variables
is the explanatory variable and the other is the
response variable. - Examples
- Gender and Tomatoes Preference
- Country of Origin and Marital Status
- Gender and Highest Degree Obtained
- Compare percentages in each level of one
categorical variable across the levels of the
other categorical variable
10Two-Way Tables
- A two-way (contingency) table can summarize the
data for relationships between two categorical
variables. - - Example Response Tomatoes, Explanatory
Gender
112. Relationships Between Two Categorical Variables
- Associations Two categorical variables are
associated if the relative frequencies in the
response variable are not the same for all
populations. (The populations are defined by the
values the categorical explanatory variable takes
on.) - Percentages for the joint, marginal, and
conditional distributions - Joint Distribution How likely are you to like
tomatoes and be a male? Ans13/38 - Marginal Distribution What is the percentage of
people who like tomatoes? Ans21/38 - Conditional Distribution If you are a female,
how likely are you to like tomatoes? Ans8/19
122. Relationships Between Two Categorical Variables
Eg 9.9 Hospital A loses 3, B loses 2. Choose
B. Eg 9.10 Good condition A loses 1, B
1.3. Poor condition A loses 3.8, B 4. Choose
A.
132. Relationships Between Two Categorical Variables
- Lurking Variable A variable that is not among
the explanatory or response variables in a study
and yet may influence the interpretation of
relationships among those variables. - E.g. Good/ Poor Condition.
- Simpsons Paradox An association or comparison
that holds for all of several groups can reverse
direction when the data are combined to form a
single group. This reversal is called Simpsons
Paradox. This can happen when a lurking variable
is present. - E.g. We chose Hospital B in 9.9, but chose
Hospital A in 9.10.
143. Relationships Between Two Numeric Variables
- Depending on the situation, one of the variables
is the explanatory variable and the other is the
response variable. - Examples
- Height and Weight
- Income and Age
- Time and Growth
- Amount of time spent studying for STAT303 and
exam scores
153. Relationships Between Two Numeric Variables
- Example
- Response MPG
- Explanatory Weight
This is called a scatter plot. Each individual in
the data appears as one point in the plot.
Response Variable (y-axis)
Explanatory Variable (x-axis)
163. Relationships Between Two Numeric Variables
- Example
- Response MPG
- Explanatory Weight
173. Relationships Between Two Numeric Variables
- Example
- Response Horsepower
- Explanatory Weight
183. Relationships Between Two Numeric Variables
- Correlation or r measures the direction and
strength of the linear relationship between two
numeric variables. - If X represents the explanatory and Y represents
the response, the correlation is calculated as
193. Relationships Between Two Numeric Variables
- General Properties of Correlation
- It must be between -1 and 1, or (-1lt r lt 1).
- If r is negative, the relationship is negative.
- If r is -1, there is a perfect negative linear
relationship. - If r is positive, the relationship is positive.
- If r is 1, there is a perfect positive linear
relationship. - If r is 0, there is no linear relationship.
- If explanatory and response are switched, r
remains the same. - r has no units of measurement associated with it
- Scale changes do not affect r
- r measures ONLY linear relationships.
203. Relationships Between Two Numeric Variables
r 1
r 0
r -1
213. Relationships Between Two Numeric Variables
r 0.0489
r 0.04
r 0.4306
r -0.8428
223. Relationships Between Two Numeric Variables
- It is possible for there to be a strong
relationship between two variables and still have
r 0. - EX.
-
233. Relationships Between Two Numeric Variables
- Important notes
- Association/Correlation does not imply causation
- Slope is not correlation
- scale change does not affect correlation, but
affects slope. - For correlation, it doesnt matter which is x,
which is y. - But for slope, it does matter.
- Correlation doesnt measure the strength of a
non-linear relationship - r 0.46
243. Relationships Between Two Numeric Variables
- Association does not imply causation
- For the worlds nations,
- variable X number of TV sets per person
- variable Y average life expectancy
- There is high positive correlation nations with
more TV sets have higher life expectancies. - Is there causation? Can we lengthen the lives of
people in poor nations by shipping them TV sets?
No!
253. Relationships Between Two Numeric Variables
- Regression Line a straight line that describes
how a response variable Y changes as an
explanatory variable X changes. - General form
- y a bx, where a is the intercept, b is the
slope. - Least Squares Regression best fit
- Associations Two numerical variables are
associated if the distribution of the response is
not the same for each value of the explanatory
variable. - We often use a regression line to predict the
value of y for a given value of x. - Regression, unlike correlation, requires that we
have an explanatory variable and a response
variable.
26Regression Line
- Fitting a line to data means drawing a line that
comes as close as possible to the points. - Extrapolation the use of a regression line for
prediction far outside the range of values of the
explanatory variable x that you used to obtain
the line. - -----such predictions are often NOT accurate.
273. Relationships Between Two Numeric Variables
Horsepower -10.78 0.04weight (Equation of
the line.)
Intercept y-value or response (horsepower) when
line crosses the y-axis.
Slope increase in response for a unit increase
in explanatory variable.
So if weight increases by one pound, horsepower
increases by 0.04 units (on average).
28Least-Squares Regression Line
- The least-squares regression line of y on x is
the line that makes the sum of squares of the
vertical distances of the data points from the
line as small as possible. - These vertical distances are called the
residuals, or the error in prediction, because
they measure how far the point is from the line
- where y is the point and
is the predicted point.
29Least-Squares Regression Line
- The equation of the least-squares regression line
of y on x is -
-
30Least-Squares Regression Line
- The expression for slope, b, says that along the
regression line, a change of one standard
deviation in x corresponds to a change of r
standard deviations in y. - The slope, b, is the amount by which y changes
when x increases by one unit. - The intercept, a, is the value of y when
- The least-squares regression line ALWAYS passes
through the point
31r2 in Regression
- The square of the correlation, r2, is the
fraction of the variation in the values of y that
is explained by the least-squares regression of y
on x. - Use r2 as a measure of how successfully the
regression explains the response. - Interpret r2 as the percent of variance
explained
32Relationships between 2 numeric variables
- Example
- How much of the variation is explained
- by the least squares line of y on x? ______
- What is the correlation coefficient? ______
Horsepower -10.78 0.04weight (Equation of
the line.)
__________ y-value or response (horsepower) when
line crosses the y-axis.
_______ increase in response for a unit increase
in explanatory variable.
So if weight increases by one pound, horsepower
increases by 0.04 units (on average).
333. Relationships Between Two Numeric Variables
- What is the effect of an outlier on correlation?
- Adding a point that is not near the line and is
far from the other points (an outlier in the y
direction) -
r 0.53 Note This point does not greatly
affect the estimated regression equation.
343. Relationships Between Two Numeric Variables
- What is the effect of an outlier on correlation?
- Adding a point that is near the line and is far
from the other points (an outlier in the x
direction) -
r 0.94 Note This point greatly influences the
estimated regression equation (an influential
point.)
35- How does the correlation change when adding a
point to data set? - Adding a point at the mean doesnt change
anything (not even intercept or slope ) - The further a point is from the mean, the more
the correlation changes.
36SUMMARY Descriptive Statistics
- ONE POPULATION (Chapter 1)
- Describing the distribution of a single variable
- Categorical variable
- Frequency table
- Pie chart
- Bar chart
- Relative frequencies, Mode
- Numeric variable
- Measures of location (center(mean, median), Q1,
Q3, min, max) - Measures of spread (standard deviation, range,
IQR) - Frequency table
- Stemplot
- Histogram
- Boxplot
- Normal quantile plot
37SUMMARY Descriptive Statistics
- COMPARING POPULATIONS (Chapter 2 and 9.1)
- Looking for associations between two variables
- Explanatory (independent) variable is
categorical, Response (dependent) variable is
numeric - Measures of location and spread by category
- Side by side boxplot
- Explanatory variable is categorical, Response
variable is also categorical - Two-way table
- Simpsons Paradox
- Lurking variable
- Explanatory variable is numeric, Response
variable is also numeric - Scatter plot
- Correlation, r