Loading...

PPT – Chapter 10 Analyzing the Association Between Categorical Variables PowerPoint presentation | free to download - id: 220783-ZmMyN

The Adobe Flash plugin is needed to view this content

Chapter 10Analyzing the Association Between

Categorical Variables

- Learn .
- How to detect and describe associations between

categorical variables

Section 10.1

- What Is Independence and What is Association?

Example Is There an Association Between

Happiness and Family Income?

Example Is There an Association Between

Happiness and Family Income?

Example Is There an Association Between

Happiness and Family Income?

- The percentages in a particular row of a table

are called conditional percentages - They form the conditional distribution for

happiness, given a particular income level

Example Is There an Association Between

Happiness and Family Income?

Example Is There an Association Between

Happiness and Family Income?

- Guidelines when constructing tables with

conditional distributions - Make the response variable the column variable
- Compute conditional proportions for the response

variable within each row - Include the total sample sizes

Independence vs Dependence

- For two variables to be independent, the

population percentage in any category of one

variable is the same for all categories of the

other variable - For two variables to be dependent (or

associated), the population percentages in the

categories are not all the same

Example Happiness and Gender

Example Happiness and Gender

Example Belief in Life After Death

Example Belief in Life After Death

- Are race and belief in life after death

independent or dependent? - The conditional distributions in the table are

similar but not exactly identical - It is tempting to conclude that the variables are

dependent

Example Belief in Life After Death

- Are race and belief in life after death

independent or dependent? - The definition of independence between variables

refers to a population - The table is a sample, not a population

Independence vs Dependence

- Even if variables are independent, we would not

expect the sample conditional distributions to be

identical - Because of sampling variability, each sample

percentage typically differs somewhat from the

true population percentage

Section 10.2

- How Can We Test whether Categorical Variables are

Independent?

A Significance Test for Categorical Variables

- The hypotheses for the test are
- H0 The two variables are independent
- Ha The two variables are dependent

(associated) - The test assumes random sampling and a large

sample size

What Do We Expect for Cell Counts if the

Variables Are Independent?

- The count in any particular cell is a random

variable - Different samples have different values for the

count - The mean of its distribution is called an

expected cell count - This is found under the presumption that H0 is

true

How Do We Find the Expected Cell Counts?

- Expected Cell Count
- For a particular cell, the expected cell count

equals

Example Happiness by Family Income

The Chi-Squared Test Statistic

- The chi-squared statistic summarizes how far the

observed cell counts in a contingency table fall

from the expected cell counts for a null

hypothesis

Example Happiness and Family Income

Example Happiness and Family Income

- State the null and alternative hypotheses for

this test - H0 Happiness and family income are independent
- Ha Happiness and family income are dependent

(associated)

Example Happiness and Family Income

- Report the statistic and explain how it was

calculated - To calculate the statistic, for each cell,

calculate - Sum the values for all the cells
- The value is 73.4

Example Happiness and Family Income

- The larger the value, the greater the

evidence against the null hypothesis of

independence and in support of the alternative

hypothesis that happiness and income are

associated

The Chi-Squared Distribution

- To convert the test statistic to a

P-value, we use the sampling distribution of the

statistic - For large sample sizes, this sampling

distribution is well approximated by the

chi-squared probability distribution

The Chi-Squared Distribution

The Chi-Squared Distribution

- Main properties of the chi-squared distribution
- It falls on the positive part of the real number

line - The precise shape of the distribution depends on

the degrees of freedom - df (r-1)(c-1)

The Chi-Squared Distribution

- Main properties of the chi-squared distribution
- The mean of the distribution equals the df value
- It is skewed to the right
- The larger the value, the greater the

evidence against H0 independence

The Chi-Squared Distribution

The Five Steps of the Chi-Squared Test of

Independence

- 1. Assumptions
- Two categorical variables
- Randomization
- Expected counts 5 in all cells

The Five Steps of the Chi-Squared Test of

Independence

- 2. Hypotheses
- H0 The two variables are independent
- Ha The two variables are dependent (associated)

The Five Steps of the Chi-Squared Test of

Independence

- 3. Test Statistic

The Five Steps of the Chi-Squared Test of

Independence

- 4. P-value Right-tail probability above the

observed value, for the chi-squared

distribution with df (r-1)(c-1) - 5. Conclusion Report P-value and interpret in

context - If a decision is needed, reject H0 when P-value

significance level

Chi-Squared is Also Used as a Test of

Homogeneity

- The chi-squared test does not depend on which is

the response variable and which is the

explanatory variable - When a response variable is identified and the

population conditional distributions are

identical, they are said to be homogeneous - The test is then referred to as a test of

homogeneity

Example Aspirin and Heart Attacks Revisited

Example Aspirin and Heart Attacks Revisited

- What are the hypotheses for the chi-squared test

for these data? - The null hypothesis is that whether a doctor has

a heart attack is independent of whether he takes

placebo or aspirin - The alternative hypothesis is that theres an

association

Example Aspirin and Heart Attacks Revisited

- Report the test statistic and P-value for the

chi-squared test - The test statistic is 25.01 with a P-value of

0.000 - This is very strong evidence that the population

proportion of heart attacks differed for those

taking aspirin and for those taking placebo

Example Aspirin and Heart Attacks Revisited

- The sample proportions indicate that the aspirin

group had a lower rate of heart attacks than the

placebo group

Limitations of the Chi-Squared Test

- If the P-value is very small, strong evidence

exists against the null hypothesis of

independence - But
- The chi-squared statistic and the P-value tell us

nothing about the nature of the strength of the

association

Limitations of the Chi-Squared Test

- We know that there is statistical significance,

but the test alone does not indicate whether

there is practical significance as well

Section 10.3

- How Strong is the Association?

- In a study of the two variables (Gender and

Happiness), which one is the response variable? - Gender
- Happiness

- What is the Expected Cell Count for Females who

are Pretty Happy? - 898
- 801.5
- 902
- 521

- Calculate the
- 1.75
- 0.27
- 0.98
- 10.34

- At a significance level of 0.05, what is the

correct decision? - Gender and Happiness are independent
- There is an association between Gender and

Happiness

Analyzing Contingency Tables

- Is there an association?
- The chi-squared test of independence addresses

this - When the P-value is small, we infer that the

variables are associated

Analyzing Contingency Tables

- How do the cell counts differ from what

independence predicts? - To answer this question, we compare each observed

cell count to the corresponding expected cell

count

Analyzing Contingency Tables

- How strong is the association?
- Analyzing the strength of the association reveals

whether the association is an important one, or

if it is statistically significant but weak and

unimportant in practical terms

Measures of Association

- A measure of association is a statistic or a

parameter that summarizes the strength of the

dependence between two variables

Difference of Proportions

- An easily interpretable measure of association is

the difference between the proportions making a

particular response

Difference of Proportions

Difference of Proportions

- Case (a) exhibits the weakest possible

association no association - Accept Credit Card
- The difference of proportions is 0

Difference of Proportions

- Case (b) exhibits the strongest possible

association - Accept Credit Card
- The difference of proportions is 100

Difference of Proportions

- In practice, we dont expect data to follow

either extreme (0 difference or 100

difference), but the stronger the association,

the large the absolute value of the difference of

proportions

Example Do Student Stress and Depression Depend

on Gender?

Example Do Student Stress and Depression Depend

on Gender?

- Which response variable, stress or depression,

has the stronger sample association with gender?

Example Do Student Stress and Depression Depend

on Gender?

Example Do Student Stress and Depression Depend

on Gender?

- Stress
- The difference of proportions between females and

males was 0.35 0.16 0.19

Example Do Student Stress and Depression Depend

on Gender?

- Depression
- The difference of proportions between females and

males was 0.08 0.06 0.02

Example Do Student Stress and Depression Depend

on Gender?

- In the sample, stress (with a difference of

proportions 0.19) has a stronger association

with gender than depression has (with a

difference of proportions 0.02)

The Ratio of Proportions Relative Risk

- Another measure of association, is the ratio of

two proportions p1/p2 - In medical applications in which the proportion

refers to an adverse outcome, it is called the

relative risk

Example Relative Risk for Seat Belt Use and

Outcome of Auto Accidents

Example Relative Risk for Seat Belt Use and

Outcome of Auto Accidents

- Treating the auto accident outcome as the

response variable, find and interpret the

relative risk

Example Relative Risk for Seat Belt Use and

Outcome of Auto Accidents

- The adverse outcome is death
- The relative risk is formed for that outcome
- For those who wore a seat belt, the proportion

who died equaled 510/412,878 0.00124 - For those who did not wear a seat belt, the

proportion who died equaled 1601/164,128

0.00975

Example Relative Risk for Seat Belt Use and

Outcome of Auto Accidents

- The relative risk is the ratio
- 0.00124/0.00975 0.127
- The proportion of subjects wearing a seat belt

who died was 0.127 times the proportion of

subjects not wearing a seat belt who died

Example Relative Risk for Seat Belt Use and

Outcome of Auto Accidents

- Many find it easier to interpret the relative

risk but reordering the rows of data so that the

relative risk has value above 1.0

Example Relative Risk for Seat Belt Use and

Outcome of Auto Accidents

- Reversing the order of the rows, we calculate the

ratio - 0.00975/0.00124 7.9
- The proportion of subjects not wearing a seat

belt who died was 7.9 times the proportion of

subjects wearing a seat belt who died

Example Relative Risk for Seat Belt Use and

Outcome of Auto Accidents

- A relative risk of 7.9 represents a strong

association - This is far from the value of 1.0 that would

occur if the proportion of deaths were the same

for each group - Wearing a set belt has a practically significant

effect in enhancing the chance of surviving an

auto accident

Properties of the Relative Risk

- The relative risk can equal any nonnegative

number - When p1 p2, the variables are independent and

relative risk 1.0 - Values farther from 1.0 (in either direction)

represent stronger associations

Large Does Not Mean Theres a Strong

Association

- A large chi-squared value provides strong

evidence that the variables are associated - It does not imply that the variables have a

strong association - This statistic merely indicates (through its

P-value) how certain we can be that the variables

are associated, not how strong that association is

Section 10.4

- How Can Residuals Reveal the Pattern of

Association?

Association Between Categorical Variables

- The chi-squared test and measures of association

such as (p1 p2) and p1/p2 are fundamental

methods for analyzing contingency tables - The P-value for summarized the strength of

evidence against H0 independence

Association Between Categorical Variables

- If the P-value is small, then we conclude that

somewhere in the contingency table the population

cell proportions differ from independence - The chi-squared test does not indicate whether

all cells deviate greatly from independence or

perhaps only some of them do so

Residual Analysis

- A cell-by-cell comparison of the observed counts

with the counts that are expected when H0 is true

reveals the nature of the evidence against H0 - The difference between an observed and expected

count in a particular cell is called a residual

Residual Analysis

- The residual is negative when fewer subjects are

in the cell than expected under H0 - The residual is positive when more subjects are

in the cell than expected under H0

Residual Analysis

- To determine whether a residual is large enough

to indicate strong evidence of a deviation from

independence in that cell we use a adjusted form

of the residual the standardized residual

Residual Analysis

- The standardized residual for a cell
- (observed count expected count)/se
- A standardized residual reports the number of

standard errors that an observed count falls from

its expected count - Its formula is complex
- Software can be used to find its value
- A large value provides evidence against

independence in that cell

Example Standardized Residuals for Religiosity

and Gender

- To what extent do you consider yourself a

religious person?

Example Standardized Residuals for Religiosity

and Gender

Example Standardized Residuals for Religiosity

and Gender

- Interpret the standardized residuals in the table

Example Standardized Residuals for Religiosity

and Gender

- The table exhibits large positive residuals for

the cells for females who are very religious and

for males who are not at all religious. - In these cells, the observed count is much larger

than the expected count - There is strong evidence that the population has

more subjects in these cells than if the

variables were independent

Example Standardized Residuals for Religiosity

and Gender

- The table exhibits large negative residuals for

the cells for females who are not at all

religious and for males who are very religious - In these cells, the observed count is much

smaller than the expected count - There is strong evidence that the population has

fewer subjects in these cells than if the

variables were independent

Section 10.5

- What if the Sample Size is Small? Fishers Exact

Test

Fishers Exact Test

- The chi-squared test of independence is a

large-sample test - When the expected frequencies are small, any of

them being less than about 5, small-sample tests

are more appropriate - Fishers exact test is a small-sample test of

independence

Fishers Exact Test

- The calculations for Fishers exact test are

complex - Statistical software can be used to obtain the

P-value for the test that the two variables are

independent - The smaller the P-value, the stronger is the

evidence that the variables are associated

Example Tea Tastes Better with Milk Poured

First?

- This is an experiment conducted by Sir Ronald

Fisher - His colleague, Dr. Muriel Bristol, claimed that

when drinking tea she could tell whether the milk

or the tea had been added to the cup first

Example Tea Tastes Better with Milk Poured

First?

- Experiment
- Fisher asked her to taste eight cups of tea
- Four had the milk added first
- Four had the tea added first
- She was asked to indicate which four had the milk

added first - The order of presenting the cups was randomized

Example Tea Tastes Better with Milk Poured

First?

- Results

Example Tea Tastes Better with Milk Poured

First?

- Analysis

Example Tea Tastes Better with Milk Poured

First?

- The one-sided version of the test pertains to the

alternative that her predictions are better than

random guessing - Does the P-value suggest that she had the ability

to predict better than random guessing?

Example Tea Tastes Better with Milk Poured

First?

- The P-value of 0.243 does not give much evidence

against the null hypothesis - The data did not support Dr. Bristols claim that

she could tell whether the milk or the tea had

been added to the cup first