How Can We Test whether Categorical Variables are Independent? - PowerPoint PPT Presentation

About This Presentation
Title:

How Can We Test whether Categorical Variables are Independent?

Description:

H0: Happiness and family income are independent ... In a study of the two variables (Gender and Happiness), which one is the response variable? ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: How Can We Test whether Categorical Variables are Independent?


1
Section 10.2
  • How Can We Test whether Categorical Variables are
    Independent?

2
A Significance Test for Categorical Variables
  • The hypotheses for the test are
  • H0 The two variables are independent
  • Ha The two variables are dependent
    (associated)
  • The test assumes random sampling and a large
    sample size

3
What Do We Expect for Cell Counts if the
Variables Are Independent?
  • The count in any particular cell is a random
    variable
  • Different samples have different values for the
    count
  • The mean of its distribution is called an
    expected cell count
  • This is found under the presumption that H0 is
    true

4
How Do We Find the Expected Cell Counts?
  • Expected Cell Count
  • For a particular cell, the expected cell count
    equals

5
Example Happiness by Family Income
6
The Chi-Squared Test Statistic
  • The chi-squared statistic summarizes how far the
    observed cell counts in a contingency table fall
    from the expected cell counts for a null
    hypothesis

7
Example Happiness and Family Income
8
Example Happiness and Family Income
  • State the null and alternative hypotheses for
    this test
  • H0 Happiness and family income are independent
  • Ha Happiness and family income are dependent
    (associated)

9
Example Happiness and Family Income
  • Report the statistic and explain how it was
    calculated
  • To calculate the statistic, for each cell,
    calculate
  • Sum the values for all the cells
  • The value is 73.4

10
Example Happiness and Family Income
  • The larger the value, the greater the
    evidence against the null hypothesis of
    independence and in support of the alternative
    hypothesis that happiness and income are
    associated

11
The Chi-Squared Distribution
  • To convert the test statistic to a
    P-value, we use the sampling distribution of the
    statistic
  • For large sample sizes, this sampling
    distribution is well approximated by the
    chi-squared probability distribution

12
The Chi-Squared Distribution
13
The Chi-Squared Distribution
  • Main properties of the chi-squared distribution
  • It falls on the positive part of the real number
    line
  • The precise shape of the distribution depends on
    the degrees of freedom
  • df (r-1)(c-1)

14
The Chi-Squared Distribution
  • Main properties of the chi-squared distribution
  • The mean of the distribution equals the df value
  • It is skewed to the right
  • The larger the value, the greater the
    evidence against H0 independence

15
The Chi-Squared Distribution
16
The Five Steps of the Chi-Squared Test of
Independence
  • 1. Assumptions
  • Two categorical variables
  • Randomization
  • Expected counts 5 in all cells

17
The Five Steps of the Chi-Squared Test of
Independence
  • 2. Hypotheses
  • H0 The two variables are independent
  • Ha The two variables are dependent (associated)

18
The Five Steps of the Chi-Squared Test of
Independence
  • 3. Test Statistic

19
The Five Steps of the Chi-Squared Test of
Independence
  • 4. P-value Right-tail probability above the
    observed value, for the chi-squared
    distribution with df (r-1)(c-1)
  • 5. Conclusion Report P-value and interpret in
    context
  • If a decision is needed, reject H0 when P-value
    significance level

20
Chi-Squared is Also Used as a Test of
Homogeneity
  • The chi-squared test does not depend on which is
    the response variable and which is the
    explanatory variable
  • When a response variable is identified and the
    population conditional distributions are
    identical, they are said to be homogeneous
  • The test is then referred to as a test of
    homogeneity

21
Example Aspirin and Heart Attacks Revisited
22
Example Aspirin and Heart Attacks Revisited
  • What are the hypotheses for the chi-squared test
    for these data?
  • The null hypothesis is that whether a doctor has
    a heart attack is independent of whether he takes
    placebo or aspirin
  • The alternative hypothesis is that theres an
    association

23
Example Aspirin and Heart Attacks Revisited
  • Report the test statistic and P-value for the
    chi-squared test
  • The test statistic is 25.01 with a P-value of
    0.000
  • This is very strong evidence that the population
    proportion of heart attacks differed for those
    taking aspirin and for those taking placebo

24
Example Aspirin and Heart Attacks Revisited
  • The sample proportions indicate that the aspirin
    group had a lower rate of heart attacks than the
    placebo group

25
Limitations of the Chi-Squared Test
  • If the P-value is very small, strong evidence
    exists against the null hypothesis of
    independence
  • But
  • The chi-squared statistic and the P-value tell us
    nothing about the nature of the strength of the
    association

26
Limitations of the Chi-Squared Test
  • We know that there is statistical significance,
    but the test alone does not indicate whether
    there is practical significance as well

27
Section 10.3
  • How Strong is the Association?

28
The following is a table on Gender and Happiness The following is a table on Gender and Happiness The following is a table on Gender and Happiness The following is a table on Gender and Happiness
Gender Not Pretty Very
Females 163 898 502
Males 130 705 379
  • In a study of the two variables (Gender and
    Happiness), which one is the response variable?
  • Gender
  • Happiness

29
The following is a table on Gender and Happiness The following is a table on Gender and Happiness The following is a table on Gender and Happiness The following is a table on Gender and Happiness
Gender Not Pretty Very
Females 163 898 502
Males 130 705 379
  • What is the Expected Cell Count for Females who
    are Pretty Happy?
  • 898
  • 801.5
  • 902
  • 521

30
The following is a table on Gender and Happiness The following is a table on Gender and Happiness The following is a table on Gender and Happiness The following is a table on Gender and Happiness
Gender Not Pretty Very
Females 163 898 502
Males 130 705 379
  • What is the Expected Cell Count for Females who
    are Pretty Happy?
  • 898
  • 801.5
  • 902 N(898705)/N(163898502)/N
  • 521

31
The following is a table on Gender and Happiness The following is a table on Gender and Happiness The following is a table on Gender and Happiness The following is a table on Gender and Happiness
Gender Not Pretty Very
Females 163 898 502
Males 130 705 379
  • Calculate the
  • 1.75
  • 0.27
  • 0.98
  • 10.34

32
The following is a table on Gender and Happiness The following is a table on Gender and Happiness The following is a table on Gender and Happiness The following is a table on Gender and Happiness
Gender Not Pretty Very
Females 163 898 502
Males 130 705 379
  • At a significance level of 0.05, what is the
    correct decision?
  • Gender and Happiness are independent
  • There is an association between Gender and
    Happiness

33
Analyzing Contingency Tables
  • Is there an association?
  • The chi-squared test of independence addresses
    this
  • When the P-value is small, we infer that the
    variables are associated

34
Analyzing Contingency Tables
  • How do the cell counts differ from what
    independence predicts?
  • To answer this question, we compare each observed
    cell count to the corresponding expected cell
    count

35
Analyzing Contingency Tables
  • How strong is the association?
  • Analyzing the strength of the association reveals
    whether the association is an important one, or
    if it is statistically significant but weak and
    unimportant in practical terms

36
Measures of Association
  • A measure of association is a statistic or a
    parameter that summarizes the strength of the
    dependence between two variables

37
Difference of Proportions
  • An easily interpretable measure of association is
    the difference between the proportions making a
    particular response

38
Difference of Proportions
39
Difference of Proportions
  • Case (a) exhibits the weakest possible
    association no association
  • Accept Credit Card
  • The difference of proportions is 0

Income No Yes
High 60 40
Low 60 40
40
Difference of Proportions
  • Case (b) exhibits the strongest possible
    association
  • Accept Credit Card
  • The difference of proportions is 100

Income No Yes
High 0 100
Low 100 0
41
Difference of Proportions
  • In practice, we dont expect data to follow
    either extreme (0 difference or 100
    difference), but the stronger the association,
    the large the absolute value of the difference of
    proportions

42
Example Do Student Stress and Depression Depend
on Gender?
43
Example Do Student Stress and Depression Depend
on Gender?
  • Which response variable, stress or depression,
    has the stronger sample association with gender?

44
Example Do Student Stress and Depression Depend
on Gender?
Example Do Student Stress and Depression Depend
on Gender?
  • Stress
  • The difference of proportions between females and
    males was 0.35 0.16 0.19

Gender Yes No
Female 35 65
Male 16 84
45
Example Do Student Stress and Depression Depend
on Gender?
  • Depression
  • The difference of proportions between females and
    males was 0.08 0.06 0.02

Gender Yes No
Female 8 92
Male 6 94
46
Example Do Student Stress and Depression Depend
on Gender?
  • In the sample, stress (with a difference of
    proportions 0.19) has a stronger association
    with gender than depression has (with a
    difference of proportions 0.02)

47
Example Relative Risk for Seat Belt Use and
Outcome of Auto Accidents
48
Example Relative Risk for Seat Belt Use and
Outcome of Auto Accidents
  • Treating the auto accident outcome as the
    response variable, find and interpret the
    relative risk

49
Large Does Not Mean Theres a Strong
Association
  • A large chi-squared value provides strong
    evidence that the variables are associated
  • It does not imply that the variables have a
    strong association
  • This statistic merely indicates (through its
    P-value) how certain we can be that the variables
    are associated, not how strong that association is

50
Section 10.4
  • How Can Residuals Reveal the Pattern of
    Association?

51
Association Between Categorical Variables
  • The chi-squared test and measures of association
    such as (p1 p2) and p1/p2 are fundamental
    methods for analyzing contingency tables
  • The P-value for summarized the strength of
    evidence against H0 independence

52
Association Between Categorical Variables
  • If the P-value is small, then we conclude that
    somewhere in the contingency table the population
    cell proportions differ from independence
  • The chi-squared test does not indicate whether
    all cells deviate greatly from independence or
    perhaps only some of them do so

53
Residual Analysis
  • A cell-by-cell comparison of the observed counts
    with the counts that are expected when H0 is true
    reveals the nature of the evidence against H0
  • The difference between an observed and expected
    count in a particular cell is called a residual

54
Residual Analysis
  • The residual is negative when fewer subjects are
    in the cell than expected under H0
  • The residual is positive when more subjects are
    in the cell than expected under H0

55
Residual Analysis
  • To determine whether a residual is large enough
    to indicate strong evidence of a deviation from
    independence in that cell we use a adjusted form
    of the residual the standardized residual

56
Residual Analysis
  • The standardized residual for a cell
  • (observed count expected count)/se
  • A standardized residual reports the number of
    standard errors that an observed count falls from
    its expected count
  • Its formula is complex
  • Software can be used to find its value
  • A large value provides evidence against
    independence in that cell

57
Example Standardized Residuals for Religiosity
and Gender
  • To what extent do you consider yourself a
    religious person?

58
Example Standardized Residuals for Religiosity
and Gender
59
Example Standardized Residuals for Religiosity
and Gender
  • Interpret the standardized residuals in the table

60
Example Standardized Residuals for Religiosity
and Gender
  • The table exhibits large positive residuals for
    the cells for females who are very religious and
    for males who are not at all religious.
  • In these cells, the observed count is much larger
    than the expected count
  • There is strong evidence that the population has
    more subjects in these cells than if the
    variables were independent

61
Example Standardized Residuals for Religiosity
and Gender
  • The table exhibits large negative residuals for
    the cells for females who are not at all
    religious and for males who are very religious
  • In these cells, the observed count is much
    smaller than the expected count
  • There is strong evidence that the population has
    fewer subjects in these cells than if the
    variables were independent

62
Section 10.5
  • What if the Sample Size is Small? Fishers Exact
    Test

63
Fishers Exact Test
  • The chi-squared test of independence is a
    large-sample test
  • When the expected frequencies are small, any of
    them being less than about 5, small-sample tests
    are more appropriate
  • Fishers exact test is a small-sample test of
    independence

64
Fishers Exact Test
  • The calculations for Fishers exact test are
    complex
  • Statistical software can be used to obtain the
    P-value for the test that the two variables are
    independent
  • The smaller the P-value, the stronger is the
    evidence that the variables are associated

65
Example Tea Tastes Better with Milk Poured
First?
  • This is an experiment conducted by Sir Ronald
    Fisher
  • His colleague, Dr. Muriel Bristol, claimed that
    when drinking tea she could tell whether the milk
    or the tea had been added to the cup first

66
Example Tea Tastes Better with Milk Poured
First?
  • Experiment
  • Fisher asked her to taste eight cups of tea
  • Four had the milk added first
  • Four had the tea added first
  • She was asked to indicate which four had the milk
    added first
  • The order of presenting the cups was randomized

67
Example Tea Tastes Better with Milk Poured
First?
  • Results

68
Example Tea Tastes Better with Milk Poured
First?
  • Analysis

69
Example Tea Tastes Better with Milk Poured
First?
  • The one-sided version of the test pertains to the
    alternative that her predictions are better than
    random guessing
  • Does the P-value suggest that she had the ability
    to predict better than random guessing?

70
Example Tea Tastes Better with Milk Poured
First?
  • The P-value of 0.243 does not give much evidence
    against the null hypothesis
  • The data did not support Dr. Bristols claim that
    she could tell whether the milk or the tea had
    been added to the cup first
Write a Comment
User Comments (0)
About PowerShow.com