Title: Chapter 9'1: Using Chisquare analysis to analyze the fit of a proposed model
1Chapter 9.1 Using Chi-square analysis to analyze
the fit of a proposed model
- Goals
- 1. For a Chi-square Goodness of Fit test
- a. Know when to use it.
- b. Know the assumptions.
- c. Be able to perform all 9 steps.
- d. Know how to make a statistical conclusion and
an interpretation of the results. - 2. Be able to calculate the expected counts
for a cell. - 3. Be able to calculate the chi-square value for
a cell as well as the chi-square value for the
model.
2how to test one sample proportion vs. some
expected proportion
- Research question
- We are interested in simultaneously estimating
multiple unknown proportions for a population
based on a sample. - Example
- From newspaper articles, we have read that the US
is approximately 55 Caucasian, 20 Hispanic, 15
African-American, and 10 Other races. We want to
test if the University of Wyoming follows these
percentages.
3Multiple comparisons?
- We could perform 4 studies, each testing one of
the proportions - Test1 Is the proportion Caucasian .55?
- Test2 Is the proportion Hispanic .20? Etc.
- But this introduces the problem of multiple
comparisons, and it doesnt really test what we
want. - We want one OVERALL test to answer Does this
model (in its entirety) fit the data? - We want an overall test to look for any
differences among all the parameters in which we
are interested. We want something to analyze a
Goodness of fit for the model.
4To answer this question, we will use the
following ideas
- We will set up our actual data in a table with
r rows and c columns. - Using the null hypothesis, we will compute
another table for the expected counts for each
cell. - We will compare each actual value vs. its
corresponding expected value. - If these differences (in total) are large, then
we will reject the model. - If these differences (in total) are small, then
we will not reject the model.
5Calculating expected counts
- Using our problem above, we have heard that the
population proportions should be - Caucasian Hispanic African-American Other
- 55 20 15 10
- Let our sample size be equal to 500. Our sample
results are 440, 15, 20, and 25. For each race,
how many people do we expect if our null
hypothesis is true?
6Observed and expected Frequencies
7Chi Square
8Chi Square cont.
9Example Conduct a Hypothesis test to see if the
students at UW follow the percentages 55/ 20/ 15/
10 using alpha .05.
- Step 1. Hypotheses
- The population proportions for Caucasian/
Hispanic/ African-American/Other are .55, .20,
.15, and .10 - There is at least one difference in the
population proportions - Step 2. Find critical Chi Square for alpha .05
- Step 3. Collect data
10Checking assumptions
- Step 4. Assumptions
- a. Data is collected with a SRS
- Reason want our data to be representative
- b. Population (N) is at least 10 times sample
size (n) - Reason we want the probability of selecting a
yes to be independent from person to person,
thus the probability of a yes is constant. - c. No more than 20 of the Expected cell counts
are less than 5. All expected cell counts are at
least 1. - -We want to get an accurate estimate of the true
proportion. - Step 5. Calculate the test statistic.
11Chi Square Calculations
12Chi Square Calculations, cont.
13Further Steps
- Step 6. p-Value from the table,
- the p-value is lt .0005
- (since it is off the chart)
- 7. Comparison
- pltalpha
- 8. Statistical Conclusion
- reject the null hypothesis
- 9. Interpretation
- There is significant evidence that the true model
for race is different than 55 Caucasian/ 20
Hispanic/15 African American /10 Other, in the
population, alpha .05.
14Final notes
- Note If our data supported a Fail to reject
the null hypothesis decision, then our
interpretation would be the same but just add
NOT. - Specifically, we would sayThere is not
significant evidence that the truemodel for race
is different than 55 Caucasian/ 20 Hispanic/15
African American /10 Other, in the population,
alpha .05. - Note The actual counts for a table must be an
integer. - The expected counts for a table DO NOT have to be
an integer - If we expected 55 of the sample to be Caucasian,
and - our sample size 50 people, then our expected
counts - for that cell would be (.55)(50) 27.5 people
15Chi square test of independence
- New Situation
- We are interested in simultaneously estimating
multiple unknown proportions and comparing these
multiple proportions to see if they are all the
same or if at least one of them is different. - Example Cancer rates for different cities in the
US (testing if multiple proportions are the
same) - We want to investigate the claim that different
cities have different rates of cancer. Below is a
breakdown of the number of households affected by
cancer for six cities. A household is either
affected or not affected. We would like to know
if the proportions of affected households in the
different cities are the same or if at least one
of them is different than the rest.
16Data
17What does this mean?
- We want one OVERALL test to answer Are all of
these proportions the same? - We want an overall test to look for any
differences among all the parameters in which we
are interested. We want something to analyze a
Goodness of fit for the model of all
proportions being equal.
18To answer this question, we will use the
following ideas
- We will set up our actual data in a table with
r rows and c columns. - Using the null hypothesis, we will compute
another table for the expected counts for each
cell. - We will compare each actual value vs. its
corresponding expected value. - If these differences (in total) are large, then
we will reject the model. - If these differences (in total) are small, then
we will not reject the model.
19Differences from goodness of fit test
- Using our problem above, we know that if cancer
rates are the same, then the population
proportions should be the same for each of the
six cities. - Note We are not saying that we know that the
population proportion is equal to some value like
- p .08. We are only saying that whatever that
value is, that it is the same for all cities. - Also note the difference between this statement
and the one from last lecture - (The proportions are equal to .55, .20, .15, .10
for Caucasian, Hispanic, African-American and
Other)
20Row and column totals
21Calculating expected values
22Calculating Chi Square
23Example Conduct a Hypothesis test to see if the
cancer rates for the six cities are the same,
- Step 1. Hypotheses
- H0 The population proportions for being
affected by cancer are the same for the six
cities - H1 There is at least one difference in the
population proportions - These hypotheses can also be written with
symbols - H1 at least one ? is different from the rest
H0
24How do we calculate Degrees of freedom?
- df number of cells that are free and not
pre-determined - So for us, df (r-1)(c-1) (6-1)(2-1)
- df 5
25Data for cancer study
Observed
Expected
26Calculations
27Calculations, cont.
28Deciding if result is significant
- 6. p-Value From the table, the p-value is
between .01 and .02 - 7. Comparison
- p lt alpha
- 8. Statistical Conclusion
- Reject the null hypothesis
- 9. Interpretation
- There is significant evidence that the
proportions for getting cancer in the 6 cities
are different in thepopulation, alpha .05. - Note If our data supported a Fail to reject
the null hypothesis decision, then our
interpretation would be the same but just add
NOT. - Specifically, we would say There is not
significant evidence that the proportions for
getting cancer in the 6 cities are different in
the population, alpha .05.
29Analyzing which cell contributes the most to a
significant overall chi-square value
- Look at the individual chi-square values for each
cell and find the largest. - This is the cell in which observed and expected
are the most different. - Interpretation for our data above
- The chi-square value for the cell corresponding
to city4 and affected with cancer is 11.338.
- For this cell, the observed value of 12 was much
greater than the expected value of 4.7 people.
30Testing if 2 variables are independent
- We want to test if the variables city and
cancer status are related. - That is, are the variables city and cancer
status independent or dependent? - Does knowing the value for city help us in
determining the proportion of individuals
affected with cancer? - If the variables are independent, then the
proportion of people with cancer should be the
same for all cities. Thus, by knowing the city,
this does NOT help us out. - If the variables are dependent, then the
proportion of people with cancer should NOT be
the same for all cities. Thus, by knowing the
city, this DOES help us out.