Lecture 7 Two-Way Tables Slides available from Statistics - PowerPoint PPT Presentation

About This Presentation
Title:

Lecture 7 Two-Way Tables Slides available from Statistics

Description:

Example: You have a two sided die and a coin, find Pr(1 and H). Answer: x ... c2 Tables: are usually set up like a t-table with df down the side, and the ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 61
Provided by: author91
Category:

less

Transcript and Presenter's Notes

Title: Lecture 7 Two-Way Tables Slides available from Statistics


1
Lecture 7Two-Way TablesSlides available from
Statistics SPSS page of www.gpryce.com
  • Social Science Statistics Module I
  • Gwilym Pryce

2
Notices
  • Register

3
Aims and Objectives
  • Aim
  • This session introduces methods of examining
    relationships between categorical variables
  • Objectives
  • By the end of this session you should be able to
  • Understand how to examine relationships between
    categorical variables using
  • 2 way tables
  • Chi square test for independence.

4
Plan
  • 1. Independent events
  • 2. Contingent events
  • 3. Chi square test for independence
  • 4. Further Study

5
1. Probability of two Independent events occurring
  • If knowing that one event occurs does not affect
    the outcome of another event, we say those two
    outcomes are independent.
  • And if A and B are independent, and we know the
    probability of each of them occurring, we can
    calculate the probability of them both occurring

6
Example You have a two sided die and a coin,
find Pr(1 and H).
  • Answer ½ x ½ ¼
  • Rule P(A ? B) P(A) x P(B)

7
e.g. You have one fair coin which you toss twice
whats the probability of getting two heads?
  • Suppose
  • A 1st toss is a head
  • B 2nd toss is a head
  • what is the probability of A ? B?
  • Answer A and B are independent and are not
    disjoint (i.e. not mutually exclusive). P(A)
    0.5 and P(B) 0.5.
  • P (A ? B) 0.5 x 0.5 0.25.

8
2. Probability of two contingent events occurring
  • If knowing that one event occurs does change the
    probability that the other occurs, then two
    events are not independent and are said to be
    contingent upon each other
  • If events are contingent then we can say that
    there is some kind of relationship between them
  • So testing for contingency is one way of testing
    for a relationship

9
Example of contingent events
  • There is a 70 chance that a child will go to
    university if his/her parents are middle class,
    but only a 10 chance if his/her parents are
    working class. Given that there is a 60 chance
    of a childs parents being working class
  • What are the chances that a child will be born
    working class and go to University?
  • What proportion of people at university will be
    from working working class backgrounds?

10
A tricky one...
11
This diagram illustrates graphically how the
probability of going to university is contingent
upon the social class of your parents.
12
6 of all children are both working class and end
up going to University
13
as percent of all children
Working class Middle class
Go to University 6 28
Do not go to University 54 12
14
at Uni from WC parents?
  • Of all children, only 34 end up at university
    (6 WC 28 MC)
  • i.e. 6 out of every 34 University students are
    from WC parents
  • 6/34 17.6 of University students are WC

15
  • Probability theory states that
  • if x and y are independent, then the probability
    of events x and y simultaneously occurring is
    simply equal to the product of the two events
    occurring
  • But, if x and y are not independent, then

Prob(x ? y) Prob(x) ? Prob(y given that x has occurred)
16
Test for independence
  • We can use these two rules to test whether events
    are independent
  • Does the distribution of observations across
    possible outcomes resemble
  • the random distribution we would get if events
    were independent?
  • I.e. if we assume independence and calculate the
    expected number of of cases in each category, do
    these figures correspond fairly closely to the
    actual distribution of outcomes found in our
    data?
  • Or a distribution of outcomes more akin to
    contingency
  • i.e. one event contingent on the other

17
Example 1 Is there a relationship between social
class and education? We might test this by
looking at categories in our data of WC, MC,
University, no University.
Suppose we have 300 observations distributed as
follows
Working class Middle class
Go to University 18 84
Do not go to University 162 36
Given this distribution, would you say these two
variables are independent?
18
  • To do the test for independence we need to
    compare expected with observed.
  • But how do we calculate ei, the expected number
    of observations in category i?
  • ei number of cases expected in cell i assuming
    that the two categorical variables are
    independent
  • ei is calculated simply as the probability of an
    observation falling into category i under the
    independence assumption, multiplied by the total
    number of observations.
  • I.e. No contingency

19
  • So, if UNIY or UNIN and WC or MC are independent
    (i.e. assuming H0) then

Prob(UNIY ? WC) Prob(UNIY)?Prob(WC) so the
expected number of cases for each of the four
mutually exclusive categories are as follows
Working class Middle class
Go to University P(UNIY) x P(WC) x n P(UNIY) x P(MC) x n
Do not go to University P(UNIN) x P(WC) x n P(UNIN) x P(MC) x n
20
  • But how do we work out
  • Prob(UNIY) and Prob(WC)
  • which are needed to calcluate Prob(UNIY ? WC)
  • Prob(UNIY ? WC) Prob(UNIY)?Prob(WC)
  • Answer we assume independence and so estimate
    them from the data by simply dividing the total
    observations by the total number in the given
    category
  • E.g. Prob(UNIY) Total no. cases UNIY ? All
    observations
  • (18 84) / 300 0.34
  • Prob(WC) is calculated the same way
  • E.g. Prob(WC) Total no. cases WC ? All
    observations
  • (18 162)/300 0.6
  • Prob(UNIY ? WC) .34 x .6 x 300 61.2

21
Working class Middle class
Go to University P(UNIY) x P(WC) x n (no. at Uni / n) x (no. WC/n) x n P(UNIY) x P(MC) x n (no. at Uni / n) x (no. MC/n) x n
Do not go to University P(UNIN) x P(WC) x n (no.not Uni / n) x (no. WC/n) x n P(UNIN) x P(MC) x n (no. not Uni / n) x (no. MC/n) x n
22
Working class Middle class
Go to University 18 84 102
Do not go to University 162 36 198
180 120 300
23
Working class Middle class
Go to University P(UNIY) x P(WC) x n (102 / 300) x (180 /300) x 300 P(UNIY) x P(MC) x n (102 / 300) x (120 /300) x 300
Do not go to University P(UNIN) x P(WC) x n (198 / 300) x (180 /300) x 300 P(UNIN) x P(MC) x n (198 / 300) x (120 /300) x 300
24
Expected count in each category
Working class Middle class
Go to University (102 / 300) x (180 /300) x 300 .34 x .6 x 300 61.2 (102 / 300) x (120 /300) x 300 .34 x .4 x 300 40.8
Do not go to University (198 / 300) x (180 /300) x 300 .66 x .6 x 300 118.8 (198 / 300) x (120 /300) x 300 .66 x .4 x 300 79.2
25
We have the actual count (I.e. from our data set)
Working class Middle class
Go to University 18 84
Do not go to University 162 36
26
And the expected count
(I.e. the numbers wed expect if we assume class
education to be independent of each other)
Working class Middle class
Go to University 61.2 40.8
Do not go to University 118.8 79.2
27
What does this table tell you?
Working class Middle class
Go to University Actual count 18 84
Expected count 61.2 40.8
Do not go to University Actual count 162 36
Expected count 118.8 79.2
28
  • It tells you that if class and education were
    indeed independent of each other
  • I.e. the outcome of one does not affect the
    chances of outcome of the other
  • Then youd expect a lot more working class people
    in the data to have gone to university than
    actually recorded (61 people, rather than 18)
  • Conversely, youd expect far fewer middle class
    people to have gone to university (half the
    number actually recorded 41 people rather than
    80).

29
But remember, all this is based on a sample, not
the entire population
  • Q/ Is this discrepancy due to sampling variation
    alone or does it indicate that we must reject the
    assumption of independence?
  • To answer this within the standardised hypothesis
    testing framework we need to know the chances of
    false rejection

30
3. Chi-square test for independence
(non-parametric -- I.e. no presuppositions re
distribution of variables sample size not
relevant)
  • (1) H0 expected actual ? x y are
    independent
  • I.e. Prob(x) is not affected by whether or not y
    occurs
  • H1 expected ? actual ? there is some
    relationship
  • I.e. Prob(x) is affected by y occurring.
  • (2) a 0.05

k no. of categories ei expected (given H0)
no. of sample observations in the ith category oi
actual no. of sample observations in the ith
category d no. of parameters that have to be
estimated from the sample data.
r no. of rows in table c no. of colums
31
Chi-square distribution changes shape for
different df
32
  • (3) Reject H0 iff P lt a
  • (4) Calculate P
  • P Prob(c2 gt c2c)
  • N.B. Chi-square tests are always an upper tail
    test
  • c2 Tables are usually set up like a t-table with
    df down the side, and the probabilities listed
    along the top row, with values of c2c actually
    in the body of the table. So look up c2c in the
    body of the table for the relevant df and then
    find the upper tail probability that heads that
    column.
  • SPSS - CDF.CHISQ(c2c,df) calculates Prob(c2 lt
    c2c), so use the following syntax
  • COMPUTE chi_prob 1 - CDF.CHISQ(c2c,df).
  • EXECUTE.

33
Do a chi-square test on the following table
Working class Middle class
Go to University Actual count 18 84
Expected count 61.2 40.8
Do not go to University Actual count 162 36
Expected count 118.8 79.2
34
  • H0 expected actual
  • ? class and Higher Education are independent
  • H1 expected ? actual
  • ? there is some relationship between class and
    Higher Education

35
(2) State the formula calc c2
  • c2 ( (18 - 61.2)2 / 61.2
  • (84 - 40.8)2/ 40.8
  • (162-118.8)2 / 118.8
  • (36 - 79.2)2/ 79.2 )

36
  • c2 ((18 - 61.2)2 / 61.2 (84 -
    40.8)2/ 40.8
  • (162-118.8)2 /118.8 (36 - 79.2)2/
    79.2 )
  • 30.49 45.74 15.71 23.56
  • 115.51
  • df (r-1)(c-1) 1
  • Sig P(c2 gt 115.51) 0

37
  • (3) Reject H0 iff P lt a
  • (4) Calculate P
  • COMPUTE chi_prob 1 - CDF.CHISQ(115.51,1).
  • EXECUTE.
  • Sig P(c2 gt 115.51) 0
  • ? Reject H0

38
Caveat
  • As with the 2 proportions tests, the chi-square
    test is,
  • an approximate method that becomes more
    accurate as the counts in the cells of the table
    get larger (Moore, Basic Practice of Statistics,
    2000, p. 485)
  • Cell counts required for the Chi-square test
  • You can safely use the chi-square test with
    critical values from the chi-square distribution
    when no more than 20 of the expected counts are
    less than 5 and all individual expected counts
    are 1 or greater. In particular, all four
    expected counts in a 2x2 table should be 5 or
    greater (Moore, Basic Practice of Statistics,
    2000, p. 485)

39
Example 2 Is there a relationship between
whether a borrower is a first time buyer and
whether they live in Durham or Cumberland?
  • Only real problem is how do we calculate ei the
    expected number of observations in category i?
  • (I.e. number of cases expected in i assuming that
    the variables are independent)
  • the formula for ei is the probability of an
    observation falling into category i multiplied by
    the total number of observations.

40
As noted earlier
  • Probability theory states that
  • if x and y are independent, then the probability
    of events x and y simultaneously occurring is
    simply equal to the product of the two events
    occurring
  • But, if x and y are not independent, then

Prob(x ? y) Prob(x) ? Prob(y given that x has occurred)
41
  • So, if FTBY or N and CountyD or C are independent
    (i.e. assuming H0) then
  • Prob(FTBY ? CountyD) Prob(FTBY)?Prob(CountyD)
  • so the expected number of cases for each of the
    four mutually exclusive categories are as
    follows

42
Prob(FTBN) Total no. cases FTBN ? All
observations
43
(No Transcript)
44
This gives us the expected count
To obtain this table in SPSS, go to Analyse,
Descriptive Statistics, Crosstabs, Cells, and
choose expected count rather than observed
45
(No Transcript)
46
  • What does this table tell you?
  • Does it suggest that the probability of being an
    FTB independent of location?
  • Or does it suggest that the two factors are
    contingent on each other in some way?
  • Can it tell you anything about the direction of
    causation?
  • What about sampling variation?

47
(No Transcript)
48
Summary of Hypothesis test
  • (1) H0 FTB and County are independent
  • H1 there is some relationship
  • (2) a 0.05
  • (3) Reject H0 iff P lt a
  • (4) Calculate P
  • P Prob(c2 gt c2c) 0.29557 Do not reject H0
  • I.e. if we were to reject H0, there would be a 1
    in 3 chance of us rejecting it incorrectly, and
    so we cannot do so. In other words, FTB and
    County are independent.

49
(No Transcript)
50
Contingency Tables in SPSS
51
(No Transcript)
52
  • Click Cells button to select counts s
  • If you select all three (row, column and total),
    you will end up with

53
(No Transcript)
54
  • Click the Statistics button to choose which stats
    you want.
  • If you click Chi-square, the results of a range
    of tests will be listed

55
We have been calculating the Pearson Chi-square
56
4. For further study
  • The Pearson Chi square test only tests for the
    existence of a relationship
  • It tells you little about the strength of the
    relationship
  • SPSS includes a raft of measures that try to
    measure the level of association between
    categorical variables.
  • Click on the name of one of the statistics and
    SPSS will give you a brief definition (see below)
  • In the lab exercises, take a look at these
    statistics and copy and paste the definitions
    along side your answers
  • Right click on the definition and select Copy.
    Then open up a Word document and paste along with
    your output.

57
Nominal variables
  • Contingency coefficient
  • A measure of association based on chi-square.
    The value ranges between zero and 1, with zero
    indicating no association between the row and
    column variables and values close to 1 indicating
    a high degree of association between the
    variables. The maximum value possible depends on
    the number of rows and columns in a table.
  • Phi and Cramers V
  • Phi is a chi-square based measure of association
    that involves dividing the chi-square statistic
    by the sample size and taking the square root of
    the result. Cramer's V is a measure of
    association based on chi-square.
  • Lambda
  • A measure of association which reflects the
    proportional reduction in error when values of
    the independent variable are used to predict
    values of the dependent variable. A value of 1
    means that the independent variable perfectly
    predicts the dependent variable. A value of 0
    means that the independent variable is no help in
    predicting the dependent variable.
  • Uncertainty coefficient
  • A measure of association that indicates the
    proportional reduction in error when values of
    one variable are used to predict values of the
    other variable. For example, a value of 0.83
    indicates that knowledge of one variable reduces
    error in predicting values of the other variable
    by 83. The program calculates both symmetric and
    asymmetric versions of the uncertainty
    coefficient.

58
Ordinal Variables
  • Gamma
  • A symmetric measure of association between two
    ordinal variables that ranges between negative 1
    and 1. Values close to an absolute value of 1
    indicate a strong relationship between the two
    variables. Values close to zero indicate little
    or no relationship. For 2-way tables, zero-order
    gammas are displayed. For 3-way to n-way tables,
    conditional gammas are displayed.
  • Somers d
  • A measure of association between two ordinal
    variables that ranges from -1 to 1. Values close
    to an absolute value of 1 indicate a strong
    relationship between the two variables, and
    values close to 0 indicate little or no
    relationship between the variables. Somers' d is
    an asymmetric extension of gamma that differs
    only in the inclusion of the number of pairs not
    tied on the independent variable. A symmetric
    version of this statistic is also calculated.
  • Kendalls tau-b
  • A nonparametric measure of association for
    ordinal or ranked variables that take ties into
    account. The sign of the coefficient indicates
    the direction of the relationship, and its
    absolute value indicates the strength, with
    larger absolute values indicating stronger
    relationships. Possible values range from -1 to
    1, but a value of -1 or 1 can only be obtained
    from square tables.
  • Kendalls tau-c
  • A nonparametric measure of association for
    ordinal variables that ignores ties. The sign of
    the coefficient indicates the direction of the
    relationship, and its absolute value indicates
    the strength, with larger absolute values
    indicating stronger relationships. Possible
    values range from -1 to 1, but a value of -1 or
    1 can only be obtained from square tables.

59
Correlations
  • Pearson correlation coefficient r
  • a measure of linear association between two
    variables
  • Spearman correlation coefficient
  • a measure of association between rank orders.
    Values of both range between -1 (a perfect
    negative relationship) and 1 (a perfect positive
    relationship). A value of 0 indicates no linear
    relationship.

60
When you have a dependent variable measured on an
interval scale an independent variable with a
limited number of categories
  • Eta
  • A measure of association that ranges from 0 to
    1, with 0 indicating no association between the
    row and column variables and values close to 1
    indicating a high degree of association. Eta is
    appropriate for a dependent variable measured on
    an interval scale (e.g., income) and an
    independent variable with a limited number of
    categories (e.g., gender). Two eta values are
    computed one treats the row variable as the
    interval variable the other treats the column
    variable as the interval variable.
Write a Comment
User Comments (0)
About PowerShow.com