The Analysis of Categorical Data - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

The Analysis of Categorical Data

Description:

The Analysis of Categorical Data Categorical variables When both predictor and response variables are categorical: Presence or absence Color, etc. – PowerPoint PPT presentation

Number of Views:63
Avg rating:3.0/5.0
Slides: 47
Provided by: biolog8
Category:

less

Transcript and Presenter's Notes

Title: The Analysis of Categorical Data


1
The Analysis of Categorical Data
2
Categorical variables
  • When both predictor and response variables are
    categorical
  • Presence or absence
  • Color, etc.
  • The data in such a study represents counts or
    frequencies- of observations in each category

3
Analysis
Data Analysis
A single categorical predictor variable Organized as two way contingency tables, and tested with chi-square or G-test
Multiple predictor variables (or complex models) Organized as a multi-way contingency tables, and analyzed using either log-linear models or classification trees
4
Two way Contingency Tables
  • Analysis of contingency tables is done correctly
    only on the raw counts, not on the percentages,
    proportions, or relative frequencies of the data

5
Wildebeest carcasses from the Serengeti (Sinclair
and Arcese 1995)
6
Sex, cause of death, and bone marrow type
  • Sex (males / females)
  • Cause of death (predation / other)
  • Bone marrow type
  • Solid white fatty (healthy animal)
  • Opaque gelatinous
  • Translucent gelatinous

7
Data
Sex Marrow Death by predation
Male SWF Yes
Male OG Yes
Male TG Yes

8
Brief format
SEX MARROW DEATH COUNT
FEMALE SWF PRED 26
MALE SWF PRED 14
FEMALE OG PRED 32
MALE OG PRED 43
FEMALE TG PRED 8
MALE TG PRED 10
FEMALE SWF NPRED 6
MALE SWF NPRED 7
FEMALE OG NPRED 26
MALE OG NPRED 12
FEMALE TG NPRED 16
MALE TG NPRED 26
9
Contingency table
  • Sex Death Crosstabulation

Dead
Sex NPRED PRED Total
FEMALE 48 66 114
MALE 45 67 112
Total 93 133 226
10
Contingency table
  • Sex Marrow Crosstabulation

Marrow
Sex OG SWF TG Total
FEMALE 58 32 24 114
MALE 55 21 36 112
Total 113 53 60 226
11
Contingency table
  • Death Marrow Crosstabulation

Marrow
Death OG SWF TG Total
NPRED 38 13 42 93
PRED 75 40 18 133
Total 113 53 60 226
12
Are the variables independent?
  • We want to know, for example, whether males are
    more likely to die by predation than females
  • Specifying the null hypothesis
  • The predictor and response variable are not
    associated with each other. The two variables are
    independent of each other and the observed degree
    of association is not stronger than we would
    expect by chance or random sampling

13
Calculating the expected values
  • The expected value is the total number of
    observations (N) times the probability of a
    population being both males and dead by predation

14
The probability of two independent events
Because we have no other information than the
data, we estimate the probabilities of each of
the right hand terms from the equation from the
marginal totals
15
Contingency table
  • Sex Death expected values

Dead
Sex NPRED PRED P
FEMALE 46.91 67.09 114 0.5044
MALE 46.09 65.91 112 0.4956
93 133
P 0.4115 0.5885 N226
16
(No Transcript)
17
Testing the hypothesis Pearsons Chi-square test
0.0866, P0.7685
0.0253, P0.8736
18
The degrees of freedom
1
19
Calculating the P-value
  • We find the probability of obtaining a value of
    ?2 as large or larger than 0.0866 relative to a
    ?2 distribution with 1 degree of freedom
  • P 0.769

20
(No Transcript)
21
An alternative
  • The likelihood ratio test It compares observed
    values with the distribution of expected values
    based on the multinomial probability distribution

0.0866
22
Two way contingency tables
  • Sex Death Crosstabulation
  • Sex Marrow Crosstabulation
  • Marrow Death Crosstabulation

23
Which test to chose?
Model Rows/ Columns Sample size Test
I II Not fixed Fixed/not fixed small G-test, with corrections
I II Not fixed Fixed/not fixed large G-test, Chi square test
III Fixed Fisher exact test
24
Log-linear modelsMulti-way Contingency Tables
25
Multiple two-way tables
Females Marrow
Death OG SWF TG Total
PRED 32 26 8 66
NPRED 26 6 16 48
Total 58 32 24 114
Males Marrow
Death OG SWF TG Total
PRED 43 14 10 67
NPRED 12 7 26 45
Total 55 21 36 112
26
Log-linear models
  • They treat the cell frequencies as counts
    distributed as a Poisson random variable
  • The expected cell frequencies are modeled against
    the variables using the log-link and Poisson
    error term
  • They are fit and parameters estimated using
    maximum likelihood techniques

27
Log-linear models
  • Do not distinguish response and predictor
    variables all the variables are considered
    equally as response variables

28
However
  • A logit model with categorical variables can be
    analyzed as a log-linear model

29
Two way tables
  • For a two way table (I by J) we can fit two
    log-linear models
  • The first is a saturated (full) model
  • Log fij constant ?ix ?ky ?jkxy
  • fij is the expected frequency in cell ij
  • ?ix is the effect of category i of variable X
  • ?ky is the effect of category k of variable Y
  • ?jkxy is the effect any interaction between X
    and Y
  • This model fit the observed frequencies perfectly

30
Note
  • The effect does not imply any causality, just the
    influence of a variable or interaction between
    variables on the log of the expected number of
    observations in a cell

31
Two way tables
  • The second log-linear model represents
    independence of the two variables (X and Y) and
    is a reduced model
  • Log fij constant ?ix ?ky
  • The interpretation of this model is that the log
    of the expected frequency in any cell is a
    function of the mean of the log of all the
    expected frequencies plus the effect of variable
    x and the effect of variable y. This is an
    additive linear model with no interactions
    between the two variables

32
Interpretation
  • The parameters of the log-linear models are the
    effects of a particular category of each variable
    on the expected frequencies
  • i.e. a larger ? means that the expected
    frequencies will be larger for that variable.
  • These variables are also deviations from the mean
    of all expected frequencies

33
Null hypothesis of independence
  • The Ho is that the sampling or experimental units
    come from a population of units in which the two
    variables (rows and columns) are independent of
    each other in terms of the cell frequencies
  • It is also a test that ?jkxy 0
  • There is NO interaction between two variables

34
Test
  • We can test this Ho by comparing the fit of the
    model without this term to the saturated model
    that includes this term
  • We determine the fit of each model by calculating
    the expected frequencies under each model,
    comparing the observed and expected frequencies
    and calculating the log-likelihood of each model

35
Test
  • We then compare the fit of the two models with
    the likelihood ratio test statistic ?
  • However the sampling distribution of this ratio
    (? ) is not well known, so instead we calculate
    G2 statistic
  • G2 -2log?
  • G2 Follows a ?2 distribution for reasonable
    sample sizes and can be generalized to
  • - 2(log-likelihood reduced model --
    log-likelihood full model)

36
Degrees of freedom
  • The calculated G2 is compared to a ?2
    distribution with (I-1)(J-1) df.
  • This df (I-1)(J-1) is the difference between the
    df for the full model (IJ-1) and the df for the
    reduced model (I-1)(j-1)

37
Akaike information criteria
Hirotugu Akaike
38
The full model
39
Complete table
Model G2 df P AIC
1 DSM 42.76 7 0.001 28.76
2 DS 42.68 6 0.001 30.68
3 DM 13.24 5 0.021 3.24
4 SM 37.98 5 0.001 27.98
5 DSDM 13.16 4 0.01 5.16
6 DSSM 37.89 4 0.001 29.89
7 DMSM 8.46 3 0.037 2.46
8 DSDMSM 7.19 2 0.027 3.19
9 Saturated full model 0 0
40
Two way interactions (marginal independence)
DSM 42.76 reference d.f P
DS 1vs 2 42.6759 42.76-42.680.084 7-6 1 0.769
DM 1vs 3 13.24 42.76-13.2429.520 7-5 2 lt0.001
SM 1 vs 4 37.98 42.76-37.984.778 7-5 2 0.092
41
Three way interaction
  • DeathSexMarrow
  • Models compared 8 vs 9
  • G2 7.19
  • df 2
  • P0.027

42
Conditional independence
term Models compared G2 df P
DS 7 vs 8 1.28 1 0.259
DM 6 vs 8 30.71 2 0.001
SM 5 vs 8 5.97 2 0.051
Death and marrow have a partial association
43
Conditional independence
Females Marrow
Death OG SWF TG Total
PRED 32 26 8 66
NPRED 26 6 16 48
Total 58 32 24 114
Males Marrow
Death OG SWF TG Total
PRED 43 14 10 67
NPRED 12 7 26 45
Total 55 21 36 112
44
Males 95 CI Females
OG vs TG 0.107 0.041-0.283 0.406 0.150-1.097
SWF vs TG 0.192 0.060-0.616 0.115 0.034-0.395
SWF vs OG 0.558 0.184-1.693 3.521 1.261-9.836

45
Complete independence
  • Models compared 1 vs 8
  • G235.57
  • df 5
  • Plt0.001

46
Warning
  • Always fit a saturated model first, containing
    all the variables of interest and all the
    interactions involving the (potential) nuisance
    variables. Only delete from the model the
    interactions that involve the variables of
    interest.
Write a Comment
User Comments (0)
About PowerShow.com