Title: STA 107: Logistic Regression and Categorical Data Analysis
1STA 107 Logistic Regression and Categorical
Data Analysis
- Lecturer Dr. Daisy Dai
- Department of Medical Research
2Contents
- Binary Logit Analysis
- Simple Logistic Regression
- Multiple Logistic Regression
- Stepwise or Backward Model Selections
- Collinearity
3Categorical Data Analysis
- Binomial Test
- Chi-square Test
- Fishers Exact Test
- McNemars Test
- Cochran-Mantel-Haenszel Test
4Binomial test
- Make inference about a proportion of binary
outcomes by comparing the confidence interval of
a proportion to target.
5Case Study Genital Wart
- A company markets a therapeutic product for
genital warts with a known cure rate of 40 in
the general population. In a study of 25
patients with genital warts treated with this
product, patients were also given high doses of
vitamin C. As shown in Table on the next page,
14 patients were cured. Is this consistent with
the cure rate in the general population?
6Treatment to Genital Wart
ID Effectiveness
1 YES
2 NO
3 YES
4 NO
5 YES
6 YES
7 NO
8 YES
9 NO
10 NO
11 YES
12 NO
13 YES
14 NO
ID Effectiveness
15 YES
16 NO
17 NO
18 YES
19 YES
20 NO
21 YES
22 YES
23 NO
24 YES
25 YES
7Results
- 64 (16/25) of patient were cured by the
treatment. - The 95 confidence interval extends from 44 to
80 - If the probability of "success" in each trial or
subject is 0.300, then the chance of observing 16
or more successes in 25 trials is 0.045
(p-value). - The cure rate of genital wart by the experimental
therapy was significantly higher than 30.
8Fishers Exact Test
- A conservative non-parametric test about a
relationship between two categorical variables.
The groups in comparison should be independent.
Responders Non-responders Total
Group 1 N11 N12 N11N12
Group 2 N21 N22 N21N22
Combined N11N21 N12 N22 N
9Case Study CHF Incidence
- A new adenosine-releasing agent (ARA), thought
to reduce side effects in patients undergoing
coronary artery bypass surgery (CABG), was
studied in a pilot trial.
CHF No CHF Total
ARA 2 (6) 33 35
Placebo 5 (25) 20 25
Combined 7 53 60
Fishers exact test p0.0455
10Chi-square test
- Test a relationship between two categorical
variables. Groups should be independent. The
chi-square test assumes that the expected value
for each cell is five or higher.
11Case Study ADR Frequency with Antibiotic
Treatment
- A study was conducted to monitor the incidence
of GI adverse drug reactions of a new antibiotic
used in lower respiratory tract infections.
Responders Non-responders Total
Test (new antibiotic) 22 (33) 44 66
Control (erythromycin) 28 (54) 24 53
Combined 50 (42) 68 118
Chi-square test p0.0252 Fishers exact test
p0.0385
12McNemars test
- Compare response rates in binary data between
two related populations. Its analogous to
Chi-square test or Fishers exact test for
independent populations.
After Before Responders Non-responders
Responders A B
Non-responder C D
13Case Study Bilirubin
- A study was conduct to evaluate the toxicity
side effect of an experimental therapy. Patients
(n86) were treated with the experimental drug
for 3 months. Clinical lab measured bilirubin
levels of each patient at baseline and 3 months
after therapy.
14(No Transcript)
15Results of McNemars Test
After Before Normal Abnormally high
Normal 60 14
Abnormally high 6 6
- At baseline, 14 (12/86) of patients had
abnormally high bilirubin level. - At 3 months post treatment, 23 (20/86) of
patients had abnormally high bilirubin level. - P-value 0.1175
- Odds ratio 2.3 95 CI 0.8 - 7.4
- There is no enough evidence to prove the
increasing risk of high bilirubin due to
treatment.
16Cochran-Mantel-Haenszel (CMH) Test
- The Cochran-Mantel-Haenszel test is a method to
compare the probability of an event among
independent groups in stratified samples. - The stratification factor can be study center,
gender, race, age groups, obesity status or
disease severity. These underlying
sub-population can be confounding factors that
affect the associations between risk factors and
the outcome variables.
17Case Study Diabetic Ulcers
- A multi-center study with 4 centers is testing an
experimental treatment, Dermotel, used to
accelerate the healing of dermal foot ulcers in
diabetic patients. Sodium hyaluronate was used
in a control group. Patients who showed a
decrease in ulcer size after 20 weeks treatment
of at least 90 surface area measurements were
considered responders. The numbers of
responders in each group are shown in Table 19.2
for each study center. Is there an overall
difference in response rates between the Dermotel
and control groups?
18Response Frequencies by Study Center
Study Center Treatment Group Response Non-Response
1 Dermotel 26 (87) 4
1 Control 18 (62) 11
2 Dermotel 8 (73) 3
2 Control 7 (58) 5
3 Dermotel 7 (58) 5
3 Control 4 (40) 6
4 Dermotel 11 (65) 6
4 Control 9 (64) 5
19- The interest in this study is to compare the
response rate between two treatment. Because the
study was conducted in four centers, it is
concerned that some potential influences of study
center on the response rate. By including the
study center, the researcher can examine
associations between the treatment and the
response rate while adjusting (controlling) for
the effect of study center. - Cochran-Mantel-Haenszel Test assumes a common
odds ratio and test the null hypothesis that the
explanatory variable X (treatment) and the
outcome variable Y (response rate) are
conditionally independent, given the control
variable Z (study center). In other words, CMH
tests whether the response is conditionally
independent of the explanatory variable when
adjusting for the control variable. - One can also measures average conditional
association between the explanatory (treatment)
and the response variable by calculating the
common odds ratio conditioned on the control
variable (study center).
20Results
Response Rates Response Rates
Study Center Active (n) Control (n) Chi-Square p-Value
1 86.7 (30) 62.1 (29) 4.706 0.030
2 72.7 (11) 58.3 (12) 0.524 0.469
3 58.3 (12) 40.0 (10) 0.733 0.392
4 64.7 (17) 64.3 (14) 0.001 0.981
Overall 74.3 (70) 58.5 (65) 40.39 0.044
P-value from CMH test
21Chi-Square Test, Ignoring Strata
Group Response Non-Response Total
Active Control 52 (74.3) 38 (58.5) 18 27 70 65
Total 90 (100.0) 45 135
Chi-square value 3.798, p 0.051
22Counter-Intuitive Combined Results
Stratum Group Responders Non-Responders Total Response Rate
1 A 10 38 48 21
B 4 21 25 16
2 A 20 10 30 67
B 27 17 44 61
Combined A 30 48 78 38
B 31 38 69 45
23Logistic Regression
24Logistic Regression
- Logistic Regression are methods to identify the
associations between a categorical outcome
variable and explanatory variables. - In most cases, the outcome variable is
dichotomous. The explanatory variables can be
categorical or continuous. The probability of the
outcome variable can be predicted by the values
of explanatory variables. - Dichotomous outcome variable ? explanatory
variables - Log(P/(1-P))a b1 x1 b2 x2 b3 x3 b4
x4 -
-
25Odds Ratio
- Let Y be the dichotomous variable where y1
indicates an event and y0 indicates no events - Oddprobability of an event/probability of no
event - P(Y1)/P(Y0)P(Y1)/(1-P(Y0))
- Odds RatioOdds in the Test Group/Odd in the
Control Group - Logistic Model Log(Odds Ratio of an event) ?
explanatory variables - Log (odds ratio)a b1 x1 b2 x2 b3 x3
b4 x4
26Case Study CHF Incidence
- Odd of CHF incidence in the ARA
group(2/35)/(33/35)2/336. - Odd of CHF incidence in the Placebo
group(5/25)/(20/25)20. - Odds RatioOdd in the ARA group/odd in the
Placebo group(2/33)/(5/20)0.24 - The risk (odd) of CHF incidence in the ARA group
is only 24 the risk (odd) in the Placebo group.
- A new adenosine-releasing agent (ARA), thought
to reduce side effects in patients undergoing
coronary artery bypass surgery (CABG), was
studied in a pilot trial.
CHF No CHF Total
ARA 2 (5.7) 33 35
Placebo 5 (25) 20 25
Combined 7 53 60
Fishers exact test p0.0455
27Properties of Odds Ratio
- Odds ratio is non-negative.
- If odds ratiolt1, then the risk is smaller than
control. - If odds ratiogt1, then the risk is larger than
control. - Odds ratio of no event1/odds ratio of an event.
- One can calculate the confidence interval of an
odds ratio. The confidence interval of a
significance odds ratio does not contain 1.
28Case Study CHF Incidence
- odds Ratio for ARA versus Control(2/33)/(5/20)0.
24lt1. So the risk of CHF incidence in the ARA
group is relatively smaller. - One can also calculate odds ratio for Control
versus ARA as 1/0.244.1gt1, which indicates the
risk (odd) of CHF in Placebo group is 4.1 fold of
risk in ARA group.
- A new adenosine-releasing agent (ARA), thought
to reduce side effects in patients undergoing
coronary artery bypass surgery (CABG), was
studied in a pilot trial.
CHF No CHF Total
ARA 2 (5.7) 33 35
Placebo 5 (25) 20 25
Combined 7 53 60
Fishers exact test p0.0455
29Logistic Probability Curve
- Log(p/(1-p))abx
- p/1-pexp(abx)
- p1/(1exp(-a-bx))
Probability
X
30Logistic Regression vs. Linear Regression
Common In regression we are looking for a
dependence of one variable, the dependent
variable, on other, the independent variable(s).
- In linear regression the dependent variable is
continuous - The relationship is summarized by a regression
equation consisting of a slope and an intercept.
In increases with unit increase in the
independent variable, and the intercept
represents the value of the dependent variable
when the independent variable takes the value
zero.
- in logistic regression the dependent variable is
binary. - In logistic regression the slope represents the
change in log odds for a unit increase in the
independent variable and the regression we are
interested in the simultaneous relationship
between one dependent variable and a number of
independent variables.
31Case Study Relapse Rate in AML
- One hundred and two patients with acute
myelogenous leukemia (AML) in remission were
enrolled in a study of a new antisense
oligonucleotide (asODN). The patients were
randomly assigned to receive a 10-day infusion of
asODN or no treatment (Control), and the effects
were followed for 90 days. The time of remission
from diagnosis or prior relapse (X, in months) at
study enrollment was considered an important
covariate in predicating response. The response
data are shown in next page with Y1 indicating
relapse, death, or major intervention, such as
bone marrow transplant before Day 90. Is there
any evidence that administration of asODN is
associated with a decreased relapse rate?
32(No Transcript)
33(No Transcript)
34(No Transcript)
35Effect Selection Methods
- Statistical model selection will facilitate
selection and screening of explanatory variables
from a sets of candidate variables. The commonly
used model selection method include - Backward selection starting with all candidate
variables and testing them one by one for
statistical significance, deleting any that are
not significant. - Forward selection starting with no variables in
the model, trying out the variables one by one
and including them if they are 'statistically
significant'. - Stepwise selection A combination of both
methods. Select a most significant variable from
the candidate pool and remove this variable if
its not significant in the joint model. And
repeat this process step by step for all
remaining variables.
36Flow Chart of Forward Selection
37Multicollinearity
- Multicollinearity occurs when two or more
explanatory variables in a multiple regression
model are highly correlated. In other words,
there is redundant explanatory variables in the
multiple regression models. - Multicollinearity can cause problematic estimate
in the individual effects. A high degree of
multicollinearity can also cause computer
software packages to be unable to perform the
matrix inversion that is required for computing
the regression coefficients, or it may make the
results of that inversion inaccurate. - Note that in statements of the assumptions
underlying regression analyses such as ordinary
least squares, the phrase "no multicollinearity"
is sometimes used to mean the absence of perfect
multicollinearity, which is an exact
(non-stochastic) linear relation among the
regressors.
38Detection of Multicollinearity
- Large changes in the estimated regression
coefficients when a predictor variable is added
or deleted - Tests of the individual effects of affected
variables are not significant, but a global test
of overall model is significant (using an
F-test). - Use variance inflation factor (VIF) to detect
multicollinearity Regress a explanatory variable
on all the other explanatory variables. A high
coefficient of determination, r2, indicates the
regressed explanatory variable was highly
corrected with other explanatory variables. A
tolerance1-r2. VIF1/tolerance. A tolerance of
less than 0.20 or 0.10 or a VIF of 5 or 10 and
above indicates a multicollinearity problem.
39Caveats
- Sometimes logistic regression is carried out when
a dependent variable is dichotomized. It is
important that the cut point is not derived by
direct examination of the data for example to
find a gap in the data which maximizes the
discrimination between the selected groups as
this can lead to biased results. It is bests if
there are a priori grounds for choosing a
particular cut point.
40References
- Common Statistical Methods for Clinical Research
2nd Edition by Glenn Walker - Logistic Regression Using The SAS System by Paul
Allison - Medical Statistics by Campbell et al.