STA 107: Logistic Regression and Categorical Data Analysis - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

STA 107: Logistic Regression and Categorical Data Analysis

Description:

Odds Ratio=Odd in the ARA group/odd in the ... Ignoring Strata Counter-Intuitive Combined Results Logistic Regression Logistic Regression Odds Ratio Case ... – PowerPoint PPT presentation

Number of Views:263
Avg rating:3.0/5.0
Slides: 41
Provided by: hdai
Category:

less

Transcript and Presenter's Notes

Title: STA 107: Logistic Regression and Categorical Data Analysis


1
STA 107 Logistic Regression and Categorical
Data Analysis
  • Lecturer Dr. Daisy Dai
  • Department of Medical Research

2
Contents
  • Binary Logit Analysis
  • Simple Logistic Regression
  • Multiple Logistic Regression
  • Stepwise or Backward Model Selections
  • Collinearity

3
Categorical Data Analysis
  • Binomial Test
  • Chi-square Test
  • Fishers Exact Test
  • McNemars Test
  • Cochran-Mantel-Haenszel Test

4
Binomial test
  • Make inference about a proportion of binary
    outcomes by comparing the confidence interval of
    a proportion to target.

5
Case Study Genital Wart
  • A company markets a therapeutic product for
    genital warts with a known cure rate of 40 in
    the general population. In a study of 25
    patients with genital warts treated with this
    product, patients were also given high doses of
    vitamin C. As shown in Table on the next page,
    14 patients were cured. Is this consistent with
    the cure rate in the general population?

6
Treatment to Genital Wart
ID Effectiveness
1 YES
2 NO
3 YES
4 NO
5 YES
6 YES
7 NO
8 YES
9 NO
10 NO
11 YES
12 NO
13 YES
14 NO
ID Effectiveness
15 YES
16 NO
17 NO
18 YES
19 YES
20 NO
21 YES
22 YES
23 NO
24 YES
25 YES
7
Results
  • 64 (16/25) of patient were cured by the
    treatment.
  • The 95 confidence interval extends from 44 to
    80
  • If the probability of "success" in each trial or
    subject is 0.300, then the chance of observing 16
    or more successes in 25 trials is 0.045
    (p-value).
  • The cure rate of genital wart by the experimental
    therapy was significantly higher than 30.

8
Fishers Exact Test
  • A conservative non-parametric test about a
    relationship between two categorical variables.
    The groups in comparison should be independent.

Responders Non-responders Total
Group 1 N11 N12 N11N12
Group 2 N21 N22 N21N22
Combined N11N21 N12 N22 N
9
Case Study CHF Incidence
  • A new adenosine-releasing agent (ARA), thought
    to reduce side effects in patients undergoing
    coronary artery bypass surgery (CABG), was
    studied in a pilot trial.

CHF No CHF Total
ARA 2 (6) 33 35
Placebo 5 (25) 20 25
Combined 7 53 60
Fishers exact test p0.0455
10
Chi-square test
  • Test a relationship between two categorical
    variables. Groups should be independent. The
    chi-square test assumes that the expected value
    for each cell is five or higher.

11
Case Study ADR Frequency with Antibiotic
Treatment
  • A study was conducted to monitor the incidence
    of GI adverse drug reactions of a new antibiotic
    used in lower respiratory tract infections.

Responders Non-responders Total
Test (new antibiotic) 22 (33) 44 66
Control (erythromycin) 28 (54) 24 53
Combined 50 (42) 68 118
Chi-square test p0.0252 Fishers exact test
p0.0385
12
McNemars test
  • Compare response rates in binary data between
    two related populations. Its analogous to
    Chi-square test or Fishers exact test for
    independent populations.

After Before Responders Non-responders
Responders A B
Non-responder C D
13
Case Study Bilirubin
  • A study was conduct to evaluate the toxicity
    side effect of an experimental therapy. Patients
    (n86) were treated with the experimental drug
    for 3 months. Clinical lab measured bilirubin
    levels of each patient at baseline and 3 months
    after therapy.

14
(No Transcript)
15
Results of McNemars Test
After Before Normal Abnormally high
Normal 60 14
Abnormally high 6 6
  • At baseline, 14 (12/86) of patients had
    abnormally high bilirubin level.
  • At 3 months post treatment, 23 (20/86) of
    patients had abnormally high bilirubin level.
  • P-value 0.1175
  • Odds ratio 2.3 95 CI 0.8 - 7.4
  • There is no enough evidence to prove the
    increasing risk of high bilirubin due to
    treatment.

16
Cochran-Mantel-Haenszel (CMH) Test
  • The Cochran-Mantel-Haenszel test is a method to
    compare the probability of an event among
    independent groups in stratified samples.
  • The stratification factor can be study center,
    gender, race, age groups, obesity status or
    disease severity. These underlying
    sub-population can be confounding factors that
    affect the associations between risk factors and
    the outcome variables.

17
Case Study Diabetic Ulcers
  • A multi-center study with 4 centers is testing an
    experimental treatment, Dermotel, used to
    accelerate the healing of dermal foot ulcers in
    diabetic patients. Sodium hyaluronate was used
    in a control group. Patients who showed a
    decrease in ulcer size after 20 weeks treatment
    of at least 90 surface area measurements were
    considered responders. The numbers of
    responders in each group are shown in Table 19.2
    for each study center. Is there an overall
    difference in response rates between the Dermotel
    and control groups?

18
Response Frequencies by Study Center
Study Center Treatment Group Response Non-Response
1 Dermotel 26 (87) 4
1 Control 18 (62) 11
2 Dermotel 8 (73) 3
2 Control 7 (58) 5
3 Dermotel 7 (58) 5
3 Control 4 (40) 6
4 Dermotel 11 (65) 6
4 Control 9 (64) 5
19
  • The interest in this study is to compare the
    response rate between two treatment. Because the
    study was conducted in four centers, it is
    concerned that some potential influences of study
    center on the response rate. By including the
    study center, the researcher can examine
    associations between the treatment and the
    response rate while adjusting (controlling) for
    the effect of study center.
  • Cochran-Mantel-Haenszel Test assumes a common
    odds ratio and test the null hypothesis that the
    explanatory variable X (treatment) and the
    outcome variable Y (response rate) are
    conditionally independent, given the control
    variable Z (study center). In other words, CMH
    tests whether the response is conditionally
    independent of the explanatory variable when
    adjusting for the control variable.  
  • One can also measures average conditional
    association between the explanatory (treatment)
    and the response variable by calculating the
    common odds ratio conditioned on the control
    variable (study center).

20
Results
Response Rates Response Rates
Study Center Active (n) Control (n) Chi-Square p-Value
1 86.7 (30) 62.1 (29) 4.706 0.030
2 72.7 (11) 58.3 (12) 0.524 0.469
3 58.3 (12) 40.0 (10) 0.733 0.392
4 64.7 (17) 64.3 (14) 0.001 0.981
Overall 74.3 (70) 58.5 (65) 40.39 0.044
P-value from CMH test
21
Chi-Square Test, Ignoring Strata
Group Response Non-Response Total
Active Control 52 (74.3) 38 (58.5) 18 27 70 65
Total 90 (100.0) 45 135
Chi-square value 3.798, p 0.051
22
Counter-Intuitive Combined Results
Stratum Group Responders Non-Responders Total Response Rate
1 A 10 38 48 21
B 4 21 25 16
2 A 20 10 30 67
B 27 17 44 61
Combined A 30 48 78 38
B 31 38 69 45
23
Logistic Regression
24
Logistic Regression
  • Logistic Regression are methods to identify the
    associations between a categorical outcome
    variable and explanatory variables.
  • In most cases, the outcome variable is
    dichotomous. The explanatory variables can be
    categorical or continuous. The probability of the
    outcome variable can be predicted by the values
    of explanatory variables.
  • Dichotomous outcome variable ? explanatory
    variables
  • Log(P/(1-P))a b1 x1 b2 x2 b3 x3 b4
    x4

25
Odds Ratio
  • Let Y be the dichotomous variable where y1
    indicates an event and y0 indicates no events
  • Oddprobability of an event/probability of no
    event
  • P(Y1)/P(Y0)P(Y1)/(1-P(Y0))
  • Odds RatioOdds in the Test Group/Odd in the
    Control Group
  • Logistic Model Log(Odds Ratio of an event) ?
    explanatory variables
  • Log (odds ratio)a b1 x1 b2 x2 b3 x3
    b4 x4

26
Case Study CHF Incidence
  • Odd of CHF incidence in the ARA
    group(2/35)/(33/35)2/336.
  • Odd of CHF incidence in the Placebo
    group(5/25)/(20/25)20.
  • Odds RatioOdd in the ARA group/odd in the
    Placebo group(2/33)/(5/20)0.24
  • The risk (odd) of CHF incidence in the ARA group
    is only 24 the risk (odd) in the Placebo group.
  • A new adenosine-releasing agent (ARA), thought
    to reduce side effects in patients undergoing
    coronary artery bypass surgery (CABG), was
    studied in a pilot trial.

CHF No CHF Total
ARA 2 (5.7) 33 35
Placebo 5 (25) 20 25
Combined 7 53 60
Fishers exact test p0.0455
27
Properties of Odds Ratio
  • Odds ratio is non-negative.
  • If odds ratiolt1, then the risk is smaller than
    control.
  • If odds ratiogt1, then the risk is larger than
    control.
  • Odds ratio of no event1/odds ratio of an event.
  • One can calculate the confidence interval of an
    odds ratio. The confidence interval of a
    significance odds ratio does not contain 1.

28
Case Study CHF Incidence
  • odds Ratio for ARA versus Control(2/33)/(5/20)0.
    24lt1. So the risk of CHF incidence in the ARA
    group is relatively smaller.
  • One can also calculate odds ratio for Control
    versus ARA as 1/0.244.1gt1, which indicates the
    risk (odd) of CHF in Placebo group is 4.1 fold of
    risk in ARA group.
  • A new adenosine-releasing agent (ARA), thought
    to reduce side effects in patients undergoing
    coronary artery bypass surgery (CABG), was
    studied in a pilot trial.

CHF No CHF Total
ARA 2 (5.7) 33 35
Placebo 5 (25) 20 25
Combined 7 53 60
Fishers exact test p0.0455
29
Logistic Probability Curve
  • Log(p/(1-p))abx
  • p/1-pexp(abx)
  • p1/(1exp(-a-bx))

Probability
X
30
Logistic Regression vs. Linear Regression
Common In regression we are looking for a
dependence of one variable, the dependent
variable, on other, the independent variable(s).
  • In linear regression the dependent variable is
    continuous
  • The relationship is summarized by a regression
    equation consisting of a slope and an intercept.
    In increases with unit increase in the
    independent variable, and the intercept
    represents the value of the dependent variable
    when the independent variable takes the value
    zero.
  • in logistic regression the dependent variable is
    binary.
  • In logistic regression the slope represents the
    change in log odds for a unit increase in the
    independent variable and the regression we are
    interested in the simultaneous relationship
    between one dependent variable and a number of
    independent variables.

31
Case Study Relapse Rate in AML
  • One hundred and two patients with acute
    myelogenous leukemia (AML) in remission were
    enrolled in a study of a new antisense
    oligonucleotide (asODN). The patients were
    randomly assigned to receive a 10-day infusion of
    asODN or no treatment (Control), and the effects
    were followed for 90 days. The time of remission
    from diagnosis or prior relapse (X, in months) at
    study enrollment was considered an important
    covariate in predicating response. The response
    data are shown in next page with Y1 indicating
    relapse, death, or major intervention, such as
    bone marrow transplant before Day 90. Is there
    any evidence that administration of asODN is
    associated with a decreased relapse rate?

32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
Effect Selection Methods
  • Statistical model selection will facilitate
    selection and screening of explanatory variables
    from a sets of candidate variables. The commonly
    used model selection method include
  • Backward selection starting with all candidate
    variables and testing them one by one for
    statistical significance, deleting any that are
    not significant.
  • Forward selection starting with no variables in
    the model, trying out the variables one by one
    and including them if they are 'statistically
    significant'.
  • Stepwise selection A combination of both
    methods. Select a most significant variable from
    the candidate pool and remove this variable if
    its not significant in the joint model. And
    repeat this process step by step for all
    remaining variables.

36
Flow Chart of Forward Selection
37
Multicollinearity
  • Multicollinearity occurs when two or more
    explanatory variables in a multiple regression
    model are highly correlated. In other words,
    there is redundant explanatory variables in the
    multiple regression models.
  • Multicollinearity can cause problematic estimate
    in the individual effects. A high degree of
    multicollinearity can also cause computer
    software packages to be unable to perform the
    matrix inversion that is required for computing
    the regression coefficients, or it may make the
    results of that inversion inaccurate.
  • Note that in statements of the assumptions
    underlying regression analyses such as ordinary
    least squares, the phrase "no multicollinearity"
    is sometimes used to mean the absence of perfect
    multicollinearity, which is an exact
    (non-stochastic) linear relation among the
    regressors.

38
Detection of Multicollinearity
  • Large changes in the estimated regression
    coefficients when a predictor variable is added
    or deleted
  • Tests of the individual effects of affected
    variables are not significant, but a global test
    of overall model is significant (using an
    F-test).
  • Use variance inflation factor (VIF) to detect
    multicollinearity Regress a explanatory variable
    on all the other explanatory variables. A high
    coefficient of determination, r2, indicates the
    regressed explanatory variable was highly
    corrected with other explanatory variables. A
    tolerance1-r2. VIF1/tolerance. A tolerance of
    less than 0.20 or 0.10 or a VIF of 5 or 10 and
    above indicates a multicollinearity problem.

39
Caveats
  • Sometimes logistic regression is carried out when
    a dependent variable is dichotomized. It is
    important that the cut point is not derived by
    direct examination of the data for example to
    find a gap in the data which maximizes the
    discrimination between the selected groups as
    this can lead to biased results. It is bests if
    there are a priori grounds for choosing a
    particular cut point.

40
References
  • Common Statistical Methods for Clinical Research
    2nd Edition by Glenn Walker
  • Logistic Regression Using The SAS System by Paul
    Allison
  • Medical Statistics by Campbell et al.
Write a Comment
User Comments (0)
About PowerShow.com