Stats 330: Lecture 25 - PowerPoint PPT Presentation

About This Presentation
Title:

Stats 330: Lecture 25

Description:

The study aimed to evaluate two different treatments for drug addiction. ... List of variables. Variable Description Codes/Values Name. Identification Code 1-575 ID ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 45
Provided by: T10877
Category:

less

Transcript and Presenter's Notes

Title: Stats 330: Lecture 25


1
Stats 330 Lecture 25
Multiple Logistic Regression Prediction and a
case study
2
Plan of the day
  • In todays lecture we discuss prediction and
    present a logistic regression case study.
  • Topics covered will be
  • Prediction in logistic regression
  • In-sample and out-of-sample error rates
  • Cross-validation and bootstrap estimates of error
    rates
  • Sensitivity and specificity
  • ROC curves
  • Then, a case study

3
Housekeeping
  • Error in slide 34 in lecture 23
  • Function is now influenceplots
  • Bug in ROC.curve download replacement from web
    page

4
Prediction
  • Suppose we have fitted a logistic model and we
    want to use the model to predict new cases. If a
    new case presents with explanatory variables x,
    how do we predict the y-value, 0 or 1?
  • Work out the estimated log-odds for the case
  • Work out probability Prob exp(log-odds)/(1exp(
    log.odds))
  • Predict
  • Y1 if prob gt 0.5 (equivalently log.odds gt0)
  • Y0 if prob lt 0.5 (equivalently log.odds lt0)

5
Estimating the prediction error
  • Prediction error is the probability of a wrong
    classification ( 0s predicted as 1s, 1s
    predicted as 0s)
  • As in linear regression, using the training data
    to estimate these proportions tends to give an
    optimistic estimate
  • We can use cross-validation or the bootstrap to
    improve the estimate see the case study

6
Sensitivity and specificity
  • Sensitivity probability of predicting a 1 when
    the case is truly a 1 the true positive rate
  • Specificity probability of predicting a 0 when
    the case is truly a 0 the true negative rate
    (1-specificity is called the false positive
    rate)
  • Ideally, want both to be close to 1
  • We would like to know what these would be for new
    data use cross-validation and the bootstrap as
    for normal regression

7
Calculating sensitivity and specificity
Model predicts Model predicts
Failure (0) Success (1)
Actual Failure (0) 100 200
Success ( 1) 250 600
Specificity 100/(100200) 33 Sensitivity
600/(600250) 70 In-sample error rate
(200250)/1150
8
ROC curves
  • We have predicted a success (Y1) if the
    log-odds are positive.
  • We can generalize this to predict a success if
    log-odds gtc for some constant c
  • If c is large and ve, almost every case will be
    predicted as a success (1)
  • Sensitivity close to 1, specificity close to 0
  • If c is large and ve, almost every case will be
    predicted as a failure (0)
  • Sensitivity close to 0, specificity close to 1
  • Allows a trade-off between sensitivity and
    specificity
  • As c varies, the sensitivity and specificity
    change.
  • ROC curve is a plot of the points (1-specificity,
    sensitivity)
  • as c changes.

9
(No Transcript)
10
ROC curves - cont
Perfect prediction
Worst case prediction
True positive rate
True positive rate
False positive rate
Predictor no help
False positive rate
True positive rate
False positive rate
11
Area under the curve
  • For a perfect predictor, the area under the ROC
    curve (AUC) is 1.
  • If the predictor is independent of the response,
    the sensitivity and specificity are both 0.5.
  • AUC curve serves as a measure of how good the
    model is at predicting.

12
Case study
  • The data comes from the University of
    Massachusetts AIDS Research Unit IMPACT study, a
    medical study performed in the US in the early
    90s.
  • The study aimed to evaluate two different
    treatments for drug addiction.
  • Reference Hosmer and Lemeshow, Applied Logistic
    Regression (2nd Ed), p28

13
List of variables
  • Variable Description Codes/Values Name
  • Identification Code 1-575 ID
  • Age at Enrollment Years AGE
  • Beck Depression Score 0-54 BECK
  • IV Drug Use History 1 Never, IVHX
  • at Admission 2 Previous, 3 Recent
  • No of prior Treatments 0-40 NDRUGTX
  • Subject's Race 0 White , RACE
  • 1
    Other
  • Treatment Duration 0short, 1 Long TREAT
  • Treatment Site 0 A, 1 B SITE
  • Remained Drug Free 1 Yes, 0 No DFREE

14
The variables
  • The response DFREE is binary records if subject
    is drug-free after conclusion of treatment.
  • There is a mix of categorical and continuous
    explanatory variables
  • Categorical IVHX, RACE, TREAT, SITE
  • Continuous AGE, BECK, NDRUGTX.

15
Questions
  • Is the longer treatment more effective?
  • Did Site A deliver the program more effectively
    than site B?
  • What other variables have an effect on successful
    rehabilitation of addicts?
  • Can we predict who is likely to be drug-free in
    12 months?

16
Analysis strategy
  • Preliminary plots, tables
  • Variable selection
  • Model fitting
  • Interpretation of coefficients
  • Evaluation as a predictor of recovery from
    addiction.

17
Preliminary plots
18
Preliminary plots (2)
19
Preliminary plots (3)
  • Seems like number of previous drug treatments
    have an effect
  • Seems like factors IVHX (Recent IV drug use),
    SITE (Site A or Site B) and TREAT (short or long
    treatment) have an effect

20
Preliminary fits (1)
  • Call
  • glm(formula DFREE . - IVHX - ID
    factor(IVHX), family binomial,
  • data drug.df)
  • Deviance Residuals
  • Min 1Q Median 3Q Max
  • -1.3465 -0.8091 -0.6326 1.1834 2.4231

Dont include ID!
21
Preliminary fits (2)
  • Coefficients
  • Estimate Std. Error z value
    Pr(gtz)
  • (Intercept) -2.4111283 0.5983427 -4.030
    5.59e-05
  • AGE 0.0504143 0.0174057 2.896
    0.00377
  • BECK 0.0002759 0.0107982 0.026
    0.97961
  • NDRUGTX -0.0615329 0.0256441 -2.399
    0.01642
  • RACE 0.2260262 0.2233685 1.012
    0.31159
  • TREAT 0.4424802 0.1992922 2.220
    0.02640
  • SITE 0.1489209 0.2176062 0.684
    0.49375
  • factor(IVHX)2 -0.6036962 0.2875974 -2.099
    0.03581
  • factor(IVHX)3 -0.7336591 0.2549893 -2.877
    0.00401
  • (Dispersion parameter for binomial family taken
    to be 1)
  • Null deviance 653.73 on 574 degrees of freedom
  • Residual deviance 619.25 on 566 degrees of
    freedom
  • AIC 637.25
  • Number of Fisher Scoring iterations 4

22
Preliminary conclusions
  • Important variables seem to be AGE, NDRUGTX,
    TREAT, IVHX
  • Data are ungrouped, cant assess goodness of fit
    with residual deviance
  • No extremely large residuals

23
Hosmer-Lemeshow test
  • gt HLstat(drug.glm)
  • Value of HL statistic 5.05
  • P-value 0.752

No evidence of a bad fit using this test
24
Variable selection (1)
  • gt anova(drug.glm, test"Chisq")
  • Analysis of Deviance Table
  • Model binomial, link logit
  • Response DFREE
  • Terms added sequentially (first to last)
  • Df Deviance Resid. Df Resid. Dev
    P(gtChi)
  • NULL 574 653.73
  • AGE 1 1.40 573 652.33
    0.24
  • BECK 1 0.57 572 651.76
    0.45
  • NDRUGTX 1 14.07 571 637.69
    0.0001760
  • RACE 1 3.06 570 634.63
    0.08
  • TREAT 1 4.96 569 629.67
    0.03
  • SITE 1 1.07 568 628.60
    0.30
  • factor(IVHX) 2 9.35 566 619.25
    0.01

25
Variable selection (2)
  • Step AIC 632.59
  • DFREE NDRUGTX IVHX AGE TREAT
  • Call glm(formula DFREE NDRUGTX IVHX AGE
    TREAT,
  • family binomial, data drug.df)
  • Degrees of Freedom 574 Total (i.e. Null) 569
    Residual
  • Null Deviance 653.7
  • Residual Deviance 620.6 AIC 632.6

26
Sub-model
  • gt sub.glmlt-glm(DFREE NDRUGTX factor(IVHX)
    AGE TREAT, familybinomial, datadrug.df)
  • gt summary(sub.glm)
  • Call
  • glm(formula DFREE NDRUGTX factor(IVHX)
    AGE TREAT, family binomial, data drug.df)
  • Deviance Residuals
  • Min 1Q Median 3Q Max
  • -1.2598 -0.8051 -0.6284 1.1401 2.4574

27
Sub-model (ii)
All variables significant, but use caution
  • Coefficients
  • Estimate Std. Error z value
    Pr(gtz)
  • (Intercept) -2.33276 0.54838 -4.254
    2.1e-05
  • NDRUGTX -0.06376 0.02563 -2.488
    0.012844
  • factor(IVHX)2 -0.62366 0.28470 -2.191
    0.028484
  • factor(IVHX)3 -0.80561 0.24453 -3.294
    0.000986
  • AGE 0.05259 0.01721 3.056
    0.002244
  • TREAT 0.45134 0.19860 2.273
    0.023048
  • (Dispersion parameter for binomial family taken
    to be 1)
  • Null deviance 653.73 on 574 degrees of freedom
  • Residual deviance 620.59 on 569 degrees of
    freedom
  • AIC 632.59
  • Number of Fisher Scoring iterations 4

28
Do we need interaction terms?
  • gt sub.glmlt-glm(DFREE NDRUGTX IVHX AGE
    TREAT , familybinomial, datadrug.df)
  • gt sub2.glmlt-glm(DFREE NDRUGTXIVHX AGEIVHX
    AGETREAT NDRUGTXTREAT , familybinomial,
    datadrug.df)
  • gt
  • gt anova(sub.glm, sub2.glm, test"Chisq")
  • Analysis of Deviance Table
  • Model 1 DFREE NDRUGTX IVHX AGE TREAT
  • Model 2 DFREE NDRUGTX IVHX AGE IVHX
    AGE TREAT NDRUGTX TREAT
  • Resid. Df Resid. Dev Df Deviance P(gtChi)
  • 1 569 620.59
  • 2 563 616.16 6 4.42 0.62

Model with interactions
Big p-value so interactions not required
29
Do we need to transform?
par(mfrowc(1,2)) sub.gamlt-gam(DFREE s(NDRUGTX)
factor(IVHX) s(AGE) TREAT ,
familybinomial, datadrug.df) plot(sub.gam)
30
Transforming
  • Suggests a possible quadratic in NDRUGTX

gt subq.glmlt-glm(DFREE poly(NDRUGTX,2) IVHX
AGE TREAT, familybinomial, datadrug.df) gt
summary(subq.glm) Coefficients
Estimate Std. Error z value Pr(gtz)
(Intercept) -2.70763 0.56715 -4.774
1.80e-06 poly(NDRUGTX, 2)1 -7.24501
2.93274 -2.470 0.01350 poly(NDRUGTX, 2)2
4.21319 2.69624 1.563 0.11814 IVHX2
-0.59018 0.28608 -2.063 0.03912
IVHX3 -0.76015 0.24725 -3.074
0.00211 AGE 0.05458 0.01730
3.154 0.00161 TREAT 0.44379
0.19904 2.230 0.02577
But term is not significant, so we stick with no
transformation
31
Diagnostics
Pt 85
7, 471
7, 471
32
Influence of 7, 471, 85
Effect on coefficients of removing cases
  • None 7 85 471
    All
  • (Intercept) -2.333 -2.295 -2.222 -2.447 -2.293
  • NDRUGTX -0.064 -0.084 -0.065 -0.075 -0.100
  • IVHX2 -0.624 -0.595 -0.635 -0.680 -0.662
  • IVHX3 -0.806 -0.783 -0.785 -0.795 -0.747
  • AGE 0.053 0.053 0.049 0.057 0.054
  • TREAT 0.451 0.434 0.441 0.479 0.450

None seem particularly influential! We will not
delete them
33
Over-dispersion
  • qsub.glmlt-glm(DFREE NDRUGTX IVHX AGE
    TREAT, familyquasibinomial, datadrug.df)
  • gt summary(qsub.glm)
  • Coefficients
  • Estimate Std. Error t value Pr(gtt)
  • (Intercept) -2.33276 0.55435 -4.208 2.99e-05
  • NDRUGTX -0.06376 0.02591 -2.461 0.01414
  • IVHX2 -0.62366 0.28780 -2.167 0.03065
  • IVHX3 -0.80561 0.24720 -3.259 0.00118
  • AGE 0.05259 0.01740 3.023 0.00262
  • TREAT 0.45134 0.20076 2.248 0.02495
  • ---
  • (Dispersion parameter for quasibinomial family
    taken to be 1.021892)

Very close to 1 so no overdispersion
34
Interpretation
  • Coefficients
  • Estimate Std. Error z value Pr(gtz)
  • (Intercept) -2.33276 0.54838 -4.254 2.1e-05
  • NDRUGTX -0.06376 0.02563 -2.488 0.012844
  • IVHX2 -0.62366 0.28470 -2.191 0.028484
  • IVHX3 -0.80561 0.24453 -3.294 0.000986
  • AGE 0.05259 0.01721 3.056 0.002244
  • TREAT 0.45134 0.19860 2.273 0.023048
  • As the number of prior treatments goes up, the
    probability of a drug-free recovery goes down
  • The probability of a drug-free recovery for
    persons with no IV drug use is more than for
    persons with previous IV drug use
  • The probability of a drug-free recovery for
    persons with previous IV drug use is more than
    for persons with recent IV drug use.
  • The probability of a drug-free recovery goes up
    with age
  • The probability of a drug-free recovery is higher
    for the long treatment

35
Interpreting p-values after model selection
  • We have seen that this is not valid, as model
    selection changes the distribution of the
    estimated coefficients.
  • We can use the bootstrap to examine the revised
    distribution
  • Leave TREAT in the model, use forward selection
    to select the other variables.

36
Procedure
  • Draw a bootstrap sample
  • Do forward selection, record value of regression
    coef for TREAT (forced to be in every model)
  • Repeat 200 times, draw histogram of the results

37
R code
  • bootstrap sample
  • n dim(drug.df)1
  • B200
  • beta.boot numeric(B)
  • for(b in 1B)
  • ni rmultinom(1, n ,probrep(1/n,n))
  • newdata drug.dfrep(1n,ni),
  • drug.boot.glm glm(DFREE NDRUGTX
    factor(IVHX) AGE BECK TREAT RACE SITE,
  • familybinomial, data newdata)
  • chosen step(drug.boot.glm, list(lower DFREE
    TREAT, upper formula(drug.boot.glm)),
  • direction forward", trace0)
  • k match("TREAT",names(coef(chosen)))
  • beta.bootb summary(chosen)coefficientsk,1

38
Histogram
gt mean(beta.boot) 1 0.4540209 gt
sd(beta.boot) 1 0.2019222 gt z.val
mean(beta.boot)/ sd(beta.boot) gt
2(1-pnorm(z.val)) 1 0.02454468 Compare Beta
0.45134 SE 0.19860 P-value 0.023048
39
Prediction
  • Sensitivity chance the model predicts a
    successful recovery (drug-free at end of
    program), when one will actually occur
  • Specificity chance the model predicts a failure
    (return to drug use before end of program), when
    one actually will occur

40
R code
  • sub.glmlt-glm(DFREE NDRUGTX IVHX AGE TREAT
    , familybinomial, datadrug.df)
  • gt pred predict(sub.glm, type"response")
  • gt predcode ifelse(predlt0.5, 0,1)
  • gt table(drug.dfDFREE,predcode)
  • predicted
  • 0 1
  • Actual 0 426 2
  • 1 144 3
  • Sensitivity 3/147 0.02040816
  • Specificity 426/428 0.9953271
  • Error rate 146/575 0.2539130
  • Proportion correctly classified 429/575
    0.746087

41
ROC curve
ROC.curve(DFREE NDRUGTX factor(IVHX) AGE
TREAT, data drug.df) in the R330 package
42
Prediction (2)
  • Use 10-fold cross-validation
  • Split data into 10 parts
  • Calculate sensitivity and specificity for each
    part, using model fitted to the remaining parts
  • Average results
  • Repeat for different splits, average repeats
  • E.g. for one part

43
CV and bootstrap Results
  • gt cross.val(DFREE NDRUGTX factor(IVHX) AGE
    TREAT, drug.df)
  • Mean Specificity 0.9908424
  • Mean Sensitivity 0.02854005
  • Mean Correctly classified 0.7446491
  • gt err.boot(DFREE NDRUGTX factor(IVHX) AGE
    TREAT, data drug.df)
  • err
  • 1 0.2539130
  • Err
  • 1 0.2552974

A poor classifier, but this doesnt mean that
the model fits poorly there are very few cases
with fitted probs over 0.5, and many with
fitted probabilities between 0.2 and 0.5. We
expect a moderate number of these to be
misclassified, as some events (being drug free)
with probs 0.2 to 0.5 have occurred.
Bootstrap estimate
Training set estimate
44
Overall conclusions
  • Model seems to fit well
  • Strong evidence that longer treatments are better
  • No apparent difference between sites
  • Age and prior IV drug use affect recovery
  • Model predicts poorly for the covariates in the
    data set effectively always predicts patients
    will not be drug free
Write a Comment
User Comments (0)
About PowerShow.com