Logistic Regression in STATA

About This Presentation

Title:

Logistic Regression in STATA

Description:

The resulting odds ratio will correspond to each one unit increase in the category coding. ... The p-value on the product term can be interpreted as a test of ... – PowerPoint PPT presentation

Number of Views:1574

Avg rating:3.0/5.0

Slides: 42

Provided by: stort

Category:

more less

Transcript and Presenter's Notes

Title: Logistic Regression in STATA

1
Logistic Regression in STATA

The logistic regression programs in STATA use
maximum likelihood estimation to generate the
logit (the logistic regression coefficient, which
corresponds to the natural log of the OR for each
one-unit increase in the level of the regressor
variable).
The resulting ORs are maximum-likelihood
estimates (MLEs) of the uniform effect (OR)
across strata of the model covariates. Thus they
are pooled (uniform, common) estimates of the OR
and in this sense are adjusted for all regressors
included in the model.

2
STATA Logistic Regression Commands

The logistic command in STATA yields odds
ratios.
. logistic sil hpv2 age
Logit estimates
Number of obs 595
LR chi2(2) 155.28
Prob gt chi2 0.0000
Log likelihood -332.23774 Pseudo R2
0.1894
--------------------------------------------------
----------------------------
sil Odds Ratio Std. Err. z
Pgtz 95 Conf. Interval
-------------------------------------------------
----------------------------
hpv2 17.23939 5.10605 9.61
0.000 9.647363 30.80598
age .9947965 .0146842 -0.35
0.724 .9664283 1.023997
--------------------------------------------------
----------------------------

3
STATA Logistic Regression Commands

The logit command in STATA yields the actual
beta coefficients.
logit sil hpv2 age
Logit estimates
Number of obs 595
LR chi2(2) 155.28
Prob gt chi2 0.0000
Log likelihood -332.23774 Pseudo
R2 0.1894
--------------------------------------------------
----------------------------
sil Coef. Std. Err. z
Pgtz 95 Conf. Interval
-------------------------------------------------
----------------------------
hpv2 2.847197 .2961851 9.61
0.000 2.266685 3.427709
age -.0052171 .014761 -0.35
0.724 -.0341482 .0237139
_cons -.2873639 .4075825 -0.71
0.481 -1.086211 .5114832
--------------------------------------------------
----------------------------
Note that 17.23939 exp2.847197 as in the
previous slide

4
STATA Commands for Multilevel Categorical
Variables in Logistic Regression Models

Categorized continuous variables should be
entered in regression models as a series of
indicator variables
for each category a variable is created in which
observations falling in that category are coded
1" and all other observations are coded 0",
thus the variable is represented in the model as
a series of indicator terms, with the reference
category left out of the model.
Any categories of a variable that get left out of
the model become part of the reference group
(because those observations will be coded 0 for
each indicator term left in the model).

5
STATA Commands for Multilevel Categorical
Variables in Logistic Regression Models

If categorized continuous variables are entered
in models as if they were continuous, that is, as
one term rather than a series of indicator
variables, the program will treat the values as a
continuous distribution, with each observation in
a category having the same value. The resulting
odds ratio will correspond to each one unit
increase in the category coding. This will not
produce a meaningful result unless the coding can
be interpreted as linear increments from one
category to another.
STATA has a convenient command that makes it
unnecessary to create the indicator terms for
multilevel categorical variables. The xi
command creates a series of indicator variables
for variables marked i.variablename by
recognizing each value as a category. When used
with the logistic or logit commands, STATA uses
the lowest value as the reference category, which
it drops out of the model. It is necessary to
make sure that the variable coding reflects the
desired categorization and reference level.

xilogistic sil i.partner
i.partner
Logit estimates
Number of obs 580
LR chi2(12) 51.48
Prob gt chi2 0.0000
Log likelihood -374.7638
Pseudo R2 0.0643
--------------------------------------------------
----------------------------
sil Odds Ratio Std. Err.
z Pgtz 95 Conf. Interval
-------------------------------------------------
----------------------------
_Ipartner_2 2.008163 .5687573 2.46
0.014 1.152708 3.498473
_Ipartner_3 2.231293 .6428015 2.79
0.005 1.26864 3.924413
_Ipartner_4 2.40158 .7180631 2.93
0.003 1.336568 4.315221
_Ipartner_5 3.322094 1.050245 3.80
0.000 1.787778 6.173199
_Ipartner_6 2.693878 1.338111 2.00
0.046 1.017575 7.13164
_Ipartner_7 7.836735 6.346424 2.54
0.011 1.602531 38.32338
_Ipartner_9 13.71429 14.85759 2.42
0.016 1.64063 114.6399
_Ipartner_10 7.183673 2.981345 4.75
0.000 3.184811 16.20353
_Ipartner_12 5.877551 6.864814 1.52
0.129 .5956855 57.99303

7
Note constant from logit model ln(odds of
reference category) logit ln(odds)-constant,
constant ln(odds of reference
value ORexp(logit)
8
Interpreting logistic regression model
coefficients for continuous variables

When a logistic regression model contains a
continuous independent variable, interpretation
of the estimated coefficient depends on how it is
entered into the model and the particular units
of the variable
To interpret the coefficient, we assume that the
logit is linear in the variable
The slope coefficient gives the change in the log
odds for an increase of 1 unit in x.

9
Interpreting logistic regression model
coefficients for continuous variables

Sometimes a unit increase may not be meaningful
or considered important
If we are interested in estimating the increased
odds instead for every 5 year increase. We can
use the formula
OR (c)Exp(cß1)
(95 CIexp(cß11.96cSEß1)

10
P-values for Trend

The term trend generally refers to a monotonic,
though not necessarily linear, association
between increasing levels of exposure and the
probability of the outcome.
Although examining effect estimates and
confidence intervals over levels of exposure is
the most informative manner of evaluating a
dose-response trend, it has been conventional to
report p-values for the null hypothesis of no
monotonic association between exposure and
disease, that is, a test for trend.
The p-value corresponding to the coefficient of a
variable entered on an appropriate continuous
scale in a regression model can be interpreted as
a p-value for trend. For statistical hypothesis
testing of trend in stratified analysis, see
Rothman and Greenland.

. logit sil lifepart
Iteration 0 log likelihood -407.27031
Iteration 1 log likelihood -398.61876
Iteration 2 log likelihood -397.39561
Iteration 3 log likelihood -397.33389
Iteration 4 log likelihood -397.33374
Logit estimates
Number of obs 591
LR chi2(1) 19.87
Prob gt chi2 0.0000
Log likelihood -397.33374
Pseudo R2 0.0244
--------------------------------------------------
----------------------------
sil Coef. Std. Err. z
Pgtz 95 Conf. Interval
-------------------------------------------------
----------------------------
lifepart .0493962 .0139827 3.53
0.000 .0219906 .0768017
_cons -.0989464 .1091532 -0.91
0.365 -.3128828 .11499
--------------------------------------------------
----------------------------

12
Interpretation

For every increase in the number of partners, the
risk or odds of SIL increases 5.
If we are interested in estimating the increased
odds instead for every 5 partners. We can use the
formula
OR (5)Exp(5 .0493962)3.43
For every increase of 5 additional partners, the
odds for SIL increases 3.43 times
Be careful.. Validity of statements may be
questionable since additional risk of SIL for 10
partners may be quite different for 5 partners,
but this is an unavoidable dilemma when
continuous variables are modeled linearly in the
logit.

13
Significance testing

Does the model that includes the variable in
question tell more about the outcome variable
than a model that does not include that variable?
In general you are comparing observed values of
the response variable to the predicted values
obtained from models with and without the
variables in question
Log likelihood ratio test is obtained by
multiplying the difference between the two values
by -2
Follows a chi square distribution with the
degrees of freedom the difference between the
number of parameters in the 2 models

14
LR Test
. xilogistic sil i.hpv2 i.hpv2
_Ihpv2_0-1 (naturally coded _Ihpv2_0
omitted) Logit estimates
Number of obs 595
LR
chi2(1) 155.15
Prob gt chi2
0.0000 Log likelihood -332.30027
Pseudo R2 0.1893 -----------------
--------------------------------------------------
----------- sil Odds Ratio Std. Err.
z Pgtz 95 Conf. Interval ---------
-------------------------------------------------
------------------- _Ihpv2_1 17.30026
5.121828 9.63 0.000 9.683894
30.90687 -----------------------------------------
------------------------------------- .
lrtest,saving (0)
15
LR Test

. xilogistic sil i.hpv2 income
i.hpv2 _Ihpv2_0-1 (naturally
coded _Ihpv2_0 omitted)
Logit estimates
Number of obs 592
LR chi2(2) 156.07
Prob gt chi2 0.0000
Log likelihood -329.84288
Pseudo R2 0.1913
--------------------------------------------------
----------------------------
sil Odds Ratio Std. Err. z
Pgtz 95 Conf. Interval
-------------------------------------------------
----------------------------
_Ihpv2_1 17.3712 5.14748 9.63
0.000 9.718504 31.04989
income .9664517 .0766365 -0.43
0.667 .8273373 1.128958
--------------------------------------------------
----------------------------
. lrtest, using (0)
Warning observations differ 595 vs. 592
Logistic likelihood-ratio test
chi2(-1) -4.91
Prob gt chi2 .

16
LR Test adding income to the model

-2(-332.30027 (-329.84288))2.457
P08

17
Wald test

Obtained by comparing the maximum likelihood
estimate of the slope parameter, ß1, to an
estimate of the standard error
W ß1/SE (ß1)
A Z test, the distribution is approximately
standard normal
Gives approximately equal answer to the LR test
in large samples but may be different in smaller
samples
LR test seems to perform better in most situations

18
Variable selection

Once we have obtained a model that contains
essential variables, examine the variables in the
model more closely
Then check for interactions- For example an
interaction between HPV status and race would
indicate that the slope coefficient for HPV would
be different by race/ethnicity

19
Interaction assessment

Can take many forms
When interaction is present, the association
between the risk factor and the outcome variable
differs, or depends in some way on the level of
the covariate
The covariate modifies the effect of the risk
factor
Consider a model with a dichotomous risk factor
and the covariate, age. If the association
between the covariate (age) and the outcome
variable is the same within each level of the
risk factor, then there is no interaction between
the covariate and the risk factor
Graphically, no interaction yields a model with
parallel lines

20
Interaction assessment

Need to be HWF
Decide whether the effect of each variable varies
importantly across race/ethnic categories.
To make this decision, first note the magnitude
of the difference between the ORs across strata.
The tests of homogeneity have low power for
detecting moderate effect-measure modification.
The p-value on this test indicates whether there
is adequate statistical power in the data to
detect a difference, but a high p-value does not
mean there is no effect-measure modification.
In general, the significance level for
heterogeneity worth exploring further should be
set around 0.20-0.25

21
Observe stratum-specific estimates using
stratified analysis.

cc sil hpv2, by (race)
OR 95
Conf. Interval M-H Weight
-------------------------------------------------
-----------------
white 28.88889 8.69751
148.2404 .8790698 (exact)
african- 7.520408 2.891786
21.71961 1.973154 (exact)
hispanic 20.5 6.885033
81.30018 1.073593 (exact)
-------------------------------------------------
-----------------
Crude 17.30026 9.556989
33.34827 (exact)
M-H combined 15.85477 8.815859
28.51382
--------------------------------------------------
-----------------
Test of homogeneity (M-H) chi2(2) 3.80
Prgtchi2 0.1494
Test that combined OR 1
Mantel-Haenszel chi2(1)
123.18
Prgtchi2 0.0000

22
Interaction assessment

Use product terms in a multivariable logistic
regression model in order to identify potential
effect-measure modification (interaction) while
adjusting for confounders.
The p-value on the product term can be
interpreted as a test of homogeneity.
To model a product term for two continuous
variables, a term must be created for the product
of the two variables. The product term is entered
into the model, along with each of the two
variables.
The xi command in STATA creates all of the
required product terms for modeling interaction
if at least one of the two variables is
categorical. With this command the two variables
do not have to be entered separately in the model
because STATA does it for you.
When entering a product term between a
categorical and a continuous variable in a
logistic regression model, we evaluate whether
the entire dose-response of the continuous
variable differs across strata of the categorized
variable.

. xilogistic sil i.hpv2 i.race
Logit estimates
Number of obs 595
LR chi2(3) 166.35
Prob gt chi2 0.0000
Log likelihood -326.70089 Pseudo R2
0.2029
--------------------------------------------------
----------------------------
sil Odds Ratio Std. Err. z
Pgtz 95 Conf. Interval
-------------------------------------------------
----------------------------
_Ihpv2_1 15.86857 4.726791 9.28
0.000 8.850939 28.45026
_Irace_2 .5410994 .1348136 -2.47
0.014 .3320491 .8817628
_Irace_3 .4955735 .1097656 -3.17
0.002 .3210507 .7649667
--------------------------------------------------
----------------------------

. xilogistic sil i.hpv2 i.race i.racehpv2
Logit estimates
Number of obs 595
LR chi2(5) 170.04
Prob gt chi2 0.0000
Log likelihood -324.85603
Pseudo R2 0.2074
--------------------------------------------------
----------------------------
sil Odds Ratio Std. Err.
z Pgtz 95 Conf. Interval
-------------------------------------------------
----------------------------
_Ihpv2_1 28.88889 17.72576 5.48
0.000 8.678548 96.16446
_Irace_2 .6467662 .1711091 -1.65
0.100 .3850814 1.086281
_Irace_3 .515873 .1214782 -2.81
0.005 .3251631 .8184354
_IracXhpv2_2 .2603218 .1996824 -1.75
0.079 .057888 1.170666
_IracXhpv2_3 .7096154 .5830384 -0.42
0.676 .1417926 3.551341
--------------------------------------------------
----------------------------

25
Interaction assessment

Use chunk test for entire collection of
interaction terms
Use LR test comparing main effects model with
fuller model
. lrtest, using (0)
Logistic likelihood-ratio test chi2(-2)
-3.69
Prob
gt chi2 0.15
If the interaction term is not significant then
drop the interaction term

26
Interaction Assessment

If the interaction term is retained in the model,
the estimated ORs for other variables confounded
by race or another modifier should not be
obtained from a model that enters race or
another modifier in suboptimal form for the
purpose of obtaining stratum-specific estimates.
In an analysis that aims to estimate effects of
several variables, we may use several different
models to estimate the effects of interest. In
this case, our goal is not the elaboration of a
final model.

27
Confounding Assessment

Confounding is used to describe a covariate that
is associated both with the outcome variable of
interest and the primary independent variable or
risk factor of interest, but is not an
intermediate variable in the causal pathway
When both variables are present, the relationship
is said to be confounded
Only appropriate when there is no interaction
Decisions regarding whether or not to adjust for
potential confounding variables will depend on a
combined assessment of prior knowledge, observed
associations in the data, and sample size
considerations

28
Confounding Assessment

In practice we look for empirical evidence of
confounding in data obtained from study
populations, however, we must keep in mind that
what we observe in such data may reflect
selection and information bias affecting observed
confounder-disease-exposure associations in a
similar way to how these biases affect observed
exposure-disease associations.
Therefore, it is necessary to rely on prior
knowledge of relevant associations in source
populations. For example, if a variable is a
known confounder but does not appear to be one in
the data, this should create uncertainty
regarding the validity of the data. Adjusting for
this factor may not change the relative risk
point estimate, but it may influence the standard
error for this estimate, thus appropriately
reflecting our uncertainty.
If prior knowledge suggests a variable should not
be a confounder but it appears to be one in the
data, the confounding may have been introduced by
the study methods (eg., as a result of matching
in a case-control design). In this case it would
be appropriate to adjust for this factor, if it
is not an intervening (intermediate) variable.

29
Confounding Assessment

When prior knowledge regarding exposure-covariate
associations is insufficient and the number of
covariates to consider is small, it may be
desirable to adjust for all variables that appear
to be important risk factors for the outcome, as
long as they are not intervening variables.
When a large number of potential confounders must
be considered, the change-in-estimate variable
selection strategy has been shown in simulation
studies to produce more valid results for
confounder detection than strategies that rely on
p-values, unless the significance level for the
p-value is raised to 0.2 or higher. There is more
than one reasonable approach to variable
selection in such situations the important thing
is that the criterion for selection should be
explicit and consistently applied.
When examining continuous variables as potential
confounders, careful assessment of the
dose-response for the purpose of choosing the
optimal scale or categorization should be carried
out prior to the assessment of confounding.
Inappropriate modeling of continuous variables
may lead to incorrect decisions about whether or
not to adjust for these variables and/or to
inadequate control of confounding.

30
Confounding Assessment

Begin with a model with the exposure-disease
relationship For example if we are interested in
the association between number of lifetime
partners and SIL
. xilogistic sil i.part1_4
i.part1_4 _Ipart1_4_0-3 (naturally
coded _Ipart1_4_0 omitted)
Logit estimates
Number of obs 593
LR chi2(3) 47.91
Prob gt chi2 0.0000
Log likelihood -384.71
Pseudo R2 0.0586
--------------------------------------------------
----------------------------
sil Odds Ratio Std. Err. z
Pgtz 95 Conf. Interval
-------------------------------------------------
----------------------------
_Ipart1_4_1 2.113856 .5010233 3.16
0.002 1.328387 3.363768
_Ipart1_4_2 2.793651 .6974642 4.11
0.000 1.712619 4.557047
_Ipart1_4_3 5.120594 1.277563 6.55
0.000 3.140146 8.350084
-------

31
Confounding Assessment

. . xilogistic sil i.curr_smk
i.curr_smk _Icurr_smk_0-1 (naturally
coded _Icurr_smk_0 omitted)
Logit estimates
Number of obs 595
LR chi2(1) 21.26
Prob gt chi2 0.0000
Log likelihood -399.24786
Pseudo R2 0.0259
--------------------------------------------------
----------------------------
sil Odds Ratio Std. Err. z
Pgtz 95 Conf. Interval
-------------------------------------------------
----------------------------
_Icurr_smk_1 2.349306 .4454758 4.50
0.000 1.620073 3.406784
--------------------------------------------------
----------------------------
.

32
Confounding Assessment

. xilogistic sil i.curr_smk i.hpv2
Logit estimates
Number of obs 595
LR chi2(2) 161.58
Prob gt
chi2 0.0000
Log likelihood -329.08559
Pseudo R2 0.1971
--------------------------------------------------
----------------------------
sil Odds Ratio Std. Err.
z Pgtz 95 Conf. Interval
-------------------------------------------------
----------------------------
_Icurr_smk_1 1.724126 .3709956 2.53
0.011 1.130859 2.628631
_Ihpv2_1 15.96255 4.749559 9.31
0.000 8.909072 28.6004
--------------------------------------------------
----------------------------

33
Confounding Assessment

Examine the OR for the main risk factor to
determine whether it is meaningfully different
than the OR when controlling for confounding
For previous examples the ORs are 2.34 vs. 1.72
Can examine using the LR test
May want to include the confounder if others
would not trust the results if did not perform
the adjustment
May want to use a decision rule, depending on the
subject matter (i.e., 10 change in the OR) and
report the criterion
Several have suggested a backward deletion
strategy whereby you enter all main effects and
confounders in the model and you eliminate
one-by-one confounder which makes the smallest
difference in the exposure-effect estimate
Biases resulting from multiple confounders may
cancel each other out or produce results that may
not be easy to disentangle. Consider the
following.

34
Model Checking

When run diagnostics and why?
This depends on what we want the model to do
If our goal is to obtain approximately valid
summary estimates of effect for a few key
relationships, less rigorous checking is required
If our goal is to predict outcomes for a given
set of factors, more detailed checking is required

35
Model Checking

Assuming the model contains those variables that
should be in the model, we want to know how
effectively the model we have describes the data
(goodness of fit)
A good-fitting model is not the same as a correct
model
Regression diagnostics can detect discrepancies
between a model and data only within the range of
the data and only if there are enough
observations for adequate diagnostic power
A model may appear to fit well in the central
range of the data, but produce poor predictions
for covariate values that are not well
represented in the data

36
Model Checking

Computation and evaluation of overall measures
Examination of individual components of the
summary statistics, often graphically
Examination of other measures of the difference
between components of the observed values versus
the predicted values

37
Model Checking

Check model results against results from
stratified analysis
Log-likelihood Ratio Test (also called Deviance
Test)
Tests of Regression
Test the hypothesis that all the regression
coefficients (except the intercept) are zero
The R2 not recommended o use- may give a
distorted impression

38
Model Checking

Tests of fit (see RG, pp. 409-10 for test
details)
Tests for nonrandom incompatibilities between a
model and data
Compares the fit of an index model to a more
elaborate reference model that contains it
Small p-value suggests that the index model fits
poorly relative to the reference model, that is,
that the additional terms in the reference model
improve the fit
Large p-value does not mean that the index model
fits well, only that the test did not detect an
improvement in fit from the additional terms in
the reference model
Pearson residual (examining obs-pred/SE)
Deviance residual
Hosmer-Lemeshow goodness of fit statistic
A grouping-based method based on values of the
estimated probabilities
Classification tables
Table is the result of cross-classifying the
outcome variable with a dichotomous variable
whose values are derived from the estimated
logistic probabilities

39
Multicolinearity

Occurs when one or more independent variables can
be approximately determined by some other
variables in the model
When there is multicolinearity, the estimated
regression coefficients of the fitted model can
be highly unreliable

40
Multiple testing

Occurs from the many tests of significance that
are typically carried out when selecting or
eliminating variables from the model
More likely to obtain statistical significance
even is no real association exists

41
Influential observations

Refers to data on individuals that may have a
large influence on the estimated regression
coefficients
Methods assessing the possibility of influential
observations should be considered

Write a Comment

User Comments (0)