Title: Short Course in Biostatistics2005 Lesson 6: Introduction to Regression
1Short Course in Biostatistics-2005Lesson 6
Introduction to Regression
- Larry Magder
- April 15, 2005
2Outline
- Brief review of parametric statistical methods
- Two purposes of regression models
- Modeling the effect of a quantitative predictor.
- Estimating the effect of one variable,
controlling for other variables.
3Review of Parametric Statistical Methods
- Many scientific questions can be viewed as
questions about the unknown probability
distribution of random variables. - These distributions can be characterized by a
small number of parameters. - Statistical Methods
- 1. Summarize what the data say about the
possible values of the parameters. - 2. Quantify the evidence in the data with
respect to hypotheses about the parameters.
4 Regression Models, Purpose 1 Modeling effect
of a quantitative predictor
- Suppose you want to know whether a new drug
increases blood pressure. Then you might focus
on two parameters - ?1 mean BP change in those given the drug
- ?0 mean BP change in those not given drug
- The null hypothesis would be H0 ?0?1
- Given data we can estimate the ?s and assess the
evidence with respect to the null hypothesis
5Modeling effect of quantitative predictor
- What if your predictor is a quantitative
variable? - e.g, we might be interested in the effect of
different doses of drug. - For simplicity, assume there are three doses.
- One approach. Assume
- ?1 mean BP change in those given dose 1
- ?2 mean BP change in those given dose 2
- ?3 mean BP change in those given dose 3
6 Modeling effect of quantitative predictor
- Alternatively, we might assume
- ?x ?0 ?1x where x is the dose
- Note that this expression implies that
- ?0 ?0
- ?1 ?0 ?1
- ?2 ?0 2?1
- ?3 ?0 3?1
- That is, each unit change in exposure leads to an
increase in blood pressure by an additional ß1
units
7 Modeling effect of quantitative predictor
- Graphically, this model appears as follows
8 Modeling effect of quantitative predictor
- Advantages of this model
- It is more parsimonious than assuming a separate
mean at each level. (Effect of treatment is
summarized by one parameter). - This model is not confined to a situation when
there are only a small number of doses as in this
example. - Disadvantages of this model
- It assumes a linear relationship between dose and
effect. (Smoothing Assumption)
9 Regression Models, Purpose 2Parameterizing
the effect of one variable, controlling for other
variables
- Motivating example Suppose you are interested
in the effect of moderate alcohol use during
pregnancy on the mean birth weight of infants. - Results
10 Parameterizing the effect of one variable,
controlling for other variables
- We might want to control for smoking.
- Lets redo the analysis stratifying by smoking
status. - What would be a good estimate of the effect of
moderate drinking on birth weight, controlling
for smoking?
11 Parameterizing the effect of one variable,
controlling for other variables
- The effect of drinking controlling for smoking
might be viewed as the effect of drinking after
holding smoking constant. - In these data, the effect of drinking controlling
for smoking appears to be 50 gms in one strata
and 39 gms in the other strata. - If we are willing to assume the effect is the
same in both strata (no effect modification)
then we might take an average of 50 and 39 - This average can be called the effect of
drinking on birth weight, controlling for smoking.
12 Parameterizing the effect of one variable,
controlling for other variables
- One way to model these data and incorporate the
assumption of no effect modification is
parameterize it as a regression model.
13 Parameterizing the effect of one variable,
controlling for other variables
- Let
- X11 if mother drank, 0 otherwise
- X21 if mother smoked, 0 otherwise
- Then, assume E(Y) ?0 ?1X1 ?2X2
- Note this implies
- For Smokers and Drinkers, E(Y) ?0
?1 ?2 - For Smokers and Non-Drinkers E(Y) ?0 ?2
- For Non-Smokers and Drinkers E(Y) ?0 ?1
- For Non-Smokers/Non-Drinkers E(Y) ?0
14 Parameterizing the effect of one variable,
controlling for other variables
- By subtracting, it can be seen that the effect of
drinking in each strata defined by smoking is ?1
! - Therefore, ?1 is the effect of drinking
controlling for smoking. - The linear equation is a simple way to
parameterize this effect of interest. - The assumption that the effect of drinking is the
same among smokers and non-smokers is another
strong smoothing assumption
15 General Multiple Regression Model
- More generally, we often fit regression models
with many independent variables, i.e. - E(Y) ?0 ?1X1 ?2X2 ?3X3 .... ?kXk
- The predictors can be dummy variables or
continuous variables. - Generally we also assume that the variance of Y
is the same at all levels of the predictor
variables. - Generally we also assume that the distribution of
Y is normal at all levels of the predictor
variables (not essential unless departure from
normality is extreme).
16 General Multiple Regression Model
- In these models, it can be shown that the ?s are
interpretable as the effect of the corresponding
predictor, controlling for all other variables in
the model. - All ANOVA models (discussed last week) can be
parameterized as linear regression models.
17 Smoothing Assumptions of the General Multiple
Regression Model
- It is important to realize that Multiple
Regression Models make smoothing assumptions of
two types. - 1. The model assumes that the relationship
between the E(Y) and quantitative predictors
is linear. - 2. The model assumes that the effects are
simply additive (i.e. the effect one variable
is the same in all levels of other variables) - (No Effect Modification w.r.t. Mean
Difference)
18 Advantages and Disadvantages of Regression
Approach
- Advantage
- Smoothing Assumption Makes it possible to
represent important research questions with a
small number of key parameters. Makes it
possible to simultaneously consider many
variables - Disadvantage
- Smoothing Assumption Makes strong simplifying
assumptions about the true relationships.
19 Relaxing the Smoothing Assumptions
- Each type of smoothing assumption can be relaxed
by adding additional complexity to the regression
model. - Relaxing the assumption of linearity Add
quadratic terms, piecewise linear terms, or other
creative approaches. -
- Example
- ?x ?0 ?1x ?2x2 where x is the
dose
20 Relaxing the Smoothing Assumptions
- Relaxing the assumption of no effect
modification - Let X11 if mother drank, 0 otherwise
- X21 if mother smoked, 0 otherwise
- Then, assume E(Y) ?0 ?1X1 ?2X2 ?3
X1X2 - Note this implies
- For Smokers and Drinkers, v E(Y) ?0
?1 ?2 ?3 - For Smokers and Non-Drinkers E(Y) ?0 ?2
- For Non-Smokers and Drinkers E(Y) ?0 ?1
- For Non-Smokers/Non-Drinkers E(Y) ?0
- Which implies
- Effect of Drinking among smokers ?2 ?3
- Effect of Drinking among non-smokers ?2
21 Estimation, Confidence Intervals, P-values
- How do we estimate the terms in the model?
- Need to collect independent realizations of Y and
the associated predictors X1, X2, ...
22 Estimation, Confidence Intervals, P-values
- Note, for any possible choice of estimates, I
would get a predicted value for Y. For example
for - the predicted value of Y would be
-
- Choose ?'s that make the result in estimates of Y
that are closest to the observed value, on
average. - More specifically, we minimize the sum of the
squared distances. ("Least Squares Regression")
23 Estimation, Confidence Intervals, P-values
- How is this done? By the computer!
- (It could take years to fit a single multiple
regression model by hand). - The computer will also provide confidence
intervals and p-values based on t-distributions. - These inferences are approximate, but they are
exactly correct if the underlying data are
assumed to be normal.
24 Example
- Head growth and neurodevelopment of infants born
to HIV-1infected drug-using women - Macmillan C, Magder LS, et al.,
Neurology, 2001 - Using data from WITS cohort study of
HIV-1infected women and their children, we fit
regression models to estimate the effects of
HIV-1 infection and in utero hard drug exposure
Bayley Scales of Infant Development controlling
for relevant confounders.
25 Example (continued)
- Data
- 1,094 infants born to HIV-1-positive mothers
- 147 (13) of the infants became HIV-1positive
- 383 (35) were exposed in utero to opiates or
cocaine (drug-positive). -
- Bayley scores measured repeatedly during the
first 2 years of life -
- (Note that repeated measures on the same
children violates the assumption of
independence so we had to use somewhat more
complex methods).
26 Example (continued)
- Box Plots of Bayley Motor Scores by HIV status
and in utero drug exposure. (A score of 100 is
normal)
4 months of age
12 months of age
24 months of age
27 Example (continued)
- Regression Model for 4-month Bayley Score.
- E(Bayley Score) ß0 ß1HIV/DRUG
ß2HIV/NoDRUG - ß3DRUGnoHIV ß4AZT ß5 SMOKE
- ß6 ALCOHOL ....
28 Example (continued) Results
29 Example (continued) Fitted Model
Expected Bayley Motor Standardized Scores by Age
in 4groups (?HIV-,Drug- ?HIV-,Drug ?HIV,
Drug- XHIV, Drug)
30 Extension to other types of random variable
- The regression models described above were to
model a quantitative outcome. - They can be extended to model binary outcomes
simply by replacing E(Y) on the left hand side by
P(Y). - For example, we might assume that
- Prob(Y1) ?0 ?1X1 ?2X2 ?3X3 .... ?kXk
- This model assumes that an increase in X by one
unit results has an additive effect on the
probability. - The ?s are interpretable as "Risk Differences".
- The model assumes a smooth relationship between
quantitative predictors and the prob. of binary
outcome. - The ?s are interpretable as the effect of the
corresponding predictor on the probability of the
outcome, controlling for all other variables in
the model. -
31 Extension to other types of random variable
- A second approach would be the following
- Prob(Y1) (?0)(?1X1)(?2X2)(?3X3) .... (?kXk)
- This model assumes that an increase in X by one
unit results has an multiplicative effect on the
probability. - The ?s are interpretable as "Risk Ratios".
- The model assumes a smooth relationship between
quantitative predictors and the prob. of binary
outcome. - The ?s are interpretable as the effect of the
corresponding predictor on the probability of the
outcome, controlling for all other variables in
the model.
32 Logistic Regression
- However, most binary regression analyses are
based on the odds ratio. The Logistic Regression
Model is as follows - Odds(Y1) (?0)(?1X1)(?2X2)(?3X3) .... (?kXk)
- This model assumes that an increase in X by one
unit results has an multiplicative effect on the
odds. - The ?s are interpretable as "Odds Ratios".
- The model assumes a smooth relationship between
quantitative predictors and the odds of the
outcome. - The ?s are interpretable as the effect of the
corresponding predictor on the probability of the
outcome, controlling for all other variables in
the model.