Entering Multidimensional Space: Multiple Regression - PowerPoint PPT Presentation

1 / 54

About This Presentation

Title:

Entering Multidimensional Space: Multiple Regression

Description:

Statistics for Health Research Entering Multidimensional Space: Multiple Regression Peter T. Donnan Professor of Epidemiology and Biostatistics Generalized linear ... – PowerPoint PPT presentation

Number of Views:152

Avg rating:3.0/5.0

Slides: 55

Provided by: mcadd

Category:

more less

Transcript and Presenter's Notes

Title: Entering Multidimensional Space: Multiple Regression

1
Statistics for Health Research
Entering Multidimensional Space Multiple
Regression
Peter T. Donnan Professor of Epidemiology and
Biostatistics
2
Objectives of session

Recognise the need for multiple regression
Understand methods of selecting variables
Understand strengths and weakness of selection
methods
Carry out Multiple
Regression in SPSS
and interpret the output

3
Why do we need multiple regression?
Research is not as simple as effect of one
variable on one outcome, Especially with
observational data Need to assess many factors
simultaneously more realistic models
4
Consider Fitted line of y a b1x1 b2x2
Dependent (y)
Explanatory (x2)
Explanatory (x1)
5
3-dimensional scatterplot from SPSS of Min LDL in
relation to baseline LDL and age
6
When to use multiple regression modelling (1)
Assess relationship between two variables while
adjusting or allowing for another
variable Sometimes the second variable is
considered a nuisance factor Example Physical
Activity allowing for age and medications
7
When to use multiple regression modelling (2)
In RCT whenever there is imbalance between arms
of the trial at baseline in characteristics of
subjects e.g. survival in colorectal cancer on
two different randomised therapies adjusted for
age, gender, stage, and co-morbidity at baseline
8
When to use multiple regression modelling (2)
A special case of this is when adjusting for
baseline level of the primary outcome in an
RCT Baseline level added as a factor in
regression model This will be covered in Trials
part of the course
9
When to use multiple regression modelling (3)
With observational data in order to produce a
prognostic equation for future prediction of risk
of mortality e.g. Predicting future risk of CHD
used 10-year data from the Framingham cohort
10
When to use multiple regression modelling (4)
With observational designs in order to adjust for
possible confounders e.g. survival in colorectal
cancer in those with hypertension adjusted for
age, gender, social deprivation and co-morbidity
11
Definition of Confounding
A confounder is a factor which is related to both
the variable of interest (explanatory) and the
outcome, but is not an intermediary in a causal
pathway
12
Example of Confounding
Lung Cancer
Deprivation
Smoking
13
But, also worth adjusting for factors only
related to outcome
Lung Cancer
Deprivation
Exercise
14
Not worth adjusting for intermediate factor in a
causal pathway
Exercise
Blood viscosity
Stroke
In a causal pathway each factor is merely a
marker of the other factors i.e correlated -
collinearity
15
SPSS Add both baseline LDL and age in the
independent box in linear regression
16
Output from SPSS linear regression on Age at
baseline
17
Output from SPSS linear regression on Baseline LDL
18
Output Multiple regression
R2 now improved to 13
Both variables still significant INDEPENDENTLY of
each other
19
How do you select which variables to enter the
model?

Usually consider what hypotheses are you testing?
If main exposure variable, enter first and
assess confounders one at a time
For derivation of CPR you want powerful
predictors
Also clinically important factors e.g.
cholesterol in CHD prediction
Significance is important but
It is acceptable to have an important variable
without statistical significance

20
How do you decide what variables to enter in
model? Correlations? With great difficulty!
21
3-dimensional scatterplot from SPSS of Time from
Surgery in relation to Dukes staging and age
22
Approaches to model building

1. Let Scientific or Clinical factors guide
selection

2. Use automatic selection algorithms 3. A
mixture of above
23
1) Let Science or Clinical factors guide
selection
Baseline LDL cholesterol is an important factor
determining LDL outcome so enter first Next allow
for age and gender Add adherence as
important? Add BMI and smoking?
24
1) Let Science or Clinical factors guide
selection

Results in model of
Baseline LDL
age and gender
Adherence
BMI and smoking
Is this a good model?

25
1) Let Science or Clinical factors guide
selection Final Model
Note three variables entered but not
statistically significant
26
1) Let Science or Clinical factors guide
selection
Is this the best model? Should I leave out the
non-significant factors (Model 2)?
Model Adj R2 F from ANOVA No. of Parameters p

1 0.137 37.48 7

2 0.134 72.021 4

Adj R2 lower, F has increased and number of
parameters is less in 2nd model. Is this better?
27
Kullback-Leibler Information
f
Kullback and Leibler (1951) quantified the
meaning of information related to Fishers
sufficient statistics
Basically we have reality f And a model g to
approximate f So K-L information is I(f,g)
g
28
Kullback-Leibler Information
We want to minimise I (f,g) to obtain the best
model over other models I (f,g) is the
information lost or distance between reality
and a model so need to minimise
29
Akaikes Information Criterion
It turns out that the function I(f,g) is related
to a very simple measure of goodness-of-fit Akaik
es Information Criterion or AIC
30
Selection Criteria

With a large number of factors type 1 error
large, likely to have model with many variables
Two standard criteria
1) Akaikes Information Criterion (AIC)
2) Schwartzs Bayesian Information Criterion
(BIC)
Both penalise models with large number of
variables if sample size is large

31
Akaikes Information Criterion

Where p number of parameters and -2log
likelihood is in the output
Hence AIC penalises models with large number of
variables
Select model that minimises (-2LL2p)

32
Generalized linear models

Unfortunately the standard REGRESSION in SPSS
does not give these statistics
Need to use
Analyze
Generalized Linear Models..

33
Generalized linear models. Default is linear

Add Min LDL achieved as dependent as in
REGRESSION in SPSS
Next go to predictors..

34
Generalized linear models Predictors

WARNING!
Make sure you add the predictors in the correct
box
Categorical in FACTORS box
Continuous in COVARIATES box

35
Generalized linear models Model

Add all factors and covariates in the model as
main effects

36
Generalized Linear Models Parameter Estimates
Note identical to REGRESSION output
37
Generalized Linear Models Goodness-of-fit
Note output gives log likelihood and AIC
2835 (AIC -2x-1409.6 2x7 2835) Footnote
explains smaller AIC is better
38
Let Science or Clinical factors guide selection
Optimal model

The log likelihood is a measure of
GOODNESS-OF-FIT
Seek optimal model that maximises the log
likelihood or minimises the AIC

Model 2LL p AIC

1 Full Model -1409.6 7 2835.6

2 Non-significant variables removed -1413.6 4 2837.2

Change is 1.6
39
1) Let Science or Clinical factors guide
selection

Key points
Results demonstrate a significant association
with baseline LDL, Age and Adherence
Difficult choices with Gender, smoking and BMI
AIC only changes by 1.6 when removed
Generally changes of 4 or more in AIC are
considered important

40
1) Let Science or Clinical factors guide
selection

Key points
Conclude little to chose between models
AIC actually lower with larger model and consider
Gender, and BMI important factors so keep larger
model but have to justify
Model building manual, logical, transparent and
under your control

41
2) Use automatic selection procedures
These are based on automatic mechanical
algorithms usually related to statistical
significance Common ones are stepwise, forward or
backward elimination Can be selected in SPSS
using Method in dialogue box
42
2) Use automatic selection procedures (e.g
Stepwise)
Select Method Stepwise
43
2) Use automatic selection procedures (e.g
Stepwise)
1st step
2nd step
Final Model
44
2) Change in AIC with Stepwise selection
Note Only available from Generalized Linear
Models
Step Model Log Likelihood AIC Change in AIC No. of Parameters p

1 Baseline LDL -1423.1 2852.2 - 2

2 Adherence -1418.0 2844.1 8.1 3

3 Age -1413.6 2837.2 6.9 4

45
2) Advantages and disadvantages of stepwise
Advantages Simple to implement Gives a
parsimonious model Selection is certainly
objective Disadvantages Non stable selection
stepwise considers many models that are very
similar P-value on entry may be smaller once
procedure is finished so exaggeration of
p-value Predictions in external dataset usually
worse for stepwise procedures tends to add bias
46
2) Automatic procedures Backward elimination

Backward starts by eliminating the least
significant factor form the full model and has a
few advantages over forward
Modeller has to consider the full model and
sees results for all factors simultaneously
Correlated factors can remain in the model (in
forward methods they may not even enter)
Criteria for removal tend to be more lax in
backward so end up with more parameters

47
2) Use automatic selection procedures (e.g
Backward)
Select Method Backward
48
2) Backward elimination in SPSS
1st step Gender removed
2nd step BMI removed
Final Model
49
Summary of automatic selection

Automatic selection may not give optimal model
(may leave out important factors)
Different methods may give different results
(forward vs. backward elimination)
Backward elimination preferred as less stringent
Too easily fitted in SPSS!
Model assessment still requires some thought

50
3) A mixture of automatic procedures and self
selection

Use automatic procedures as a guide
Think about what factors are important
Add important factors
Do not blindly follow statistical significance
Consider AIC

51
Summary of Model selection

Selection of factors for Multiple Linear
regression models requires some judgement
Automatic procedures are available but treat
results with caution
They are easily fitted in SPSS
Check AIC or log likelihood for fit

52
Summary

Multiple regression models are the most used
analytical tool in quantitative research
They are easily fitted in SPSS
Model assessment requires some thought
Parsimony is better Occams Razor
Donnelly LA, Palmer CNA, Whitley AL, Lang C,
Doney ASF, Morris AD, Donnan PT. Apolipoprotein E
genotypes are associated with lipid lowering
response to statin treatment in diabetes A
Go-DARTS study. Pharmacogenetics and Genomics,
2008 18 279-87.

53
Remember Occams Razor
Entia non sunt multiplicanda praeter
necessitatem Entities must not be multiplied
beyond necessity
William of Ockham 14th century Friar and
logician 1288-1347
54
Practical on Multiple Regression

Read in LDL Data.sav
Try fitting multiple regression model on Min LDL
obtained using forward and backward elimination.
Are the results the same? Add other factors than
those considered in the presentation such as BMI,
smoking. Remember the goal is to assess the
association of APOE with LDL response.
Try fitting multiple regression models for Min
Chol achieved. Is the model similar to that found
for Min Chol?