Title: Statistical, Practical, and Design Issues in Analysis with Missing Data
1Statistical, Practical, and Design Issues in
Analysis with Missing Data
- John Graham
- The Methodology Center
- Dept. of Biobehavioral Health
- Penn State University
- SRCD, Atlanta, April 9, 2005
2Recent Missing Data Work
- Collins, L. M., Schafer, J. L., Kam, C. M.
(2001). A comparison of inclusive and
restrictive strategies in modern missing data
procedures. Psychological Methods, 6, 330_351. - Little Rubin (2002). 2nd Edition
- Schafer, J. L., Graham, J. W. (2002). Missing
data our view of the state of the art.
Psychological Methods, 7, 147-177. - Graham, J. W., Cumsille, P. E., Elek-Fisk, E.
(2003). Methods for handling missing data. In
J. A. Schinka W. F. Velicer (Eds.). Research
Methods in Psychology (pp. 87_114). Volume 2 of
Handbook of Psychology (I. B. Weiner,
Editor-in-Chief). New York John Wiley Sons.
3Recent Planned Missingness Papers
- Graham, J. W., Taylor, B. J., Cumsille, P. E.
(2001). Planned missing data designs in analysis
of change. In L. Collins A. Sayer (Eds.), New
methods for the analysis of change, (pp.
335-353). Washington, DC American Psychological
Association. - Graham, J. W., Taylor, B. J., Olchowski, A. E.,
Cumsille, P. E. (2005). Planned missing data
designs in psychological research. Submitted for
Publication.
4Other Recent Papers
- Graham, J. W. (2003). Adding missing-data
relevant variables to FIML-based structural
equation models. Structural Equation Modeling,
10, 80-100. - Graham, J. W., Schafer, J. L. (1999). On the
performance of multiple imputation for
multivariate data with small sample size. In R.
Hoyle (Ed.) Statistical Strategies for Small
Sample Research, (pp. 1-29). Thousand Oaks, CA
Sage.
5Problem with Missing Data
- Analysis procedures were designed for complete
data. . .
6Solution 1
- Design new procedures
- Missing Data Parameter Estimation in One Step
- Full Information Maximum Likelihood (FIML)SEM
and Other Latent Variable Programs(Amos, Mx,
LISREL, Mplus, LTA)
7Solution 2
- Missing data Multiple Imputation (MI)
- Two Steps
- Step 1 Replace Missing Values with Plausible
Values - Step 2 Analyze Data as if there were No Missing
Data
8FAQ
- Aren't you somehowhelping yourself with
imputation?. . .
9NO. Missing data imputation . . .
- does NOT give you something for nothing
- DOES let you make use of all data you have
- . . .
10FAQ
- Is the imputed value what the person would have
given?
11NO. When we impute a value . . .
- We do not impute for the sake of the value itself
- We impute to preserve important characteristics
of the whole data set - . . .
12We want . . .
- unbiased parameter estimation
- e.g., b-weights
- Good estimate of variability
- e.g., standard errors
- best statistical power
13MCAR(Missing Completely At Random)
- MCAR 1 Cause of missingness completely random
process (like coin flip) - MCAR 2 Cause uncorrelated with variables of
interest - Example parents move
- No bias if cause omitted
14MAR (Missing At Random)
- Missingness may be related to measured variables
- But no residual relationship with unmeasured
variables - Example reading speed
- No bias if you control for measured variables
(conditionally missing at random)
15MNAR (Missing Not At Random)
- Even after controlling for measured variables ...
- Residual relationship with unmeasured variables
- Example drug use reason for absence
16MNAR Causes
- The recommended methods assume data are MAR
- Should these methods be used even when
assumptions not met? - . . .
17YES! These Methods Work!
- Suggested methods work better than old methods
- Multiple causes of missingness
- Only small part of missingness may be MNAR
- Suggested methods usually work very well
18Practical Issues
- How much difference does it make?
- How easy is the "sell"?
- Which is better FIML or MI?
- "Auxiliary" Variables (Collins, Schafer, Kam,
2001 Graham, 2003) - Small sample size (Graham Schafer, 1999)
- Too many variables
- Automation
19Some Practical Issues
20Practical IssuesBiggest problems in multiple
imputation
- How do I write my data out of SPSS?
- How can I use MI with ANOVA?
- How do I use MI with SPSS, STATA, SUDAAN, EQS,
Mplus? - Is there a less tedious way?
21Practical IssuesHow Easy is the "Sell"?
- Sell is getting easier all the time
- Pendulum starting to swing other way
22Practical IssuesWhich is better -- MI or FIML?
- MI is more generally applicable than FIML
- In theory (in long run), they are equivalent
- As practiced, they are sometimes a bit different
23Recent Research (Collins, Schafer, Kam, 2001)
Shows ...
- Include auxiliary variables in model
- highly correlated with variables of interest
- minimize loss of power
- reduce bias (MNAR)
24FIML versus MI (revisited)
- In long run, FIML and MI equivalent
- AS PRACTICED, MI is better
- MI makes it easier to include auxiliary variables
(Collins, Schafer, Kam, 2001) - but models are available to allow auxiliary
variables with FIML - see Graham (2003, Structural Equation Modeling)
25FIML versus MI (revisited)
- In long run, FIML and MI equivalent
- AS PRACTICED, FIML may be better
- FIML estimates sometimes MAY have more power
- but if you set m (n imputations) high enough (40
imputations ?) - power is equivalent
26Practical IssuesSmall Sample Size
- Graham Schafer (1999)
- large regression model
- N 50
- 50 missingness
- Multiple Imputation (NORM) worked fine
27Practical Issues Too Many Variables
- Longitudinal research
- 10 variables x 5 waves 50 variables
- 1,325 parameters
- 10 latent variables x 4 manifest variables x
5 waves 200 variables - 20,300 parameters
28One Solution (borrow Steve Peck's slide)Impute
Scales Rather Than Items
- Create scales based on partial data
- Require 100 of items if Alpha .60
- Require 80 of items if Alpha .70
- Require 67 of items if Alpha .80
- Require 50 of items if Alpha .90
- Good solution for regression analyses
- Not good solution for SEM
29One Solution (borrow Steve Peck's slide)Impute
Scales Rather Than Items
- Create scales based on partial data
- Items
- Alpha Required
- _________ ________
- lt.60 100
- .70 80
- .80 67
- .90 50
- Good solution for regression analyses
- Not good solution for SEM
30Too Many VariablesA Partial Solution
- Normally cannot impute separately
- assumes r 0 between sets of variables
- What if r 0 is true?
- (1) Separate variables so they are maximally
uncorrelated (principal components) - (2) Use factor scores to represent unused
variables
31Practical IssuesBiggest Problems in Multiple
Imputation
- NOT the theory
- NOT imputing the data
- NOT combining the results from m 20 analyses
32Practical IssuesBiggest problems in multiple
imputation
- How do I write my data out of SPSS?
- How can I use MI with ANOVA?
- How do I use MI with SPSS, STATA, SUDAAN, EQS,
Mplus? - Is there a less tedious way?
33Now That We HaveGood Solutions for Missing Data
Analysis ...
- Consider PLANNED missing data designs
34Planned Missingness for Growth Modeling
- see Graham, J. W., Taylor, B. J., Cumsille, P.
E. (2001). Planned missing data designs in
analysis of change. In L. Collins A. Sayer
(Eds.), New methods for the analysis of change,
(pp. 335-353). Washington, DC American
Psychological Association.
35Design 1 all combinations of1 time missing (17
missing)
- 1 1 1 1 1 57 0 0 0 0 01 1 1 1 0 57 1 1 1 1
11 1 1 0 1 57 1 1 1 1 11 1 0 1 1 57 1 1 1 1
11 0 1 1 1 57 1 1 1 1 10 1 1 1 1 57 1 1 1 1
1 ___ N 342
36Design 3 all combinations of 2 times missing
(36 missing)
- 1 1 1 1 1 31 0 0 0 0 0
- 1 1 1 0 0 31 0 0 0 0 0
- 1 1 0 1 0 31 0 0 0 0 0
- 1 0 1 1 0 31 0 0 0 0 0
- 0 1 1 1 0 31 1 1 1 1 1
- 1 1 0 0 1 31 1 1 1 1 1
- 1 0 1 0 1 31 1 1 1 1 1
- 0 1 1 0 1 31 1 1 1 1 1
- 1 0 0 1 1 31 1 1 1 1 1
- 0 1 0 1 1 31 1 1 1 1 1
- 0 0 1 1 1 31 1 1 1 1 1
37SE for b-wt Predicting Slope
same money
Missing Data Designs
same SE
same SE
Complete Cases Designs
same money
Data Points
383-Form Design
- Student Received Item Set?
- ----------------------------
- X A B C
- Form 1 yes yes yes NO
- Form 2 yes yes NO yes
- Form 3 yes NO yes yes
- Form 4 yes yes yes yes
393-Form Design
- Item Sets X A B C total 34 33 33 3
3 133 - form X A B C1 34 33 33 0 1002 34 33
0 33 1003 34 0 33 33 100
403-Form Design Item Order
- Form 1 X A BForm 2 X C AForm 3 X B C
413-Form Design Item Order
- Form 1 X A B CForm 2 X C A BForm
3 X B C A
423-Form Design Item Order
- Form 1 X A B CForm 2 X C A BForm
3 X B C A - Could pay some subjects to complete extra
questions
433-Form Design Item Order
- Form 1 X A B CForm 2 X C A BForm
3 X B C A - Give questions as shown, measure reasons for
non-completion - poor reading
- low motivation
- "Managed" missingness
44Expensive Measures II
Larger N, Less Expensive
r .30
Smaller N, More Expensive
45Example Study
- r -.30 (smoking and health)
- Self-report Smoking
- two items
- Biochemical Smoking Measures
- Expired Air CO
- Saliva Cotinine
46Example Study
- 15,050 for Measuring Smoking
- Self-Reports 7.30 per subject
- CO / Cotinine 16.78 per subject
- self-reports bio-chem625 x 7.30
625 x 16.78 15,050 - 1200 x 7.30 375 x 16.78 15,050
47D
C
A
B
E
48- Cheap Measures are Biased
- A loadings .50, .70 (Cheap, Expensive)
- B loadings .70, .70 (Cheap, Expensive)
- C loadings .70, .50 (Cheap, Expensive)
- D loadings .50, .50 (Cheap, Expensive)
- Cheap Measures Not Biased
- E loadings .50, .70 (Cheap, Expensive)
49Research Examples
- Smoking Research
- less expensive Self-Reports
- more expensive CO and Saliva Cotinine
- Alcohol Research
- less expensive Brief Self-reports
- more expensive Time Line Follow Back
50Research Examples
- Blood Vessel Health (relevant for cardiovascular
health) - Ultrasound Flow-mediated dilation
- 150 per subject
- BP approximation
- pulse wave contour analysis
- 15 per subject
51Research Examples
- Nutrition Research
- less expensive Brief Nutrition Survey
- more expensive Extensive 24-hr Recall
- Survey Research I
- less expensive Brief Mail Survey
- more expensive Extensive Face-to-Face
Interview
52Research Examples
- Exercise Research Physical Conditioning
- less expensive Self-report survey
- more expensive VO2-max
- Survey Research II
- less expensive Retrospective reports
- more expensive Prospective reports
53 54Recent Papers
- Schafer, J. L., Graham, J. W. (2002). Missing
data our view of the state of the art.
Psychological Methods, 7, 147-177. - Graham, J. W., Cumsille, P. E., Elek-Fisk, E.
(2003). Methods for handling missing data. In
J. A. Schinka W. F. Velicer (Eds.). Research
Methods in Psychology (pp. 87_114). Volume 2 of
Handbook of Psychology (I. B. Weiner,
Editor-in-Chief). New York John Wiley Sons. - Graham, J. W., Taylor, B. J., Olchowski, A. E.,
Cumsille, P. E. (2005). Planned missing data
designs in psychological research. Manuscript
submitted for publication. - http//mcgee.hhdev.psu.edu/publication_resources
- email jgraham_at_psu.edu