PROC MI

About This Presentation

Title:

PROC MI

Description:

You create multiple imputed data sets, analyze them with standard analyses, and ... Fishes of Species Bream -* data Fish1; title 'Fish Measurement Data' ... – PowerPoint PPT presentation

Number of Views:334

Avg rating:3.0/5.0

Slides: 55

Provided by: HALL2

Learn more at: https://www.cpcug.org

Category:

Tags: proc | bream

more less

Transcript and Presenter's Notes

Title: PROC MI

1
PROC MI MIANALYZE
February 12, 2003 Charlie Hallahan
2
Overview
Multiple imputation is a strategy for dealing
with data sets with missing values. You
replace each missing value with a set of
plausible values that represent the uncertainty
about the right value to impute. You create
multiple imputed data sets, analyze them with
standard analyses, and then combine the results.
You produce valid statistical inferences that
properly reflect the uncertainty due to the
missing values.
3
Overview
PROC MI creates multiply imputed data sets for
incomplete p-dimensional multivariate data. It
offers three methods for creating the imputed
data sets the regression method,
the propensity score method, the
Markov Chain Monte Carlo (MCMC) method. The
procedure creates an output data set containing m
imputed versions of the original data. In each
version, the missing values are replaced with
imputed values. .
4
Overview
For the MCMC method, you can specify whether you
want a single chain for all m imputations or a
separate chain for each imputation. You can also
specify the initial estimates for the MCMC
method. After analyzing your imputed data with
standard procedures, you use PROC MIANALYZE to
combine the results The MI procedure was
introduced in Release 8.1 and remains
experimental in Release 8.2, with various new
options and output displays available. Among
others, a new TRANSFORM statement enables you to
transform variables before imputation and
back-transform these variables before combining
inferences and creating output data sets. (It is
production in Release 9.0)
5
Multiple Imputation Strategy
Rubin (1987) ? Replace each missing value with
a set of plausible values that represent the
uncertainty about the value to impute ? Does
not attempt to estimate each missing value ?
Represents a random sample of the missing
values ? Analyzes multiple imputed data sets
using standard procedures for complete
data ? Combines results from standard analyses.
No matter which complete-data analysis is
used, the process of combining results is the
same. ? Yields valid statistical inferences
that properly reflect uncertainty due to
missing values. E.g.., confidence intervals
6
Steps in Multiple Imputation Inference
1. Missing data are filled in m times to
generate m complete data sets using the MI
procedure. 2. m complete data sets are analyzed
using standard SAS procedures. 3. Results from m
complete data sets are combined for inference
using the MIANALYZE procedure.
7
Multiple Imputation Methods
? Monotone missing patterns data set with
ordered variables Y1 ? ? Yp and whenever
Yj is missing for an individual, then all
subsequent Yk , kgtj are missing for the same
observation. - Regression method (Rubin 1987)
- Propensity score (Lavori,
Dawson, and Shera 1995) ? Arbitrary missing
patterns - Full-data imputation
through MCMC (Schafer 1997) -
Monotone-data imputation through MCMC (Schafer
1997)

8
Basic Assumption Missing at Random
Missing at Random (MAR) means that the
probability that an observation is missing may
depend on Yobs but not on Ymiss . For example,
if a data set with Y1 and Y2 fully observed and
Y3 with missing values, for an observation
MAR assumes that Pr(Y3 is missing) may be
related to the values of Y1 and Y2 , but not to
the value of Y3 . Missing Completely at Random
(MCAR) is a special case of MAR where the missing
data values are a simple random sample of all
data values.
9
Getting Started
The Fitness data set has been altered to contain
an arbitrary pattern of missingness.
----------------- Data on Physical Fitness
----------------- These measurements were
made on men involved in a physical fitness
course at N.C. State University.
Only selected variables of
Oxygen (oxygen intake, ml per
kg body weight per minute), Runtime (time
to run 1.5 miles in minutes), and
RunPulse (heart rate while running) are used.
Certain values were changed to
missing for the analysis. ------------------
------------------------------------------
Assume the data are multivariate normally
distributed and the missing data are missing at
random (MAR).
10
Getting Started
data FitMiss input Oxygen RunTime RunPulse
_at__at_ datalines 44.609 11.37 178 45.313
10.07 185 54.297 8.65 156 59.571 .
. 49.874 9.22 . 44.811 11.63 176
. 11.95 176 . 10.85 . 39.442
13.08 174 60.055 8.63 170 50.541 .
. 37.388 14.03 186 44.754 11.12 176
47.273 . . 51.855 10.33 166
49.156 8.95 180 40.836 10.95 168 46.672
10.00 . 46.774 10.25 . 50.388 10.08
168 39.407 12.63 174 46.080 11.17 156
45.441 9.63 164 . 8.92 . 45.118
11.08 . 39.203 12.88 168 45.790 10.47
186 50.545 9.93 148 48.673 9.40 186
47.920 11.50 170 47.467 10.50 170
11
Getting Started
proc mi dataFitMiss seed501213 mu050 10 180
outoutmi var Oxygen RunTime RunPulse
run proc print dataoutmi (obs10)
title 'First 10 Observations of the Imputed Data
Set' run
By default, the procedure uses the Markov Chain
Monte Carlo (MCMC) method with a single chain to
create 5 imputations. The MI procedure takes 200
burn-in iterations before the first imputation
(to eliminate the series of dependence on the
starting value of the chain and to achieve the
stationary distribution) and 100 iterations
between imputations (to eliminate the series of
dependence between the two imputations). This
reflected in the Model Information table of the
output.
12
Getting Started
Model Information Data Set
WORK.FITMISS Method
MCMC Multiple Imputation
Chain Single Chain Initial Estimates
for MCMC EM Posterior Mode Start
Starting Value Prior
Jeffreys Number
of Imputations 5 Number of
Burn-in Iterations 200 Number of
Iterations 100 Seed for random
number generator 501213
A summary of the missing data patterns is then
given in the Missing Data Patterns table (next
page).
13
Getting Started
(see page 10) Missing Data Patterns Group
Oxygen Time Pulse Freq Percent
1 X X X 21
67.74 2 X X .
4 12.90 3 X . .
3 9.68 4 . X
X 1 3.23 5 .
X . 2 6.45
Missing Data Patterns -----------
------Group Means---------------- Group
Oxygen RunTime RunPulse 1
46.353810 10.809524 171.666667 2
47.109500 10.137500 . 3
52.461667 . . 4
. 11.950000 176.000000 5
. 9.885000 .
14
Getting Started
The Parameter Estimates table summarizes the
descriptive statistics for the imputed data sets.
Multiple Imputation Parameter
Estimates Variable Mean Std Error
95 Confidence Limits DF Oxygen
47.155085 0.989566 45.1241
49.1861 26.84 RunTime 10.544837
0.265540 9.9995 11.0901 26.523
RunPulse 172.180912 1.982233 168.0408
176.3210 19.613 Multiple
Imputation Parameter Estimates
t for H0
Variable Minimum Maximum Mu0
MeanMu0 Pr gt t Oxygen 47.002999
47.395550 50.000000 -2.87
0.0078 RunTime 10.486098 10.600720
10.000000 2.05 0.0502 RunPulse
170.983775 173.005487 180.000000
-3.94 0.0008
15
Getting Started
A listing of the first 10 observations of the
imputed values show they have a different
precision than the original data. This can be
corrected by using the ROUND option, e.g.,
ROUND0.001 0.01 0.1 would be appropriate here.

Run Obs _Imputation_ Oxygen RunTime
Pulse 1 1 44.6090
11.3700 178.000 2 1
45.3130 10.0700 185.000 3 1
54.2970 8.6500 156.000 4
1 59.5710 8.0747 155.925 5
1 49.8740 9.2200 176.837
6 1 44.8110 11.6300
176.000 7 1 42.8857
11.9500 176.000 8 1
46.9992 10.8500 173.099 9 1
39.4420 13.0800 174.000 10
1 60.0550 8.6300 170.000
16
Monotone Missing Patterns
Regression Method
17
Example Regression Method
/----------- Fishes of Species Bream
----------/ data Fish1 title 'Fish
Measurement Data' input Length1 Length2
Length3 _at__at_ datalines 23.2 25.4 30.0
24.0 26.3 31.2 23.9 26.5 31.1 26.3 29.0 33.5
26.5 29.0 . 26.8 29.7 34.7 26.8 . .
27.6 30.0 35.0 27.6 30.0 35.1 28.5 30.7
36.2 28.4 31.0 36.2 28.7 . . 29.1
31.5 . 29.5 32.0 37.3 29.4 32.0 37.2
29.4 32.0 37.2 30.4 33.0 38.3 30.4 33.0
38.5 30.9 33.5 38.6 31.0 33.5 38.7 31.3
34.0 39.5 31.4 34.0 39.2 31.5 34.5 .
31.8 35.0 40.6 31.9 35.0 40.5 31.8 35.0 40.9
32.0 35.0 40.6 32.7 36.0 41.5 32.8 36.0
41.6 33.5 37.0 42.6 35.0 38.5 44.1 35.0
38.5 44.0 36.2 39.5 45.3 37.4 41.0 45.9
38.0 41.0 46.5
18
Example Regression Method
proc mi dataFish1 seed137851 outoutmi1
monotone methodregression var Length1
Length2 Length3 run
The VAR statement specifies the (monotone) order
of the variables. The default number of
imputations is 5.
Model
Information Data Set
WORK.FISH1 Method
Regression Number of Imputations
5 Seed for random number generator 137851
Missing Data Patterns

Group Length1 Length2
Length3 Freq Percent 1
X X X 30
85.71 2 X X .
3 8.57 3 X
. . 2 5.71
19
Example Regression Method
Multiple Imputation Parameter
Estimates Variable Mean Std Error
95 Confidence Limits DF Length2
33.105285 0.662744 31.75558
34.45499 32.158 Length3 38.366222
0.703892 36.93271 39.79974 32.151
Variable Minimum Maximum Length2
33.097416 33.112194 Length3
38.349572 38.382455
20
Example Regression Method
First 10 Observations of the Imputed Data
Set Obs _Imputation_ Length1 Length2
Length3 1 1 23.2
25.4000 30.0000 2 1 24.0
26.3000 31.2000 3 1
23.9 26.5000 31.1000 4 1
26.3 29.0000 33.5000 5 1
26.5 29.0000 33.7670 6 1
26.8 29.7000 34.7000 7
1 26.8 29.2644 34.0233 8
1 27.6 30.0000 35.0000 9
1 27.6 30.0000 35.1000 10
1 28.5 30.7000 36.2000
Use the option round 0.1 to force the imputed
values to have the same precision as the observed
values.
21
Monotone Missing Patterns
Propensity Score Method
22
Monotone Missing Patterns
Approximate Bayesian Bootstrap
23
Monotone Missing Patterns
Propensity Score Method ? Uses only covariate
information associated with whether the imputed
values are missing. ? Does not use
correlations among the variables. ? Is
effective for inferences about distributions of
imputed variables, such as univariate
analyses. ? Is not appropriate for analysis
involving relationships among variables, such
as regression analyses. See article by Paul
Allison for argument against using the propensity
score method, http//www.ssc.upenn.edu/allison/P
apers
24
Example Propensity Method
proc mi dataFish1 seed899603 outoutex2
monotone methodpropensity var Length1
Length2 Length3 run proc print
dataoutex2(obs10) title 'First 10
Observations of the Imputed Data Set' run
The MI Procedure
Model Information Data Set
WORK.FISH1 Method
Propensity Number of Imputations
5 Number of Groups on Propensity
5 Seed for random number generator 899603
25
Example Propensity Method
Missing Data Patterns

Group Length1 Length2
Length3 Freq Percent 1 X
X X 30 85.71
2 X X . 3
8.57 3 X . .
2 5.71 -----------------Group
Means---------------- Length1 Length2
Length3 30.603333 33.436667
38.720000 29.033333 31.666667
. 27.750000 . .
26
Example Propensity Method
First 10 Observations of the Imputed Data Set
Obs _Imputation_ Length1 Length2
Length3 1 1 23.2 25.4
30.0 2 1 24.0 26.3
31.2 3 1 23.9
26.5 31.1 4 1 26.3
29.0 33.5 5 1 26.5
29.0 38.6 6 1 26.8
29.7 34.7 7 1 26.8
29.0 35.1 8 1 27.6
30.0 35.0 9 1 27.6
30.0 35.1 10 1
28.5 30.7 36.2
See page 17 to compare with original data set and
page 20 to compare with regression method. Note
that the Round option is not needed here since
all the imputed values are random draws from the
original data set.
27
Single Imputation with EM
The expectation-maximization (EM) algorithm is a
technique for maximum likelihood estimation in
parametric models for incomplete data. It can be
used in PROC MI to produce a single imputation
for each missing observation. This is done
either with the option NIMPUTE0 on the PROC
statement or with the EM statement. 1. The
expectation E-step Given a set of parameter
estimates, such as a mean vector and
covariance matrix for a multivariate normal
distribution, the E-step calculates the
conditional expectation of the complete-data log
likelihood given the observed data and the
parameter estimates. 2. The maximization M-step
Given a complete-data log likelihood, the
M-step finds the parameter estimates to
maximize the complete-data log likelihood
from the E-step.
28
Single Imputation with EM
proc mi dataFitMiss seed1518971 simple
nimpute0 em itprint outemoutem
var Oxygen RunTime RunPulse run proc
print dataoutem title 'EM Estimates'
run
(see page 10 for the missing data pattern for
this data set)
Model Information
Data Set
WORK.FITMISS Method
MCMC Multiple Imputation Chain
Single Chain Initial Estimates for
MCMC EM Posterior Mode Start
Starting Value
Prior Jeffreys
Number of Imputations 0
Number of Burn-in Iterations 200
Number of Iterations 100
Seed for random number generator 1518971
29
Single Imputation with EM
(original data set has 31 observations)
Missing Data Patterns
Run Run Group
Oxygen Time Pulse Freq
1 X X X 21
2 X X . 4
3 X . .
3 4 . X X
1 5 . X .
2
EM Estimates Obs _TYPE_
_NAME_ Oxygen RunTime RunPulse
1 MEAN 47.1041 10.5549
171.382 2 COV Oxygen
27.7980 -6.4579 -18.031 3 COV
RunTime -6.4579 2.0155 3.516
4 COV RunPulse -18.0308 3.5161
97.767
30
Markov Chain Monte Carlo (MCMC)
With arbitrary missing value patterns (i.e., not
a monotone missing pattern) the MCMC method must
be used. The MCMC method can either be used for
the full data imputation or used up to the point
where the missing data pattern becomes monotone
and then switch to either the Regression or
Propensity Score methods to complete the
imputations. In Bayesian inference, information
about unknown parameters is expressed in the form
of a posterior distribution. MCMC is used for
exploring posterior distributions. Through MCMC,
one can simulate the joint posterior distribution
of the unknown quantities and then obtain
estimates of posterior parameters.
31
Markov Chain Monte Carlo (MCMC)
Assuming the data are from a multivariate normal
distribution, data augmentation is applied to
Bayesian inference with missing data by repeating
the following step ? Imputation I-step ?
Posterior P-step
32
Markov Chain Monte Carlo (MCMC)
I-Step
33
Markov Chain Monte Carlo (MCMC)
P-Step
34
Markov Chain Monte Carlo (MCMC)
MCMC Method in Theory
35
Markov Chain Monte Carlo (MCMC)
MCMC Method in Practice
36
Markov Chain Monte Carlo (MCMC)
proc mi dataFitMiss seed21355417 nimpute6
mu050 10 180 outmcmc mcmc
chainmultiple displayinit initialem(itprint)
var Oxygen RunTime RunPulse run
proc print datamcmc(obs10) title
'First 10 Observations of the Imputed Data Set'
run
PROC MI Options used mu0 specifies values
for hypothesis testing with means
of
imputed values. MCMC Options chain
single/multiple specifies whether a single chain
is used
for all imputations
or a separate chain is
used
for each imputation.
displayinit displays initial parameter
values for each imputation
initialem specifies initial mean and
covariance estimates
calculated by EM algorithm
37
Markov Chain Monte Carlo (MCMC)
EM (Posterior Mode) Iteration
History _Iteration_ -2 Log L -2 Log
Posterior Oxygen RunTime
RunPulse 0 254.482800
282.909590 47.104086 10.554864
171.381796 1 255.081159
282.051588 47.104079 10.554859
171.381708 2 255.271405
282.017488 47.104077 10.554858
171.381669 3 255.318621
282.015372 47.104002 10.554524
171.381853 4 255.330259
282.015232 47.103861 10.554388
171.382058 5 255.333160
282.015222 47.103797 10.554341
171.382152 6 255.333896
282.015222 47.103774 10.554325
171.382186 7 255.334085
282.015222 47.103766 10.554320
171.382197
EM (Posterior Mode) Estimates
_TYPE_ _NAME_
Oxygen RunTime RunPulse
MEAN
47.103766 10.554320 171.382197
COV Oxygen
24.549968 -5.726112 -15.926034
COV RunTime
-5.726112 1.781407 3.124798
COV RunPulse
-15.926034 3.124798 83.164044
Initial
Parameter Estimates for MCMC
_TYPE_ _NAME_ Oxygen
RunTime RunPulse
MEAN 47.103766
10.554320 171.382197
COV Oxygen 24.549968
-5.726112 -15.926034
COV RunTime -5.726112
1.781407 3.124798
COV RunPulse -15.926034
3.124798 83.164044
38
Markov Chain Monte Carlo (MCMC)
Multiple Imputation
Parameter Estimates Variable Mean
Std Error 95 Confidence Limits DF
Oxygen 47.164819 0.994145
45.1212 49.2085 25.958 RunTime
10.549936 0.273312 9.9880
11.1118 25.902 RunPulse 170.969836
3.010920 163.9615 177.9782 7.5938
Multiple Imputation Parameter
Estimates t for H0
Variable MeanMu0 Pr gt t
Oxygen -2.85 0.0084
RunTime 2.01 0.0547
RunPulse -3.00 0.0182
39
Markov Chain Monte Carlo (MCMC)
First 10 Observations of the Imputed Data
Set Run Obs _Imputation_ Oxygen
RunTime Pulse 1 1
44.6090 11.3700 178.000 2
1 45.3130 10.0700 185.000
3 1 54.2970 8.6500
156.000 4 1 59.5710
7.4870 128.991 5 1
49.8740 9.2200 154.361 6
1 44.8110 11.6300 176.000
7 1 44.7499 11.9500
176.000 8 1 49.7774
10.8500 181.244 9 1
39.4420 13.0800 174.000 10
1 60.0550 8.6300 170.000
40
Markov Chain Monte Carlo (MCMC)
Checking Convergence in MCMC
proc mi dataFitMiss seed42037921 noprint
nimpute2 mcmc timeplot(mean(Oxygen))
acfplot(mean(Oxygen)) var Oxygen RunTime
RunPulse run
timeplot option plots successive parameter
estimates against iteration number i.
Long-term trends in the plot
indicate that successive iterations
are highly correlated and that the
series of iterations has not
converged. acfplot option plots the
autocorrelation function of successive iterates,
41
Markov Chain Monte Carlo (MCMC)
No apparent trends in successive iterates
42
Markov Chain Monte Carlo (MCMC)
No autocorrelations in successive iterates
43
PROC MIANALYZE
PROC MIANALYZE combines the results of the
analyses of imputations and generates valid
statistical inferences. Typical sequence of
steps 1. Run PROC MI to generate a data set
with m sets of imputed values. 2. Run a
statistical procedure, e.g, PROC REG, using BY
_IMPUTATION_ and create an output data set
with the combined results, e.g., m sets of
regression estimates. 3. Run PROC MIANALYZE
using the output data set from Step 2 as input.
44
PROC MIANALYZE
Combining Inferences from Imputed Data Sets
45
PROC MIANALYZE
Combining Inferences from Imputed Data Sets
46
PROC MIANALYZE
Why are only a few imputations
needed? Many are surprised by the claim that only
3-10 imputations may be needed. Rubin (1987, p.
114) shows that the efficiency of an estimate
based on m imputations is approximately where is
the rate of missing information for the quantity
being estimated. The efficiencies achieved for
various values of m and rates of missing
information are shown below. Unless the rate
of missing information is very high, In most
situations there is simply little advantage to
producing and analyzing more than a few imputed
datasets.
47
PROC MIANALYZE
Why are only a few imputations needed?
48
PROC MIANALYZE
proc mi dataFitMiss noprint outoutmi
seed3237851 var Oxygen RunTime RunPulse
run proc reg dataoutmi outestoutreg
covout noprint model Oxygen RunTime
RunPulse by _Imputation_ run proc
print dataoutreg(obs8) var _Imputation_
_Type_ _Name_ Intercept RunTime
RunPulse title 'Parameter Estimates from
Imputed Data Sets' run proc mianalyze
dataoutreg var Intercept RunTime
RunPulse run
49
PROC MIANALYZE
Parameter Estimates from Imputed Data
Sets Obs _Imputation_ _TYPE_
_NAME_ Intercept RunTime RunPulse
1 1 PARMS
86.544 -2.82231 -0.05873 2 1
COV Intercept 100.145
-0.53519 -0.55077 3 1
COV RunTime -0.535 0.10774
-0.00345 4 1 COV
RunPulse -0.551 -0.00345 0.00343
5 2 PARMS
92.451 -2.89662 -0.08750 6 2
COV Intercept 64.527
-0.37466 -0.35512 7 2
COV RunTime -0.375 0.10754
-0.00446 8 2 COV
RunPulse -0.355 -0.00446 0.00237
50
PROC MIANALYZE
The MIANALYZE Procedure
Model Information
Data Set
WORK.OUTREG Number
of Imputations 5 Multiple
Imputation Variance Information

Relative -----------------Variance-------------
---- Increase
Parameter Between Within
Total DF in Variance Intercept
7.405948 80.807859 89.694996
407.45 0.109979 RunTime 0.033768
0.114730 0.155252 58.716
0.353194 RunPulse 0.000182
0.002720 0.002938 727.07 0.080115
51
PROC MIANALYZE
Multiple Imputation Parameter
Estimates Parameter Estimate Std
Error 95 Confidence Limits DF
Intercept 91.396566 9.470744
72.77895 110.0142 407.45 RunTime
-2.980954 0.394020 -3.76947
-2.1924 58.716 RunPulse -0.076286
0.054202 -0.18270 0.0301 727.07
52
For More Information
Visit the SAS website General information on
Multiple Imputation is SAS
http//support.sas.com/rnd/app/da/new/dami.html
SUGI paper by SAS developer
http//support.sas.com/rnd/app/papers/multipleimpu
tation.pdf
SAS Documentation on MI MIANALYZE
http//support.sas.com/rnd/app/papers/miv802.pdf
http//support.sas.com/rnd/app/papers/mianalyze.pd
f
53
More Websites
Multiple Imputation Online
http//www.multiple-imputation.com
/
Multiple Imputation FAQ Page
http//www.stat.psu.edu/jls/mifaq.html
Multiple Imputation paper by J. Hox E. de
Leeuw
http//www.fss.uu.nl/ms/jh/papers/zad4p1.pdf
A Paradox of Multiple Imputation by Phil Kott
http//www.nass.usda.gov/research/reports/M
IPAP11.pdf
54
More Websites
Horton, N.J. and Lipsitz, S.R. (2001). Multiple
imputation in practice Comparison of software
packages for regression models with missing
variables. American Statistician, 55, 244-254.

http//www.biostat.harvard.edu/horton/tasimpute.p
df
Homepage of Joseph Schafer. Many papers and
handouts from ASA courses on Multiple
Imputation.
http//www.stat.psu.edu/jls/

Write a Comment

User Comments (0)