Title: Assessing Disclosure Risk and Analytical Validity for the SIPP-SSA-IRS Public Use File Beta Version 4.1
1Assessing Disclosure Risk and Analytical Validity
for the SIPP-SSA-IRS Public Use File Beta Version
4.1
- John M. AbowdU.S. Census Bureau and Cornell
University - CNSTAT Panel on the Census Bureaus Dynamics of
Economic Well-being SystemJanuary 26, 2007
2Background
- Longstanding goal of the Census Bureau
- Statutory mandate to provide survey data used to
study critical policy issues - Focus of long standing internal Census Bureau
survey improvement project that is part of the
LEHD Program - This is the first Title 13/Chapter 5 predominant
purpose for using IRS data - Treasury Regulation Change, February 2001 (final
regulation February 2003) - New W-2 items authorized SSN, EIN, Box 1, Box 3,
Box 13, number of quarters, 1099R - Creation of a public use data set that integrates
survey and administrative data is the other
predominant Title 13/Chapter 5 purpose
3Team and Sponsorship
- The project was conducted by a team of
researchers from the Census Bureau, IRS, Social
Security Administration, and a consortium of
university partners - Main financial support provided by the Census
Bureau, Social Security Administration, and the
National Science Foundation - Primary design decisions made by an inter-agency
team lead by Martha Stinson at the Census Bureau
and with the participation of SSA, IRS, the
Congressional Budget Office, and the Joint
Committee on Taxation
4Acknowledgements Research Team
- Martha Stinson (Census Bureau), project manager
- Gary Benedetto, Lisa Dragoset, Sam Hawala, Bryan
Ricchetti (Census Bureau) - Karen Masken (IRS)
- Simon Woodcock (Simon Fraser University), Jerry
Reiter (Duke University), Josep Domingo-Ferrer
(University of Rovira and Virgili), Vicenc Torra
(University of Barcelona), Lars Vilhuber (Cornell
University and Census Bureau), consultants
5Acknowledgements I Agencies
- Kenneth Prewitt, C. Louis Kincannon, Hermann
Habermann, Paula Schneider, Nancy Gordon,
Frederick Knickerbocker, Cynthia Clark, Howard
Hogan, and Thomas Mesenbourg, senior management
Census Bureau - Susan Grad, Howard Iams, and Paul van de Water,
senior management SSA - Mark Mazur and Nicholas Greenia, IRS senior
management and IRS/SOI Census Bureau disclosure
liaison - Daniel Newlon, NSF project officer
6Acknowledgements II Agencies
- Chet Bowie, Al Tupek, Barry Sessamen Dan
Weinberg, Ron Prevost, Jeremy Wu, division and
program management Census Bureau - Brian Greenberg, Dawn Haynes, SSA technical
support, contract management, and disclosure
officers - Patricia Doyle, Judith Eargle and Nancy Bates,
Census Bureau SIPP research direction - Charlene Leggieri and Sally Obenski, Census
Bureau administrative records management - Laura Zayatz, Census Bureau statistical
disclosure research direction - John Sabelhaus, Congressional Budget Office
research direction
7Conceptual Framework
- Link all SIPP panels from the 1990s
- Five panels 1990, 1991, 1992, 1993, 1996
- Link to IRS data
- Summary Earnings Records (FICA taxable earnings
1937-1950, and 1951-2003 annual) - Detailed Earnings Record (job level data,
uncapped, 1978-2003 annual) - SSA benefits data
- Master Beneficiary Record, Supplemental Security
Record, Payment History Update System, 831 file
(all available historical data through 2002) - Create product that prevents individuals from
being re-identified in the current public use
SIPP files
8Major Design Decisions
- Limit number of SIPP variables included
- Target national retirement and disability
research communities - Investigate disclosure avoidance methods to
protect both survey and administrative data - But, note that a re-identification in the current
SIPP public use files is not a disclosure since
those files have also been subjected to extensive
disclosure avoidance procedures - Very high hurdle
9Latest Versions
- Gold Standard confidential file at release 4.0
- All confidential data (person-level), all sources
- Beta Public Use File 4.1
- All person-level SIPP, IRS variables from the
Gold Standard Version 4.0 - Benefit and type of benefit measures for initial
SSA benefit (if any), benefit and type of benefit
as of April 1, 2000 - Consistent panel weight for civilian,
non-institutional population as of April 1, 2000
(synthesized on each implicate) - Four missing data implicates with four synthetic
implicates each (16 implicates total)
10Summary of Discussion Today
- A tour of the methods used to complete and
synthesize the SIPP-PUF - Some disclosure avoidance results
- Selected analytical validity results
11Multiple Imputation Confidentiality Protection
History
- Rubin (1993) treat unsampled individuals in
population as missing the survey data, impute
missing values (synthetic population), sample and
release (fully synthetic data) - Little (1993) treat sensitive values as missing,
impute and release imputed values (partially
synthetic data) - Feinberg (1994) parametric Bayesian procedure
eliminated the use of any actual values in
synthetic data - Ragunathan, Reiter, and Rubin (2003) adapted the
Sequential Regression Multivariate Imputation
method to synthetic data - Reiter (2004) Inference-valid combination of
multiple imputation for missing and synthetic
data - Abowd and Woodcock (2001) Applied SRMI to
confidentiality protection of longitudinally
linked employer-employee synthetic micro-data
12Multiple Imputation Confidentiality Protection
Methods
- Denote confidential data by Y and nonconfidential
data by X (may be empty) - Both Y and X may contain missing data, so that
Y(Yobs , Ymis) and X(Xobs , Xmis) - Assume database can be represented by joint
density p(Y,X,?) - Estimate the posterior predictive distribution
p(Ynew, Xnew Yobs, Xobs) - Sample multiple times from the posterior
predictive distribution, release these samples
13Sequential Regression Multivariate Imputation
(SRMI) Method
- Synthetic data values are draws from the
posterior predictive density - In practice, use a two-step procedure 1)
complete the missing data using SRMI2) draw
synthetic data from predictive density given the
completed data - Repeating the procedure yields multiple synthetic
data implicates
14SRMI Method Details
- Specifying the joint density p(Y,X,?) is
unrealistic in most applications - Instead, approximate the joint density by a
sequence of conditional densities defined by
generalized linear models - Synthetic values of some are draws
fromwhere Ym,Xm are completed data, and
densities pk are defined by an appropriate
generalized linear model and prior, a
Dirichlet-multinomial model, or a Bayesian
Bootstrap
15Maintaining Relationships in the Underlying Data
- Define a multilevel parent-child tree to describe
the exact relationships in the data - Variables at the root of this tree should have
values for all individuals, completed and
synthesized first (but as a function of all data) - Child variables only completed or synthesized
when appropriate given the parent variable - For missing data, iterate nine times to complete
all missing data, sample 4 implicates - For synthetic data, condition on values from the
completed data, sample 4 implicates per completed
implicate
16Maintaining Multivariate Distributions
- Automated creation and management of stratifying
(grouping) variables and conditioning variables - Bayesian bootstrap procedure for sets of related
discrete variables estimated using the automated
grouping - SRMI procedure for most continuous variables
using automated grouping, conditioning variable
management, Bayesian model selection
17Maintaining Univariate Distributions
- Automated management of sets of related
continuous variables (e.g., earnings histories) - Within stratifying groups, automated management
of a non-parametric transform with inverse
transform to preserve the univariate distribution
of all continuous variables within group
18SRMI Example Date of Birth
- Link administrative birth date (more accurate)
- Take birth date from Bayesian bootstrap link of
couple administrative records when SSN is not
available - Formulate grouping and control variable lists and
hierarchy (two sets) - Perform overall stratifications, sample size
checks
19SRMI Example Date of Birth
- By unique values of the grouping variables
- Estimate the pdf of birth date using a kernel
density estimator - Transform birth date to normal using the
estimated KDE - Estimate a linear regression of transformed birth
date on the master list of control variables for
this group - Use Bayesian model selection to prune variable
list - Re-estimate the linear regression using the
Bayesian Normal-Inverse Gamma natural conjugate
posterior (flat priors) - Sample from the posterior distribution of ? and
?2 - Given , sample from the predictive distribution
of transformed birth date - Invert the transformation on birth date
20SRMI Example Critical Dates
21Bayesian Bootstrap Method Details
- The BB is a non-parametric method of taking draws
from the posterior predictive distribution of a
group of variables (Rubin 1981) - Automated stratification into homogeneous groups
- Within groups do a Bayesian bootstrap of all
variables to be synthesized at the same time - Similar to a standard bootstrap except that it
accounts for the fact that the multivariate
distribution is measured with error in the sample.
22BB example Missing Administrative Data
- Stratify households with missing IRS and SSA data
(no SSN) into - Single
- Married missing both SSNs
- Married missing one SSN
- For each set above, form grouping variable lists
and hierarchy - Check overall sample sizes and establish by-groups
23BB example Missing Administrative Data
- For each unique value of variables in the
grouping set - Impute the complete set of missing administrative
records using BB from the sample of complete
records in the same group - Couples are BB imputed together
- When only one member of a couple has missing
administrative data, the donor comes from a BB of
couples with similar spouses (based on the
grouping variables)
24Steps after Synthesizing
- Two criteria for judging success
- Confidentiality protection
- Statistical usefulness (Analytical validity)
- Perform two types of tests
- Probabilistic record linkage re-identification
tests can SIPP respondents in synthetic data be
linked back to already existing public use data? - Use synthetic data for analyses and compare
results to results obtained using non-synthetic
data
25Confidentiality Protection
- Protection is based on the inability of PUF users
to re-identify the SIPP record upon which the PUF
record is based - This prevents wholesale addition of SIPP data to
the IRS and SSA data in the PUF - Goals
- re-identification of SIPP records from the PUF
should result in very few true matches - any candidate match should have substantial
uncertainty regarding its status as true or false
26Disclosure Avoidance Analysis
- Uses probabilistic record linking and two types
of distance-based record linking - Each synthetic implicate is matched back to the
gold standard - All unsynthesized variables are used as blocking
variables - Different matching variable sets are used in the
probabilistic record linking - All synthesized variables are used in the
distance-based record linking
27Matching Variables and Associated M and U
Probabilities
28Probabilistic Record Linking Results
29Distance-based Linking Results
30Analytical Validity
- All univariate distributions
- Selected first, second and third-order
interactions - Selected linear and non-linear multivariate
models - Small micro-simulations
31(No Transcript)
32(No Transcript)
33(No Transcript)
34(No Transcript)
35Log Total Earnings White Males
36Log Total Earnings Black Males
37Log AIME/AMW All Individuals
38Log Initial MBA All Retired Individuals
39Log Initial MBA Disabled Individuals
40Logistic Regression Has a DB or DC Pension
41(No Transcript)
42(No Transcript)
43Lifetime Total FICA Earnings
44Lifetime Total FICA Work Years
45Micro-simulation of Retirement Accounts
46Next Steps
- Census DRB has approved release
- IRS Disclosure Officer has completed review and
will approve release - SSA is negotiating with the Census Bureau the
terms of the Beta and Final releases - Released data will be fully supported on the
Cornell Virtual Research Data Center - Some models estimated on the Beta release will be
re-estimated on the Gold Standard to further
assess its analytical validity