Assessing Disclosure Risk and Analytical Validity for the SIPP-SSA-IRS Public Use File Beta Version 4.1 - PowerPoint PPT Presentation

About This Presentation

Title:

Assessing Disclosure Risk and Analytical Validity for the SIPP-SSA-IRS Public Use File Beta Version 4.1

Description:

Assessing Disclosure Risk and Analytical Validity for the SIPP-SSA-IRS Public Use File Beta Version 4.1 John M. Abowd U.S. Census Bureau and Cornell University – PowerPoint PPT presentation

Number of Views:140

Avg rating:3.0/5.0

Slides: 47

Provided by: John1939

Learn more at: https://www.census.gov

Category:

more less

Transcript and Presenter's Notes

Title: Assessing Disclosure Risk and Analytical Validity for the SIPP-SSA-IRS Public Use File Beta Version 4.1

1
Assessing Disclosure Risk and Analytical Validity
for the SIPP-SSA-IRS Public Use File Beta Version
4.1

John M. AbowdU.S. Census Bureau and Cornell
University
CNSTAT Panel on the Census Bureaus Dynamics of
Economic Well-being SystemJanuary 26, 2007

2
Background

Longstanding goal of the Census Bureau
Statutory mandate to provide survey data used to
study critical policy issues
Focus of long standing internal Census Bureau
survey improvement project that is part of the
LEHD Program
This is the first Title 13/Chapter 5 predominant
purpose for using IRS data
Treasury Regulation Change, February 2001 (final
regulation February 2003)
New W-2 items authorized SSN, EIN, Box 1, Box 3,
Box 13, number of quarters, 1099R
Creation of a public use data set that integrates
survey and administrative data is the other
predominant Title 13/Chapter 5 purpose

3
Team and Sponsorship

The project was conducted by a team of
researchers from the Census Bureau, IRS, Social
Security Administration, and a consortium of
university partners
Main financial support provided by the Census
Bureau, Social Security Administration, and the
National Science Foundation
Primary design decisions made by an inter-agency
team lead by Martha Stinson at the Census Bureau
and with the participation of SSA, IRS, the
Congressional Budget Office, and the Joint
Committee on Taxation

4
Acknowledgements Research Team

Martha Stinson (Census Bureau), project manager
Gary Benedetto, Lisa Dragoset, Sam Hawala, Bryan
Ricchetti (Census Bureau)
Karen Masken (IRS)
Simon Woodcock (Simon Fraser University), Jerry
Reiter (Duke University), Josep Domingo-Ferrer
(University of Rovira and Virgili), Vicenc Torra
(University of Barcelona), Lars Vilhuber (Cornell
University and Census Bureau), consultants

5
Acknowledgements I Agencies

Kenneth Prewitt, C. Louis Kincannon, Hermann
Habermann, Paula Schneider, Nancy Gordon,
Frederick Knickerbocker, Cynthia Clark, Howard
Hogan, and Thomas Mesenbourg, senior management
Census Bureau
Susan Grad, Howard Iams, and Paul van de Water,
senior management SSA
Mark Mazur and Nicholas Greenia, IRS senior
management and IRS/SOI Census Bureau disclosure
liaison
Daniel Newlon, NSF project officer

6
Acknowledgements II Agencies

Chet Bowie, Al Tupek, Barry Sessamen Dan
Weinberg, Ron Prevost, Jeremy Wu, division and
program management Census Bureau
Brian Greenberg, Dawn Haynes, SSA technical
support, contract management, and disclosure
officers
Patricia Doyle, Judith Eargle and Nancy Bates,
Census Bureau SIPP research direction
Charlene Leggieri and Sally Obenski, Census
Bureau administrative records management
Laura Zayatz, Census Bureau statistical
disclosure research direction
John Sabelhaus, Congressional Budget Office
research direction

7
Conceptual Framework

Link all SIPP panels from the 1990s
Five panels 1990, 1991, 1992, 1993, 1996
Link to IRS data
Summary Earnings Records (FICA taxable earnings
1937-1950, and 1951-2003 annual)
Detailed Earnings Record (job level data,
uncapped, 1978-2003 annual)
SSA benefits data
Master Beneficiary Record, Supplemental Security
Record, Payment History Update System, 831 file
(all available historical data through 2002)
Create product that prevents individuals from
being re-identified in the current public use
SIPP files

8
Major Design Decisions

Limit number of SIPP variables included
Target national retirement and disability
research communities
Investigate disclosure avoidance methods to
protect both survey and administrative data
But, note that a re-identification in the current
SIPP public use files is not a disclosure since
those files have also been subjected to extensive
disclosure avoidance procedures
Very high hurdle

9
Latest Versions

Gold Standard confidential file at release 4.0
All confidential data (person-level), all sources
Beta Public Use File 4.1
All person-level SIPP, IRS variables from the
Gold Standard Version 4.0
Benefit and type of benefit measures for initial
SSA benefit (if any), benefit and type of benefit
as of April 1, 2000
Consistent panel weight for civilian,
non-institutional population as of April 1, 2000
(synthesized on each implicate)
Four missing data implicates with four synthetic
implicates each (16 implicates total)

10
Summary of Discussion Today

A tour of the methods used to complete and
synthesize the SIPP-PUF
Some disclosure avoidance results
Selected analytical validity results

11
Multiple Imputation Confidentiality Protection
History

Rubin (1993) treat unsampled individuals in
population as missing the survey data, impute
missing values (synthetic population), sample and
release (fully synthetic data)
Little (1993) treat sensitive values as missing,
impute and release imputed values (partially
synthetic data)
Feinberg (1994) parametric Bayesian procedure
eliminated the use of any actual values in
synthetic data
Ragunathan, Reiter, and Rubin (2003) adapted the
Sequential Regression Multivariate Imputation
method to synthetic data
Reiter (2004) Inference-valid combination of
multiple imputation for missing and synthetic
data
Abowd and Woodcock (2001) Applied SRMI to
confidentiality protection of longitudinally
linked employer-employee synthetic micro-data

12
Multiple Imputation Confidentiality Protection
Methods

Denote confidential data by Y and nonconfidential
data by X (may be empty)
Both Y and X may contain missing data, so that
Y(Yobs , Ymis) and X(Xobs , Xmis)
Assume database can be represented by joint
density p(Y,X,?)
Estimate the posterior predictive distribution
p(Ynew, Xnew Yobs, Xobs)
Sample multiple times from the posterior
predictive distribution, release these samples

13
Sequential Regression Multivariate Imputation
(SRMI) Method

Synthetic data values are draws from the
posterior predictive density
In practice, use a two-step procedure 1)
complete the missing data using SRMI2) draw
synthetic data from predictive density given the
completed data
Repeating the procedure yields multiple synthetic
data implicates

14
SRMI Method Details

Specifying the joint density p(Y,X,?) is
unrealistic in most applications
Instead, approximate the joint density by a
sequence of conditional densities defined by
generalized linear models
Synthetic values of some are draws
fromwhere Ym,Xm are completed data, and
densities pk are defined by an appropriate
generalized linear model and prior, a
Dirichlet-multinomial model, or a Bayesian
Bootstrap

15
Maintaining Relationships in the Underlying Data

Define a multilevel parent-child tree to describe
the exact relationships in the data
Variables at the root of this tree should have
values for all individuals, completed and
synthesized first (but as a function of all data)
Child variables only completed or synthesized
when appropriate given the parent variable
For missing data, iterate nine times to complete
all missing data, sample 4 implicates
For synthetic data, condition on values from the
completed data, sample 4 implicates per completed
implicate

16
Maintaining Multivariate Distributions

Automated creation and management of stratifying
(grouping) variables and conditioning variables
Bayesian bootstrap procedure for sets of related
discrete variables estimated using the automated
grouping
SRMI procedure for most continuous variables
using automated grouping, conditioning variable
management, Bayesian model selection

17
Maintaining Univariate Distributions

Automated management of sets of related
continuous variables (e.g., earnings histories)
Within stratifying groups, automated management
of a non-parametric transform with inverse
transform to preserve the univariate distribution
of all continuous variables within group

18
SRMI Example Date of Birth

Link administrative birth date (more accurate)
Take birth date from Bayesian bootstrap link of
couple administrative records when SSN is not
available
Formulate grouping and control variable lists and
hierarchy (two sets)
Perform overall stratifications, sample size
checks

19
SRMI Example Date of Birth

By unique values of the grouping variables
Estimate the pdf of birth date using a kernel
density estimator
Transform birth date to normal using the
estimated KDE
Estimate a linear regression of transformed birth
date on the master list of control variables for
this group
Use Bayesian model selection to prune variable
list
Re-estimate the linear regression using the
Bayesian Normal-Inverse Gamma natural conjugate
posterior (flat priors)
Sample from the posterior distribution of ? and
?2
Given , sample from the predictive distribution
of transformed birth date
Invert the transformation on birth date

20
SRMI Example Critical Dates
21
Bayesian Bootstrap Method Details

The BB is a non-parametric method of taking draws
from the posterior predictive distribution of a
group of variables (Rubin 1981)
Automated stratification into homogeneous groups
Within groups do a Bayesian bootstrap of all
variables to be synthesized at the same time
Similar to a standard bootstrap except that it
accounts for the fact that the multivariate
distribution is measured with error in the sample.

22
BB example Missing Administrative Data

Stratify households with missing IRS and SSA data
(no SSN) into
Single
Married missing both SSNs
Married missing one SSN
For each set above, form grouping variable lists
and hierarchy
Check overall sample sizes and establish by-groups

23
BB example Missing Administrative Data

For each unique value of variables in the
grouping set
Impute the complete set of missing administrative
records using BB from the sample of complete
records in the same group
Couples are BB imputed together
When only one member of a couple has missing
administrative data, the donor comes from a BB of
couples with similar spouses (based on the
grouping variables)

24
Steps after Synthesizing

Two criteria for judging success
Confidentiality protection
Statistical usefulness (Analytical validity)
Perform two types of tests
Probabilistic record linkage re-identification
tests can SIPP respondents in synthetic data be
linked back to already existing public use data?
Use synthetic data for analyses and compare
results to results obtained using non-synthetic
data

25
Confidentiality Protection

Protection is based on the inability of PUF users
to re-identify the SIPP record upon which the PUF
record is based
This prevents wholesale addition of SIPP data to
the IRS and SSA data in the PUF
Goals
re-identification of SIPP records from the PUF
should result in very few true matches
any candidate match should have substantial
uncertainty regarding its status as true or false

26
Disclosure Avoidance Analysis

Uses probabilistic record linking and two types
of distance-based record linking
Each synthetic implicate is matched back to the
gold standard
All unsynthesized variables are used as blocking
variables
Different matching variable sets are used in the
probabilistic record linking
All synthesized variables are used in the
distance-based record linking

27
Matching Variables and Associated M and U
Probabilities
28
Probabilistic Record Linking Results
29
Distance-based Linking Results
30
Analytical Validity

All univariate distributions
Selected first, second and third-order
interactions
Selected linear and non-linear multivariate
models
Small micro-simulations

31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
Log Total Earnings White Males
36
Log Total Earnings Black Males
37
Log AIME/AMW All Individuals
38
Log Initial MBA All Retired Individuals
39
Log Initial MBA Disabled Individuals
40
Logistic Regression Has a DB or DC Pension
41
(No Transcript)
42
(No Transcript)
43
Lifetime Total FICA Earnings
44
Lifetime Total FICA Work Years
45
Micro-simulation of Retirement Accounts
46
Next Steps

Census DRB has approved release
IRS Disclosure Officer has completed review and
will approve release
SSA is negotiating with the Census Bureau the
terms of the Beta and Final releases
Released data will be fully supported on the
Cornell Virtual Research Data Center
Some models estimated on the Beta release will be
re-estimated on the Gold Standard to further
assess its analytical validity