Assessing Disclosure Risk and Analytical Validity for the SIPP-SSA-IRS Public Use File Beta Version 4.1 - PowerPoint PPT Presentation

About This Presentation
Title:

Assessing Disclosure Risk and Analytical Validity for the SIPP-SSA-IRS Public Use File Beta Version 4.1

Description:

Assessing Disclosure Risk and Analytical Validity for the SIPP-SSA-IRS Public Use File Beta Version 4.1 John M. Abowd U.S. Census Bureau and Cornell University – PowerPoint PPT presentation

Number of Views:140
Avg rating:3.0/5.0
Slides: 47
Provided by: John1939
Learn more at: https://www.census.gov
Category:

less

Transcript and Presenter's Notes

Title: Assessing Disclosure Risk and Analytical Validity for the SIPP-SSA-IRS Public Use File Beta Version 4.1


1
Assessing Disclosure Risk and Analytical Validity
for the SIPP-SSA-IRS Public Use File Beta Version
4.1
  • John M. AbowdU.S. Census Bureau and Cornell
    University
  • CNSTAT Panel on the Census Bureaus Dynamics of
    Economic Well-being SystemJanuary 26, 2007

2
Background
  • Longstanding goal of the Census Bureau
  • Statutory mandate to provide survey data used to
    study critical policy issues
  • Focus of long standing internal Census Bureau
    survey improvement project that is part of the
    LEHD Program
  • This is the first Title 13/Chapter 5 predominant
    purpose for using IRS data
  • Treasury Regulation Change, February 2001 (final
    regulation February 2003)
  • New W-2 items authorized SSN, EIN, Box 1, Box 3,
    Box 13, number of quarters, 1099R
  • Creation of a public use data set that integrates
    survey and administrative data is the other
    predominant Title 13/Chapter 5 purpose

3
Team and Sponsorship
  • The project was conducted by a team of
    researchers from the Census Bureau, IRS, Social
    Security Administration, and a consortium of
    university partners
  • Main financial support provided by the Census
    Bureau, Social Security Administration, and the
    National Science Foundation
  • Primary design decisions made by an inter-agency
    team lead by Martha Stinson at the Census Bureau
    and with the participation of SSA, IRS, the
    Congressional Budget Office, and the Joint
    Committee on Taxation

4
Acknowledgements Research Team
  • Martha Stinson (Census Bureau), project manager
  • Gary Benedetto, Lisa Dragoset, Sam Hawala, Bryan
    Ricchetti (Census Bureau)
  • Karen Masken (IRS)
  • Simon Woodcock (Simon Fraser University), Jerry
    Reiter (Duke University), Josep Domingo-Ferrer
    (University of Rovira and Virgili), Vicenc Torra
    (University of Barcelona), Lars Vilhuber (Cornell
    University and Census Bureau), consultants

5
Acknowledgements I Agencies
  • Kenneth Prewitt, C. Louis Kincannon, Hermann
    Habermann, Paula Schneider, Nancy Gordon,
    Frederick Knickerbocker, Cynthia Clark, Howard
    Hogan, and Thomas Mesenbourg, senior management
    Census Bureau
  • Susan Grad, Howard Iams, and Paul van de Water,
    senior management SSA
  • Mark Mazur and Nicholas Greenia, IRS senior
    management and IRS/SOI Census Bureau disclosure
    liaison
  • Daniel Newlon, NSF project officer

6
Acknowledgements II Agencies
  • Chet Bowie, Al Tupek, Barry Sessamen Dan
    Weinberg, Ron Prevost, Jeremy Wu, division and
    program management Census Bureau
  • Brian Greenberg, Dawn Haynes, SSA technical
    support, contract management, and disclosure
    officers
  • Patricia Doyle, Judith Eargle and Nancy Bates,
    Census Bureau SIPP research direction
  • Charlene Leggieri and Sally Obenski, Census
    Bureau administrative records management
  • Laura Zayatz, Census Bureau statistical
    disclosure research direction
  • John Sabelhaus, Congressional Budget Office
    research direction

7
Conceptual Framework
  • Link all SIPP panels from the 1990s
  • Five panels 1990, 1991, 1992, 1993, 1996
  • Link to IRS data
  • Summary Earnings Records (FICA taxable earnings
    1937-1950, and 1951-2003 annual)
  • Detailed Earnings Record (job level data,
    uncapped, 1978-2003 annual)
  • SSA benefits data
  • Master Beneficiary Record, Supplemental Security
    Record, Payment History Update System, 831 file
    (all available historical data through 2002)
  • Create product that prevents individuals from
    being re-identified in the current public use
    SIPP files

8
Major Design Decisions
  • Limit number of SIPP variables included
  • Target national retirement and disability
    research communities
  • Investigate disclosure avoidance methods to
    protect both survey and administrative data
  • But, note that a re-identification in the current
    SIPP public use files is not a disclosure since
    those files have also been subjected to extensive
    disclosure avoidance procedures
  • Very high hurdle

9
Latest Versions
  • Gold Standard confidential file at release 4.0
  • All confidential data (person-level), all sources
  • Beta Public Use File 4.1
  • All person-level SIPP, IRS variables from the
    Gold Standard Version 4.0
  • Benefit and type of benefit measures for initial
    SSA benefit (if any), benefit and type of benefit
    as of April 1, 2000
  • Consistent panel weight for civilian,
    non-institutional population as of April 1, 2000
    (synthesized on each implicate)
  • Four missing data implicates with four synthetic
    implicates each (16 implicates total)

10
Summary of Discussion Today
  • A tour of the methods used to complete and
    synthesize the SIPP-PUF
  • Some disclosure avoidance results
  • Selected analytical validity results

11
Multiple Imputation Confidentiality Protection
History
  • Rubin (1993) treat unsampled individuals in
    population as missing the survey data, impute
    missing values (synthetic population), sample and
    release (fully synthetic data)
  • Little (1993) treat sensitive values as missing,
    impute and release imputed values (partially
    synthetic data)
  • Feinberg (1994) parametric Bayesian procedure
    eliminated the use of any actual values in
    synthetic data
  • Ragunathan, Reiter, and Rubin (2003) adapted the
    Sequential Regression Multivariate Imputation
    method to synthetic data
  • Reiter (2004) Inference-valid combination of
    multiple imputation for missing and synthetic
    data
  • Abowd and Woodcock (2001) Applied SRMI to
    confidentiality protection of longitudinally
    linked employer-employee synthetic micro-data

12
Multiple Imputation Confidentiality Protection
Methods
  • Denote confidential data by Y and nonconfidential
    data by X (may be empty)
  • Both Y and X may contain missing data, so that
    Y(Yobs , Ymis) and X(Xobs , Xmis)
  • Assume database can be represented by joint
    density p(Y,X,?)
  • Estimate the posterior predictive distribution
    p(Ynew, Xnew Yobs, Xobs)
  • Sample multiple times from the posterior
    predictive distribution, release these samples

13
Sequential Regression Multivariate Imputation
(SRMI) Method
  • Synthetic data values are draws from the
    posterior predictive density
  • In practice, use a two-step procedure 1)
    complete the missing data using SRMI2) draw
    synthetic data from predictive density given the
    completed data
  • Repeating the procedure yields multiple synthetic
    data implicates

14
SRMI Method Details
  • Specifying the joint density p(Y,X,?) is
    unrealistic in most applications
  • Instead, approximate the joint density by a
    sequence of conditional densities defined by
    generalized linear models
  • Synthetic values of some are draws
    fromwhere Ym,Xm are completed data, and
    densities pk are defined by an appropriate
    generalized linear model and prior, a
    Dirichlet-multinomial model, or a Bayesian
    Bootstrap

15
Maintaining Relationships in the Underlying Data
  • Define a multilevel parent-child tree to describe
    the exact relationships in the data
  • Variables at the root of this tree should have
    values for all individuals, completed and
    synthesized first (but as a function of all data)
  • Child variables only completed or synthesized
    when appropriate given the parent variable
  • For missing data, iterate nine times to complete
    all missing data, sample 4 implicates
  • For synthetic data, condition on values from the
    completed data, sample 4 implicates per completed
    implicate

16
Maintaining Multivariate Distributions
  • Automated creation and management of stratifying
    (grouping) variables and conditioning variables
  • Bayesian bootstrap procedure for sets of related
    discrete variables estimated using the automated
    grouping
  • SRMI procedure for most continuous variables
    using automated grouping, conditioning variable
    management, Bayesian model selection

17
Maintaining Univariate Distributions
  • Automated management of sets of related
    continuous variables (e.g., earnings histories)
  • Within stratifying groups, automated management
    of a non-parametric transform with inverse
    transform to preserve the univariate distribution
    of all continuous variables within group

18
SRMI Example Date of Birth
  • Link administrative birth date (more accurate)
  • Take birth date from Bayesian bootstrap link of
    couple administrative records when SSN is not
    available
  • Formulate grouping and control variable lists and
    hierarchy (two sets)
  • Perform overall stratifications, sample size
    checks

19
SRMI Example Date of Birth
  • By unique values of the grouping variables
  • Estimate the pdf of birth date using a kernel
    density estimator
  • Transform birth date to normal using the
    estimated KDE
  • Estimate a linear regression of transformed birth
    date on the master list of control variables for
    this group
  • Use Bayesian model selection to prune variable
    list
  • Re-estimate the linear regression using the
    Bayesian Normal-Inverse Gamma natural conjugate
    posterior (flat priors)
  • Sample from the posterior distribution of ? and
    ?2
  • Given , sample from the predictive distribution
    of transformed birth date
  • Invert the transformation on birth date

20
SRMI Example Critical Dates
21
Bayesian Bootstrap Method Details
  • The BB is a non-parametric method of taking draws
    from the posterior predictive distribution of a
    group of variables (Rubin 1981)
  • Automated stratification into homogeneous groups
  • Within groups do a Bayesian bootstrap of all
    variables to be synthesized at the same time
  • Similar to a standard bootstrap except that it
    accounts for the fact that the multivariate
    distribution is measured with error in the sample.

22
BB example Missing Administrative Data
  • Stratify households with missing IRS and SSA data
    (no SSN) into
  • Single
  • Married missing both SSNs
  • Married missing one SSN
  • For each set above, form grouping variable lists
    and hierarchy
  • Check overall sample sizes and establish by-groups

23
BB example Missing Administrative Data
  • For each unique value of variables in the
    grouping set
  • Impute the complete set of missing administrative
    records using BB from the sample of complete
    records in the same group
  • Couples are BB imputed together
  • When only one member of a couple has missing
    administrative data, the donor comes from a BB of
    couples with similar spouses (based on the
    grouping variables)

24
Steps after Synthesizing
  • Two criteria for judging success
  • Confidentiality protection
  • Statistical usefulness (Analytical validity)
  • Perform two types of tests
  • Probabilistic record linkage re-identification
    tests can SIPP respondents in synthetic data be
    linked back to already existing public use data?
  • Use synthetic data for analyses and compare
    results to results obtained using non-synthetic
    data

25
Confidentiality Protection
  • Protection is based on the inability of PUF users
    to re-identify the SIPP record upon which the PUF
    record is based
  • This prevents wholesale addition of SIPP data to
    the IRS and SSA data in the PUF
  • Goals
  • re-identification of SIPP records from the PUF
    should result in very few true matches
  • any candidate match should have substantial
    uncertainty regarding its status as true or false

26
Disclosure Avoidance Analysis
  • Uses probabilistic record linking and two types
    of distance-based record linking
  • Each synthetic implicate is matched back to the
    gold standard
  • All unsynthesized variables are used as blocking
    variables
  • Different matching variable sets are used in the
    probabilistic record linking
  • All synthesized variables are used in the
    distance-based record linking

27
Matching Variables and Associated M and U
Probabilities
28
Probabilistic Record Linking Results
29
Distance-based Linking Results
30
Analytical Validity
  • All univariate distributions
  • Selected first, second and third-order
    interactions
  • Selected linear and non-linear multivariate
    models
  • Small micro-simulations

31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
Log Total Earnings White Males
36
Log Total Earnings Black Males
37
Log AIME/AMW All Individuals
38
Log Initial MBA All Retired Individuals
39
Log Initial MBA Disabled Individuals
40
Logistic Regression Has a DB or DC Pension
41
(No Transcript)
42
(No Transcript)
43
Lifetime Total FICA Earnings
44
Lifetime Total FICA Work Years
45
Micro-simulation of Retirement Accounts
46
Next Steps
  • Census DRB has approved release
  • IRS Disclosure Officer has completed review and
    will approve release
  • SSA is negotiating with the Census Bureau the
    terms of the Beta and Final releases
  • Released data will be fully supported on the
    Cornell Virtual Research Data Center
  • Some models estimated on the Beta release will be
    re-estimated on the Gold Standard to further
    assess its analytical validity
Write a Comment
User Comments (0)
About PowerShow.com