Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods

Description:

Other authors, particularly Raghunathan, Reiter, Rubin, Abowd, Woodcock ... Multivariate Imputation (Raghunathan, Reither, and Rubin; Abowd and Woodcock) ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 55
Provided by: john1019
Category:

less

Transcript and Presenter's Notes

Title: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods


1
Confidentiality Protection of Social Science
Micro Data Synthetic Data and Related Methods
  • John M. AbowdCornell University and Census
    Bureau
  • January 30, 2006UCLA Institute for Digital
    Research and Education
  • Presentation

2
Acknowledgements
  • Many current and past LEHD staff and senior
    research fellows contributed to the development
    of the LEHD infrastructure system and the
    Quarterly Workforce Indicators. Kevin McKinney,
    Bryce Stephens and Lars Vilhuber were
    particularly responsible for the confidentiality
    protection system.
  • Fredrik Andersson and Marc Roemer at LEHD did the
    data analysis and implementation of the On the
    Map package. John Carpenter of Excensus, Inc.
    developed the mapping application.
  • Gary Benedetto, Lisa Dragoset, Martha Stinson and
    Bryan Ricchetti did the synthesis programming for
    the SIPP-PUF application.

3
Overview
  • What is the problem?
  • What are synthetic data?
  • How can the research community benefit from
    synthetic data?
  • The NSF-ITR synthetic data grant
  • The Census Bureaus synthetic data and related
    products
  • QWI Online
  • On the Map
  • The new SIPP-SSA-IRS Public Use File
  • Tools

4
Information Release and Data Protection are
Competing Objectives
  • Statisticians call this the Risk-Utility tradeoff
  • Economists prefer to distinguish between
    technological trade-offs and preference
    trade-offs
  • Information release and data protection are
    technological tradeoffs

5
A Simple Example of the Technological Trade-off
  • There are two outputs information released and
    data protection
  • Consider a census with sampling as the release
    technology
  • The PPF measures the amount of information that
    must be sacrificed to get additional protection
  • The information measure is Shannons H (or the
    Kullback-Liebler difference between the census
    and the sample)
  • The protection measure is the maximum probability
    of an exact disclosure

6
(No Transcript)
7
(No Transcript)
8
What Are Synthetic Data?
  • Public use micro data products that reproduce
    essential features of confidential micro data
    products
  • Essential features include
  • Univariate distributions overall and in
    subpopulations
  • Multivariate relations among the variables

9
Some History
  • Original fully synthetic data idea was due to
    Rubin (JOS, 1993)
  • Synthesize the Decennial Census long form
    responses for the short form households, then
    release samples that do not include any actual
    long form records
  • Original partially synthetic data idea was due to
    Little (JOS, 1993)
  • Synthesize the sensitive values on the public use
    file
  • Critical refinement (Fienberg, 1994)
  • Use a parametric posterior predictive
    distribution (instead of a Bayes bootstrap) to do
    the sampling
  • Other authors, particularly Raghunathan, Reiter,
    Rubin, Abowd, Woodcock
  • Partially synthetic data with missing data
    (Reiter)
  • Sequential Regression Multivariate Imputation
    (Raghunathan, Reither, and Rubin Abowd and
    Woodcock)

10
How Can You Preserve Confidentiality and
Multivariate Relations?
  • Fundamental trade-off
  • better protection v. better data quality
  • Protection results from summarizing the data with
    a complicated multivariate distribution, then
    sampling that distribution instead of the
    original data
  • The synthetic data are not any respondents
    actual data
  • But, for some techniques, it may still be
    possible to re-identify the source record in the
    confidential data
  • New techniques address this problem

11
How Can the Research Community Benefit from
Synthetic Data?
  • Sophisticated research users must help develop
    the synthesizers in order to promote and improve
    analytic validity
  • Many more users will have access to the
    information because there is a public use micro
    data product.

12
The Research Synthetic Data Feedback Cycle
ConfidentialityProtection
ScientificModeling
DataSynthesis
AnalyticValidity
13
The Multi-layer System
  • Basic confidential data
  • Fundamental product of virtually all Census
    programs
  • Leads to the publication of public-use products
    (summary data, micro data, narrative data)
  • Gold-standard confidential data
  • Edited, documented and archived research versions
    of confidential data
  • Used in internal Census research and at Research
    Data Centers

14
More Layers
  • Partially-synthetic micro data
  • Preserves the record structure or sampling frame
    of the gold standard micro data
  • Replaces the data elements with synthetic values
    sampled from an appropriate probability model
  • Fully-synthetic micro data
  • Uses only the population or record linkage
    structure of the gold standard micro data
  • Generates synthetic entities and data elements
    from appropriate probability models

15
The NSF Information Technologies Research Grant
  • A program that encourages innovative, high-payoff
    IT research and education
  • Our grant proposal cited the many research
    studies and data products created by previous NSF
    support for the Research Data Center network and
    the Longitudinal Employer-Household Dynamics
    Program

16
What Is It?
  • 2.9 million 3-year grant to the RDC network
    (Cornell is the coordinating institution)
  • Provides core support for scientific activities
    at the RDCs
  • To develop public use, analytically valid
    synthetic data from many of the RDC-accessible
    data sets
  • To facilitate collaboration with RDC projects
    that help design and test these products

17
The Quarterly Workforce Indicators
  • QWI was the LEHD Programs first public use data
    product
  • QWI Online
  • Detailed labor force information by sub-state
    geography, detailed industry, ownership class,
    sex and age group.

18
The Confidentiality Protection System
  • All QWI protections are done by noise infusion of
    the micro-data
  • All micro-data items are distorted at least
    minimal percentage up to a maximal percentage
  • Only the distorted items are used in the
    production of the release product

19
Protection and Validity Principles
  • Cells with few businesses contributing or with
    few individuals contributing have been distorted
    in the cross-section but not the time-series
  • Bias in the cross-section is controlled and
    random, no analyst knows its sign
  • More information

20
Theoretical Distribution of the QWI Distortion
Factor
21
Theoretical Distribution of the QWI Distortion
Factor
22
Actual Confidentiality Protection Distortion
Employment, Beginning-of-Quarter
23
Table 8 Distribution of Error in First Order
Serial Correlation
24
Graph Distribution of Error in First Order
Serial Correlation
25
Enhancements
  • The current product has suppressions for cells
    too small to protect by noise infusion
  • The enhanced product replaces these suppressions
    with synthetic data

26
Percentage of Data Items in County Level Release
File
27
Beginning of Period Employment in NAICS Sector 62
28
Full Quarter New Hires in NAICS4 3259
29
The Census Bureaus First Public Use Synthetic
Data Application
  • LEHD On-the-map application
  • Shows commuting patterns at the Census Block
    level with characteristics of the origin and
    destination block groups
  • Origin block data are synthetic
  • Sampled from the posterior predictive
    distribution of origin blocks and origin
    characteristics given destination block,
    destination block characteristics.
  • On-the-map

30
Where people living in the selected area
(Mobiles neighboring communities of Daphne and
Fairhope) work
DRAFT Beta Test Document Only
Source On the Map beta application,
Longitudinal Employer-Household Dynamics Program,
U.S. Census Bureau
September 23, 2005
31
Where people working in the selected area
(downtown Mobile) live
DRAFT Beta Test Document Only
Source On the Map beta application,
Longitudinal Employer-Household Dynamics Program,
U.S. Census Bureau
September 23, 2005
32
Synthetic Data Model
  • yijk are the counts for residence block i, work
    place block j and characteristics k.
  • Characteristics are age groups, earnings groups,
    industry (NAICS sector), ownership sector.

33
Complications
  • Informative prior shape
  • Prior sample size
  • Work place counts must be compatible with the
    protection system used by Quarterly Workforce
    Indicators (QWI)
  • Dynamically consistent noise infusion

34
(No Transcript)
35
(No Transcript)
36
Analytic Validity
  • Assess the bias
  • Assess the incremental variation

37
(No Transcript)
38
(No Transcript)
39
Confidentiality Protection
  • The reclassification index is a measure of how
    many workers were geographically relocated by the
    synthetic data.

40
(No Transcript)
41
SIPP-SSA-IRS Public Use File
  • Links IRS detailed earnings records and Social
    Security benefit data to public use SIPP data
  • Basic confidential data SIPP (1990-1993, 1996)
    W-2 earnings data SSA benefit data
  • Gold standard completely linked, edited version
    of the data with variables drawn from all of the
    sources
  • Partially-synthetic data created using the
    record structure of the existing SIPP panels with
    all data elements synthesized using Bayesian
    bootstrap and sequential regression multivariate
    imputation methods

42
Multiple Imputation Confidentiality Protection
  • Denote confidential data by Y and disclosable
    data by X.
  • Both Y and X may contain missing data, so that Y
    (Yobs , Ymis) and X (Xobs, Xmis).
  • Assume database can be represented by joint
    density p(Y,X,?).

43
Sequential Regression Multivariate Imputation
Method
  • Synthetic data values Y are draws from the
    posterior predictive density
  • In practice, use a two-step procedure 1) draw m
    completed datasets using SRMI (imputes values for
    all missing data)2) draw r synthetic datasets
    for each completed dataset from predictive
    density given the completed data.

44
Confidentiality Protection
  • Protection is based on the inability of PUF users
    to re-identify the SIPP record upon which the PUF
    record is based.
  • This prevents wholesale addition of SIPP data to
    the IRS and SSA data in the PUF
  • Goal re-identification of SIPP records from the
    PUF should result in true matches and false
    matches with equal probability

45
Disclosure Analysis
  • Uses probabilistic record linking
  • Each synthetic implicate is matched to the gold
    standard
  • All unsynthesized variables are used as blocking
    variables
  • Different matching variable sets are used

46
(No Transcript)
47
Testing Analytic Validity
  • Run analyses on each synthetic implicate.
  • Average coefficients
  • Combine standard errors using formulae that take
    account of average variance of estimates (within
    implicate variance) and differences in variance
    across estimates (between implicate variance).
  • Run analyses on gold standard data.
  • Compare average synthetic coefficient and
    standard error to the same quantities for the
    gold standard.
  • Analytic validity is measured by the overlap in
    the coverage of the synthetic and gold standard
    confidence intervals for a parameter.

48
Log Annual Earnings Amount
49
Log Annual Benefit Amount
50
Tools
  • NSF sponsored supercomputer
  • Virtual RDC
  • Cornell INFO 747

51
The NSF-sponsored Supercomputer on the RDC Network
  • NSF01 is a 64-processor (384GB memory)
    supercomputer
  • Installed and optimized for complex data
    synthesizing and simulation
  • Projects related to the ITR grant have access and
    priority

52
The Virtual RDC
  • Virtual RDC (news server)
  • The virtual RDC environment contains multiple
    servers that closely approximate an RDC compute
    server (e.g., NSF01)
  • Disclosure-proofed metadata and synthetic data
  • Now fully operational
  • Any current or potential RDC user can have an
    account

53
Cornell Information Science 747
  • INFO 747
  • Course available to any potential RDC user, on
    DVD and via internet feed
  • Training for using RDC-based data products
  • Training for creating and testing synthetic data

54
Conclusions
  • An important and challenging area that social
    scientists must be part of
  • Use of confidential data collected by a public
    agency carries with it an obligation to
    disseminate enough data to permit scientific
    discourse
  • Synthetic data is an important tool for this
    dissemination
Write a Comment
User Comments (0)
About PowerShow.com