Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods - PowerPoint PPT Presentation

1 / 54

About This Presentation

Title:

Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods

Description:

Other authors, particularly Raghunathan, Reiter, Rubin, Abowd, Woodcock ... Multivariate Imputation (Raghunathan, Reither, and Rubin; Abowd and Woodcock) ... – PowerPoint PPT presentation

Number of Views:97

Avg rating:3.0/5.0

Slides: 55

Provided by: john1019

Category:

more less

Transcript and Presenter's Notes

Title: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods

1
Confidentiality Protection of Social Science
Micro Data Synthetic Data and Related Methods

John M. AbowdCornell University and Census
Bureau
January 30, 2006UCLA Institute for Digital
Research and Education
Presentation

2
Acknowledgements

Many current and past LEHD staff and senior
research fellows contributed to the development
of the LEHD infrastructure system and the
Quarterly Workforce Indicators. Kevin McKinney,
Bryce Stephens and Lars Vilhuber were
particularly responsible for the confidentiality
protection system.
Fredrik Andersson and Marc Roemer at LEHD did the
data analysis and implementation of the On the
Map package. John Carpenter of Excensus, Inc.
developed the mapping application.
Gary Benedetto, Lisa Dragoset, Martha Stinson and
Bryan Ricchetti did the synthesis programming for
the SIPP-PUF application.

3
Overview

What is the problem?
What are synthetic data?
How can the research community benefit from
synthetic data?
The NSF-ITR synthetic data grant
The Census Bureaus synthetic data and related
products
QWI Online
On the Map
The new SIPP-SSA-IRS Public Use File
Tools

4
Information Release and Data Protection are
Competing Objectives

Statisticians call this the Risk-Utility tradeoff
Economists prefer to distinguish between
technological trade-offs and preference
trade-offs
Information release and data protection are
technological tradeoffs

5
A Simple Example of the Technological Trade-off

There are two outputs information released and
data protection
Consider a census with sampling as the release
technology
The PPF measures the amount of information that
must be sacrificed to get additional protection
The information measure is Shannons H (or the
Kullback-Liebler difference between the census
and the sample)
The protection measure is the maximum probability
of an exact disclosure

6
(No Transcript)
7
(No Transcript)
8
What Are Synthetic Data?

Public use micro data products that reproduce
essential features of confidential micro data
products
Essential features include
Univariate distributions overall and in
subpopulations
Multivariate relations among the variables

9
Some History

Original fully synthetic data idea was due to
Rubin (JOS, 1993)
Synthesize the Decennial Census long form
responses for the short form households, then
release samples that do not include any actual
long form records
Original partially synthetic data idea was due to
Little (JOS, 1993)
Synthesize the sensitive values on the public use
file
Critical refinement (Fienberg, 1994)
Use a parametric posterior predictive
distribution (instead of a Bayes bootstrap) to do
the sampling
Other authors, particularly Raghunathan, Reiter,
Rubin, Abowd, Woodcock
Partially synthetic data with missing data
(Reiter)
Sequential Regression Multivariate Imputation
(Raghunathan, Reither, and Rubin Abowd and
Woodcock)

10
How Can You Preserve Confidentiality and
Multivariate Relations?

Fundamental trade-off
better protection v. better data quality
Protection results from summarizing the data with
a complicated multivariate distribution, then
sampling that distribution instead of the
original data
The synthetic data are not any respondents
actual data
But, for some techniques, it may still be
possible to re-identify the source record in the
confidential data
New techniques address this problem

11
How Can the Research Community Benefit from
Synthetic Data?

Sophisticated research users must help develop
the synthesizers in order to promote and improve
analytic validity
Many more users will have access to the
information because there is a public use micro
data product.

12
The Research Synthetic Data Feedback Cycle
ConfidentialityProtection
ScientificModeling
DataSynthesis
AnalyticValidity
13
The Multi-layer System

Basic confidential data
Fundamental product of virtually all Census
programs
Leads to the publication of public-use products
(summary data, micro data, narrative data)
Gold-standard confidential data
Edited, documented and archived research versions
of confidential data
Used in internal Census research and at Research
Data Centers

14
More Layers

Partially-synthetic micro data
Preserves the record structure or sampling frame
of the gold standard micro data
Replaces the data elements with synthetic values
sampled from an appropriate probability model
Fully-synthetic micro data
Uses only the population or record linkage
structure of the gold standard micro data
Generates synthetic entities and data elements
from appropriate probability models

15
The NSF Information Technologies Research Grant

A program that encourages innovative, high-payoff
IT research and education
Our grant proposal cited the many research
studies and data products created by previous NSF
support for the Research Data Center network and
the Longitudinal Employer-Household Dynamics
Program

16
What Is It?

2.9 million 3-year grant to the RDC network
(Cornell is the coordinating institution)
Provides core support for scientific activities
at the RDCs
To develop public use, analytically valid
synthetic data from many of the RDC-accessible
data sets
To facilitate collaboration with RDC projects
that help design and test these products

17
The Quarterly Workforce Indicators

QWI was the LEHD Programs first public use data
product
QWI Online
Detailed labor force information by sub-state
geography, detailed industry, ownership class,
sex and age group.

18
The Confidentiality Protection System

All QWI protections are done by noise infusion of
the micro-data
All micro-data items are distorted at least
minimal percentage up to a maximal percentage
Only the distorted items are used in the
production of the release product

19
Protection and Validity Principles

Cells with few businesses contributing or with
few individuals contributing have been distorted
in the cross-section but not the time-series
Bias in the cross-section is controlled and
random, no analyst knows its sign
More information

20
Theoretical Distribution of the QWI Distortion
Factor
21
Theoretical Distribution of the QWI Distortion
Factor
22
Actual Confidentiality Protection Distortion
Employment, Beginning-of-Quarter
23
Table 8 Distribution of Error in First Order
Serial Correlation
24
Graph Distribution of Error in First Order
Serial Correlation
25
Enhancements

The current product has suppressions for cells
too small to protect by noise infusion
The enhanced product replaces these suppressions
with synthetic data

26
Percentage of Data Items in County Level Release
File
27
Beginning of Period Employment in NAICS Sector 62
28
Full Quarter New Hires in NAICS4 3259
29
The Census Bureaus First Public Use Synthetic
Data Application

LEHD On-the-map application
Shows commuting patterns at the Census Block
level with characteristics of the origin and
destination block groups
Origin block data are synthetic
Sampled from the posterior predictive
distribution of origin blocks and origin
characteristics given destination block,
destination block characteristics.
On-the-map

30
Where people living in the selected area
(Mobiles neighboring communities of Daphne and
Fairhope) work
DRAFT Beta Test Document Only
Source On the Map beta application,
Longitudinal Employer-Household Dynamics Program,
U.S. Census Bureau
September 23, 2005
31
Where people working in the selected area
(downtown Mobile) live
DRAFT Beta Test Document Only
Source On the Map beta application,
Longitudinal Employer-Household Dynamics Program,
U.S. Census Bureau
September 23, 2005
32
Synthetic Data Model

yijk are the counts for residence block i, work
place block j and characteristics k.
Characteristics are age groups, earnings groups,
industry (NAICS sector), ownership sector.

33
Complications

Informative prior shape
Prior sample size
Work place counts must be compatible with the
protection system used by Quarterly Workforce
Indicators (QWI)
Dynamically consistent noise infusion

34
(No Transcript)
35
(No Transcript)
36
Analytic Validity

Assess the bias
Assess the incremental variation

37
(No Transcript)
38
(No Transcript)
39
Confidentiality Protection

The reclassification index is a measure of how
many workers were geographically relocated by the
synthetic data.

40
(No Transcript)
41
SIPP-SSA-IRS Public Use File

Links IRS detailed earnings records and Social
Security benefit data to public use SIPP data
Basic confidential data SIPP (1990-1993, 1996)
W-2 earnings data SSA benefit data
Gold standard completely linked, edited version
of the data with variables drawn from all of the
sources
Partially-synthetic data created using the
record structure of the existing SIPP panels with
all data elements synthesized using Bayesian
bootstrap and sequential regression multivariate
imputation methods

42
Multiple Imputation Confidentiality Protection

Denote confidential data by Y and disclosable
data by X.
Both Y and X may contain missing data, so that Y
(Yobs , Ymis) and X (Xobs, Xmis).
Assume database can be represented by joint
density p(Y,X,?).

43
Sequential Regression Multivariate Imputation
Method

Synthetic data values Y are draws from the
posterior predictive density
In practice, use a two-step procedure 1) draw m
completed datasets using SRMI (imputes values for
all missing data)2) draw r synthetic datasets
for each completed dataset from predictive
density given the completed data.

44
Confidentiality Protection

Protection is based on the inability of PUF users
to re-identify the SIPP record upon which the PUF
record is based.
This prevents wholesale addition of SIPP data to
the IRS and SSA data in the PUF
Goal re-identification of SIPP records from the
PUF should result in true matches and false
matches with equal probability

45
Disclosure Analysis

Uses probabilistic record linking
Each synthetic implicate is matched to the gold
standard
All unsynthesized variables are used as blocking
variables
Different matching variable sets are used

46
(No Transcript)
47
Testing Analytic Validity

Run analyses on each synthetic implicate.
Average coefficients
Combine standard errors using formulae that take
account of average variance of estimates (within
implicate variance) and differences in variance
across estimates (between implicate variance).
Run analyses on gold standard data.
Compare average synthetic coefficient and
standard error to the same quantities for the
gold standard.
Analytic validity is measured by the overlap in
the coverage of the synthetic and gold standard
confidence intervals for a parameter.

48
Log Annual Earnings Amount
49
Log Annual Benefit Amount
50
Tools

NSF sponsored supercomputer
Virtual RDC
Cornell INFO 747

51
The NSF-sponsored Supercomputer on the RDC Network

NSF01 is a 64-processor (384GB memory)
supercomputer
Installed and optimized for complex data
synthesizing and simulation
Projects related to the ITR grant have access and
priority

52
The Virtual RDC

Virtual RDC (news server)
The virtual RDC environment contains multiple
servers that closely approximate an RDC compute
server (e.g., NSF01)
Disclosure-proofed metadata and synthetic data
Now fully operational
Any current or potential RDC user can have an
account

53
Cornell Information Science 747

INFO 747
Course available to any potential RDC user, on
DVD and via internet feed
Training for using RDC-based data products
Training for creating and testing synthetic data

54
Conclusions

An important and challenging area that social
scientists must be part of
Use of confidential data collected by a public
agency carries with it an obligation to
disseminate enough data to permit scientific
discourse
Synthetic data is an important tool for this
dissemination

Write a Comment

User Comments (0)