Title: Confidentiality Protection of Social Science Micro Data: Synthetic Data and Related Methods
1Confidentiality Protection of Social Science
Micro Data Synthetic Data and Related Methods
- John M. AbowdCornell University and Census
Bureau - January 30, 2006UCLA Institute for Digital
Research and Education - Presentation
2Acknowledgements
- Many current and past LEHD staff and senior
research fellows contributed to the development
of the LEHD infrastructure system and the
Quarterly Workforce Indicators. Kevin McKinney,
Bryce Stephens and Lars Vilhuber were
particularly responsible for the confidentiality
protection system. - Fredrik Andersson and Marc Roemer at LEHD did the
data analysis and implementation of the On the
Map package. John Carpenter of Excensus, Inc.
developed the mapping application. - Gary Benedetto, Lisa Dragoset, Martha Stinson and
Bryan Ricchetti did the synthesis programming for
the SIPP-PUF application.
3Overview
- What is the problem?
- What are synthetic data?
- How can the research community benefit from
synthetic data? - The NSF-ITR synthetic data grant
- The Census Bureaus synthetic data and related
products - QWI Online
- On the Map
- The new SIPP-SSA-IRS Public Use File
- Tools
4Information Release and Data Protection are
Competing Objectives
- Statisticians call this the Risk-Utility tradeoff
- Economists prefer to distinguish between
technological trade-offs and preference
trade-offs - Information release and data protection are
technological tradeoffs
5A Simple Example of the Technological Trade-off
- There are two outputs information released and
data protection - Consider a census with sampling as the release
technology - The PPF measures the amount of information that
must be sacrificed to get additional protection - The information measure is Shannons H (or the
Kullback-Liebler difference between the census
and the sample) - The protection measure is the maximum probability
of an exact disclosure
6(No Transcript)
7(No Transcript)
8What Are Synthetic Data?
- Public use micro data products that reproduce
essential features of confidential micro data
products - Essential features include
- Univariate distributions overall and in
subpopulations - Multivariate relations among the variables
9Some History
- Original fully synthetic data idea was due to
Rubin (JOS, 1993) - Synthesize the Decennial Census long form
responses for the short form households, then
release samples that do not include any actual
long form records - Original partially synthetic data idea was due to
Little (JOS, 1993) - Synthesize the sensitive values on the public use
file - Critical refinement (Fienberg, 1994)
- Use a parametric posterior predictive
distribution (instead of a Bayes bootstrap) to do
the sampling - Other authors, particularly Raghunathan, Reiter,
Rubin, Abowd, Woodcock - Partially synthetic data with missing data
(Reiter) - Sequential Regression Multivariate Imputation
(Raghunathan, Reither, and Rubin Abowd and
Woodcock)
10How Can You Preserve Confidentiality and
Multivariate Relations?
- Fundamental trade-off
- better protection v. better data quality
- Protection results from summarizing the data with
a complicated multivariate distribution, then
sampling that distribution instead of the
original data - The synthetic data are not any respondents
actual data - But, for some techniques, it may still be
possible to re-identify the source record in the
confidential data - New techniques address this problem
11How Can the Research Community Benefit from
Synthetic Data?
- Sophisticated research users must help develop
the synthesizers in order to promote and improve
analytic validity - Many more users will have access to the
information because there is a public use micro
data product.
12The Research Synthetic Data Feedback Cycle
ConfidentialityProtection
ScientificModeling
DataSynthesis
AnalyticValidity
13The Multi-layer System
- Basic confidential data
- Fundamental product of virtually all Census
programs - Leads to the publication of public-use products
(summary data, micro data, narrative data) - Gold-standard confidential data
- Edited, documented and archived research versions
of confidential data - Used in internal Census research and at Research
Data Centers
14More Layers
- Partially-synthetic micro data
- Preserves the record structure or sampling frame
of the gold standard micro data - Replaces the data elements with synthetic values
sampled from an appropriate probability model - Fully-synthetic micro data
- Uses only the population or record linkage
structure of the gold standard micro data - Generates synthetic entities and data elements
from appropriate probability models
15The NSF Information Technologies Research Grant
- A program that encourages innovative, high-payoff
IT research and education - Our grant proposal cited the many research
studies and data products created by previous NSF
support for the Research Data Center network and
the Longitudinal Employer-Household Dynamics
Program
16What Is It?
- 2.9 million 3-year grant to the RDC network
(Cornell is the coordinating institution) - Provides core support for scientific activities
at the RDCs - To develop public use, analytically valid
synthetic data from many of the RDC-accessible
data sets - To facilitate collaboration with RDC projects
that help design and test these products
17The Quarterly Workforce Indicators
- QWI was the LEHD Programs first public use data
product - QWI Online
- Detailed labor force information by sub-state
geography, detailed industry, ownership class,
sex and age group.
18The Confidentiality Protection System
- All QWI protections are done by noise infusion of
the micro-data - All micro-data items are distorted at least
minimal percentage up to a maximal percentage - Only the distorted items are used in the
production of the release product
19Protection and Validity Principles
- Cells with few businesses contributing or with
few individuals contributing have been distorted
in the cross-section but not the time-series - Bias in the cross-section is controlled and
random, no analyst knows its sign - More information
20Theoretical Distribution of the QWI Distortion
Factor
21Theoretical Distribution of the QWI Distortion
Factor
22Actual Confidentiality Protection Distortion
Employment, Beginning-of-Quarter
23Table 8 Distribution of Error in First Order
Serial Correlation
24Graph Distribution of Error in First Order
Serial Correlation
25Enhancements
- The current product has suppressions for cells
too small to protect by noise infusion - The enhanced product replaces these suppressions
with synthetic data
26Percentage of Data Items in County Level Release
File
27Beginning of Period Employment in NAICS Sector 62
28Full Quarter New Hires in NAICS4 3259
29The Census Bureaus First Public Use Synthetic
Data Application
- LEHD On-the-map application
- Shows commuting patterns at the Census Block
level with characteristics of the origin and
destination block groups - Origin block data are synthetic
- Sampled from the posterior predictive
distribution of origin blocks and origin
characteristics given destination block,
destination block characteristics. - On-the-map
30Where people living in the selected area
(Mobiles neighboring communities of Daphne and
Fairhope) work
DRAFT Beta Test Document Only
Source On the Map beta application,
Longitudinal Employer-Household Dynamics Program,
U.S. Census Bureau
September 23, 2005
31Where people working in the selected area
(downtown Mobile) live
DRAFT Beta Test Document Only
Source On the Map beta application,
Longitudinal Employer-Household Dynamics Program,
U.S. Census Bureau
September 23, 2005
32Synthetic Data Model
- yijk are the counts for residence block i, work
place block j and characteristics k. - Characteristics are age groups, earnings groups,
industry (NAICS sector), ownership sector.
33Complications
- Informative prior shape
- Prior sample size
- Work place counts must be compatible with the
protection system used by Quarterly Workforce
Indicators (QWI) - Dynamically consistent noise infusion
34(No Transcript)
35(No Transcript)
36Analytic Validity
- Assess the bias
- Assess the incremental variation
37(No Transcript)
38(No Transcript)
39Confidentiality Protection
- The reclassification index is a measure of how
many workers were geographically relocated by the
synthetic data.
40(No Transcript)
41SIPP-SSA-IRS Public Use File
- Links IRS detailed earnings records and Social
Security benefit data to public use SIPP data - Basic confidential data SIPP (1990-1993, 1996)
W-2 earnings data SSA benefit data - Gold standard completely linked, edited version
of the data with variables drawn from all of the
sources - Partially-synthetic data created using the
record structure of the existing SIPP panels with
all data elements synthesized using Bayesian
bootstrap and sequential regression multivariate
imputation methods
42Multiple Imputation Confidentiality Protection
- Denote confidential data by Y and disclosable
data by X. - Both Y and X may contain missing data, so that Y
(Yobs , Ymis) and X (Xobs, Xmis). - Assume database can be represented by joint
density p(Y,X,?).
43Sequential Regression Multivariate Imputation
Method
- Synthetic data values Y are draws from the
posterior predictive density - In practice, use a two-step procedure 1) draw m
completed datasets using SRMI (imputes values for
all missing data)2) draw r synthetic datasets
for each completed dataset from predictive
density given the completed data.
44Confidentiality Protection
- Protection is based on the inability of PUF users
to re-identify the SIPP record upon which the PUF
record is based. - This prevents wholesale addition of SIPP data to
the IRS and SSA data in the PUF - Goal re-identification of SIPP records from the
PUF should result in true matches and false
matches with equal probability
45Disclosure Analysis
- Uses probabilistic record linking
- Each synthetic implicate is matched to the gold
standard - All unsynthesized variables are used as blocking
variables - Different matching variable sets are used
46(No Transcript)
47Testing Analytic Validity
- Run analyses on each synthetic implicate.
- Average coefficients
- Combine standard errors using formulae that take
account of average variance of estimates (within
implicate variance) and differences in variance
across estimates (between implicate variance). - Run analyses on gold standard data.
- Compare average synthetic coefficient and
standard error to the same quantities for the
gold standard. - Analytic validity is measured by the overlap in
the coverage of the synthetic and gold standard
confidence intervals for a parameter.
48Log Annual Earnings Amount
49Log Annual Benefit Amount
50Tools
- NSF sponsored supercomputer
- Virtual RDC
- Cornell INFO 747
51The NSF-sponsored Supercomputer on the RDC Network
- NSF01 is a 64-processor (384GB memory)
supercomputer - Installed and optimized for complex data
synthesizing and simulation - Projects related to the ITR grant have access and
priority
52The Virtual RDC
- Virtual RDC (news server)
- The virtual RDC environment contains multiple
servers that closely approximate an RDC compute
server (e.g., NSF01) - Disclosure-proofed metadata and synthetic data
- Now fully operational
- Any current or potential RDC user can have an
account
53Cornell Information Science 747
- INFO 747
- Course available to any potential RDC user, on
DVD and via internet feed - Training for using RDC-based data products
- Training for creating and testing synthetic data
54Conclusions
- An important and challenging area that social
scientists must be part of - Use of confidential data collected by a public
agency carries with it an obligation to
disseminate enough data to permit scientific
discourse - Synthetic data is an important tool for this
dissemination