Title: Statistical confidentiality and privacy: 1. General considerations * * * Robert McCaa Minnesota Population Center rmccaa@umn.edu
1Statistical confidentiality and privacy1.
General considerations Robert
McCaaMinnesota Population Centerrmccaa_at_umn.edu
Inadequate use of microdata has high
costs--Len Cook (2003, registrar general, ONS)
2UNSD Principles and Recommendations (Rev. 1,
1997) endorse dissemination of census microdata
- 1.218 There are a range of methodsthat can be
used to make such microdata available while still
protecting individuals rights to privacy.
(Rev. 2 has a stronger statement.) - In four decades of distributing microdata there
is not a single allegation of a breach of
confidentiality or privacy (includes 100
microdata stored at CELADE in Santiago, Chile).
3Why disseminate microdata? Julia Lane, European
Statisticians Conference (2003)
- 1. Analyze more realistic questions
- 2. Develop reality-based policy
- 3. Acquire new constituencies and stakeholders
- 4. Build trust reduce suspicions of data cooking
- 5. Replicate findings
- a. use standards of UNSD, Eurostat, ISCO, ISCED,
etc. - b. facilitate comparative research in time and
space - 6. Calculate marginal effects
- 7. Assess data quality
- and much, much more.
4Imagine!!!
Whats the problem?
- Confidentializing an integrated microdata base
with - 200 samples of households (70 countries)
- Containing ½ billion person records with
thousands of variables - Available to tens of thousands of licensed users
regardless of country of birth, citizenship,
residence or place of work - Without a single allegation of violation of
privacy or statistical confidentiality--
Ever!!
5Usage Off-site vs. on-site use (secure
microdata laboratory)? Germany RDC, 2005-8
ten-to-one
Jan-Sept
RDCs are expensive and attract few users.
6ONS-UK gold standard
Statistical disclosure control methods may
modify the data or the design of the statistic,
or a combination of both. They will be judged
sufficient when the guarantee of confidentiality
can be maintained, taking account of information
likely to be available to third parties, either
from other sources or as previously released
National Statistics outputs, against the
following standardIt would take a
disproportionate amount of time, effort and
expertise for an intruder to identify a
statistical unit to others, or to reveal
information about that unit not already in the
public domain. Protocols on Data Access and
Confidentiality, pp. 7-8 --ONS-UK(2004)www.stati
stics.gov.uk/about_ns/cop/downloads/prot_data_acce
ss_confidentiality.pdf
7Risk assessment of household samples of UK 1991
census attempts at matching are fruitlessfew
matches many false positives
- After taking into account errors in the data,
coding variability and changing of personal
characteristics in time - Dale and Elliott, JRSS-A (2003)
For a user of an outside database,
attempting this sort of match with no opportunity
for verification would prove fruitless. In the
first place, the small degree of expected overlap
would be a considerable deterrent to an intruder.
However, if a match between the two files was
attempted the large number of apparent matches
would be highly confusing as an intruder would
have no way of checking correct identification.
8Level of Anonymization(FSO-Germany)
Degree of confidentiality
stronger anonymisationmethod
delete direct identifier
anonymisationmethod
de-facto anonymised microdata
fully anonymised microdata
complete microdata
confidential microdata
Degree of analysis potential
Trade-off between confidentiality and analysis
potential is it
monotonic (as portrayed)?
9Level of Anonymizationnot monotonic
Degree of confidentiality
95
99
99.9
stronger anonymisationmethod
delete direct identifier
anonymisationmethod
Construct sample
de-facto anonymised microdata
fully anonymised microdata
complete microdata
confidential microdata
50
25
45
Degree of analysis potential
Trade-off is not monotonic
10Resources
- UN-ECE (2007), Managing Statistical
Confidentiality Microdata Access
http//www.unece.org/stats/documents/tfcm.htm - IHSN Tools Guidelines, anonymizationwww.survey
network.org - Eurostat (1999)
11UN-ECE (2007) www.unece.org/stats/documents/tfcm
.htm
12IHSN www.Surveynetwork.org
13IHSN www.Surveynetwork.org
14IHSN www.Surveynetwork.org
- Remove variables
- Identifiers name, address, low-level
administrative geography - Sensitive tribe, disability
- Global recoding
- Aggregate classes age (5 yr groups), country of
birth (continent), administrative geography,
occupation (4 digit ? 3), etc. - Top and bottom coding (continuous
variables--income, size of residence, number of
rooms, etc.) - Local suppression--sparse categories (population
n lt 2502,500) - Data swapping (household geography)
- Complex perturbations
15EUROSTAT statistical confidentiality standards
(Thorogood, 1999) --all endorsed by
IPUMS-International
- 1. Restrict access to samples
- 2. Limit geographical detail
- 3. Re-code unique categories--top and bottom
- 4. Sign non-disclosure agreement
- 5. Prohibit redistribution to third parties
- 6. Prohibit attempts to identify individuals or
the making any claim to that effect - 7. Require users to provide copies of
publications
16EUROSTAT statistical confidentiality standards
(Thorogood, 1999) --all endorsed by
IPUMS-International
- 8. Construct age from birthdate, if necessary
- 9. Do not identify date of birth
- 10. Do not identify precise place of birth
- 11. Migration timing/place not identified in
detail - 12. Identify place of residence by major civil
division (popgt20k, 60k, 100k, 1 millioni.e.,
national convention) - 13. Do sensitivity analysis
- 14. Do confidentiality assessment (not yet)
17Countering Fear, Hysteria and Paranoiawith reason
There has been no known attempt at
identification with the 1991 SARs microdata
samples of the UK-nor in any other countries
that disseminate samples of microdata
--Elliott and Dale, Journal of the Royal
Statistical Society, 1999
18 No official statistical microdata!!
Why Not?Companies want linkable data with names,
addresses, ID s, etc.
Probabilistic linking with 90 of the
population missing is not good enough
ChoicePoint Data Sources and Clients. Source
Washington Post
http//www.choicepoint.com/
19 No statistical microdata!!
To play pizza videohttp//www.aclu.org/pizza/
20(No Transcript)
21Statistical samples are innocuous. Nothing to
be gained from matching.
Countering Fear, Hysteria and Paranoiawith reason
There has been no known attempt at
identification with the 1991 SARs microdata
samples of the UK-nor in any other countries
that disseminate samples of microdata
--Elliott and Dale, Journal of the Royal
Statistical Society, 1999
22Please allow me to invite you to think about
producing (or permitting IPUMS to produce)
anonymized, integrated samples for all the
censuses of your country for which microdata
surviveThank you Contact
rmccaa_at_umn.eduthis ppt is available
atwww.hist.umn.edu/rmccaa/ipums-global See
Port of Spain workshop