Title: Disclosure Limitation Methods and Information Loss for Tabular Data
1Disclosure Limitation Methods and Information
Loss for Tabular Data
- George T. Duncan, Stephen E. Fienberg,
- Ramayya Krishnan, Rema Padman
- and Stephen F. Roehrig
- Carnegie Mellon University
2Focus of the Talk
- Categorical data
- Compilations of surveys and other data gathering
efforts - Tables of counts (e.g., number of females in
Metropolis with income gt 200,000) - Cf. microdata
- Does the release of a table allow inference of a
sensitive attribute value for an individual
(e.g., Lois Lanes income)? - Exact value
- Range of values
- Probability distribution
3Some Tough Questions
- Exact, interval or probabilistic disclosure?
- Should we analyze a data product in isolation? Or
must we look at the suite of products released? - Longitudinal data can be especially revealing.
How can we know next years data? - What about linkage with external data sources?
Are we responsible for everything thats out
there?
4The SDL Problem
Disclosure Risk Information About Confidential
Items
Original Data
Maximum Tolerable Risk
Released Data
No Data
Data Utility Information About Legitimate Items
5Risk and Utility
- Disclosure risk depends on
- The definition of disclosure, and
- The ways disclosure could occur.
- Data utility is
- A measure of information loss, and
- Maximal for the original data.
- Often we can trade off disclosure risk and data
utility
6Sample Measure for Risk
- Risk for a cell is where
- r(k) is the risk of a snooper discovering cell
value is k - p(k) is the probability of the cell having value
k. - The agency determines r(k), then tries to
estimate the snoopers posterior for p(k), given
the table release.
7Sample Measures for Utility
- Cell-oriented mean square precision of the
users posterior distribution for a cell. - Table-oriented change in ?2 for 2-way tables,
other (or multiple) measures of association for
n-way tables.
8Disclosure Auditing
- Traditional risk assessment
- A data disseminator follows these steps
- Audit a proposed release for disclosures
- If potential disclosures exist , apply SDL
- Audit the result to ensure protection
- Repeat as necessary
- Once more, what is a disclosure?
9Disclosure Auditing (cont.)
- Disclosure may be a sensitive cell value known
- with certainty,
- to be in a narrow range, or
- with high probability.
- Lets examine some SDL techniques, considering
- The various definitions of disclosure,
- The difficulty of applying and auditing them,
- The utility of the disclosure-limited results,
and - Whether they are useful for higher-dimensional
tables.
10Controlled Rounding(Zero-Restricted, Lets Say)
15 1 3 1 20
20 10 10 15 55
3 10 10 2 25
12 14 7 2 35
50 35 30 20 135
15 0 3 0 18
21 9 12 15 57
3 12 9 0 24
12 15 6 3 36
51 36 30 18 135
Original Table
Published Table Rounded to Base 3
11Controlled Rounding (cont.)
- No exact disclosures can occur.
- The feasibility interval of any cell is its
published value (b-1), where b is the rounding
base (except close to zero). - Finding a rounding is easy for 2-way tables.
- Finding a rounding is harder (and may not even
exist) for higher-dimensional tables.
12Controlled Rounding (cont.)
- There are 576,598,396 tables that could be
rounded to the published table. - How to determine a prior probability over this
set? - With a huge leap of faith about priors, cell
(1,2) has this distribution
q 0 1 2
Pr(q) .436 .347 .217
13Cell Suppression
15 1 3 1 20
20 10 10 15 55
3 10 10 2 25
12 14 7 2 35
50 35 30 20 135
15 s 3 s 20
20 10 10 15 55
3 s 10 s 25
12 s 7 s 35
50 35 30 20 135
Original Table
Published Table With Suppressions
14Cell Suppression (cont.)
- Finding a suppression pattern can be hard
computationally heuristics may be untrustworthy. - Auditing is often done with linear programming
(LP), finding upper and lower cell bounds. - In higher dimensions, LP may give fractional
bounds---how to interpret? - How does an analyst use a table with suppressions?
15Cell Suppression (cont.)
- Again, there are many possible true tables.
- For 2-way tables, they are easily enumerated.
- For n-way tables, its quite hard.
- Again, its difficult to specify priors (need to
know the exact implementation of suppression
algorithm). - Posterior distributions for suppressed cells can
be had, but its a lot of work.
16Publishing Only Some Margins of an N-Way Table
- Think of the n-way base table as being fully
suppressed. - The published marginal tables constrain the
values in the base table. - Auditing characterizes cells in the base table
and/or other unpublished margins. - Heres an example
17An HMO Example
Table OfficeVisit v Patient Doctor
Treatment 122 David Christy Compoz 123
John Phillips Fungicide 124 Israel
Christy AZT 125 John Hill
Compoz
Treatment (k)
Doctor (j)
xijk
xijk count of visits over Patient i i
1,,I Doctor j
j 1,,J
Treatment k k 1,,K
Patient (i)
18The HMO Example (cont.)
- Obviously we dont broadcast Patient-Doctor-Treatm
ent. - The view Patient-Treatment is also sensitive.
- But the Accounting Dept. has Patient-Doctor.
- And the Physician Review Board has
Doctor-Treatment. - Ted works in Accounting, his wife Alice is on the
Physician Review Board, and Israel is an
occasional babysitter for them.
19More Generally
- An n-way table of sensitive data.
- Some collection of lower-dimensional marginal
tables are proposed for publication. - How to find bounds, or better, distributions, for
the sensitive cells? - Recall linear programming often gives fractional
bounds.
20Integer Linear Programming?
- Many techniques, but generally very slow compared
to continuous LP. - Empirically, Gomory cuts work well.
- Some special problems have the integer rounding
property. - Much more to be done here.
21Other Bounding Techniques
- Generalized shuttle algorithm
- The shuttle algorithm (Buzzigoli Giusti) starts
with loose upper/lower bounds, then tightens
them. - Dobra Fienberg improved this (a lot), but still
not completely general
14
13.5
13
15
16
8
True lower bound
True continuous bound
True integer bound
Successive BG upper bounds
22Exploiting Structure
- Decomposable graphs
- Suppose 3-D table (indices I,J,K), we publish IJ
and JK, and want bounds for IJK. - The Dobra-Fienberg graph looks like
- Dobra and Fienberg show that if the graph has a
separator (node J),and this separator is a
clique, then Frechet bounds are exact.
I
J
K
JK
IJ
23Probabilities of Cell Values
- Diaconis and Sturmfels (1998) show how to sample
from the space of tables that agree with known
marginals. - Not hard to extend to tables with suppressions.
- They use results from commutative algebra to find
a Gröbner basis, a list of moves that change a
table but leave the margins fixed. - A random walk using these moves carries you
uniformly thorough the space of tables. - Tally the proportion of time a sensitive cell
takes on different values.
243-D Table, 2-D Margins Known
k3
k2
k1
i/j
6
6
6
6 6 6 18
6
7
9
6 6 6 22
6
6
7
6 7 6 19
6 6 6 18
6 7 6 19
9 6 7 22
21 19 19 59
25Gröbner Bases Moves
- Suppose we know a table that matches the
published margins (i.e., is feasible). - How can we move to another feasible table?
- Example move
? 0
? 0
0 0 0
? 0
? 0
0 0 0
0 0 0
0 0 0
0 0 0
26Computing the Gröbner Basis
- The general-purpose program Macauley can find the
3?3?3 basis in about 7 hours (300 MHz PC). - A specialized program does this in 25 mS.
- The 4?3?3 basis takes 20 minutes (628 moves)
- The 5?3?3 basis takes 3 months (3236 moves)
27Exploiting Structure Again
- If the independence graph of the released
marginals is decomposable, the Gröbner basis is
easily determined. - If the graph is almost decomposable, the basis
can be obtained by piecing together bases for
smaller problems. - Dobra demonstrates that these methods can be used
to estimate sensitive cell distributions.
28Markov Perturbation
- Consider an elementary data square in a 2-way
table. - It might look like
1 14 15
17 83 100
18 97 115
29Markov Perturbation (cont.)
- The cell values in the data square are
stochastically modified so that - the marginal totals remain unchanged, and
- the expected cell values equal the original
values (unbiased). - A single parameter ? determines how much mixing
is done. - By choosing elementary data squares randomly,
then perturbing, the overall table is protected.
30Markov Perturbation (cont.)
- In the book chapter, we show a Bayesian analysis
comparing - Markov perturbation
- Cell suppression, and
- Rounding.
- The resulting Risk-Utility Confidentiality Map
shows some of the trade-offs in choosing a SDP
method.
31Various SDL Methods Compared
32Directions for Research
- Distributions of cell values in protected tables.
- Examining the consequences of different
user/intruder prior distributions on SDL method
tradeoffs. - New procedures with increased data utility while
maintaining low risk. - All of this for higher-dimensional tables.