Disclosure Limitation Methods and Information Loss for Tabular Data - PowerPoint PPT Presentation

About This Presentation

Title:

Disclosure Limitation Methods and Information Loss for Tabular Data

Description:

Title: An Overview of Disclosure Auditing in Categorical Databases Author: The Heinz School Last modified by: The Heinz School Created Date: 7/17/2001 12:56:52 AM – PowerPoint PPT presentation

Number of Views:87

Avg rating:3.0/5.0

Slides: 33

Provided by: TheH94

Learn more at: http://www.contrib.andrew.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Disclosure Limitation Methods and Information Loss for Tabular Data

1
Disclosure Limitation Methods and Information
Loss for Tabular Data

George T. Duncan, Stephen E. Fienberg,
Ramayya Krishnan, Rema Padman
and Stephen F. Roehrig
Carnegie Mellon University

2
Focus of the Talk

Categorical data
Compilations of surveys and other data gathering
efforts
Tables of counts (e.g., number of females in
Metropolis with income gt 200,000)
Cf. microdata
Does the release of a table allow inference of a
sensitive attribute value for an individual
(e.g., Lois Lanes income)?
Exact value
Range of values
Probability distribution

3
Some Tough Questions

Exact, interval or probabilistic disclosure?
Should we analyze a data product in isolation? Or
must we look at the suite of products released?
Longitudinal data can be especially revealing.
How can we know next years data?
What about linkage with external data sources?
Are we responsible for everything thats out
there?

4
The SDL Problem
Disclosure Risk Information About Confidential
Items
Original Data
Maximum Tolerable Risk
Released Data
No Data
Data Utility Information About Legitimate Items
5
Risk and Utility

Disclosure risk depends on
The definition of disclosure, and
The ways disclosure could occur.
Data utility is
A measure of information loss, and
Maximal for the original data.
Often we can trade off disclosure risk and data
utility

6
Sample Measure for Risk

Risk for a cell is where
r(k) is the risk of a snooper discovering cell
value is k
p(k) is the probability of the cell having value
k.
The agency determines r(k), then tries to
estimate the snoopers posterior for p(k), given
the table release.

7
Sample Measures for Utility

Cell-oriented mean square precision of the
users posterior distribution for a cell.
Table-oriented change in ?2 for 2-way tables,
other (or multiple) measures of association for
n-way tables.

8
Disclosure Auditing

Traditional risk assessment
A data disseminator follows these steps
Audit a proposed release for disclosures
If potential disclosures exist , apply SDL
Audit the result to ensure protection
Repeat as necessary
Once more, what is a disclosure?

9
Disclosure Auditing (cont.)

Disclosure may be a sensitive cell value known
with certainty,
to be in a narrow range, or
with high probability.
Lets examine some SDL techniques, considering
The various definitions of disclosure,
The difficulty of applying and auditing them,
The utility of the disclosure-limited results,
and
Whether they are useful for higher-dimensional
tables.

10
Controlled Rounding(Zero-Restricted, Lets Say)
15 1 3 1 20
20 10 10 15 55
3 10 10 2 25
12 14 7 2 35
50 35 30 20 135
15 0 3 0 18
21 9 12 15 57
3 12 9 0 24
12 15 6 3 36
51 36 30 18 135
Original Table
Published Table Rounded to Base 3
11
Controlled Rounding (cont.)

No exact disclosures can occur.
The feasibility interval of any cell is its
published value (b-1), where b is the rounding
base (except close to zero).
Finding a rounding is easy for 2-way tables.
Finding a rounding is harder (and may not even
exist) for higher-dimensional tables.

12
Controlled Rounding (cont.)

There are 576,598,396 tables that could be
rounded to the published table.
How to determine a prior probability over this
set?
With a huge leap of faith about priors, cell
(1,2) has this distribution

q 0 1 2
Pr(q) .436 .347 .217
13
Cell Suppression
15 1 3 1 20
20 10 10 15 55
3 10 10 2 25
12 14 7 2 35
50 35 30 20 135
15 s 3 s 20
20 10 10 15 55
3 s 10 s 25
12 s 7 s 35
50 35 30 20 135
Original Table
Published Table With Suppressions
14
Cell Suppression (cont.)

Finding a suppression pattern can be hard
computationally heuristics may be untrustworthy.
Auditing is often done with linear programming
(LP), finding upper and lower cell bounds.
In higher dimensions, LP may give fractional
bounds---how to interpret?
How does an analyst use a table with suppressions?

15
Cell Suppression (cont.)

Again, there are many possible true tables.
For 2-way tables, they are easily enumerated.
For n-way tables, its quite hard.
Again, its difficult to specify priors (need to
know the exact implementation of suppression
algorithm).
Posterior distributions for suppressed cells can
be had, but its a lot of work.

16
Publishing Only Some Margins of an N-Way Table

Think of the n-way base table as being fully
suppressed.
The published marginal tables constrain the
values in the base table.
Auditing characterizes cells in the base table
and/or other unpublished margins.
Heres an example

17
An HMO Example
Table OfficeVisit v Patient Doctor
Treatment 122 David Christy Compoz 123
John Phillips Fungicide 124 Israel
Christy AZT 125 John Hill
Compoz
Treatment (k)
Doctor (j)
xijk
xijk count of visits over Patient i i
1,,I Doctor j
j 1,,J

Treatment k k 1,,K
Patient (i)
18
The HMO Example (cont.)

Obviously we dont broadcast Patient-Doctor-Treatm
ent.
The view Patient-Treatment is also sensitive.
But the Accounting Dept. has Patient-Doctor.
And the Physician Review Board has
Doctor-Treatment.
Ted works in Accounting, his wife Alice is on the
Physician Review Board, and Israel is an
occasional babysitter for them.

19
More Generally

An n-way table of sensitive data.
Some collection of lower-dimensional marginal
tables are proposed for publication.
How to find bounds, or better, distributions, for
the sensitive cells?
Recall linear programming often gives fractional
bounds.

20
Integer Linear Programming?

Many techniques, but generally very slow compared
to continuous LP.
Empirically, Gomory cuts work well.
Some special problems have the integer rounding
property.
Much more to be done here.

21
Other Bounding Techniques

Generalized shuttle algorithm
The shuttle algorithm (Buzzigoli Giusti) starts
with loose upper/lower bounds, then tightens
them.
Dobra Fienberg improved this (a lot), but still
not completely general

14
13.5
13
15
16
8
True lower bound
True continuous bound
True integer bound
Successive BG upper bounds
22
Exploiting Structure

Decomposable graphs
Suppose 3-D table (indices I,J,K), we publish IJ
and JK, and want bounds for IJK.
The Dobra-Fienberg graph looks like
Dobra and Fienberg show that if the graph has a
separator (node J),and this separator is a
clique, then Frechet bounds are exact.

I
J
K
JK
IJ
23
Probabilities of Cell Values

Diaconis and Sturmfels (1998) show how to sample
from the space of tables that agree with known
marginals.
Not hard to extend to tables with suppressions.
They use results from commutative algebra to find
a Gröbner basis, a list of moves that change a
table but leave the margins fixed.
A random walk using these moves carries you
uniformly thorough the space of tables.
Tally the proportion of time a sensitive cell
takes on different values.

24
3-D Table, 2-D Margins Known
k3
k2
k1
i/j
6
6
6
6 6 6 18
6
7
9
6 6 6 22
6
6
7
6 7 6 19
6 6 6 18
6 7 6 19
9 6 7 22
21 19 19 59
25
Gröbner Bases Moves

Suppose we know a table that matches the
published margins (i.e., is feasible).
How can we move to another feasible table?
Example move

? 0
? 0
0 0 0
? 0
? 0
0 0 0
0 0 0
0 0 0
0 0 0
26
Computing the Gröbner Basis

The general-purpose program Macauley can find the
3?3?3 basis in about 7 hours (300 MHz PC).
A specialized program does this in 25 mS.
The 4?3?3 basis takes 20 minutes (628 moves)
The 5?3?3 basis takes 3 months (3236 moves)

27
Exploiting Structure Again

If the independence graph of the released
marginals is decomposable, the Gröbner basis is
easily determined.
If the graph is almost decomposable, the basis
can be obtained by piecing together bases for
smaller problems.
Dobra demonstrates that these methods can be used
to estimate sensitive cell distributions.

28
Markov Perturbation

Consider an elementary data square in a 2-way
table.
It might look like

1 14 15
17 83 100
18 97 115
29
Markov Perturbation (cont.)

The cell values in the data square are
stochastically modified so that
the marginal totals remain unchanged, and
the expected cell values equal the original
values (unbiased).
A single parameter ? determines how much mixing
is done.
By choosing elementary data squares randomly,
then perturbing, the overall table is protected.

30
Markov Perturbation (cont.)

In the book chapter, we show a Bayesian analysis
comparing
Markov perturbation
Cell suppression, and
Rounding.
The resulting Risk-Utility Confidentiality Map
shows some of the trade-offs in choosing a SDP
method.

31
Various SDL Methods Compared
32
Directions for Research

Distributions of cell values in protected tables.
Examining the consequences of different
user/intruder prior distributions on SDL method
tradeoffs.
New procedures with increased data utility while
maintaining low risk.
All of this for higher-dimensional tables.

Write a Comment

User Comments (0)