Disclosure Limitation Methods and Information Loss for Tabular Data - PowerPoint PPT Presentation

About This Presentation
Title:

Disclosure Limitation Methods and Information Loss for Tabular Data

Description:

Title: An Overview of Disclosure Auditing in Categorical Databases Author: The Heinz School Last modified by: The Heinz School Created Date: 7/17/2001 12:56:52 AM – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 33
Provided by: TheH94
Category:

less

Transcript and Presenter's Notes

Title: Disclosure Limitation Methods and Information Loss for Tabular Data


1
Disclosure Limitation Methods and Information
Loss for Tabular Data
  • George T. Duncan, Stephen E. Fienberg,
  • Ramayya Krishnan, Rema Padman
  • and Stephen F. Roehrig
  • Carnegie Mellon University

2
Focus of the Talk
  • Categorical data
  • Compilations of surveys and other data gathering
    efforts
  • Tables of counts (e.g., number of females in
    Metropolis with income gt 200,000)
  • Cf. microdata
  • Does the release of a table allow inference of a
    sensitive attribute value for an individual
    (e.g., Lois Lanes income)?
  • Exact value
  • Range of values
  • Probability distribution

3
Some Tough Questions
  • Exact, interval or probabilistic disclosure?
  • Should we analyze a data product in isolation? Or
    must we look at the suite of products released?
  • Longitudinal data can be especially revealing.
    How can we know next years data?
  • What about linkage with external data sources?
    Are we responsible for everything thats out
    there?

4
The SDL Problem
Disclosure Risk Information About Confidential
Items
Original Data
Maximum Tolerable Risk
Released Data
No Data
Data Utility Information About Legitimate Items
5
Risk and Utility
  • Disclosure risk depends on
  • The definition of disclosure, and
  • The ways disclosure could occur.
  • Data utility is
  • A measure of information loss, and
  • Maximal for the original data.
  • Often we can trade off disclosure risk and data
    utility

6
Sample Measure for Risk
  • Risk for a cell is where
  • r(k) is the risk of a snooper discovering cell
    value is k
  • p(k) is the probability of the cell having value
    k.
  • The agency determines r(k), then tries to
    estimate the snoopers posterior for p(k), given
    the table release.

7
Sample Measures for Utility
  • Cell-oriented mean square precision of the
    users posterior distribution for a cell.
  • Table-oriented change in ?2 for 2-way tables,
    other (or multiple) measures of association for
    n-way tables.

8
Disclosure Auditing
  • Traditional risk assessment
  • A data disseminator follows these steps
  • Audit a proposed release for disclosures
  • If potential disclosures exist , apply SDL
  • Audit the result to ensure protection
  • Repeat as necessary
  • Once more, what is a disclosure?

9
Disclosure Auditing (cont.)
  • Disclosure may be a sensitive cell value known
  • with certainty,
  • to be in a narrow range, or
  • with high probability.
  • Lets examine some SDL techniques, considering
  • The various definitions of disclosure,
  • The difficulty of applying and auditing them,
  • The utility of the disclosure-limited results,
    and
  • Whether they are useful for higher-dimensional
    tables.

10
Controlled Rounding(Zero-Restricted, Lets Say)
15 1 3 1 20
20 10 10 15 55
3 10 10 2 25
12 14 7 2 35
50 35 30 20 135
15 0 3 0 18
21 9 12 15 57
3 12 9 0 24
12 15 6 3 36
51 36 30 18 135
Original Table
Published Table Rounded to Base 3
11
Controlled Rounding (cont.)
  • No exact disclosures can occur.
  • The feasibility interval of any cell is its
    published value (b-1), where b is the rounding
    base (except close to zero).
  • Finding a rounding is easy for 2-way tables.
  • Finding a rounding is harder (and may not even
    exist) for higher-dimensional tables.

12
Controlled Rounding (cont.)
  • There are 576,598,396 tables that could be
    rounded to the published table.
  • How to determine a prior probability over this
    set?
  • With a huge leap of faith about priors, cell
    (1,2) has this distribution

q 0 1 2
Pr(q) .436 .347 .217
13
Cell Suppression
15 1 3 1 20
20 10 10 15 55
3 10 10 2 25
12 14 7 2 35
50 35 30 20 135
15 s 3 s 20
20 10 10 15 55
3 s 10 s 25
12 s 7 s 35
50 35 30 20 135
Original Table
Published Table With Suppressions
14
Cell Suppression (cont.)
  • Finding a suppression pattern can be hard
    computationally heuristics may be untrustworthy.
  • Auditing is often done with linear programming
    (LP), finding upper and lower cell bounds.
  • In higher dimensions, LP may give fractional
    bounds---how to interpret?
  • How does an analyst use a table with suppressions?

15
Cell Suppression (cont.)
  • Again, there are many possible true tables.
  • For 2-way tables, they are easily enumerated.
  • For n-way tables, its quite hard.
  • Again, its difficult to specify priors (need to
    know the exact implementation of suppression
    algorithm).
  • Posterior distributions for suppressed cells can
    be had, but its a lot of work.

16
Publishing Only Some Margins of an N-Way Table
  • Think of the n-way base table as being fully
    suppressed.
  • The published marginal tables constrain the
    values in the base table.
  • Auditing characterizes cells in the base table
    and/or other unpublished margins.
  • Heres an example

17
An HMO Example
Table OfficeVisit v Patient Doctor
Treatment 122 David Christy Compoz 123
John Phillips Fungicide 124 Israel
Christy AZT 125 John Hill
Compoz
Treatment (k)
Doctor (j)
xijk
xijk count of visits over Patient i i
1,,I Doctor j
j 1,,J

Treatment k k 1,,K
Patient (i)
18
The HMO Example (cont.)
  • Obviously we dont broadcast Patient-Doctor-Treatm
    ent.
  • The view Patient-Treatment is also sensitive.
  • But the Accounting Dept. has Patient-Doctor.
  • And the Physician Review Board has
    Doctor-Treatment.
  • Ted works in Accounting, his wife Alice is on the
    Physician Review Board, and Israel is an
    occasional babysitter for them.

19
More Generally
  • An n-way table of sensitive data.
  • Some collection of lower-dimensional marginal
    tables are proposed for publication.
  • How to find bounds, or better, distributions, for
    the sensitive cells?
  • Recall linear programming often gives fractional
    bounds.

20
Integer Linear Programming?
  • Many techniques, but generally very slow compared
    to continuous LP.
  • Empirically, Gomory cuts work well.
  • Some special problems have the integer rounding
    property.
  • Much more to be done here.

21
Other Bounding Techniques
  • Generalized shuttle algorithm
  • The shuttle algorithm (Buzzigoli Giusti) starts
    with loose upper/lower bounds, then tightens
    them.
  • Dobra Fienberg improved this (a lot), but still
    not completely general

14
13.5
13
15
16
8
True lower bound
True continuous bound
True integer bound
Successive BG upper bounds
22
Exploiting Structure
  • Decomposable graphs
  • Suppose 3-D table (indices I,J,K), we publish IJ
    and JK, and want bounds for IJK.
  • The Dobra-Fienberg graph looks like
  • Dobra and Fienberg show that if the graph has a
    separator (node J),and this separator is a
    clique, then Frechet bounds are exact.

I
J
K
JK
IJ
23
Probabilities of Cell Values
  • Diaconis and Sturmfels (1998) show how to sample
    from the space of tables that agree with known
    marginals.
  • Not hard to extend to tables with suppressions.
  • They use results from commutative algebra to find
    a Gröbner basis, a list of moves that change a
    table but leave the margins fixed.
  • A random walk using these moves carries you
    uniformly thorough the space of tables.
  • Tally the proportion of time a sensitive cell
    takes on different values.

24
3-D Table, 2-D Margins Known
k3
k2
k1
i/j
6
6
6
6 6 6 18
6
7
9
6 6 6 22
6
6
7
6 7 6 19
6 6 6 18
6 7 6 19
9 6 7 22
21 19 19 59
25
Gröbner Bases Moves
  • Suppose we know a table that matches the
    published margins (i.e., is feasible).
  • How can we move to another feasible table?
  • Example move

? 0
? 0
0 0 0
? 0
? 0
0 0 0
0 0 0
0 0 0
0 0 0
26
Computing the Gröbner Basis
  • The general-purpose program Macauley can find the
    3?3?3 basis in about 7 hours (300 MHz PC).
  • A specialized program does this in 25 mS.
  • The 4?3?3 basis takes 20 minutes (628 moves)
  • The 5?3?3 basis takes 3 months (3236 moves)

27
Exploiting Structure Again
  • If the independence graph of the released
    marginals is decomposable, the Gröbner basis is
    easily determined.
  • If the graph is almost decomposable, the basis
    can be obtained by piecing together bases for
    smaller problems.
  • Dobra demonstrates that these methods can be used
    to estimate sensitive cell distributions.

28
Markov Perturbation
  • Consider an elementary data square in a 2-way
    table.
  • It might look like

1 14 15
17 83 100
18 97 115
29
Markov Perturbation (cont.)
  • The cell values in the data square are
    stochastically modified so that
  • the marginal totals remain unchanged, and
  • the expected cell values equal the original
    values (unbiased).
  • A single parameter ? determines how much mixing
    is done.
  • By choosing elementary data squares randomly,
    then perturbing, the overall table is protected.

30
Markov Perturbation (cont.)
  • In the book chapter, we show a Bayesian analysis
    comparing
  • Markov perturbation
  • Cell suppression, and
  • Rounding.
  • The resulting Risk-Utility Confidentiality Map
    shows some of the trade-offs in choosing a SDP
    method.

31
Various SDL Methods Compared
32
Directions for Research
  • Distributions of cell values in protected tables.
  • Examining the consequences of different
    user/intruder prior distributions on SDL method
    tradeoffs.
  • New procedures with increased data utility while
    maintaining low risk.
  • All of this for higher-dimensional tables.
Write a Comment
User Comments (0)
About PowerShow.com