Title: Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Hoeteck Wee
1From Idiosyncratic to Stereotypical Toward
Privacy in Public Databases
- Shuchi Chawla, Cynthia Dwork,
Frank McSherry, Adam Smith, Hoeteck Wee
2Database Privacy
- Census data a prototypical example
- Individuals provide information
- Census bureau publishes sanitized records
- Privacy is legally mandated what utility can we
achieve? - Our Goal
- What do we mean by preservation of privacy?
- Characterize the trade-off between privacy and
utility - disguise individual identifying information
- preserve macroscopic properties
- Develop a good sanitizing procedure with
theoretical guarantees
3An outline of this talk
- A mathematical formalism
- What do we mean by privacy?
- Prior work
- An abstract model of datasets
- Isolation Good sanitizations
- A candidate sanitization
- A brief overview of results
- General argument for privacy of n-point datasets
- Open issues and concluding remarks
4Privacy a philosophical view-point
- Ruth Gavison includes protection from being
brought to the attention of others - Matches intuition inherently desirable
- Attention invites further loss of privacy
- Privacy is assured to the extent that one blends
in with the crowd - Appealing definition can be converted into a
precise mathematical statement!
5Database Privacy
- Statistical approaches
- Alter the frequency (PRAN/DS/PERT) of particular
features, while preserving means. - Additionally, erase values that reveal too much
- Query-based approaches
- involve a permanent trusted third party
- Query monitoring dissallow queries that breach
privacy - Perturbation Add noise to the query
output Dinur Nissim03, Dwork Nissim04 - Statistical perturbation adversarial analysis
- Evfimievsky et al 03 combine statistical
techniques with analysis similar to query-based
approaches
6Everybodys First Suggestion
- Learn the distribution, then output
- A description of the distribution, or,
- Samples from the learned distribution
-
- Want to reflect facts on the ground
- Statistically insignificant facts can be
important for allocating resources
7Our Approach
- Crypto-flavored definitions
- Mathematical characterization of Adversarys goal
- Precise definition of when sanitization procedure
fails - Intuition seeing sanitized DB gives Adversary an
advantage - Statistical Techniques
- Perturbation of attribute values
- Differs from previous work perturbation amounts
depend on local densities of points - Highly abstracted version of problem
- If we cant understand this, we cant understand
real life. - If we get negative results here, the world is in
trouble.
8A geometric view
- Abstraction
- Points in a high dimensional metric space say R
d drawn i.i.d. from some distribution - Points are unlabeled you are your collection of
attributes - Distance is everything
- Real Database (RDB) private
- n unlabeled points in d-dimensional space.
- Sanitized Database (SDB) public
- n new points possibly in a different space.
9The adversary or Isolator
- Using SDB and auxiliary information (AUX),
outputs a point q - q isolates a real point x, if it is much closer
to x than to xs neighbors. - Even if q looks similar to x, it may fail to
isolate x if it looks as similar to xs neighbors
as well. - Tightly clustered points have a smaller radius of
isolation
RDB
Isolating
Non-isolating
10The adversary or Isolator
- I(SDB,AUX) q
- x is isolated if B(q,cd) contains less than T
points - T-radius of x distance to its T-nearest
neighbor - x is safe if ?x gt (T-radius of x)/(c-1)
- B(q,cdx) contains xs entire T-neighborhood
c privacy parameter eg. 4
large T and small c is good
11A good sanitization
- Sanitizing algorithm compromises privacy if the
adversary is able to considerably increase his
probability of isolating a point by looking at
its output - A rigorous (and too ideal) definition
- ?D ?I ? I w.o.p RDB 2R Dn ?aux z ?
x 2 RDB - PrI(SDB,z) isolates x PrI (z) isolates
x ?/n - Definition of ? can be forgiving, say, 2-?(d) or
(1 in a 1000) - Quantification over x If aux reveals info about
some x, the privacy of some other y should still
be preserved - Provides a framework for describing the power of
a sanitization method, and hence for comparisons
12The Sanitizer
- The privacy of x is linked to its T-radius
- ? Randomly perturb it in proportion to its
T-radius - x San(x) ?R S(x,T-rad(x))
T1
13The Sanitizer
- The privacy of x is linked to its T-radius
- ? Randomly perturb it in proportion to its
T-radius - x San(x) ?R S(x,T-rad(x))
- Intuition
- We are blending x in with its crowd
- If the number of dimensions (d) is large, there
are many pre-images for x. The adversary
cannot conclusively pick any one. - We are adding random noise with mean zero to x,
so several macroscopic properties should be
preserved.
14Results on privacy.. An overview
Distribution Num. points Revealed to adversary Auxiliary information
Uniform on surface of sphere 2 Both sanitized points Distribution, 1-radius
Uniform over a bounding box or surface of sphere n One sanitized point, all other real points Distribution, all real points
Uniform over a cube n Exact histogram count over subcells of sufficiently large size Distribution
Uniform over a cube n Perturbation of n/2 points Distribution exact histogram counts of subcells
Adversary is computationally unbounded
15Results on utility An overview
Distributional/Worst-case Objective Assumptions Result
Worst-case Find K clusters minimizing largest diameter - Optimal diameter as well as approximations increase by at most a factor of 3
Distributional Find k maximum likelihood clusters Mixture of k Gaussians Correct clustering with high probability as long as means are pairwise sufficiently far
16A special case - one sanitized point
- RDB x1,,xn
- The adversary is given n-1 real points x2,,xn
and one sanitized point x1 T 1 c4
flat prior - Recall x1 2R S(x1,x1-y)
- where y is the nearest neighbor of x1
- Main idea
- Consider the posterior distribution on x1
- Show that the adversary cannot isolate a large
probability mass under this distribution
17A special case - one sanitized point
- Let Z p?R d p is a legal pre-image for x1
- Q p if x1p then x1 is isolated by q
- We show that Pr QnZ x1 2-W(d) Pr Z
x1 - Prx1 in QnZ x1
- prob mass contribution from QnZ / contribution
from Z - 21-d /(1/4)
p-q 1/3 p-x1
x3
x5
x1
x2
x4
18Contribution from Z
- Prx1p x1 ? Prx1 x1p ? 1/rd
(r x1-p) - Increase in r ? x1 gets randomized over a larger
area proportional to rd. Hence the inverse
dependence. - Prx1 x12 S ? sS 1/rd ? solid angle
subtended at x1 - Z subtends a solid angle equal to at least half a
sphere at x1
x3
x5
x1
x2
S
x4
19Contribution from Q Å Z
- The ellipsoid is roughly as far from x1 as its
longest radius - Contribution from ellipsoid is ? 2-d x total
solid angle - Therefore, Prx1 2 QÅZ / Prx1 2 Z ? 2-d
x3
QnZ
x5
x1
r
r
x2
x4
20The general case n sanitized points
- Initial intuition is wrong
- Privacy of x1 given x1 and other points in the
clear does not imply - privacy of x1 given x1 and sanitizations of
others! - Problem Sanitization is non-oblivious
- Other sanitized points reveal information about
x, if x is their - nearest neighbor
- Solution Decouple the two kinds of information
from x and xi
21The general case n sanitized points
- Perturbation of L is a function of R
- What function of R would reveal no information
about R? - Answer Coarse-grained histogram information!
- Divide space into cells
- Histogram count of cell C number of points in
RÅC - Perturbation radius of a point p
- / density of points in the cell
containing p
22Histogram-based sanitization
- Recursively divide space into cells until all
cells have few points - Reveal the EXACT count of points in each cell
- Contrast this to k-anonymity
T6
23Histogram-based sanitization
- Adversary outputs (q,r) ? guess and radius of
isolation - Adversary wins if purple ball contains gt 1 points
and orange ball contains lt T points
24Histogram-based sanitization
- We show
- If purple ball is large,
- then orange ball contains the parent cell gt at
least T points - If purple ball is small,
- then orange ball is exponentially larger than
purple ball - gt either purple has lt 1 points or orange
has gt T points
Recall cells are d-dimensional
25Results on privacy.. An overview
Distribution Num. points Revealed to adversary Auxiliary information
Uniform on surface of sphere 2 Both sanitized points Distribution, 1-radius
Uniform over a bounding box or surface of sphere n One sanitized point, all other real points Distribution, all real points
Uniform over a cube n Exact histogram count over subcells of sufficiently large size Distribution
Uniform over a cube n Perturbation of n/2 points Distribution exact histogram counts of subcells
26Future directions
- Extend the privacy argument to other nice
distributions - For what distributions is there no meaningful
privacyutility trade-off? - Characterize acceptable auxiliary information
- Think of auxiliary information as an a priori
distribution - The low-dimensional case Is it inherently
impossible? - Discrete-valued attributes
- Our proofs require a spread in all attributes
- Extend the utility argument to other interesting
macroscopic properties e.g. correlations
27Conclusions
- A first step towards understanding the
privacy-utility trade-off - A general and rigorous definition of privacy
- A work in progress!
28Questions?