Title: Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee
1From Idiosyncratic to Stereotypical Toward
Privacy in Public Databases
- Shuchi Chawla, Cynthia Dwork,
Frank McSherry, Adam Smith, Larry
Stockmeyer, Hoeteck Wee
2Database Privacy
- Census data a prototypical example
- Individuals provide information
- Census bureau publishes sanitized records
- Privacy is legally mandated what utility can we
achieve? - Our Goal
- What do we mean by preservation of privacy?
- Characterize the trade-off between privacy and
utility - disguise individual identifying information
- preserve macroscopic properties
- Develop a good sanitizing procedure with
theoretical guarantees
3An outline of this talk
- A mathematical formalism
- What do we mean by privacy?
- Prior work
- An abstract model of datasets
- Isolation Good sanitizations
- A candidate sanitization
- A brief overview of results
- General argument for privacy of n-point datasets
- Open issues and concluding remarks
4Everybodys First Suggestion
- Learn the distribution, then output
- A description of the distribution, or,
- Samples from the learned distribution
-
- Want to reflect facts on the ground
- Statistically insignificant clusters can be
important for allocating resources
5Database Privacy
- A long standing research problem a wide variety
of definitions and techniques - Statistical approaches
- Alter the frequency (PRAN/DS/PERT) of particular
features, while preserving means. - Additionally, erase values that reveal too much
- Query-based approaches
- Perturb output or disallow queries that breach
privacy
6Privacy a philosophical view-point
- Ruth Gavison Privacy is protection from being
brought to the attention of others - Attention invites further loss of privacy
- Privacy is assured to the extent that one blends
in with the crowd - Appealing definition can be converted into a
precise mathematical statement!
7What is a breach of privacy?
- The statistical approach
- Infering that database contains too few ( 3)
people with a set of characteristics - The cryptographic approach
- Guessing a value with high probability
- Unsatisfying definitions
- Approximating a real-valued attribute may be
sufficient to breach privacy - A case of one size fits all
- A combination of the two
- Guessing enough attributes such that these
together match few records
8A geometric view
- Abstraction
- Points in a high dimensional metric space say R
d drawn i.i.d. from some distribution - Points are unlabeled you are your collection of
attributes - Distance is everything
- Real Database (RDB) private
- n unlabeled points in d-dimensional space.
- Sanitized Database (SDB) public
- n new points possibly in a different space.
9The adversary or Isolator
- Using SDB and auxiliary information (AUX),
outputs a point q - q isolates a real point x, if it is much closer
to x than to xs neighbors, - T-radius of x distance to its T-nearest
neighbor - x is safe if ?x gt (T-radius of x)/(c-1)
- B(q, cdx) contains xs entire T-neighborhood
- i.e., if B(q,cd) contains less than T points
c privacy parameter eg. 4
large T and small c is good
10A good sanitization
- Sanitizing algorithm compromises privacy if the
adversary is able to increase his probability of
isolating a point considerably by looking at its
output - A rigorous definition
- ?I ?D ?aux z ? x ? I
- PrI(SDB,z) succeeds on x PrI(z) succeeds
on x is small - Definition of small can be forgiving, say, n-2
- Quantification over x If aux reveals info about
some x, the privacy of some other y should still
be preserved - Provides a framework for describing the power of
a sanitization method, and hence for comparisons
11The Sanitizer
- The privacy of x is linked to its T-radius
- ? Randomly perturb it in proportion to its
T-radius - x San(x) ?R B(x,T-rad(x))
- Intuition
- We are blending x in with its crowd
- If the number of dimensions (d) is large, there
are many pre-images for x. The adversary
cannot conclusively pick any one. - We are adding random noise with mean zero to x,
so several macroscopic properties should be
preserved.
12Flavor of Results (Preliminary)
- Assumptions
- Data arises from a mixture of Gaussians
- dimensions d, num of points n are large
- d w(log n)
- Results
- Privacy An adversary who knows the Gaussians
and some auxiliary information cannot isolate
any point with probability more than 2-W(d) - Several special cases General result not yet
proved - Very different proof techniques from anything
in the statistics or crypto literatures! - Utility An honest user who does not know the
Gaussians, can compute the means with a high
probability
13Results on privacy.. An overview
Distribution Num. of points Revealed to adversary Auxiliary information
Uniform on surface of sphere 2 Both sanitized points Distribution, 1-radius
Uniform over a bounding box or surface of sphere n One sanitized point, all other real points Distribution, all real points
Uniform over a bounding box under construction 2?(d) n/2 sanitized points Distribution
Gaussian 2o(d) n sanitized points Distribution
14Results on utility An overview
Distributional/Worst-case Objective Assumptions Result
Worst-case Find K clusters minimizing largest diameter - Optimal diameter as well as approximations increase by at most a factor of 3
Distributional Find k maximum likelihood clusters Mixture of k Gaussians Correct clustering with high probability as long as means are pairwise sufficiently far
15A representative case - one sanitized point
- RDB x1,,xn
- The adversary is given n-1 real points x2,,xn
and one sanitized point x1 T 1
flat prior - Recall x1 2R B(x1,y)
- where y is the nearest neighbor of x1
- Main idea
- Consider the posterior distribution on x1
- Show that the adversary cannot isolate a large
probability mass under this distribution
16A representative case - one sanitized point
- Let Z p?R d p is a legal pre-image for x1
- Q p if x1p then x1 is isolated by q
- We show that Pr QnZ x1 2-W(d) Pr Z
x1 - Prx1 in QnZ x1
- prob mass contribution from QnZ / contribution
from Z - 21-d /(1/4)
p-q 1/3 p-x1
17Contribution from Z
- Prx1p x1 ? Prx1 x1p ? 1/rd
(r x1-p) - Increase in r ? x1 gets randomized over a larger
area proportional to rd. Hence the inverse
dependence. - Prx1 x12 S ? sS 1/rd ? solid angle
subtended at x1 - Z subtends a solid angle equal to at least half a
sphere at x1
r
S
18Contribution from Q Å Z
- The ellipsoid is roughly as far from x1 as its
longest radius - Contribution from ellipsoid is ? 2-d x total
solid angle - Therefore, Prx1 2 QÅZ / Prx1 2 Z ? 2-d
r
r
19The general case n sanitized points
- Initial intuition is wrong
- Privacy of x1 given x1 and all the other points
in the clear does not imply privacy of x1 given
x1 and sanitizations of others! - Sanitization is non-oblivious Other sanitized
points reveal information about x, if x is their
nearest neighbor - A possible approach Decouple the two kinds of
information from x and xi
20The general case n sanitized points
- Initial intuition is wrong
- Privacy of x1 given x1 and all the other points
in the clear does not imply privacy of x1 given
x1 and sanitizations of others! - Sanitization is non-oblivious Other sanitized
points reveal information about x, if x is their
nearest neighbor - Where we are now
- Consider some example of safe sanitization (not
necessarily using perturbations) - Density regions? Histograms?
- Relate perturbations to the safe sanitization
21Summary.. (1) Results on privacy
Distribution Num. of points Revealed to adversary Auxiliary information
Uniform on surface of sphere 2 Both sanitized points Distribution, 1-radius
Uniform over a bounding box or surface of sphere n One sanitized point, all other real points Distribution, all real points
Uniform over a bounding box under construction 2?(d) n/2 sanitized points Distribution
Gaussian 2o(d) n sanitized points Distribution
22Summary.. (2) Results on utility
Distributional/Worst-case Objective Assumptions Result
Worst-case Find K clusters minimizing largest diameter - Optimal diameter as well as approximations increase by at most a factor of 3
Distributional Find k maximum likelihood clusters Mixture of k Gaussians Correct clustering with high probability as long as means are pairwise sufficiently far
23Future directions
- Extend the privacy argument to other nice
distributions - For what distributions is there no meaningful
privacyutility trade-off? - Can revealing the distribution hurt privacy?
- Characterize the kind of auxiliary information
that is acceptable - Depends on the distribution on the datapoints
- Think of auxiliary information as an apriori
distribution - Our proofs full knowledge about some real
points no knowledge about others
24Future directions
- The low-dimensional case
- Is it inherently impossible?
- Dinur Nissim show impossibility for the
1-dimensional case - Discrete-valued attributes
- Real world data is rarely real-valued
- Our proofs require a spread in all attributes
- Possible solution convert binary values to
probabilities (for example) - Can the adversary gain advantage from rounding
off the values? - Extend the utility argument to other interesting
macroscopic properties e.g. correlations
25Questions?
26Learning mixtures of Gaussians (Spectral
methods)
- Observation Top eigenvectors of a matrix span a
low-dimensional space that yields a good
approximation of complex data sets, in particular
Gaussian data. - Intuition
- Sampled points are close to means of the
corresponding Gaussians in any subspace - Span of top k singular vectors approximates span
of the means - Distances between means of Gaussians are
preserved - Other distances shrink by a factor of v(k/n)
- Our goal show that the same algorithm works for
clustering sanitized data.
27Spectral techniques for perturbed data
- A sanitized point is the sum of two Gaussian
variables sample noise - w.h.p. the 1-radius of a point is less than the
radius of its Gaussian - Variance of the noise is small
- Sanitized points are still close to their means
(uses independence of direction) - Span of top k singular vectors still approximates
the span of means of Gaussians - Distances between means are preserved others
shrink
28Current solutions
- Statistical approaches
- Alter the frequency (PRAN/DS/PERT) of particular
features, while preserving means. - Additionally, erase values that reveal too much
- Query-based approaches
- Perturb output or disallow queries that breach
privacy - Unsatisfying
- Overly constrained definitions ad-hoc techniques
- Ad-hoc treatment of external sources of info
- Erasure can disclose information Refusal to
answer may be revelatory
29Our Approach
- Crypto-flavored definitions
- Mathematical characterization of Adversarys goal
- Precise definition of when sanitization procedure
fails - Intuition seeing sanitized DB gives Adversary an
advantage - Statistical Techniques
- Perturbation of attribute values
- Differs from previous work perturbation amounts
depend on local densities of points - Highly abstracted version of problem
- If we cant understand this, we cant understand
real life. - If we get negative results here, the world is in
trouble.
30The Sanitizer
- The privacy of x is linked to its T-radius
- ? Randomly perturb it in proportion to its
T-radius - x San(x) ?R B(x,T-rad(x))
T1
31The general case n points
- Z p p is a legal pre-image for x1
- Q p x1p is isolated by q
Key observation As q-x increases, Q
becomes larger. But, larger distance from x
implies smaller probability mass, because x is
randomized over a larger area
Probability depends only on the solid angle
subtended at x
32The simplest interesting case
- RDB x, y x, y 2R B(o,?) where o origin
- T1 c4 SDB x, y
- The adversary knows x, y, r and d x-y
- We show
- There are m2W(d) decoy pairs (xi,yi)
- (xi,yi) are legal pre-images of (x,y)
- that is, xi-yid and Pr xi,yi x,y
Pr x,y x,y - Adversary cannot know which of the (xi, yi)
represents reality - The adversary can only isolate one point in
x1,y1, xm, ym at a time
33The simplest interesting case
- Consider a hyperplane H through x, y and o
- xH, yH mirror reflections of x, y through H
- Note reflections preserve distances!
- The world of xH, yH looks identical to the world
of x, y
Pr xH,yH x,y Pr x,y x,y
34The simplest interesting case
- Consider a hyperplane H through x, y and o
- xH, yH mirror reflections of x, y through H
- Note reflections preserve distances!
- The world of xH, yH looks identical to the world
of x, y - How many different H such that the corresponding
xH are pairwise distant?
Sufficient to pick r2/3 d and q 30
Fact There are 2W(d) vectors in d-dim, at angle
60 from each other. ? Probability that
adversary wins 2-W(d)
2/3 d
r
35The general case n sanitized points
- Claim 1 (Privacy for L)
- Given all sanitizations, all points in R, and
all but one point in L, adversary cannot isolate
last point - Follows from the proof for n-1 real points
- Claim 2 (Privacy for R)
- Given all sanitizations, all points in L and all
but one point in R, adversary cannot isolate last
point - Work under progress
Idea Show that the adversary cannot distinguish
between whether R contains some point x or
not. (Information-theoretic argument)