Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee - PowerPoint PPT Presentation

About This Presentation

Title:

Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee

Description:

Our proofs full knowledge about some real points; no ... Highly abstracted version of problem. If we can't understand this, we can't understand real life. ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 26

Provided by: Shuchi2

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee

1
From Idiosyncratic to Stereotypical Toward
Privacy in Public Databases

Shuchi Chawla, Cynthia Dwork,
Frank McSherry, Adam Smith, Larry
Stockmeyer, Hoeteck Wee

2
Database Privacy

Census data a prototypical example
Individuals provide information
Census bureau publishes sanitized records
Privacy is legally mandated what utility can we
achieve?
Our Goal
What do we mean by preservation of privacy?
Characterize the trade-off between privacy and
utility
disguise individual identifying information
preserve macroscopic properties
Develop a good sanitizing procedure with
theoretical guarantees

3
An outline of this talk

A mathematical formalism
What do we mean by privacy?
Prior work
An abstract model of datasets
Isolation Good sanitizations
A candidate sanitization
A brief overview of results
General argument for privacy of n-point datasets
Open issues and concluding remarks

4
Everybodys First Suggestion

Learn the distribution, then output
A description of the distribution, or,
Samples from the learned distribution
Want to reflect facts on the ground
Statistically insignificant clusters can be
important for allocating resources

5
Database Privacy

A long standing research problem a wide variety
of definitions and techniques
Statistical approaches
Alter the frequency (PRAN/DS/PERT) of particular
features, while preserving means.
Additionally, erase values that reveal too much
Query-based approaches
Perturb output or disallow queries that breach
privacy

6
Privacy a philosophical view-point

Ruth Gavison Privacy is protection from being
brought to the attention of others
Attention invites further loss of privacy
Privacy is assured to the extent that one blends
in with the crowd
Appealing definition can be converted into a
precise mathematical statement!

7
What is a breach of privacy?

The statistical approach
Infering that database contains too few ( 3)
people with a set of characteristics
The cryptographic approach
Guessing a value with high probability
Unsatisfying definitions
Approximating a real-valued attribute may be
sufficient to breach privacy
A case of one size fits all
A combination of the two
Guessing enough attributes such that these
together match few records

8
A geometric view

Abstraction
Points in a high dimensional metric space say R
d drawn i.i.d. from some distribution
Points are unlabeled you are your collection of
attributes
Distance is everything
Real Database (RDB) private
n unlabeled points in d-dimensional space.
Sanitized Database (SDB) public
n new points possibly in a different space.

9
The adversary or Isolator

Using SDB and auxiliary information (AUX),
outputs a point q
q isolates a real point x, if it is much closer
to x than to xs neighbors,
T-radius of x distance to its T-nearest
neighbor
x is safe if ?x gt (T-radius of x)/(c-1)
B(q, cdx) contains xs entire T-neighborhood

i.e., if B(q,cd) contains less than T points

c privacy parameter eg. 4
large T and small c is good
10
A good sanitization

Sanitizing algorithm compromises privacy if the
adversary is able to increase his probability of
isolating a point considerably by looking at its
output
A rigorous definition
?I ?D ?aux z ? x ? I
PrI(SDB,z) succeeds on x PrI(z) succeeds
on x is small
Definition of small can be forgiving, say, n-2
Quantification over x If aux reveals info about
some x, the privacy of some other y should still
be preserved
Provides a framework for describing the power of
a sanitization method, and hence for comparisons

11
The Sanitizer

The privacy of x is linked to its T-radius
? Randomly perturb it in proportion to its
T-radius
x San(x) ?R B(x,T-rad(x))
Intuition
We are blending x in with its crowd
If the number of dimensions (d) is large, there
are many pre-images for x. The adversary
cannot conclusively pick any one.
We are adding random noise with mean zero to x,
so several macroscopic properties should be
preserved.

12
Flavor of Results (Preliminary)

Assumptions
Data arises from a mixture of Gaussians
dimensions d, num of points n are large
d w(log n)
Results
Privacy An adversary who knows the Gaussians
and some auxiliary information cannot isolate
any point with probability more than 2-W(d)
Several special cases General result not yet
proved
Very different proof techniques from anything
in the statistics or crypto literatures!
Utility An honest user who does not know the
Gaussians, can compute the means with a high
probability

13
Results on privacy.. An overview
Distribution Num. of points Revealed to adversary Auxiliary information
Uniform on surface of sphere 2 Both sanitized points Distribution, 1-radius
Uniform over a bounding box or surface of sphere n One sanitized point, all other real points Distribution, all real points
Uniform over a bounding box under construction 2?(d) n/2 sanitized points Distribution
Gaussian 2o(d) n sanitized points Distribution
14
Results on utility An overview
Distributional/Worst-case Objective Assumptions Result
Worst-case Find K clusters minimizing largest diameter - Optimal diameter as well as approximations increase by at most a factor of 3
Distributional Find k maximum likelihood clusters Mixture of k Gaussians Correct clustering with high probability as long as means are pairwise sufficiently far
15
A representative case - one sanitized point

RDB x1,,xn
The adversary is given n-1 real points x2,,xn
and one sanitized point x1 T 1
flat prior
Recall x1 2R B(x1,y)
where y is the nearest neighbor of x1
Main idea
Consider the posterior distribution on x1
Show that the adversary cannot isolate a large
probability mass under this distribution

16
A representative case - one sanitized point

Let Z p?R d p is a legal pre-image for x1
Q p if x1p then x1 is isolated by q
We show that Pr QnZ x1 2-W(d) Pr Z
x1
Prx1 in QnZ x1
prob mass contribution from QnZ / contribution
from Z
21-d /(1/4)

p-q 1/3 p-x1
17
Contribution from Z

Prx1p x1 ? Prx1 x1p ? 1/rd
(r x1-p)
Increase in r ? x1 gets randomized over a larger
area proportional to rd. Hence the inverse
dependence.
Prx1 x12 S ? sS 1/rd ? solid angle
subtended at x1
Z subtends a solid angle equal to at least half a
sphere at x1

r
S
18
Contribution from Q Å Z

The ellipsoid is roughly as far from x1 as its
longest radius
Contribution from ellipsoid is ? 2-d x total
solid angle
Therefore, Prx1 2 QÅZ / Prx1 2 Z ? 2-d

r
r
19
The general case n sanitized points

Initial intuition is wrong
Privacy of x1 given x1 and all the other points
in the clear does not imply privacy of x1 given
x1 and sanitizations of others!
Sanitization is non-oblivious Other sanitized
points reveal information about x, if x is their
nearest neighbor
A possible approach Decouple the two kinds of
information from x and xi

20
The general case n sanitized points

Initial intuition is wrong
Privacy of x1 given x1 and all the other points
in the clear does not imply privacy of x1 given
x1 and sanitizations of others!
Sanitization is non-oblivious Other sanitized
points reveal information about x, if x is their
nearest neighbor
Where we are now
Consider some example of safe sanitization (not
necessarily using perturbations)
Density regions? Histograms?
Relate perturbations to the safe sanitization

21
Summary.. (1) Results on privacy
Distribution Num. of points Revealed to adversary Auxiliary information
Uniform on surface of sphere 2 Both sanitized points Distribution, 1-radius
Uniform over a bounding box or surface of sphere n One sanitized point, all other real points Distribution, all real points
Uniform over a bounding box under construction 2?(d) n/2 sanitized points Distribution
Gaussian 2o(d) n sanitized points Distribution
22
Summary.. (2) Results on utility
Distributional/Worst-case Objective Assumptions Result
Worst-case Find K clusters minimizing largest diameter - Optimal diameter as well as approximations increase by at most a factor of 3
Distributional Find k maximum likelihood clusters Mixture of k Gaussians Correct clustering with high probability as long as means are pairwise sufficiently far
23
Future directions

Extend the privacy argument to other nice
distributions
For what distributions is there no meaningful
privacyutility trade-off?
Can revealing the distribution hurt privacy?
Characterize the kind of auxiliary information
that is acceptable
Depends on the distribution on the datapoints
Think of auxiliary information as an apriori
distribution
Our proofs full knowledge about some real
points no knowledge about others

24
Future directions

The low-dimensional case
Is it inherently impossible?
Dinur Nissim show impossibility for the
1-dimensional case
Discrete-valued attributes
Real world data is rarely real-valued
Our proofs require a spread in all attributes
Possible solution convert binary values to
probabilities (for example)
Can the adversary gain advantage from rounding
off the values?
Extend the utility argument to other interesting
macroscopic properties e.g. correlations

25
Questions?
26
Learning mixtures of Gaussians (Spectral
methods)

Observation Top eigenvectors of a matrix span a
low-dimensional space that yields a good
approximation of complex data sets, in particular
Gaussian data.
Intuition
Sampled points are close to means of the
corresponding Gaussians in any subspace
Span of top k singular vectors approximates span
of the means
Distances between means of Gaussians are
preserved
Other distances shrink by a factor of v(k/n)
Our goal show that the same algorithm works for
clustering sanitized data.

27
Spectral techniques for perturbed data

A sanitized point is the sum of two Gaussian
variables sample noise
w.h.p. the 1-radius of a point is less than the
radius of its Gaussian
Variance of the noise is small
Sanitized points are still close to their means
(uses independence of direction)
Span of top k singular vectors still approximates
the span of means of Gaussians
Distances between means are preserved others
shrink

28
Current solutions

Statistical approaches
Alter the frequency (PRAN/DS/PERT) of particular
features, while preserving means.
Additionally, erase values that reveal too much
Query-based approaches
Perturb output or disallow queries that breach
privacy
Unsatisfying
Overly constrained definitions ad-hoc techniques
Ad-hoc treatment of external sources of info
Erasure can disclose information Refusal to
answer may be revelatory

29
Our Approach

Crypto-flavored definitions
Mathematical characterization of Adversarys goal
Precise definition of when sanitization procedure
fails
Intuition seeing sanitized DB gives Adversary an
advantage
Statistical Techniques
Perturbation of attribute values
Differs from previous work perturbation amounts
depend on local densities of points
Highly abstracted version of problem
If we cant understand this, we cant understand
real life.
If we get negative results here, the world is in
trouble.

30
The Sanitizer

The privacy of x is linked to its T-radius
? Randomly perturb it in proportion to its
T-radius
x San(x) ?R B(x,T-rad(x))

T1
31
The general case n points

Z p p is a legal pre-image for x1
Q p x1p is isolated by q

Key observation As q-x increases, Q
becomes larger. But, larger distance from x
implies smaller probability mass, because x is
randomized over a larger area
Probability depends only on the solid angle
subtended at x
32
The simplest interesting case

RDB x, y x, y 2R B(o,?) where o origin
T1 c4 SDB x, y
The adversary knows x, y, r and d x-y
We show
There are m2W(d) decoy pairs (xi,yi)
(xi,yi) are legal pre-images of (x,y)
that is, xi-yid and Pr xi,yi x,y
Pr x,y x,y
Adversary cannot know which of the (xi, yi)
represents reality
The adversary can only isolate one point in
x1,y1, xm, ym at a time

33
The simplest interesting case

Consider a hyperplane H through x, y and o
xH, yH mirror reflections of x, y through H
Note reflections preserve distances!
The world of xH, yH looks identical to the world
of x, y

Pr xH,yH x,y Pr x,y x,y
34
The simplest interesting case

Consider a hyperplane H through x, y and o
xH, yH mirror reflections of x, y through H
Note reflections preserve distances!
The world of xH, yH looks identical to the world
of x, y
How many different H such that the corresponding
xH are pairwise distant?

Sufficient to pick r2/3 d and q 30
Fact There are 2W(d) vectors in d-dim, at angle
60 from each other. ? Probability that
adversary wins 2-W(d)
2/3 d
r
35
The general case n sanitized points

Claim 1 (Privacy for L)
Given all sanitizations, all points in R, and
all but one point in L, adversary cannot isolate
last point
Follows from the proof for n-1 real points
Claim 2 (Privacy for R)
Given all sanitizations, all points in L and all
but one point in R, adversary cannot isolate last
point
Work under progress

Idea Show that the adversary cannot distinguish
between whether R contains some point x or
not. (Information-theoretic argument)

Write a Comment

User Comments (0)