Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee - PowerPoint PPT Presentation

About This Presentation
Title:

Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee

Description:

Our proofs full knowledge about some real points; no ... Highly abstracted version of problem. If we can't understand this, we can't understand real life. ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 26
Provided by: Shuchi2
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee


1
From Idiosyncratic to Stereotypical Toward
Privacy in Public Databases
  • Shuchi Chawla, Cynthia Dwork,
    Frank McSherry, Adam Smith, Larry
    Stockmeyer, Hoeteck Wee

2
Database Privacy
  • Census data a prototypical example
  • Individuals provide information
  • Census bureau publishes sanitized records
  • Privacy is legally mandated what utility can we
    achieve?
  • Our Goal
  • What do we mean by preservation of privacy?
  • Characterize the trade-off between privacy and
    utility
  • disguise individual identifying information
  • preserve macroscopic properties
  • Develop a good sanitizing procedure with
    theoretical guarantees

3
An outline of this talk
  • A mathematical formalism
  • What do we mean by privacy?
  • Prior work
  • An abstract model of datasets
  • Isolation Good sanitizations
  • A candidate sanitization
  • A brief overview of results
  • General argument for privacy of n-point datasets
  • Open issues and concluding remarks

4
Everybodys First Suggestion
  • Learn the distribution, then output
  • A description of the distribution, or,
  • Samples from the learned distribution
  • Want to reflect facts on the ground
  • Statistically insignificant clusters can be
    important for allocating resources

5
Database Privacy
  • A long standing research problem a wide variety
    of definitions and techniques
  • Statistical approaches
  • Alter the frequency (PRAN/DS/PERT) of particular
    features, while preserving means.
  • Additionally, erase values that reveal too much
  • Query-based approaches
  • Perturb output or disallow queries that breach
    privacy

6
Privacy a philosophical view-point
  • Ruth Gavison Privacy is protection from being
    brought to the attention of others
  • Attention invites further loss of privacy
  • Privacy is assured to the extent that one blends
    in with the crowd
  • Appealing definition can be converted into a
    precise mathematical statement!

7
What is a breach of privacy?
  • The statistical approach
  • Infering that database contains too few ( 3)
    people with a set of characteristics
  • The cryptographic approach
  • Guessing a value with high probability
  • Unsatisfying definitions
  • Approximating a real-valued attribute may be
    sufficient to breach privacy
  • A case of one size fits all
  • A combination of the two
  • Guessing enough attributes such that these
    together match few records

8
A geometric view
  • Abstraction
  • Points in a high dimensional metric space say R
    d drawn i.i.d. from some distribution
  • Points are unlabeled you are your collection of
    attributes
  • Distance is everything
  • Real Database (RDB) private
  • n unlabeled points in d-dimensional space.
  • Sanitized Database (SDB) public
  • n new points possibly in a different space.

9
The adversary or Isolator
  • Using SDB and auxiliary information (AUX),
    outputs a point q
  • q isolates a real point x, if it is much closer
    to x than to xs neighbors,
  • T-radius of x distance to its T-nearest
    neighbor
  • x is safe if ?x gt (T-radius of x)/(c-1)
  • B(q, cdx) contains xs entire T-neighborhood
  • i.e., if B(q,cd) contains less than T points

c privacy parameter eg. 4
large T and small c is good
10
A good sanitization
  • Sanitizing algorithm compromises privacy if the
    adversary is able to increase his probability of
    isolating a point considerably by looking at its
    output
  • A rigorous definition
  • ?I ?D ?aux z ? x ? I
  • PrI(SDB,z) succeeds on x PrI(z) succeeds
    on x is small
  • Definition of small can be forgiving, say, n-2
  • Quantification over x If aux reveals info about
    some x, the privacy of some other y should still
    be preserved
  • Provides a framework for describing the power of
    a sanitization method, and hence for comparisons

11
The Sanitizer
  • The privacy of x is linked to its T-radius
  • ? Randomly perturb it in proportion to its
    T-radius
  • x San(x) ?R B(x,T-rad(x))
  • Intuition
  • We are blending x in with its crowd
  • If the number of dimensions (d) is large, there
    are many pre-images for x. The adversary
    cannot conclusively pick any one.
  • We are adding random noise with mean zero to x,
    so several macroscopic properties should be
    preserved.

12
Flavor of Results (Preliminary)
  • Assumptions
  • Data arises from a mixture of Gaussians
  • dimensions d, num of points n are large
  • d w(log n)
  • Results
  • Privacy An adversary who knows the Gaussians
    and some auxiliary information cannot isolate
    any point with probability more than 2-W(d)
  • Several special cases General result not yet
    proved
  • Very different proof techniques from anything
    in the statistics or crypto literatures!
  • Utility An honest user who does not know the
    Gaussians, can compute the means with a high
    probability

13
Results on privacy.. An overview
Distribution Num. of points Revealed to adversary Auxiliary information
Uniform on surface of sphere 2 Both sanitized points Distribution, 1-radius
Uniform over a bounding box or surface of sphere n One sanitized point, all other real points Distribution, all real points
Uniform over a bounding box under construction 2?(d) n/2 sanitized points Distribution
Gaussian 2o(d) n sanitized points Distribution
14
Results on utility An overview
Distributional/Worst-case Objective Assumptions Result
Worst-case Find K clusters minimizing largest diameter - Optimal diameter as well as approximations increase by at most a factor of 3
Distributional Find k maximum likelihood clusters Mixture of k Gaussians Correct clustering with high probability as long as means are pairwise sufficiently far
15
A representative case - one sanitized point
  • RDB x1,,xn
  • The adversary is given n-1 real points x2,,xn
    and one sanitized point x1 T 1
    flat prior
  • Recall x1 2R B(x1,y)
  • where y is the nearest neighbor of x1
  • Main idea
  • Consider the posterior distribution on x1
  • Show that the adversary cannot isolate a large
    probability mass under this distribution

16
A representative case - one sanitized point
  • Let Z p?R d p is a legal pre-image for x1
  • Q p if x1p then x1 is isolated by q
  • We show that Pr QnZ x1 2-W(d) Pr Z
    x1
  • Prx1 in QnZ x1
  • prob mass contribution from QnZ / contribution
    from Z
  • 21-d /(1/4)

p-q 1/3 p-x1
17
Contribution from Z
  • Prx1p x1 ? Prx1 x1p ? 1/rd
    (r x1-p)
  • Increase in r ? x1 gets randomized over a larger
    area proportional to rd. Hence the inverse
    dependence.
  • Prx1 x12 S ? sS 1/rd ? solid angle
    subtended at x1
  • Z subtends a solid angle equal to at least half a
    sphere at x1

r
S
18
Contribution from Q Å Z
  • The ellipsoid is roughly as far from x1 as its
    longest radius
  • Contribution from ellipsoid is ? 2-d x total
    solid angle
  • Therefore, Prx1 2 QÅZ / Prx1 2 Z ? 2-d

r
r
19
The general case n sanitized points
  • Initial intuition is wrong
  • Privacy of x1 given x1 and all the other points
    in the clear does not imply privacy of x1 given
    x1 and sanitizations of others!
  • Sanitization is non-oblivious Other sanitized
    points reveal information about x, if x is their
    nearest neighbor
  • A possible approach Decouple the two kinds of
    information from x and xi

20
The general case n sanitized points
  • Initial intuition is wrong
  • Privacy of x1 given x1 and all the other points
    in the clear does not imply privacy of x1 given
    x1 and sanitizations of others!
  • Sanitization is non-oblivious Other sanitized
    points reveal information about x, if x is their
    nearest neighbor
  • Where we are now
  • Consider some example of safe sanitization (not
    necessarily using perturbations)
  • Density regions? Histograms?
  • Relate perturbations to the safe sanitization

21
Summary.. (1) Results on privacy
Distribution Num. of points Revealed to adversary Auxiliary information
Uniform on surface of sphere 2 Both sanitized points Distribution, 1-radius
Uniform over a bounding box or surface of sphere n One sanitized point, all other real points Distribution, all real points
Uniform over a bounding box under construction 2?(d) n/2 sanitized points Distribution
Gaussian 2o(d) n sanitized points Distribution
22
Summary.. (2) Results on utility
Distributional/Worst-case Objective Assumptions Result
Worst-case Find K clusters minimizing largest diameter - Optimal diameter as well as approximations increase by at most a factor of 3
Distributional Find k maximum likelihood clusters Mixture of k Gaussians Correct clustering with high probability as long as means are pairwise sufficiently far
23
Future directions
  • Extend the privacy argument to other nice
    distributions
  • For what distributions is there no meaningful
    privacyutility trade-off?
  • Can revealing the distribution hurt privacy?
  • Characterize the kind of auxiliary information
    that is acceptable
  • Depends on the distribution on the datapoints
  • Think of auxiliary information as an apriori
    distribution
  • Our proofs full knowledge about some real
    points no knowledge about others

24
Future directions
  • The low-dimensional case
  • Is it inherently impossible?
  • Dinur Nissim show impossibility for the
    1-dimensional case
  • Discrete-valued attributes
  • Real world data is rarely real-valued
  • Our proofs require a spread in all attributes
  • Possible solution convert binary values to
    probabilities (for example)
  • Can the adversary gain advantage from rounding
    off the values?
  • Extend the utility argument to other interesting
    macroscopic properties e.g. correlations

25
Questions?
26
Learning mixtures of Gaussians (Spectral
methods)
  • Observation Top eigenvectors of a matrix span a
    low-dimensional space that yields a good
    approximation of complex data sets, in particular
    Gaussian data.
  • Intuition
  • Sampled points are close to means of the
    corresponding Gaussians in any subspace
  • Span of top k singular vectors approximates span
    of the means
  • Distances between means of Gaussians are
    preserved
  • Other distances shrink by a factor of v(k/n)
  • Our goal show that the same algorithm works for
    clustering sanitized data.

27
Spectral techniques for perturbed data
  • A sanitized point is the sum of two Gaussian
    variables sample noise
  • w.h.p. the 1-radius of a point is less than the
    radius of its Gaussian
  • Variance of the noise is small
  • Sanitized points are still close to their means
    (uses independence of direction)
  • Span of top k singular vectors still approximates
    the span of means of Gaussians
  • Distances between means are preserved others
    shrink

28
Current solutions
  • Statistical approaches
  • Alter the frequency (PRAN/DS/PERT) of particular
    features, while preserving means.
  • Additionally, erase values that reveal too much
  • Query-based approaches
  • Perturb output or disallow queries that breach
    privacy
  • Unsatisfying
  • Overly constrained definitions ad-hoc techniques
  • Ad-hoc treatment of external sources of info
  • Erasure can disclose information Refusal to
    answer may be revelatory

29
Our Approach
  • Crypto-flavored definitions
  • Mathematical characterization of Adversarys goal
  • Precise definition of when sanitization procedure
    fails
  • Intuition seeing sanitized DB gives Adversary an
    advantage
  • Statistical Techniques
  • Perturbation of attribute values
  • Differs from previous work perturbation amounts
    depend on local densities of points
  • Highly abstracted version of problem
  • If we cant understand this, we cant understand
    real life.
  • If we get negative results here, the world is in
    trouble.

30
The Sanitizer
  • The privacy of x is linked to its T-radius
  • ? Randomly perturb it in proportion to its
    T-radius
  • x San(x) ?R B(x,T-rad(x))

T1
31
The general case n points
  • Z p p is a legal pre-image for x1
  • Q p x1p is isolated by q

Key observation As q-x increases, Q
becomes larger. But, larger distance from x
implies smaller probability mass, because x is
randomized over a larger area
Probability depends only on the solid angle
subtended at x
32
The simplest interesting case
  • RDB x, y x, y 2R B(o,?) where o origin
  • T1 c4 SDB x, y
  • The adversary knows x, y, r and d x-y
  • We show
  • There are m2W(d) decoy pairs (xi,yi)
  • (xi,yi) are legal pre-images of (x,y)
  • that is, xi-yid and Pr xi,yi x,y
    Pr x,y x,y
  • Adversary cannot know which of the (xi, yi)
    represents reality
  • The adversary can only isolate one point in
    x1,y1, xm, ym at a time

33
The simplest interesting case
  • Consider a hyperplane H through x, y and o
  • xH, yH mirror reflections of x, y through H
  • Note reflections preserve distances!
  • The world of xH, yH looks identical to the world
    of x, y

Pr xH,yH x,y Pr x,y x,y
34
The simplest interesting case
  • Consider a hyperplane H through x, y and o
  • xH, yH mirror reflections of x, y through H
  • Note reflections preserve distances!
  • The world of xH, yH looks identical to the world
    of x, y
  • How many different H such that the corresponding
    xH are pairwise distant?

Sufficient to pick r2/3 d and q 30
Fact There are 2W(d) vectors in d-dim, at angle
60 from each other. ? Probability that
adversary wins 2-W(d)
2/3 d
r
35
The general case n sanitized points
  • Claim 1 (Privacy for L)
  • Given all sanitizations, all points in R, and
    all but one point in L, adversary cannot isolate
    last point
  • Follows from the proof for n-1 real points
  • Claim 2 (Privacy for R)
  • Given all sanitizations, all points in L and all
    but one point in R, adversary cannot isolate last
    point
  • Work under progress

Idea Show that the adversary cannot distinguish
between whether R contains some point x or
not. (Information-theoretic argument)
Write a Comment
User Comments (0)
About PowerShow.com