Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Hoeteck Wee - PowerPoint PPT Presentation

About This Presentation
Title:

Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Hoeteck Wee

Description:

Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Hoeteck Wee ... Census data a prototypical example. Individuals provide information. Census bureau ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 29
Provided by: Shuchi2
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Hoeteck Wee


1
From Idiosyncratic to Stereotypical Toward
Privacy in Public Databases
  • Shuchi Chawla, Cynthia Dwork,
    Frank McSherry, Adam Smith, Hoeteck Wee

2
Database Privacy
  • Census data a prototypical example
  • Individuals provide information
  • Census bureau publishes sanitized records
  • Privacy is legally mandated what utility can we
    achieve?
  • Our Goal
  • What do we mean by preservation of privacy?
  • Characterize the trade-off between privacy and
    utility
  • disguise individual identifying information
  • preserve macroscopic properties
  • Develop a good sanitizing procedure with
    theoretical guarantees

3
An outline of this talk
  • A mathematical formalism
  • What do we mean by privacy?
  • Prior work
  • An abstract model of datasets
  • Isolation Good sanitizations
  • A candidate sanitization
  • A brief overview of results
  • General argument for privacy of n-point datasets
  • Open issues and concluding remarks

4
Privacy a philosophical view-point
  • Ruth Gavison includes protection from being
    brought to the attention of others
  • Matches intuition inherently desirable
  • Attention invites further loss of privacy
  • Privacy is assured to the extent that one blends
    in with the crowd
  • Appealing definition can be converted into a
    precise mathematical statement!

5
Database Privacy
  • Statistical approaches
  • Alter the frequency (PRAN/DS/PERT) of particular
    features, while preserving means.
  • Additionally, erase values that reveal too much
  • Query-based approaches
  • involve a permanent trusted third party
  • Query monitoring dissallow queries that breach
    privacy
  • Perturbation Add noise to the query
    output Dinur Nissim03, Dwork Nissim04
  • Statistical perturbation adversarial analysis
  • Evfimievsky et al 03 combine statistical
    techniques with analysis similar to query-based
    approaches

6
Everybodys First Suggestion
  • Learn the distribution, then output
  • A description of the distribution, or,
  • Samples from the learned distribution
  • Want to reflect facts on the ground
  • Statistically insignificant facts can be
    important for allocating resources

7
Our Approach
  • Crypto-flavored definitions
  • Mathematical characterization of Adversarys goal
  • Precise definition of when sanitization procedure
    fails
  • Intuition seeing sanitized DB gives Adversary an
    advantage
  • Statistical Techniques
  • Perturbation of attribute values
  • Differs from previous work perturbation amounts
    depend on local densities of points
  • Highly abstracted version of problem
  • If we cant understand this, we cant understand
    real life.
  • If we get negative results here, the world is in
    trouble.

8
A geometric view
  • Abstraction
  • Points in a high dimensional metric space say R
    d drawn i.i.d. from some distribution
  • Points are unlabeled you are your collection of
    attributes
  • Distance is everything
  • Real Database (RDB) private
  • n unlabeled points in d-dimensional space.
  • Sanitized Database (SDB) public
  • n new points possibly in a different space.

9
The adversary or Isolator
  • Using SDB and auxiliary information (AUX),
    outputs a point q
  • q isolates a real point x, if it is much closer
    to x than to xs neighbors.
  • Even if q looks similar to x, it may fail to
    isolate x if it looks as similar to xs neighbors
    as well.
  • Tightly clustered points have a smaller radius of
    isolation

RDB
Isolating
Non-isolating
10
The adversary or Isolator
  • I(SDB,AUX) q
  • x is isolated if B(q,cd) contains less than T
    points
  • T-radius of x distance to its T-nearest
    neighbor
  • x is safe if ?x gt (T-radius of x)/(c-1)
  • B(q,cdx) contains xs entire T-neighborhood

c privacy parameter eg. 4
large T and small c is good
11
A good sanitization
  • Sanitizing algorithm compromises privacy if the
    adversary is able to considerably increase his
    probability of isolating a point by looking at
    its output
  • A rigorous (and too ideal) definition
  • ?D ?I ? I w.o.p RDB 2R Dn ?aux z ?
    x 2 RDB
  • PrI(SDB,z) isolates x PrI (z) isolates
    x ?/n
  • Definition of ? can be forgiving, say, 2-?(d) or
    (1 in a 1000)
  • Quantification over x If aux reveals info about
    some x, the privacy of some other y should still
    be preserved
  • Provides a framework for describing the power of
    a sanitization method, and hence for comparisons

12
The Sanitizer
  • The privacy of x is linked to its T-radius
  • ? Randomly perturb it in proportion to its
    T-radius
  • x San(x) ?R S(x,T-rad(x))

T1
13
The Sanitizer
  • The privacy of x is linked to its T-radius
  • ? Randomly perturb it in proportion to its
    T-radius
  • x San(x) ?R S(x,T-rad(x))
  • Intuition
  • We are blending x in with its crowd
  • If the number of dimensions (d) is large, there
    are many pre-images for x. The adversary
    cannot conclusively pick any one.
  • We are adding random noise with mean zero to x,
    so several macroscopic properties should be
    preserved.

14
Results on privacy.. An overview
Distribution Num. points Revealed to adversary Auxiliary information
Uniform on surface of sphere 2 Both sanitized points Distribution, 1-radius
Uniform over a bounding box or surface of sphere n One sanitized point, all other real points Distribution, all real points
Uniform over a cube n Exact histogram count over subcells of sufficiently large size Distribution
Uniform over a cube n Perturbation of n/2 points Distribution exact histogram counts of subcells
Adversary is computationally unbounded
15
Results on utility An overview
Distributional/Worst-case Objective Assumptions Result
Worst-case Find K clusters minimizing largest diameter - Optimal diameter as well as approximations increase by at most a factor of 3
Distributional Find k maximum likelihood clusters Mixture of k Gaussians Correct clustering with high probability as long as means are pairwise sufficiently far
16
A special case - one sanitized point
  • RDB x1,,xn
  • The adversary is given n-1 real points x2,,xn
    and one sanitized point x1 T 1 c4
    flat prior
  • Recall x1 2R S(x1,x1-y)
  • where y is the nearest neighbor of x1
  • Main idea
  • Consider the posterior distribution on x1
  • Show that the adversary cannot isolate a large
    probability mass under this distribution

17
A special case - one sanitized point
  • Let Z p?R d p is a legal pre-image for x1
  • Q p if x1p then x1 is isolated by q
  • We show that Pr QnZ x1 2-W(d) Pr Z
    x1
  • Prx1 in QnZ x1
  • prob mass contribution from QnZ / contribution
    from Z
  • 21-d /(1/4)

p-q 1/3 p-x1
x3
x5
x1
x2
x4
18
Contribution from Z
  • Prx1p x1 ? Prx1 x1p ? 1/rd
    (r x1-p)
  • Increase in r ? x1 gets randomized over a larger
    area proportional to rd. Hence the inverse
    dependence.
  • Prx1 x12 S ? sS 1/rd ? solid angle
    subtended at x1
  • Z subtends a solid angle equal to at least half a
    sphere at x1

x3
x5
x1
x2
S
x4
19
Contribution from Q Å Z
  • The ellipsoid is roughly as far from x1 as its
    longest radius
  • Contribution from ellipsoid is ? 2-d x total
    solid angle
  • Therefore, Prx1 2 QÅZ / Prx1 2 Z ? 2-d

x3
QnZ
x5
x1
r
r
x2
x4
20
The general case n sanitized points
  • Initial intuition is wrong
  • Privacy of x1 given x1 and other points in the
    clear does not imply
  • privacy of x1 given x1 and sanitizations of
    others!
  • Problem Sanitization is non-oblivious
  • Other sanitized points reveal information about
    x, if x is their
  • nearest neighbor
  • Solution Decouple the two kinds of information
    from x and xi

21
The general case n sanitized points
  • Perturbation of L is a function of R
  • What function of R would reveal no information
    about R?
  • Answer Coarse-grained histogram information!
  • Divide space into cells
  • Histogram count of cell C number of points in
    RÅC
  • Perturbation radius of a point p
  • / density of points in the cell
    containing p

22
Histogram-based sanitization
  • Recursively divide space into cells until all
    cells have few points
  • Reveal the EXACT count of points in each cell
  • Contrast this to k-anonymity

T6
23
Histogram-based sanitization
  • Adversary outputs (q,r) ? guess and radius of
    isolation
  • Adversary wins if purple ball contains gt 1 points
    and orange ball contains lt T points

24
Histogram-based sanitization
  • We show
  • If purple ball is large,
  • then orange ball contains the parent cell gt at
    least T points
  • If purple ball is small,
  • then orange ball is exponentially larger than
    purple ball
  • gt either purple has lt 1 points or orange
    has gt T points

Recall cells are d-dimensional
25
Results on privacy.. An overview
Distribution Num. points Revealed to adversary Auxiliary information
Uniform on surface of sphere 2 Both sanitized points Distribution, 1-radius
Uniform over a bounding box or surface of sphere n One sanitized point, all other real points Distribution, all real points
Uniform over a cube n Exact histogram count over subcells of sufficiently large size Distribution
Uniform over a cube n Perturbation of n/2 points Distribution exact histogram counts of subcells
26
Future directions
  • Extend the privacy argument to other nice
    distributions
  • For what distributions is there no meaningful
    privacyutility trade-off?
  • Characterize acceptable auxiliary information
  • Think of auxiliary information as an a priori
    distribution
  • The low-dimensional case Is it inherently
    impossible?
  • Discrete-valued attributes
  • Our proofs require a spread in all attributes
  • Extend the utility argument to other interesting
    macroscopic properties e.g. correlations

27
Conclusions
  • A first step towards understanding the
    privacy-utility trade-off
  • A general and rigorous definition of privacy
  • A work in progress!

28
Questions?
Write a Comment
User Comments (0)
About PowerShow.com