Pinning Down - PowerPoint PPT Presentation

About This Presentation
Title:

Pinning Down

Description:

Tautology trap (Also: how do you figure out what f is? -- Yosi) C. C. C ... Problem: tautology trap. estimate of distrib. depends on data... why is it safe? 18 ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 46
Provided by: adamd
Category:

less

Transcript and Presenter's Notes

Title: Pinning Down


1
Pinning Down PrivacyDefining Privacy in
Statistical Databases
  • Adam Smith
  • Weizmann Institute of Science
  • http//theory.csail.mit.edu/asmith

2
Database Privacy
Alice
Users (government, researchers, marketers, )
Collection and sanitization
Bob
?
You
  • Census problem
  • Two conflicting goals
  • Utility Users can extract global statistics
  • Privacy Individual information stays hidden
  • How can these be formalized?

3
Database Privacy
Alice
Users (government, researchers, marketers, )
Collection and sanitization
Bob
?
You
  • Census problem
  • Why privacy?
  • Ethical legal obligation
  • Honest answers require respondents trust

4
Trust is important
5
Database Privacy
Alice
Users (government, researchers, marketers, )
Collection and sanitization
Bob
?
You
  • Trusted collection agency
  • Published statistics may be tables, graphs,
    microdata, etc
  • May have noise or other distortions
  • May be interactive

6
Database Privacy
Alice
Users (government, researchers, marketers, )
Collection and sanitization
Bob
?
You
  • Variations on model studied in
  • Statistics
  • Data mining
  • Theoretical CS
  • Cryptography
  • Different traditions for what privacy means

7
How can we formalize privacy?
  • Different people mean different things
  • Pin it down mathematically?

8
  • I ask them to take a poem and hold it up to the
    light like a color slide
  • or press an ear against its hive.
  • But all they want to dois tie the poem to a
    chair with ropeand torture a confession out of
    it.
  • They begin beating it with a hoseto find out
    what it really means.
  • - Billy Collins, Introduction to poetry
  • Can we approach privacy scientifically?
  • Pin down social concept
  • No perfect definition?
  • But lots of place for rigor
  • Too late? (see Adis talk)

9
How can we formalize privacy?
  • Different people mean different things
  • Pin it down mathematically?
  • Goal 1 Rigor
  • Prove clear theorems about privacy
  • Few exist in literature
  • Make clear (and refutable) conjectures
  • Sleep better at night
  • Goal 2 Interesting science
  • (New) Computational phenomenon
  • Algorithmic problems
  • Statistical problems

10
Overview
  • Examples
  • Intuitions for privacy
  • Why crypto defs dont apply
  • A Partial Selection of Definitions
  • Conclusions

partial incomplete and biased
11
Basic Setting
Users (government, researchers, marketers, )
San
DB
?



random coins
  • Database DB table of n rows, each in domain D
  • D can be numbers, categories, tax forms, etc
  • This talk D 0,1d
  • E.g. Married?, Employed?, Over 18?,

12
Examples of sanitization methods
  • Input perturbation
  • Change data before processing
  • E.g. Randomized response
  • flip each bit of table with probability p
  • Summary statistics
  • Means, variances
  • Marginal totals ( people with blue eyes and
    brown hair)
  • Regression coefficients
  • Output perturbation
  • Summary statistics with noise
  • Interactive versions of above
  • Auditor decides which queries are OK, type of
    noise

13
Two Intuitions for Privacy
  • If the release of statistics S makes it possible
    to determine the value of private information
    more accurately than is possible without access
    to S, a disclosure has taken place. Dalenius
  • Learning more about me should be hard
  • Privacy is protection from being brought to the
    attention of others. Gavison
  • Safety is blending into a crowd

Remove Gavison def?
14
Why not use crypto definitions?
  • Attempt 1
  • Defn For every entry i, no information about xi
    is leaked (as if encrypted)
  • Problem no information at all is revealed!
  • Tradeoff privacy vs utility
  • Attempt 2
  • Agree on summary statistics f(DB) that are safe
  • Defn No information about DB except f(DB)
  • Problem how to decide that f is safe?
  • Tautology trap
  • (Also how do you figure out what f is? --Yosi)

15
Overview
  • Examples
  • Intuitions for privacy
  • Why crypto defs dont apply
  • A Partial Selection of Definitions
  • Two straw men
  • Blending into the Crowd
  • An impossibility result
  • Attribute Disclosure and Differential Privacy
  • Conclusions
  • Criteria
  • Understandable
  • Clear adversarys goals prior knowledge /
    side information
  • I am a co-author...

partial incomplete and biased
16
Straw man 1 Exact Disclosure
x1
query 1
x2
San
answer 1
x3
DB
?
?
xn-1
query T
xn
Adversary A
answer T



random coins
  • Defn safe if adversary cannot learn any entry
    exactly
  • leads to nice (but hard) combinatorial problems
  • Does not preclude learning value with 99
    certainty or narrowing down to a small interval
  • Historically
  • Focus auditing interactive queries
  • Difficulty understanding relationships between
    queries
  • E.g. two queries with small difference

17
Straw man 2 Learning the distribution
  • Assume x1,,xn are drawn i.i.d. from unknown
    distribution
  • Defn San is safe if it only reveals
    distribution
  • Implied approach
  • learn the distribution
  • release description of distrib
  • or re-sample points from distrib
  • Problem tautology trap
  • estimate of distrib. depends on data why is it
    safe?

18
Blending into a Crowd
  • Intuition I am safe in a group of k or more
  • k varies (3 6 100 10,000 ?)
  • Many variations on theme
  • Adv. wants predicate g such that 0 lt i
    g(xi)true lt k
  • g is called a breach of privacy
  • Why?
  • Fundamental
  • R. Gavison protection from being brought to the
    attention of others
  • Rare property helps me re-identify someone
  • Implicit information about a large group is
    public
  • e.g. liver problems more prevalent among diabetics

19
Blending into a Crowd
  • Intuition I am safe in a group of k or more
  • k varies (3 6 100 10,000 ?)
  • Many variations on theme
  • Adv. wants predicate g such that 0 lt i
    g(xi)true lt k
  • g is called a breach of privacy
  • Why?
  • Fundamental
  • R. Gavison protection from being brought to the
    attention of others
  • Rare property helps me re-identify someone
  • Implicit information about a large group is
    public
  • e.g. liver problems more prevalent among diabetics
  • Two variants
  • frequency in DB
  • frequency in underlying population
  • How can we capture this?
  • Syntactic definitions
  • Bayesian adversary
  • Crypto-flavored definitions

20
Syntactic Definitions
  • Given sanitization S, look at set of all
    databases consistent with S
  • Defn Safe if no predicate is a breach for all
    consistent databases
  • k-anonymity L. Sweeney
  • Sanitization is histogram of data
  • Partition D into bins B1 B2 ? Bt
  • Output cardinalities fj ( DB Å Bj )
  • Safe if for all j, either fj k or fj0
  • Cell bound methods statistics, 1990s
  • Sanitization consists of marginal sums
  • Let fz i xi z. Then San(DB) various
    sums of fz
  • Safe if for all z, either 9 const DB with fz k
    or 8 const DBs, fz0
  • Large literature using algebraic and
    combinatorial techniques

brown blue ?
blond 2 10 12
brown 12 6 18
? 14 16
brown blue ?
blond 0,12 0,12 12
brown 0,14 0,16 18
? 14 16
21
Syntactic Definitions
  • Given sanitization S, look at set of all
    databases consistent with S
  • Defn Safe if no predicate is a breach for all
    consistent databases
  • k-anonymity L. Sweeney
  • Sanitization is histogram of data
  • Partition D into bins B1 B2 ? Bt
  • Output cardinalities fj ( DB Å Bj )
  • Safe if for all j, either fj k or fj0
  • Cell bound methods statistics, 1990s
  • Sanitization consists of marginal sums
  • if fz i xi z then output various sums
    of fz
  • Safe if for all z, either 9 const DB with fz k
    or 8 const DBs, fz0
  • large literature using algebraic and
    combinatorial techniques

brown blue ?
blond 0,12 0,12 12
brown 0,14 0,16 18
? 14 16
  • Issues
  • If k is small all three Canadians at Weizmann
    sing in a choir.
  • Semantics?
  • Probability not considered
  • What if I have side information?
  • Algorithm for making decisions not considered
  • What adversary does this apply to?

22
Security for Bayesian adversaries
  • Issues
  • Restricts the type of predicates adversary can
    choose
  • Must know prior distribution
  • Can 1 scheme work for many distributions?
  • Sanitizer works harder than adversary
  • Conditional probabilities dont consider previous
    iterations
  • Simulatability KMN05
  • Can this be fixed (with efficient computations)?
  • Goal
  • Adversary outputs point z 2 D
  • Score 1/fz if fz gt 0 0 otherwise
  • Defn sanitization safe if E(score) ?
  • Procedure
  • Assume you know adversarys prior distribution
    over databases
  • Given a candidate output (e.g. set of marginal
    sums)
  • Update prior conditioned on output (via Bayes
    rule)
  • If maxz E( score output ) lt ? then release
  • Else consider new set of marginal sums
  • Extensive literature on computing expected value
    (see Yosis talk)

23
Crypto-flavored Approach CDMSW,CDMT,NS
If the release of statistics S makes it possible
to determine thevalue of private information
more accurately than is possiblewithout access
to S, a disclosure has taken place. Dalenius
24
Crypto-flavored Approach CDMSW,CDMT,NS
  • CDMSW Compare to simulator8 distributions
    on databases DB 8 adversaries A, 9 A such that
    8 subsets J µ DB PrDB,S A(S) breach in J
    PrDB A() breach in J ?
  • Definition says nothing if adversary knows x1
  • Require that it hold for all subsets of DB
  • No non-trivial examples satisfying this
    definition
  • Restrict family of distributions to some class C
    of distribs
  • Try to make as large as possible
  • Sufficient i.i.d. from smooth distribution

2 C
25
Crypto-flavored Approach CDMSW,CDMT,NS
  • CDMSW Compare to simulator8 distributions
    on databases DB 2 C 8 adversaries A, 9 A such
    that 8 subsets J µ DB PrDB,S A(S) breach
    in J PrDB A() breach in J ?
  • CDMSW,CDMT Geometric data
  • Assume xi 2 Rd
  • Relax definition
  • Ball predicates gz,r x x-z r
    gz,r x x-z C r
  • Breach if (DB Å gz,r) gt0 and (DB Å gz,r) lt k
  • Several types of histograms can be released
  • Sufficient for metric problems clustering,
    min. span tree,

2
2
2
1
2
3
26
Crypto-flavored Approach CDMSW,CDMT,NS
  • CDMSW Compare to simulator8 distributions
    on databases DB 2 C 8 adversaries A, 9 A such
    that 8 subsets J µ DB PrDB,S A(S) breach
    in J PrDB A() breach in J ?
  • NS No geometric restrictions
  • A lot of noise
  • Almost erase data!
  • Strong privacy statement
  • Very weak utility
  • CDMSW,CDMT,NS proven statements!
  • Issues
  • Works for a large class of prior distributions
    and side information
  • But not for all
  • Not clear if it helps with ordinary statistical
    calculations
  • Interesting utility requires geometric
    restrictions
  • Too messy?

27
Blending into a Crowd
  • Intuition I am safe in a group of k or more
  • pros
  • appealing intuition for privacy
  • seems fundamental
  • mathematically interesting
  • meaningful statements are possible!
  • cons
  • does it rule out learningfacts about particular
    individual?
  • all results seem to make strong assumptions on
    adversarys prior distribution
  • is this necessary? (yes)

28
Overview
  • Examples
  • Intuitions for privacy
  • Why crypto defs dont apply
  • A Partial Selection of Definitions
  • Two Straw men
  • Blending into the Crowd
  • An impossibility result
  • Attribute Disclosure and Differential Privacy
  • Conclusions

29
an impossibility result
  • An abstract schema
  • Define a privacy breach
  • 8 distributions on databases8 adversaries A, 9
    A such that Pr( A(San) breach ) Pr( A()
    breach ) ?
  • Theorem Dwork-Naor
  • For reasonable breach, if San(DB) contains
    information about DB then some adversary breaks
    this definition
  • Example
  • Adv. knows Alice is 2 inches shorter than average
    Lithuanian
  • but how tall are Lithuanians?
  • With sanitized database, probability of guessing
    height goes up
  • Theorem this is unavoidable

30
proof sketch
  • Suppose
  • If DB is uniform then entropy I( DB San(DB) )
    gt 0
  • breach is predicting a predicate g(DB)
  • Pick hash function h databases!0,1H(DBSan)
  • Prior distrib. is uniform conditioned on h(DB)z
  • Then
  • h(DB)z gives no info on g(DB)
  • San(DB) and h(DB)z together determine DB
  • DN vastly generalize this

31
Preventing Attribute Disclosure
x1
query 1
x2
San
answer 1
x3
DB
?
?
xn-1
query T
xn
Adversary A
answer T



random coins
  • Large class of definitions
  • safe if adversary cant learn too much about
    any entry
  • E.g.
  • Cannot narrow Xi down to small interval
  • For uniform Xi, mutual information I(Xi San(DB)
    ) ?
  • How can we decide among these definitions?

32
Differential Privacy
x1
query 1
x2
San
answer 1
x3
DB
?
?
xn-1
query T
xn
Adversary A
answer T



random coins
  • Lithuanians example
  • Adv. learns height even if Alice not in DB
  • Intuition DM
  • Whatever is learned would be learned regardless
    of whether or not Alice participates
  • Dual Whatever is already known, situation wont
    get worse

33
Differential Privacy
x1
query 1
x2
San
answer 1
0
DB
?
?
xn-1
query T
xn
Adversary A
answer T



random coins
  • Define n1 games
  • Game 0 Adv. interacts with San(DB)
  • For each i, let DB-i (x1,,xi-1,0,xi1,,xn)
  • Game i Adv. interacts with San(DB-i)
  • Bayesian adversary
  • Given S and prior distrib p() on DB, define n1
    posterior distribs

34
Differential Privacy
x1
query 1
x2
San
answer 1
0
DB
?
?
xn-1
query T
xn
Adversary A
answer T



random coins
  • Definition San is safe if 8 prior distributions
    p() on DB,8 transcripts S, 8 i
    1,,n StatDiff( p0(S) , pi( S) ) ?
  • Note that the prior distribution may be far from
    both
  • How can we satisfy this?

35
Approach Indistinguishability DiNi,EGS,BDMN
query 1
transcript S
San
answer 1
DB
query T
?
answer T
Distributions at distance ?



random coins
query 1
transcript S
x1
San
x2
answer 1
x3
DB
query T
?
Choice of distance measure is important
?
Differ in 1 row
answer T
xn-1
xn



random coins
36
Approach Indistinguishability DiNi,EGS,BDMN
Choice of distance measure is important
37
Approach Indistinguishability DiNi,EGS,BDMN
  • Problem ? must be large
  • By hybrid argumentAny two databases induce
    transcripts at distance n?
  • To get utility, ? gt 1/n
  • Statistical difference 1/n is not meaningful
  • Example Release random point in database
  • San(x1,,xn) ( j, xj ) for random j
  • For every i , changing xi induces statistical
    difference 1/n
  • But some xi is revealed with probability 1

S
query 1
San
DB
?
query T
Distribsdistance ?



S
query 1
San
DB
query T
?



38
Formalizing Indistinguishability
?
query 1
transcript S
query 1
transcript S
answer 1
answer 1
Adversary A
  • Definition San is ?-indistinguishable if
  • 8 A, 8 DB, DB which differ in 1 row, 8 sets
    of transcripts E

2 e?
(1 ?)
p( San(DB) 2 E )
p( San(DB) 2 E )
39
Indistinguishability ) Differential Privacy
  • Definition San is safe if 8 prior distributions
    p() on DB,8 transcripts S, 8 i
    1,,n StatDiff( p0(S) , pi( S) ) ?
  • We can use indistinguishability
  • For every S and DB
  • This implies StatDiff( p0(S) , pi( S) ) ?

40
Why does this help?
  • With relatively little noise
  • Averages
  • Histograms
  • Matrix decompositions
  • Certain types of clustering
  • See Kobbis talk

41
Preventing Attribute Disclosure
  • Various ways to capture no particular value
    should be revealed
  • Differential Criterion
  • Whatever is learned would be learned regardless
    of whether or not person i participates
  • Satisfied by indistinguishability
  • Also implies protection from re-identification?
  • Two interpretations
  • A given release wont make privacy worse
  • Rational respondent will answer if there is some
    gain
  • Can we preserve enough utility?

42
Overview
  • Examples
  • Intuitions for privacy
  • Why crypto defs dont apply
  • A Partial Selection of Definitions
  • Two Straw men
  • Blending into the Crowd
  • An impossibility result
  • Attribute Disclosure and Differential Privacy

partial incomplete and biased
43
Things I Didnt Talk About
  • Economic Perspective KPR
  • Utility of providing data value cost
  • May depend on whether others participate
  • When is it worth my while?
  • Specific methods for re-identification
  • Various other frameworks (e.g. L-diversity)
  • Other pieces of big data privacy picture
  • Access Control
  • Implementing trusted collection center

44
Conclusions
  • Pinning down social notion in particular context
  • Biased survey of approaches to definitions
  • A taste of techniques along the way
  • Didnt talk about utility
  • Question has different flavor from
  • usual crypto problems
  • statisticians traditional conception
  • Meaningful statements are possible!
  • Practical?
  • Do they cover everything? No

45
Conclusions
  • How close are we to converging?
  • e.g. s.f.e., encryption, Turing machines,
  • But were after a social concept?
  • Silver bullet?
  • What are the big challenges?
  • Need cryptanalysis of these systems (Adi?)
Write a Comment
User Comments (0)
About PowerShow.com