Title: Pinning Down
1Pinning Down PrivacyDefining Privacy in
Statistical Databases
- Adam Smith
- Weizmann Institute of Science
- http//theory.csail.mit.edu/asmith
2Database Privacy
Alice
Users (government, researchers, marketers, )
Collection and sanitization
Bob
?
You
- Census problem
- Two conflicting goals
- Utility Users can extract global statistics
- Privacy Individual information stays hidden
- How can these be formalized?
3Database Privacy
Alice
Users (government, researchers, marketers, )
Collection and sanitization
Bob
?
You
- Census problem
- Why privacy?
- Ethical legal obligation
- Honest answers require respondents trust
4Trust is important
5Database Privacy
Alice
Users (government, researchers, marketers, )
Collection and sanitization
Bob
?
You
- Trusted collection agency
- Published statistics may be tables, graphs,
microdata, etc - May have noise or other distortions
- May be interactive
6Database Privacy
Alice
Users (government, researchers, marketers, )
Collection and sanitization
Bob
?
You
- Variations on model studied in
- Statistics
- Data mining
- Theoretical CS
- Cryptography
- Different traditions for what privacy means
7How can we formalize privacy?
- Different people mean different things
- Pin it down mathematically?
8- I ask them to take a poem and hold it up to the
light like a color slide - or press an ear against its hive.
-
- But all they want to dois tie the poem to a
chair with ropeand torture a confession out of
it. - They begin beating it with a hoseto find out
what it really means. - - Billy Collins, Introduction to poetry
- Can we approach privacy scientifically?
- Pin down social concept
- No perfect definition?
- But lots of place for rigor
- Too late? (see Adis talk)
9How can we formalize privacy?
- Different people mean different things
- Pin it down mathematically?
- Goal 1 Rigor
- Prove clear theorems about privacy
- Few exist in literature
- Make clear (and refutable) conjectures
- Sleep better at night
- Goal 2 Interesting science
- (New) Computational phenomenon
- Algorithmic problems
- Statistical problems
10Overview
- Examples
- Intuitions for privacy
- Why crypto defs dont apply
- A Partial Selection of Definitions
- Conclusions
partial incomplete and biased
11Basic Setting
Users (government, researchers, marketers, )
San
DB
?
random coins
- Database DB table of n rows, each in domain D
- D can be numbers, categories, tax forms, etc
- This talk D 0,1d
- E.g. Married?, Employed?, Over 18?,
12Examples of sanitization methods
- Input perturbation
- Change data before processing
- E.g. Randomized response
- flip each bit of table with probability p
- Summary statistics
- Means, variances
- Marginal totals ( people with blue eyes and
brown hair) - Regression coefficients
- Output perturbation
- Summary statistics with noise
- Interactive versions of above
- Auditor decides which queries are OK, type of
noise
13Two Intuitions for Privacy
- If the release of statistics S makes it possible
to determine the value of private information
more accurately than is possible without access
to S, a disclosure has taken place. Dalenius - Learning more about me should be hard
- Privacy is protection from being brought to the
attention of others. Gavison - Safety is blending into a crowd
Remove Gavison def?
14Why not use crypto definitions?
- Attempt 1
- Defn For every entry i, no information about xi
is leaked (as if encrypted) - Problem no information at all is revealed!
- Tradeoff privacy vs utility
- Attempt 2
- Agree on summary statistics f(DB) that are safe
- Defn No information about DB except f(DB)
- Problem how to decide that f is safe?
- Tautology trap
- (Also how do you figure out what f is? --Yosi)
15Overview
- Examples
- Intuitions for privacy
- Why crypto defs dont apply
- A Partial Selection of Definitions
- Two straw men
- Blending into the Crowd
- An impossibility result
- Attribute Disclosure and Differential Privacy
- Conclusions
- Criteria
- Understandable
- Clear adversarys goals prior knowledge /
side information - I am a co-author...
partial incomplete and biased
16Straw man 1 Exact Disclosure
x1
query 1
x2
San
answer 1
x3
DB
?
?
xn-1
query T
xn
Adversary A
answer T
random coins
- Defn safe if adversary cannot learn any entry
exactly - leads to nice (but hard) combinatorial problems
- Does not preclude learning value with 99
certainty or narrowing down to a small interval - Historically
- Focus auditing interactive queries
- Difficulty understanding relationships between
queries - E.g. two queries with small difference
17Straw man 2 Learning the distribution
- Assume x1,,xn are drawn i.i.d. from unknown
distribution - Defn San is safe if it only reveals
distribution - Implied approach
- learn the distribution
- release description of distrib
- or re-sample points from distrib
- Problem tautology trap
- estimate of distrib. depends on data why is it
safe?
18Blending into a Crowd
- Intuition I am safe in a group of k or more
- k varies (3 6 100 10,000 ?)
- Many variations on theme
- Adv. wants predicate g such that 0 lt i
g(xi)true lt k - g is called a breach of privacy
- Why?
- Fundamental
- R. Gavison protection from being brought to the
attention of others - Rare property helps me re-identify someone
- Implicit information about a large group is
public - e.g. liver problems more prevalent among diabetics
19Blending into a Crowd
- Intuition I am safe in a group of k or more
- k varies (3 6 100 10,000 ?)
- Many variations on theme
- Adv. wants predicate g such that 0 lt i
g(xi)true lt k - g is called a breach of privacy
- Why?
- Fundamental
- R. Gavison protection from being brought to the
attention of others - Rare property helps me re-identify someone
- Implicit information about a large group is
public - e.g. liver problems more prevalent among diabetics
- Two variants
- frequency in DB
- frequency in underlying population
- How can we capture this?
- Syntactic definitions
- Bayesian adversary
- Crypto-flavored definitions
20Syntactic Definitions
- Given sanitization S, look at set of all
databases consistent with S - Defn Safe if no predicate is a breach for all
consistent databases - k-anonymity L. Sweeney
- Sanitization is histogram of data
- Partition D into bins B1 B2 ? Bt
- Output cardinalities fj ( DB Å Bj )
- Safe if for all j, either fj k or fj0
- Cell bound methods statistics, 1990s
- Sanitization consists of marginal sums
- Let fz i xi z. Then San(DB) various
sums of fz - Safe if for all z, either 9 const DB with fz k
or 8 const DBs, fz0 - Large literature using algebraic and
combinatorial techniques
brown blue ?
blond 2 10 12
brown 12 6 18
? 14 16
brown blue ?
blond 0,12 0,12 12
brown 0,14 0,16 18
? 14 16
21Syntactic Definitions
- Given sanitization S, look at set of all
databases consistent with S - Defn Safe if no predicate is a breach for all
consistent databases - k-anonymity L. Sweeney
- Sanitization is histogram of data
- Partition D into bins B1 B2 ? Bt
- Output cardinalities fj ( DB Å Bj )
- Safe if for all j, either fj k or fj0
- Cell bound methods statistics, 1990s
- Sanitization consists of marginal sums
- if fz i xi z then output various sums
of fz - Safe if for all z, either 9 const DB with fz k
or 8 const DBs, fz0 - large literature using algebraic and
combinatorial techniques
brown blue ?
blond 0,12 0,12 12
brown 0,14 0,16 18
? 14 16
- Issues
- If k is small all three Canadians at Weizmann
sing in a choir. - Semantics?
- Probability not considered
- What if I have side information?
- Algorithm for making decisions not considered
- What adversary does this apply to?
22Security for Bayesian adversaries
- Issues
- Restricts the type of predicates adversary can
choose - Must know prior distribution
- Can 1 scheme work for many distributions?
- Sanitizer works harder than adversary
- Conditional probabilities dont consider previous
iterations - Simulatability KMN05
- Can this be fixed (with efficient computations)?
- Goal
- Adversary outputs point z 2 D
- Score 1/fz if fz gt 0 0 otherwise
- Defn sanitization safe if E(score) ?
- Procedure
- Assume you know adversarys prior distribution
over databases - Given a candidate output (e.g. set of marginal
sums) - Update prior conditioned on output (via Bayes
rule) - If maxz E( score output ) lt ? then release
- Else consider new set of marginal sums
- Extensive literature on computing expected value
(see Yosis talk)
23Crypto-flavored Approach CDMSW,CDMT,NS
If the release of statistics S makes it possible
to determine thevalue of private information
more accurately than is possiblewithout access
to S, a disclosure has taken place. Dalenius
24Crypto-flavored Approach CDMSW,CDMT,NS
- CDMSW Compare to simulator8 distributions
on databases DB 8 adversaries A, 9 A such that
8 subsets J µ DB PrDB,S A(S) breach in J
PrDB A() breach in J ? - Definition says nothing if adversary knows x1
- Require that it hold for all subsets of DB
- No non-trivial examples satisfying this
definition - Restrict family of distributions to some class C
of distribs - Try to make as large as possible
- Sufficient i.i.d. from smooth distribution
2 C
25Crypto-flavored Approach CDMSW,CDMT,NS
- CDMSW Compare to simulator8 distributions
on databases DB 2 C 8 adversaries A, 9 A such
that 8 subsets J µ DB PrDB,S A(S) breach
in J PrDB A() breach in J ? - CDMSW,CDMT Geometric data
- Assume xi 2 Rd
- Relax definition
- Ball predicates gz,r x x-z r
gz,r x x-z C r - Breach if (DB Å gz,r) gt0 and (DB Å gz,r) lt k
- Several types of histograms can be released
- Sufficient for metric problems clustering,
min. span tree,
2
2
2
1
2
3
26Crypto-flavored Approach CDMSW,CDMT,NS
- CDMSW Compare to simulator8 distributions
on databases DB 2 C 8 adversaries A, 9 A such
that 8 subsets J µ DB PrDB,S A(S) breach
in J PrDB A() breach in J ? - NS No geometric restrictions
- A lot of noise
- Almost erase data!
- Strong privacy statement
- Very weak utility
- CDMSW,CDMT,NS proven statements!
- Issues
- Works for a large class of prior distributions
and side information - But not for all
- Not clear if it helps with ordinary statistical
calculations - Interesting utility requires geometric
restrictions - Too messy?
27Blending into a Crowd
- Intuition I am safe in a group of k or more
- pros
- appealing intuition for privacy
- seems fundamental
- mathematically interesting
- meaningful statements are possible!
- cons
- does it rule out learningfacts about particular
individual? - all results seem to make strong assumptions on
adversarys prior distribution - is this necessary? (yes)
28Overview
- Examples
- Intuitions for privacy
- Why crypto defs dont apply
- A Partial Selection of Definitions
- Two Straw men
- Blending into the Crowd
- An impossibility result
- Attribute Disclosure and Differential Privacy
- Conclusions
29an impossibility result
- An abstract schema
- Define a privacy breach
- 8 distributions on databases8 adversaries A, 9
A such that Pr( A(San) breach ) Pr( A()
breach ) ? - Theorem Dwork-Naor
- For reasonable breach, if San(DB) contains
information about DB then some adversary breaks
this definition - Example
- Adv. knows Alice is 2 inches shorter than average
Lithuanian - but how tall are Lithuanians?
- With sanitized database, probability of guessing
height goes up - Theorem this is unavoidable
30proof sketch
- Suppose
- If DB is uniform then entropy I( DB San(DB) )
gt 0 - breach is predicting a predicate g(DB)
- Pick hash function h databases!0,1H(DBSan)
- Prior distrib. is uniform conditioned on h(DB)z
- Then
- h(DB)z gives no info on g(DB)
- San(DB) and h(DB)z together determine DB
- DN vastly generalize this
31Preventing Attribute Disclosure
x1
query 1
x2
San
answer 1
x3
DB
?
?
xn-1
query T
xn
Adversary A
answer T
random coins
- Large class of definitions
- safe if adversary cant learn too much about
any entry - E.g.
- Cannot narrow Xi down to small interval
- For uniform Xi, mutual information I(Xi San(DB)
) ? - How can we decide among these definitions?
32Differential Privacy
x1
query 1
x2
San
answer 1
x3
DB
?
?
xn-1
query T
xn
Adversary A
answer T
random coins
- Lithuanians example
- Adv. learns height even if Alice not in DB
- Intuition DM
- Whatever is learned would be learned regardless
of whether or not Alice participates - Dual Whatever is already known, situation wont
get worse
33Differential Privacy
x1
query 1
x2
San
answer 1
0
DB
?
?
xn-1
query T
xn
Adversary A
answer T
random coins
- Define n1 games
- Game 0 Adv. interacts with San(DB)
- For each i, let DB-i (x1,,xi-1,0,xi1,,xn)
- Game i Adv. interacts with San(DB-i)
- Bayesian adversary
- Given S and prior distrib p() on DB, define n1
posterior distribs
34Differential Privacy
x1
query 1
x2
San
answer 1
0
DB
?
?
xn-1
query T
xn
Adversary A
answer T
random coins
- Definition San is safe if 8 prior distributions
p() on DB,8 transcripts S, 8 i
1,,n StatDiff( p0(S) , pi( S) ) ? - Note that the prior distribution may be far from
both - How can we satisfy this?
35Approach Indistinguishability DiNi,EGS,BDMN
query 1
transcript S
San
answer 1
DB
query T
?
answer T
Distributions at distance ?
random coins
query 1
transcript S
x1
San
x2
answer 1
x3
DB
query T
?
Choice of distance measure is important
?
Differ in 1 row
answer T
xn-1
xn
random coins
36Approach Indistinguishability DiNi,EGS,BDMN
Choice of distance measure is important
37Approach Indistinguishability DiNi,EGS,BDMN
- Problem ? must be large
- By hybrid argumentAny two databases induce
transcripts at distance n? - To get utility, ? gt 1/n
- Statistical difference 1/n is not meaningful
- Example Release random point in database
- San(x1,,xn) ( j, xj ) for random j
- For every i , changing xi induces statistical
difference 1/n - But some xi is revealed with probability 1
S
query 1
San
DB
?
query T
Distribsdistance ?
S
query 1
San
DB
query T
?
38Formalizing Indistinguishability
?
query 1
transcript S
query 1
transcript S
answer 1
answer 1
Adversary A
- Definition San is ?-indistinguishable if
- 8 A, 8 DB, DB which differ in 1 row, 8 sets
of transcripts E
2 e?
(1 ?)
p( San(DB) 2 E )
p( San(DB) 2 E )
39Indistinguishability ) Differential Privacy
- Definition San is safe if 8 prior distributions
p() on DB,8 transcripts S, 8 i
1,,n StatDiff( p0(S) , pi( S) ) ? - We can use indistinguishability
- For every S and DB
- This implies StatDiff( p0(S) , pi( S) ) ?
40Why does this help?
- With relatively little noise
- Averages
- Histograms
- Matrix decompositions
- Certain types of clustering
-
- See Kobbis talk
41Preventing Attribute Disclosure
- Various ways to capture no particular value
should be revealed - Differential Criterion
- Whatever is learned would be learned regardless
of whether or not person i participates - Satisfied by indistinguishability
- Also implies protection from re-identification?
- Two interpretations
- A given release wont make privacy worse
- Rational respondent will answer if there is some
gain - Can we preserve enough utility?
42Overview
- Examples
- Intuitions for privacy
- Why crypto defs dont apply
- A Partial Selection of Definitions
- Two Straw men
- Blending into the Crowd
- An impossibility result
- Attribute Disclosure and Differential Privacy
partial incomplete and biased
43Things I Didnt Talk About
- Economic Perspective KPR
- Utility of providing data value cost
- May depend on whether others participate
- When is it worth my while?
- Specific methods for re-identification
- Various other frameworks (e.g. L-diversity)
- Other pieces of big data privacy picture
- Access Control
- Implementing trusted collection center
44Conclusions
- Pinning down social notion in particular context
- Biased survey of approaches to definitions
- A taste of techniques along the way
- Didnt talk about utility
- Question has different flavor from
- usual crypto problems
- statisticians traditional conception
- Meaningful statements are possible!
- Practical?
- Do they cover everything? No
45Conclusions
- How close are we to converging?
- e.g. s.f.e., encryption, Turing machines,
- But were after a social concept?
- Silver bullet?
- What are the big challenges?
- Need cryptanalysis of these systems (Adi?)