Publishing Set-Valued Data via Differential Privacy presentation

About This Presentation

Transcript and Presenter's Notes

Title: Publishing Set-Valued Data via Differential Privacy

1
Publishing Set-Valued Data via Differential
Privacy

Rui Chen, Concordia University
Noman Mohammed, Concordia University
Benjamin C. M. Fung, Concordia University
Bipin C. Desai, Concordia University
Li Xiong, Emory University

VLDB 2011
2
Outline

Introduction
Preliminaries
Sanitization algorithm
Experimental results
Conclusions

3
Introduction

The problem non-interactive set-valued data
publication under differential privacy

Typical set-valued data transaction data, web
search queries

4
Introduction

Set-valued data refers to the data in which each
record owner is associated with a set of items
drawn from an item universe.

TID Items
t1 I1, I2, I3, I4
t2 I2, I4
t3 I2
t4 I1, I2
t5 I2
t6 I1
t7 I1, I2, I3, I4
t8 I2, I3, I4
5
Introduction

Existing works 1, 2, 3, 4, 5, 6, 7 on
publishing set-valued data are based on
partitioned-based privacy models 8.
They provide insufficient privacy protection.
Composition attack 8
deFinetti attack 9
Foreground knowledge attack 10
They are vulnerable to background knowledge.

6
Introduction

Differential privacy is independent of an
adversary background knowledge and computational
power (with exceptions 11).
The outcome of any analysis should not overly
depend on a single data record.
Existing differentially private data publishing
approaches are not adequate in terms of both
utility and scalability for our problem.

7
Introduction

Problems of data-independent publishing
approaches

I1
I2
I3
I1, I2
I1, I3
I2, I3
I1, I2, I3
Universe I I1, I2, I3

Scalability O(2n)
Utility noise accumulates exponentially

8
Outline

Introduction
Preliminaries
Sanitization algorithm
Experimental results
Conclusions

9
Preliminaries

Context-free taxonomy tree

Each internal node is a set of their leaves, not
necessarily the semantic generalization

10
Preliminaries

Differential privacy 12

D
D

D and D are neighbors if they differ on at most
one record

A non-interactive privacy mechanism A gives
e-differential privacy if for all neighbours D,
D, and for any possible sanitized database D ?
Range(A), PrAA(D) D
exp(e) PrAA(D) D
11
Preliminaries

Laplace mechanism 12

Global Sensitivity
For example, for a single counting query Q over a
dataset D, returning Q(D)Laplace(1/e) gives
e-differential privacy.
12
Preliminaries

Exponential mechanism 13

Given a utility function q (D R) ? R for a
database instance D, the mechanism A, A(D, q)
return r with probability ?
exp(eq(D, r)/2?q) gives e-differential
privacy.
13
Preliminaries

Composition properties 14

Sequential composition ?iei differential
privacy
Parallel composition max(ei)differential privacy
14
Preliminaries

Utility metrics

For a given itemset I I , a counting query Q
over a dataset D is defined to be
A privacy mechanism A is (a, d)-useful if with
probability 1- d, for every counting query and
every dataset D, for DA(D), Q(D)-Q(D)lt a.
15
15
Outline

Introduction
Preliminaries
Sanitization algorithm
Experimental results
Conclusions

16
Sanitization Algorithm

Top-down partitioning

TID Items
t1 I1, I2, I3, I4
t2 I2, I4
t3 I2
t4 I1, I2
t5 I2
t6 I1
t7 I1, I2, I3, I4
t8 I2, I3, I4

Generalize all records to a single partition
Keep partitioning non-empty partitions until leaf
partitions are reached

17
Sanitization Algorithm

Privacy budget allocation

We reserve B/2 to obtain noisy sizes of leaf
partitions and the rest B/2 to guide the
partitioning.
Assign less budget to more general partitions and
more budget to more specific partitions.

18
Sanitization Algorithm

Privacy budget allocation

A hierarchy cut needs at most
partition operations to reach leaf partitions.
Example I1,2, I3, 4 needs at most two
partition operations to reach leaf partitions
19
Sanitization Algorithm

Privacy budget allocation

We reserve B/2 to obtain noisy sizes of leaf
partitions and the rest B/2 to guide the
partitioning.
Assign less budget to more general partitions and
more budget to more specific partitions.

B/2/3 B/6
(B/2-B/6)/2 B/6
B/6B/2 2B/3
20
Sanitization Algorithm

Sub-partition generation

For a non-leaf partition, we need to consider all
possible sub-partitions to satisfy differential
privacy.
Efficient implementation separately handling
empty and non-empty partitions (inspired by 16).

21
Outline

Introduction
Preliminaries
Sanitization algorithm
Experimental results
Conclusions

22
Experiments

Two real-life set-valued datasets are used.

MSNBC is publicly available at UCI machine
learning repository(http//archive.ics.uci.edu/ml/
index.html).
STM is provided by Societe de transport de
Montreal (STM) (http//www.stm.info).

23
Experiments

Average relative error vs. privacy budget

B0.5
B0.75
B1.0
24
Experiments

Utility for frequent itemset mining

B0.75
B0.5
B1.0
25
Experiments

Scalability O(DI)

Runtime vs. D
Runtime vs. I
26
Outline

Introduction
Preliminaries
Sanitization algorithm
Experimental results
Conclusions

27
Conclusions

Differential privacy can be successfully applied
to non-interactive set-valued data publishing
with guaranteed utility.
Differential privacy can be achieved by
data-dependent solutions with improved efficiency
and accuracy.
The general idea of data-dependent solutions
applies to other types of data, for example,
relational data 17 and trajectory data 18.

28
References

1 J. Cao, P. Karras, C. Raissi, and K.-L. Tan.
?uncertainty inference proof transaction
anonymization. In VLDB, pp. 10331044, 2010.
2 G. Ghinita, Y. Tao, and P. Kalnis. On the
anonymization of sparse high-dimensional data. In
ICDE, pp. 715724, 2008.
3 Y. He and J. F. Naughton. Anonymization of
set-valued data via top-down, local
generalization. In VLDB, pp. 934945, 2009.
4 M. Terrovitis, N. Mamoulis, and P. Kalnis.
Privacy-preserving anonymization of set-valued
data. In VLDB, pp.115125, 2008.
5 M. Terrovitis, N. Mamoulis, and P. Kalnis.
Local and global recoding methods for anonymizing
set-valued data.VLDBJ, 20(1)83106, 2011.
6 Y. Xu, B. C. M. Fung, K. Wang, A. W. C. Fu,
and J. Pei. Publishing sensitive transactions for
itemset utility. In ICDM, pp. 11091114, 2008.
7 Y. Xu, K. Wang, A. W. C. Fu, and P. S. Yu.
Anonymizing transaction databases for
publication. In SIGKDD, pp. 767775, 2008.

29
References

8 S. R. Ganta, S. P. Kasiviswanathan, and A.
Smith. Composition attacks and auxiliary
information in data privacy. In SIGKDD, pp.
265-273, 2008.
9 D. Kifer. Attacks on privacy and deFinettis
theorem. In SIGMOD, pp. 127138, 2009.
10 R. C. W. Wong, A. Fu, K. Wang, P. S. Yu, and
J. Pei. Can the utility of anonymized data be
used for privacy breaches, ACM Transactions on
Knowledge Discovery from Data, to appear.
11 D. Kifer and A. Machanavajjhala. No free
lunch in data privacy. In SIGMOD, 2011.
12 C. Dwork, F. McSherry, K. Nissim, and A.
Smith. Calibrating noise to sensitivity in
private data analysis. In Theory of Cryptography
Conference, pp. 265284, 2006.
13 F. McSherry and K. Talwar. Mechanism design
via differential privacy. In FOCS, pp. 94103,
2007.
14 F. McSherry. Privacy integrated queries An
extensible platform for privacy-preserving data
analysis. In SIGMOD, pp. 1930, 2009.
15 A. Blum, K. Ligett, and A. Roth. A learning
theory approach to non-interactive database
privacy. In STOC, pp.609618, 2008.

30
References

16 G. Cormode, M. Procopiuc, D. Srivastava, and
T. T. L. Tran. Differentially Private Publication
of Sparse Data. In CoRR, 2011.
17 N. Mohammed, R. Chen, B. C. M. Fung, and P.
S. Yu. Differentially private data release for
data mining. In SIGKDD, 2011.
18 R. Chen, B. C. M. Fung, and B. C. Desai.
Differentially private trajectory data
publication. ICDE, under review, 2012.

Thank you!
Q A

Backup Slides

33
Lower Bound Results

In the interactive setting, only a limited number
of queries could be answered otherwise, an
adversary would be able to precisely reconstruct
almost the entire original database.
In the non-interactive setting, one can only
guarantee the utility of restricted classes of
queries.

34
(No Transcript)
35
(No Transcript)
36
Threshold Selection

We design the threshold as a function of the
standard deviation of the noise and the height of
a partitions hierarchy cut

37
Relative error

(a, d)-usefulness is effective to give an overall
estimation of utility, but fails to produce
intuitive experimental results.
We experimentally measure the utility of
sanitized data for counting queries by relative
error

Sanity bound
38
Experiments

Average relative error vs. taxonomy tree fan-out

B0.75
B0.5
B1.0

Write a Comment

User Comments (0)

About PowerShow.com

Publishing Set-Valued Data via Differential Privacy PowerPoint PPT Presentation