Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation

About This Presentation

Title:

Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation

Description:

Title: Mining Distributed Data: An Overview and an Algorithm for Outlier Detection Chris Giannella Last modified by: user Document presentation format – PowerPoint PPT presentation

Number of Views:331

Avg rating:3.0/5.0

Slides: 31

Provided by: csNmsuEd7

Category:

more less

Transcript and Presenter's Notes

Title: Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation

1
Privacy Preserving Data MiningAn Overview and
Examination of Euclidean Distance Preserving Data
Transformation

Chris Giannella
cgiannel AT acm DOT org

2
Talk Outline

Introduction
Privacy preserving data mining what problem is
it aimed to address?
Focus of this talk data transformation
Some data transformation approaches
My current research Euclidean distance
preserving data transformation
Wrap-up summary

3
An Example Problem

The U.S. Census Bureau collects lots of data
If released in raw form, this data would provide
a wealth of valuable information regarding broad
population patterns ?
Access to private information regarding
individuals ?
How to allow analysts to extract population
patterns without learning private information?

4
Privacy-Preserving Data Mining

The study of how to produce valid mining models
and patterns without disclosing private
information.- F. Giannotti and F. Bonchi,
Privacy Preserving Data Mining, KDUbiq Summer
School, 2006.
Several broad approaches this talk ?
data transformation (the census model)

5
Data Transformation(the Census Model)
Researcher
Private Data
Transformed Data
Data Miner
6
DT Objectives

Minimize risk of disclosing private information
Maximize the analytical utility of the
transformed data
DT is also studied in the field of Statistical
Disclosure Control.

7
Some things DT does not address

Preventing unauthorized access to the private
data (e.g. hacking).
Securely communicating private data.
DT and cryptography are quite different.
(Moreover, standard encryption does not solve the
DT problem)

8
Assessing Transformed Data Utility

How accurately does a transformation preserve
certain kinds of patterns, e.g.
data mean, covariance
Euclidean distance between data records
Underlying generating distribution?
How useful are the patterns at drawing
conclusions/inferences?

9
Assessing Privacy Disclosure Risk

Some efforts in the literature to develop
rigorous definitions of disclosure risk
no widely accepted agreement
This talk will take an ad-hoc approach
for a specific attack, how closely can any
private data record be estimated?

10
Talk Outline

Introduction
Privacy preserving data mining what problem is
it aimed to address?
Focus of this talk data transformation
Some data transformation approaches
My current research Euclidean distance
preserving data transformation
Wrap-up summary

11
Some DT approaches

Discussed in this talk
Additive independent noise
Euclidean distance preserving transformation
My current research
Others
Data swapping/shuffling, multiplicative noise,
micro-aggregation, K-anonymization, replacement
with synthetic data, etc

12
Additive Independent Noise

For each private data record, (x1,,xn), add
independent random noise to each entry
(y1,,yn) (x1e1,,xnen)
ei is generated independently as N(0, dVar(i))
Increasing d reduces privacy disclosure risk

13
Additive Independent Noise d 0.5
14
Additive Independent Noise

Difficult to set d producing
low privacy disclosure risk
high data utility
Some enhancements on the basic idea exist
E.g. Muralidhar et al.

15
Talk Outline

Introduction
Privacy preserving data mining what problem is
it aimed to address?
Focus of this talk data transformation
Some data transformation approaches
My current research Euclidean distance
preserving data transformation (EDPDT)
Wrap-up summary

16
EDPDT High Data Utility!

Many data clustering algorithms use Euclidean
distance to group records, e.g.
K-means clustering, hierarchical agglomerative
clustering
If Euclidean distance is accurately preserved,
these algorithms will produce the same clusters
on the transformed data as the original data.

17
EDPDT High Data Utility!

Original data

Transformed data

18
EDPDT Unclear Privacy Disclosure Risk

Focus of the research ... approach ?
Develop attacks combining the transformed data
with plausible prior knowledge.
How well can these attacks estimate private data
records?

19
Two Different Prior Knowledge Assumptions

Known input The attacker knows a small subset of
the private data records.
Focus of this talk.
Known sample The attacker knows a set of data
records drawn independently from the same
underlying distribution as the private data
records.
happy to discuss off-line.

20
Known Input Prior Knowledge

Underlying assumption Individuals know
a) if there is a record for them along the
private data records, and
b) know the attributes of the private data
records.
? Each individual knows one private record.
? A small group of malicious individuals could
cooperate to produce a small subset of the
private data records.

21
Known Input Attack

Given Y1,,Ym (transformed data records)
X1,,Xk (known private data records)
1) Determine the transformation constraints
i.e. which transformed records came from which
known private records.
2) Choose T randomly from the set of all distance
preserving transformations that satisfy the
constraints.
3) Apply T-1 to the transformed data.

22
Know Input Attack 2D data, 1 known private data
record
23
Known Input Attack General Case

Y MX
Each column of X (Y) is a private (transformed)
data record.
M is an orthogonal matrix.
Ykn Yun MXknown Xunkown
Attack
Choose T randomly from T an orthogonal matrix
TXknown Ykn.
Produce T-1(Yun).

23
24
Known Input Attack -- Experiments

18,000 record, 16-attribute real data set.
Given k known private data records, computed Pk,
the probability that the attack estimates one
unknown private record with gt 85 accuracy.
P2 0.16
P4 1
P16 1

25
Wrap-Up Summary

Introduction
Privacy preserving data mining what problem is
it aimed to address?
Focus of this talk data transformation
Some data transformation approaches
My current research Euclidean distance
preserving data transformation

26
Thanks to

You
for your attention
Kun Liu
joint research some material used in this
presentation
Krish Muralidhar
some material used in this presentation
Hillol Kargupta
joint research

27
Distance Preserving Perturbation
Attributes
perturbed
Records
28
Distance Preserving Perturbation
Y
29
Known Sample Attack
more
30
Known Sample Attack Experiments backup
Fig. Known sample attack for Adult data with
32,561 private tuples. The attacker has 2
samples from the same distribution. The average
relative error of the recovered data is 0.1081
(10.81).

Write a Comment

User Comments (0)

About PowerShow.com

Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation - PowerPoint PPT Presentation

Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation

Title: Mining Distributed Data: An Overview and an Algorithm for Outlier Detection Chris Giannella Last modified by: user Document presentation format – PowerPoint PPT presentation