Title: Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation
1Privacy Preserving Data MiningAn Overview and
Examination of Euclidean Distance Preserving Data
Transformation
- Chris Giannella
- cgiannel AT acm DOT org
2Talk Outline
- Introduction
- Privacy preserving data mining what problem is
it aimed to address? - Focus of this talk data transformation
- Some data transformation approaches
- My current research Euclidean distance
preserving data transformation - Wrap-up summary
3An Example Problem
- The U.S. Census Bureau collects lots of data
- If released in raw form, this data would provide
- a wealth of valuable information regarding broad
population patterns ? - Access to private information regarding
individuals ? - How to allow analysts to extract population
patterns without learning private information?
4Privacy-Preserving Data Mining
- The study of how to produce valid mining models
and patterns without disclosing private
information.- F. Giannotti and F. Bonchi,
Privacy Preserving Data Mining, KDUbiq Summer
School, 2006. - Several broad approaches this talk ?
- data transformation (the census model)
5Data Transformation(the Census Model)
Researcher
Private Data
Transformed Data
Data Miner
6DT Objectives
- Minimize risk of disclosing private information
- Maximize the analytical utility of the
transformed data - DT is also studied in the field of Statistical
Disclosure Control.
7Some things DT does not address
- Preventing unauthorized access to the private
data (e.g. hacking). - Securely communicating private data.
- DT and cryptography are quite different.
- (Moreover, standard encryption does not solve the
DT problem)
8Assessing Transformed Data Utility
- How accurately does a transformation preserve
certain kinds of patterns, e.g. - data mean, covariance
- Euclidean distance between data records
- Underlying generating distribution?
- How useful are the patterns at drawing
conclusions/inferences?
9Assessing Privacy Disclosure Risk
- Some efforts in the literature to develop
rigorous definitions of disclosure risk - no widely accepted agreement
- This talk will take an ad-hoc approach
- for a specific attack, how closely can any
private data record be estimated?
10Talk Outline
- Introduction
- Privacy preserving data mining what problem is
it aimed to address? - Focus of this talk data transformation
- Some data transformation approaches
- My current research Euclidean distance
preserving data transformation - Wrap-up summary
11Some DT approaches
- Discussed in this talk
- Additive independent noise
- Euclidean distance preserving transformation
- My current research
- Others
- Data swapping/shuffling, multiplicative noise,
micro-aggregation, K-anonymization, replacement
with synthetic data, etc
12Additive Independent Noise
- For each private data record, (x1,,xn), add
independent random noise to each entry - (y1,,yn) (x1e1,,xnen)
- ei is generated independently as N(0, dVar(i))
- Increasing d reduces privacy disclosure risk
13Additive Independent Noise d 0.5
14Additive Independent Noise
- Difficult to set d producing
- low privacy disclosure risk
- high data utility
- Some enhancements on the basic idea exist
- E.g. Muralidhar et al.
15Talk Outline
- Introduction
- Privacy preserving data mining what problem is
it aimed to address? - Focus of this talk data transformation
- Some data transformation approaches
- My current research Euclidean distance
preserving data transformation (EDPDT) - Wrap-up summary
16EDPDT High Data Utility!
- Many data clustering algorithms use Euclidean
distance to group records, e.g. - K-means clustering, hierarchical agglomerative
clustering - If Euclidean distance is accurately preserved,
these algorithms will produce the same clusters
on the transformed data as the original data.
17EDPDT High Data Utility!
18EDPDT Unclear Privacy Disclosure Risk
- Focus of the research ... approach ?
- Develop attacks combining the transformed data
with plausible prior knowledge. - How well can these attacks estimate private data
records?
19Two Different Prior Knowledge Assumptions
- Known input The attacker knows a small subset of
the private data records. - Focus of this talk.
- Known sample The attacker knows a set of data
records drawn independently from the same
underlying distribution as the private data
records. - happy to discuss off-line.
20Known Input Prior Knowledge
- Underlying assumption Individuals know
- a) if there is a record for them along the
private data records, and - b) know the attributes of the private data
records. - ? Each individual knows one private record.
- ? A small group of malicious individuals could
cooperate to produce a small subset of the
private data records.
21Known Input Attack
- Given Y1,,Ym (transformed data records)
X1,,Xk (known private data records) - 1) Determine the transformation constraints
- i.e. which transformed records came from which
known private records. - 2) Choose T randomly from the set of all distance
preserving transformations that satisfy the
constraints. - 3) Apply T-1 to the transformed data.
22Know Input Attack 2D data, 1 known private data
record
23Known Input Attack General Case
- Y MX
- Each column of X (Y) is a private (transformed)
data record. - M is an orthogonal matrix.
- Ykn Yun MXknown Xunkown
- Attack
- Choose T randomly from T an orthogonal matrix
TXknown Ykn. - Produce T-1(Yun).
23
24Known Input Attack -- Experiments
- 18,000 record, 16-attribute real data set.
- Given k known private data records, computed Pk,
the probability that the attack estimates one
unknown private record with gt 85 accuracy. - P2 0.16
- P4 1
-
- P16 1
25Wrap-Up Summary
- Introduction
- Privacy preserving data mining what problem is
it aimed to address? - Focus of this talk data transformation
- Some data transformation approaches
- My current research Euclidean distance
preserving data transformation
26Thanks to
- You
- for your attention
- Kun Liu
- joint research some material used in this
presentation - Krish Muralidhar
- some material used in this presentation
- Hillol Kargupta
- joint research
27Distance Preserving Perturbation
Attributes
perturbed
Records
28Distance Preserving Perturbation
Y
29Known Sample Attack
more
30Known Sample Attack Experiments backup
Fig. Known sample attack for Adult data with
32,561 private tuples. The attacker has 2
samples from the same distribution. The average
relative error of the recovered data is 0.1081
(10.81).