Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation

Description:

Title: Mining Distributed Data: An Overview and an Algorithm for Outlier Detection Chris Giannella Last modified by: user Document presentation format – PowerPoint PPT presentation

Number of Views:331
Avg rating:3.0/5.0
Slides: 31
Provided by: csNmsuEd7
Category:

less

Transcript and Presenter's Notes

Title: Privacy Preserving Data Mining: An Overview and Examination of Euclidean Distance Preserving Data Transformation


1
Privacy Preserving Data MiningAn Overview and
Examination of Euclidean Distance Preserving Data
Transformation
  • Chris Giannella
  • cgiannel AT acm DOT org

2
Talk Outline
  • Introduction
  • Privacy preserving data mining what problem is
    it aimed to address?
  • Focus of this talk data transformation
  • Some data transformation approaches
  • My current research Euclidean distance
    preserving data transformation
  • Wrap-up summary

3
An Example Problem
  • The U.S. Census Bureau collects lots of data
  • If released in raw form, this data would provide
  • a wealth of valuable information regarding broad
    population patterns ?
  • Access to private information regarding
    individuals ?
  • How to allow analysts to extract population
    patterns without learning private information?

4
Privacy-Preserving Data Mining
  • The study of how to produce valid mining models
    and patterns without disclosing private
    information.- F. Giannotti and F. Bonchi,
    Privacy Preserving Data Mining, KDUbiq Summer
    School, 2006.
  • Several broad approaches this talk ?
  • data transformation (the census model)

5
Data Transformation(the Census Model)
Researcher
Private Data
Transformed Data
Data Miner
6
DT Objectives
  • Minimize risk of disclosing private information
  • Maximize the analytical utility of the
    transformed data
  • DT is also studied in the field of Statistical
    Disclosure Control.

7
Some things DT does not address
  • Preventing unauthorized access to the private
    data (e.g. hacking).
  • Securely communicating private data.
  • DT and cryptography are quite different.
  • (Moreover, standard encryption does not solve the
    DT problem)

8
Assessing Transformed Data Utility
  • How accurately does a transformation preserve
    certain kinds of patterns, e.g.
  • data mean, covariance
  • Euclidean distance between data records
  • Underlying generating distribution?
  • How useful are the patterns at drawing
    conclusions/inferences?

9
Assessing Privacy Disclosure Risk
  • Some efforts in the literature to develop
    rigorous definitions of disclosure risk
  • no widely accepted agreement
  • This talk will take an ad-hoc approach
  • for a specific attack, how closely can any
    private data record be estimated?

10
Talk Outline
  • Introduction
  • Privacy preserving data mining what problem is
    it aimed to address?
  • Focus of this talk data transformation
  • Some data transformation approaches
  • My current research Euclidean distance
    preserving data transformation
  • Wrap-up summary

11
Some DT approaches
  • Discussed in this talk
  • Additive independent noise
  • Euclidean distance preserving transformation
  • My current research
  • Others
  • Data swapping/shuffling, multiplicative noise,
    micro-aggregation, K-anonymization, replacement
    with synthetic data, etc

12
Additive Independent Noise
  • For each private data record, (x1,,xn), add
    independent random noise to each entry
  • (y1,,yn) (x1e1,,xnen)
  • ei is generated independently as N(0, dVar(i))
  • Increasing d reduces privacy disclosure risk

13
Additive Independent Noise d 0.5
14
Additive Independent Noise
  • Difficult to set d producing
  • low privacy disclosure risk
  • high data utility
  • Some enhancements on the basic idea exist
  • E.g. Muralidhar et al.

15
Talk Outline
  • Introduction
  • Privacy preserving data mining what problem is
    it aimed to address?
  • Focus of this talk data transformation
  • Some data transformation approaches
  • My current research Euclidean distance
    preserving data transformation (EDPDT)
  • Wrap-up summary

16
EDPDT High Data Utility!
  • Many data clustering algorithms use Euclidean
    distance to group records, e.g.
  • K-means clustering, hierarchical agglomerative
    clustering
  • If Euclidean distance is accurately preserved,
    these algorithms will produce the same clusters
    on the transformed data as the original data.

17
EDPDT High Data Utility!
  • Original data
  • Transformed data

18
EDPDT Unclear Privacy Disclosure Risk
  • Focus of the research ... approach ?
  • Develop attacks combining the transformed data
    with plausible prior knowledge.
  • How well can these attacks estimate private data
    records?

19
Two Different Prior Knowledge Assumptions
  • Known input The attacker knows a small subset of
    the private data records.
  • Focus of this talk.
  • Known sample The attacker knows a set of data
    records drawn independently from the same
    underlying distribution as the private data
    records.
  • happy to discuss off-line.

20
Known Input Prior Knowledge
  • Underlying assumption Individuals know
  • a) if there is a record for them along the
    private data records, and
  • b) know the attributes of the private data
    records.
  • ? Each individual knows one private record.
  • ? A small group of malicious individuals could
    cooperate to produce a small subset of the
    private data records.

21
Known Input Attack
  • Given Y1,,Ym (transformed data records)
    X1,,Xk (known private data records)
  • 1) Determine the transformation constraints
  • i.e. which transformed records came from which
    known private records.
  • 2) Choose T randomly from the set of all distance
    preserving transformations that satisfy the
    constraints.
  • 3) Apply T-1 to the transformed data.

22
Know Input Attack 2D data, 1 known private data
record
23
Known Input Attack General Case
  • Y MX
  • Each column of X (Y) is a private (transformed)
    data record.
  • M is an orthogonal matrix.
  • Ykn Yun MXknown Xunkown
  • Attack
  • Choose T randomly from T an orthogonal matrix
    TXknown Ykn.
  • Produce T-1(Yun).

23
24
Known Input Attack -- Experiments
  • 18,000 record, 16-attribute real data set.
  • Given k known private data records, computed Pk,
    the probability that the attack estimates one
    unknown private record with gt 85 accuracy.
  • P2 0.16
  • P4 1
  • P16 1

25
Wrap-Up Summary
  • Introduction
  • Privacy preserving data mining what problem is
    it aimed to address?
  • Focus of this talk data transformation
  • Some data transformation approaches
  • My current research Euclidean distance
    preserving data transformation

26
Thanks to
  • You
  • for your attention
  • Kun Liu
  • joint research some material used in this
    presentation
  • Krish Muralidhar
  • some material used in this presentation
  • Hillol Kargupta
  • joint research

27
Distance Preserving Perturbation
Attributes
perturbed
Records
28
Distance Preserving Perturbation
Y
29
Known Sample Attack
more
30
Known Sample Attack Experiments backup
Fig. Known sample attack for Adult data with
32,561 private tuples. The attacker has 2
samples from the same distribution. The average
relative error of the recovered data is 0.1081
(10.81).
Write a Comment
User Comments (0)
About PowerShow.com