Data Transformation for Privacy-Preserving Data Mining

About This Presentation

Title:

Data Transformation for Privacy-Preserving Data Mining

Description:

Privacy-Preserving Data Mining Stanley R. M. Oliveira Database Systems Laboratory Computing Science Department University of Alberta, Canada PhD Thesis - Final ... – PowerPoint PPT presentation

Number of Views:317

Avg rating:3.0/5.0

Slides: 25

Provided by: Stanl67

Category:

more less

Transcript and Presenter's Notes

Title: Data Transformation for Privacy-Preserving Data Mining

1
Data Transformation for Privacy-Preserving Data
Mining
Database Laboratory

Stanley R. M. Oliveira
Database Systems Laboratory
Computing Science Department
University of Alberta, Canada

PhD Thesis - Final Examination November 29th, 2004
2
Motivation

Scenario 1 A collaboration between an Internet
marketing company and an on-line retail company.
Objective find optimal customer targets.
Scenario 2 Companies sharing their transactions
to build a recommender system.
Objective provide recommendations to their
customers.

3
A Taxonomy of the Existing Solutions
Data Partitioning
Data Modification
Data Restriction
Data Ownership
Fig.1 A Taxonomy of PPDM Techniques
4
Problem Definition

To transform a database into a new one that
conceals sensitive information while preserving
general patterns and trends from the original
database.

5
Problem Definition (cont.)

Sub-Problem 1 Privacy-Preserving Association
Rule Mining
I do not address privacy of individuals but the
problem of protecting sensitive knowledge.
Sub-Problem 2 Privacy-Preserving Clustering
I protect the underlying attribute values of
objects subjected to clustering analysis.

6
A Framework for Privacy PPDM
7
Privacy-Preserving Association Rule Mining
A taxonomy of sanitizing algorithms
8
Data Sharing-Based Algorithms Problems
1. Hiding Failure 3. Artifactual Patterns 4.
Difference between D and D
2. Misses Cost
9
Data Sharing-Based Algorithms

Scan a database and identify the sensitive
transactions for each sensitive rule
Based on the disclosure threshold ?, compute the
number of sensitive transactions to be sanitized
For each sensitive rule, identify a candidate
item that should be eliminated (victim item)
Based on the number found in step 3, remove the
victim items from the sensitive transactions.

10
Measuring Effectiveness
Misses Cost under condition C1
Misses Cost under condition C3
11
Pattern Sharing-Based Algorithm
12
Pattern Sharing-Based Algorithms Problems

1. Side Effect Factor (SEF) SEF
2. Recovery Factor (RF) RF 0, 1

13
Measuring Effectiveness
Dataset ? 0 ? 6 sensitive rules ? 0 ? 6 sensitive rules ? 0 ? 6 sensitive rules ? 0 ? 6 sensitive rules
S1 S2 S3 S4
Kosarak IGA DSA DSA DSA
Retail DSA DSA DSA IGA
Reuters DSA DSA DSA IGA
BMS-1 DSA DSA DSA DSA
Dataset ? 0 ? varying of rules ? 0 ? varying of rules ? 0 ? varying of rules ? 0 ? varying of rules
S1 S2 S3 S4
Kosarak IGA DSA DSA IGA / DSA
Retail DSA DSA IGA / DSA IGA
Reuters IGA / DSA DSA DSA IGA
BMS-1 DSA DSA DSA DSA
The best algorithm in terms of misses cost
The best algorithm in terms of misses cost
varying the number of rules to sanitize
Dataset ? 0 ? 6 sensitive rules ? 0 ? 6 sensitive rules ? 0 ? 6 sensitive rules ? 0 ? 6 sensitive rules
S1 S2 S3 S4
Kosarak IGA DSA DSA DSA
Retail DSA DSA DSA IGA
Reuters DSA DSA DSA IGA
BMS-1 DSA DSA DSA DSA
The best algorithm in terms of side effect factor
14
Lessons Learned

Large dataset are our friends.
The benefit of index at most two scans to
sanitize a dataset.
The data sanitization paradox.
The outstanding performance of IGA and DSA.
Rule sanitization does not change the support and
confidence of the shared rules.
DSA reduces the flexibility of information
sharing.

15
Privacy-Preserving Clustering (PPC)

PPC over Centralized Data
The attribute values subjected to clustering are
available in a central repository.
PPC over Vertically Partitioned Data
There are k parties sharing data for clustering,
where k ? 2
The attribute values of the objects are split
across the k parties.
Objects IDs are revealed for join purposes only.
The values of the associated attributes are
private.

16
Object Similarity-Based Representation (OSBR)
Example 1 Sharing data for research purposes
(OSBR).
17
Object Similarity-Based Representation (OSBR)

Limitations of the OSBR
Expensive in terms of communication cost - O(m2),
where m is the number of objects under analysis.
Vulnerable to attacks in the case of vertically
partitioned data.
Conclusion ? The OSBR is effective for PPC over
centralized data only.

18
Dimensionality Reduction Transformation (DRBT)

Random projection from d to k dimensions
D n? k Dn? d Rd? k (linear transformation),
where
D is the original data, D is the reduced data,
and R is a random matrix.
R is generated by first setting each entry, as
follows
(R1) rij is drawn from an i.i.d. N(0,1) and then
normalizing the columns to unit length
(R2) rij

19
Dimensionality Reduction Transformation (DRBT)
ID Att1 Att2 Att3 Att1 Att2 Att3
123 -50.40 17.33 12.31 -55.50 -95.26 -107.93
342 -37.08 6.27 12.22 -51.00 -84.29 -83.13
254 -55.86 20.69 -0.66 -65.50 -70.43 -66.97
446 -37.61 -31.66 -17.58 -85.50 -140.87 -72.74
286 -62.72 37.64 18.16 -88.50 -50.22 -102.76
20
Dimensionality Reduction Transformation (DRBT)

Security A random projection from d to k
dimensions, where k ?? d, is a non-invertible
linear transformation.
Space requirement is of the order O(m), where m
is the number of objects.
Communication cost is of the order O(mkl), where
l represents the size (in bits) required to
transmit a dataset from one party to a central or
third party.
Conclusion ? The DRBT is effective for PPC over
centralized data and vertically partitioned data.

21
DRBT PPC over Centralized Data
Transformation dr 37 dr 34 dr 31 dr 28 dr 25 dr 22 dr 16
RP1 0.00 0.015 0.024 0.033 0.045 0.072 0.141
RP2 0.00 0.014 0.019 0.032 0.041 0.067 0.131
The error produced on the dataset Chess (do 37)
Data K 2 K 2 K 3 K 3 K 4 K 4 K 5 K 5
Transformation Avg Std Avg Std Avg Std Avg Std
RP2 0.941 0.014 0.912 0.009 0.881 0.010 0.885 0.006
Average of F-measure (10 trials) for the dataset
Accidents (do 18, dr 12)
Data K 2 K 2 K 3 K 3 K 4 K 4 K 5 K 5
Transformation Avg Std Avg Std Avg Std Avg Std
RP2 1.000 0.000 0.948 0.010 0.858 0.089 0.833 0.072
Average of F-measure (10 trials) for the dataset
Iris (do 5, dr 3)
22
DRBT PPC over Vertically Partitioned Data
No. Parties RP1 RP2
1 0.0762 0.0558
2 0.0798 0.0591
3 0.0870 0.0720
4 0.0923 0.0733
The error produced on the dataset Pumsb (do 74)
Data K 2 K 2 K 3 K 3 K 4 K 4 K 5 K 5
Transformation Avg Std Avg Std Avg Std Avg Std
1 0.909 0.140 0.965 0.081 0.891 0.028 0.838 0.041
2 0.904 0.117 0.931 0.101 0.894 0.059 0.840 0.047
3 0.874 0.168 0.887 0.095 0.873 0.081 0.801 0.073
4 0.802 0.155 0.812 0.117 0.866 0.088 0.831 0.078
Average of F-measure (10 trials) for the dataset
Pumsb (do 74, dr 38)
23
Contributions of this Research

Foundations for further research in PPDM.
A taxonomy of PPDM techniques.
A family of privacy-preserving methods.
A library of sanitizing algorithms.
Retrieval facilities.
A set of metrics.

24
Future Research

Privacy definition in data mining.
Combining sanitization and randomization for
PPARM.
Transforming data using one-way functions and
learning from the distorted data.
Privacy preservation in spoken language
databases.
Sanitization of documents repositories on the Web.

25
PPDM Increasing Number of Papers
26
PPDM Privacy Violation

Privacy violation in data mining misuse of data.
Defining privacy preservation in data mining
Individual privacy preservation protection of
personally identifiable information.
Collective privacy preservation protection of
users collective activity.

27
Privacy-Preserving Association Rule Mining
28
Heuristic 1 Degree of Sensitive Transactions

Definition Let D be a transactional database and
ST a set of all sensitive transactions in D. The
degree of a sensitive transactions t, such that t
? ST, is defined as the number of sensitive
association rules that can be found in t.

Degree(T1) 2 Degree(T3) 1 Degree(T4) 1
29
Data Sharing-Based Algorithms
Sensitive Rules (SR) Rule 1 A,B?D Rule 2
A,C?D

The Round Robin Algorithm (RRA)
Step 1 Sensitive transactions A,B?D T1, T3
A,C?D T1, T4
Step 2 Select the number of sensitive
transactions (a) ? 50 (b) ? 0
Step 3 Identify the victim items (taking turns)
? 50 Victim(T1) A Victim(T1) A
(Partial Sanitization)
? 0 Victim(T3) B Victim(T4) C
(Full Sanitization)
Step 4 Sanitize the marked sensitive
transactions.

30
Data Sharing-Based Algorithms (cont.)
Sensitive Rules (SR) Rule 1 A,B?D Rule 2
A,C?D

The Random Algorithm (RA)
Step 1 Sensitive transactions A,B?D T1, T3
A,C?D T1, T4
Step 2 Select the number of sensitive
transactions ? 0
Step 3 Identify the victim items (randomly)
? 50 Victim(T1) A Victim(T1) A
(Partial Sanitization)
? 0 Victim(T3) D Victim(T4) C
(Full Sanitization)
Step 4 Sanitize the marked sensitive
transactions.

31
Data Sharing-Based Algorithms (cont.)
Sensitive Rules (SR) Rule 1 A,B?D Rule 2
A,C?D

The Item Grouping Algorithm (IGA)
Step 1 Sensitive transactions A,B?D T1, T3
A,C?D T1, T4
Step 2 Select the number of sensitive
transactions ? 0
Step 3 Identify the victim items (grouping
sensitive rules)
Victim(T1) D Victim(T3) D Victim(T4)
D (Full Sanitization)
Step 4 Sanitize the marked sensitive
transactions.

32
Heuristic 2 Size of Sensitive Transactions

For every group of K transactions
Step1 Distinguishing the sensitive transactions
from the non-sensitive ones
Step 2 Selecting the victim item for each
sensitive rule
Step 3 Computing the number of sensitive
transactions to be sanitized
Step 4 Sorting the sensitive transactions by
size
Step 5 Sanitizing the sensitive transactions.

33
Novelties of this Approach

The notion of disclosure threshold for every
single pattern ? Mining Permissions (MP).
Each mining permission mp ltsri, ?igt, where
?i sri ? set of sensitive rules SR and
?i ? 0 1.
Mining Permissions allow an DBA to put different
weights for different rules to hide.
All the thresholds ?i can also be set to the same
value, if needed.

34
Data Sharing-Based Algorithms (cont.)
Sensitive Rules (SR) Rule 1 A,B?D Rule 2
A,C?D

The Sliding Window Algorithm (SWA)
Step 1 Sensitive transactions A,B?D T1, T3
A,C?D T1, T4
Step 2 Identify the victim items (based on
frequencies of the items in SR)
Victim(T3) B Victim(T4) A Victim(T1)
D
Step 3 Select the number of sensitive
transactions ? 0
Step 4 Sort sensitive transactions by size
A,B?D T3, T1 A,C?D T4, T1
Step 5 Sanitize the marked sensitive
transactions.

35
Results and Evaluation Datasets
Dataset records items Avg.Length Shortest Record Longest Record
BMS 59,602 497 2.51 1 145
Retail 88,162 16,470 10.30 1 76
Kosarak 990,573 41,270 8.10 1 1065
Reuters 7,774 26,639 46.81 1 427
Accidents 340,183 468 33.81 18 51
Mushroom 8,124 119 23.00 23 23
Chess 3,196 75 37.00 37 37
Connect 67,557 129 43.00 43 43
Pumbs 49,046 2,113 74.00 74 74
Association Rules
Clustering
Datasets used in our performance evaluation
36
Data Sharing-Based Algorithms

Item Group Algorithm (IGA) Oliveira Zaiane,
PSDM 2002.
Sliding Window Algorithm (SWA) Oliveira
Zaïane, ICDM 2003.
Round Robin Algorithm (RRA) Oliveira Zaïane,
IDEAS 2003.
Random Algorithm (RA) Oliveira Zaïane, IDEAS
2003.
Algo2a E. Dasseni et al., IHW 2001.

37
Methodology

The Sensitive rules selected based on four
scenarios
S1 Rules with items mutually exclusive.
S2 Rules selected randomly.
S3 Rules with very high support.
S4 Rules with low support.
The effectiveness of the algorithms was measured
based on
C1 ? 0, fixed the minimum support (?) and
minimum confidence (?).
C2 the same as C1 but varied the number of
sensitive rules.
C3 ? 0, fixed the minimum confidence (?) and
the number of sensitive rules, and varied the
minimum support (?).

38
Evaluating the Window Size for SWA
Disclosure threshold ? 25
K 40,000 window size representing 45.37 of
the Retail dataset
K 40,000 window size representing 4.04 of
the Kosarak dataset
39
Special Cases of Data Sanitization
SWA rule1, 30, rule2, 25, rule3, 15,
rule4, 45, rule5, 15, rule6, 20
Algorithm Kosarak Retail Reuters BMS-1
MC 37.22 31.07 46.48 8.68
HF 5.57 7.45 0.01 21.84
Dif (D, D) 1.68 1.24 0.63 0.70
K 100,000
An example of different thresholds for the
sensitive rules in scenario S3.
? 0 ? 0 ? 5 ? 5 ? 10 ? 10 ? 15 ? 15 ? 25 ? 25
MC HF MC HF MC HF MC HF MC HF
IGA 66.31 0.00 64.77 0.66 63.23 0.83 60.94 1.32 56.26 1.99
RRA 64.02 0.00 61.18 7.28 58.15 6.46 55.12 7.62 46.46 15.73
RA 63.86 0.00 60.12 7.12 56.72 7.62 54.29 7.95 46.48 16.39
SWA 65.29 0.00 55.58 1.16 48.31 1.82 42.67 3.31 27.74 15.89
Effect of ? on misses cost and hiding failure in
the dataset Retail
40
CPU Time
Results of CPU time for the sanitization process
41
Pattern Sharing-Based Algorithm

Downright Sanitizing Algorithm (DSA) Oliveira
Zaiane, PAKDD 2004.
We used the data-sharing algorithm IGA for our
comparison study.
Methodology
We used IGA to sanitize the datasets.
We used Apriori to extract the rules to share
(all the datasets).
We used Apriori to extract the rules from the
datasets.
We used DSA to sanitize the rules mined in the
previous step.

IGA
DSA
42
CPU Time
Results of CPU time for the sanitization process
43
Basic Definitions

Definition 1 Frequent Itemset Graph
Definition 2 Itemset Level
Definition 3 Frequent Itemset Graph Level

44
Possible Inference Channels

Inferences based on non-restrictive rules,
someone tries to deduce one or more restrictive
rules that are not supposed to be discovered.

45
Heuristic 3 Rule Sanitization

Step1 Identifying the sensitive itemsets.
Step 2 Selecting the subsets to sanitize.
Step 3 Sanitizing the set of supersets of marked
pairs in level1.

Step3
Step2
46
Object Similarity-Based Representation (OSBR)

The Security of the OSBR
Lemma 1 Let DMm?m be a dissimilarity matrix,
where m is the number of objects. It is
impossible to determine the coordinates of the
two objects by knowing only the distance between
them.
The Complexity of the OSBR
Communication cost is of the order O(m2), where m
is the number of objects under analysis.

47
Dimensionality Reduction Transformation (DRBT)

General Assumptions
The attribute values subjected to clustering are
numerical only.
In PPC over centralized data, object IDs should
be replaced by fictitious identifiers
In PPC over vertically partitioned data, object
IDs are used for the join purposes between the
parties involved in the solution.
The transformation (random projection) applied to
the data might slightly modify the distances
between data points.

48
Dimensionality Reduction Transformation (DRBT)

PPC over Centralized Data (General Approach)
Step 1 - Suppressing identifiers.
Step 2 - Normalizing attribute values subjected
to clustering.
Step 3 - Reducing the dimension of the original
dataset by using random projection.
Step 4 Computing the error that the distances
in k-d space suffer from
PPC over Vertically Partitioned Data
It is a generalization of the solution for PPC
over centralized data.

49
Dimensionality Reduction Transformation (DRBT)

The Accuracy of the DRBT

Precision (P) Recall (R)
C1 C2 Ck
C1 freq1,1 freq1,2 freq1,k
C2 freq2,1 freq2,2 freq2,k

Ck freqk,1 freqk,2 freqk,k
F-measure (F)
50
Evaluation DRBT

Methodology
Step 1 Attribute normalization.
Step 2 Dimensionality reduction (two
approaches).
Step 3 Computation of the error produced on
reduced datasets.
Step 4 Run K-means to find the clusters in the
original and reduced datasets.
Step 5 Computation of F-measure (experiments
repeated 10 times).
Step 6 Comparison of the clusters generated
from the original and the reduced datasets.

51
Thank You!

Write a Comment

User Comments (0)