Data Transformation for Privacy-Preserving Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Data Transformation for Privacy-Preserving Data Mining

Description:

Database Laboratory Data Transformation for Privacy-Preserving Data Mining Stanley R. M. Oliveira Database Systems Laboratory Computing Science Department – PowerPoint PPT presentation

Number of Views:176
Avg rating:3.0/5.0
Slides: 50
Provided by: StanleyO8
Category:

less

Transcript and Presenter's Notes

Title: Data Transformation for Privacy-Preserving Data Mining


1
Data Transformation for Privacy-Preserving Data
Mining
Database Laboratory
  • Stanley R. M. Oliveira
  • Database Systems Laboratory
  • Computing Science Department
  • University of Alberta, Canada

Graduate Seminar November 26th, 2004
2
Motivation
  • Changes in technology are making privacy harder.
  • The new challenge of Statistical Offices.
  • Data Mining plays an outstanding role in business
    collaboration.
  • The traditional solution all or nothing has
    been too rigid.
  • The need for techniques to enforce privacy
    concerns when data are shared for mining.

3
PPDM Increasing Number of Papers
4
PPDM Privacy Violation
  • Privacy violation in data mining misuse of data.
  • Defining privacy preservation in data mining
  • Individual privacy preservation protection of
    personally identifiable information.
  • Collective privacy preservation protection of
    users collective activity.

5
A few Examples of Scenarios in PPDM
  • Scenario 1 A hospital shares some data for
    research purposes.
  • Scenario 2 Outsourcing the data mining process.
  • Scenario 3 A collaboration between an Internet
    marketing company and an on-line retail company.

6
A Taxonomy of the Existing Solutions
Data Partitioning
Data Modification
Data Restriction
Data Ownership
Fig.1 A Taxonomy of PPDM Techniques
7
Problem Definition
  • To transform a database into a new one that
    conceals sensitive information while preserving
    general patterns and trends from the original
    database.

8
Problem Definition (cont.)
  • Problem 1 Privacy-Preserving Association Rule
    Mining
  • I do not address privacy of individuals but the
    problem of protecting sensitive knowledge.
  • Assumptions
  • The data owners have to know in advance some
    knowledge (rules) that they want to protect.
  • The individual data values (e.g., a specific
    item) are not restricted but the relationships
    between items.

9
Problem Definition (cont.)
  • Problem 2 Privacy-Preserving Clustering
  • I protect the underlying attribute values of
    objects subjected to clustering analysis.
  • Assumptions
  • Given a data matrix Dm?n, the goal is to
    transform D into D' so that the following
    restrictions hold
  • A transformation TD ? D must preserve the
    privacy of individual records.
  • The similarity between objects in D and D' must
    be the same or slightly altered by the
    transformation process.

10
A Framework for Privacy PPDM
11
Privacy-Preserving Association Rule Mining
12
Privacy-Preserving Association Rule Mining
A taxonomy of sanitizing algorithms
13
Heuristic 1 Degree of Sensitive Transactions
  • Definition Let D be a transactional database and
    ST a set of all sensitive transactions in D. The
    degree of a sensitive transactions t, such that t
    ? ST, is defined as the number of sensitive
    association rules that can be found in t.

Degree(T1) 2 Degree(T3) 1 Degree(T4) 1
14
Data Sharing-Based Algorithms
  1. Scan a database and identify the sensitive
    transactions for each restrictive patterns
  2. Based on the disclosure threshold ?, compute the
    number of sensitive transactions to be sanitized
  3. For each restrictive pattern, identify a
    candidate item that should be eliminated (victim
    item)
  4. Based on the number found in step 3, remove the
    victim items from the sensitive transactions.

15
Data Sharing-Based Algorithms
Sensitive Rules (SR) Rule 1 A,B?D Rule 2
A,C?D
  • The Round Robin Algorithm (RRA)
  • Step 1 Sensitive transactions A,B?D T1, T3
    A,C?D T1, T4
  • Step 2 Select the number of sensitive
    transactions (a) ? 50 (b) ? 0
  • Step 3 Identify the victim items (taking turns)
  • ? 50 Victim(T1) A Victim(T1) A
    (Partial Sanitization)
  • ? 0 Victim(T3) B Victim(T4) C
    (Full Sanitization)
  • Step 4 Sanitize the marked sensitive
    transactions.

16
Data Sharing-Based Algorithms (cont.)
Sensitive Rules (SR) Rule 1 A,B?D Rule 2
A,C?D
  • The Random Algorithm (RA)
  • Step 1 Sensitive transactions A,B?D T1, T3
    A,C?D T1, T4
  • Step 2 Select the number of sensitive
    transactions ? 0
  • Step 3 Identify the victim items (randomly)
  • ? 50 Victim(T1) A Victim(T1) A
    (Partial Sanitization)
  • ? 0 Victim(T3) D Victim(T4) C
    (Full Sanitization)
  • Step 4 Sanitize the marked sensitive
    transactions.

17
Data Sharing-Based Algorithms (cont.)
Sensitive Rules (SR) Rule 1 A,B?D Rule 2
A,C?D
  • The Item Grouping Algorithm (IGA)
  • Step 1 Sensitive transactions A,B?D T1, T3
    A,C?D T1, T4
  • Step 2 Select the number of sensitive
    transactions ? 0
  • Step 3 Identify the victim items (grouping
    sensitive rules)
  • Victim(T1) D Victim(T3) D Victim(T4)
    D (Full Sanitization)
  • Step 4 Sanitize the marked sensitive
    transactions.

18
Heuristic 2 Size of Sensitive Transactions
  • For every group of K transactions
  • Step1 Distinguishing the sensitive transactions
    from the non-sensitive ones
  • Step 2 Selecting the victim item for each
    sensitive rule
  • Step 3 Computing the number of sensitive
    transactions to be sanitized
  • Step 4 Sorting the sensitive transactions by
    size
  • Step 5 Sanitizing the sensitive transactions.

19
Novelties of this Approach
  • The notion of disclosure threshold for every
    single pattern ? Mining Permissions (MP).
  • Each mining permission mp ltsri, ?igt, where
  • ?i sri ? set of sensitive rules SR and
  • ?i ? 0 1.
  • Mining Permissions allow an DBA to put different
    weights for different rules to hide.
  • All the thresholds ?i can also be set to the same
    value, if needed.

20
Data Sharing-Based Algorithms (cont.)
Sensitive Rules (SR) Rule 1 A,B?D Rule 2
A,C?D
  • The Sliding Window Algorithm (SWA)
  • Step 1 Sensitive transactions A,B?D T1, T3
    A,C?D T1, T4
  • Step 2 Identify the victim items (based on
    frequencies of the items in SR)
  • Victim(T3) B Victim(T4) A Victim(T1)
    D
  • Step 3 Select the number of sensitive
    transactions ? 0
  • Step 4 Sort sensitive transactions by size
    A,B?D T3, T1 A,C?D T4, T1
  • Step 5 Sanitize the marked sensitive
    transactions.

21
Data Sharing-Based Metrics
1. Hiding Failure 3. Artifactual Patterns 4.
Difference between D and D
2. Misses Cost
22
Pattern Sharing-Based Algorithm
23
Basic Definitions
  • Definition 1 Frequent Itemset Graph
  • Definition 2 Itemset Level
  • Definition 3 Frequent Itemset Graph Level

24
Possible Inference Channels
  • Inferences based on non-restrictive rules,
    someone tries to deduce one or more restrictive
    rules that are not supposed to be discovered.

25
Pattern Sharing-Based Metrics
  • 1. Side Effect Factor (SEF) SEF
  • 2. Recovery Factor (RF) RF 0, 1

26
Heuristic 3 Rule Sanitization
  • Step1 Identifying the sensitive itemsets.
  • Step 2 Selecting the subsets to sanitize.
  • Step 3 Sanitizing the set of supersets of marked
    pairs in level1.

Step3
Step2
27
Privacy-Preserving Clustering (PPC)
  • PPC over Centralized Data
  • The attribute values subjected to clustering are
    available in a central repository.
  • PPC over Vertically Partitioned Data
  • There are k parties sharing data for clustering,
    where k ? 2
  • The attribute values of the objects are split
    across the k parties.
  • Objects IDs are revealed for join purposes only.
    The values of the associated attributes are
    private.

28
Object Similarity-Based Representation (OSBR)
Example 1 Sharing data for research purposes
(OSBR).
29
Object Similarity-Based Representation (OSBR)
  • The Security of the OSBR
  • Lemma 1 Let DMm?m be a dissimilarity matrix,
    where m is the number of objects. It is
    impossible to determine the coordinates of the
    two objects by knowing only the distance between
    them.
  • The Complexity of the OSBR
  • Communication cost is of the order O(m2), where m
    is the number of objects under analysis.

30
Object Similarity-Based Representation (OSBR)
  • Limitations of the OSBR
  • Lemma 2 Knowing the coordinates of a particular
    object i and the distance r between i and any
    other object j, it is possible to estimate the
    attribute values of j.
  • Vulnerable to attacks in the case of vertically
    partitioned data (Lemma 2).
  • Conclusion ? The OSBR is effective for PPC
    centralized data only, but expensive.

31
Dimensionality Reduction Transformation (DRBT)
  • General Assumptions
  • The attribute values subjected to clustering are
    numerical only.
  • In PPC over centralized data, object IDs should
    be replaced by fictitious identifiers
  • In PPC over vertically partitioned data, object
    IDs are used for the join purposes between the
    parties involved in the solution.
  • The transformation (random projection) applied to
    the data might slightly modify the distances
    between data points.

32
Dimensionality Reduction Transformation (DRBT)
  • Random projection from d to k dimensions
  • D n? k Dn? d Rd? k (linear transformation),
    where
  • D is the original data, D is the reduced data,
    and R is a random matrix.
  • R is generated by first setting each entry, as
    follows
  • (R1) rij is drawn from an i.i.d. N(0,1) and then
    normalizing the columns to unit length
  • (R2) rij

33
Dimensionality Reduction Transformation (DRBT)
  • PPC over Centralized Data (General Approach)
  • Step 1 - Suppressing identifiers.
  • Step 2 - Normalizing attribute values subjected
    to clustering.
  • Step 3 - Reducing the dimension of the original
    dataset by using random projection.
  • Step 4 Computing the error that the distances
    in k-d space suffer from
  • PPC over Vertically Partitioned Data
  • It is a generalization of the solution for PPC
    over centralized data.

34
Dimensionality Reduction Transformation (DRBT)
ID Att1 Att2 Att3 Att1 Att2 Att3
123 -50.40 17.33 12.31 -55.50 -95.26 -107.93
342 -37.08 6.27 12.22 -51.00 -84.29 -83.13
254 -55.86 20.69 -0.66 -65.50 -70.43 -66.97
446 -37.61 -31.66 -17.58 -85.50 -140.87 -72.74
286 -62.72 37.64 18.16 -88.50 -50.22 -102.76
35
Dimensionality Reduction Transformation (DRBT)
  • The Security of the DRBT
  • Lemma 3 A random projection from d to k
    dimensions, where k ?? d, is a non-invertible
    linear transformation.
  • The Complexity of the DRBT
  • The complexity of space requirements is of order
    O(m), where m is the number of objects.
  • The communication cost is of order O(mlk), where
    l represents the size (in bits) required to
    transmit a dataset from one party to a central or
    third party.

36
Dimensionality Reduction Transformation (DRBT)
  • The Accuracy of the DRBT

Precision (P) Recall (R)
C1 C2 Ck
C1 freq1,1 freq1,2 freq1,k
C2 freq2,1 freq2,2 freq2,k

Ck freqk,1 freqk,2 freqk,k
F-measure (F)
37
Results and Evaluation
Dataset records items Avg.Length Shortest Record Longest Record
BMS 59,602 497 2.51 1 145
Retail 88,162 16,470 10.30 1 76
Kosarak 990,573 41,270 8.10 1 1065
Reuters 7,774 26,639 46.81 1 427
Accidents 340,183 468 33.81 18 51
Mushroom 8,124 119 23.00 23 23
Chess 3,196 75 37.00 37 37
Connect 67,557 129 43.00 43 43
Pumbs 49,046 2,113 74.00 74 74
Association Rules
Clustering
Datasets used in our performance evaluation
38
Data Sharing-Based Algorithms
  • Item Group Algorithm (IGA) Oliveira Zaiane,
    PSDM 2002.
  • Sliding Window Algorithm (SWA) Oliveira
    Zaïane, ICDM 2003.
  • Round Robin Algorithm (RRA) Oliveira Zaïane,
    IDEAS 2003.
  • Random Algorithm (RA) Oliveira Zaïane, IDEAS
    2003.
  • Algo2a E. Dasseni et al., IHW 2001.

39
Methodology
  • The Sensitive rules selected based on four
    scenarios
  • S1 Rules with items mutually exclusive.
  • S2 Rules selected randomly.
  • S3 Rules with very high support.
  • S4 Rules with low support.
  • The effectiveness of the algorithms was measured
    based on
  • C1 ? 0, fixed the minimum support (?) and
    minimum confidence (?).
  • C2 the same as C1 but varied the number of
    sensitive rules.
  • C3 ? 0, fixed the minimum confidence (?) and
    the number of sensitive rules, and varied the
    minimum support (?).

40
Evaluating the Window Size for SWA
Disclosure threshold ? 25
K 40,000 window size representing 45.37 of
the Retail dataset
K 40,000 window size representing 4.04 of
the Kosarak dataset
41
Measuring Effectiveness
Misses Cost under condition C1
Misses Cost under condition C3
42
Special Cases of Data Sanitization
SWA rule1, 30, rule2, 25, rule3, 15,
rule4, 45, rule5, 15, rule6, 20
Algorithm Kosarak Retail Reuters BMS-1
MC 37.22 31.07 46.48 8.68
HF 5.57 7.45 0.01 21.84
Dif (D, D) 1.68 1.24 0.63 0.70
K 100,000
An example of different thresholds for the
sensitive rules in scenario S3.
? 0 ? 0 ? 5 ? 5 ? 10 ? 10 ? 15 ? 15 ? 25 ? 25
MC HF MC HF MC HF MC HF MC HF
IGA 66.31 0.00 64.77 0.66 63.23 0.83 60.94 1.32 56.26 1.99
RRA 64.02 0.00 61.18 7.28 58.15 6.46 55.12 7.62 46.46 15.73
RA 63.86 0.00 60.12 7.12 56.72 7.62 54.29 7.95 46.48 16.39
SWA 65.29 0.00 55.58 1.16 48.31 1.82 42.67 3.31 27.74 15.89
Effect of ? on misses cost and hiding failure in
the dataset Retail
43
CPU Time
Results of CPU time for the sanitization process
44
Pattern Sharing-based Algorithm
  • Downright Sanitizing Algorithm (DSA) Oliveira
    Zaiane, PAKDD 2004.
  • We used the data-sharing algorithm IGA for our
    comparison study.
  • Methodology
  • We used IGA to sanitize the datasets.
  • We used Apriori to extract the rules to share
    (all the datasets).
  • We used Apriori to extract the rules from the
    datasets.
  • We used DSA to sanitize the rules mined in the
    previous step.

IGA
DSA
45
Measuring Effectiveness
Dataset ? 0 ? 6 sensitive rules ? 0 ? 6 sensitive rules ? 0 ? 6 sensitive rules ? 0 ? 6 sensitive rules
S1 S2 S3 S4
Kosarak IGA DSA DSA DSA
Retail DSA DSA DSA IGA
Reuters DSA DSA DSA IGA
BMS-1 DSA DSA DSA DSA
Dataset ? 0 ? varying of rules ? 0 ? varying of rules ? 0 ? varying of rules ? 0 ? varying of rules
S1 S2 S3 S4
Kosarak IGA DSA DSA IGA / DSA
Retail DSA DSA IGA / DSA IGA
Reuters IGA / DSA DSA DSA IGA
BMS-1 DSA DSA DSA DSA
The best algorithm in terms of misses cost
The best algorithm in terms of misses cost
varying the number of rules to sanitize
Dataset ? 0 ? 6 sensitive rules ? 0 ? 6 sensitive rules ? 0 ? 6 sensitive rules ? 0 ? 6 sensitive rules
S1 S2 S3 S4
Kosarak IGA DSA DSA DSA
Retail DSA DSA DSA IGA
Reuters DSA DSA DSA IGA
BMS-1 DSA DSA DSA DSA
The best algorithm in terms of side effect factor
46
CPU Time
Results of CPU time for the sanitization process
47
Lessons Learned
  • Large dataset are our friends.
  • The benefit of index at most two scans to
    sanitize a dataset.
  • The data sanitization paradox.
  • The outstanding performance of IGA and DSA.
  • Rule sanitization reduces inference channels, and
    does not change the support and confidence of the
    shared rules.
  • DSA reduces the flexibility of information
    sharing.

48
Evaluation DRBT
  • Methodology
  • Step 1 Attribute normalization.
  • Step 2 Dimensionality reduction (two
    approaches).
  • Step 3 Computation of the error produced on
    reduced datasets.
  • Step 4 Run K-means to find the clusters in the
    original and reduced datasets.
  • Step 5 Computation of F-measure (experiments
    repeated 10 times).
  • Step 6 Comparison of the clusters generated
    from the original and the reduced datasets.

49
DRBT PPC over centralized Data
Transformation dr 37 dr 34 dr 31 dr 28 dr 25 dr 22 dr 16
RP1 0.00 0.015 0.024 0.033 0.045 0.072 0.141
RP2 0.00 0.014 0.019 0.032 0.041 0.067 0.131
The error produced on the dataset Chess (do 37)
Data K 2 K 2 K 3 K 3 K 4 K 4 K 5 K 5
Transformation Avg Std Avg Std Avg Std Avg Std
RP2 0.941 0.014 0.912 0.009 0.881 0.010 0.885 0.006
Average of F-measure (10 trials) for the dataset
Accidents (do 18, dr 12)
Data K 2 K 2 K 3 K 3 K 4 K 4 K 5 K 5
Transformation Avg Std Avg Std Avg Std Avg Std
RP2 1.000 0.000 0.948 0.010 0.858 0.089 0.833 0.072
Average of F-measure (10 trials) for the dataset
Iris (do 5, dr 3)
50
DRBT PPC over Vertically Centralized Data
No. Parties RP1 RP2
1 0.0762 0.0558
2 0.0798 0.0591
3 0.0870 0.0720
4 0.0923 0.0733
The error produced on the dataset Pumsb (do 74)
Data K 2 K 2 K 3 K 3 K 4 K 4 K 5 K 5
Transformation Avg Std Avg Std Avg Std Avg Std
1 0.909 0.140 0.965 0.081 0.891 0.028 0.838 0.041
2 0.904 0.117 0.931 0.101 0.894 0.059 0.840 0.047
3 0.874 0.168 0.887 0.095 0.873 0.081 0.801 0.073
4 0.802 0.155 0.812 0.117 0.866 0.088 0.831 0.078
Average of F-measure (10 trials) for the dataset
Pumsb (do 74, dr 38)
51
Contributions of this Research
  • Foundations for further research in PPDM.
  • A taxonomy of PPDM techniques.
  • A family of privacy-preserving methods.
  • A library of sanitizing algorithms.
  • Retrieval facilities.
  • A set of metrics.

52
Future Research
  • Privacy definition in data mining.
  • Combining sanitization and randomization.
  • New method for PPC (k-anonymity isometries
    data distortion)
  • Sanitization of documents repositories.

53
Thank You!
Write a Comment
User Comments (0)
About PowerShow.com