Maintaining Data Privacy in Association Rule Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Maintaining Data Privacy in Association Rule Mining

Description:

Authors: Shariq J. Rizvi. Jayant R. Haritsa. VLDB 2002. 2. Content ... The authors proposed a scheme --- MASK (Mining Associations with Secrecy Konstraints) ... – PowerPoint PPT presentation

Number of Views:136
Avg rating:3.0/5.0
Slides: 40
Provided by: hkucsis
Category:

less

Transcript and Presenter's Notes

Title: Maintaining Data Privacy in Association Rule Mining


1
Maintaining Data Privacy in Association Rule
Mining
VLDB 2002
Authors Shariq J. Rizvi
Jayant R. Haritsa
  • Speaker Minghua ZHANG

Oct. 11, 2002
2
Content
  • Background
  • Problem framework
  • MASK -- distortion part
  • MASK -- mining part
  • Performance
  • Conclusion

3
Background
  • In data mining, the accuracy of the input data is
    very important for obtaining valuable mining
    results.
  • However, in real life, there are many reasons
    which lead to inaccurate data.
  • One example is that, the users deliberately
    provide wrong information to protect their
    privacy.
  • age, income, illness, etc.
  • Problem how to protect user privacy while
    getting accurate mining results at the same time?

4
Background (contd)
  • Privacy and accuracy are contradictory in nature.
  • A compromise way is more feasible.
  • satisfactory (not 100) privacy and satisfactory
    (not 100) accuracy
  • This paper studied this problem in the context of
    mining association rules.

5
Overview of the Paper
  • The authors proposed a scheme --- MASK (Mining
    Associations with Secrecy Konstraints).
  • Major idea of MASK
  • Apply a simple probabilistic distortion on
    original data
  • The distortion can be done at the user machine
  • The miner tries to find accurate mining results,
    given the following inputs
  • The distorted data
  • A description of the distortion procedure

6
Problem Framework
  • Database model
  • Each customer transaction is a record in the
    database.
  • A record is a fixed-length sequence of 1s and
    0s.
  • E.g for market-basket data
  • length of the record the total number of items
    sold by the market.
  • 1 the corresponding item was bought in the
    transaction
  • 0 vice versa.
  • The database can be regarded as a two-dimensional
    boolean matrix.

7
Problem Framework (contd)
  • The matrix is very sparse. Why not use
    item-lists?
  • The data will be distorted.
  • After the distortion, it will not as sparse as
    the original (true) data.
  • Mining objective find frequent itemsets
  • Itemset whose appearance (support) in the
    database is larger than a threshold.

8
  • Background
  • Problem framework
  • MASK --- distortion part
  • MASK --- mining part
  • Performance
  • Conclusion

9
MASK --- Distortion Part
  • Distortion Procedure
  • Represent a customer record by a random vector.
  • Original record XXi, where Xi 0 or 1.
  • Distorted record YYi, where Yi 0 or 1.
  • Yi Xi (with a probability of p)
  • Yi 1-Xi (with a probability of 1-p)

10
Quantifying Privacy
  • Privacy metric
  • The probability of reconstructing the true data
  • Consider each individual item
  • With what probability can a given 1 or 0 in the
    true matrix database be reconstructed?
  • Calculate reconstruction probability
  • Let si prob (a random customer C bought the ith
    item)
  • the true support of item i
  • The probability of correctly reconstruction of a
    1 in a random item i is
  • R1(p,si) si x p2 / (si x p (1-si) x (1-p) )
  • si x (1-p) 2 / ( si x (1-p)
    (1-si) x p)

11
Reconstruction Probability
  • Reconstruction probability of a 1 across all
    items
  • R1(p) ( ?i siR1(p,si) ) / (?isi)
  • Suppose
  • s0the average support of an item
  • Replace si by s0, we get
  • R1(p) s0 x p2 / (s0 x p (1-s0) x (1-p) )
  • s0 x (1-p) 2 / ( s0 x (1-p) (1-s0)
    x p)

12
Reconstruction Probability (contd)
  • Relationship between R1(p) and p, s0
  • Observations
  • R1(p) is high when p is near 0 and 1, and it is
    lowest when p0.5.
  • The curves become flatter as s0 decreases.

13
Privacy Measure
  • The reconstruction probability of a 0
  • R0(p) func(p and s0).
  • The total reconstruction probability
  • R(p)a R1(p) (1-a) R0(p)
  • a is the weight parameter.
  • Privacy
  • P(p) ( 1- R(p) ) x 100

14
Privacy Measure (contd)
  • Privacy vs. p
  • Observations
  • For a given value of s0, the curve shape is
    fixed.
  • The value of a determines the absolute value of
    privacy.
  • The privacy is nearly constant for a large range
    of p.
  • provide flexibility in choosing p that can
    minimize the error in the later mining part.

P(p) for s00.01
15
  • Background
  • Problem framework
  • MASK --- distortion part
  • MASK --- mining part
  • Performance
  • Conclusion

16
MASK --- Mining Part
  • How to estimate the accurate supports of itemsets
    from a distorted database?
  • Remember that the miner knows the value of p.
  • Estimating 1-itemset supports
  • Estimating n-itemset supports
  • The whole mining process

17
Estimating 1-itemset Supports
  • Symbols
  • T the original true matrix D the distorted
    matrix
  • i a random item
  • C1T and C0T the number of 1s and 0s in the i
    column of T
  • C1D and C0D the number of 1s and 0s in the i
    column of D.
  • From distortion method, we have
  • C1D roughly C1T p C0T(1-p) -gt C1D C1T p
    C0T(1-p)
  • C0D roughly C0T p C1T(1-p) -gt C0D C0T p
    C1T(1-p)
  • Let , ,
    , then CD MCT. So CT M-1 CD.

18
Estimating n-itemset Supports
  • Still use CT M-1 CD to estimate support.
  • Define
  • CKT is the number of records in T that have the
    binary form of k for the given itemset.
  • E.g for a 3-itemset that contains the first 3
    items
  • CT has 238 rows
  • C3T is the No. of records in T of form 0,1,1,
  • Mi,j Prob ( CjT -gt CiD).
  • M7,3p2(1-p) (C3T -gt C7D or C011T -gt C111D)

19
Mining Process
  • Similar to Apriori algorithm
  • Difference
  • E.g when counting supports of 2-itemsets,
  • Apriori only need to count the No. of records
    that have value 1 for both items, or of form
    11.
  • MASK has to keep track of all 4 combinations
    00,01,10 and 11 for the corresponding items.
  • C2n-1T is estimated from C0D, C1D, , C2n-1D.
  • MASK requires more time and space than Apriori.
  • Some optimizations (omitted)

20
  • Background
  • Problem framework
  • MASK --- distortion part
  • MASK --- mining part
  • Performance
  • Conclusion

21
Performance
  • Data sets
  • Synthetic database
  • 1,000,000 records 1000 items
  • s00.01
  • Real dataset
  • Click-stream data of a retailer web site
  • 600,000 records about 500 items
  • s00.005

22
Performance (contd)
  • Error Metrics
  • Right class, wrong support
  • Infrequent itemsets, error doesnt matter
  • Frequent itemsets
  • Support Error (?)
  • Wrong class
  • Identity Error (?)
  • false positives
  • false negatives

23
Performance (contd)
  • Parameters
  • sup 0.25, 0.5
  • p 0.9, 0.7
  • a1 only concern of privacy of 1s
  • r 0, 10
  • Coverage may be more important than precision.
  • Use a smaller support threshold to mine the
    distorted database.
  • Support used to mine D sup x (1-r)

24
Performance (contd)
  • Synthetic dataset
  • Experiment 1 p0.9 (85), sup0.25

Level F ? ?- ?
1 689 3.31 1.16 1.16
2 2648 3.58 4.49 5.14
3 1990 1.71 4.57 2.16
4 1418 1.28 3.67 0.22
5 730 1.27 5.89 0
6 212 1.36 4.25 5.19
7 35 1.40 0 0
8 3 0.99 0 0
Level F ? ?- ?
1 689 3.37 0.73 3.19
2 2648 3.73 0.19 19.68
3 1990 1.76 0 28.09
4 1418 1.29 0 25.81
5 730 1.32 0 16.44
6 212 1.37 0 25.47
7 35 1.40 0 51.43
8 3 0.99 0 66.67
r0
r10
25
Performance (contd)
  • Synthetic dataset
  • Experiment 2 p0.9 (85), sup0.5

Level F ? ?- ?
1 560 2.60 1.25 0.89
2 470 2.13 5.53 4.89
3 326 1.22 3.07 0.31
4 208 1.34 1.44 0.48
5 125 1.81 0 0
6 43 2.62 0 0
7 10 3.44 10 0
8 1 4.50 0 0
Level F ? ?- ?
1 560 2.66 0.18 4.29
2 470 2.21 0 44.89
3 326 1.26 0 42.64
4 208 1.35 0 51.44
5 125 1.81 0 22.4
6 43 2.62 0 18.60
7 10 3.47 0 10
8 1 4.50 0 0
r0
r10
26
Performance (contd)
  • Synthetic dataset
  • Experiment 3 p0.7 (96), sup0.25, r10

Level F ? ?- ?
1 689 10.16 2.61 7.84
2 2648 25.23 19.52 630.93
3 1990 26.93 42.86 172.71
4 1418 29.14 65.94 0.35
5 730 28.47 79.32 0
6 212 36.25 84.91 0
7 35 51.37 85.71 0
8 3 - 100 0
27
Performance (contd)
  • Real database
  • Experiment 1 p0.9 (89), sup0.25

Level F ? ?- ?
1 249 5.89 4.02 2.81
2 239 3.87 6.69 7.11
3 73 2.60 10.96 9.59
4 4 1.41 0 25.0
Level F ? ?- ?
1 249 6.12 1.2 0.40
2 239 4.04 1.26 23.43
3 73 2.93 0 45.21
4 4 1.41 0 75
r0
r10
28
Performance (contd)
  • Real database
  • Experiment 2 p0.9 (89), sup0.5

Level F ? ?- ?
1 150 4.23 0.67 4.67
2 45 2.42 2.22 4.44
3 6 1.07 0 16.66
Level F ? ?- ?
1 150 4.27 0 8
2 45 2.56 0 37.77
3 6 1.07 0 66.66
r0
r10
29
Performance (contd)
  • Real database
  • Experiment 3 p0.7 (97), sup0.25, r10

Level F ? ?- ?
1 249 18.96 7.23 15.66
2 239 33.59 20.08 1907.53
3 73 32.87 30.14 2308.22
4 4 7.55 50 400
30
Performance (contd)
  • Summary
  • Good privacy and good accuracy can be achieved at
    the same time by careful selection of p.
  • In experiments, p around 0.9 is the best choice.
  • A smaller p leads to much error in mining
    results.
  • A larger p will reduces the privacy greatly.

31
Conclusion
  • This paper studies the problem of achieving a
    satisfactory privacy and accuracy simultaneously
    for association rule mining.
  • A probabilistic distortion of the true data is
    proposed.
  • Privacy is measured by a formula, which is a
    function of p and s0.

32
Conclusion (contd)
  • A mining process is put forward to estimate the
    real support from the distorted database.
  • Experiment results show that there is a small
    window of p (near 0.9) that can achieve good
    accuracy (90) and privacy (80) at the same
    time.

33
Related Works
  • On preventing sensitive rules from being inferred
    by the miner (output privacy)
  • Y. Saygin, V. Verykios and C. Clifton, Using
    Unknowns to Prevent Discovery of Association
    Rules, ACM SIGMOD Record, vol.30 no. 4, 2001
  • M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim
    and V. Verykios, Disclosure Limitation of
    Sensitive Rules, Proc. Of IEEE Knowledge and
    Data Engineering Exchange Workshop, Nov.1999

34
Related Works
  • On input data privacy in distributed databases
  • J. Vaidya and C. Clifton, Privacy Preserving
    Association Rule Mining in Vertically Partitioned
    Data, KDD2002
  • M. Kantarcioglu and C. Clifton,
    Privacy-preserving Distributed Mining of
    Association Rules on Horizontally Partitioned
    Data, Proc. Of ACM SIGMOD Workshop on Research
    Issues in Data Mining and Knowledge Discovery,
    2002

35
Related Works
  • Privacy-preserving mining in the context of
    classification rules
  • D. Agrawal and C. Aggarwal, On the Design and
    Quantification of Privacy Preserving Data Mining
    Algorithms, PODS, 2001
  • A recent paper also appears in 2002
  • A. Evfimievski, R. Srikant, R. Agrawal and J.
    Gehrke, Privacy Preserving Mining of Association
    Rules, KDD2002

36
  • ?

37
More information
  • Distortion procedure
  • Yi Xi XOR ri, where ri is the complement of
    ri,
  • ri is a random variable with density function f (
    r ) bernoulli(p) (0 lt p lt 1)

38
More Information
  • Reconstruction error bounds (1-itemsets)
  • With probability PE(m,p,(2p-1)?/2) X
    PE(n,p,(2p-1)?/2) , the error is less than ?.
  • n the real support count of the item
  • m dbsize-n
  • PE(n,p,?) ?(rnp-?np?) nCrpr(1-p)n-r

39
  • Reconstruction probability of a 1 in a random
    item i
  • Si the true support of item i
  • pr (a random customer C bought the ith
    item),
  • Xi the original entry for item i
  • Yi the distorted entry for item I
  • The probability of correct reconstruction of a
    1 in a random item i is
  • R1(p,si) PrYi 1 Xi 1 x prXi 1 Yi 1
  • PrYi 0 Xi 1 x PrXi 1
    Yi 0
  • si x p2 / (si x p (1-si) x
    (1-p) )
  • si x (1-p) 2 / ( si x (1-p)
    (1-si) x p)
Write a Comment
User Comments (0)
About PowerShow.com