Privacy-Preserving Data Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Privacy-Preserving Data Mining

Description:

Title: Privacy-Preserving Data Mining Author: caoli Last modified by: caoli Created Date: 10/5/2003 11:24:17 PM Document presentation format: – PowerPoint PPT presentation

Number of Views:153
Avg rating:3.0/5.0
Slides: 41
Provided by: cao72
Category:

less

Transcript and Presenter's Notes

Title: Privacy-Preserving Data Mining


1
Privacy-Preserving Data Mining
  • Representor Li Cao
  • October 15, 2003

2
Presentation organization
  • Associate Data Mining with Privacy
  • Privacy Preserving Data Mining scheme using
    random perturbation
  • Privacy Preserving Data Mining using Randomized
    Response Techniques
  • Comparing these two cases

3
Privacy protection historyPrivacy concerns
nowadays
  • Citizens attitude
  • Scholars attitude

4
Internet users attitudes
5
Privacy value
  • Filtering to weed out unwanted information
  • Better search results with less effort
  • Useful recommendations
  • Market trend
  • Example
  • From the analysis of a large number of
    purchase transaction records with the costumers
    age and income, we know what kind of costumers
    like some style or brand.

6
Motivation (Introducing Data Mining )
  • Data Minings goal discover knowledge, trends,
    patterns from large amounts of data.
  • Data Minings primary task developing accurate
    models about aggregated data without access to
    precise information in individual data records.
    (Not only discovering knowledge but also
    preserving privacy)

7
Presentation organization
  • Associate Data Mining with Privacy
  • Privacy Preserving Data Mining scheme using
    random perturbation
  • Privacy Preserving Data Mining using Randomized
    Response Techniques
  • Comparing these two cases

8
Privacy Preserving Data Mining scheme using
random perturbation
  • Basic idea
  • Reconstruction procedure
  • Decision-Tree classification
  • Three different algorithms

9
Attribute list of a example
10
Records of a example
11
Privacy preserving methods
  • Value-class Membership the values for an
    attribute are partitioned into a set of disjoint,
    mutually-exclusive classes.
  • Value Distortion xir instead of xi where r is
    a random value.
  • a) Uniform
  • b) Gaussian

12
Basic idea
  • Original data ? Perturbed data (Let users provide
    a modified value for sensitive attributes)
  • Estimate the distribution of original data from
    perturbed data
  • Build classifiers by using these reconstructed
    distributions (Decision-tree)

13
Basic Steps
14
Reconstruction problem
  1. View the n original data value x1,x2,xn of a
    one-dimensional distribution as realizations of n
    independent identically distributed (iid) random
    variables X1,X2,Xn, each with the same
    distribution as the random variable X.
  2. To hide these data values, n independent random
    variables Y1,Y2Yn have been used, each with the
    same distribution as a different random variable
    Y.

15
  • Given x1y1, x2y2xnyn (where yi is the
    realization of Yi) and the cumulative
    distribution function Fy for Y, we would like to
    estimate the cumulative distribution function Fx
    for X.
  • In short, given a cumulative distribution Fy and
    the realizations of n iid random samples X1Y1,
    X2Y2,XnYn, estimate Fx.

16
Reconstruction process
  • Let the value of XiYi be wi (xiyi). Use Bayes
    rule to estimate the posterior distribution
    function Fx1 (given that X1Y1w1) for X1,
    assuming we know the density function fx and fy
    for X and Y respectively.

17
  • To estimate the posterior distribution function
    Fx given x1y1, x2y2xnyn, we average the
    distribution function for each of the Xi.

18
  • The corresponding posterior density function, fx
    is obtained by differentiating Fx
  • Given a sufficiently large number of samples, fx
    will be very close to the real density function
    fx.

19
Reconstruction algorithm
  • fx0 Uniform distribution
  • j 0 //Iteration number
  • Repeat
  • j j1
  • until (stopping criterion met)

20
Stopping Criterion
  • Observed randomized distribution ? The result of
    randomizing the current estimate of the original
    distribution
  • The difference between successive estimates of
    the original distribution is very small.

21
Reconstruction effect
22
(No Transcript)
23
Decision-Tree Classification
  • Tow stages
  • (1) Growth (2) Prune
  • Example

24
Tree-growth phase algorithm
  • Partition (Data S)
  • begin
  • if (most points in S are of the same
    class)
  • then return
  • for each attribute A do
  • evaluate splits on attribute A
  • Use best split to partition S into S1
    and S2
  • Partition (S1)
  • Partition (S2)
  • end

25
Choose the best split
  • Information gain (categorical attributes)
  • Gini index (continuous attributes)

26
Gini index calculation
  • (pj is the relative frequency of class j in
    S)
  • If a split divides S into two subsets S1 and S2
  • Note Calculating this index requires only
    the distribution of the class values.

27
(No Transcript)
28
When How original distribution are reconstructed
  • Global Reconstruct the distribution for each
    attribute once. Decision-tree classification.
  • ByClass For each attribute, first split the
    training data by class, then reconstruct the
    distributions separately. Decision-tree
    classification
  • Local The same as in ByClass, however, instead
    of doing reconstruction only once, reconstruction
    is done at each node.

29
Example (ByClass and Local)
30
Comparing the three algorithms
Execution Time Accuracy
Global Cheapest Worst
ByClass Middle Middle
Local Most expensive Best
31
Presentation Organization
  • Associate Data Mining with Privacy
  • Privacy Preserving Data Mining scheme using
    random perturbation
  • Privacy Preserving Data Mining using Randomized
    Response Techniques
  • Comparing these two cases
  • Are there any other classification methods
    available?

32
Privacy Preserving Data Mining using Randomized
Response Techniques
  • Randomized Response
  • Building Decision-Tree
  • Key Information Gain Calculation
  • Experimental results

33
Randomized Response
  • A survey contains a sensitive attribute A.
  • Instead of asking whether the respondent has the
    attribute A, ask two related questions, the
    answer to which are opposite to each other (have
    A ? no A).
  • Respondent use a randomizing device to decide
    which question to answer. The device is designed
    in such way that the probability of choosing the
    first question is ? .

34
  • To estimate the percentage of people who has the
    attribute A, we can use
  • P(Ayes) P(Ayes) ? P(Ano)(1- ?)
  • P(Ano) P(Ano) ? P(Ayes)(1- ?)
  • P(kyes) The proportion of the yes
    responses obtained from the survey data.
  • P(kyes) The estimated proportion of the
    yes responses
  • Our Goal P(Ayes) and P(Ano)

35
Example
  • Sensitive attribute Married?
  • Two questions
  • A? Yes / No
  • B? No / Yes

36
Decision-Tree (Key Info Gain)
  • m m classes assumed
  • Qj The relative frequency of class j in S
  • v any possible value of attribute A
  • Sv The subset of S for which attribute A has
    value v.
  • Sv The number of elements in Sv
  • S The number of elements in S

37
  • P(E) The proportion of the records in the
    undisguised data set that satisfy Etrue
  • P(E) The proportion of the records in the
    disguised data set that satisfy Etrue
  • Assume the class label is binary.
  • So the Entropy(S) can be calculated.
    Similarly, calculate Sv. At last, we get
    Gain(S,A).

38
Experimental results
39
Comparing these two cases
Perturbation Response
Attribute Continuous Categorical
Privacy Preserving method Value distortion Randomized response
Choose attribute to split Gini index Information Gain
Inverse procedure Reconstruct distribution Estimate P(E) from P(E)
40
Future work
  • Solve categorical problems by the first scheme
  • Solve continuous problems by the second scheme
  • Combine these two scheme to solve some problems
  • Other classification suitable
Write a Comment
User Comments (0)
About PowerShow.com