Privacypreserving data mining 1

About This Presentation

Title:

Privacypreserving data mining 1

Description:

Privacy-preserving data mining (1) Outline. A brief introduction to ... Irregularly shaped clusters and noises. Clustering algorithms. Typical ones. Kmeans ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 24

Provided by: keke9

Category:

more less

Transcript and Presenter's Notes

Title: Privacypreserving data mining 1

1
Privacy-preserving data mining (1)
2
Outline

A brief introduction to learning algorithms
Classification algorithms
Clustering algorithms
Addressing privacy issues in learning
Single dataset publishing
Distributed multiple datasets
How data is partitioned

3
A quick review

Machine learning algorithms
Supervised learning (classification)
Training data have class labels
Find the boundary between classes
Unsupervised learning (clustering)
Training data have no labels
Similarity measure is the key
Grouping records based on the similarity measure

4
A quick review

Good tutorials
http//www.cs.utexas.edu/mooney/cs391L/
Top 10 data mining algorithms
www.cs.uvm.edu/icdm/algorithms/10Algorithms-08.pd
f
We will review the basic ideas of some algorithms

5
C4.5 decision tree (classification)

Based on ID3 algorithm
Convert decision tree to rule set
From the root to a leave ? a rule
Prune the rules
Cross validation

Split data to N folds
In each round
training
validating
testing
Testing the generalization power
For choosing the best parameters
Final result the average of N testing results
6
Naïve bayes (classification)
Two classes 0/1, feature vector x (x1,x2,, xn)
Apply bayes rule
Assume independent features
Easy to count f(xiclass label) with the
training data
7
K nearest neighbor (classification)
instance-based learning
Classifying the point
Decision area Dz
More general kernel methods
8
Linear classifier (classification)
wTx b 0
wTx b gt 0
wTx b lt 0
f(x) sign(wTx b)

Examples
Perceptron
Linear discriminant analysis(LDA)

9
There are infinite number of linear
separators Which one is optimal?
10
Support Vector Machine (classification)

Distance from example xi to the separator is
Examples closest to the hyperplane are support
vectors.
Margin ? of the separator is the distance between
support vectors.

?
Maximizing
r

Extended to handle
Nonlinear
Noisy margin
Large datasets

11
Boosting (classification)

Classifier ensembles
Average prediction of a set of classifiers
trained on the same set of data
Intuition
The output of a classifier has certain amount of
variance
Averaging can reduce the variance ? improve the
accuracy

12
AdaBoost

Freund Y, Schapire RE (1997) A decision-theoretic
generalization of on-line learning and an
application to boosting. J Comput Syst Sci

Gradient boosting
J. Friedman stochastic gradient boosting,
http//citeseer.ist.psu.edu/old/126259.html

14
Challenges in Clustering

Definition of similarity measures
Point-wise
Euclidean
Cosine ( document similarity)
Correlation
Set-wise
Min/max distance between two sets
Entropy based (categorical data)

15
Challenges in Clustering

Hierarchical
1. Merging most similar pairs each step
2. Until reaching desired number of clusters
Partitioning (k-means)
1. Set initial centroids
2. Partition the data
3. Adjust the centroids
4. Iterate on 2 and 3 until converging
Other classification of algorithms
Aglommerative (bottom-up) methods
Divisive (partitional, top-down)

16
Challenges in Clustering

Efficiency of the algorithm large datasets
Linear-cost algorithms k-means
However, the costs of many algorithms are
quadratic
Perform a three-phase processing
Sampling
Clustering
Labeling

17
Challenges in Clustering

Irregularly shaped clusters and noises

18
Clustering algorithms

Typical ones
Kmeans
Expectation-Maximization (EM)
A lot of clustering algorithms addressing
different challenges
Good survey
AK Jain etc. Data Clustering A Review, ACM
Computing Surveys, 1999

19
PPDM issues

How data is distributed
Single party releases data
Multiparty collaboratively mining data
Pooling data
Cryptographic protocols
How data is partitioned
Horizontally
vertically

20
Single party

Data perturbation
Rakesh00, for decision tree
Chen05, for many classifiers and clustering
algorithms
Anonymization
Top-down/bottom-up decision tree

21
Multiple parties
user 1
user 1
user 1
network
Perturbed data
Service-based computing
Peer-to-peer computing

Perturbation anonymization
Papers 89,92,94,185,

Cryptographic approaches
Papers 95-99,104,107,108

22
How data is partitioned

Horizontally partitioned
All additive (and some multiplicative)
perturbation methods
Protocols
Kmeans, svm, naïve bayes, bayesian network
Vertically partitioned
All additive perturbation methods
Protocols
Kmeans, bayesian network

23
Challenges and opportunities

Many modeling methods have no privacy-preserving
version
Cost protocol based approaches
Limitation of column-based additive perturbation
Complexity
The advantage of geometric data perturbation
Covers many different modeling methods

Write a Comment

User Comments (0)

About PowerShow.com

Privacypreserving data mining 1 - PowerPoint PPT Presentation

Privacypreserving data mining 1

Privacy-preserving data mining (1) Outline. A brief introduction to ... Irregularly shaped clusters and noises. Clustering algorithms. Typical ones. Kmeans ... – PowerPoint PPT presentation