K-medoid-style Clustering Algorithms for Supervised Summary Generation - PowerPoint PPT Presentation

About This Presentation

Title:

K-medoid-style Clustering Algorithms for Supervised Summary Generation

Description:

Ford Trucks. Attribute1. Ford Trucks. Ford Vans. GMC Trucks. GMC Van. GMC Van :Ford :GMC. 4. Clustering Algorithms Currently Investigated. Partitioning Around ... – PowerPoint PPT presentation

Number of Views:151

Avg rating:3.0/5.0

Slides: 33

Provided by: nidalz

Learn more at: https://www2.cs.uh.edu

Category:

more less

Transcript and Presenter's Notes

Title: K-medoid-style Clustering Algorithms for Supervised Summary Generation

1
K-medoid-style Clustering Algorithms for
Supervised Summary Generation

Nidal Zeidat Christoph F. Eick
Dept. of Computer Science
University of Houston

2
Talk Outline

What is Supervised Clustering?
Representative-based Clustering Algorithms
Benefits of Supervised Clustering
Algorithms for Supervised Clustering
Empirical Results
Conclusion and Areas of Future Work

3
1. (Traditional) Clustering

Partition a set of objects into groups of similar
objects. Each group is called cluster
Clustering is used to detect classes in data
set (unsupervised learning)
Clustering is based on a fitness function that
relies on a distance measure and usually tries to
minimize distance between objects within a
cluster.

4
(Traditional) Clustering (continued)
Attribute1
A
C
B
Attribute2
5
Supervised Clustering

Assumes that clustering is applied to classified
examples.
The goal of supervised clustering is to identify
class-uniform clusters that have a high
probability density. ? prefers clusters whose
members belong to single class (low impurity)
We would, also, like to keep the number of
clusters low (small number of clusters).

6
Supervised Clustering (continued)
Attribute 1
Attribute 1
Attribute 2
Attribute 2
Traditional Clustering
Supervised Clustering
7
A Fitness Function for Supervised Clustering

q(X) Impurity(X) ßPenalty(k)

k number of clusters used n number of examples
the dataset c number of classes in a dataset.
ß Weight for Penalty(k), 0lt ß 2.0
8
2. Representative-Based Supervised Clustering
(RSC)

Aims at finding a set of objects among all
objects (called representatives) in the data set
that best represent the objects in the data set.
Each representative corresponds to a cluster.
The remaining objects in the data set are, then,
clustered around these representatives by
assigning objects to the cluster of the closest
representative.
Remark The popular k-medoid algorithm, also
called PAM, is a representative-based clustering
algorithm.

9
Representative Based Supervised Clustering
(Continued)
Attribute1
Attribute2
10
Representative Based Supervised Clustering
(Continued)
2
Attribute1
1
3
Attribute2
4
11
Representative Based Supervised Clustering
(Continued)
2
Attribute1
1
3
Attribute2
4
Objective of RSC Find a subset OR of O such that
the clustering X obtained by using the objects
in OR as representatives minimizes q(X).
12
Why do we use Representative-Based Clustering
Algorithms?

Representatives themselves are useful
can be used for summarization
can be used for dataset compression
Smaller search space if compared with algorithms,
such as k-means.
Less sensitive to outliers
Can be applied to datasets that contain nominal
attributes (not feasible to compute means)

13
3. Applications of Supervised Clustering

Enhance classification algorithms.
Use SC for Dataset Editing to enhance
NN-classifiers ICDM04
Improve Simple Classifiers ICDM03
Learn Sub-classes / Summary Generation
Distance Function Learning
Dataset Compression/Reduction
For Measuring the Difficulty of a Classification
Task

14
Representative Based Supervised Clustering ?
Dataset Editing
Attribute1
Attribute1
A
B
D
C
F
E
Attribute2
Attribute2
a. Dataset clustered using supervised clustering.
b. Dataset edited using cluster representatives.
15
Representative Based Supervised Clustering ?
Enhance Simple Classifiers
Attribute1
Attribute2
16
Representative Based Supervised Clustering ?
Learning Sub-classes
Attribute1
Ford Trucks
Ford
GMC
GMC Trucks
GMC Van
Ford Vans
Ford Trucks
Attribute2
GMC Van
17
4. Clustering Algorithms Currently Investigated

Partitioning Around Medoids (PAM). ? Traditional
Supervised Partitioning Around Medoids (SPAM).
Single Representative Insertion/Deletion Steepest
Decent Hill Climbing with Randomized Restart
(SRIDHCR).
Top Down Splitting Algorithm (TDS).
Supervised Clustering using Evolutionary
Computing (SCEC).

18
Algorithm SRIDHCR
19
Set of Medoids after adding one non-medoid q(X) Set of Medoids after removing a medoid q(X)
8 42 62 148 (Initial solution) 0.086 42 62 148 0.086
8 42 62 148 1 0.091 8 62 148 0.073
8 42 62 148 2 0.091 8 42 148 0.313
.... . 8 42 62 0.333
8 42 62 148 52 0.065 42 62 148 0.086
.
8 42 62 148 150 0.0715
Trials in first part (add a non-medoid) Trials in first part (add a non-medoid) Trials in second part (drop a medoid) Trials in second part (drop a medoid)
Run Set of Medoids producing lowest q(X) in the run q(X) Purity
0 8 42 62 148 (Init. Solution) 0.086 0.947
1 8 42 62 148 52 0.065 0.947
2 8 42 62 148 52 122 0.041 0.973
3 42 62 148 52 122 117 0.030 0.987
4 8 62 148 52 122 117 0.021 0.993
5 8 62 148 52 122 117 87 0.016 1.000
6 8 62 52 122 117 87 0.014 1.000
7 8 62 122 117 87 0.012 1.000
20
Algorithm SPAM
21
Differences between SPAM and SRIDHCR

SPAM tries to improve the current solution by
replacing a representative by a
non-representative, whereas SRIDHCR improves the
current solution by removing a representative/by
inserting a non-representative.
SPAM is run keeping the number of clusters k
fixed, whereas SRIDHCR searches for a good
value of k, therefore exploring a larger solution
space. However, in the case of SRIDHCR which
choices for k are good is somewhat restricted by
the selection of the parameter b.
SRIDHCR is run r times starting from a random
initial solution, SPAM is only run once.

22
5. Performance Measures for the Experimental
Evaluation

The investigated algorithms were evaluated based
on the following performance measures
Cluster Purity (Majority ).
Value of the fitness function q(X).
Average dissimilarity between all objects and
their representatives (cluster tightness).
Wall-Clock Time (WCT). Actual time, in seconds,
that the algorithm took to finish the clustering
task.

23
Algorithm Purity q(X) Tightness(X).
Iris-Plants data set, clusters3 Iris-Plants data set, clusters3 Iris-Plants data set, clusters3 Iris-Plants data set, clusters3
PAM 0.907 0.0933 0.081
SRIDHCR 0.981 0.0200 0.093
SPAM 0.973 0.0267 0.133
Vehicle data set, clusters 65 Vehicle data set, clusters 65 Vehicle data set, clusters 65 Vehicle data set, clusters 65
PAM 0.701 0.326 0.044
SRIDHCR 0.835 0.192 0.072
SPAM 0.764 0.263 0.097
Image-Segment data set, clusters 53 Image-Segment data set, clusters 53 Image-Segment data set, clusters 53 Image-Segment data set, clusters 53
PAM 0.880 0.135 0.027
SRIDHCR 0.980 0.035 0.050
SPAM 0.944 0.071 0.061
Pima-Indian Diabetes data set, clusters 45 Pima-Indian Diabetes data set, clusters 45 Pima-Indian Diabetes data set, clusters 45 Pima-Indian Diabetes data set, clusters 45
PAM 0.763 0.237 0.056
SRIDHCR 0.859 0.164 0.093
SPAM 0.822 0.202 0.086
7
19
Table 4 Traditional vs. Supervised Clustering
(ß0.1)
24
Algorithm q(X) Purity Tightness (X) WCT (Sec.)
IRIS-Flowers Dataset, clusters3 IRIS-Flowers Dataset, clusters3 IRIS-Flowers Dataset, clusters3 IRIS-Flowers Dataset, clusters3 IRIS-Flowers Dataset, clusters3
PAM 0.0933 0.907 0.081 0.06
SRIDHCR 0.0200 0.980 0.093 11.00
SPAM 0.0267 0.973 0.133 0.32
Vehicle Dataset, clusters65 Vehicle Dataset, clusters65 Vehicle Dataset, clusters65 Vehicle Dataset, clusters65 Vehicle Dataset, clusters65
PAM 0.326 0.701 0.044 372.00
SRIDHCR 0.192 0.835 0.072 1715.00
SPAM 0.263 0.764 0.097 1090.00
Segmentation Dataset, clusters53 Segmentation Dataset, clusters53 Segmentation Dataset, clusters53 Segmentation Dataset, clusters53 Segmentation Dataset, clusters53
PAM 0.135 0.880 0.027 4073.00
SRIDHCR 0.035 0.980 0.050 11250.00
SPAM 0.071 0.944 0.061 1422.00
Pima-Indians-Diabetes, clusters45 Pima-Indians-Diabetes, clusters45 Pima-Indians-Diabetes, clusters45 Pima-Indians-Diabetes, clusters45 Pima-Indians-Diabetes, clusters45
PAM 0.237 0.763 0.056 186.00
SRIDHCR 0.164 0.859 0.093 660.00
SPAM 0.202 0.822 0.086 58.00
Table 5 Comparative Performance of the Different
Algorithms, ß0.1
25
Algorithm Avg. Purity Tightness(X) Avg.WCT (Sec.)
IRIS-Flowers Dataset, clusters3 IRIS-Flowers Dataset, clusters3 IRIS-Flowers Dataset, clusters3 IRIS-Flowers Dataset, clusters3
PAM 0.907 0.081 0.06
SRIDHCR 0.959 0.104 0.18
SPAM 0.973 0.133 0.33
Vehicle Dataset, clusters56 Vehicle Dataset, clusters56 Vehicle Dataset, clusters56 Vehicle Dataset, clusters56
PAM 0.681 0.046 505.00
SRIDHCR 0.762 0.081 22.58
SPAM 0.754 0.100 681.00
Segmentation Dataset, clusters32 Segmentation Dataset, clusters32 Segmentation Dataset, clusters32 Segmentation Dataset, clusters32
PAM 0.875 0.032 1529.00
SRIDHCR 0.946 0.054 169.39
SPAM 0.940 0.065 1053.00
Pima-Indians-Diabetes, clusters2 Pima-Indians-Diabetes, clusters2 Pima-Indians-Diabetes, clusters2 Pima-Indians-Diabetes, clusters2
PAM 0.656 0.104 0.97
SRIDHCR 0.795 0.109 5.08
SPAM 0.772 0.125 2.70
Table 6 Average Comparative Performance of the
Different Algorithms, ß0.4
26
Why is SRIDHCR performing so much better than
SPAM?

SPAM is relatively slow compared with a single
run of SRIDHCR allowing for 5-30 restarts of
SRIDHCR using the same resources. This enables
SRIDHCR to conduct a more balanced exploration of
the search space.
Fitness landscape induced by q(X) contains many
plateau-like structures (q(X1)q(X2)) and many
local minima and SPAM seems to get stuck more
easily.
The fact that SPAM uses a fixed k-value does not
seem beneficiary for finding good solutions,
e.g. SRIDHCR might explore u1,u2,u3,u4??u1,u2
,u3,u4,v1,v2 ?? u3,u4,v1,v2, whereas SPAM
might terminate with the sub-optimal solution
u1,u2,u3,u4, if neither the replacement of u1
through v1 nor the replacement of u2 by v2
enhances q(X).

27
Dataset k ß Ties Using q(X) Ties Using Tightness(X)
Iris-Plants 10 0.00001 5.8 0.0004
Iris-Plants 10 0.4 5.7 0.0004
Iris-Plants 50 0.00001 20.5 0.0019
Iris-Plants 50 0.4 20.9 0.0018

Vehicle 10 0.00001 1.04 0.000001
Vehicle 10 0.4 1.06 0.000001
Vehicle 50 0.00001 1.78 0.000001
Vehicle 50 0.4 1.84 0.000001

Segmentation 10 0.00001 0.220 0.000000
Segmentation 10 0.4 0.225 0.000001
Segmentation 50 0.00001 0.626 0.000001
Segmentation 50 0.4 0.638 0.000000

Diabetes 10 0.00001 2.06 0.0
Diabetes 10 0.4 2.05 0.0
Diabetes 50 0.00001 3.43 0.0002
Diabetes 50 0.4 3.45 0.0002
Table 7 Ties distribution
28
Figure 2 How Purity and k Change as ß Increases
29
6. Conclusions

As expected, supervised clustering algorithms
produced significantly better cluster purity than
traditional clustering. Improvements range
between 7 and 19 for different data sets.
Algorithms that too greedily explore the search
space, such as SPAM, do not seem to be very
suitable for supervised clustering. In general,
algorithms that explore the search space more
randomly seem to be more suitable for supervised
clustering.
Supervised clustering can be used to enhance
classifiers, dataset summarization, and generate
better distance functions.

30
Future Work