Supervised Clustering --- Algorithms and Applications - PowerPoint PPT Presentation

About This Presentation
Title:

Supervised Clustering --- Algorithms and Applications

Description:

Ford Trucks. Attribute1. Ford SUV. Ford Vans. GMC Trucks. GMC Van. GMC SUV :Ford :GMC. Ch. Eick: Supervised Clustering --- Algorithms and Applications ... – PowerPoint PPT presentation

Number of Views:122
Avg rating:3.0/5.0
Slides: 48
Provided by: david134
Learn more at: https://www2.cs.uh.edu
Category:

less

Transcript and Presenter's Notes

Title: Supervised Clustering --- Algorithms and Applications


1
Supervised Clustering ---Algorithms and
Applications
  • Christoph F. Eick
  • Department of Computer Science
  • University of Houston
  • Organization of the Talk
  • Supervised Clustering
  • Representative-based Supervised Clustering
    Algorithms
  • Applications Using Supervised Clustering for
  • Dataset Editing
  • Class Decomposition
  • Distance Function Learning
  • Region Discovery in Spatial Datasets
  • Other Activities I am Involved With

2
List of Persons that Contributed to the Work
Presented in Todays Talk
  • Tae-Wan Ryu (former PhD student now faculty
    member Cal State Fullerton)
  • Ricardo Vilalta (colleague at UH since 2002
    Co-Director of the UHs Data Mining and Knowledge
    Discovery Group)
  • Murali Achari (former Master student)
  • Alain Rouhana (former Master student)
  • Abraham Bagherjeiran (current PhD student)
  • Chunshen Chen (current Master student)
  • Nidal Zeidat (current PhD student)
  • Sujing Wang (current PhD student)
  • Kim Wee (current MS student)
  • Zhenghong Zhao (former Master student)

3
Traditional Clustering
  • Partition a set of objects into groups of similar
    objects. Each group is called a cluster.
  • Clustering is used to detect classes in a data
    set (unsupervised learning).
  • Clustering is based on a fitness function that
    relies on a distance measure and usually tries to
    create tight clusters.

4
Different Forms of Clustering
Ch. Eick
Objectives Supervised Clustering Minimize
cluster impurity while keeping the number of
clusters low (expressed by a fitness function
q(X)).
5
Motivation Finding Subclasses using SC
Attribute1
Ford Trucks
Ford
GMC
GMC Trucks
GMC Van
Ford Vans
Ford SUV
Attribute2
GMC SUV
6
Related Work Supervised Clustering
  • Sinkkonens SKN02 discriminative clustering and
    Tishbys information bottleneck method TPB99,
    ST99 can be viewed as probabilistic supervised
    clustering algorithms.
  • There has been a lot of work in the area of
    semi-supervised clustering that centers on
    clustering with background information. Although
    the focus of this work is traditional clustering,
    there is still a lot of similarity between
    techniques and algorithms they investigate and
    the techniques and algorithms we investigate.

7
2. Representative-Based Supervised Clustering
  • Aims at finding a set of objects among all
    objects (called representatives) in the data set
    that best represent the objects in the data set.
    Each representative corresponds to a cluster.
  • The remaining objects in the data set are then
    clustered around these representatives by
    assigning objects to the cluster of the closest
    representative.
  • Remark The popular k-medoid algorithm, also
    called PAM, is a representative-based clustering
    algorithm.

8
Representative-Based Supervised Clustering
(Continued)
2
Attribute1
1
3
Attribute2
4
9
Representative-Based Supervised Clustering
(continued)
2
Attribute1
1
3
Attribute2
4
Objective of RSC Find a subset OR of O such that
the clustering X obtained by using the objects
in OR as representatives minimizes q(X).
10
SC Algorithms Currently Investigated
  1. Supervised Partitioning Around Medoids (SPAM).
  2. Single Representative Insertion/Deletion Steepest
    Decent Hill Climbing with Randomized Restart
    (SRIDHCR).
  3. Top Down Splitting Algorithm (TDS).
  4. Supervised Clustering using Evolutionary
    Computing (SCEC)
  5. Agglomerative Hierarchical Supervised Clustering
    (AHSC)
  6. Grid-Based Supervised Clustering (GRIDSC)

Remark For a more detailed discussion of SCEC
and SRIDHCR see EZZ04
11
A Fitness Function for Supervised Clustering
  • q(X) Impurity(X) ßPenalty(k)

k number of clusters used n number of examples
the dataset c number of classes in a dataset.
ß Weight for Penalty(k), 0lt ß 2.0
Penalty(k) increase sub-linearly. because the
effect of increasing the of clusters from k to
k1 has greater effect on the end result when k
is small than when it is large. Hence the formula
above
12
Algorithm SRIDHCR (Greedy Hill Climbing)
  • Highlights
  • k is not an input parameter, SRIDHCR searches
    for best k within the range that is induced by b.
  • Reports the best clustering found in r runs

13
Supervised Clustering using Evolutionary
Computing SCEC
Initial generation
Next generation
Mutation
Crossover
Copy
Best solution
Final generation
Result
14
The complete flow chart of SCEC
The complete flow chart of SCEC
15
Complex1 Dataset
16
Supervised Clustering Result
17
Supervised Clustering ---Algorithms and
Applications
  • Organization of the Talk
  • Supervised Clustering
  • Representative-based Supervised Clustering
    Algorithms
  • Applications Using Supervised Clustering for
  • for Dataset Editing
  • for Class Decomposition
  • for Distance Function Learning
  • for Region Discovery in Spatial Datasets
  • Other Activities I am Involved With

18
Nearest Neighbour Rule
Consider a two class problem where each sample
consists of two measurements (x,y).
k 1
For a given query point q, assign the class of
the nearest neighbour.
k 3
Compute the k nearest neighbours and assign the
class by majority vote.
Problem requires good distance function
19
3a. Dataset Reduction Editing
  • Training data may contain noise, overlapping
    classes
  • Editing seeks to remove noisy points and produce
    smooth decision boundaries often by retaining
    points far from the decision boundaries
  • Main Goal of Editing enhance the accuracy of
    classifier ( of unseen examples classified
    correctly)
  • Secondary Goal of Editing enhance the speed of a
    k-NN classifier

20
Wilson Editing
  • Wilson 1972
  • Remove points that do not agree with the majority
    of their k nearest neighbours

Earlier example
Overlapping classes
Original data
Original data
Wilson editing with k7
Wilson editing with k7
21
RSC ? Dataset Editing
Attribute1
Attribute1
B
A
D
C
F
E
Attribute2
Attribute2
a. Dataset clustered using supervised clustering.
b. Dataset edited using cluster representatives.
22
Experimental Evaluation
  • We compared a traditional 1-NN, 1-NN using Wilson
    Editing, Supervised Clustering Editing (SCE), and
    C4.5 (that was run using its default parameter
    setting).
  • A benchmark consisting of 8 UCI datasets was used
    for this purpose.
  • Accuracies were computed using 10-fold cross
    validation.
  • SRIDHCR was used for supervised clustering.
  • SCE was tested using different compression rates
    by associating different penalties with the
    number of clusters found (by setting parameter b
    to 0.1, 0.4 and 1.0).
  • Compression rates of SCE and Wilson Editing were
    computed using 1-(k/n) with n being the size of
    the original dataset and k being the size of the
    edited dataset.

23
Table 2 Prediction Accuracy for the four
classifiers.
ß NR Wilson 1-NN C4.5
Glass (214) Glass (214) Glass (214) Glass (214) Glass (214)
0.1 0.636 0.607 0.692 0.677
0.4 0.589 0.607 0.692 0.677
1.0 0.575 0.607 0.692 0.677
Heart-Stat Log (270) Heart-Stat Log (270) Heart-Stat Log (270) Heart-Stat Log (270) Heart-Stat Log (270)
0.1 0.796 0.804 0.767 0.782
0.4 0.833 0.804 0.767 0.782
1.0 0.838 0.804 0.767 0.782
Diabetes (768) Diabetes (768) Diabetes (768) Diabetes (768) Diabetes (768)
0.1 0.736 0.734 0.690 0.745
0.4 0.736 0.734 0.690 0.745
1.0 0.745 0.734 0.690 0.745
Vehicle (846) Vehicle (846) Vehicle (846) Vehicle (846) Vehicle (846)
0.1 0.667 0.716 0.700 0.723
0.4 0.667 0.716 0.700 0.723
1.0 0.665 0.716 0.700 0.723
Heart-H (294) Heart-H (294) Heart-H (294) Heart-H (294) Heart-H (294)
0.1 0.755 0.809 0.783 0.802
0.4 0.793 0.809 0.783 0.802
1.0 0.809 0.809 0.783 0.802
Waveform (5000) Waveform (5000) Waveform (5000) Waveform (5000) Waveform (5000)
0.1 0.834 0.796 0.768 0.781
0.4 0.841 0.796 0.768 0.781
1.0 0.837 0.796 0.768 0.781
Iris-Plants (150) Iris-Plants (150) Iris-Plants (150) Iris-Plants (150) Iris-Plants (150)
0.1 0.947 0.936 0.947 0.947
0.4 0.973 0.936 0.947 0.947
1.0 0.953 0.936 0.947 0.947
Segmentation (2100) Segmentation (2100) Segmentation (2100) Segmentation (2100) Segmentation (2100)
0.1 0.938 0.966 0.956 0.968
0.4 0.919 0.966 0.956 0.968
1.0 0.890 0.966 0.956 0.968
24
Table 3 Dataset Compression Rates for SCE and
Wilson Editing.
b Avg. k Min-Max for SCE SCE Compression Rate () Wilson Compression Rate ()
Glass (214) Glass (214) Glass (214) Glass (214)
0.1 34 28-39 84.3 27
0.4 25 19-29 88.4 27
1.0 6 6 6 97.2 27
Heart-Stat Log (270) Heart-Stat Log (270) Heart-Stat Log (270) Heart-Stat Log (270)
0.1 15 12-18 94.4 22.4
0.4 2 2 2 99.3 22.4
1.0 2 2 2 99.3 22.4
Diabetes (768) Diabetes (768) Diabetes (768) Diabetes (768)
0.1 27 22-33 96.5 30.0
0.4 9 2-18 98.8 30.0
1.0 2 2 2 99.7 30.0
Vehicle (846) Vehicle (846) Vehicle (846) Vehicle (846)
0.1 57 51-65 97.3 30.5
0.4 38 26-61 95.5 30.5
1.0 14 9-22 98.3 30.5
Heart-H (294) Heart-H (294) Heart-H (294) Heart-H (294)
0.1 14 11-18 95.2 21.9
0.4 2 99.3 21.9
1.0 2 99.3 21.9
Waveform (5000) Waveform (5000) Waveform (5000) Waveform (5000)
0.1 104 79-117 97.9 23.4
0.4 28 20-39 99.4 23.4
1.0 4 3-6 99.9 23.4
Iris-Plants (150) Iris-Plants (150) Iris-Plants (150) Iris-Plants (150)
0.1 4 3-8 97.3 6.0
0.4 3 3 3 98.0 6.0
1.0 3 3 3 98.0 6.0
Segmentation (2100) Segmentation (2100) Segmentation (2100) Segmentation (2100)
0.1 57 48-65 97.3 2.8
0.4 30 24-37 98.6 2.8
1.0 14 99.3 2.8
25
Summary SCE and Wilson Editing
  • Wilson editing enhances the accuracy of a
    traditional 1-NN classifier for six of the eight
    datasets tested. It achieved compression rates of
    approx. 25, but much lower compression rates for
    easy datasets.
  • SCE achieved very high compression rates without
    loss in accuracy for 6 of the 8 datasets tested.
  • SCE accomplished a significant improvement in
    accuracy for 3 of the 8 datasets tested.
  • Surprisingly, many UCI datasets can be compressed
    by just using a single representative per class
    without a significant loss in accuracy.
  • SCE tends to pick representatives that are in the
    center of a region that is dominated by a single
    class it removes examples that are classified
    correctly as well as examples that are classified
    incorrectly from the dataset. This explains its
    much higher compression rates.
  • Remark For a more detailed evaluation of SCE,
    Wilson Editing, and other editing techniques see
    EZV04 and ZWE05.

26
Future Direction of this Research
p
Data Set
Data Set
IDLA
IDLA
Classifier C
Classifier C
Goal Find p, such that C is more accurate than
C or C and C have approximately the same
accuracy, but C can be learnt more quickly
and/or C classifies new examples more quickly.
27
Supervised Clustering vs. Clustering the Examples
of Each Separately
  • Approaches to discover subclasses of a given
    class
  • Cluster the examples of each class separately
  • Use supervised clustering

Figure 4. Supervised clustering editing vs.
clustering each class (x and o) separately.
Remark A traditional clustering algorithm, such
as k-medoids, would pick o as the cluster
representative, because it is blind on how the
examples of other classes distribute, whereas
supervised clustering would pick o as the
representative obviously, o is not a good
choice for editing, because it attracts points of
the class x, which leads to misclassifications.
28
Applications of Supervised Clustering 3.b Class
Decomposition (see also VAE03)
Attribute 1
Attribute 1
Attribute 2
Attribute 2
Attribute 1
  • Simple classifiers
  • Encompass a small class of approximating
    functions.
  • Limited flexibility in their decision boundaries

Attribute 2
29
Naïve Bayes vs. Naïve Bayes with Class
Decomposition
30
Example How to Find Similar Patients?
3c. Using Clustering in Distance Function Learning
  • The following relation is given (with 10000
    tuples)
  • Patient(ssn, weight, height, cancer-sev,
    eye-color, age,)
  • Attribute Domains
  • ssn 9 digits
  • weight between 30 and 650 mweight158
    sweight24.20
  • height between 0.30 and 2.20 in meters
    mheight1.52 sheight19.2
  • cancer-sev 4serious 3quite_serious 2medium
    1minor
  • eye-color brown, blue, green, grey
  • age between 3 and 100 mage45 sage13.2
  • Task Define Patient Similarity

31
CAL-FULL/UH Database Clustering Similarity
Assessment Environments
Training Data
A set of clusters
Library of clustering algorithms
Learning Tool
Object View
Similarity measure
Clustering Tool
Library of similarity measures
Similarity Measure Tool
Data Extraction Tool
User Interface
Todays topic
Type and weight information
Default choices and domain information
DBMS
For more details see RE05
32
Similarity Assessment Framework and Objectives
  • Objective Learn a good distance function q for
    classification tasks.
  • Our approach Apply a clustering algorithm with
    the distance function q to be evaluated that
    returns a number of clusters k. The more pure the
    obtained clusters are the better is the quality
    of q.
  • Our goal is to learn the weights of an object
    distance function q such that all the clusters
    are pure (or as pure is possible) for more
    details see ERBV05 and BECV05 papers.

33
Idea Coevolving Clusters and Distance Functions
Weight Updating Scheme / Search Strategy
Clustering X
Distance Function Q
Cluster
Bad distance function Q1
Good distance function Q2
q(X) Clustering Evaluation
o
o
x
o
x
o
o
x
o
o
o
o
x
o
Goodness of the Distance Function Q
o
o
x
x
x
x
x
x
34
Idea Inside/Outside Weight Updating
oexamples belonging to majority class x
non-majority-class examples
Cluster1 distances with respect to Att1
xo oo ox
Action Increase weight of Att1
Cluster1 distances with respect to Att2
Idea Move examples of the majority class closer
to each other
o o xx o o
Action Decrease weight for Att2
35
Sample Run of IOWU for Diabetes Dataset
Graph produced by Abraham Bagherjeiran
36
Research Framework Distance Function Learning
Distance Function Evaluation
Weight-Updating Scheme / Search Strategy
K-Means
Inside/Outside Weight Updating
ERBV04
Supervised Clustering
Work By Karypis
Randomized Hill Climbing
NN-Classifier
Adaptive Clustering
Other Research


BECV05
37
3.d Discovery of Interesting Regions for
Spatial Data Mining
  • Task 2D/3D datasets are given discover
    interesting regions in the dataset that maximize
    a given fitness function examples of region
    discovery include
  • Discover regions that have significant deviations
    from the prior probability of a class e.g.
    regions in the state of Wyoming were people are
    very poor or not poor at all
  • Discover regions that have significant variation
    in the income (fitness is defined based on the
    variance with respect to income in a region)
  • Discover regions for congressional redistricting
  • Discover congested regions for traffic control
  • Remark We use (supervised) clustering to
    discover such regions regions are implicitly
    defined by the set of points that belong to a
    cluster.

38
Wyoming Map
39
Household Income in 1999 Wyoming Park County
40
Clusters ? Regions
Example 2 clusters in red and blue are given
regions are defined by using a Voronoi diagram
based on a NN classifier with k7 region are in
grey and white.
41
An Evaluation Scheme for Discovering Regions that
Deviate from the Prior Probability of a Class C
Let prior(C) C/n p(c,C) percentage of
examples in c that belong to class C Reward(c) is
computed based on p(c.C), prior(C) , and based on
the following parameters
g1,g2,R,R- (g1?1?g2 R,R-?0) relying on the
following interpolation
function (e.g. g10.8,g21.2,R 1, R-1)
qC(X) Sc?X (t(p(c,C),prior(C),g1,g2,R,R-)
c)b/n) with bgt1 (typically, 1.0001ltblt2) the
idea is that increases in cluster-size rewarded
nonlinearly, favoring clusters with more points
as long as ct() increases.
Reward(c)
R
R-
t(p(C),prior(C),g1,g2,R,R-)
prior(C)
prior(C)g1
prior(C)g2
p(c,C)
1
42
Example Discovery of Interesting Regions in
Wyoming Census 2000 Datasets
Ch. Eick
43
Supervised Clustering ---Algorithms and
Applications
  • Organization of the Talk
  • Supervised Clustering
  • Representative-based Supervised Clustering
    Algorithms
  • Applications Using Supervised Clustering for
  • for Dataset Editing
  • for Class Decomposition
  • for Distance Function Learning
  • for Region Discovery in Spatial Datasets
  • Other Activities I am Involved With

44
An Environment for Adaptive (Supervised)
Clusteringfor Summary Generation Applications
Clustering
Summary
Clustering Algorithm
Inputs
changes
Adaptation System
Evaluation System
feedback
Past Experience
Domain Expert
quality
Fitness Functions (predefined)
q(X),
Idea Development of a Generic Clustering/Feedback
/Adaptation Architecture whose objective is to
facilitate the search for clusterings that
maximize an internally and/or an externally given
reward function (for some initial ideas see
BECV05)
45
Clustering Algorithm Inputs
Data Set Examples Data Set Feature
Representation Distance Function Clustering
Algorithm Parameters Fitness Function
Parameters Background Knowledge
46
Research Topics 2005/2006
  • Inductive Learning/Data Mining
  • Decision trees, nearest neighbor classifiers
  • Using clustering to enhance classification
    algorithms
  • Making sense of data
  • Supervised Clustering
  • Learning subclasses
  • Supervised clustering algorithms that learn
    clusters with arbitrary shape
  • Using supervised clustering for region discovery
  • Adaptive clustering
  • Tools for Similarity Assessment and Distance
    Function Learning
  • Data Set Compression and Creating Meta Knowledge
    for Local Learning Techniques
  • Comparative studies
  • Creating maps and other data set signatures for
    datasets based on editing, SC, and other
    techniques
  • Traditional Clustering
  • Data Mining and Information Retrieval for
    Structured Data
  • Other Evolutionary Computing, File Prediction,
    Ontologies, Heuristic Search, Reinforcement
    Learning, Data Models.

Remark Topics that were covered in this talk
are in blue
47
Links to 7 Papers
VAE03 R. Vilalta, M. Achari, C. Eick, Class
Decomposition via Clustering A New Framework for
Low-Variance Classifiers, in Proc. IEEE
International Conference on Data Mining (ICDM),
Melbourne, Florida, November 2003.
http//www.cs.uh.edu/ceick/kdd/VAE03.pdf EZZ04
C. Eick, N. Zeidat, Z. Zhao, Supervised
Clustering --- Algorithms and Benefits, short
version appeared in Proc. International
Conference on Tools with AI (ICTAI), Boca Raton,
Florida, November 2004. http//www.cs.uh.edu/ceic
k/kdd/EZZ04.pdf EZV04 C. Eick, N. Zeidat, R.
Vilalta, Using Representative-Based Clustering
for Nearest Neighbor Dataset Editing, in Proc.
IEEE International Conference on Data Mining
(ICDM), Brighton, England, November
2004. http//www.cs.uh.edu/ceick/kdd/EZV04.pdf R
E05 T. Ryu and C. Eick, A Clustering Methodology
and Tool, in Information Sciences 171(1-3) 29-59
(2005). http//www.cs.uh.edu/ceick/kdd/RE05.doc
ERBV04 C. Eick, A. Rouhana, A. Bagherjeiran, R.
Vilalta, Using Clustering to Learn Distance
Functions for Supervised Similarity Assessment,
in Proc. MLDM'05, Leipzig, Germany, July
2005. http//www.cs.uh.edu/ceick/kdd/ERBV05.pdf
ZWE05 N. Zeidat, S. Wang, C. Eick,, Editing
Techniques a Comparative Study, submitted for
publication. http//www.cs.uh.edu/ceick/kdd/ZWE05
.pdf BECV05 A. Bagherjeiran, C. Eick, C.-S.
Chen, R. Vilalta, Adaptive Clustering Obtaining
Better Clusters Using Feedback and Past
Experience, submitted for publication. http//www.
cs.uh.edu/ceick/kdd/BECV05.pdf
Write a Comment
User Comments (0)
About PowerShow.com