Title: Using Clustering to Learn Distance Functions for Supervised Similarity Assessment
1Using Clustering to Learn Distance Functions for
Supervised Similarity Assessment
- Christoph F. Eick, A. Rouhana, A. Bagherjeiran,
R. Vilalta - Department of Computer Science
- University of Houston
- Organization of the Talk
- Similarity Assessment
- A Framework for Distance Function Learning
- Inside Outside Weight Updating
- Distance Function Learning Research at UH-DMML
- Experimental Evaluation
- Other Distance Function Learning Research
- Summary
21. Similarity Assessment
- Definition Similarity assessment is the task of
determining which objects are similar - to each other and which are dissimilar to each
other. - Goal of Similarity Assessment Construct a
distance function! - Applications of Similarity Assessment
- Case-based reasoning
- Classification techniques that rely on distance
functions - Clustering
-
- Complications
- Usually, there is no universal good distance
function for a set of objects the - usefulness of a distance depends on the task it
used for (no free lunch in - similarity assessment either).
- Defining the distance between objects is more an
art than a science.
3Motivating Example How To Find Similar Patients?
- The following relation is given (with 10000
tuples) - Patient(ssn, weight, height, cancer-sev,
eye-color, age,) - Attribute Domains
- ssn 9 digits
- weight between 30 and 650 mweight158
sweight24.20 - height between 0.30 and 2.20 in meters
mheight1.52 sheight19.2 - cancer-sev 4serious 3quite_serious 2medium
1minor - eye-color brown, blue, green, grey
- age between 3 and 100 mage45 sage13.2
- Task Define Patient Similarity
4CAL-FULL/UH Database Clustering Similarity
Assessment Environments
For more details see RE05
Training Data
A set of clusters
Library of clustering algorithms
Learning Tool
Object View
Similarity measure
Clustering Tool
Library of similarity measures
Similarity Measure Tool
Data Extraction Tool
User Interface
Todays topic
Type and weight information
Default choices and domain information
DBMS
52. A Framework for Distance Function Learning
- Assumption The distance between two objects is
computed as the weighted sum of the distances
with respect to their attributes. - Objective Learn a good distance function q for
classification tasks. - Our approach Apply a clustering algorithm with
the object distance function q to be evaluated
that returns k clusters. - Our goal is to learn the weights of an object
distance function q such that pure clusters are
obtained (or as pure is possible) --- a pure
cluster contains example belonging to a single
class.
6Idea Coevolving Clusters and Distance Functions
Weight Updating Scheme / Search Strategy
Clustering X
Distance Function Q
Cluster
Bad distance function Q1
Good distance function Q2
q(X) Clustering Evaluation
o
o
x
o
x
o
o
x
o
o
o
o
x
o
Goodness of the Distance Function Q
o
o
x
x
x
x
x
x
73. Inside/Outside Weight Updating
oexamples belonging to majority class x
non-majority-class examples
Cluster1 distances with respect to Att1
xo oo ox
Action Increase weight of Att1
Cluster1 distances with respect to Att2
Idea Move examples of the majority class closer
to each other
o o xx o o
Action Decrease weight for Att2
8Inside/Outside Weight Updating Algorithm
- Cluster the dataset using a given weight vector
w(w1,,wp) using k-means - FOR EACH cluster-attribute pair DO
- Modify w using inside/outside weight updating
- IF NOT DONE, CONTINUE with Step1 OTHERWISE,
RETURN w.
9Inside/Outside Weight Updating Heuristic
The weight of the i-th attribute wi is updated as
follows for a given cluster
(W)
Example 2
Example 1
o o xx o o
xo oo ox
10Idea Weight Inside/Outside Weight Updating
Clusterk
2
5
1
4
6
3
Attribute1
Attribute2
Attribute3
Initial Weights w1w2w31 Updated Weights
w11.14,w21.32, w30.84
11Illustration Net Effect of Weight Adjustments
Clusterk
2
5
1
4
6
3
New Object Distances
Old Object Distances
12A Slight Enhanced Weight Update Formula
13Sample Run of IOWU for the Diabetes Dataset
144. Distance Function Learning Research at UH-DMML
Distance Function Evaluation
Weight-Updating Scheme / Search Strategy
Current Research EZZ04
K-Means
ERBV04
Inside/Outside Weight Updating
Supervised Clustering
Work By Karypis
Randomized Hill Climbing
NN-Classifier
Adaptive Clustering
Other Research
BECV05
155. Experimental Evaluation
- Used a benchmark consisting of 7/15 UCI datasets
- Inside/outside weight updating was run for 200
iterations - a was set to 0.3
- Evaluation (10-fold cross validation repeated 10
times was used to determine accuracy) - Used 1-NN classifier as the base line classifer
- Usee the learned distance function for a 1-NN
- Used the learned distance function for a NCC
classifier (new!)
16NCC-Classifier
Idea the training set is replaced by k
(centroid, majority class) pairs that are
computed using k-means the so generated dataset
is then used to classify the examples in the test
set.
Attribute1
Attribute1
B
A
D
C
F
E
Attribute2
Attribute2
a. Dataset clustered by K-means
b. Dataset edited using cluster centroids that
carry the class label of the cluster majority
class
17Experimental Evaluation
Remark Statistically significant improvements
are in red.
18DF-Learning With Randomized Hill Climbing
Random random number a rate of change for
example-0.3,0.3
- Generate R solutions in the neighborhood of w
and pick the best one to be the new weight vector
w
0.3
-0.3
19Accuracy IOWA and Randomized Hill Climbing
20Distance Function Learning With Adaptive
Clustering
- Uses reinforcement learning to adapt distance
functions for k-means clustering. - Employs search strategies that explores multiple
paths in parallel. The algorithm maintains an
open-list with maximum size L --- bad
performers a dropped from the open list.
Currently, beam search is used which creates 2p
successors (increasing and decreasing the weight
of each attribute exactly once) and evaluates
those 2pL successors and keeps the best L of
those. - Discretizes the search space in which states are
(ltweightsgt,ltcentroidsgt) tuples into a grid, and
memorizes and updates the fitness values of the
grid value iteration is limited to interesting
states by employing prioritized sweeping. - Weights are updated by increasing / decreasing
the weight of an attribute by a randomly chosen
percentage that fall within an interval
min-change, max-change our current
implementation uses 25,50. - Employs entropy H(X) as the fitness function (low
entropy? pure cluster)
216. Related Distance Function Learning Research
- Interactive approaches that use user feedback and
reinforcement learning to derive a good distance
function. - Other work uses randomized hill climbing and
neural networks to learn distance functions for
classification tasks mostly, NN-queries are used
to evaluate the quality of a clustering. - Other work, mostly in the area of semi-supervised
clustering, adapts object distances to cope with
constraints.
227. Summary
- Described an approach that employs clustering for
distance function evaluation. - Introduced an attribute weight updating heuristic
called inside/outside weight-updating and
evaluated its performance. - The inside/weight updating approach enhanced a
1-NN classifier significantly for some UCI
datasets, but not for all data sets that were
tested. - The quality of the employed approach is dependent
on the number of cluster k which is an input
parameter our current research centers on
determining k automatically with a supervised
clustering algorithm EZZ04 - The general idea to replace a dataset by cluster
representatives to enhance NN-classifiers shows a
lot of promise in this (as exemplified in the NCC
classifier) and other research we are currently
conducting. - Distance function learning is quite time
consuming one run of 200 iterations of
inside/outside weight updating takes between 5
seconds and 5 minutes depending on dataset size
and k-value other techniques we currently
investigate are significantly slower therefore,
we are currently moving to high performance
computing facilities for the empirical evaluation
of the distance function learning approaches.
23Links to 4 Papers
- EZZ04 C. Eick, N. Zeidat, Z. Zhao, Supervised
Clustering --- Algorithms and Benefits, short
version appeared in Proc. International
Conference on Tools with AI (ICTAI), Boca Raton,
Florida, November 2004. http//www.cs.uh.edu/ceic
k/kdd/EZZ04.pdf - RE05 T. Ryu and C. Eick, A Clustering
Methodology and Tool, in Information Sciences
171(1-3) 29-59 (2005). http//www.cs.uh.edu/ceic
k/kdd/RE05.doc - ERBV04 C. Eick, A. Rouhana, A. Bagherjeiran, R.
Vilalta, Using Clustering to Learn Distance
Functions for Supervised Similarity Assessment,
in Proc. MLDM'05, Leipzig, Germany, July 2005.
http//www.cs.uh.edu/ceick/kdd/ERBV05.pdf - BECV05 A. Bagherjeiran, C. Eick, C.-S. Chen, R.
Vilalta, Adaptive Clustering Obtaining Better
Clusters Using Feedback and Past Experience,
submitted for publication. http//www.cs.uh.edu/c
eick/kdd/BECV05.pdf
24Question?
?
?
?
25Randomized Hill Climbing
- Fast start algorithm starts from small
neighborhood size until it can not find any
better solutions. Then it increases its
neighborhood size by 3 times hopping that a
better solution can be found by trying more
points - Shoulder condition When the algorithm has moved
to a shoulder or flat hill, it will keep getting
solutions with the same fitness value. Our
algorithm terminates when it has tried for 3
times and still getting the same results. This
prevents it from been trapped in a shoulder
forever
26Randomized Hill Climbing
Flat hill
Objective function
Shoulder
State space
27Purity in clusters obtained (internal)
28Purity in clusters obtained (internal)
29Different Forms of Clustering
Ch. Eick
Objectives Supervised Clustering Minimize
cluster impurity while keeping the number of
clusters low (expressed by a fitness function
q(X)).
30A Fitness Function for Supervised Clustering
- q(X) Impurity(X) ßPenalty(k)
k number of clusters used n number of examples
the dataset c number of classes in a dataset.
ß Weight for Penalty(k), 0lt ß 2.0
Penalty(k) increase sub-linearly. because the
effect of increasing the of clusters from k to
k1 has greater effect on the end result when k
is small than when it is large. Hence the formula
above