Using Clustering to Learn Distance Functions for Supervised Similarity Assessment - PowerPoint PPT Presentation

About This Presentation

Title:

Using Clustering to Learn Distance Functions for Supervised Similarity Assessment

Description:

usefulness of a distance depends on the task it used for ('no free lunch in ... Our goal is to learn the weights of an object distance function q such that pure ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 31

Provided by: david134

Learn more at: https://www2.cs.uh.edu

Category:

more less

Transcript and Presenter's Notes

Title: Using Clustering to Learn Distance Functions for Supervised Similarity Assessment

1
Using Clustering to Learn Distance Functions for
Supervised Similarity Assessment

Christoph F. Eick, A. Rouhana, A. Bagherjeiran,
R. Vilalta
Department of Computer Science
University of Houston
Organization of the Talk
Similarity Assessment
A Framework for Distance Function Learning
Inside Outside Weight Updating
Distance Function Learning Research at UH-DMML
Experimental Evaluation
Other Distance Function Learning Research
Summary

2
1. Similarity Assessment

Definition Similarity assessment is the task of
determining which objects are similar
to each other and which are dissimilar to each
other.
Goal of Similarity Assessment Construct a
distance function!
Applications of Similarity Assessment
Case-based reasoning
Classification techniques that rely on distance
functions
Clustering
Complications
Usually, there is no universal good distance
function for a set of objects the
usefulness of a distance depends on the task it
used for (no free lunch in
similarity assessment either).
Defining the distance between objects is more an
art than a science.

3
Motivating Example How To Find Similar Patients?

The following relation is given (with 10000
tuples)
Patient(ssn, weight, height, cancer-sev,
eye-color, age,)
Attribute Domains
ssn 9 digits
weight between 30 and 650 mweight158
sweight24.20
height between 0.30 and 2.20 in meters
mheight1.52 sheight19.2
cancer-sev 4serious 3quite_serious 2medium
1minor
eye-color brown, blue, green, grey
age between 3 and 100 mage45 sage13.2
Task Define Patient Similarity

4
CAL-FULL/UH Database Clustering Similarity
Assessment Environments
For more details see RE05
Training Data
A set of clusters
Library of clustering algorithms
Learning Tool
Object View
Similarity measure
Clustering Tool
Library of similarity measures
Similarity Measure Tool
Data Extraction Tool
User Interface
Todays topic
Type and weight information
Default choices and domain information
DBMS
5
2. A Framework for Distance Function Learning

Assumption The distance between two objects is
computed as the weighted sum of the distances
with respect to their attributes.
Objective Learn a good distance function q for
classification tasks.
Our approach Apply a clustering algorithm with
the object distance function q to be evaluated
that returns k clusters.
Our goal is to learn the weights of an object
distance function q such that pure clusters are
obtained (or as pure is possible) --- a pure
cluster contains example belonging to a single
class.

6
Idea Coevolving Clusters and Distance Functions
Weight Updating Scheme / Search Strategy
Clustering X
Distance Function Q
Cluster
Bad distance function Q1
Good distance function Q2
q(X) Clustering Evaluation
o
o
x
o
x
o
o
x
o
o
o
o
x
o
Goodness of the Distance Function Q
o
o
x
x
x
x
x
x
7
3. Inside/Outside Weight Updating
oexamples belonging to majority class x
non-majority-class examples
Cluster1 distances with respect to Att1
xo oo ox
Action Increase weight of Att1
Cluster1 distances with respect to Att2
Idea Move examples of the majority class closer
to each other
o o xx o o
Action Decrease weight for Att2
8
Inside/Outside Weight Updating Algorithm

Cluster the dataset using a given weight vector
w(w1,,wp) using k-means
FOR EACH cluster-attribute pair DO
Modify w using inside/outside weight updating
IF NOT DONE, CONTINUE with Step1 OTHERWISE,
RETURN w.

9
Inside/Outside Weight Updating Heuristic
The weight of the i-th attribute wi is updated as
follows for a given cluster
(W)
Example 2
Example 1
o o xx o o
xo oo ox
10
Idea Weight Inside/Outside Weight Updating
Clusterk
2
5
1
4
6
3
Attribute1
Attribute2
Attribute3
Initial Weights w1w2w31 Updated Weights
w11.14,w21.32, w30.84
11
Illustration Net Effect of Weight Adjustments
Clusterk
2
5
1
4
6
3
New Object Distances
Old Object Distances
12
A Slight Enhanced Weight Update Formula
13
Sample Run of IOWU for the Diabetes Dataset
14
4. Distance Function Learning Research at UH-DMML
Distance Function Evaluation
Weight-Updating Scheme / Search Strategy
Current Research EZZ04
K-Means
ERBV04
Inside/Outside Weight Updating
Supervised Clustering
Work By Karypis
Randomized Hill Climbing
NN-Classifier
Adaptive Clustering
Other Research

BECV05
15
5. Experimental Evaluation

Used a benchmark consisting of 7/15 UCI datasets
Inside/outside weight updating was run for 200
iterations
a was set to 0.3
Evaluation (10-fold cross validation repeated 10
times was used to determine accuracy)
Used 1-NN classifier as the base line classifer
Usee the learned distance function for a 1-NN
Used the learned distance function for a NCC
classifier (new!)

16
NCC-Classifier
Idea the training set is replaced by k
(centroid, majority class) pairs that are
computed using k-means the so generated dataset
is then used to classify the examples in the test
set.
Attribute1
Attribute1
B
A
D
C
F
E
Attribute2
Attribute2
a. Dataset clustered by K-means
b. Dataset edited using cluster centroids that
carry the class label of the cluster majority
class
17
Experimental Evaluation
Remark Statistically significant improvements
are in red.
18
DF-Learning With Randomized Hill Climbing
Random random number a rate of change for
example-0.3,0.3

Generate R solutions in the neighborhood of w
and pick the best one to be the new weight vector
w

0.3
-0.3
19
Accuracy IOWA and Randomized Hill Climbing
20
Distance Function Learning With Adaptive
Clustering

Uses reinforcement learning to adapt distance
functions for k-means clustering.
Employs search strategies that explores multiple
paths in parallel. The algorithm maintains an
open-list with maximum size L --- bad
performers a dropped from the open list.
Currently, beam search is used which creates 2p
successors (increasing and decreasing the weight
of each attribute exactly once) and evaluates
those 2pL successors and keeps the best L of
those.
Discretizes the search space in which states are
(ltweightsgt,ltcentroidsgt) tuples into a grid, and
memorizes and updates the fitness values of the
grid value iteration is limited to interesting
states by employing prioritized sweeping.
Weights are updated by increasing / decreasing
the weight of an attribute by a randomly chosen
percentage that fall within an interval
min-change, max-change our current
implementation uses 25,50.
Employs entropy H(X) as the fitness function (low
entropy? pure cluster)

21
6. Related Distance Function Learning Research

Interactive approaches that use user feedback and
reinforcement learning to derive a good distance
function.
Other work uses randomized hill climbing and
neural networks to learn distance functions for
classification tasks mostly, NN-queries are used
to evaluate the quality of a clustering.
Other work, mostly in the area of semi-supervised
clustering, adapts object distances to cope with
constraints.

22
7. Summary

Described an approach that employs clustering for
distance function evaluation.
Introduced an attribute weight updating heuristic
called inside/outside weight-updating and
evaluated its performance.
The inside/weight updating approach enhanced a
1-NN classifier significantly for some UCI
datasets, but not for all data sets that were
tested.
The quality of the employed approach is dependent
on the number of cluster k which is an input
parameter our current research centers on
determining k automatically with a supervised
clustering algorithm EZZ04
The general idea to replace a dataset by cluster
representatives to enhance NN-classifiers shows a
lot of promise in this (as exemplified in the NCC
classifier) and other research we are currently
conducting.
Distance function learning is quite time
consuming one run of 200 iterations of
inside/outside weight updating takes between 5
seconds and 5 minutes depending on dataset size
and k-value other techniques we currently
investigate are significantly slower therefore,
we are currently moving to high performance
computing facilities for the empirical evaluation
of the distance function learning approaches.

23
Links to 4 Papers

EZZ04 C. Eick, N. Zeidat, Z. Zhao, Supervised
Clustering --- Algorithms and Benefits, short
version appeared in Proc. International
Conference on Tools with AI (ICTAI), Boca Raton,
Florida, November 2004. http//www.cs.uh.edu/ceic
k/kdd/EZZ04.pdf
RE05 T. Ryu and C. Eick, A Clustering
Methodology and Tool, in Information Sciences
171(1-3) 29-59 (2005). http//www.cs.uh.edu/ceic
k/kdd/RE05.doc
ERBV04 C. Eick, A. Rouhana, A. Bagherjeiran, R.
Vilalta, Using Clustering to Learn Distance
Functions for Supervised Similarity Assessment,
in Proc. MLDM'05, Leipzig, Germany, July 2005.
http//www.cs.uh.edu/ceick/kdd/ERBV05.pdf
BECV05 A. Bagherjeiran, C. Eick, C.-S. Chen, R.
Vilalta, Adaptive Clustering Obtaining Better
Clusters Using Feedback and Past Experience,
submitted for publication. http//www.cs.uh.edu/c
eick/kdd/BECV05.pdf

24
Question?
?
?
?
25
Randomized Hill Climbing

Fast start algorithm starts from small
neighborhood size until it can not find any
better solutions. Then it increases its
neighborhood size by 3 times hopping that a
better solution can be found by trying more
points
Shoulder condition When the algorithm has moved
to a shoulder or flat hill, it will keep getting
solutions with the same fitness value. Our
algorithm terminates when it has tried for 3
times and still getting the same results. This
prevents it from been trapped in a shoulder
forever

26
Randomized Hill Climbing
Flat hill
Objective function
Shoulder
State space
27
Purity in clusters obtained (internal)
28
Purity in clusters obtained (internal)
29
Different Forms of Clustering
Ch. Eick
Objectives Supervised Clustering Minimize
cluster impurity while keeping the number of
clusters low (expressed by a fitness function
q(X)).
30
A Fitness Function for Supervised Clustering

q(X) Impurity(X) ßPenalty(k)

k number of clusters used n number of examples
the dataset c number of classes in a dataset.
ß Weight for Penalty(k), 0lt ß 2.0
Penalty(k) increase sub-linearly. because the
effect of increasing the of clusters from k to
k1 has greater effect on the end result when k
is small than when it is large. Hence the formula
above

Write a Comment

User Comments (0)