Using Clustering to Learn Distance Functions for Supervised Similarity Assessment - PowerPoint PPT Presentation

About This Presentation
Title:

Using Clustering to Learn Distance Functions for Supervised Similarity Assessment

Description:

usefulness of a distance depends on the task it used for ('no free lunch in ... Our goal is to learn the weights of an object distance function q such that pure ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 31
Provided by: david134
Learn more at: https://www2.cs.uh.edu
Category:

less

Transcript and Presenter's Notes

Title: Using Clustering to Learn Distance Functions for Supervised Similarity Assessment


1
Using Clustering to Learn Distance Functions for
Supervised Similarity Assessment
  • Christoph F. Eick, A. Rouhana, A. Bagherjeiran,
    R. Vilalta
  • Department of Computer Science
  • University of Houston
  • Organization of the Talk
  • Similarity Assessment
  • A Framework for Distance Function Learning
  • Inside Outside Weight Updating
  • Distance Function Learning Research at UH-DMML
  • Experimental Evaluation
  • Other Distance Function Learning Research
  • Summary

2
1. Similarity Assessment
  • Definition Similarity assessment is the task of
    determining which objects are similar
  • to each other and which are dissimilar to each
    other.
  • Goal of Similarity Assessment Construct a
    distance function!
  • Applications of Similarity Assessment
  • Case-based reasoning
  • Classification techniques that rely on distance
    functions
  • Clustering
  • Complications
  • Usually, there is no universal good distance
    function for a set of objects the
  • usefulness of a distance depends on the task it
    used for (no free lunch in
  • similarity assessment either).
  • Defining the distance between objects is more an
    art than a science.

3
Motivating Example How To Find Similar Patients?
  • The following relation is given (with 10000
    tuples)
  • Patient(ssn, weight, height, cancer-sev,
    eye-color, age,)
  • Attribute Domains
  • ssn 9 digits
  • weight between 30 and 650 mweight158
    sweight24.20
  • height between 0.30 and 2.20 in meters
    mheight1.52 sheight19.2
  • cancer-sev 4serious 3quite_serious 2medium
    1minor
  • eye-color brown, blue, green, grey
  • age between 3 and 100 mage45 sage13.2
  • Task Define Patient Similarity

4
CAL-FULL/UH Database Clustering Similarity
Assessment Environments
For more details see RE05
Training Data
A set of clusters
Library of clustering algorithms
Learning Tool
Object View
Similarity measure
Clustering Tool
Library of similarity measures
Similarity Measure Tool
Data Extraction Tool
User Interface
Todays topic
Type and weight information
Default choices and domain information
DBMS
5
2. A Framework for Distance Function Learning
  • Assumption The distance between two objects is
    computed as the weighted sum of the distances
    with respect to their attributes.
  • Objective Learn a good distance function q for
    classification tasks.
  • Our approach Apply a clustering algorithm with
    the object distance function q to be evaluated
    that returns k clusters.
  • Our goal is to learn the weights of an object
    distance function q such that pure clusters are
    obtained (or as pure is possible) --- a pure
    cluster contains example belonging to a single
    class.

6
Idea Coevolving Clusters and Distance Functions
Weight Updating Scheme / Search Strategy
Clustering X
Distance Function Q
Cluster
Bad distance function Q1
Good distance function Q2
q(X) Clustering Evaluation
o
o
x
o
x
o
o
x
o
o
o
o
x
o
Goodness of the Distance Function Q
o
o
x
x
x
x
x
x
7
3. Inside/Outside Weight Updating
oexamples belonging to majority class x
non-majority-class examples
Cluster1 distances with respect to Att1
xo oo ox
Action Increase weight of Att1
Cluster1 distances with respect to Att2
Idea Move examples of the majority class closer
to each other
o o xx o o
Action Decrease weight for Att2
8
Inside/Outside Weight Updating Algorithm
  • Cluster the dataset using a given weight vector
    w(w1,,wp) using k-means
  • FOR EACH cluster-attribute pair DO
  • Modify w using inside/outside weight updating
  • IF NOT DONE, CONTINUE with Step1 OTHERWISE,
    RETURN w.

9
Inside/Outside Weight Updating Heuristic
The weight of the i-th attribute wi is updated as
follows for a given cluster
(W)
Example 2
Example 1
o o xx o o
xo oo ox
10
Idea Weight Inside/Outside Weight Updating
Clusterk
2
5
1
4
6
3
Attribute1
Attribute2
Attribute3
Initial Weights w1w2w31 Updated Weights
w11.14,w21.32, w30.84
11
Illustration Net Effect of Weight Adjustments
Clusterk
2
5
1
4
6
3
New Object Distances
Old Object Distances
12
A Slight Enhanced Weight Update Formula
13
Sample Run of IOWU for the Diabetes Dataset
14
4. Distance Function Learning Research at UH-DMML
Distance Function Evaluation
Weight-Updating Scheme / Search Strategy
Current Research EZZ04
K-Means
ERBV04
Inside/Outside Weight Updating
Supervised Clustering
Work By Karypis
Randomized Hill Climbing
NN-Classifier
Adaptive Clustering
Other Research


BECV05
15
5. Experimental Evaluation
  • Used a benchmark consisting of 7/15 UCI datasets
  • Inside/outside weight updating was run for 200
    iterations
  • a was set to 0.3
  • Evaluation (10-fold cross validation repeated 10
    times was used to determine accuracy)
  • Used 1-NN classifier as the base line classifer
  • Usee the learned distance function for a 1-NN
  • Used the learned distance function for a NCC
    classifier (new!)

16
NCC-Classifier
Idea the training set is replaced by k
(centroid, majority class) pairs that are
computed using k-means the so generated dataset
is then used to classify the examples in the test
set.
Attribute1
Attribute1
B
A
D
C
F
E
Attribute2
Attribute2
a. Dataset clustered by K-means
b. Dataset edited using cluster centroids that
carry the class label of the cluster majority
class
17
Experimental Evaluation
Remark Statistically significant improvements
are in red.
18
DF-Learning With Randomized Hill Climbing
Random random number a rate of change for
example-0.3,0.3
  • Generate R solutions in the neighborhood of w
    and pick the best one to be the new weight vector
    w

0.3
-0.3
19
Accuracy IOWA and Randomized Hill Climbing
20
Distance Function Learning With Adaptive
Clustering
  • Uses reinforcement learning to adapt distance
    functions for k-means clustering.
  • Employs search strategies that explores multiple
    paths in parallel. The algorithm maintains an
    open-list with maximum size L --- bad
    performers a dropped from the open list.
    Currently, beam search is used which creates 2p
    successors (increasing and decreasing the weight
    of each attribute exactly once) and evaluates
    those 2pL successors and keeps the best L of
    those.
  • Discretizes the search space in which states are
    (ltweightsgt,ltcentroidsgt) tuples into a grid, and
    memorizes and updates the fitness values of the
    grid value iteration is limited to interesting
    states by employing prioritized sweeping.
  • Weights are updated by increasing / decreasing
    the weight of an attribute by a randomly chosen
    percentage that fall within an interval
    min-change, max-change our current
    implementation uses 25,50.
  • Employs entropy H(X) as the fitness function (low
    entropy? pure cluster)

21
6. Related Distance Function Learning Research
  • Interactive approaches that use user feedback and
    reinforcement learning to derive a good distance
    function.
  • Other work uses randomized hill climbing and
    neural networks to learn distance functions for
    classification tasks mostly, NN-queries are used
    to evaluate the quality of a clustering.
  • Other work, mostly in the area of semi-supervised
    clustering, adapts object distances to cope with
    constraints.

22
7. Summary
  • Described an approach that employs clustering for
    distance function evaluation.
  • Introduced an attribute weight updating heuristic
    called inside/outside weight-updating and
    evaluated its performance.
  • The inside/weight updating approach enhanced a
    1-NN classifier significantly for some UCI
    datasets, but not for all data sets that were
    tested.
  • The quality of the employed approach is dependent
    on the number of cluster k which is an input
    parameter our current research centers on
    determining k automatically with a supervised
    clustering algorithm EZZ04
  • The general idea to replace a dataset by cluster
    representatives to enhance NN-classifiers shows a
    lot of promise in this (as exemplified in the NCC
    classifier) and other research we are currently
    conducting.
  • Distance function learning is quite time
    consuming one run of 200 iterations of
    inside/outside weight updating takes between 5
    seconds and 5 minutes depending on dataset size
    and k-value other techniques we currently
    investigate are significantly slower therefore,
    we are currently moving to high performance
    computing facilities for the empirical evaluation
    of the distance function learning approaches.

23
Links to 4 Papers
  • EZZ04 C. Eick, N. Zeidat, Z. Zhao, Supervised
    Clustering --- Algorithms and Benefits, short
    version appeared in Proc. International
    Conference on Tools with AI (ICTAI), Boca Raton,
    Florida, November 2004. http//www.cs.uh.edu/ceic
    k/kdd/EZZ04.pdf
  • RE05 T. Ryu and C. Eick, A Clustering
    Methodology and Tool, in Information Sciences
    171(1-3) 29-59 (2005). http//www.cs.uh.edu/ceic
    k/kdd/RE05.doc
  • ERBV04 C. Eick, A. Rouhana, A. Bagherjeiran, R.
    Vilalta, Using Clustering to Learn Distance
    Functions for Supervised Similarity Assessment,
    in Proc. MLDM'05, Leipzig, Germany, July 2005.
    http//www.cs.uh.edu/ceick/kdd/ERBV05.pdf
  • BECV05 A. Bagherjeiran, C. Eick, C.-S. Chen, R.
    Vilalta, Adaptive Clustering Obtaining Better
    Clusters Using Feedback and Past Experience,
    submitted for publication. http//www.cs.uh.edu/c
    eick/kdd/BECV05.pdf

24
Question?
?
?
?
25
Randomized Hill Climbing
  • Fast start algorithm starts from small
    neighborhood size until it can not find any
    better solutions. Then it increases its
    neighborhood size by 3 times hopping that a
    better solution can be found by trying more
    points
  • Shoulder condition When the algorithm has moved
    to a shoulder or flat hill, it will keep getting
    solutions with the same fitness value. Our
    algorithm terminates when it has tried for 3
    times and still getting the same results. This
    prevents it from been trapped in a shoulder
    forever

26
Randomized Hill Climbing
Flat hill
Objective function
Shoulder
State space
27
Purity in clusters obtained (internal)
28
Purity in clusters obtained (internal)
29
Different Forms of Clustering
Ch. Eick
Objectives Supervised Clustering Minimize
cluster impurity while keeping the number of
clusters low (expressed by a fitness function
q(X)).
30
A Fitness Function for Supervised Clustering
  • q(X) Impurity(X) ßPenalty(k)

k number of clusters used n number of examples
the dataset c number of classes in a dataset.
ß Weight for Penalty(k), 0lt ß 2.0
Penalty(k) increase sub-linearly. because the
effect of increasing the of clusters from k to
k1 has greater effect on the end result when k
is small than when it is large. Hence the formula
above
Write a Comment
User Comments (0)
About PowerShow.com