A Probabilistic Framework for Semi-Supervised Clustering - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

A Probabilistic Framework for Semi-Supervised Clustering

Description:

Use constraints for initialization and assignment of points to clusters ... Remaining clusters initialized by random perturbations of the global centroid of ... – PowerPoint PPT presentation

Number of Views:120

Avg rating:3.0/5.0

Slides: 28

Provided by: mehedy

Category:

more less

Transcript and Presenter's Notes

Title: A Probabilistic Framework for Semi-Supervised Clustering

1
A Probabilistic Framework forSemi-Supervised
Clustering

Sugato Basu, Mikhail Bilenko, Raymond J. Mooney,
(UT Austin)
KDD 2004

2
Outline

Introduction
Background
Algorithm
Experiments

3
What is Semi-SupervisedClustering?

Use human input to provide labels for some of the
data
Improve existing naive clustering methods
Use labeled data to guide clustering of
unlabeled data
End result is a better clustering of data

4
Motivation

Large amounts of unlabeled data exists
More is being produced all the time
Expensive to generate Labels for data
Usually requires human intervention
Want to use limited amounts of labeled data to
guide the clustering of the entire dataset
in order to provide a better clustering method

5
Semi-Supervised Clustering

Constraint-Based
Modify objective functions so that it includes
satisfying constraints,
enforcing constraints during the clustering or
initialization process
Distance-Based
A distance function is used that is trained on
the supervised dataset to satisfy the labels or
constraint,
and then is applied to the complete dataset

6
Method

Use both constraint-based and distance based
approaches in a unified method
Use Hidden Markov Random Fields to generate
constraints
Use constraints for initialization and assignment
of points to clusters
Use an adaptive distance function to try to learn
the distance measure
Cluster data to minimize some objective distance
function

7
Main Points of Method

Improved initialization
Initial clusters are formed based on constraints
Constraint sensitive assignment of points to
clusters
Points are assigned to clusters to minimize a
distortion function,
while minimizing the number of constraints
violated
Iterative Distance Learning
Distortion measure is re-estimated after each
iteration

8
Constraints

Pair-wise constraints of Must-link, or
Cannot-link labels
Set M of must link constraints
Set C of cannot link constraints
A list of associated costs for violating
Must-link or cannot-link requirements
Class labels do not have to be known,
but a user can still specify relationship between
points.

9
HMRF
10
Posterior probability

This problem is an incomplete-data problem
Cluster representative as well as class labels
are unknown
Popular method for solving this type of problem
is Expectation Maximization
K-Means is equivalent to an EM algorithm with
hard clustering assignments

11
Must-Link Violation Cost Function

Ensures that the penalty for violating the
must-link constraint between 2 points
that are far apart is higher than
between 2 points that are close
Punishes distance functions in which must-link
points are far apart

12
Cannot-Link Violation Cost Function

Ensures that the penalty for violating
cannot-link constraints between points
that are nearby according to the current distance
function is higher than
between distant points
Punishes distance functions that place 2 cannot
link points close together

13
Objective Function

Goal find minimum objective function
Supervised data in initialization
Constraints in cluster assignments
Distance learning in M step

14
Algorithm

EM framework
Initialization step
Use constraints to guide initial cluster
formations
E step
minimizes objective function over cluster
assignment
M step
Minimizes objective function over cluster
representatives
Minimizes objective function over the parameter
distortion measure

15
Initialization

Form transitive closure of the must-link
Constraints
Set of connected components consisting of points
connected by must-link constraints
y connected components
If y lt K (number of clusters), y connected
neighborhoods used to create y initial clusters
Remaining clusters initialized by random
perturbations of the global centroid of the data

16
Initialization (Continued)

If y gt K (number of clusters) k initial clusters
are selected using the distance measure
Farthest first traversal is a good heuristic
Weighted variant of farthest first traversal
Distance between 2 centroids multiplied by their
corresponding weights
Weight of each centroid is proportional to the
size of the corresponding neighborhood.
Biased to select centroids that are relatively
far apart and of a decent size

17
Initialization (Continued)

Assuming consistency of data
Augment set M with must-link constraints inferred
from transitive closure
For each pair of neighborhoods Np ,Np
That have at least one cannont link constraint
between them,
add connot-link constraints between every member
of Np and Np .
Learn as much about the data through the
constraints as possible

18
Augment set M
19
Augment set C
20
E-step

Assignments of data points to clusters
Since model incorporates interactions between
points,
computing point assignments to minimize objective
function is computationally intractable
Approximations available
Iterated conditional models (ICM), belief
propagation, and linear programming relaxation
ICM uses greedy strategy to
sequentially update the cluster assignment of
each point
while keeping other points fixed.

21
M-step

First, cluster representatives are re-estimated
to decrease objective function
Constraints do not factor into this step, so it
is equivalent to K-Means
If parameterized variant of a distance measure is
used, it is updated here
Parameters can be found through partial
derivatives of distance function
Learning step results in modifying the distortion
measure so that
similar points are brought closer together, while
dissimilar points are pulled apart

22
Results

KMeans I C D
Complete HMRF- KMeans algorithm, with supervised
data in initialization(I) and cluster
assignment(C) and distance learning(D)
KMEANS I-C
HMRF-KMeans algorithm without distance learning
KMEANS I
HMRF-KMeans algorithm without distance learning
and supervised cluster assignment

23
Results (Continued)
24
Results (Continued)
25
Results (Continued)
26
Questions

Can all types of constraints be captured in
pairwise associations?
Hierarchal structure?
Could other types of labels be included in this
model?
Use class labels as well as pairwise constraints
How does this model handle noise in the
data/labels?
Point A has must link constraint to Point B,
Point B has must list constraint to Point C,
Point
A has Cannot-link constraint to point C

27
Conclusion

HMRF-KMeans Performs well (compared to naïve
K-Means) with a limited number of constraints
The goal of the algorithm was to provide a better
clustering method
with the use of a limited number of constraints
HMRF-KMeans learns quickly from a limited number
of constraints
Should be applicable to data sets where we want
to limit the amount of labeling needed
To be done by humans, and constraints can be
specified in pair-wise labels