A Probabilistic Framework for Semi-Supervised Clustering - PowerPoint PPT Presentation

About This Presentation

Title:

A Probabilistic Framework for Semi-Supervised Clustering

Description:

A Probabilistic Framework for Semi-Supervised Clustering Sugato Basu Mikhail Bilenko Raymond J. Mooney Deptment of Computer Sciences University of Texas at Austin – PowerPoint PPT presentation

Number of Views:119

Avg rating:3.0/5.0

Slides: 31

Provided by: UNC5268

Learn more at: https://cis.temple.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Probabilistic Framework for Semi-Supervised Clustering

1
A Probabilistic Framework for Semi-Supervised
Clustering

Sugato Basu
Mikhail Bilenko
Raymond J. Mooney
Deptment of Computer Sciences
University of Texas at Austin
Presented by Jingting Zeng

2
Outline

Introduction
Background
Algorithm
Experiments
Conclusion

3
What is Semi-Supervised Clustering?

Use human input to provide labels for some of the
data
Improve existing naive clustering methods
Use labeled data to guide clustering of unlabeled
data
End result is a better clustering of data

4
Motivation

Large amounts of unlabeled data exists
More is being produced all the time
Expensive to generate Labels for data
Usually requires human intervention
Want to use limited amounts of labeled data to
guide the clustering of the entire dataset in
order to provide a better clustering method

5
Semi-Supervised Clustering

Constraint-Based
Modify objective functions so that it includes
satisfying constraints, enforcing constraints
during the clustering or initialization process
Distance-Based
A distance function is used that is trained on
the supervised dataset to satisfy the labels or
constraint, and then is applied to the complete
dataset

6
Method

Use both constraint-based and distance based
approaches in a unified method
Use Hidden Markov Random Fields to generate
constraints
Use constraints for initialization and assignment
of points to clusters
Use an adaptive distance function to try to learn
the distance measure
Cluster data to minimize some objective distance
function

7
Main Points of Method

Improved initialization
Initial clusters are formed based on constraints
Constraint sensitive assignment of points to
clusters
Points are assigned to clusters to minimize a
distortion function, while minimizing the number
of constraints violated
Iterative Distance Learning
Distortion measure is re-estimated after each
iteration

8
Constraints

Pairwise constraints of Must-link, or Cannot-link
labels
Set M of must link constraints
Set C of cannot link constraints
A list of associated costs for violating
Must-link or cannot-link requirements
Class labels do not have to be known, but a user
can still specify relationship between points.

9
HMRF
10
Posterior probability

This problem is an incomplete-data problem
Cluster representative as well as class labels
are unknown
Popular method for solving this type of problem
is Expectation Maximization
K-Means is equivalent to an EM algorithm with
hard clustering assignments

11
Must-Link Violation Cost Function

Ensures that the penalty for violating the
must-link constraint between 2 points that are
far apart is higher than between 2 points that
are close
Punishes distance functions in which must-link
points are far apart

12
Cannot-Link Violation Cost Function

Ensures that the penalty for violating
cannot-link constraints between points that are
nearby according to the current distance function
is higher than between distant points
Punishes distance functions that place 2 cannot
link points close together

13
Objective Function

Goal find minimum objective function
Supervised data in initialization
Constraints in cluster assignments
Distance learning in M step

14
Algorithm

EM framework
Initialization step
Use constraints to guide initial cluster
formations
E step
minimizes objective function over cluster
assignment
M step
Minimizes objective function over cluster
representatives
Minimizes objective function over the parameter
distortion measure

15
Initialization

Initialize
Form transitive closure of the must-link
constraints
Set of connected components consisting of points
connected by must-link constraints
y connected components
If y lt K (number of clusters), y connected
neighborhoods used to create y initial clusters
Remaining clusters initialized by random
perturbations of the global centroid of the data

16
What If More Neighborhoods Then Clusters?

If y gt K (number of clusters), k initial
clusters are selecting using the distance measure
Farthest first traversal is a good heuristic
Weighted variant of farthest first traversal
Distance between 2 centroids multiplied by their
corresponding weights
Weight of each centroid is proportional to the
size of the corresponding neighborhood.
Biased to select centroids that are relatively
far apart and of a decent size

17
Initialization Continued

Assuming consistency of data
Augment set M with must-link constraints inferred
from transitive closure
For each pair of neighborhoods Np ,Np that have
at least one cannont link constraint between
them, add connot-link constraints between every
member of Np and Np .
Learn as much about the data through the
constraints as possible

18
Augment set M
a
b
c
Must-link
Must-link
Inferred Must-link
19
Augment Set C
a
b
c
Must-link
Cannot-link
Inferred Cannot-link
20
E-step

Assignments of data points to clusters
Since model incorporates interactions between
points, computing point assignments to minimize
objective function is computationally intractable
Iterated conditional models, belief propagation,
and linear programming relaxation
ICM uses greedy strategy to sequentially update
the cluster assignment of each point while
keeping other points fixed.

21
M-step

First, cluster representatives are re-estimated
to decrease objective function
Constraints do not factor into this step, so it
is equivalent to K-Means
If parameterized variant of a distance measure is
used, it is updated here
Parameters can be found through partial
derivatives of distance function
Learning step results in modifying the distortion
measure so that similar points are brought closer
together, while dissimilar points are pulled apart

22
Results

KMeans I C D
Complete HMRF- KMeans algorithm, with supervised
data in initialization(I) and cluster
assignment(C) and distance learning(D)
KMEANS I- C
HMRF-KMeans algorithm without distance learning
KMEANS I
HMRF-KMeans algorithm without distance learning
and supervised cluster assignment

23
Results
24
Results
25
Results
26
Conclusion

HMRF-KMeans Performs well (compared to naïve
K-Means) with a limited number of constraints
The goal of the algorithm was to provide a better
clustering method with the use of a limited
number of constraints
HMRF-KMeans learns quickly from a limited number
of constraints
Should be applicable to data sets where we want
to limit the amount of labeling needed to be done
by humans, and constraints can be specified in
pairwise labels

27
Questions

Can all types of constraints be captured in
pairwise associations?
Hierarchal structure?
Could other types of labels be included in this
model?
Use class labels as well as pairwise constraints
How does this model handle noise in the
data/labels?
Point A has must link constraint to Point B,
Point B has must list constraint to Point C,
Point A has Cannot-link constraint to point C

28
More Questions

How does this apply to other types of Data?
Authors mention wanting to try applying method to
other types of data in the future, such as gene
representation
Who provides weights for function violations, and
how are weights determined?
Only compared with naïve KMeans method
How does it compare with other semi-supervised
clustering methods?

29
Reference

S. Basu, M. Bilenko, and R.J. Mooney, A
Probabilistic Framework for Semi-Supervised
Clustering, Proc. 10th ACM SIGKDD Int'l Conf.
Knowledge Discovery and Data Mining (KDD), Aug.
2004.