1 / 30

A Probabilistic Framework for Semi-Supervised

Clustering

- Sugato Basu
- Mikhail Bilenko
- Raymond J. Mooney
- Deptment of Computer Sciences
- University of Texas at Austin
- Presented by Jingting Zeng

Outline

- Introduction
- Background
- Algorithm
- Experiments
- Conclusion

What is Semi-Supervised Clustering?

- Use human input to provide labels for some of the

data - Improve existing naive clustering methods
- Use labeled data to guide clustering of unlabeled

data - End result is a better clustering of data

Motivation

- Large amounts of unlabeled data exists
- More is being produced all the time
- Expensive to generate Labels for data
- Usually requires human intervention
- Want to use limited amounts of labeled data to

guide the clustering of the entire dataset in

order to provide a better clustering method

Semi-Supervised Clustering

- Constraint-Based
- Modify objective functions so that it includes

satisfying constraints, enforcing constraints

during the clustering or initialization process - Distance-Based
- A distance function is used that is trained on

the supervised dataset to satisfy the labels or

constraint, and then is applied to the complete

dataset

Method

- Use both constraint-based and distance based

approaches in a unified method - Use Hidden Markov Random Fields to generate

constraints - Use constraints for initialization and assignment

of points to clusters - Use an adaptive distance function to try to learn

the distance measure - Cluster data to minimize some objective distance

function

Main Points of Method

- Improved initialization
- Initial clusters are formed based on constraints
- Constraint sensitive assignment of points to

clusters - Points are assigned to clusters to minimize a

distortion function, while minimizing the number

of constraints violated - Iterative Distance Learning
- Distortion measure is re-estimated after each

iteration

Constraints

- Pairwise constraints of Must-link, or Cannot-link

labels - Set M of must link constraints
- Set C of cannot link constraints
- A list of associated costs for violating

Must-link or cannot-link requirements - Class labels do not have to be known, but a user

can still specify relationship between points.

HMRF

Posterior probability

- This problem is an incomplete-data problem
- Cluster representative as well as class labels

are unknown - Popular method for solving this type of problem

is Expectation Maximization - K-Means is equivalent to an EM algorithm with

hard clustering assignments

Must-Link Violation Cost Function

- Ensures that the penalty for violating the

must-link constraint between 2 points that are

far apart is higher than between 2 points that

are close - Punishes distance functions in which must-link

points are far apart

Cannot-Link Violation Cost Function

- Ensures that the penalty for violating

cannot-link constraints between points that are

nearby according to the current distance function

is higher than between distant points - Punishes distance functions that place 2 cannot

link points close together

Objective Function

- Goal find minimum objective function
- Supervised data in initialization
- Constraints in cluster assignments
- Distance learning in M step

Algorithm

- EM framework
- Initialization step
- Use constraints to guide initial cluster

formations - E step
- minimizes objective function over cluster

assignment - M step
- Minimizes objective function over cluster

representatives - Minimizes objective function over the parameter

distortion measure

Initialization

- Initialize
- Form transitive closure of the must-link

constraints - Set of connected components consisting of points

connected by must-link constraints - y connected components
- If y lt K (number of clusters), y connected

neighborhoods used to create y initial clusters - Remaining clusters initialized by random

perturbations of the global centroid of the data

What If More Neighborhoods Then Clusters?

- If y gt K (number of clusters), k initial

clusters are selecting using the distance measure - Farthest first traversal is a good heuristic
- Weighted variant of farthest first traversal
- Distance between 2 centroids multiplied by their

corresponding weights - Weight of each centroid is proportional to the

size of the corresponding neighborhood. - Biased to select centroids that are relatively

far apart and of a decent size

Initialization Continued

- Assuming consistency of data
- Augment set M with must-link constraints inferred

from transitive closure - For each pair of neighborhoods Np ,Np that have

at least one cannont link constraint between

them, add connot-link constraints between every

member of Np and Np . - Learn as much about the data through the

constraints as possible

Augment set M

a

b

c

Must-link

Must-link

Inferred Must-link

Augment Set C

a

b

c

Must-link

Cannot-link

Inferred Cannot-link

E-step

- Assignments of data points to clusters
- Since model incorporates interactions between

points, computing point assignments to minimize

objective function is computationally intractable - Iterated conditional models, belief propagation,

and linear programming relaxation - ICM uses greedy strategy to sequentially update

the cluster assignment of each point while

keeping other points fixed.

M-step

- First, cluster representatives are re-estimated

to decrease objective function - Constraints do not factor into this step, so it

is equivalent to K-Means - If parameterized variant of a distance measure is

used, it is updated here - Parameters can be found through partial

derivatives of distance function - Learning step results in modifying the distortion

measure so that similar points are brought closer

together, while dissimilar points are pulled apart

Results

- KMeans I C D
- Complete HMRF- KMeans algorithm, with supervised

data in initialization(I) and cluster

assignment(C) and distance learning(D) - KMEANS I- C
- HMRF-KMeans algorithm without distance learning
- KMEANS I
- HMRF-KMeans algorithm without distance learning

and supervised cluster assignment

Results

Results

Results

Conclusion

- HMRF-KMeans Performs well (compared to naïve

K-Means) with a limited number of constraints - The goal of the algorithm was to provide a better

clustering method with the use of a limited

number of constraints - HMRF-KMeans learns quickly from a limited number

of constraints - Should be applicable to data sets where we want

to limit the amount of labeling needed to be done

by humans, and constraints can be specified in

pairwise labels

Questions

- Can all types of constraints be captured in

pairwise associations? - Hierarchal structure?
- Could other types of labels be included in this

model? - Use class labels as well as pairwise constraints
- How does this model handle noise in the

data/labels? - Point A has must link constraint to Point B,

Point B has must list constraint to Point C,

Point A has Cannot-link constraint to point C

More Questions

- How does this apply to other types of Data?
- Authors mention wanting to try applying method to

other types of data in the future, such as gene

representation - Who provides weights for function violations, and

how are weights determined? - Only compared with naïve KMeans method
- How does it compare with other semi-supervised

clustering methods?

Reference

- S. Basu, M. Bilenko, and R.J. Mooney, A

Probabilistic Framework for Semi-Supervised

Clustering, Proc. 10th ACM SIGKDD Int'l Conf.

Knowledge Discovery and Data Mining (KDD), Aug.

2004.

- Thank you!