# A Probabilistic Framework for Semi-Supervised Clustering - PowerPoint PPT Presentation

Loading...

PPT – A Probabilistic Framework for Semi-Supervised Clustering PowerPoint presentation | free to view - id: 1812e2-NmQ4M

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

## A Probabilistic Framework for Semi-Supervised Clustering

Description:

### Use constraints for initialization and assignment of points to clusters ... Remaining clusters initialized by random perturbations of the global centroid of ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 28
Provided by: mehedy
Category:
User Comments (0)
Transcript and Presenter's Notes

Title: A Probabilistic Framework for Semi-Supervised Clustering

1
A Probabilistic Framework forSemi-Supervised
Clustering
• Sugato Basu, Mikhail Bilenko, Raymond J. Mooney,
• (UT Austin)
• KDD 2004

2
Outline
• Introduction
• Background
• Algorithm
• Experiments

3
What is Semi-SupervisedClustering?
• Use human input to provide labels for some of the
data
• Improve existing naive clustering methods
• Use labeled data to guide clustering of
unlabeled data
• End result is a better clustering of data

4
Motivation
• Large amounts of unlabeled data exists
• More is being produced all the time
• Expensive to generate Labels for data
• Usually requires human intervention
• Want to use limited amounts of labeled data to
• guide the clustering of the entire dataset
• in order to provide a better clustering method

5
Semi-Supervised Clustering
• Constraint-Based
• Modify objective functions so that it includes
satisfying constraints,
• enforcing constraints during the clustering or
initialization process
• Distance-Based
• A distance function is used that is trained on
the supervised dataset to satisfy the labels or
constraint,
• and then is applied to the complete dataset

6
Method
• Use both constraint-based and distance based
approaches in a unified method
• Use Hidden Markov Random Fields to generate
constraints
• Use constraints for initialization and assignment
of points to clusters
• Use an adaptive distance function to try to learn
the distance measure
• Cluster data to minimize some objective distance
function

7
Main Points of Method
• Improved initialization
• Initial clusters are formed based on constraints
• Constraint sensitive assignment of points to
clusters
• Points are assigned to clusters to minimize a
distortion function,
• while minimizing the number of constraints
violated
• Iterative Distance Learning
• Distortion measure is re-estimated after each
iteration

8
Constraints
• Pair-wise constraints of Must-link, or
Cannot-link labels
• Set M of must link constraints
• Set C of cannot link constraints
• A list of associated costs for violating
Must-link or cannot-link requirements
• Class labels do not have to be known,
• but a user can still specify relationship between
points.

9
HMRF
10
Posterior probability
• This problem is an incomplete-data problem
• Cluster representative as well as class labels
are unknown
• Popular method for solving this type of problem
is Expectation Maximization
• K-Means is equivalent to an EM algorithm with
hard clustering assignments

11
Must-Link Violation Cost Function
• Ensures that the penalty for violating the
must-link constraint between 2 points
• that are far apart is higher than
• between 2 points that are close
• Punishes distance functions in which must-link
points are far apart

12
Cannot-Link Violation Cost Function
• Ensures that the penalty for violating
cannot-link constraints between points
• that are nearby according to the current distance
function is higher than
• between distant points
• Punishes distance functions that place 2 cannot
link points close together

13
Objective Function
• Goal find minimum objective function
• Supervised data in initialization
• Constraints in cluster assignments
• Distance learning in M step

14
Algorithm
• EM framework
• Initialization step
• Use constraints to guide initial cluster
formations
• E step
• minimizes objective function over cluster
assignment
• M step
• Minimizes objective function over cluster
representatives
• Minimizes objective function over the parameter
distortion measure

15
Initialization
• Form transitive closure of the must-link
Constraints
• Set of connected components consisting of points
connected by must-link constraints
• y connected components
• If y lt K (number of clusters), y connected
neighborhoods used to create y initial clusters
• Remaining clusters initialized by random
perturbations of the global centroid of the data

16
Initialization (Continued)
• If y gt K (number of clusters) k initial clusters
are selected using the distance measure
• Farthest first traversal is a good heuristic
• Weighted variant of farthest first traversal
• Distance between 2 centroids multiplied by their
corresponding weights
• Weight of each centroid is proportional to the
size of the corresponding neighborhood.
• Biased to select centroids that are relatively
far apart and of a decent size

17
Initialization (Continued)
• Assuming consistency of data
• Augment set M with must-link constraints inferred
from transitive closure
• For each pair of neighborhoods Np ,Np
• That have at least one cannont link constraint
between them,
• add connot-link constraints between every member
of Np and Np .
• Learn as much about the data through the
constraints as possible

18
Augment set M
19
Augment set C
20
E-step
• Assignments of data points to clusters
• Since model incorporates interactions between
points,
• computing point assignments to minimize objective
function is computationally intractable
• Approximations available
• Iterated conditional models (ICM), belief
propagation, and linear programming relaxation
• ICM uses greedy strategy to
• sequentially update the cluster assignment of
each point
• while keeping other points fixed.

21
M-step
• First, cluster representatives are re-estimated
to decrease objective function
• Constraints do not factor into this step, so it
is equivalent to K-Means
• If parameterized variant of a distance measure is
used, it is updated here
• Parameters can be found through partial
derivatives of distance function
• Learning step results in modifying the distortion
measure so that
• similar points are brought closer together, while
dissimilar points are pulled apart

22
Results
• KMeans I C D
• Complete HMRF- KMeans algorithm, with supervised
data in initialization(I) and cluster
assignment(C) and distance learning(D)
• KMEANS I-C
• HMRF-KMeans algorithm without distance learning
• KMEANS I
• HMRF-KMeans algorithm without distance learning
and supervised cluster assignment

23
Results (Continued)
24
Results (Continued)
25
Results (Continued)
26
Questions
• Can all types of constraints be captured in
pairwise associations?
• Hierarchal structure?
• Could other types of labels be included in this
model?
• Use class labels as well as pairwise constraints
• How does this model handle noise in the
data/labels?
• Point A has must link constraint to Point B,
• Point B has must list constraint to Point C,
Point
• A has Cannot-link constraint to point C

27
Conclusion
• HMRF-KMeans Performs well (compared to naïve
K-Means) with a limited number of constraints
• The goal of the algorithm was to provide a better
clustering method
• with the use of a limited number of constraints
• HMRF-KMeans learns quickly from a limited number
of constraints
• Should be applicable to data sets where we want
to limit the amount of labeling needed
• To be done by humans, and constraints can be
specified in pair-wise labels
About PowerShow.com