A Probabilistic Framework for Semi-Supervised Clustering - PowerPoint PPT Presentation

Loading...

PPT – A Probabilistic Framework for Semi-Supervised Clustering PowerPoint presentation | free to view - id: 1812e2-NmQ4M



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

A Probabilistic Framework for Semi-Supervised Clustering

Description:

Use constraints for initialization and assignment of points to clusters ... Remaining clusters initialized by random perturbations of the global centroid of ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 28
Provided by: mehedy
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: A Probabilistic Framework for Semi-Supervised Clustering


1
A Probabilistic Framework forSemi-Supervised
Clustering
  • Sugato Basu, Mikhail Bilenko, Raymond J. Mooney,
  • (UT Austin)
  • KDD 2004

2
Outline
  • Introduction
  • Background
  • Algorithm
  • Experiments

3
What is Semi-SupervisedClustering?
  • Use human input to provide labels for some of the
    data
  • Improve existing naive clustering methods
  • Use labeled data to guide clustering of
    unlabeled data
  • End result is a better clustering of data

4
Motivation
  • Large amounts of unlabeled data exists
  • More is being produced all the time
  • Expensive to generate Labels for data
  • Usually requires human intervention
  • Want to use limited amounts of labeled data to
  • guide the clustering of the entire dataset
  • in order to provide a better clustering method

5
Semi-Supervised Clustering
  • Constraint-Based
  • Modify objective functions so that it includes
    satisfying constraints,
  • enforcing constraints during the clustering or
    initialization process
  • Distance-Based
  • A distance function is used that is trained on
    the supervised dataset to satisfy the labels or
    constraint,
  • and then is applied to the complete dataset

6
Method
  • Use both constraint-based and distance based
    approaches in a unified method
  • Use Hidden Markov Random Fields to generate
    constraints
  • Use constraints for initialization and assignment
    of points to clusters
  • Use an adaptive distance function to try to learn
    the distance measure
  • Cluster data to minimize some objective distance
    function

7
Main Points of Method
  • Improved initialization
  • Initial clusters are formed based on constraints
  • Constraint sensitive assignment of points to
    clusters
  • Points are assigned to clusters to minimize a
    distortion function,
  • while minimizing the number of constraints
    violated
  • Iterative Distance Learning
  • Distortion measure is re-estimated after each
    iteration

8
Constraints
  • Pair-wise constraints of Must-link, or
    Cannot-link labels
  • Set M of must link constraints
  • Set C of cannot link constraints
  • A list of associated costs for violating
    Must-link or cannot-link requirements
  • Class labels do not have to be known,
  • but a user can still specify relationship between
    points.

9
HMRF
10
Posterior probability
  • This problem is an incomplete-data problem
  • Cluster representative as well as class labels
    are unknown
  • Popular method for solving this type of problem
    is Expectation Maximization
  • K-Means is equivalent to an EM algorithm with
    hard clustering assignments

11
Must-Link Violation Cost Function
  • Ensures that the penalty for violating the
    must-link constraint between 2 points
  • that are far apart is higher than
  • between 2 points that are close
  • Punishes distance functions in which must-link
    points are far apart

12
Cannot-Link Violation Cost Function
  • Ensures that the penalty for violating
    cannot-link constraints between points
  • that are nearby according to the current distance
    function is higher than
  • between distant points
  • Punishes distance functions that place 2 cannot
    link points close together

13
Objective Function
  • Goal find minimum objective function
  • Supervised data in initialization
  • Constraints in cluster assignments
  • Distance learning in M step

14
Algorithm
  • EM framework
  • Initialization step
  • Use constraints to guide initial cluster
    formations
  • E step
  • minimizes objective function over cluster
    assignment
  • M step
  • Minimizes objective function over cluster
    representatives
  • Minimizes objective function over the parameter
    distortion measure

15
Initialization
  • Form transitive closure of the must-link
    Constraints
  • Set of connected components consisting of points
    connected by must-link constraints
  • y connected components
  • If y lt K (number of clusters), y connected
    neighborhoods used to create y initial clusters
  • Remaining clusters initialized by random
    perturbations of the global centroid of the data

16
Initialization (Continued)
  • If y gt K (number of clusters) k initial clusters
    are selected using the distance measure
  • Farthest first traversal is a good heuristic
  • Weighted variant of farthest first traversal
  • Distance between 2 centroids multiplied by their
    corresponding weights
  • Weight of each centroid is proportional to the
    size of the corresponding neighborhood.
  • Biased to select centroids that are relatively
    far apart and of a decent size

17
Initialization (Continued)
  • Assuming consistency of data
  • Augment set M with must-link constraints inferred
    from transitive closure
  • For each pair of neighborhoods Np ,Np
  • That have at least one cannont link constraint
    between them,
  • add connot-link constraints between every member
    of Np and Np .
  • Learn as much about the data through the
    constraints as possible

18
Augment set M
19
Augment set C
20
E-step
  • Assignments of data points to clusters
  • Since model incorporates interactions between
    points,
  • computing point assignments to minimize objective
    function is computationally intractable
  • Approximations available
  • Iterated conditional models (ICM), belief
    propagation, and linear programming relaxation
  • ICM uses greedy strategy to
  • sequentially update the cluster assignment of
    each point
  • while keeping other points fixed.

21
M-step
  • First, cluster representatives are re-estimated
    to decrease objective function
  • Constraints do not factor into this step, so it
    is equivalent to K-Means
  • If parameterized variant of a distance measure is
    used, it is updated here
  • Parameters can be found through partial
    derivatives of distance function
  • Learning step results in modifying the distortion
    measure so that
  • similar points are brought closer together, while
    dissimilar points are pulled apart

22
Results
  • KMeans I C D
  • Complete HMRF- KMeans algorithm, with supervised
    data in initialization(I) and cluster
    assignment(C) and distance learning(D)
  • KMEANS I-C
  • HMRF-KMeans algorithm without distance learning
  • KMEANS I
  • HMRF-KMeans algorithm without distance learning
    and supervised cluster assignment

23
Results (Continued)
24
Results (Continued)
25
Results (Continued)
26
Questions
  • Can all types of constraints be captured in
    pairwise associations?
  • Hierarchal structure?
  • Could other types of labels be included in this
    model?
  • Use class labels as well as pairwise constraints
  • How does this model handle noise in the
    data/labels?
  • Point A has must link constraint to Point B,
  • Point B has must list constraint to Point C,
    Point
  • A has Cannot-link constraint to point C

27
Conclusion
  • HMRF-KMeans Performs well (compared to naïve
    K-Means) with a limited number of constraints
  • The goal of the algorithm was to provide a better
    clustering method
  • with the use of a limited number of constraints
  • HMRF-KMeans learns quickly from a limited number
    of constraints
  • Should be applicable to data sets where we want
    to limit the amount of labeling needed
  • To be done by humans, and constraints can be
    specified in pair-wise labels
About PowerShow.com