A Probabilistic Framework for Semi-Supervised Clustering - PowerPoint PPT Presentation

About This Presentation
Title:

A Probabilistic Framework for Semi-Supervised Clustering

Description:

A Probabilistic Framework for Semi-Supervised Clustering Sugato Basu Mikhail Bilenko Raymond J. Mooney Deptment of Computer Sciences University of Texas at Austin – PowerPoint PPT presentation

Number of Views:119
Avg rating:3.0/5.0
Slides: 31
Provided by: UNC5268
Learn more at: https://cis.temple.edu
Category:

less

Transcript and Presenter's Notes

Title: A Probabilistic Framework for Semi-Supervised Clustering


1
A Probabilistic Framework for Semi-Supervised
Clustering
  • Sugato Basu
  • Mikhail Bilenko
  • Raymond J. Mooney
  • Deptment of Computer Sciences
  • University of Texas at Austin
  • Presented by Jingting Zeng

2
Outline
  • Introduction
  • Background
  • Algorithm
  • Experiments
  • Conclusion

3
What is Semi-Supervised Clustering?
  • Use human input to provide labels for some of the
    data
  • Improve existing naive clustering methods
  • Use labeled data to guide clustering of unlabeled
    data
  • End result is a better clustering of data

4
Motivation
  • Large amounts of unlabeled data exists
  • More is being produced all the time
  • Expensive to generate Labels for data
  • Usually requires human intervention
  • Want to use limited amounts of labeled data to
    guide the clustering of the entire dataset in
    order to provide a better clustering method

5
Semi-Supervised Clustering
  • Constraint-Based
  • Modify objective functions so that it includes
    satisfying constraints, enforcing constraints
    during the clustering or initialization process
  • Distance-Based
  • A distance function is used that is trained on
    the supervised dataset to satisfy the labels or
    constraint, and then is applied to the complete
    dataset

6
Method
  • Use both constraint-based and distance based
    approaches in a unified method
  • Use Hidden Markov Random Fields to generate
    constraints
  • Use constraints for initialization and assignment
    of points to clusters
  • Use an adaptive distance function to try to learn
    the distance measure
  • Cluster data to minimize some objective distance
    function

7
Main Points of Method
  • Improved initialization
  • Initial clusters are formed based on constraints
  • Constraint sensitive assignment of points to
    clusters
  • Points are assigned to clusters to minimize a
    distortion function, while minimizing the number
    of constraints violated
  • Iterative Distance Learning
  • Distortion measure is re-estimated after each
    iteration

8
Constraints
  • Pairwise constraints of Must-link, or Cannot-link
    labels
  • Set M of must link constraints
  • Set C of cannot link constraints
  • A list of associated costs for violating
    Must-link or cannot-link requirements
  • Class labels do not have to be known, but a user
    can still specify relationship between points.

9
HMRF
10
Posterior probability
  • This problem is an incomplete-data problem
  • Cluster representative as well as class labels
    are unknown
  • Popular method for solving this type of problem
    is Expectation Maximization
  • K-Means is equivalent to an EM algorithm with
    hard clustering assignments

11
Must-Link Violation Cost Function
  • Ensures that the penalty for violating the
    must-link constraint between 2 points that are
    far apart is higher than between 2 points that
    are close
  • Punishes distance functions in which must-link
    points are far apart

12
Cannot-Link Violation Cost Function
  • Ensures that the penalty for violating
    cannot-link constraints between points that are
    nearby according to the current distance function
    is higher than between distant points
  • Punishes distance functions that place 2 cannot
    link points close together

13
Objective Function
  • Goal find minimum objective function
  • Supervised data in initialization
  • Constraints in cluster assignments
  • Distance learning in M step

14
Algorithm
  • EM framework
  • Initialization step
  • Use constraints to guide initial cluster
    formations
  • E step
  • minimizes objective function over cluster
    assignment
  • M step
  • Minimizes objective function over cluster
    representatives
  • Minimizes objective function over the parameter
    distortion measure

15
Initialization
  • Initialize
  • Form transitive closure of the must-link
    constraints
  • Set of connected components consisting of points
    connected by must-link constraints
  • y connected components
  • If y lt K (number of clusters), y connected
    neighborhoods used to create y initial clusters
  • Remaining clusters initialized by random
    perturbations of the global centroid of the data

16
What If More Neighborhoods Then Clusters?
  • If y gt K (number of clusters), k initial
    clusters are selecting using the distance measure
  • Farthest first traversal is a good heuristic
  • Weighted variant of farthest first traversal
  • Distance between 2 centroids multiplied by their
    corresponding weights
  • Weight of each centroid is proportional to the
    size of the corresponding neighborhood.
  • Biased to select centroids that are relatively
    far apart and of a decent size

17
Initialization Continued
  • Assuming consistency of data
  • Augment set M with must-link constraints inferred
    from transitive closure
  • For each pair of neighborhoods Np ,Np that have
    at least one cannont link constraint between
    them, add connot-link constraints between every
    member of Np and Np .
  • Learn as much about the data through the
    constraints as possible

18
Augment set M
a
b
c
Must-link
Must-link
Inferred Must-link
19
Augment Set C
a
b
c
Must-link
Cannot-link
Inferred Cannot-link
20
E-step
  • Assignments of data points to clusters
  • Since model incorporates interactions between
    points, computing point assignments to minimize
    objective function is computationally intractable
  • Iterated conditional models, belief propagation,
    and linear programming relaxation
  • ICM uses greedy strategy to sequentially update
    the cluster assignment of each point while
    keeping other points fixed.

21
M-step
  • First, cluster representatives are re-estimated
    to decrease objective function
  • Constraints do not factor into this step, so it
    is equivalent to K-Means
  • If parameterized variant of a distance measure is
    used, it is updated here
  • Parameters can be found through partial
    derivatives of distance function
  • Learning step results in modifying the distortion
    measure so that similar points are brought closer
    together, while dissimilar points are pulled apart

22
Results
  • KMeans I C D
  • Complete HMRF- KMeans algorithm, with supervised
    data in initialization(I) and cluster
    assignment(C) and distance learning(D)
  • KMEANS I- C
  • HMRF-KMeans algorithm without distance learning
  • KMEANS I
  • HMRF-KMeans algorithm without distance learning
    and supervised cluster assignment

23
Results
24
Results
25
Results
26
Conclusion
  • HMRF-KMeans Performs well (compared to naïve
    K-Means) with a limited number of constraints
  • The goal of the algorithm was to provide a better
    clustering method with the use of a limited
    number of constraints
  • HMRF-KMeans learns quickly from a limited number
    of constraints
  • Should be applicable to data sets where we want
    to limit the amount of labeling needed to be done
    by humans, and constraints can be specified in
    pairwise labels

27
Questions
  • Can all types of constraints be captured in
    pairwise associations?
  • Hierarchal structure?
  • Could other types of labels be included in this
    model?
  • Use class labels as well as pairwise constraints
  • How does this model handle noise in the
    data/labels?
  • Point A has must link constraint to Point B,
    Point B has must list constraint to Point C,
    Point A has Cannot-link constraint to point C

28
More Questions
  • How does this apply to other types of Data?
  • Authors mention wanting to try applying method to
    other types of data in the future, such as gene
    representation
  • Who provides weights for function violations, and
    how are weights determined?
  • Only compared with naïve KMeans method
  • How does it compare with other semi-supervised
    clustering methods?

29
Reference
  • S. Basu, M. Bilenko, and R.J. Mooney, A
    Probabilistic Framework for Semi-Supervised
    Clustering, Proc. 10th ACM SIGKDD Int'l Conf.
    Knowledge Discovery and Data Mining (KDD), Aug.
    2004.

30
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com