Semi-Supervised Clustering - PowerPoint PPT Presentation

1 / 91
About This Presentation
Title:

Semi-Supervised Clustering

Description:

between O1 and O2 is a real number denoted by D(O1, ... Hierarchy algorithms ... If S 0 then swap o with o' to form the new set of k medoids. K-Medoids example ... – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 92
Provided by: jiep6
Category:

less

Transcript and Presenter's Notes

Title: Semi-Supervised Clustering


1
Clustering I
Data Mining Soongsil University
2
What is clustering ?
3
What is a natural grouping among these objects?
4
What is a natural grouping among these objects?
Clustering is subjective
5
What is Similarity?
The quality or state of being similar, likeness,
resemblance as a similarity of features.
Similarity is hard to define, but We know it when
we see it
The real meaning of similarity is a
philosophical question. We will take a more
pragmatic approach.
6
Defining Distance Measures
Definition Let O1 and O2 be two objects from
the universe of possible objects. The distance
(dissimilarity) between O1 and O2 is a real
number denoted by D(O1,O2)
7
Unsupervised learning Clustering
Black Box
8
2-dimensional clustering, showing three data
clusters
9
What is Cluster Analysis?
  • Finding groups of objects such that the objects
    in a group will be similar (or related) to one
    another and different from (or unrelated to) the
    objects in other groups

10
What Is A Good Clustering?
  • High intra-class similarity and low inter-class
    similarity
  • Depending on the similarity measure
  • The ability to discover some or all of the hidden
    patterns

11
Requirements of Clustering
  • Scalability
  • Ability to deal with various types of attributes
  • Discovery of clusters with arbitrary shape
  • Minimal requirements for domain knowledge to
    determine input parameters

12
Requirements of Clustering
  • Able to deal with noise and outliers
  • Insensitive to order of input records
  • High dimensionality
  • Incorporation of user-specified constraints
  • Interpretability and usability

13
  • A technique demanded by many real world tasks
  • Biology taxonomy of living things such as
    kingdom, phylum, class, order, family, genus and
    species
  • Information retrieval document/multimedia data
    clustering
  • Land use Identification of areas of similar land
    use in an earth observation database
  • Marketing Help marketers discover distinct
    groups in their customer bases, and then use this
    knowledge to develop targeted marketing programs
  • City-planning Identify groups of houses
    according to their house type, value, and
    geographical location
  • Earth-quake studies Observed earth quake
    epicenters should be clustered along continent
    faults
  • Climate understand earth climate, find patterns
    of atmospheric and ocean
  • - Social network mining special interest
    group discovery

14
(No Transcript)
15
Data Matrix
  • For memory-based clustering
  • Also called object-by-variable structure
  • Represents n objects with p variables
    (attributes, measures)
  • A relational table

16
Dissimilarity Matrix
  • For memory-based clustering
  • Also called object-by-object structure
  • Proximities of pairs of objects
  • d(i,j) dissimilarity between objects i and j
  • Nonnegative
  • Close to 0 similar

17
How Good Is A Clustering?
  • Dissimilarity/similarity depends on distance
    function
  • Different applications have different functions
  • Judgment of clustering quality is typically
    highly subjective

18
Types of Attributes
  • There are different types of attributes
  • Nominal
  • Examples ID numbers, eye color, zip codes
  • Ordinal
  • Examples rankings (e.g., taste of potato chips
    on a scale from 1-10), grades, height in tall,
    medium, short
  • Interval
  • Examples calendar dates, temperatures in Celsius
    or Fahrenheit.
  • Ratio
  • Examples length, time, counts

19
Types of Data in Clustering
  • Interval-scaled variables
  • Binary variables
  • Nominal, ordinal, and ratio variables
  • Variables of mixed types

20
Similarity and Dissimilarity Between Objects
  • Distances are normally used measures
  • Minkowski distance a generalization
  • If q 2, d is Euclidean distance
  • If q 1, d is Manhattan distance
  • Weighed distance

21
Properties of Minkowski Distance
  • Nonnegative d(i,j) ? 0
  • The distance of an object to itself is 0
  • d(i,i) 0
  • Symmetric d(i,j) d(j,i)
  • Triangular inequality
  • d(i,j) ? d(i,k) d(k,j)

22
Categories of Clustering Approaches (1)
  • Partitioning algorithms
  • Partition the objects into k clusters
  • Iteratively reallocate objects to improve the
    clustering
  • Hierarchy algorithms
  • Agglomerative each object is a cluster, merge
    clusters to form larger ones
  • Divisive all objects are in a cluster, split it
    up into smaller clusters

23
Partitional Clustering
Original Points
24
Hierarchical Clustering
Traditional Hierarchical Clustering
Traditional Dendrogram
Non-traditional Hierarchical Clustering
Non-traditional Dendrogram
25
Categories of Clustering Approaches (2)
  • Density-based methods
  • Based on connectivity and density functions
  • Filter out noise, find clusters of arbitrary
    shape
  • Grid-based methods
  • Quantize the object space into a grid structure
  • Model-based
  • Use a model to find the best fit of data

26
Partitioning Algorithms Basic Concepts
  • Partition n objects into k clusters
  • Optimize the chosen partitioning criterion
  • Global optimal examine all partitions
  • (kn-(k-1)n--1) possible partitions, too
    expensive!
  • Heuristic methods k-means and k-medoids
  • K-means a cluster is represented by the center
  • K-medoids or PAM (partition around medoids) each
    cluster is represented by one of the objects in
    the cluster

27
Overview of K-Means Clustering
  • K-Means is a partitional clustering algorithm
    based on iterative relocation that partitions a
    dataset into K clusters.
  • Algorithm
  • Initialize K cluster centers randomly. Repeat
    until convergence
  • Cluster Assignment Step Assign each data point x
    to the cluster Xl, such that L2 distance of x
    from (center of Xl) is minimum
  • Center Re-estimation Step Re-estimate each
    cluster center as the mean of the points in
    that cluster

28
K-Means Objective Function
  • Locally minimizes sum of squared distance between
    the data points and their corresponding cluster
    centers
  • Initialization of K cluster centers
  • Totally random
  • Random perturbation from global mean
  • Heuristic to ensure well-separated centers

Source J. Ye 2006
29
K Means Example
30
K Means ExampleRandomly Initialize Means
x
x
31
Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
32
Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
33
Second Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
34
Second Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
Pros and Cons of K-means
  • Relatively efficient O(tkn)
  • n objects, k clusters, t iterations k,
    t ltlt n.
  • Often terminate at a local optimum
  • Applicable only when mean is defined
  • What about categorical data?
  • Need to specify the number of clusters
  • Unable to handle noisy data and outliers
  • Unsuitable to discover non-convex clusters

36
Variations of the K-means
  • Aspects of variations
  • Selection of the initial k means
  • Dissimilarity calculations
  • Strategies to calculate cluster means
  • Handling categorical data k-modes
  • Use mode instead of mean
  • Mode the most frequent item(s)
  • A mixture of categorical and numerical data
    k-prototype method

37
Categorical Values
  • Handling categorical data k-modes (Huang98)
  • Replacing means of clusters with modes
  • Mode of an attribute most frequent value
  • Mode of instances each attribute most frequent
    value
  • K-mode is equivalent to K-means
  • Using a frequency-based method to update modes of
    clusters
  • A mixture of categorical and numerical data
    k-prototype method

37
38
A Problem of K-means

  • Sensitive to outliers
  • Outlier objects with extremely large values
  • May substantially distort the distribution of the
    data
  • K-medoids the most centrally located object in a
    cluster

39
PAM A K-medoids Method
  • PAM partitioning around Medoids
  • Arbitrarily choose k objects as the initial
    medoids
  • Until no change, do
  • (Re)assign each object to the cluster to which
    the nearest medoid
  • Randomly select a non-medoid object o, compute
    the total cost, S, of swapping medoid o with o
  • If S lt 0 then swap o with o to form the new set
    of k medoids

40
K-Medoids example
  • 1, 2, 6, 7, 8, 10, 15, 17, 20 break into 3
    clusters
  • Cluster 6 1, 2
  • Cluster 7
  • Cluster 8 10, 15, 17, 20
  • Random non-medoid 15 replace 7 (total cost-13)
  • Cluster 6 1 (cost 0), 2 (cost 0), 7(1-01)
  • Cluster 8 10 (cost 0)
  • New Cluster 15 17 (cost 2-9-7), 20 (cost
    5-12-7)
  • Replace medoid 7 with new medoid (15) and
    reassign
  • Cluster 6 1, 2, 7
  • Cluster 8 10
  • Cluster 15 17, 20

41
K-Medoids example (continued)
  • Random non-medoid 1 replaces 6 (total cost2)
  • Cluster 8 7 (cost 6-15)10 (cost 0)
  • Cluster 15 17 (cost 0), 20 (cost 0)
  • New Cluster 1 2 (cost 1-4-3)
  • 2 replaces 6 (total cost1)
  • Dont replace medoid 6
  • Cluster 6 1, 2, 7
  • Cluster 8 10
  • Cluster 15 17, 20
  • Random non-medoid 7 replaces 6 (total cost2)
  • Cluster 8 10 (cost 0)
  • Cluster 15 17(cost 0), 20(cost 0)
  • New Cluster 7 6 (cost 1-01), 2 (cost 5-41)

42
K-Medoids example (continued)
  • Dont Replace medoid 6
  • Cluster 6 1, 2, 7
  • Cluster 8 10
  • Cluster 15 17, 20
  • Random non-medoid 10 replaces 8 (total cost2)
    dont replace
  • Cluster 6 1(cost 0), 2(cost 0), 7(cost 0)
  • Cluster 15 17 (cost 0), 20(cost 0)
  • New Cluster 10 8 (cost 2-02)
  • Random non-medoid 17 replaces 15 (total cost0)
    dont replace
  • Cluster 6 1(cost 0), 2(cost 0), 7(cost 0)
  • Cluster 8 10 (cost 0)
  • New Cluster 17 15 (cost 2-02), 20(cost
    3-5-2)

43
K-Medoids example (continued)
  • Random non-medoid 20 replaces 15 (total cost3)
    dont replace
  • Cluster 6 1(cost 0), 2(cost 0), 7(cost 0)
  • Cluster 8 10 (cost 0)
  • New Cluster 20 15 (cost 5-02), 17(cost
    3-21)
  • Other possible changes all have high costs
  • 1 replaces 15, 2 replaces 15, 1 replaces 8,
  • No changes, final clusters
  • Cluster 6 1, 2, 7
  • Cluster 8 10
  • Cluster 15 17, 20

44
Semi-Supervised Clustering

45
Outline
  • Overview of clustering and classification
  • What is semi-supervised learning?
  • Semi-supervised clustering
  • Semi-supervised classification
  • Semi-supervised clustering
  • What is semi-supervised clustering?
  • Why semi-supervised clustering?
  • Semi-supervised clustering algorithms

Source J. Ye 2006
46
Supervised classification versus unsupervised
clustering
  • Unsupervised clustering Group similar objects
    together to find clusters
  • Minimize intra-class distance
  • Maximize inter-class distance
  • Supervised classification Class label for each
    training sample is given
  • Build a model from the training data
  • Predict class label on unseen future data points

Source J. Ye 2006
47
What is clustering?
  • Finding groups of objects such that the objects
    in a group will be similar (or related) to one
    another and different from (or unrelated to) the
    objects in other groups

Source J. Ye 2006
48
What is Classification?
Source J. Ye 2006
49
Clustering algorithms
  • K-Means
  • Hierarchical clustering
  • Graph based clustering (Spectral clustering)
  • Bi-clustering

Source J. Ye 2006
50
Classification algorithms
  • K-Nearest-Neighbor classifiers
  • Naïve Bayes classifier
  • Linear Discriminant Analysis (LDA)
  • Support Vector Machines (SVM)
  • Logistic Regression
  • Neural Networks

Source J. Ye 2006
51
Supervised Classification Example
.
.
.
.
52
Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
53
Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
54
Unsupervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
55
Unsupervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
56
Semi-Supervised Learning
  • Combines labeled and unlabeled data during
    training to improve performance
  • Semi-supervised classification Training on
    labeled data exploits additional unlabeled data,
    frequently resulting in a more accurate
    classifier.
  • Semi-supervised clustering Uses small amount of
    labeled data to aid and bias the clustering of
    unlabeled data.

57
Semi-Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
58
Semi-Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
59
Semi-Supervised Classification
  • Algorithms
  • Semisupervised EM GhahramaniNIPS94,NigamML00.
  • Co-training BlumCOLT98.
  • Transductive SVMs Vapnik98,JoachimsICML99.
  • Graph based algorithms
  • Assumptions
  • Known, fixed set of categories given in the
    labeled data.
  • Goal is to improve classification of examples
    into these known categories.

60
Semi-supervised clustering problem definition
  • Input
  • A set of unlabeled objects, each described by a
    set of attributes (numeric and/or categorical)
  • A small amount of domain knowledge
  • Output
  • A partitioning of the objects into k clusters
    (possibly with some discarded as outliers)
  • Objective
  • Maximum intra-cluster similarity
  • Minimum inter-cluster similarity
  • High consistency between the partitioning and the
    domain knowledge

61
Why semi-supervised clustering?
  • Why not clustering?
  • The clusters produced may not be the ones
    required.
  • Sometimes there are multiple possible groupings.
  • Why not classification?
  • Sometimes there are insufficient labeled data.
  • Potential applications
  • Bioinformatics (gene and protein clustering)
  • Document hierarchy construction
  • News/email categorization
  • Image categorization

62
Semi-Supervised Clustering
  • Domain knowledge
  • Partial label information is given
  • Apply some constraints (must-links and
    cannot-links)
  • Approaches
  • Search-based Semi-Supervised Clustering
  • Alter the clustering algorithm using the
    constraints
  • Similarity-based Semi-Supervised Clustering
  • Alter the similarity measure based on the
    constraints
  • Combination of both

63
Search-Based Semi-Supervised Clustering
  • Alter the clustering algorithm that searches for
    a good partitioning by
  • Modifying the objective function to give a reward
    for obeying labels on the supervised data
    Demeriz ANNIE99.
  • Enforcing constraints (must-link, cannot-link) on
    the labeled data during clustering
    WagstaffICML00, WagstaffICML01.
  • Use the labeled data to initialize clusters in an
    iterative refinement algorithm (k-Means,)
    BasuICML02.

Source J. Ye 2006
64
(No Transcript)
65
(No Transcript)
66
K Means ExampleAssign Points to Clusters
x
x
67
K Means ExampleRe-estimate Means
x
x
68
K Means ExampleRe-assign Points to Clusters
x
x
69
K Means ExampleRe-estimate Means
x
x
70
K Means ExampleRe-assign Points to Clusters
x
x
71
K Means ExampleRe-estimate Means and Converge
x
x
72
Semi-Supervised K-Means
  • Partial label information is given
  • Seeded K-Means
  • Constrained K-Means
  • Constraints (Must-link, Cannot-link)
  • COP K-Means

73
Semi-Supervised K-Means for partially labeled data
  • Seeded K-Means
  • Labeled data provided by user are used for
    initialization initial center for cluster i is
    the mean of the seed points having label i.
  • Seed points are only used for initialization, and
    not in subsequent steps.
  • Constrained K-Means
  • Labeled data provided by user are used to
    initialize K-Means algorithm.
  • Cluster labels of seed data are kept unchanged in
    the cluster assignment steps, and only the labels
    of the non-seed data are re-estimated.

74
Seeded K-Means
Use labeled data to find the initial centroids
and then run K-Means. The labels for seeded
points may change.
Source J. Ye 2006
75
Seeded K-Means Example
76
Seeded K-Means ExampleInitialize Means Using
Labeled Data
x
x
77
Seeded K-Means ExampleAssign Points to Clusters
x
x
78
Seeded K-Means ExampleRe-estimate Means
x
x
79
Seeded K-Means ExampleAssign points to clusters
and Converge
x
the label is changed
x
80
Constrained K-Means
Use labeled data to find the initial centroids
and then run K-Means. The labels for seeded
points will not change.
Source J. Ye 2006
81
Constrained K-Means Example
82
Constrained K-Means ExampleInitialize Means
Using Labeled Data
x
x
83
Constrained K-Means ExampleAssign Points to
Clusters
x
x
84
Constrained K-Means ExampleRe-estimate Means and
Converge
85
COP K-Means
  • COP K-Means Wagstaff et al. ICML01 is K-Means
    with must-link (must be in same cluster) and
    cannot-link (cannot be in same cluster)
    constraints on data points.
  • Initialization Cluster centers are chosen
    randomly,
  • but as each one is chosen any must-link
    constraints that it participates in are enforced
    (so that they cannot later be chosen as the
    center of another cluster).
  • Algorithm During cluster assignment step in
    COP-K-Means, a point is assigned to its nearest
    cluster without violating any of its constraints.
    If no such assignment exists, abort.

Source J. Ye 2006
86
COP K-Means Algorithm
87
Illustration
Determine its label
Must-link
x
x
Assign to the red class
88
Illustration
Determine its label
x
x
Cannot-link
Assign to the red class
89
Illustration
Determine its label
Must-link
x
x
Cannot-link
The clustering algorithm fails
90
Summary
  • Seeded and Constrained K-Means partially labeled
    data
  • COP K-Means constraints (Must-link and
    Cannot-link)
  • Constrained K-Means and COP K-Means require all
    the constraints to be satisfied.
  • May not be effective if the seeds contain noise.
  • Seeded K-Means use the seeds only in the first
    step to determine the initial centroids.
  • Less sensitive to the noise in the seeds.
  • Experiments show that semi-supervised k-Means
    outperform traditional K-Means.

91
References
  • Ye , Jieping Introduction to Data Mining,
    Department of Computer Science and Engineering
  • Arizona State University, 2006
  • Clifton, Chris Introduction to Data Mining,
  • Purdue University, 2006
  • Zhu, Xingquan Davidson, Ian , Knowledge
    Discovery and Data Mining, 2007
Write a Comment
User Comments (0)
About PowerShow.com