Semisupervised Learning - PowerPoint PPT Presentation

1 / 66
About This Presentation
Title:

Semisupervised Learning

Description:

Ontology. Why semi-supervised clustering? Why not clustering? ... Ontology based semi-supervised clustering 'A framework for ontology-driven ... – PowerPoint PPT presentation

Number of Views:201
Avg rating:3.0/5.0
Slides: 67
Provided by: WeiW8
Category:

less

Transcript and Presenter's Notes

Title: Semisupervised Learning


1
Semi-supervised Learning
  • COMP 790-90 Seminar
  • Spring 2009

2
Overview
  • Semi-supervised learning
  • Semi-supervised classification
  • Semi-supervised clustering
  • Semi-supervised clustering
  • Search based methods
  • Cop K-mean
  • Seeded K-mean
  • Constrained K-mean
  • Similarity based methods

3
Supervised Classification Example
.
.
.
.
4
Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
Unsupervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
Unsupervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
Semi-Supervised Learning
  • Combines labeled and unlabeled data during
    training to improve performance
  • Semi-supervised classification Training on
    labeled data exploits additional unlabeled data,
    frequently resulting in a more accurate
    classifier.
  • Semi-supervised clustering Uses small amount of
    labeled data to aid and bias the clustering of
    unlabeled data.

9
Semi-Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
10
Semi-Supervised Classification Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
Semi-Supervised Classification
  • Algorithms
  • Semisupervised EM GhahramaniNIPS94,NigamML00.
  • Co-training BlumCOLT98.
  • Transductive SVMs Vapnik98,JoachimsICML99.
  • Assumptions
  • Known, fixed set of categories given in the
    labeled data.
  • Goal is to improve classification of examples
    into these known categories.

12
Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
14
Second Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
Second Semi-Supervised Clustering Example
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
16
Semi-Supervised Clustering
  • Can group data using the categories in the
    initial labeled data.
  • Can also extend and modify the existing set of
    categories as needed to reflect other
    regularities in the data.
  • Can cluster a disjoint set of unlabeled data
    using the labeled data as a guide to the type
    of clusters desired.

17
Problem definition
  • Input
  • A set of unlabeled objects
  • Some domain knowledge
  • Output
  • A partitioning of the objects into clusters
  • Objective
  • Maximum intra-cluster similarity
  • Minimum inter-cluster similarity
  • High consistency between the partitioning and the
    domain knowledge

18
What is Domain Knowledge?
  • Must-link and cannot-link
  • Class labels
  • Ontology

19
Why semi-supervised clustering?
  • Why not clustering?
  • Could not incorporate prior knowledge into
    clustering process
  • Why not classification?
  • Sometimes there are insufficient labeled data.
  • Potential applications
  • Bioinformatics (gene and protein clustering)
  • Document hierarchy construction
  • News/email categorization
  • Image categorization

20
Semi-Supervised Clustering
  • Approaches
  • Search-based Semi-Supervised Clustering
  • Alter the clustering algorithm using the
    constraints
  • Similarity-based Semi-Supervised Clustering
  • Alter the similarity measure based on the
    constraints
  • Combination of both

21
Search-Based Semi-Supervised Clustering
  • Alter the clustering algorithm that searches for
    a good partitioning by
  • Modifying the objective function to give a reward
    for obeying labels on the supervised data
    DemerizANNIE99.
  • Enforcing constraints (must-link, cannot-link) on
    the labeled data during clustering
    WagstaffICML00, WagstaffICML01.
  • Use the labeled data to initialize clusters in an
    iterative refinement algorithm (kMeans, EM)
    BasuICML02.

22
Unsupervised KMeans Clustering
  • KMeans iteratively partitions a dataset into K
    clusters.
  • Algorithm
  • Initialize K cluster centers randomly.
    Repeat until convergence
  • Cluster Assignment Step Assign each data point x
    to the cluster Xl, such that L2 distance of x
    from (center of Xl) is minimum
  • Center Re-estimation Step Re-estimate each
    cluster center as the mean of the points in
    that cluster

23
KMeans Objective Function
  • Locally minimizes sum of squared distance between
    the data points and their corresponding cluster
    centers
  • Initialization of K cluster centers
  • Totally random
  • Random perturbation from global mean
  • Heuristic to ensure well-separated centers etc.

24
K Means Example
25
K Means ExampleRandomly Initialize Means
x
x
26
K Means ExampleAssign Points to Clusters
x
x
27
K Means ExampleRe-estimate Means
x
x
28
K Means ExampleRe-assign Points to Clusters
x
x
29
K Means ExampleRe-estimate Means
x
x
30
K Means ExampleRe-assign Points to Clusters
x
x
31
K Means ExampleRe-estimate Means and Converge
x
x
32
Semi-Supervised K-Means
  • Constraints (Must-link, Cannot-link)
  • COP K-Means
  • Partial label information is given
  • Seeded K-Means (Basu, ICML02)
  • Constrained K-Means

33
COP K-Means
  • COP K-Means is K-Means with must-link (must be in
    same cluster) and cannot-link (cannot be in same
    cluster) constraints on data points.
  • Initialization Cluster centers are chosen
    randomly but no must-link constraints that may be
    violated
  • Algorithm During cluster assignment step in
    COP-K-Means, a point is assigned to its nearest
    cluster without violating any of its constraints.
    If no such assignment exists, abort.
  • Based on Wagstaff et al. ICML01

34
COP K-Means Algorithm
35
Illustration
Determine its label
Must-link
x
x
Assign to the red class
36
Illustration
Determine its label
x
x
Cannot-link
Assign to the red class
37
Illustration
Determine its label
Must-link
x
x
Cannot-link
The clustering algorithm fails
38
Evaluation
  • Rand index measures the agreement between two
    partitions, P1 and P2, of the same data set D.
  • Each partition is viewed as a collection of
    n(n-1)/2 pairwise decisions, where n is the size
    of D.
  • a is the number of decisions where P1 and P2 put
    a pair of objects into the same cluster
  • b is the number of decisions where two instances
    are placed in different clusters in both
    partitions.
  • Total agreement can then be calculated using
    Rand(P1 P2) (a b)/ (n (n -1)/2)

39
Evaluation
40
Semi-Supervised K-Means
  • Seeded K-Means
  • Labeled data provided by user are used for
    initialization initial center for cluster i is
    the mean of the seed points having label i.
  • Seed points are only used for initialization, and
    not in subsequent steps.
  • Constrained K-Means
  • Labeled data provided by user are used to
    initialize K-Means algorithm.
  • Cluster labels of seed data are kept unchanged in
    the cluster assignment steps, and only the labels
    of the non-seed data are re-estimated.
  • Based on Basu et al., ICML02.

41
Seeded K-Means
Use labeled data to find the initial centroids
and then run K-Means. The labels for seeded
points may change.
42
Seeded K-Means Example
43
Seeded K-Means ExampleInitialize Means Using
Labeled Data
x
x
44
Seeded K-Means ExampleAssign Points to Clusters
x
x
45
Seeded K-Means ExampleRe-estimate Means
x
x
46
Seeded K-Means ExampleAssign points to clusters
and Converge
x
the label is changed
x
47
Constrained K-Means
Use labeled data to find the initial centroids
and then run K-Means. The labels for seeded
points will not change.
48
Constrained K-Means Example
49
Constrained K-Means ExampleInitialize Means
Using Labeled Data
x
x
50
Constrained K-Means ExampleAssign Points to
Clusters
x
x
51
Constrained K-Means ExampleRe-estimate Means and
Converge
52
Datasets
  • Data sets
  • UCI Iris (3 classes 150 instances)
  • CMU 20 Newsgroups (20 classes 20,000 instances)
  • Yahoo! News (20 classes 2,340 instances)
  • Data subsets created for experiments
  • Small-20 newsgroup random sample of 100
    documents from each newsgroup, created to study
    effect of datasize on algorithms.
  • Different-3 newsgroup 3 very different
    newsgroups (alt.atheism, rec.sport.baseball,
    sci.space), created to study effect of data
    separability on algorithms.
  • Same-3 newsgroup 3 very similar newsgroups
    (comp.graphics, comp.os.ms-windows,
    comp.windows.x).

53
Evaluation
  • Mutual information
  • Objective function

54
Results MI and Seeding
  • Zero noise in seeds Small-20 NewsGroup
  • Semi-Supervised KMeans substantially better than
    unsupervised KMeans

55
Results Objective function and Seeding
  • User-labeling consistent with KMeans assumptions
    Small-20 NewsGroup Obj. function of data
    partition increases exponentially with seed
    fraction

56
Results Objective Function and Seeding
  • User-labeling inconsistent with KMeans
    assumptions Yahoo! News Objective
    function of constrained algorithms decreases with
    seeding

57
Similarity Based Methods
  • Questions given a set of points and the class
    labels, can we learn a distance matrix such that
    intra-cluster distance are minimized and
    inter-cluster distance are maximized?

58
Distance metric learning
Define a new distance measure of the form
Linear transformation of the original data
59
Distance metric learning
60
Semi-Supervised Clustering ExampleSimilarity
Based
61
Semi-Supervised Clustering ExampleDistances
Transformed by Learned Metric
62
Semi-Supervised Clustering ExampleClustering
Result with Trained Metric
63
Evaluation
Source E. Xing, et al. Distance metric learning
64
Evaluation
Source E. Xing, et al. Distance metric learning
65
Additional Readings
  • Combining Similarity and Search-Based
    Semi-Supervised Clustering Comparing and
    Unifying Search-Based and Similarity-Based
    Approaches to Semi-Supervised Clustering, Basu,
    et al.
  • Ontology based semi-supervised clustering A
    framework for ontology-driven subspace
    clustering, Liu et al.

66
References
  • UT machine learning group
  • http//www.cs.utexas.edu/ml/publication/unsupervi
    sed.html
  • Semi-supervised Clustering by Seeding
  • http//www.cs.utexas.edu/users/ml/papers/semi-icml
    -02.pdf
  • Constrained K-means clustering with background
    knowledge
  • http//www.litech.org/wkiri/Papers/wagstaff-kmean
    s-01.pdf
  • Some slides are from Jieping Ye at Arizona State
Write a Comment
User Comments (0)
About PowerShow.com