Active SemiSupervision for Pairwise Constrained Clustering - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Active SemiSupervision for Pairwise Constrained Clustering

Description:

In each step, assign each point x to the cluster which has the minimum objective function on x. ... When each cluster has at least 1 point, we can use random ... – PowerPoint PPT presentation

Number of Views:139
Avg rating:3.0/5.0
Slides: 22
Provided by: xin745
Category:

less

Transcript and Presenter's Notes

Title: Active SemiSupervision for Pairwise Constrained Clustering


1
Active Semi-Supervision for Pairwise Constrained
Clustering
  • Presented by David Chen Xin Li
  • Apr 30 2007

2
Topics
  • Semi-supervised learning clustering
  • Pairwise Constrained Clustering
  • Clustering Algorithm
  • Active Learning Algorithm
  • Experiments

3
Semi-supervised learning clustering
  • A large supply of unlabeled data but limited
    labeled data
  • Semi-supervised learning learning from a
    combination of both labeled and unlabeled data.
  • Semi-supervised clustering clustering large
    amounts of unlabeled data in the presence of a
    small amount of supervised data
  • Two categories constraint-based and
    distance-based

4
Pairwise Constrained Clustering(PCC)
  • Introduce 2 sets of pairwise constraints on the
    data
  • The set of must-link pairs M
  • implies and
    should be assigned to the same cluster ,
    violation of this constraint will results in a
    cost
  • The set of cannot-link paris C
  • implies and
    should be assigned to different clusters ,
    violation of this constraint will results in a
    cost
  • The cost function

5
Clustering Algorithm
  • Initialization
  • Take the transitive closure of the
    must-link constraints to be the new set M.
  • The connected components in M are used to
    create neighborhood sets , from
    these sets create k initial clusters.
  • For every pair of neighborhoods that have
    at least one cannot-link between them, add
    cannot-link constraints between every pair of
    points of the 2 neighborhoods.
  • Iterate like k-means
  • In each step, assign each point x to the
    cluster which has the minimum objective function
    on x.
  • Compute the new centroid of each cluster.

6
(No Transcript)
7
Clustering Algorithm
  • Lemma
  • The algorithm PCKMeans converges to a
    local minimum of .
  • Proof
  • Essentially the same as K-Means in each
    iteration the objective function will decrease.

8
Active Learning Algorithm
  • Assume having an access to a noiseless oracle
    that can assign a must-link or cannot-link label
    to a given pair of points.
  • A way to improve the clustering performance with
    as few queries to the oracle as possible.
  • Try to get good initial clusters under
    metric-space assumption.

9
Active Learning Algorithm
  • Performance of algorithms like K-Means or
    PCKMeans depend heavily on the initial clusters.
  • We estimate the centroid of each cluster by the
    mean of the data points belonging to that
    cluster.
  • Under certain generative model-based assumption(
    e.g. each cluster is generated by a gaussian
    distribution, each data point is sampled
    independently), more data points of a cluster
    will give better estimate of the centroid.
  • We would like to get as many points per cluster,
    while using very few queries to the oracle.

10
Random initialization
  • The coupon collectors problem
  • To collect k coupons, each time a random
    coupon is given with equal probability.
  • The expected time to collect all coupons
    is k(ln k)O(k)
  • Can show in k(ln k)O(k) rounds with high
    probability can collect all the coupons.
  • Generalization
  • If the probability of each coupon is
    given by
  • and , then can collect all
    coupons with high probability in time l(ln
    k)O(l) .
  • Therefore roughly need to sample l(ln k) points,
    and perform lk(ln k) queries.

11
Farthest-first traversal Initialization
  • Assume in a metric space, first pick a random
    point and add to the traversal set, then select
    the farthest point from the set, add it to the
    traversal set, and so on. (
    )
  • If the probability of sampling a point in a
    cluster is given by
  • and
    , then can show in time l with probability 1
    we can sample at least one point for each
    cluster.
  • Therefore roughly need to sample l points, and
    perform lk queries, which is better than the
    random initialization.

12
Farthest-first traversal Initialization
13
Getting more points
  • When each cluster has at least 1 point, we can
    use random sampling to get more points for each
    cluster. For a new point, with at most k-1
    queries we can decide which cluster it belongs
    to.
  • By the coupon collectors problem, with roughly
    lk(ln k) queries we can get a new point for each
    cluster.

14
Getting more points
15
Experiments
  • Normalized mutual information (NMI)
  • NMI I(CK) / ((H(C) H(K)) / 2)
  • C is the cluster assignment
  • K is the actual class label
  • I(XY) H(X) H(XY)
  • H(X) is the Shannon entropy
  • H(XY) is the conditional entropy of X given Y

16
Experiments
  • Extracted from the 20 Newsgroups collection
    (www.ai.mit.edu/people/jrennie/20Newsgroups)
  • 3 similar newsgroups
  • comp.graphics
  • comp.os.ms-windows
  • comp.windows.x
  • 100 documents from each
  • 3225 dimensions

17
Experiments News-sim3
18
Experiments
19
Experiments - iproute
20
Experiments - iptable
21
Experiments mpg321
Write a Comment
User Comments (0)
About PowerShow.com