Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin - PowerPoint PPT Presentation

About This Presentation
Title:

Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin

Description:

Title: Information Theoretic Clustering and Co-Clustering for Text Mining Author: Inderjit Dhillon Last modified by: Inderjit Dhillon Created Date – PowerPoint PPT presentation

Number of Views:326
Avg rating:3.0/5.0
Slides: 40
Provided by: Inder3
Category:

less

Transcript and Presenter's Notes

Title: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin


1
Information Theoretic Clustering, Co-clustering
and Matrix Approximations Inderjit S.
Dhillon University of
Texas, Austin
Data Mining Seminar Series, Mar 26, 2004
  • Joint work with A. Banerjee, J. Ghosh, Y. Guan,
    S. Mallela,
  • S. Merugu D. Modha


2
Clustering Unsupervised Learning
  • Grouping together of similar objects
  • Hard Clustering -- Each object belongs to a
    single cluster
  • Soft Clustering -- Each object is
    probabilistically assigned to clusters

3
Contingency Tables
  • Let X and Y be discrete random variables
  • X and Y take values in 1, 2, , m and 1, 2,
    , n
  • p(X, Y) denotes the joint probability
    distributionif not known, it is often estimated
    based on co-occurrence data
  • Application areas text mining, market-basket
    analysis, analysis of browsing behavior, etc.
  • Key Obstacles in Clustering Contingency Tables
  • High Dimensionality, Sparsity, Noise
  • Need for robust and scalable algorithms

4
Co-Clustering
  • Simultaneously
  • Cluster rows of p(X, Y) into k disjoint groups
  • Cluster columns of p(X, Y) into l disjoint
    groups
  • Key goal is to exploit the duality between row
    and column clustering to overcome sparsity and
    noise

5
Co-clustering Example for Text Data
  • Co-clustering clusters both words and documents
    simultaneously using the underlying co-occurrence
    frequency matrix

document
document clusters
word
word clusters
6
Co-clustering and Information Theory
  • View co-occurrence matrix as a joint
    probability distribution over row column random
    variables
  • We seek a hard-clustering of both rows and
    columns such that information in the compressed
    matrix is maximized.

7
Information Theory Concepts
  • Entropy of a random variable X with probability
    distribution p
  • The Kullback-Leibler (KL) Divergence or Relative
    Entropy between two probability distributions p
    and q
  • Mutual Information between random variables X and
    Y

8
Optimal Co-Clustering
  • Seek random variables and taking values in
    1, 2, , k and 1, 2, , l such that mutual
    information is maximized
  • where R(X) is a function of X alone
  • where C(Y) is a function of Y alone

9
Related Work
  • Distributional Clustering
  • Pereira, Tishby Lee (1993), Baker McCallum
    (1998)
  • Information Bottleneck
  • Tishby, Pereira Bialek(1999), Slonim, Friedman
    Tishby (2001), Berkhin Becher(2002)
  • Probabilistic Latent Semantic Indexing
  • Hofmann (1999), Hofmann Puzicha (1999)
  • Non-Negative Matrix Approximation
  • Lee Seung(2000)

10
Information-Theoretic Co-clustering
  • Lemma Loss in mutual information equals
  • p is the input distribution
  • q is an approximation to p
  • Can be shown that q(x,y) is a maximum entropy
    approximation subject to cluster constraints.

11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
parameters that determine q(x,y) are
16
Decomposition Lemma
  • Question How to minimize
    ?
  • Following Lemma reveals the Answer
  • Note that may be thought of as the
    prototype of row cluster.
  • Similarly,

17
Co-Clustering Algorithm
  • Step 1 Set . Start with ,
    Compute .
  • Step 2 For every row , assign it to the
    cluster that minimizes
  • Step 3 We have . Compute
    .
  • Step 4 For every column , assign it to the
    cluster that minimizes
  • Step 5 We have . Compute
    . Iterate 2-5.

18
Properties of Co-clustering Algorithm
  • Main Theorem Co-clustering monotonically
    decreases loss in mutual information
  • Co-clustering converges to a local minimum
  • Can be generalized to multi-dimensional
    contingency tables
  • q can be viewed as a low complexity
    non-negative matrix approximation
  • q preserves marginals of p, and co-cluster
    statistics
  • Implicit dimensionality reduction at each step
    helps overcome sparsity high-dimensionality
  • Computationally economical

19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
Applications -- Text Classification
  • Assigning class labels to text documents
  • Training and Testing Phases

New Document
Class-1
Document collection
Grouped into classes
Classifier (Learns from Training data)
New Document With Assigned class
Class-m
Training Data
24
Feature Clustering (dimensionality reduction)
  • Feature Selection
  • Feature Clustering

1
  • Select the best words
  • Throw away rest
  • Frequency based pruning
  • Information criterion based
  • pruning

Document Bag-of-words
Vector Of words
Word1
Wordk
m
1
Vector Of words
Cluster1
  • Do not throw away words
  • Cluster words instead
  • Use clusters as features

Document Bag-of-words
Clusterk
m
25
Experiments
  • Data sets
  • 20 Newsgroups data
  • 20 classes, 20000 documents
  • Classic3 data set
  • 3 classes (cisi, med and cran), 3893 documents
  • Dmoz Science HTML data
  • 49 leaves in the hierarchy
  • 5000 documents with 14538 words
  • Available at http//www.cs.utexas.edu/users/manyam
    /dmoz.txt
  • Implementation Details
  • Bow for indexing,co-clustering, clustering and
    classifying

26
Results (20Ng)
  • Classification Accuracy on 20 Newsgroups data
    with 1/3-2/3 test-train split
  • Divisive clustering beats feature selection
    algorithms by a large margin
  • The effect is more significant at lower number of
    features

27
Results (Dmoz)
  • Classification Accuracy on Dmoz data with 1/3-2/3
    test train split
  • Divisive Clustering is better at lower number of
    features
  • Note contrasting behavior of Naïve Bayes and SVMs

28
Results (Dmoz)
  • Naïve Bayes on Dmoz data with only 2 Training
    data
  • Note that Divisive Clustering achieves higher
    maximum than IG with a significant 13 increase
  • Divisive Clustering performs better than IG at
    lower training data

29
Hierarchical Classification
Science
Math
Physics
Social Science
Quantum Theory
Number Theory
Mechanics
Economics
Archeology
Logic
  • Flat classifier builds a classifier over the leaf
    classes in the above hierarchy
  • Hierarchical Classifier builds a classifier at
    each internal node of the hierarchy

30
Results (Dmoz)
  • Hierarchical Classifier (Naïve Bayes at each
    node)
  • Hierarchical Classifier 64.54 accuracy at just
    10 features (Flat achieves 64.04 accuracy at
    1000 features)
  • Hierarchical Classifier improves accuracy to
    68.42 from 64.42(maximum) achieved by flat
    classifiers

31
Anecdotal Evidence
Cluster 10 Divisive Clustering (rec.sport.hockey) Cluster 9 Divisive Clustering (rec.sport.baseball) Cluster 12 Agglomerative Clustering (rec.sport.hockey and rec.sport.baseball)
team game play hockey Season boston chicago pit van nhl hit runs Baseball base Ball greg morris Ted Pitcher Hitting team detroit hockey pitching Games hitter Players rangers baseball nyi league morris player blues nhl shots Pit Vancouver buffalo ens
Top few words sorted in Clusters obtained by
Divisive and Agglomerative approaches on 20
Newsgroups data
32
Co-Clustering Results (CLASSIC3)
33
Results Binary (subset of 20Ng data)
34
Precision 20Ng data
Co-clustering 1D-clustering IB-Double IDC
Binary 0.98 0.64 0.70
Binary_Subject 0.96 0.67 0.85
Multi5 0.87 0.34 0.5
Multi5_Subject 0.89 0.37 0.88
Multi10 0.56 0.17 0.35
Multi10_Subject 0.54 0.19 0.55
35
Results Sparsity (Binary_subject data)
36
Results Sparsity (Binary_subject data)
37
Results (Monotonicity)
38
Conclusions
  • Information-theoretic approach to clustering,
    co-clustering and matrix approximation
  • Implicit dimensionality reduction at each step to
    overcome sparsity high-dimensionality
  • Theoretical approach has the potential of
    extending to other problems
  • Multi-dimensional co-clustering
  • MDL to choose number of co-clusters
  • Generalized co-clustering by Bregman divergence

39
More Information
  • Email inderjit_at_cs.utexas.edu
  • Papers are available at http//www.cs.utexas.ed
    u/users/inderjit
  • Divisive Information-Theoretic Feature
    Clustering for Text Classification, Dhillon,
    Mallela Kumar, Journal of Machine Learning
    Research(JMLR), March 2003 (also KDD, 2002)
  • Information-Theoretic Co-clustering, Dhillon,
    Mallela Modha, KDD, 2003.
  • Clustering with Bregman Divergences, Banerjee,
    Merugu, Dhillon Ghosh, SIAM Data Mining
    Proceedings, April, 2004.
  • A Generalized Maximum Entropy Approach to
    Bregman Co-clustering Matrix Approximation,
    Banerjee, Dhillon, Ghosh, Merugu Modha, working
    manuscript, 2004.
Write a Comment
User Comments (0)
About PowerShow.com