Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin - PowerPoint PPT Presentation

About This Presentation

Title:

Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin

Description:

Title: Information Theoretic Clustering and Co-Clustering for Text Mining Author: Inderjit Dhillon Last modified by: Inderjit Dhillon Created Date – PowerPoint PPT presentation

Number of Views:321

Avg rating:3.0/5.0

Slides: 40

Provided by: Inder3

Learn more at: http://ideal.ece.utexas.edu

Category:

more less

Transcript and Presenter's Notes

Title: Information Theoretic Clustering, Co-clustering and Matrix Approximations Inderjit S. Dhillon University of Texas, Austin

1
Information Theoretic Clustering, Co-clustering
and Matrix Approximations Inderjit S.
Dhillon University of
Texas, Austin
Data Mining Seminar Series, Mar 26, 2004

Joint work with A. Banerjee, J. Ghosh, Y. Guan,
S. Mallela,
S. Merugu D. Modha

2
Clustering Unsupervised Learning

Grouping together of similar objects
Hard Clustering -- Each object belongs to a
single cluster
Soft Clustering -- Each object is
probabilistically assigned to clusters

3
Contingency Tables

Let X and Y be discrete random variables
X and Y take values in 1, 2, , m and 1, 2,
, n
p(X, Y) denotes the joint probability
distributionif not known, it is often estimated
based on co-occurrence data
Application areas text mining, market-basket
analysis, analysis of browsing behavior, etc.
Key Obstacles in Clustering Contingency Tables
High Dimensionality, Sparsity, Noise
Need for robust and scalable algorithms

4
Co-Clustering

Simultaneously
Cluster rows of p(X, Y) into k disjoint groups
Cluster columns of p(X, Y) into l disjoint
groups
Key goal is to exploit the duality between row
and column clustering to overcome sparsity and
noise

5
Co-clustering Example for Text Data

Co-clustering clusters both words and documents
simultaneously using the underlying co-occurrence
frequency matrix

document
document clusters
word
word clusters
6
Co-clustering and Information Theory

View co-occurrence matrix as a joint
probability distribution over row column random
variables
We seek a hard-clustering of both rows and
columns such that information in the compressed
matrix is maximized.

7
Information Theory Concepts

Entropy of a random variable X with probability
distribution p
The Kullback-Leibler (KL) Divergence or Relative
Entropy between two probability distributions p
and q
Mutual Information between random variables X and
Y

8
Optimal Co-Clustering

Seek random variables and taking values in
1, 2, , k and 1, 2, , l such that mutual
information is maximized
where R(X) is a function of X alone
where C(Y) is a function of Y alone

9
Related Work

Distributional Clustering
Pereira, Tishby Lee (1993), Baker McCallum
(1998)
Information Bottleneck
Tishby, Pereira Bialek(1999), Slonim, Friedman
Tishby (2001), Berkhin Becher(2002)
Probabilistic Latent Semantic Indexing
Hofmann (1999), Hofmann Puzicha (1999)
Non-Negative Matrix Approximation
Lee Seung(2000)

10
Information-Theoretic Co-clustering

Lemma Loss in mutual information equals
p is the input distribution
q is an approximation to p
Can be shown that q(x,y) is a maximum entropy
approximation subject to cluster constraints.

11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
parameters that determine q(x,y) are
16
Decomposition Lemma

Question How to minimize
?
Following Lemma reveals the Answer
Note that may be thought of as the
prototype of row cluster.
Similarly,

17
Co-Clustering Algorithm

Step 1 Set . Start with ,
Compute .
Step 2 For every row , assign it to the
cluster that minimizes
Step 3 We have . Compute
.
Step 4 For every column , assign it to the
cluster that minimizes
Step 5 We have . Compute
. Iterate 2-5.

18
Properties of Co-clustering Algorithm

Main Theorem Co-clustering monotonically
decreases loss in mutual information
Co-clustering converges to a local minimum
Can be generalized to multi-dimensional
contingency tables
q can be viewed as a low complexity
non-negative matrix approximation
q preserves marginals of p, and co-cluster
statistics
Implicit dimensionality reduction at each step
helps overcome sparsity high-dimensionality
Computationally economical

19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
(No Transcript)
23
Applications -- Text Classification

Assigning class labels to text documents
Training and Testing Phases

New Document
Class-1
Document collection
Grouped into classes
Classifier (Learns from Training data)
New Document With Assigned class
Class-m
Training Data
24
Feature Clustering (dimensionality reduction)

Feature Selection
Feature Clustering

Select the best words
Throw away rest
Frequency based pruning
Information criterion based
pruning

Document Bag-of-words
Vector Of words
Word1
Wordk
m
1
Vector Of words
Cluster1

Do not throw away words
Cluster words instead
Use clusters as features

Document Bag-of-words
Clusterk
m
25
Experiments

Data sets
20 Newsgroups data
20 classes, 20000 documents
Classic3 data set
3 classes (cisi, med and cran), 3893 documents
Dmoz Science HTML data
49 leaves in the hierarchy
5000 documents with 14538 words
Available at http//www.cs.utexas.edu/users/manyam
/dmoz.txt
Implementation Details
Bow for indexing,co-clustering, clustering and
classifying

26
Results (20Ng)

Classification Accuracy on 20 Newsgroups data
with 1/3-2/3 test-train split
Divisive clustering beats feature selection
algorithms by a large margin
The effect is more significant at lower number of
features

27
Results (Dmoz)

Classification Accuracy on Dmoz data with 1/3-2/3
test train split
Divisive Clustering is better at lower number of
features
Note contrasting behavior of Naïve Bayes and SVMs

28
Results (Dmoz)

Naïve Bayes on Dmoz data with only 2 Training
data
Note that Divisive Clustering achieves higher
maximum than IG with a significant 13 increase
Divisive Clustering performs better than IG at
lower training data

29
Hierarchical Classification
Science
Math
Physics
Social Science
Quantum Theory
Number Theory
Mechanics
Economics
Archeology
Logic

Flat classifier builds a classifier over the leaf
classes in the above hierarchy
Hierarchical Classifier builds a classifier at
each internal node of the hierarchy

30
Results (Dmoz)

Hierarchical Classifier (Naïve Bayes at each
node)
Hierarchical Classifier 64.54 accuracy at just
10 features (Flat achieves 64.04 accuracy at
1000 features)
Hierarchical Classifier improves accuracy to
68.42 from 64.42(maximum) achieved by flat
classifiers

31
Anecdotal Evidence
Cluster 10 Divisive Clustering (rec.sport.hockey) Cluster 9 Divisive Clustering (rec.sport.baseball) Cluster 12 Agglomerative Clustering (rec.sport.hockey and rec.sport.baseball)
team game play hockey Season boston chicago pit van nhl hit runs Baseball base Ball greg morris Ted Pitcher Hitting team detroit hockey pitching Games hitter Players rangers baseball nyi league morris player blues nhl shots Pit Vancouver buffalo ens
Top few words sorted in Clusters obtained by
Divisive and Agglomerative approaches on 20
Newsgroups data
32
Co-Clustering Results (CLASSIC3)
33
Results Binary (subset of 20Ng data)
34
Precision 20Ng data
Co-clustering 1D-clustering IB-Double IDC
Binary 0.98 0.64 0.70
Binary_Subject 0.96 0.67 0.85
Multi5 0.87 0.34 0.5
Multi5_Subject 0.89 0.37 0.88
Multi10 0.56 0.17 0.35
Multi10_Subject 0.54 0.19 0.55
35
Results Sparsity (Binary_subject data)
36
Results Sparsity (Binary_subject data)
37
Results (Monotonicity)
38
Conclusions

Information-theoretic approach to clustering,
co-clustering and matrix approximation
Implicit dimensionality reduction at each step to
overcome sparsity high-dimensionality
Theoretical approach has the potential of
extending to other problems
Multi-dimensional co-clustering
MDL to choose number of co-clusters
Generalized co-clustering by Bregman divergence

39
More Information

Email inderjit_at_cs.utexas.edu
Papers are available at http//www.cs.utexas.ed
u/users/inderjit
Divisive Information-Theoretic Feature
Clustering for Text Classification, Dhillon,
Mallela Kumar, Journal of Machine Learning
Research(JMLR), March 2003 (also KDD, 2002)
Information-Theoretic Co-clustering, Dhillon,
Mallela Modha, KDD, 2003.
Clustering with Bregman Divergences, Banerjee,
Merugu, Dhillon Ghosh, SIAM Data Mining
Proceedings, April, 2004.
A Generalized Maximum Entropy Approach to
Bregman Co-clustering Matrix Approximation,
Banerjee, Dhillon, Ghosh, Merugu Modha, working
manuscript, 2004.