Title: Fast Effective Clustering for Graphs and Documents
1Fast Effective Clusteringfor Graphs and Documents
- William W. Cohen
- Machine Learning Dept. and Language Technologies
Institute - School of Computer Science
- Carnegie Mellon University
- Joint work with Frank Lin and
- Ramnath Balasubramanyan
2Introduction trends in machine learning1
- Supervised learning given data
(x1,y1),,(xn,yn), learn to predict y from x - y is a real number or member of small set
- x is a (sparse) vector
- Semi-supervised learning given data
(x1,y1),,(xk,yk),xk1,,xn learn to predict y
from x - Unsupervised learning given data x1,,xn find a
natural clustering
3Introduction trends in machine learning2
- Supervised learning given data
(x1,y1),,(xn,yn), learn to predict y from x - y is a real number or member of small set
- x is a (sparse) vector
- xs are all i.i.d., independent of each other
- y depends only on the corresponding x
- Structured learning xs and/or ys are related
to each other
4Introduction trends in machine learning
Introduction trends in machine learning2
- Structured learning xs and/or ys are related
to each other - General x and y are in two parallel 1d arrays
- xs are words in a document, y is POS tag
- xs are words, y1 if x is part of a company name
- xs are DNA codons, y1 if x is part of a gene
-
- More general xs are nodes in a graph, ys are
labels for these nodes
5Examples of classification in graphs
- x is a web page, edge is hyperlink, y is topic
- x is a word, edge is co-occurrence in similar
contexts, y is semantics (distributional
clustering) - x is a protein, edge is interaction, y is
subcellular location - x is a person, edge is email message, y is
organization - x is a person, edge is friendship, y1 if x
smokes -
- x,y are anything, edge from x1 to x2 indicates
similarity between x1 and x2
6Examples Zacharys karate club, political books,
protein-protein interactions, .
7Political blog network
Adamic Glance Divided They Blog 2004
8Outline
- Spectral methods
- Variant Power Iteration Clustering Lin Cohen,
ICML 2010 - Variant PIC for document clustering
- Stochastic block models
- Mixed-membership sparse block model Parkinnen et
al, 2007 - Variants BlockLDA with entropic regularization,
BlockLDA with annotated documents
9This talk
- Typical experiments
- For networks with known true labels
- can unsupervised learning can recover these
labels?
10Spectral Clustering Graph Matrix
C
A B C D E F G H I J
A 1 1 1
B 1 1
C 1
D 1 1
E 1
F 1 1 1
G 1
H 1 1 1
I 1 1 1
J 1 1
A
B
G
I
H
J
F
D
E
11Spectral Clustering Graph MatrixTransitively
Closed Components Blocks
C
A B C D E F G H I J
A _ 1 1 1
B 1 _ 1
C 1 1 _
D _ 1 1
E 1 _ 1
F 1 1 1 _
G _ 1 1
H _ 1 1
I 1 1 _ 1
J 1 1 1 _
A
B
G
I
H
J
F
D
E
Of course we cant see the blocks unless the
nodes are sorted by cluster
12Spectral Clustering Graph MatrixVector Node
? Weight
v
M
A B C D E F G H I J
A _ 1 1 1
B 1 _ 1
C 1 1 _
D _ 1 1
E 1 _ 1
F 1 1 1 _
G _ 1 1
H _ 1 1
I 1 1 _ 1
J 1 1 1 _
A
A 3
B 2
C 3
D
E
F
G
H
I
J
H
M
13Spectral Clustering Graph MatrixMv1 v2
propogates weights from neighbors
v1
v2
M
A B C D E F G H I J
A _ 1 1 1
B 1 _ 1
C 1 1 _
D _ 1 1
E 1 _ 1
F 1 1 _
G _ 1 1
H _ 1 1
I 1 1 _ 1
J 1 1 1 _
A 3
B 2
C 3
D
E
F
G
H
I
J
A 213101
B 3131
C 3121
D
E
F
G
H
I
J
H
M
14Spectral Clustering Graph MatrixWv1 v2
propogates weights from neighbors
W
v1
v2
W normalized so columns sum to 1
A B C D E F G H I J
A _ .5 .5 .3
B .3 _ .5
C .3 .5 _
D _ .5 .3
E .5 _ .3
F .3 .5 .5 _
G _ .3 .3
H _ .3 .3
I .5 .5 _ .3
J .5 .5 .3 _
A 3
B 2
C 3
D
E
F
G
H
I
J
A 2.53.50.3
B 3.33.5
C 3.332.5
D
E
F
G
H
I
J
H
15Spectral Clustering Graph MatrixWv1 v2
propogates weights from neighbors
Q How do I pick v to be an eigenvector for a
block-stochastic matrix?
16Spectral Clustering Graph MatrixWv1 v2
propogates weights from neighbors
How do I pick v to be an eigenvector for a
block-stochastic matrix?
17Spectral Clustering Graph MatrixWv1 v2
propogates weights from neighbors
?2
e3
eigengap
?3
?4
e2
?5,6,7,.
Shi Meila, 2002
18Spectral Clustering Graph MatrixWv1 v2
propogates weights from neighbors
e2
0.4
0.2
x
x
x
x
x
x
x
x
x
0.0
x
x
x
-0.2
y
z
y
y
z
e3
z
z
-0.4
y
z
z
z
z
z
z
z
y
e1
e2
-0.4
-0.2
0
0.2
Shi Meila, 2002
19Spectral Clustering Graph MatrixWv1 v2
propogates weights from neighbors
- If W is connected but roughly block diagonal with
k blocks then - the top eigenvector is a constant vector
- the next k eigenvectors are roughly piecewise
constant with pieces corresponding to blocks
M
20Spectral Clustering Graph MatrixWv1 v2
propogates weights from neighbors
- If W is connected but roughly block diagonal with
k blocks then - the top eigenvector is a constant vector
- the next k eigenvectors are roughly piecewise
constant with pieces corresponding to blocks
- Spectral clustering
- Find the top k1 eigenvectors v1,,vk1
- Discard the top one
- Replace every node a with k-dimensional vector
xa ltv2(a),,vk1 (a) gt - Cluster with k-means
M
21Spectral Clustering Pros and Cons
- Elegant, and well-founded mathematically
- Tends to avoid local minima
- Optimal solution to relaxed version of mincut
problem (Normalized cut, aka NCut) - Works quite well when relations are approximately
transitive (like similarity, social connections) - Expensive for very large datasets
- Computing eigenvectors is the bottleneck
- Approximate eigenvector computation not always
useful - Noisy datasets sometimes cause problems
- Picking number of eigenvectors and k is tricky
- Informative eigenvectors need not be in top few
- Performance can drop suddenly from good to
terrible
22Experimental results best-case assignment of
class labels to clusters
23Spectral Clustering Graph MatrixMv1 v2
propogates weights from neighbors
v1
v2
M
A B C D E F G H I J
A _ 1 1 1
B 1 _ 1
C 1 1 _
D _ 1 1
E 1 _ 1
F 1 1 _
G _ 1 1
H _ 1 1
I 1 1 _ 1
J 1 1 1 _
A 3
B 2
C 3
D
E
F
G
H
I
J
A 5
B 6
C 5
D
E
F
G
H
I
J
H
M
24Repeated averaging with neighbors as a clustering
method
- Pick a vector v0 (maybe at random)
- Compute v1 Wv0
- i.e., replace v0x with weighted average of
v0y for the neighbors y of x - Plot v1x for each x
- Repeat for v2, v3,
- Variants widely used for semi-supervised learning
- clamping of labels for nodes with known labels
- Without clamping, will converge to constant vt
- What are the dynamics of this process?
25Repeated averaging with neighbors on a sample
problem
blue
green
___red___
- Create a graph, connecting all points in the 2-D
initial space to all other points - Weighted by distance
- Run power iteration for 10 steps
- Plot node id x vs v10(x)
- nodes are ordered by actual cluster number
26Repeated averaging with neighbors on a sample
problem
blue
green
___red___
blue
green
___red___
blue
green
___red___
larger
smaller
27Repeated averaging with neighbors on a sample
problem
blue
green
___red___
blue
green
___red___
blue
green
___red___
blue
green
___red___
blue
green
___red___
28Repeated averaging with neighbors on a sample
problem
very small
29PIC Power Iteration Clusteringrun power
iteration (repeated averaging w/ neighbors) with
early stopping
- V0 random start, or degree matrix D, or
- Easy to implement and efficient
- Very easily parallelized
- Experimentally, often better than traditional
spectral methods - Surprising since the embedded space is
1-dimensional!
30Experiments
- Network problems natural graph structure
- PolBooks 105 political books, 3 classes, linked
by copurchaser - UMBCBlog 404 political blogs, 2 classes,
blogroll links - AGBlog 1222 political blogs, 2 classes, blogroll
links - Manifold problems cosine distance between
classification instances - Iris 150 flowers, 3 classes
- PenDigits01,17 200 handwritten digits, 2 classes
(0-1 or 1-7) - 20ngA 200 docs, misc.forsale vs
soc.religion.christian - 20ngB 400 docs, misc.forsale vs
soc.religion.christian - 20ngC 20ngB 200 docs from talk.politics.guns
- 20ngD 20ngC 200 docs from rec.sport.baseball
31Experimental results best-case assignment of
class labels to clusters
32(No Transcript)
33Experiments run time and scalability
Time in millisec
34Outline
- Spectral methods
- Variant Power Iteration Clustering Lin Cohen,
ICML 2010 - Experiments
- Analysis
- Variant PIC for document clustering
- Stochastic block models
- Mixed-membership sparse block model ...
- Variants BlockLDA etc
35Analysis why is this working?
36Analysis why is this working?
37Analysis why is this working?
L2 distance
scaling?
differences might cancel?
noise terms
38Analysis why is this working?
- If
- eigenvectors e2,,ek are approximately piecewise
constant on blocks - ?2,, ?k are large and ?k1, are small
- e.g., if matrix is block-stochastic
- the cis for v0 are bounded
- for any a,b from distinct blocks there is at
least one ei with ei(a)-ei(b) large - Then exists an R so that
- spec(a,b) small ? Rpic(a,b) small
39Analysis why is this working?
- Sum of differences vs sum-of-squared differences
- soft eigenvector selection
40Ncut with top k eigenvectors
Ncut with top 10 eigenvectors weighted
41PIC
42Summary of results so far
- Both PIC and Ncut embed each graph node in a
space where distance is meaningful - Distances in PIC space and Eigenspace are
closely related - At least for many graphs suited to spectral
clustering - PIC does soft selection of eigenvectors
- Strong eigenvalues give high weights
- PIC gives comparable-quality clusters
- But is much faster
43Outline
- Spectral methods
- Variant Power Iteration Clustering Lin Cohen,
ICML 2010 - Variant PIC for document clustering
- Stochastic block models
- Mixed-membership sparse block model ...
- Variants BlockLDA with entropic regularization,
BlockLDA with annotated documents
44Motivation Experimental Datasets are
- Network problems natural graph structure
- PolBooks 105 political books, 3 classes, linked
by copurchaser - UMBCBlog 404 political blogs, 2 classes,
blogroll links - AGBlog 1222 political blogs, 2 classes, blogroll
links - Also Zacharys karate club, citation networks,
... - Manifold problems cosine distance between all
pairs of classification instances - Iris 150 flowers, 3 classes
- PenDigits01,17 200 handwritten digits, 2 classes
(0-1 or 1-7) - 20ngA 200 docs, misc.forsale vs
soc.religion.christian - 20ngB 400 docs, misc.forsale vs
soc.religion.christian
Gets expensive fast
45Lazy computation of distances and normalizers
- Recall PICs update is
- vt W vt-1 D-1A vt-1
- where D is the diagonal degree matrix DA1
- My favorite distance metric for text is
length-normalized TFIDF - Defn A(i,j)ltvi,vjgt/vivj
- Let N(i,i)vi and N(i,j)0 for i!j
- Let F(i,k)TFIDF weight of word wk in document vi
- Then A N-1FTFN-1
1 is a column vector of 1s
ltu,vgtinner product u is L2-norm
46Lazy computation of distances and normalizers
Equivalent to using TFIDF/cosine on all pairs of
examples but requires only sparse matrices
- Recall PICs update is
- vt W vt-1 D-1A vt-1
- where D is the diagonal degree matrix DA1
- Let F(i,k)TFIDF weight of word wk in document vi
- Compute N(i,i)vi and N(i,j)0 for i!j
- Dont compute A N-1FTFN-1
- Let D(i,i) N-1FTFN-11 where 1 is an all-1s
vector - Computed as DN-1(FT(F(N-11))) for efficiency
- New update
- vt D-1A vt-1 D-1 N-1FTFN-1 vt-1
47Experimental results
- RCV1 text classification dataset
- 800k newswire stories
- Category labels from industry vocabulary
- Took single-label documents and categories with
at least 500 instances - Result 193,844 documents, 103 categories
- Generated 100 random category pairs
- Each is all documents from two categories
- Range in size and difficulty
- Pick category 1, with m1 examples
- Pick category 2 such that 0.5m1ltm2lt2m1
48Results
- NCUTevd Ncut with exact eigenvectors
- NCUTiram Implicit restarted Arnoldi method
- No stat. signif. diffs between NCUTevd and PIC
49Results
50Results
51Results
52Outline
- Spectral methods
- Variant Power Iteration Clustering Lin Cohen,
ICML 2010 - Variant PIC for document clustering
- Stochastic block models
- Mixed-membership sparse block model Parkinnen et
al, 2007 - Variants BlockLDA with entropic regularization,
BlockLDA with annotated documents
53Question How to model this?
MMSBM of Airoldi et al
- Draw K2 Bernoulli distributions
- Draw a ?i for each protein
- For each entry i,j, in matrix
- Draw zi from ?i
- Draw zj from ?j
- Draw mij from a Bernoulli associated with the
pair of zs.
54Question How to model this?
p1, p2 do interact
MMSBM of Airoldi et al
Index of protein 2
- Draw K2 Bernoulli distributions
- Draw a ?i for each protein
- For each entry i,j, in matrix
- Draw zi from ?i
- Draw zj from ?j
- Draw mij from a Bernoulli associated with the
pair of zs.
Index of protein 1
55Question How to model this?
p1, p2 do interact
we prefer
Sparse block model of Parkinnen et al, 2007
- Draw K multinomial distributions ß
- For each row in the link relation
- Draw (zL,zR) from ?
- Draw a protein i from left multinomial associated
with zL - Draw a protein j from right multinomial
associated with zR - Add i,j to the link relation
Index of protein 2
These define the blocks
Index of protein 1
56Learning method Gibbs sampling
- Pick random cluster labels (z1,z2) for each link
- Repeat until convergence
- For each link (e1,e2)
- Re-estimate Pr(e1Z1z1), Pr(e2Z2z2) from
current clusterings - Re-estimate Pr(Z1,Z2z1,z2.) from current
clusterings - Re-assign (e1,e2) to z1,z2 randomly according
to these estimates
Easy to update!
Easy to update!
57Gibbs sampler for sparse block model
Sampling the class pair for a link
probability of class pair in the link corpus
probability of the two entities in their
respective classes
58How do these methods compare?
59How do these methods compare?
60Also model entity-annotated text.
English text
Vac1p coordinates Rab and phosphatidylinositol
3-kinase signaling in Vps45p-dependent vesicle
docking/fusion at the endosome. The vacuolar
protein sorting (VPS) pathway of Saccharomyces
cerevisiae mediates transport of vacuolar protein
precursors from the late Golgi to the
lysosome-like vacuole. Sorting of some vacuolar
proteins occurs via a prevacuolar endosomal
compartment and mutations in a subset of VPS
genes (the class D VPS genes) interfere with the
Golgi-to-endosome transport step. Several of the
encoded proteins, including Pep12p/Vps6p (an
endosomal target (t) SNARE) and Vps45p (a Sec1p
homologue), bind each other directly 1. Another
of these proteins, Vac1p/Pep7p/Vps19p, associates
with Pep12p and binds phosphatidylinositol
3-phosphate (PI(3)P), the product of the Vps34
phosphatidylinositol 3-kinase (PI 3-kinase)
...... EP7, VPS45, VPS34, PEP12, VPS21
Protein annotations
61BlockLDA jointly modeling entity-entity links
and entity-annotated text
Entity distributions shared between blocks and
topics
62(No Transcript)
63Varying The Amount of Training Data
64Another Performance Test
- Goal predict functional categories of proteins
- 15 categories at top-level (e.g., metabolism,
cellular communication, cell fate, ) - Proteins have 2.1 categories on average
- Method for predicting categories
- Run with 15 topics
- Using held-out labeled data, associate topics
with closest category - If category has n true members, pick top n
proteins by probability of membership in
associated topic. - Metric F1, Precision, Recall
65Performance
66Another test manual evaluation of topics by
experts
67Evaluation
Joint with Katie Rivard (MLD), John Woolford,
Jelena Jakovljevic (CMU Biology)
Trained on yeast publications protein-protein
interaction networks
Topics from Block-LDA
Topics from plain vanilla LDA
Trained on only yeast publications
- Evaluate topics by asking
- is the topic meaningful?
- if so
- which of the top 10 words are consistent with
the topics meaning? - which of the top 10 genes? top 10 papers?
68Lets ask people who know - Yeast Biologists
- Evaluate topics by asking -
- are the top words for a topic meaningful?
- are the top papers for a topic meaningful?
- are the top genes for a topic meaningful?
69(No Transcript)
70Outline
- Spectral methods
- Variant Power Iteration Clustering Lin Cohen,
ICML 2010 - Variant PIC for document clustering
- Stochastic block models
- Mixed-membership sparse block model Parkinnen et
al, 2007 - Variants BlockLDA with entropic regularization,
BlockLDA with annotated documents
in progress
71BlockLDA adding regularization terms
72BlockLDA regularization
- Pseudo-observe low entropy for role assignment
to nodes ? slightly mixed membership - Similar idea balances cluster sizes
73(No Transcript)
74(No Transcript)
75Outline
- Spectral methods
- Variant Power Iteration Clustering Lin Cohen,
ICML 2010 - Variant PIC for document clustering
- Stochastic block models
- Mixed-membership sparse block model Parkinnen et
al, 2007 - Variants BlockLDA with entropic regularization,
BlockLDA with annotated documents
76Conclusions
- Two new methods
- PIC ( fast spectral clustering)
- Fast, robust
- Easily extends to bipartite graphs (e.g.,
document-term graphs) - BlockLDA ( mixed-membership block models )
- Slower longer convergence
- More flexible (mixed-membership) model
- Easier to extend to use side information
77Thanks to
- NIH/NIGMS
- NSF
- Google
- Microsoft LiveLabs