Fast Effective Clustering for Graphs and Documents

About This Presentation

Title:

Fast Effective Clustering for Graphs and Documents

Description:

Fast Effective Clustering for Graphs and Documents William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science – PowerPoint PPT presentation

Number of Views:277

Avg rating:3.0/5.0

Slides: 72

Provided by: Willi268

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Fast Effective Clustering for Graphs and Documents

1
Fast Effective Clusteringfor Graphs and Documents

William W. Cohen
Machine Learning Dept. and Language Technologies
Institute
School of Computer Science
Carnegie Mellon University
Joint work with Frank Lin and
Ramnath Balasubramanyan

2
Introduction trends in machine learning1

Supervised learning given data
(x1,y1),,(xn,yn), learn to predict y from x
y is a real number or member of small set
x is a (sparse) vector
Semi-supervised learning given data
(x1,y1),,(xk,yk),xk1,,xn learn to predict y
from x
Unsupervised learning given data x1,,xn find a
natural clustering

3
Introduction trends in machine learning2

Supervised learning given data
(x1,y1),,(xn,yn), learn to predict y from x
y is a real number or member of small set
x is a (sparse) vector
xs are all i.i.d., independent of each other
y depends only on the corresponding x
Structured learning xs and/or ys are related
to each other

4
Introduction trends in machine learning
Introduction trends in machine learning2

Structured learning xs and/or ys are related
to each other
General x and y are in two parallel 1d arrays
xs are words in a document, y is POS tag
xs are words, y1 if x is part of a company name
xs are DNA codons, y1 if x is part of a gene
More general xs are nodes in a graph, ys are
labels for these nodes

5
Examples of classification in graphs

x is a web page, edge is hyperlink, y is topic
x is a word, edge is co-occurrence in similar
contexts, y is semantics (distributional
clustering)
x is a protein, edge is interaction, y is
subcellular location
x is a person, edge is email message, y is
organization
x is a person, edge is friendship, y1 if x
smokes
x,y are anything, edge from x1 to x2 indicates
similarity between x1 and x2

6
Examples Zacharys karate club, political books,
protein-protein interactions, .
7
Political blog network
Adamic Glance Divided They Blog 2004
8
Outline

Spectral methods
Variant Power Iteration Clustering Lin Cohen,
ICML 2010
Variant PIC for document clustering
Stochastic block models
Mixed-membership sparse block model Parkinnen et
al, 2007
Variants BlockLDA with entropic regularization,
BlockLDA with annotated documents

9
This talk

Typical experiments
For networks with known true labels
can unsupervised learning can recover these
labels?

10
Spectral Clustering Graph Matrix
C
A B C D E F G H I J
A 1 1 1
B 1 1
C 1
D 1 1
E 1
F 1 1 1
G 1
H 1 1 1
I 1 1 1
J 1 1
A
B
G
I
H
J
F
D
E
11
Spectral Clustering Graph MatrixTransitively
Closed Components Blocks
C
A B C D E F G H I J
A _ 1 1 1
B 1 _ 1
C 1 1 _
D _ 1 1
E 1 _ 1
F 1 1 1 _
G _ 1 1
H _ 1 1
I 1 1 _ 1
J 1 1 1 _
A
B
G
I
H
J
F
D
E
Of course we cant see the blocks unless the
nodes are sorted by cluster
12
Spectral Clustering Graph MatrixVector Node
? Weight
v
M
A B C D E F G H I J
A _ 1 1 1
B 1 _ 1
C 1 1 _
D _ 1 1
E 1 _ 1
F 1 1 1 _
G _ 1 1
H _ 1 1
I 1 1 _ 1
J 1 1 1 _
A
A 3
B 2
C 3
D
E
F
G
H
I
J
H
M
13
Spectral Clustering Graph MatrixMv1 v2
propogates weights from neighbors
v1
v2

M
A B C D E F G H I J
A _ 1 1 1
B 1 _ 1
C 1 1 _
D _ 1 1
E 1 _ 1
F 1 1 _
G _ 1 1
H _ 1 1
I 1 1 _ 1
J 1 1 1 _

A 3
B 2
C 3
D
E
F
G
H
I
J

A 213101
B 3131
C 3121
D
E
F
G
H
I
J
H
M
14
Spectral Clustering Graph MatrixWv1 v2
propogates weights from neighbors
W
v1
v2

W normalized so columns sum to 1
A B C D E F G H I J
A _ .5 .5 .3
B .3 _ .5
C .3 .5 _
D _ .5 .3
E .5 _ .3
F .3 .5 .5 _
G _ .3 .3
H _ .3 .3
I .5 .5 _ .3
J .5 .5 .3 _

A 3
B 2
C 3
D
E
F
G
H
I
J

A 2.53.50.3
B 3.33.5
C 3.332.5
D
E
F
G
H
I
J
H
15
Spectral Clustering Graph MatrixWv1 v2
propogates weights from neighbors
Q How do I pick v to be an eigenvector for a
block-stochastic matrix?
16
Spectral Clustering Graph MatrixWv1 v2
propogates weights from neighbors
How do I pick v to be an eigenvector for a
block-stochastic matrix?
17
Spectral Clustering Graph MatrixWv1 v2
propogates weights from neighbors
?2
e3
eigengap
?3
?4
e2
?5,6,7,.
Shi Meila, 2002
18
Spectral Clustering Graph MatrixWv1 v2
propogates weights from neighbors
e2
0.4
0.2
x
x
x
x
x
x
x
x
x
0.0
x
x
x
-0.2
y
z
y
y
z
e3
z
z
-0.4
y
z
z
z
z
z
z
z
y
e1
e2
-0.4
-0.2
0
0.2
Shi Meila, 2002
19
Spectral Clustering Graph MatrixWv1 v2
propogates weights from neighbors

If W is connected but roughly block diagonal with
k blocks then
the top eigenvector is a constant vector
the next k eigenvectors are roughly piecewise
constant with pieces corresponding to blocks

M
20
Spectral Clustering Graph MatrixWv1 v2
propogates weights from neighbors

If W is connected but roughly block diagonal with
k blocks then
the top eigenvector is a constant vector
the next k eigenvectors are roughly piecewise
constant with pieces corresponding to blocks

Spectral clustering
Find the top k1 eigenvectors v1,,vk1
Discard the top one
Replace every node a with k-dimensional vector
xa ltv2(a),,vk1 (a) gt
Cluster with k-means

M
21
Spectral Clustering Pros and Cons

Elegant, and well-founded mathematically
Tends to avoid local minima
Optimal solution to relaxed version of mincut
problem (Normalized cut, aka NCut)
Works quite well when relations are approximately
transitive (like similarity, social connections)
Expensive for very large datasets
Computing eigenvectors is the bottleneck
Approximate eigenvector computation not always
useful
Noisy datasets sometimes cause problems
Picking number of eigenvectors and k is tricky
Informative eigenvectors need not be in top few
Performance can drop suddenly from good to
terrible

22
Experimental results best-case assignment of
class labels to clusters
23
Spectral Clustering Graph MatrixMv1 v2
propogates weights from neighbors
v1
v2

M
A B C D E F G H I J
A _ 1 1 1
B 1 _ 1
C 1 1 _
D _ 1 1
E 1 _ 1
F 1 1 _
G _ 1 1
H _ 1 1
I 1 1 _ 1
J 1 1 1 _

A 3
B 2
C 3
D
E
F
G
H
I
J

A 5
B 6
C 5
D
E
F
G
H
I
J
H
M
24
Repeated averaging with neighbors as a clustering
method

Pick a vector v0 (maybe at random)
Compute v1 Wv0
i.e., replace v0x with weighted average of
v0y for the neighbors y of x
Plot v1x for each x
Repeat for v2, v3,
Variants widely used for semi-supervised learning
clamping of labels for nodes with known labels
Without clamping, will converge to constant vt
What are the dynamics of this process?

25
Repeated averaging with neighbors on a sample
problem
blue
green
___red___

Create a graph, connecting all points in the 2-D
initial space to all other points
Weighted by distance
Run power iteration for 10 steps
Plot node id x vs v10(x)
nodes are ordered by actual cluster number

26
Repeated averaging with neighbors on a sample
problem
blue
green
___red___
blue
green
___red___
blue
green
___red___
larger
smaller
27
Repeated averaging with neighbors on a sample
problem
blue
green
___red___
blue
green
___red___
blue
green
___red___
blue
green
___red___
blue
green
___red___
28
Repeated averaging with neighbors on a sample
problem
very small
29
PIC Power Iteration Clusteringrun power
iteration (repeated averaging w/ neighbors) with
early stopping

V0 random start, or degree matrix D, or
Easy to implement and efficient
Very easily parallelized
Experimentally, often better than traditional
spectral methods
Surprising since the embedded space is
1-dimensional!

30
Experiments

Network problems natural graph structure
PolBooks 105 political books, 3 classes, linked
by copurchaser
UMBCBlog 404 political blogs, 2 classes,
blogroll links
AGBlog 1222 political blogs, 2 classes, blogroll
links
Manifold problems cosine distance between
classification instances
Iris 150 flowers, 3 classes
PenDigits01,17 200 handwritten digits, 2 classes
(0-1 or 1-7)
20ngA 200 docs, misc.forsale vs
soc.religion.christian
20ngB 400 docs, misc.forsale vs
soc.religion.christian
20ngC 20ngB 200 docs from talk.politics.guns
20ngD 20ngC 200 docs from rec.sport.baseball

31
Experimental results best-case assignment of
class labels to clusters
32
(No Transcript)
33
Experiments run time and scalability
Time in millisec
34
Outline

Spectral methods
Variant Power Iteration Clustering Lin Cohen,
ICML 2010
Experiments
Analysis
Variant PIC for document clustering
Stochastic block models
Mixed-membership sparse block model ...
Variants BlockLDA etc

35
Analysis why is this working?
36
Analysis why is this working?
37
Analysis why is this working?
L2 distance
scaling?
differences might cancel?
noise terms
38
Analysis why is this working?

If
eigenvectors e2,,ek are approximately piecewise
constant on blocks
?2,, ?k are large and ?k1, are small
e.g., if matrix is block-stochastic
the cis for v0 are bounded
for any a,b from distinct blocks there is at
least one ei with ei(a)-ei(b) large
Then exists an R so that
spec(a,b) small ? Rpic(a,b) small

39
Analysis why is this working?

Sum of differences vs sum-of-squared differences
soft eigenvector selection

40
Ncut with top k eigenvectors
Ncut with top 10 eigenvectors weighted
41
PIC
42
Summary of results so far

Both PIC and Ncut embed each graph node in a
space where distance is meaningful
Distances in PIC space and Eigenspace are
closely related
At least for many graphs suited to spectral
clustering
PIC does soft selection of eigenvectors
Strong eigenvalues give high weights
PIC gives comparable-quality clusters
But is much faster

43
Outline

Spectral methods
Variant Power Iteration Clustering Lin Cohen,
ICML 2010
Variant PIC for document clustering
Stochastic block models
Mixed-membership sparse block model ...
Variants BlockLDA with entropic regularization,
BlockLDA with annotated documents

44
Motivation Experimental Datasets are

Network problems natural graph structure
PolBooks 105 political books, 3 classes, linked
by copurchaser
UMBCBlog 404 political blogs, 2 classes,
blogroll links
AGBlog 1222 political blogs, 2 classes, blogroll
links
Also Zacharys karate club, citation networks,
...
Manifold problems cosine distance between all
pairs of classification instances
Iris 150 flowers, 3 classes
PenDigits01,17 200 handwritten digits, 2 classes
(0-1 or 1-7)
20ngA 200 docs, misc.forsale vs
soc.religion.christian
20ngB 400 docs, misc.forsale vs
soc.religion.christian

Gets expensive fast
45
Lazy computation of distances and normalizers

Recall PICs update is
vt W vt-1 D-1A vt-1
where D is the diagonal degree matrix DA1
My favorite distance metric for text is
length-normalized TFIDF
Defn A(i,j)ltvi,vjgt/vivj
Let N(i,i)vi and N(i,j)0 for i!j
Let F(i,k)TFIDF weight of word wk in document vi
Then A N-1FTFN-1

1 is a column vector of 1s
ltu,vgtinner product u is L2-norm
46
Lazy computation of distances and normalizers
Equivalent to using TFIDF/cosine on all pairs of
examples but requires only sparse matrices

Recall PICs update is
vt W vt-1 D-1A vt-1
where D is the diagonal degree matrix DA1
Let F(i,k)TFIDF weight of word wk in document vi
Compute N(i,i)vi and N(i,j)0 for i!j
Dont compute A N-1FTFN-1
Let D(i,i) N-1FTFN-11 where 1 is an all-1s
vector
Computed as DN-1(FT(F(N-11))) for efficiency
New update
vt D-1A vt-1 D-1 N-1FTFN-1 vt-1

47
Experimental results

RCV1 text classification dataset
800k newswire stories
Category labels from industry vocabulary
Took single-label documents and categories with
at least 500 instances
Result 193,844 documents, 103 categories
Generated 100 random category pairs
Each is all documents from two categories
Range in size and difficulty
Pick category 1, with m1 examples
Pick category 2 such that 0.5m1ltm2lt2m1

48
Results

NCUTevd Ncut with exact eigenvectors
NCUTiram Implicit restarted Arnoldi method
No stat. signif. diffs between NCUTevd and PIC

49
Results
50
Results
51
Results
52
Outline

Spectral methods
Variant Power Iteration Clustering Lin Cohen,
ICML 2010
Variant PIC for document clustering
Stochastic block models
Mixed-membership sparse block model Parkinnen et
al, 2007
Variants BlockLDA with entropic regularization,
BlockLDA with annotated documents

53
Question How to model this?
MMSBM of Airoldi et al

Draw K2 Bernoulli distributions
Draw a ?i for each protein
For each entry i,j, in matrix
Draw zi from ?i
Draw zj from ?j
Draw mij from a Bernoulli associated with the
pair of zs.

54
Question How to model this?
p1, p2 do interact
MMSBM of Airoldi et al
Index of protein 2

Draw K2 Bernoulli distributions
Draw a ?i for each protein
For each entry i,j, in matrix
Draw zi from ?i
Draw zj from ?j
Draw mij from a Bernoulli associated with the
pair of zs.

Index of protein 1
55
Question How to model this?
p1, p2 do interact
we prefer
Sparse block model of Parkinnen et al, 2007

Draw K multinomial distributions ß
For each row in the link relation
Draw (zL,zR) from ?
Draw a protein i from left multinomial associated
with zL
Draw a protein j from right multinomial
associated with zR
Add i,j to the link relation

Index of protein 2
These define the blocks
Index of protein 1
56
Learning method Gibbs sampling

Pick random cluster labels (z1,z2) for each link
Repeat until convergence
For each link (e1,e2)
Re-estimate Pr(e1Z1z1), Pr(e2Z2z2) from
current clusterings
Re-estimate Pr(Z1,Z2z1,z2.) from current
clusterings
Re-assign (e1,e2) to z1,z2 randomly according
to these estimates

Easy to update!
Easy to update!
57
Gibbs sampler for sparse block model
Sampling the class pair for a link
probability of class pair in the link corpus
probability of the two entities in their
respective classes
58
How do these methods compare?
59
How do these methods compare?
60
Also model entity-annotated text.
English text
Vac1p coordinates Rab and phosphatidylinositol
3-kinase signaling in Vps45p-dependent vesicle
docking/fusion at the endosome. The vacuolar
protein sorting (VPS) pathway of Saccharomyces
cerevisiae mediates transport of vacuolar protein
precursors from the late Golgi to the
lysosome-like vacuole. Sorting of some vacuolar
proteins occurs via a prevacuolar endosomal
compartment and mutations in a subset of VPS
genes (the class D VPS genes) interfere with the
Golgi-to-endosome transport step. Several of the
encoded proteins, including Pep12p/Vps6p (an
endosomal target (t) SNARE) and Vps45p (a Sec1p
homologue), bind each other directly 1. Another
of these proteins, Vac1p/Pep7p/Vps19p, associates
with Pep12p and binds phosphatidylinositol
3-phosphate (PI(3)P), the product of the Vps34
phosphatidylinositol 3-kinase (PI 3-kinase)
...... EP7, VPS45, VPS34, PEP12, VPS21
Protein annotations
61
BlockLDA jointly modeling entity-entity links
and entity-annotated text
Entity distributions shared between blocks and
topics
62
(No Transcript)
63
Varying The Amount of Training Data
64
Another Performance Test

Goal predict functional categories of proteins
15 categories at top-level (e.g., metabolism,
cellular communication, cell fate, )
Proteins have 2.1 categories on average
Method for predicting categories
Run with 15 topics
Using held-out labeled data, associate topics
with closest category
If category has n true members, pick top n
proteins by probability of membership in
associated topic.
Metric F1, Precision, Recall

65
Performance
66
Another test manual evaluation of topics by
experts
67
Evaluation
Joint with Katie Rivard (MLD), John Woolford,
Jelena Jakovljevic (CMU Biology)
Trained on yeast publications protein-protein
interaction networks
Topics from Block-LDA
Topics from plain vanilla LDA
Trained on only yeast publications

Evaluate topics by asking
is the topic meaningful?
if so
which of the top 10 words are consistent with
the topics meaning?
which of the top 10 genes? top 10 papers?

68
Lets ask people who know - Yeast Biologists

Evaluate topics by asking -
are the top words for a topic meaningful?
are the top papers for a topic meaningful?
are the top genes for a topic meaningful?

69
(No Transcript)
70
Outline

Spectral methods
Variant Power Iteration Clustering Lin Cohen,
ICML 2010
Variant PIC for document clustering
Stochastic block models
Mixed-membership sparse block model Parkinnen et
al, 2007
Variants BlockLDA with entropic regularization,
BlockLDA with annotated documents

in progress
71
BlockLDA adding regularization terms
72
BlockLDA regularization

Pseudo-observe low entropy for role assignment
to nodes ? slightly mixed membership
Similar idea balances cluster sizes

73
(No Transcript)
74
(No Transcript)
75
Outline

Spectral methods
Variant Power Iteration Clustering Lin Cohen,
ICML 2010
Variant PIC for document clustering
Stochastic block models
Mixed-membership sparse block model Parkinnen et
al, 2007
Variants BlockLDA with entropic regularization,
BlockLDA with annotated documents

76
Conclusions