Fast Effective Clustering for Graphs and Documents - PowerPoint PPT Presentation

About This Presentation
Title:

Fast Effective Clustering for Graphs and Documents

Description:

Fast Effective Clustering for Graphs and Documents William W. Cohen Machine Learning Dept. and Language Technologies Institute School of Computer Science – PowerPoint PPT presentation

Number of Views:277
Avg rating:3.0/5.0
Slides: 72
Provided by: Willi268
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Fast Effective Clustering for Graphs and Documents


1
Fast Effective Clusteringfor Graphs and Documents
  • William W. Cohen
  • Machine Learning Dept. and Language Technologies
    Institute
  • School of Computer Science
  • Carnegie Mellon University
  • Joint work with Frank Lin and
  • Ramnath Balasubramanyan

2
Introduction trends in machine learning1
  • Supervised learning given data
    (x1,y1),,(xn,yn), learn to predict y from x
  • y is a real number or member of small set
  • x is a (sparse) vector
  • Semi-supervised learning given data
    (x1,y1),,(xk,yk),xk1,,xn learn to predict y
    from x
  • Unsupervised learning given data x1,,xn find a
    natural clustering

3
Introduction trends in machine learning2
  • Supervised learning given data
    (x1,y1),,(xn,yn), learn to predict y from x
  • y is a real number or member of small set
  • x is a (sparse) vector
  • xs are all i.i.d., independent of each other
  • y depends only on the corresponding x
  • Structured learning xs and/or ys are related
    to each other

4
Introduction trends in machine learning
Introduction trends in machine learning2
  • Structured learning xs and/or ys are related
    to each other
  • General x and y are in two parallel 1d arrays
  • xs are words in a document, y is POS tag
  • xs are words, y1 if x is part of a company name
  • xs are DNA codons, y1 if x is part of a gene
  • More general xs are nodes in a graph, ys are
    labels for these nodes

5
Examples of classification in graphs
  • x is a web page, edge is hyperlink, y is topic
  • x is a word, edge is co-occurrence in similar
    contexts, y is semantics (distributional
    clustering)
  • x is a protein, edge is interaction, y is
    subcellular location
  • x is a person, edge is email message, y is
    organization
  • x is a person, edge is friendship, y1 if x
    smokes
  • x,y are anything, edge from x1 to x2 indicates
    similarity between x1 and x2

6
Examples Zacharys karate club, political books,
protein-protein interactions, .
7
Political blog network
Adamic Glance Divided They Blog 2004
8
Outline
  • Spectral methods
  • Variant Power Iteration Clustering Lin Cohen,
    ICML 2010
  • Variant PIC for document clustering
  • Stochastic block models
  • Mixed-membership sparse block model Parkinnen et
    al, 2007
  • Variants BlockLDA with entropic regularization,
    BlockLDA with annotated documents

9
This talk
  • Typical experiments
  • For networks with known true labels
  • can unsupervised learning can recover these
    labels?

10
Spectral Clustering Graph Matrix
C
A B C D E F G H I J
A 1 1 1
B 1 1
C 1
D 1 1
E 1
F 1 1 1
G 1
H 1 1 1
I 1 1 1
J 1 1
A
B
G
I
H
J
F
D
E
11
Spectral Clustering Graph MatrixTransitively
Closed Components Blocks
C
A B C D E F G H I J
A _ 1 1 1
B 1 _ 1
C 1 1 _
D _ 1 1
E 1 _ 1
F 1 1 1 _
G _ 1 1
H _ 1 1
I 1 1 _ 1
J 1 1 1 _
A
B
G
I
H
J
F
D
E
Of course we cant see the blocks unless the
nodes are sorted by cluster
12
Spectral Clustering Graph MatrixVector Node
? Weight
v
M
A B C D E F G H I J
A _ 1 1 1
B 1 _ 1
C 1 1 _
D _ 1 1
E 1 _ 1
F 1 1 1 _
G _ 1 1
H _ 1 1
I 1 1 _ 1
J 1 1 1 _
A
A 3
B 2
C 3
D
E
F
G
H
I
J
H
M
13
Spectral Clustering Graph MatrixMv1 v2
propogates weights from neighbors
v1
v2


M
A B C D E F G H I J
A _ 1 1 1
B 1 _ 1
C 1 1 _
D _ 1 1
E 1 _ 1
F 1 1 _
G _ 1 1
H _ 1 1
I 1 1 _ 1
J 1 1 1 _

A 3
B 2
C 3
D
E
F
G
H
I
J

A 213101
B 3131
C 3121
D
E
F
G
H
I
J
H
M
14
Spectral Clustering Graph MatrixWv1 v2
propogates weights from neighbors
W
v1
v2


W normalized so columns sum to 1
A B C D E F G H I J
A _ .5 .5 .3
B .3 _ .5
C .3 .5 _
D _ .5 .3
E .5 _ .3
F .3 .5 .5 _
G _ .3 .3
H _ .3 .3
I .5 .5 _ .3
J .5 .5 .3 _

A 3
B 2
C 3
D
E
F
G
H
I
J

A 2.53.50.3
B 3.33.5
C 3.332.5
D
E
F
G
H
I
J
H
15
Spectral Clustering Graph MatrixWv1 v2
propogates weights from neighbors
Q How do I pick v to be an eigenvector for a
block-stochastic matrix?
16
Spectral Clustering Graph MatrixWv1 v2
propogates weights from neighbors
How do I pick v to be an eigenvector for a
block-stochastic matrix?
17
Spectral Clustering Graph MatrixWv1 v2
propogates weights from neighbors
?2
e3
eigengap
?3
?4
e2
?5,6,7,.
Shi Meila, 2002
18
Spectral Clustering Graph MatrixWv1 v2
propogates weights from neighbors
e2
0.4
0.2
x
x
x
x
x
x
x
x
x
0.0
x
x
x
-0.2
y
z
y
y
z
e3
z
z
-0.4
y
z
z
z
z
z
z
z
y
e1
e2
-0.4
-0.2
0
0.2
Shi Meila, 2002
19
Spectral Clustering Graph MatrixWv1 v2
propogates weights from neighbors
  • If W is connected but roughly block diagonal with
    k blocks then
  • the top eigenvector is a constant vector
  • the next k eigenvectors are roughly piecewise
    constant with pieces corresponding to blocks

M
20
Spectral Clustering Graph MatrixWv1 v2
propogates weights from neighbors
  • If W is connected but roughly block diagonal with
    k blocks then
  • the top eigenvector is a constant vector
  • the next k eigenvectors are roughly piecewise
    constant with pieces corresponding to blocks
  • Spectral clustering
  • Find the top k1 eigenvectors v1,,vk1
  • Discard the top one
  • Replace every node a with k-dimensional vector
    xa ltv2(a),,vk1 (a) gt
  • Cluster with k-means

M
21
Spectral Clustering Pros and Cons
  • Elegant, and well-founded mathematically
  • Tends to avoid local minima
  • Optimal solution to relaxed version of mincut
    problem (Normalized cut, aka NCut)
  • Works quite well when relations are approximately
    transitive (like similarity, social connections)
  • Expensive for very large datasets
  • Computing eigenvectors is the bottleneck
  • Approximate eigenvector computation not always
    useful
  • Noisy datasets sometimes cause problems
  • Picking number of eigenvectors and k is tricky
  • Informative eigenvectors need not be in top few
  • Performance can drop suddenly from good to
    terrible

22
Experimental results best-case assignment of
class labels to clusters
23
Spectral Clustering Graph MatrixMv1 v2
propogates weights from neighbors
v1
v2


M
A B C D E F G H I J
A _ 1 1 1
B 1 _ 1
C 1 1 _
D _ 1 1
E 1 _ 1
F 1 1 _
G _ 1 1
H _ 1 1
I 1 1 _ 1
J 1 1 1 _

A 3
B 2
C 3
D
E
F
G
H
I
J

A 5
B 6
C 5
D
E
F
G
H
I
J
H
M
24
Repeated averaging with neighbors as a clustering
method
  • Pick a vector v0 (maybe at random)
  • Compute v1 Wv0
  • i.e., replace v0x with weighted average of
    v0y for the neighbors y of x
  • Plot v1x for each x
  • Repeat for v2, v3,
  • Variants widely used for semi-supervised learning
  • clamping of labels for nodes with known labels
  • Without clamping, will converge to constant vt
  • What are the dynamics of this process?

25
Repeated averaging with neighbors on a sample
problem
blue
green
___red___
  • Create a graph, connecting all points in the 2-D
    initial space to all other points
  • Weighted by distance
  • Run power iteration for 10 steps
  • Plot node id x vs v10(x)
  • nodes are ordered by actual cluster number

26
Repeated averaging with neighbors on a sample
problem
blue
green
___red___
blue
green
___red___
blue
green
___red___
larger
smaller
27
Repeated averaging with neighbors on a sample
problem
blue
green
___red___
blue
green
___red___
blue
green
___red___
blue
green
___red___
blue
green
___red___
28
Repeated averaging with neighbors on a sample
problem
very small
29
PIC Power Iteration Clusteringrun power
iteration (repeated averaging w/ neighbors) with
early stopping
  • V0 random start, or degree matrix D, or
  • Easy to implement and efficient
  • Very easily parallelized
  • Experimentally, often better than traditional
    spectral methods
  • Surprising since the embedded space is
    1-dimensional!

30
Experiments
  • Network problems natural graph structure
  • PolBooks 105 political books, 3 classes, linked
    by copurchaser
  • UMBCBlog 404 political blogs, 2 classes,
    blogroll links
  • AGBlog 1222 political blogs, 2 classes, blogroll
    links
  • Manifold problems cosine distance between
    classification instances
  • Iris 150 flowers, 3 classes
  • PenDigits01,17 200 handwritten digits, 2 classes
    (0-1 or 1-7)
  • 20ngA 200 docs, misc.forsale vs
    soc.religion.christian
  • 20ngB 400 docs, misc.forsale vs
    soc.religion.christian
  • 20ngC 20ngB 200 docs from talk.politics.guns
  • 20ngD 20ngC 200 docs from rec.sport.baseball

31
Experimental results best-case assignment of
class labels to clusters
32
(No Transcript)
33
Experiments run time and scalability
Time in millisec
34
Outline
  • Spectral methods
  • Variant Power Iteration Clustering Lin Cohen,
    ICML 2010
  • Experiments
  • Analysis
  • Variant PIC for document clustering
  • Stochastic block models
  • Mixed-membership sparse block model ...
  • Variants BlockLDA etc

35
Analysis why is this working?
36
Analysis why is this working?
37
Analysis why is this working?
L2 distance
scaling?
differences might cancel?
noise terms
38
Analysis why is this working?
  • If
  • eigenvectors e2,,ek are approximately piecewise
    constant on blocks
  • ?2,, ?k are large and ?k1, are small
  • e.g., if matrix is block-stochastic
  • the cis for v0 are bounded
  • for any a,b from distinct blocks there is at
    least one ei with ei(a)-ei(b) large
  • Then exists an R so that
  • spec(a,b) small ? Rpic(a,b) small

39
Analysis why is this working?
  • Sum of differences vs sum-of-squared differences
  • soft eigenvector selection

40
Ncut with top k eigenvectors
Ncut with top 10 eigenvectors weighted
41
PIC
42
Summary of results so far
  • Both PIC and Ncut embed each graph node in a
    space where distance is meaningful
  • Distances in PIC space and Eigenspace are
    closely related
  • At least for many graphs suited to spectral
    clustering
  • PIC does soft selection of eigenvectors
  • Strong eigenvalues give high weights
  • PIC gives comparable-quality clusters
  • But is much faster

43
Outline
  • Spectral methods
  • Variant Power Iteration Clustering Lin Cohen,
    ICML 2010
  • Variant PIC for document clustering
  • Stochastic block models
  • Mixed-membership sparse block model ...
  • Variants BlockLDA with entropic regularization,
    BlockLDA with annotated documents

44
Motivation Experimental Datasets are
  • Network problems natural graph structure
  • PolBooks 105 political books, 3 classes, linked
    by copurchaser
  • UMBCBlog 404 political blogs, 2 classes,
    blogroll links
  • AGBlog 1222 political blogs, 2 classes, blogroll
    links
  • Also Zacharys karate club, citation networks,
    ...
  • Manifold problems cosine distance between all
    pairs of classification instances
  • Iris 150 flowers, 3 classes
  • PenDigits01,17 200 handwritten digits, 2 classes
    (0-1 or 1-7)
  • 20ngA 200 docs, misc.forsale vs
    soc.religion.christian
  • 20ngB 400 docs, misc.forsale vs
    soc.religion.christian

Gets expensive fast
45
Lazy computation of distances and normalizers
  • Recall PICs update is
  • vt W vt-1 D-1A vt-1
  • where D is the diagonal degree matrix DA1
  • My favorite distance metric for text is
    length-normalized TFIDF
  • Defn A(i,j)ltvi,vjgt/vivj
  • Let N(i,i)vi and N(i,j)0 for i!j
  • Let F(i,k)TFIDF weight of word wk in document vi
  • Then A N-1FTFN-1

1 is a column vector of 1s
ltu,vgtinner product u is L2-norm
46
Lazy computation of distances and normalizers
Equivalent to using TFIDF/cosine on all pairs of
examples but requires only sparse matrices
  • Recall PICs update is
  • vt W vt-1 D-1A vt-1
  • where D is the diagonal degree matrix DA1
  • Let F(i,k)TFIDF weight of word wk in document vi
  • Compute N(i,i)vi and N(i,j)0 for i!j
  • Dont compute A N-1FTFN-1
  • Let D(i,i) N-1FTFN-11 where 1 is an all-1s
    vector
  • Computed as DN-1(FT(F(N-11))) for efficiency
  • New update
  • vt D-1A vt-1 D-1 N-1FTFN-1 vt-1

47
Experimental results
  • RCV1 text classification dataset
  • 800k newswire stories
  • Category labels from industry vocabulary
  • Took single-label documents and categories with
    at least 500 instances
  • Result 193,844 documents, 103 categories
  • Generated 100 random category pairs
  • Each is all documents from two categories
  • Range in size and difficulty
  • Pick category 1, with m1 examples
  • Pick category 2 such that 0.5m1ltm2lt2m1

48
Results
  • NCUTevd Ncut with exact eigenvectors
  • NCUTiram Implicit restarted Arnoldi method
  • No stat. signif. diffs between NCUTevd and PIC

49
Results
50
Results
51
Results
52
Outline
  • Spectral methods
  • Variant Power Iteration Clustering Lin Cohen,
    ICML 2010
  • Variant PIC for document clustering
  • Stochastic block models
  • Mixed-membership sparse block model Parkinnen et
    al, 2007
  • Variants BlockLDA with entropic regularization,
    BlockLDA with annotated documents

53
Question How to model this?
MMSBM of Airoldi et al
  • Draw K2 Bernoulli distributions
  • Draw a ?i for each protein
  • For each entry i,j, in matrix
  • Draw zi from ?i
  • Draw zj from ?j
  • Draw mij from a Bernoulli associated with the
    pair of zs.

54
Question How to model this?
p1, p2 do interact
MMSBM of Airoldi et al
Index of protein 2
  • Draw K2 Bernoulli distributions
  • Draw a ?i for each protein
  • For each entry i,j, in matrix
  • Draw zi from ?i
  • Draw zj from ?j
  • Draw mij from a Bernoulli associated with the
    pair of zs.

Index of protein 1
55
Question How to model this?
p1, p2 do interact
we prefer
Sparse block model of Parkinnen et al, 2007
  • Draw K multinomial distributions ß
  • For each row in the link relation
  • Draw (zL,zR) from ?
  • Draw a protein i from left multinomial associated
    with zL
  • Draw a protein j from right multinomial
    associated with zR
  • Add i,j to the link relation

Index of protein 2
These define the blocks
Index of protein 1
56
Learning method Gibbs sampling
  • Pick random cluster labels (z1,z2) for each link
  • Repeat until convergence
  • For each link (e1,e2)
  • Re-estimate Pr(e1Z1z1), Pr(e2Z2z2) from
    current clusterings
  • Re-estimate Pr(Z1,Z2z1,z2.) from current
    clusterings
  • Re-assign (e1,e2) to z1,z2 randomly according
    to these estimates

Easy to update!
Easy to update!
57
Gibbs sampler for sparse block model
Sampling the class pair for a link
probability of class pair in the link corpus
probability of the two entities in their
respective classes
58
How do these methods compare?
59
How do these methods compare?
60
Also model entity-annotated text.
English text
Vac1p coordinates Rab and phosphatidylinositol
3-kinase signaling in Vps45p-dependent vesicle
docking/fusion at the endosome. The vacuolar
protein sorting (VPS) pathway of Saccharomyces
cerevisiae mediates transport of vacuolar protein
precursors from the late Golgi to the
lysosome-like vacuole. Sorting of some vacuolar
proteins occurs via a prevacuolar endosomal
compartment and mutations in a subset of VPS
genes (the class D VPS genes) interfere with the
Golgi-to-endosome transport step. Several of the
encoded proteins, including Pep12p/Vps6p (an
endosomal target (t) SNARE) and Vps45p (a Sec1p
homologue), bind each other directly 1. Another
of these proteins, Vac1p/Pep7p/Vps19p, associates
with Pep12p and binds phosphatidylinositol
3-phosphate (PI(3)P), the product of the Vps34
phosphatidylinositol 3-kinase (PI 3-kinase)
...... EP7, VPS45, VPS34, PEP12, VPS21
Protein annotations
61
BlockLDA jointly modeling entity-entity links
and entity-annotated text
Entity distributions shared between blocks and
topics
62
(No Transcript)
63
Varying The Amount of Training Data
64
Another Performance Test
  • Goal predict functional categories of proteins
  • 15 categories at top-level (e.g., metabolism,
    cellular communication, cell fate, )
  • Proteins have 2.1 categories on average
  • Method for predicting categories
  • Run with 15 topics
  • Using held-out labeled data, associate topics
    with closest category
  • If category has n true members, pick top n
    proteins by probability of membership in
    associated topic.
  • Metric F1, Precision, Recall

65
Performance
66
Another test manual evaluation of topics by
experts
67
Evaluation
Joint with Katie Rivard (MLD), John Woolford,
Jelena Jakovljevic (CMU Biology)
Trained on yeast publications protein-protein
interaction networks
Topics from Block-LDA
Topics from plain vanilla LDA
Trained on only yeast publications
  • Evaluate topics by asking
  • is the topic meaningful?
  • if so
  • which of the top 10 words are consistent with
    the topics meaning?
  • which of the top 10 genes? top 10 papers?

68
Lets ask people who know - Yeast Biologists
  • Evaluate topics by asking -
  • are the top words for a topic meaningful?
  • are the top papers for a topic meaningful?
  • are the top genes for a topic meaningful?

69
(No Transcript)
70
Outline
  • Spectral methods
  • Variant Power Iteration Clustering Lin Cohen,
    ICML 2010
  • Variant PIC for document clustering
  • Stochastic block models
  • Mixed-membership sparse block model Parkinnen et
    al, 2007
  • Variants BlockLDA with entropic regularization,
    BlockLDA with annotated documents

in progress
71
BlockLDA adding regularization terms
72
BlockLDA regularization
  • Pseudo-observe low entropy for role assignment
    to nodes ? slightly mixed membership
  • Similar idea balances cluster sizes

73
(No Transcript)
74
(No Transcript)
75
Outline
  • Spectral methods
  • Variant Power Iteration Clustering Lin Cohen,
    ICML 2010
  • Variant PIC for document clustering
  • Stochastic block models
  • Mixed-membership sparse block model Parkinnen et
    al, 2007
  • Variants BlockLDA with entropic regularization,
    BlockLDA with annotated documents

76
Conclusions
  • Two new methods
  • PIC ( fast spectral clustering)
  • Fast, robust
  • Easily extends to bipartite graphs (e.g.,
    document-term graphs)
  • BlockLDA ( mixed-membership block models )
  • Slower longer convergence
  • More flexible (mixed-membership) model
  • Easier to extend to use side information

77
Thanks to
  • NIH/NIGMS
  • NSF
  • Google
  • Microsoft LiveLabs
Write a Comment
User Comments (0)
About PowerShow.com