Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models - PowerPoint PPT Presentation

About This Presentation

Title:

Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models

Description:

Lower perplexity compared to unigram models. Reveals meaningful ... Ising model. L1 regularized conditional likelihood learns true structure asymptotically ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 29

Provided by: nmra

Learn more at: http://ai.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Sparse Word Graphs: A Scalable Algorithm for Capturing Word Correlations in Topic Models

1
Sparse Word GraphsA Scalable Algorithm for
Capturing Word Correlations in Topic Models

Ramesh NallapatiJoint work with
John Lafferty, Amr Ahmed,
William Cohen and Eric Xing
Machine Learning Department
Carnegie Mellon University

2
Introduction

Statistical topic modeling an attractive
framework for topic discovery
Completely unsupervised
Models text very well
Lower perplexity compared to unigram models
Reveals meaningful semantic patterns
Can help summarize and visualize document
collections
e.g. PLSA, LDA, DPM, DTM, CTM, PA

3
Introduction

A common assumption in all the variants
Exchangeability bag of words assumption
Topics represented as a ranked list of words
Consequences
Word Correlation information is lost
e.g. white-house vs. white and house
Long distance correlations

4
Introduction

Objective
To capture correlations between words within
topics
Motivation
More interpretable representation of topics as a
network of words rather than a list
Helps better visualize and summarize document
collections
May reveal unexpected relationships and patterns
within topics

5
Past Work Topic Models

Bigram topic models Wallach, ICML 2006

Requires KV(K-1) parameters
Only captures local dependencies
Does not model sparsity of correlations
Does not capture within-topic correlations

6
Past work Other approaches

Hyperspace Analog to Language (HAL)
Lund and Burges, Cog. Sci., 96
Word pair correlation measured as a weighted
count of number of times they occur within a
fixed length window
Weight of an occurrence / 1/(mutual distance)

7
Past work Other approaches

Hyperspace Analog to Language (HAL)
Lund and Burges, Cog. Sci., 96
Plusses
Sparse solutions, scalability
Minuses
Only unearths global correlations, not semantic
correlations
E.g. river bank, bank check
Only local dependencies

8
Past work Other approaches

Query expansion in IR
Similar in spirit finds words that highly
co-occur with the query words
However, not a corpus visualization tool
requires a context to operate on
Wordnet
Semantic networks
Human labeled not directly related to our goal

9
Our approach

L1 norm regularization
Known to enforce sparse solutions
Sparsity permits scalability
Convex optimization problem
Globally optimal solutions
Recent advances in learning structure of
graphical models
L1 regularization framework asymptotically leads
to true structure

10
BackgroundLASSO

Example linear regression
Regularization used to improve generalizability
E.g.1 Ridge regression L2 norm regularization
E.g.2 Lasso L1 norm regularization

11
Background LASSO

Lasso encourages sparse solutions

12
Background Gaussian Random Fields

Multivariate Gaussian distribution
Random field structure G (V,E)
V set of all variables X1,?,Xp
(s,t) 2 E , ?-1st ? 0
Xs ? Xu XN(s) where u ? N(s)

13
Background Gaussian Random Fields

Estimating the graph structure of GRF from data
Meinshausen and Buhlmann, Annals. Stats., 2006
Regress each variable onto others imposing L1
penalty to encourage sparsity
Estimated neighborhood

14
Background Gaussian Random Fields
Estimated graph
True Graph
Courtesy Meinshausen and Buhlmann, Annals.
Stats., 2006
15
Background Gaussian Random Fields

Application to topic models CTM
Blei and Lafferty, NIPS, 2006

16
Background Gaussian Random Fields

Application to CTMBlei Lafferty, Annals.
Appl. Stats., 07

17
Structure learning of an MRF

Ising model
L1 regularized conditional likelihood learns true
structure asymptotically
Wainwright, Ravikumar and Lafferty, NIPS06

18
Structure learning of an MRF
Courtesy Wainwright, Ravikumar and Lafferty,
NIPS06
19
Sparse Word Graphs

Algorithm
Run LDA on the document collection and obtain
topic assignments
Convert topic assignments for each document into
K binary vectors X
Assume an MRF for each topic with X as underlying
data
Apply structure learning for MRF using
regularized conditional likelihood

20
Sparse Word Graphs
21
Sparse Word Graphs Scalability

We still run V logistic regression problems, each
of size V for each topic O(KV2) !
However, each example is very sparse
L1 penalty results in sparse solutions
Can run each topic in parallel
Efficient interior point based L1 regularized
logistic regression Koh, Kim Boyd, JMLR,07

22
Experiments

Small AP corpus
2.2K Docs, 10.5K unique words
Ran 10 topic LDA model
Used ? 0.1 in L1 logistic regression
Took just 45 min. per topic
Very sparse solutions
Computes only under 0.1 of the total number of
possible edges

23
Topic Business neighborhood of top LDA terms
24
Topic Business neighborhood of top edges
25
Topic War neighborhood of top LDA terms
26
Topic War neighborhood of top edges
27
Concluding remarks