Generative Topic Models for Community Analysis presentation

About This Presentation

Transcript and Presenter's Notes

Title: Generative Topic Models for Community Analysis

1
Generative Topic Models for Community Analysis

Ramesh Nallapati

2
Objectives

Provide an overview of topic models and their
learning techniques
Mixture models, PLSA, LDA
EM, variational EM, Gibbs sampling
Convince you that topic models are an attractive
framework for community analysis
5 definitive papers

3
Outline

Part I Introduction to Topic Models
Naive Bayes model
Mixture Models
Expectation Maximization
PLSA
LDA
Variational EM
Gibbs Sampling
Part II Topic Models for Community Analysis
Citation modeling with PLSA
Citation Modeling with LDA
Author Topic Model
Author Topic Recipient Model
Modeling influence of Citations
Mixed membership Stochastic Block Model

4
Introduction to Topic Models

Multinomial Naïve Bayes

For each document d 1,?, M
Generate Cd Mult( ?)
For each position n 1,?, Nd
Generate wn Mult(?,Cd)

C
..
WN
W1
W2
W3
M
b
5
Introduction to Topic Models

Naïve Bayes Model Compact representation

?
?
C
C
..
WN
W1
W2
W3
W
M
N
b
M
b
6
Introduction to Topic Models

Multinomial naïve Bayes Learning
Maximize the log-likelihood of observed variables
w.r.t. the parameters
Convex function global optimum
Solution

7
Introduction to Topic Models

Mixture model unsupervised naïve Bayes model

Joint probability of words and classes
But classes are not visible

?
C
Z
W
N
M
b
8
Introduction to Topic Models

Mixture model learning
Not a convex function
No global optimum solution
Solution Expectation Maximization
Iterative algorithm
Finds local optimum
Guaranteed to maximize a lower-bound on the
log-likelihood of the observed data

9
Introduction to Topic Models
log(0.5x10.5x2)

Quick summary of EM
Log is a concave function
Lower-bound is convex!
Optimize this lower-bound w.r.t. each variable
instead

0.5log(x1)0.5log(x2)
X2
X1
0.5x10.5x2
H(?)
10
Introduction to Topic Models

Mixture model EM solution

E-step
M-step
11
Introduction to Topic Models
12
Introduction to Topic Models

Probabilistic Latent Semantic Analysis Model

Select document d Mult(?)
For each position n 1,?, Nd
generate zn Mult( ?d)
generate wn Mult( ?zn)

?d
?
Topic distribution
z
w
N
M
?
13
Introduction to Topic Models

Probabilistic Latent Semantic Analysis Model
Learning using EM
Not a complete generative model
Has a distribution ? over the training set of
documents no new document can be generated!
Nevertheless, more realistic than mixture model
Documents can discuss multiple topics!

14
Introduction to Topic Models

PLSA topics (TDT-1 corpus)

15
Introduction to Topic Models
16
Introduction to Topic Models

Latent Dirichlet Allocation

For each document d 1,?,M
Generate ?d Dir( ?)
For each position n 1,?, Nd
generate zn Mult( ?d)
generate wn Mult( ?zn)

a
z
w
N
M
?
17
Introduction to Topic Models

Latent Dirichlet Allocation
Overcomes the issues with PLSA
Can generate any random document
Parameter learning
Variational EM
Numerical approximation using lower-bounds
Results in biased solutions
Convergence has numerical guarantees
Gibbs Sampling
Stochastic simulation
unbiased solutions
Stochastic convergence

18
Introduction to Topic Models

Variational EM for LDA
Approximate the posterior by a simpler
distribution

A convex function in each parameter!

19
Introduction to Topic Models

Gibbs sampling
Applicable when joint distribution is hard to
evaluate but conditional distribution is known
Sequence of samples comprises a Markov Chain
Stationary distribution of the chain is the joint
distribution

20
Introduction to Topic Models

LDA topics

21
Introduction to Topic Models

LDAs view of a document

22
Introduction to Topic Models

Perplexity comparison of various models

Unigram
Mixture model
PLSA
Lower is better
LDA
23
Introduction to Topic Models

Summary
Generative models for exchangeable data
Unsupervised models
Automatically discover topics
Well developed approximate techniques available
for inference and learning

24
Outline

Part I Introduction to Topic Models
Naive Bayes model
Mixture Models
Expectation Maximization
PLSA
LDA
Variational EM
Gibbs Sampling
Part II Topic Models for Community Analysis
Citation modeling with PLSA
Citation Modeling with LDA
Author Topic Model
Author Topic Recipient Model
Modeling influence of Citations
Mixed membership Stochastic Block Model

25
Hyperlink modeling using PLSA
26
Hyperlink modeling using PLSACohn and Hoffman,
NIPS, 2001
?

Select document d Mult(?)
For each position n 1,?, Nd
generate zn Mult( ?d)
generate wn Mult( ?zn)
For each citation j 1,?, Ld
generate zj Mult( ?d)
generate cj Mult( ?zj)

d
?d
z
z
w
c
N
L
M
?
g
27
Hyperlink modeling using PLSACohn and Hoffman,
NIPS, 2001
?
PLSA likelihood
d
?d
z
z
New likelihood
w
c
N
L
M
?
g
Learning using EM
28
Hyperlink modeling using PLSACohn and Hoffman,
NIPS, 2001
Heuristic
?
(1-?)
0 ? 1 determines the relative importance of
content and hyperlinks
29
Hyperlink modeling using PLSACohn and Hoffman,
NIPS, 2001

Experiments Text Classification
Datasets
Web KB
6000 CS dept web pages with hyperlinks
6 Classes faculty, course, student, staff, etc.
Cora
2000 Machine learning abstracts with citations
7 classes sub-areas of machine learning
Methodology
Learn the model on complete data and obtain ?d
for each document
Test documents classified into the label of the
nearest neighbor in training set
Distance measured as cosine similarity in the ?
space
Measure the performance as a function of ?

30
Hyperlink modeling using PLSACohn and Hoffman,
NIPS, 2001

Classification performance

content
Hyperlink
Hyperlink
content
31
Hyperlink modeling using LDA
32
Hyperlink modeling using LDAErosheva, Fienberg,
Lafferty, PNAS, 2004
a
?

For each document d 1,?,M
Generate ?d Dir( ?)
For each position n 1,?, Nd
generate zn Mult( ?d)
generate wn Mult( ?zn)
For each citation j 1,?, Ld
generate zj Mult( . ?d)
generate cj Mult( . ?zj)

z
z
w
c
N
L
M
?
g
Learning using variational EM
33
Hyperlink modeling using LDAErosheva, Fienberg,
Lafferty, PNAS, 2004
34
Author-Topic Model for Scientific Literature

35
Author-Topic Model for Scientific
LiteratureRozen-Zvi, Griffiths, Steyvers, Smyth
UAI, 2004
a
P

For each author a 1,?,A
Generate ?a Dir( ?)
For each topic k 1,?,K
Generate fk Dir( ?)
For each document d 1,?,M
For each position n 1,?, Nd
Generate author x Unif( ad)
generate zn Mult( ?a)
generate wn Mult( fzn)

a
x
z
?
A
w
N
M
f
b
K
36
Author-Topic Model for Scientific Literature
Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004
a

Learning Gibbs sampling

P
?
x
z
?
A
w
N
M
f
b
K
37
Author-Topic Model for Scientific Literature
Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004

Perplexity results

38
Author-Topic Model for Scientific Literature
Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004

Topic-Author visualization

39
Author-Topic Model for Scientific
LiteratureRozen-Zvi, Griffiths, Steyvers, Smyth
UAI, 2004

Application 1 Author similarity

40
Author-Topic Model for Scientific Literature
Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004

Application 2 Author entropy

41
Author-Topic-Recipient model for email data
McCallum, Corrada-Emmanuel,Wang, ICJAI05
42
Author-Topic-Recipient model for email data
McCallum, Corrada-Emmanuel,Wang, ICJAI05
Gibbs sampling
43
Author-Topic-Recipient model for email data
McCallum, Corrada-Emmanuel,Wang, ICJAI05

Datasets
Enron email data
23,488 messages between 147 users
McCallums personal email
23,488(?) messages with 128 authors

44
Author-Topic-Recipient model for email data
McCallum, Corrada-Emmanuel,Wang, ICJAI05

Topic Visualization Enron set

45
Author-Topic-Recipient model for email data
McCallum, Corrada-Emmanuel,Wang, ICJAI05

Topic Visualization McCallums data

46
Author-Topic-Recipient model for email data
McCallum, Corrada-Emmanuel,Wang, ICJAI05
47
Modeling Citation Influences
48
Modeling Citation InfluencesDietz, Bickel,
Scheffer, ICML 2007

Copycat model

49
Modeling Citation InfluencesDietz, Bickel,
Scheffer, ICML 2007

Citation influence model

50
Modeling Citation InfluencesDietz, Bickel,
Scheffer, ICML 2007

Citation influence graph for LDA paper

51
Modeling Citation InfluencesDietz, Bickel,
Scheffer, ICML 2007

Words in LDA paper assigned to citations

52
Modeling Citation InfluencesDietz, Bickel,
Scheffer, ICML 2007

Performance evaluation
Data
22 seed papers and 132 cited papers
Users labeled citations on a scale of 1-4
Models considered
Citation influence model
Copy cat model
LDA-JS-divergence
Symmetric Divergence in topic space
LDA-post
Page Rank
TF-IDF
Evaulation measure
Area under the ROC curve

where
53
Modeling Citation InfluencesDietz, Bickel,
Scheffer, ICML 2007

Results

54
Mixed membership Stochastic Block modelsWork In
Progress

A complete generative model for text and
citations
Can model the topicality of citations
Topic Specific PageRank
Can also predict citations between unseen
documents

55
Summary

Topic Modeling is an interesting, new framework
for community analysis
Sound theoretical basis
Completely unsupervised
Simultaneous modeling of multiple fields
Discovers soft-communities and clusters in
terms of topic membership
Can also be used for predictive purposes

Write a Comment

User Comments (0)

About PowerShow.com

Generative Topic Models for Community Analysis PowerPoint PPT Presentation