Generative Topic Models for Community Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Generative Topic Models for Community Analysis

Description:

Provide an overview of topic models and their learning techniques. Mixture ... Convince you that topic models are an attractive framework for community analysis ... – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 56
Provided by: nmra
Learn more at: http://ai.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Generative Topic Models for Community Analysis


1
Generative Topic Models for Community Analysis
  • Ramesh Nallapati

2
Objectives
  • Provide an overview of topic models and their
    learning techniques
  • Mixture models, PLSA, LDA
  • EM, variational EM, Gibbs sampling
  • Convince you that topic models are an attractive
    framework for community analysis
  • 5 definitive papers

3
Outline
  • Part I Introduction to Topic Models
  • Naive Bayes model
  • Mixture Models
  • Expectation Maximization
  • PLSA
  • LDA
  • Variational EM
  • Gibbs Sampling
  • Part II Topic Models for Community Analysis
  • Citation modeling with PLSA
  • Citation Modeling with LDA
  • Author Topic Model
  • Author Topic Recipient Model
  • Modeling influence of Citations
  • Mixed membership Stochastic Block Model

4
Introduction to Topic Models
  • Multinomial Naïve Bayes

?
  • For each document d 1,?, M
  • Generate Cd Mult( ?)
  • For each position n 1,?, Nd
  • Generate wn Mult(?,Cd)

C
..
WN
W1
W2
W3
M
b
5
Introduction to Topic Models
  • Naïve Bayes Model Compact representation

?
?
C
C
..
WN
W1
W2
W3
W
M
N
b
M
b
6
Introduction to Topic Models
  • Multinomial naïve Bayes Learning
  • Maximize the log-likelihood of observed variables
    w.r.t. the parameters
  • Convex function global optimum
  • Solution

7
Introduction to Topic Models
  • Mixture model unsupervised naïve Bayes model
  • Joint probability of words and classes
  • But classes are not visible

?
C
Z
W
N
M
b
8
Introduction to Topic Models
  • Mixture model learning
  • Not a convex function
  • No global optimum solution
  • Solution Expectation Maximization
  • Iterative algorithm
  • Finds local optimum
  • Guaranteed to maximize a lower-bound on the
    log-likelihood of the observed data

9
Introduction to Topic Models
log(0.5x10.5x2)
  • Quick summary of EM
  • Log is a concave function
  • Lower-bound is convex!
  • Optimize this lower-bound w.r.t. each variable
    instead

0.5log(x1)0.5log(x2)
X2
X1
0.5x10.5x2
H(?)
10
Introduction to Topic Models
  • Mixture model EM solution

E-step
M-step
11
Introduction to Topic Models
12
Introduction to Topic Models
  • Probabilistic Latent Semantic Analysis Model

d
  • Select document d Mult(?)
  • For each position n 1,?, Nd
  • generate zn Mult( ?d)
  • generate wn Mult( ?zn)

?d
?
Topic distribution
z
w
N
M
?
13
Introduction to Topic Models
  • Probabilistic Latent Semantic Analysis Model
  • Learning using EM
  • Not a complete generative model
  • Has a distribution ? over the training set of
    documents no new document can be generated!
  • Nevertheless, more realistic than mixture model
  • Documents can discuss multiple topics!

14
Introduction to Topic Models
  • PLSA topics (TDT-1 corpus)

15
Introduction to Topic Models
16
Introduction to Topic Models
  • Latent Dirichlet Allocation

?
  • For each document d 1,?,M
  • Generate ?d Dir( ?)
  • For each position n 1,?, Nd
  • generate zn Mult( ?d)
  • generate wn Mult( ?zn)

a
z
w
N
M
?
17
Introduction to Topic Models
  • Latent Dirichlet Allocation
  • Overcomes the issues with PLSA
  • Can generate any random document
  • Parameter learning
  • Variational EM
  • Numerical approximation using lower-bounds
  • Results in biased solutions
  • Convergence has numerical guarantees
  • Gibbs Sampling
  • Stochastic simulation
  • unbiased solutions
  • Stochastic convergence

18
Introduction to Topic Models
  • Variational EM for LDA
  • Approximate the posterior by a simpler
    distribution
  • A convex function in each parameter!

19
Introduction to Topic Models
  • Gibbs sampling
  • Applicable when joint distribution is hard to
    evaluate but conditional distribution is known
  • Sequence of samples comprises a Markov Chain
  • Stationary distribution of the chain is the joint
    distribution

20
Introduction to Topic Models
  • LDA topics

21
Introduction to Topic Models
  • LDAs view of a document

22
Introduction to Topic Models
  • Perplexity comparison of various models

Unigram
Mixture model
PLSA
Lower is better
LDA
23
Introduction to Topic Models
  • Summary
  • Generative models for exchangeable data
  • Unsupervised models
  • Automatically discover topics
  • Well developed approximate techniques available
    for inference and learning

24
Outline
  • Part I Introduction to Topic Models
  • Naive Bayes model
  • Mixture Models
  • Expectation Maximization
  • PLSA
  • LDA
  • Variational EM
  • Gibbs Sampling
  • Part II Topic Models for Community Analysis
  • Citation modeling with PLSA
  • Citation Modeling with LDA
  • Author Topic Model
  • Author Topic Recipient Model
  • Modeling influence of Citations
  • Mixed membership Stochastic Block Model

25
Hyperlink modeling using PLSA
26
Hyperlink modeling using PLSACohn and Hoffman,
NIPS, 2001
?
  • Select document d Mult(?)
  • For each position n 1,?, Nd
  • generate zn Mult( ?d)
  • generate wn Mult( ?zn)
  • For each citation j 1,?, Ld
  • generate zj Mult( ?d)
  • generate cj Mult( ?zj)

d
?d
z
z
w
c
N
L
M
?
g
27
Hyperlink modeling using PLSACohn and Hoffman,
NIPS, 2001
?
PLSA likelihood
d
?d
z
z
New likelihood
w
c
N
L
M
?
g
Learning using EM
28
Hyperlink modeling using PLSACohn and Hoffman,
NIPS, 2001
Heuristic
?
(1-?)
0 ? 1 determines the relative importance of
content and hyperlinks
29
Hyperlink modeling using PLSACohn and Hoffman,
NIPS, 2001
  • Experiments Text Classification
  • Datasets
  • Web KB
  • 6000 CS dept web pages with hyperlinks
  • 6 Classes faculty, course, student, staff, etc.
  • Cora
  • 2000 Machine learning abstracts with citations
  • 7 classes sub-areas of machine learning
  • Methodology
  • Learn the model on complete data and obtain ?d
    for each document
  • Test documents classified into the label of the
    nearest neighbor in training set
  • Distance measured as cosine similarity in the ?
    space
  • Measure the performance as a function of ?

30
Hyperlink modeling using PLSACohn and Hoffman,
NIPS, 2001
  • Classification performance

content
Hyperlink
Hyperlink
content
31
Hyperlink modeling using LDA
32
Hyperlink modeling using LDAErosheva, Fienberg,
Lafferty, PNAS, 2004
a
?
  • For each document d 1,?,M
  • Generate ?d Dir( ?)
  • For each position n 1,?, Nd
  • generate zn Mult( ?d)
  • generate wn Mult( ?zn)
  • For each citation j 1,?, Ld
  • generate zj Mult( . ?d)
  • generate cj Mult( . ?zj)

z
z
w
c
N
L
M
?
g
Learning using variational EM
33
Hyperlink modeling using LDAErosheva, Fienberg,
Lafferty, PNAS, 2004
34
Author-Topic Model for Scientific Literature


35
Author-Topic Model for Scientific
LiteratureRozen-Zvi, Griffiths, Steyvers, Smyth
UAI, 2004
a
P
  • For each author a 1,?,A
  • Generate ?a Dir( ?)
  • For each topic k 1,?,K
  • Generate fk Dir( ?)
  • For each document d 1,?,M
  • For each position n 1,?, Nd
  • Generate author x Unif( ad)
  • generate zn Mult( ?a)
  • generate wn Mult( fzn)

a
x
z
?
A
w
N
M
f
b
K
36
Author-Topic Model for Scientific Literature
Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004
a
  • Learning Gibbs sampling

P
?
x
z
?
A
w
N
M
f
b
K
37
Author-Topic Model for Scientific Literature
Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004
  • Perplexity results

38
Author-Topic Model for Scientific Literature
Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004
  • Topic-Author visualization

39
Author-Topic Model for Scientific
LiteratureRozen-Zvi, Griffiths, Steyvers, Smyth
UAI, 2004
  • Application 1 Author similarity

40
Author-Topic Model for Scientific Literature
Rozen-Zvi, Griffiths, Steyvers, Smyth UAI, 2004
  • Application 2 Author entropy

41
Author-Topic-Recipient model for email data
McCallum, Corrada-Emmanuel,Wang, ICJAI05
42
Author-Topic-Recipient model for email data
McCallum, Corrada-Emmanuel,Wang, ICJAI05
Gibbs sampling
43
Author-Topic-Recipient model for email data
McCallum, Corrada-Emmanuel,Wang, ICJAI05
  • Datasets
  • Enron email data
  • 23,488 messages between 147 users
  • McCallums personal email
  • 23,488(?) messages with 128 authors

44
Author-Topic-Recipient model for email data
McCallum, Corrada-Emmanuel,Wang, ICJAI05
  • Topic Visualization Enron set

45
Author-Topic-Recipient model for email data
McCallum, Corrada-Emmanuel,Wang, ICJAI05
  • Topic Visualization McCallums data

46
Author-Topic-Recipient model for email data
McCallum, Corrada-Emmanuel,Wang, ICJAI05
47
Modeling Citation Influences
48
Modeling Citation InfluencesDietz, Bickel,
Scheffer, ICML 2007
  • Copycat model

49
Modeling Citation InfluencesDietz, Bickel,
Scheffer, ICML 2007
  • Citation influence model

50
Modeling Citation InfluencesDietz, Bickel,
Scheffer, ICML 2007
  • Citation influence graph for LDA paper

51
Modeling Citation InfluencesDietz, Bickel,
Scheffer, ICML 2007
  • Words in LDA paper assigned to citations

52
Modeling Citation InfluencesDietz, Bickel,
Scheffer, ICML 2007
  • Performance evaluation
  • Data
  • 22 seed papers and 132 cited papers
  • Users labeled citations on a scale of 1-4
  • Models considered
  • Citation influence model
  • Copy cat model
  • LDA-JS-divergence
  • Symmetric Divergence in topic space
  • LDA-post
  • Page Rank
  • TF-IDF
  • Evaulation measure
  • Area under the ROC curve

where
53
Modeling Citation InfluencesDietz, Bickel,
Scheffer, ICML 2007
  • Results

54
Mixed membership Stochastic Block modelsWork In
Progress
  • A complete generative model for text and
    citations
  • Can model the topicality of citations
  • Topic Specific PageRank
  • Can also predict citations between unseen
    documents

55
Summary
  • Topic Modeling is an interesting, new framework
    for community analysis
  • Sound theoretical basis
  • Completely unsupervised
  • Simultaneous modeling of multiple fields
  • Discovers soft-communities and clusters in
    terms of topic membership
  • Can also be used for predictive purposes
Write a Comment
User Comments (0)
About PowerShow.com