Latent Dirichlet Allocation - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Latent Dirichlet Allocation

Description:

learn a separate model for each class. similarity judgement. e.g., essay grading ... if (w1, w2, ... wN) are infinitely exchangeable, then the joint probability p(w1, ... – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 26
Provided by: csCol6
Category:

less

Transcript and Presenter's Notes

Title: Latent Dirichlet Allocation


1
Latent Dirichlet Allocation
2
What Do We Want to do With Text Corpora?
  • clustering
  • summarization
  • classification
  • learn a separate model for each class
  • similarity judgement
  • e.g., essay grading
  • collaborative filtering
  • describe data via compact generative models

3
Bag-of-Words Assumption
  • Word order is irrelevant.
  • Theorem (De Finetti, 1935)
  • if (w1, w2, ... wN) are infinitely exchangeable,
    then the joint probability p(w1, w2, ... wN) can
    be represented as a mixture
  • for some random variable ?
  • Exchangability
  • joint distribution is invariant to permutation of
    elements
  • Infinite exhangeability
  • Every finite subsequence is exchangeable

4
What's Wrong with pLSI and Topic Model?
  • Documents have no generative probabilistic
    semantics
  • i.e., document is just a symbol
  • Model has many parameters
  • linear in number of documents
  • need heuristic methods to prevent overfitting
  • Cannot generalize to new documents

5
Unigram Model
6
Mixture of Unigrams
7
Topic Model / Probabilistic LSI
  • d is a localist representation of (trained)
    documents
  • LDA provides a distributed representation

8
LDA
  • Vocabulary of V words
  • Document is a collection of words from
    vocabulary.
  • N words in document
  • w (w1, ..., wN)
  • Latent topics
  • random variable z, with values 1, ..., k
  • Like topic model, document is generated by
    sampling a topic from a mixture and then sampling
    a word from a mixture.
  • But topic model assumes a fixed mixture of topics
    (multinomial distribution) for each document.
  • LDA assumes a random mixture of topics (Dirichlet
    distribution) for each topic.

9
Generative Model
  • Plates indicate looping structure
  • Outer plate replicated for each document
  • Inner plate replicated for each word
  • Same conditional distributions apply for each
    replicate
  • Document probability

10
Fancier Version
11
(No Transcript)
12
Geometric Interpretation
word simplex
13
Geometric Interpretation
topic 1
topic simplex
word simplex
topic 2
topic 3
14
Geometric Interpretation
  • Mixture of unigrams
  • each document placed at onecorner of topic
    simplex
  • pLSI
  • induces an empirical distributionon topic
    simplex (x's)
  • LDA
  • places smooth distribution on thetopic simplex
    (contour lines)

15
Inference
16
Inference
  • In general, this formula is intractable
  • Expanded version

1 if wn is the j'th vocab word
17
Variational Approximation Review
  • Given some equation E(x) that you can't solve
  • Find a solveable equation E'(x?) that is related
    to E(x)
  • e.g., E'(x?) lt E(x)
  • Find parameters ? that bring E' as close as
    possible to E

18
Variational Approximation
  • Computing log likelihood and introducing Jensen's
    inequality log(Ex) gt Elog(x)
  • Find variational distribution q such that the
    above equation is computable.
  • q parameterized by ? and fn
  • Maximize bound with respect to ? and fn to obtain
    best approximation to p(w a, ß)

19
Variational Approximation
multinomial
dirichlet
dirichlet
20
Parameter Estimation via Variational EM
  • E Step
  • For each document, find the optimizing values of
    the variational parameters ? and f, fixing a and
    ß
  • M Step
  • Maximize variational distribution with respect to
    a and ß for the values of ? and f found in the E
    step.

21
(No Transcript)
22
Modeling Documents
  • Build model based on set of training documents
  • Evaluate how well model characterizes a set of
    test documents.
  • perplexity lower score better generalization

23
Data Sets
  • C. Elegans Community abstracts
  • 5,225 abstracts
  • 28,414 unique terms
  • TREC AP corpus (subset)
  • 16,333 newswire articles
  • 23,075 unique terms
  • Held-out data 10
  • Removed terms
  • 50 stop words, words appearing once

24
C. Elegans
Note fold in hack for pLSI to allow it to handle
novel documents. Involves refitting p(zdnew)
parameters -gt sort of a cheat
25
AP
Write a Comment
User Comments (0)
About PowerShow.com