16,000 documents

About This Presentation

Title:

16,000 documents

Description:

Example 16,000 documents 100 topic Picked those with large p(w|z) – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 29

Provided by: acuk

Category:

more less

Transcript and Presenter's Notes

Title: 16,000 documents

1
Example

16,000 documents
100 topic
Picked those with large p(wz)

2
(No Transcript)
3
New document?

Given a new document, compute and
words allocated to each topic
approximates p(znw)
See cases where these values are relatively large
4 topics found

4
Unseen document (contd.)

Bag of words - William Randolph Hearst Foundation
assigned to different topics

5
Applications and empirical results

Document modeling
Document classification
Collaborative filtering

6
Document modeling

Task density estimation, high likelihood to
unseen document
Measure of goodness perplexity
Monotonically decreases in the likelihood

7
The experiment
Articles Terms
Scientific abstracts 5,225 28,414
Newswire articles 16,333 23,075
8
The experiment (contd.)

Preprocessed
stop words
appearing once
10 held for training
Trained with the same stopping criteria

9
Results
10
Overfitting in Mixture of unigrams

Peaked posterior in the training set
Unseen document with unseen word
Word will have very small probability
Remedy smoothing

11
Overfitting in pLSI

Mixture of topics allowed
Marginalize over d to find p(w)
Restriction to having the same topic proportions
as training documents
Folding in ignore p(zd) parameters and refit
p(zdnew)

12
LDA

Documents can have different proportions of
topics
No heuristics

13
(No Transcript)
14
Document classification

Generative or discriminative
Choice of features in document classification
LDA as dimensionality reduction technique
as LDA features

15
The experiment

Binary classification
8000 documents, 15,818 words
True label not known
50 topic
Trained SVM on the LDA features
Compared with SVM on all word features
LDA reduced feature space by 99.6

16
GRAIN vs NOT GRAIN
17
EARN vs NOT EARN
18
LDA in document classification

Feature space reduced, performance improved
Results need further investigation
Use for feature selection

19
Collaborative filtering

Collection of users and movies they prefer
Trained on observed users
Task given unobserved user and all movies
preferred but one, predict the held out movie
Only users who positively rated 100 movies
Trained on 89 of data

20
Some quantities required

Probability of held out movie p(wwobs)
For mixture of unigrams and pLSI sum out topic
variable
For LDA sum out topic and Dirichlet variables
(quantity efficient to compute)

21
Results
22
Further work

Other approaches for inference and parameter
estimation
Embedded in another model
Other types of data
Partial exchangeability

23
Example Visual words

Document image
Words image features bars, circles
Topics face, airplane
Bag of words no spatial relationship between
objects

24
Visual words
25
Identifying the visual words and topics
26
Conclusion

Exchangeability, De Finetti Theorem
Dirichlet distribution
? Generative
? Bag of words
? Independence assumption in Dirichlet
distribution - correlated topics

27
Implementations

In C (by one of the authors)
http//www.cs.princeton.edu/blei/lda-c/
In C and Matlab
http//chasen.org/daiti-m/dist/lda/

28
References

Latent Dirichlet allocation, D. Blei, A. Ng, and
M. Jordan. In Journal of Machine Learning
Research, 3993-1022, 2003
Discovering object categories in image
collections. J. Sivic, B. C. Russell, A. A.
Efros, A. Zisserman, W. T. Freeman. MIT AI Lab
Memo AIM-2005-005, February, 2005
Correlated topic models, David Blei and John
Lafferty, Advances in Neural Information
Processing Systems 18, 2005.