Probabilistic Latent Semantic Analysis presentation

About This Presentation

Transcript and Presenter's Notes

Title: Probabilistic Latent Semantic Analysis

1
Probabilistic Latent Semantic Analysis

Thomas Hofmann
Presented by
Vortrag Quang Lam Nguyen
Basierend auf Mummoorthy Murugesan, Cs 6901

2
Outline

Background
LSA
PLSA
Model Fitting
Basic I Maximum Likelihood Estimation
Basic II EM Algorithm
Basic III Over fitting
Experimental Results
Conclusion

3
Background (1/2)
Probabilistic Latent Semantic Analysis and
Latent Semantic Analysis

Latent present but not evident, hidden
Semantic meaning
Hidden meaning of terms, and their
occurrences in documents

4
Background (2/2)
N dimensions lexical space
Sport
Polysemy
Synonymy
Muskelkater
Kater
Polysemy
Auto
Bank
Bank
Du hast nicht alle Tassen im Schrank
Wagen
KltN dimensions semantic (latent) space
Du bist verrückt
Bank
Auto
Bank
Park
Wagen
Einzahlung
Kater
Muskelkater
Sport
5
The Setting

Set of N documents
Dd_1, ,d_N
Set of M words
Ww_1, ,w_M
Set of K Latent classes
Zz_1, ,z_K

6
Latent Semantic Indexing (1/2)

Term-Document-matrix A of size N M to represent
the frequency counts
Singular Value Decomposition (SVD)
A(nm) U(nn) E(nm) V(mm)
Keep only k eigen values from E
A(nm) U(nk) E(kk) V(km)
A A
Term represented by k factors or a vector in
k-dimensional space
Terms with common meaning mapped to same
direction

7
Latent Semantic Indexing (2/2)

LSI puts documents together even if they dont
have common words
Disadvantages
Statistical foundation is missing
PLSA addresses this concern!

8
Probabilistic Latent Semantic Analysis

Overview
Aspect Model
Model fitting with EM and TEM
Basic I Maximum Likelihood Estimation
Basic II EM Algorithm
Basic III Over fitting

9
PLSA Overview

Automated Document Indexing and Information
retrieval

Identification of Latent Classes using an
Expectation Maximization (EM) Algorithm

Shown to solve
Polysemy and Synonymy

Has a better statistical foundation than LSA

10
PLSA Aspect Model (1/3)

Aspect Model
Document is a mixture of underlying (latent) K
aspects
Each aspect is represented by a distribution of
words p(wz)

11
Aspect Model (2/3)

Latent Variable model for general co-occurrence
data
Associate each observation (w,d) with a class
variable z ? Zz_1,,z_K

Generative Model predicting words
Select a doc with probability P(d)
Pick a latent class z with probability P(zd)
Generate a word w with probability p(wz)

P(d)
P(zd)
P(wz)
d
z
w
12
Aspect Model (3/3)

To get the joint probability model

(d,w) assumed to be independent

Now, we have to compute P(z), P(zd), P(wz). We
are given just documents(d) and words(w).

13
Basic I Maximum Likelihood Estimation

Probability model based on real data
? it has to be fit ? Model Fitting
Tuning free Parameters of the model to provide an
optimal fit to real-world data
Parameters in a way that make the data more
likely than other values would do it
Prerequisite correct parameters are known!

14
Basic II EM Algorithm (1/2)

Maximum Likelihood Estimation
BUT correct parameters not known
FOR they depend on unknown properties!
Iterative
1. Expectation Step
2. Maximization Step

15
Basic II EM Algorithm (2/2)

E-Step (Expectation)
Hidden parameters estimated - expectation of the
likelihood function is calculated with the
current parameter values
M-Step (Maximization)
Determine the actual parameters -
Find the parameters that maximizes the likelihood
function (Maximum Likelihood Estimation)

16
Model fitting

We have the equation for log-likelihood function
from the aspect model, and we need to maximize
it.
Expectation Maximization ( EM) is used for this
purpose

17
E-Step Model Fitting (2/2)

It is the probability that a word w occurring in
a document d, is explained by aspect z
(based on some calculations)

18
M Step Model Fitting (3/3)

All these equations use p(zd,w) calculated in E
Step
Converges to local maximum of the likelihood
function

19
Basic III Over fitting

Trade off between Predictive performance on the
training data and Unseen new data
Actual aim predict correct output for UNSEEN
data, too -gt generalization
Problem may adjust to very specific random
features of the training data too much -gt over
fitting
? Tempered EM

20
TEM (Tempered EM)

Introduce control parameter ß
ß starts from the value of 1, and decreases
Similar to Simulated Annealing
ß as temperature variable

21
Choosing ß

It defines
Underfit Vs Overfit
Simple solution using held-out data (part of
training data)
Using the training data for ß starting from 1
Test the model with held-out data
If improvement, continue with the same ß
If no improvement, ß nß where nlt1

22
Experimental Results

Perplexity Comparison
Polysemy
Information Retrieval

23
Perplexity Comparison (1/2)

What is perplexity?
Indicator for the quality of probability models
Less surpised by test example
High probability will give lower perplexity, thus
good predictions

24
Perplexity Comparison (2/2)
25
Polysemy

Segment occurring in two different contexts are
identified (image, sound)

26
Information Retrieval

For natural Language Queries, simple term
matching does not work effectively
Ambiguous terms
Same Queries vary due to personal styles
Latent semantic indexing
Creates this latent semantic space (hidden
meaning)

27
Comparing PLSA and LSA

LSA and PLSA perform dimensionality reduction
In LSA, by keeping only K singular values
In PLSA, by having K aspects
Comparison to SVD
U Matrix related to P(dz) (doc to aspect)
V Matrix related to P(zw) (aspect to term)
E Matrix related to P(z) (aspect strength)
The main difference is the way the approximation
is done
PLSA generates a model (aspect model) and
maximizes its predictive power
Selecting the proper value of K is heuristic in
LSA
Model selection in statistics can determine
optimal K in PLSA

Probabilistic Latent Semantic Analysis PowerPoint PPT Presentation