Title: A Bayesian Hierarchical Model for Learning Natural Scene Categories L. Fei-Fei and P. Perona. CVPR 2005 Discovering objects and their location in images J. Sivic, B. Russell, A. Efros, A. Zisserman and B. Freeman. ICCV 2005
1A Bayesian Hierarchical Model for Learning
Natural Scene Categories L. Fei-Fei and P.
Perona. CVPR 2005 Discovering objects and their
location in images J. Sivic, B. Russell, A.
Efros, A. Zisserman and B. Freeman. ICCV 2005
Tomasz Malisiewicz tomasz_at_cmu.edu Advanced
Machine Perception February 2006
2Graphical Models Recent Trend in Machine Learning
Describing Visual Scenes using Transformed
Dirichlet Processes. E. Sudderth, A. Torralba,
W. Freeman, and A. Willsky. NIPS, Dec. 2005.
3Outline
- Goals of both vision papers
- Techniques from statistical text modeling
- - pLSA vs LDA
- Scene Classification via LDA
- Object Discovery via pLSA
4Goal Learn and Recognize Natural Scene Categories
Classify a scene without first extracting objects
Other techniques we know of -Global
frequency (Oliva and Torralba) -Texton Histogram
(Renninger, Malik et al)
5Goal Discover Object Categories
- Discover what objects are present in a collection
of images in an unsupervised way - Find those same objects in novel images
- Determine what local image features correspond to
what objects segmenting the image
6Enter the world of Statistical Text Modeling
- D. Blei, A. Ng, and M. Jordan. Latent Dirichlet
allocation. Journal of Machine Learning Research,
39931022, January 2003. - Bag-of-words approaches the order of words in a
document can be neglected - Graphical Model Fun
7Bag-of-words
- A document is a collection of M words
- A corpus (collection of documents) is summarized
in a term-document matrix
8(No Transcript)
9(No Transcript)
101990 Latent Semantic Analysis (LSA)
- Goal map high-dimensional count vectors to a
lower dimensional representation to reveal
semantic relations between words - The lower dimensional space is called the latent
semantic space - Dim( latent space ) K
111990 Latent Semantic Analysis (LSA)
- D d1,,dN N documents
- W w1,,wM M words
- Nij (di,wj) NxM co-occurrence
term-document matrix
12What did we just do?
Singular Value Decomposition
13LSA summary
- SVD on term-document matrix
- Approximate N by thresholding all but the largest
K singular values in W to zero - Produces rank-K optimal approximation to N in the
L2-matrix or Frobenius norm sense
14LSA and Polysemy
- Polysemy the ambiguity of an individual word or
phrase that can be used (in different contexts)
to express two or more different meanings - Under the LSA model, the coordinates of a word in
latent space can be written as a linear
superposition of the coordinates of the documents
that contain the word
15Problems with LSA
- LSA does not define a properly normalized
probability distribution - No obvious interpretation of the directions in
the latent space - From statistics, the utilization of L2 norm in
LSA corresponds to a Gaussian Error assumption
which is hard to justify in the context of count
variables - Polysemy problem
16pLSA to the rescue
- Probabilistic Latent Semantic Analysis
- pLSA relies on the likelihood function of
multinomial sampling and aims at an explicit
maximization of the predictive power of the model
17pLSA to the rescue
Slide credit Josef Sivic
18Learning the pLSA parameters
Observed counts of word i in document j
Unlike LSA, pLSA does not minimize any type of
squared deviation. The parameters are
estimated in a probabilistically sound way.
Maximize likelihood of data using EM. Minimize KL
divergence between empirical distribution and
model
Slide credit Josef Sivic
19EM for pLSA (training on a corpus)
- E-step compute posterior probabilities for the
latent variables - M-step maximize the expected complete data
log-likelihood
20Graphical View of pLSA
- pLSA is a generative model
- Select a document di with prob P(di)
- Pick latent class zk with prob P(zkdi)
- Generate word wj with prob P(wjzk)
Observed variables
Latent variables
Plates
21How does pLSA deal with previously unseen
documents?
- Folding-in Heuristic
- First train on Corpus to obtain
- Now re-run same training EM algorithm, but dont
re-estimate and let Ddunseen
22Problems with pLSA
- Not a well-defined generative model of documents
d is a dummy index into the list of documents in
the training set (as many values as documents) - No natural way to assign probability to a
previously unseen document - Number of parameters to be estimated grows with
size of training set
23LDA to the rescue
- Latent Dirichlet Allocation treats the topic
mixture weights as a k-parameter hidden random
variable and places a Dirichlet prior on the
multinomial mixing weights - Dirichlet distribution is conjugate to the
multinomial distribution (most natural prior to
choose the posterior distribution is also a
Dirichlet!)
24Corpus-Level parameters in LDA
- Alpha and beta are corpus-level documents that
are sampled once in the corpus creating
generative model (outside of the plates!) - Alpha and beta must be estimated before we can
find the topic mixing proportions belonging to a
previously unseen document
LDA
25Getting rid of plates
Thanks to Jonathan Huang for the un-plated LDA
graphic
26Inference in LDA
- Inference estimation of document-level
parameters - Intractable to compute ? must employ approximate
inference
27Approximate Inference in LDA
- Variational Methods Use Jensens inequality to
obtain a lower bound on the log likelihood that
is indexed by a set of variational parameters - Optimal Variational Parameters (document-specific)
are obtained by minimizing the KL divergence
between the variational distribution and the true
posterior
Variational Methods are one way of doing
this. Gibbs sampling (MCMC) is another way.
Variational distribution
28Look at some P(wz) produced by LDA
- Show some pLSI and LDA results applied to text
- An LDA project by Tomasz Malisiewicz and Jonathan
Huang - Search for the word drive
29pLSA and LDA applied to Images
- How can one apply these techniques to the images?
30Hierarchical Bayesian text models
Probabilistic Latent Semantic Analysis (pLSA)
Hoffman, 2001
Latent Dirichlet Allocation (LDA)
Blei et al., 2001
31Hierarchical Bayesian text models
Probabilistic Latent Semantic Analysis (pLSA)
Sivic et al. ICCV 2005
32Hierarchical Bayesian text models
Latent Dirichlet Allocation (LDA)
Fei-Fei et al. ICCV 2005
33A Bayesian Hierarchical Model for Learning
Natural Scene Categories
34Flow Chart Quick Overview
35How to Generate an Image?
Choose a scene (mountain, beach, )
Given scene generate an intermediate probability
vector over themes
For each word
Determine current theme from mixture of themes
Draw a codeword from that theme
36- Choose a category label c p(cn)
- N prior over scene category (multinomial)
- Choose pi p(pic,theta)
- Pi is multinomial distribution over themes
- Theta is a CxK (category x themes)
- -Theta_k is k-dimensional dirichlet parameter
condition on the category c - For each of the N patches
- Choose theme Zn mult(pi)
- Choose patch Xn p(XnZn,beta)
- Beta is matrix of size KxT (themes x words)
37How to Generate an Image?
38Inference
- How to make decision on a novel image
- Integrate over latent variables to get
- Approximate Variational Inference (not easy, but
Gibbs sampling is supposed to be easier)
39Codebook
- 174 Local Image Patches
- Detection
- Evenly Sampled Grid
- Random Sampling
- Saliency Detector
- Lowes DoG Detector
- Representation
- Normalized 11x11 gray values
- 128-dim SIFT
40Results Average performance 64
100 training examples and 50 test examples
Rank statistic test the probability of a test
scene correctly belong to one of the top N most
probable categories
41Results The Distributions
Theme distribution
Codeword distribution
42The peak at 174
43Summary of detection and representation choices
- SIFT outperforms pixel gray values
- Sliding grid, which creates the largest number of
patches, does best
44Discovering objects and their location in images
45Visual Words
- Vector Quantized SIFT descriptors computed in
regions - Regions come from elliptical shape adaptation
around interest point, and from the maximally
stable regions of Matas et al. - Both are elliptical regions at twice their
detected scale
46Building a Vocabulary
47Building a Vocabulary
K-means clustering of 300K regions to get about
1K clusters for each of Shape Adapted and
Maximally Stable regions
Vector quantization
Slide credit Josef Sivic
48pLSA Training
- Sanity Check Remember what quantities must be
estimated?
49Results 1 Topic Discovery
- This is just the training stage
- Obtain P(zkdj) for each image, then classify
image as containing object k according to the max
of P(zkdj) over k
4 object categories Plus background
50Results 1 Topic Discovery
51Results 2 Classifying New Images
- Object Categories learned on a corpus, then
object categories found in new image
Anybody remember how this is done?
Remember the index d in the graphical model
52How does pLSA deal with previously unseen
documents?
- Folding-in Heuristic
- First train on Corpus to obtain
- Now re-run same training EM algorithm, but dont
re-estimate and let Ddunseen
53Results 2 Classifying New Images
- Train on one set and test on another
54Results 3 Segmentation
- Localization and Segmentation of Object
- For a word occurrence in a particular document we
can examine the probability of different topics - Find words with P(zkdj,wi) gt .8
55Results 3 Segmentation
Note words shown are not the most probable
words for a topic, but instead they are words
that have a high probability of occurring in a
topic AND high probability of occurring in the
image
56Results 3 Segmentation and Doublets
- Two class image dataset consisting of half the
faces (218 images) and backgrounds (217 images) - A 4 topic pLSA model is learned for all training
faces and training backgrounds with 3 fixed
background topics, i.e. one (face) topic is
learned in addition to the three fixed background
topics - A doublet vocabulary is then formed from the top
100 visual words of the face topic. A second 4
topic pLSA model is then learned for the combined
vocabulary of singlets and doublets with the
background topics fixed.
57Doublets
Face Segmentation Scores Singleton .49
Doublets .61
Efros didnt work as much as youd think
58Conclusions
- Showed how both papers use bag-of-words
approaches - Were now ready to become experts on generative
models like pLSA and LDA - Graphical Model Fun! (Carlos Guestrin teaches
Graphical Models)
59Are you really into Graphical Models?
- Describing Visual Scenes using Transformed
Dirichlet Processes. E. Sudderth, A. Torralba, W.
Freeman, and A. Willsky. NIPS, Dec. 2005.
60References
- A Bayesian Hierarchical Model for Learning
Natural Scene Categories, Fei Fei Li et al - Describing Visual Scenes using Transformed
Dirichlet Processes, Sudderth et al - Discovering objects and their location in images,
Sivic et al - Latent Dirichlet Allocation, Blei et al
- Unsupervised Learning by Probabilistic Latent
Semantic Analysis, T. Hoffman