Object Recognition as Machine Translation Matching Words and Pictures - PowerPoint PPT Presentation

About This Presentation

Title:

Object Recognition as Machine Translation Matching Words and Pictures

Description:

Object Recognition as Machine Translation Matching Words and Pictures – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 59

Provided by: heather192

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Object Recognition as Machine Translation Matching Words and Pictures

1
Object Recognition as Machine Translation
Matching Words and Pictures

Heather Dunlop
16-721 Advanced Perception
April 17, 2006

2
Machine Translation

Altavistas Babel Fish
There are three more weeks of classes!
Il y a seulement trois semaines supplémentaires
de classes!
Hay solamente tres más semanas de clases!
Ci sono soltanto tre nuove settimane dei codici
categoria!
Es gibt nur drei weitere Wochen Kategorien!

3
Statistical Machine Translation

Statistically link words in one language to words
in another
Requires aligned bitext
eg. Hansard for Canadian parliament

4
Statistical Machine Translation

Assuming an unknown one-one correspondence
between words, come up with a joint probability
distribution linking words in the two languages
Missing data problem solution is EM

5
Multimedia Translation

Data
Words are associated with images, but
correspondences are unknown

sun sea sky
sun sea sky
6
Auto-Annotation

Predicting words for the images

tiger grass cat
7
Region Naming

Can also be applied to object recognition
Requires a large data set

8
Browsing
9
Auto-Illustration
Moby Dick
10
Data Sets of Annotated Images

Corel data set
Museum image collections
News photos (with captions)

11
First Paper

Object Recognition as Machine Translation
Learning a Lexicon for a Fixed Image Vocabulary
by Pinar Duygulu, Kobus Barnard, Nando de
Freitas, David Forsyth
A simple model for annotation and correspondence

12
Overview
13
Input Representation

Segment with Normalized Cuts
Only use regions larger than a threshold
(typically 5-10 per image)
Form vector representation of each region
Cluster regions with k-means to form blob tokens

sun sky waves sea
word tokens
14
Input Representation

Represent each region with a feature vector
Size portion of the image covered by the region
Position coordinates of center of mass
Color avg. and std. dev. of (R,G,B), (L,a,b) and
(rR/(RGB),gG/(RGB))
Texture avg. and variance of 16 filter responses
Shape area / perimeter2, moment of inertia,
region area / area of convex hull

15
Tokenization
16
Assignments

Each word is predicted with some probability by
each blob

17
Expectation Maximization

Select word with highest probability to assign to
each blob

of words
of images
of blobs
probability that blob bni translates to word wnj
probability of obtaining word wnj given instance
of blob bni
18
Expectation Maximization

Initialize to blob-word co-occurrences
Iterate

19
Word Prediction

On a new image
Segment
For each region
Extract features
Find the corresponding blob token using nearest
neighbor
Use the word posterior probabilities to predict
words

20
Refusing to Predict

Require p(wordblob) gt threshold
ie. Assign a null word to any blob whose best
predicted word lies below the threshold
Prunes vocabulary, so fit new lexicon

21
Indistinguishable Words

Visually indistinguishable
cat and tiger, train and locomotive
Indistinguishable with our features
eagle and jet
Entangled correspondence
polar bear
mare/foals horse
Solution cluster similar words
Obtain similarity matrix
Compare words with symmetrised KL divergence
Apply N-Cuts on matrix to get clusters
Replace word with its cluster label

22
Experiments

Train with 4500 Corel images
4-5 words for each image
371 words in vocabulary
5-10 regions per image
500 blobs
Test on 500 images

23
Auto-Annotation

Determine most likely word for each blob
If probability of word is greater than some
threshold, use in annotation

24
Measuring Performance

Do we predict the right words?

25
Region Naming / Correspondence
26
Measuring Performance

Do we predict the right words?
Are they on the right blobs?
Difficult to measure because data set contains no
correspondence information
Must be done by hand on a smaller data set
Not practical to count false negatives

27
Successful Results
28
Successful Results
29
Unsuccessful Results
30
Refusing to Predict
31
Clustering
32
Merging Regions
33
Results
light bar average number of times blob predicts
word in correct place dark bar average number
of times blob predicts word which is in the image
34
Second paper

Matching Words and Pictures
by Kobus Barnard, Pinar Duygulu, Nando de
Freitas,
David Forsyth, David Blei, Michael I. Jordan
Comparing lots of different models for annotation
and correspondence

35
Annotation Models

Multi-modal hierarchical aspect models
Mixture of multi-modal LDA

36
Multi-Model Hierarchical Aspect Model
cluster a path from a leaf to the root
37
Multi-Model Hierarchical Aspect Model

All observations are produced independent of one
another
I-0 as above
I-1 cluster dependent level structure
p(ld) replaced with p(lc,d)
I-2 generative model
p(ld) replaced with p(lc)
allows prediction for documents not in training
set

38
Multi-Model Hierarchical Aspect Model

Model fitting is done with EM
Word prediction

39
Mixture of Multi-Modal LDA
40
Mixture of Multi-Modal LDA

Distribution parameters estimated with EM
Word prediction

posterior Dirichlet
posterior over mixture components
41
Correspondence Models

Discrete translation
Hierarchical clustering
Linking word and region emission probabilities
Paired word and region emission

42
Discrete Translation

Similar to first paper
Use k-means to vector-quantize the set of
features representing an image region
Construct a joint probability table linking word
tokens to blob tokens
Data set doesnt provide explicit correspondences
Missing data problem gt EM

43
Hierarchical Clustering

Again, using vector-quantized image regions
Word prediction

44
Linking Word andRegion Emission

Words emitted conditioned on observed blobs
D-O as above (D for dependent)
D-1 cluster dependent level distributions
Replace p(lc,d) with p(ld)
D-2 generative model
Replace p(ld) with p(l)

B U W
45
Paired Word and Region Emission at Nodes

Observed words and regions are emitted in pairs
D(w,b)
C-0 as above (C for correspondence)
C-1 cluster dependent level structure
p(ld) replaced with p(lc,d)
C-2 generative model
p(ld) replaced with p(lc)

46
Wow, Thats a Lot of models!

Multi-modal hierarchical I-0, I-1, I-2
Multi-modal LDA
Discrete translation
Hierarchical clustering
Linked word and region emission D-0, D-1, D-2
Paired word and region emission C-0, C-1, C-2
Count 12
Why so many?

47
Evaluation Methods

Annotation performance measures
KL divergence between predicted and target
distributions
Word prediction measure
n of words in image
r of words predicted correctly
of words predicted is set to of actual
keywords
Normalized classification score
w of words predicted incorrectly
N vocabulary size

48
Results

Methods using clustering are very reliant on
having images that are close to the training data
MoM-LDA has strong resistance to over-fitting
D-0 (linked word and region emission) appears to
give best results, taking all measures and data
sets into consideration

49
Successful Results
50
Unsuccessful Results
good annotation, poor correspondence
complete failure
51
N-cuts vs. Blobworld
Blobworld
Normalized Cuts
52
N-cuts vs. Blobworld
53
Browsing Results
Clustering by text only
Clustering by image features only
54
Browsing Results
Clustering by both text and image features only
55
Search Results

query tiger, river

56
Auto-Illustration Results

Passage from Moby Dick
The large importance attached to the
harpooneer's vocation is evinced by the fact,
that originally in the old Dutch Fishery, two
centuries and more ago, the command of a
whale-ship!
Words extracted from the passage using natural
language processing tools
large importance attached fact old dutch century
more command whale ship was per son was divided
officer word means fat cutter time made days was
general vessel whale hunting concern british
title old dutch official present rank such more
good american officer boat night watch ground
command ship deck grand political sea men mast

57
Auto-Illustration Results

Top-ranked images retrieved using all extracted
words

58
Conclusions

Lots of different models developed
Hard to tell which is best
Can be used with any set of features
Numerous applications
Auto-annotation
Region naming (aka object recognition)
Browsing
Searching
Auto-illustration
Improvements in translation from visual to
semantic representations lead to improvements in
image access