Object Recognition as Machine Translation Matching Words and Pictures - PowerPoint PPT Presentation

About This Presentation
Title:

Object Recognition as Machine Translation Matching Words and Pictures

Description:

Object Recognition as Machine Translation Matching Words and Pictures – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 59
Provided by: heather192
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Object Recognition as Machine Translation Matching Words and Pictures


1
Object Recognition as Machine Translation
Matching Words and Pictures
  • Heather Dunlop
  • 16-721 Advanced Perception
  • April 17, 2006

2
Machine Translation
  • Altavistas Babel Fish
  • There are three more weeks of classes!
  • Il y a seulement trois semaines supplémentaires
    de classes!
  • Hay solamente tres más semanas de clases!
  • Ci sono soltanto tre nuove settimane dei codici
    categoria!
  • Es gibt nur drei weitere Wochen Kategorien!

3
Statistical Machine Translation
  • Statistically link words in one language to words
    in another
  • Requires aligned bitext
  • eg. Hansard for Canadian parliament

4
Statistical Machine Translation
  • Assuming an unknown one-one correspondence
    between words, come up with a joint probability
    distribution linking words in the two languages
  • Missing data problem solution is EM

5
Multimedia Translation
  • Data
  • Words are associated with images, but
    correspondences are unknown

sun sea sky
sun sea sky
6
Auto-Annotation
  • Predicting words for the images

tiger grass cat
7
Region Naming
  • Can also be applied to object recognition
  • Requires a large data set

8
Browsing
9
Auto-Illustration
Moby Dick
10
Data Sets of Annotated Images
  • Corel data set
  • Museum image collections
  • News photos (with captions)

11
First Paper
  • Object Recognition as Machine Translation
    Learning a Lexicon for a Fixed Image Vocabulary
  • by Pinar Duygulu, Kobus Barnard, Nando de
    Freitas, David Forsyth
  • A simple model for annotation and correspondence

12
Overview
13
Input Representation
  • Segment with Normalized Cuts
  • Only use regions larger than a threshold
    (typically 5-10 per image)
  • Form vector representation of each region
  • Cluster regions with k-means to form blob tokens

sun sky waves sea
word tokens
14
Input Representation
  • Represent each region with a feature vector
  • Size portion of the image covered by the region
  • Position coordinates of center of mass
  • Color avg. and std. dev. of (R,G,B), (L,a,b) and
    (rR/(RGB),gG/(RGB))
  • Texture avg. and variance of 16 filter responses
  • Shape area / perimeter2, moment of inertia,
    region area / area of convex hull

15
Tokenization
16
Assignments
  • Each word is predicted with some probability by
    each blob

17
Expectation Maximization
  • Select word with highest probability to assign to
    each blob

of words
of images
of blobs
probability that blob bni translates to word wnj
probability of obtaining word wnj given instance
of blob bni
18
Expectation Maximization
  • Initialize to blob-word co-occurrences
  • Iterate

19
Word Prediction
  • On a new image
  • Segment
  • For each region
  • Extract features
  • Find the corresponding blob token using nearest
    neighbor
  • Use the word posterior probabilities to predict
    words

20
Refusing to Predict
  • Require p(wordblob) gt threshold
  • ie. Assign a null word to any blob whose best
    predicted word lies below the threshold
  • Prunes vocabulary, so fit new lexicon

21
Indistinguishable Words
  • Visually indistinguishable
  • cat and tiger, train and locomotive
  • Indistinguishable with our features
  • eagle and jet
  • Entangled correspondence
  • polar bear
  • mare/foals horse
  • Solution cluster similar words
  • Obtain similarity matrix
  • Compare words with symmetrised KL divergence
  • Apply N-Cuts on matrix to get clusters
  • Replace word with its cluster label

22
Experiments
  • Train with 4500 Corel images
  • 4-5 words for each image
  • 371 words in vocabulary
  • 5-10 regions per image
  • 500 blobs
  • Test on 500 images

23
Auto-Annotation
  • Determine most likely word for each blob
  • If probability of word is greater than some
    threshold, use in annotation

24
Measuring Performance
  • Do we predict the right words?

25
Region Naming / Correspondence
26
Measuring Performance
  • Do we predict the right words?
  • Are they on the right blobs?
  • Difficult to measure because data set contains no
    correspondence information
  • Must be done by hand on a smaller data set
  • Not practical to count false negatives

27
Successful Results
28
Successful Results
29
Unsuccessful Results
30
Refusing to Predict
31
Clustering
32
Merging Regions
33
Results
light bar average number of times blob predicts
word in correct place dark bar average number
of times blob predicts word which is in the image
34
Second paper
  • Matching Words and Pictures
  • by Kobus Barnard, Pinar Duygulu, Nando de
    Freitas,
  • David Forsyth, David Blei, Michael I. Jordan
  • Comparing lots of different models for annotation
    and correspondence

35
Annotation Models
  • Multi-modal hierarchical aspect models
  • Mixture of multi-modal LDA

36
Multi-Model Hierarchical Aspect Model
cluster a path from a leaf to the root
37
Multi-Model Hierarchical Aspect Model
  • All observations are produced independent of one
    another
  • I-0 as above
  • I-1 cluster dependent level structure
  • p(ld) replaced with p(lc,d)
  • I-2 generative model
  • p(ld) replaced with p(lc)
  • allows prediction for documents not in training
    set

38
Multi-Model Hierarchical Aspect Model
  • Model fitting is done with EM
  • Word prediction

39
Mixture of Multi-Modal LDA
40
Mixture of Multi-Modal LDA
  • Distribution parameters estimated with EM
  • Word prediction

posterior Dirichlet
posterior over mixture components
41
Correspondence Models
  • Discrete translation
  • Hierarchical clustering
  • Linking word and region emission probabilities
  • Paired word and region emission

42
Discrete Translation
  • Similar to first paper
  • Use k-means to vector-quantize the set of
    features representing an image region
  • Construct a joint probability table linking word
    tokens to blob tokens
  • Data set doesnt provide explicit correspondences
  • Missing data problem gt EM

43
Hierarchical Clustering
  • Again, using vector-quantized image regions
  • Word prediction

44
Linking Word andRegion Emission
  • Words emitted conditioned on observed blobs
  • D-O as above (D for dependent)
  • D-1 cluster dependent level distributions
  • Replace p(lc,d) with p(ld)
  • D-2 generative model
  • Replace p(ld) with p(l)

B U W
45
Paired Word and Region Emission at Nodes
  • Observed words and regions are emitted in pairs
    D(w,b)
  • C-0 as above (C for correspondence)
  • C-1 cluster dependent level structure
  • p(ld) replaced with p(lc,d)
  • C-2 generative model
  • p(ld) replaced with p(lc)

46
Wow, Thats a Lot of models!
  • Multi-modal hierarchical I-0, I-1, I-2
  • Multi-modal LDA
  • Discrete translation
  • Hierarchical clustering
  • Linked word and region emission D-0, D-1, D-2
  • Paired word and region emission C-0, C-1, C-2
  • Count 12
  • Why so many?

47
Evaluation Methods
  • Annotation performance measures
  • KL divergence between predicted and target
    distributions
  • Word prediction measure
  • n of words in image
  • r of words predicted correctly
  • of words predicted is set to of actual
    keywords
  • Normalized classification score
  • w of words predicted incorrectly
  • N vocabulary size

48
Results
  • Methods using clustering are very reliant on
    having images that are close to the training data
  • MoM-LDA has strong resistance to over-fitting
  • D-0 (linked word and region emission) appears to
    give best results, taking all measures and data
    sets into consideration

49
Successful Results
50
Unsuccessful Results
good annotation, poor correspondence
complete failure
51
N-cuts vs. Blobworld
Blobworld
Normalized Cuts
52
N-cuts vs. Blobworld
53
Browsing Results
Clustering by text only
Clustering by image features only
54
Browsing Results
Clustering by both text and image features only
55
Search Results
  • query tiger, river

56
Auto-Illustration Results
  • Passage from Moby Dick
  • The large importance attached to the
    harpooneer's vocation is evinced by the fact,
    that originally in the old Dutch Fishery, two
    centuries and more ago, the command of a
    whale-ship!
  • Words extracted from the passage using natural
    language processing tools
  • large importance attached fact old dutch century
    more command whale ship was per son was divided
    officer word means fat cutter time made days was
    general vessel whale hunting concern british
    title old dutch official present rank such more
    good american officer boat night watch ground
    command ship deck grand political sea men mast

57
Auto-Illustration Results
  • Top-ranked images retrieved using all extracted
    words

58
Conclusions
  • Lots of different models developed
  • Hard to tell which is best
  • Can be used with any set of features
  • Numerous applications
  • Auto-annotation
  • Region naming (aka object recognition)
  • Browsing
  • Searching
  • Auto-illustration
  • Improvements in translation from visual to
    semantic representations lead to improvements in
    image access
Write a Comment
User Comments (0)
About PowerShow.com