Title: Robust Statistical Techniques for the Categorization of Images Using Associated Text
1Robust Statistical Techniquesfor the
Categorization of ImagesUsing Associated Text
2Text Categorization
- Text categorization (TC) refers to the automatic
labeling of documents, using natural language
text contained in or associated with each
document, into one or more pre-defined
categories. - Idea TC techniques can be applied to image
captions or articles to label the corresponding
images.
3Clues for Indoor versus OutdoorText (as opposed
to visual image features)
Denver Summit of Eight leaders begin their first
official meeting in the Denver Public Library,
June 21.
The two engines of an Amtrak passenger train lie
in the mud at the edge a marsh after the train,
bound for Boston from Washington, derailed on the
bank of the Hackensack River, just after crossing
a bridge.
4Two Paradigms of Research
- Machine learning (ML) techniques
- Common in the literature
- Usually involve the exploration of new algorithms
applied to bag of words representations of
documents - Novel representation
- Rare in the literature
- Usually more specific, but often interesting and
can lead to substantial improvement - Important for certain tasks involving images!
5Contributions
- General
- An in-depth exploration of the categorization of
images based on associated text - Incorporating research into Newsblaster
- Novel machine learning (ML) techniques
- The creation of two novel TC approaches
- The combination of high-precision/low-recall
rules with other systems - Novel representation
- The use of Natural Language Processing (NLP)
techniques - The use of low-level image features
6Framework
- Collection of Experiments
- Various tasks
- Multiple techniques
- No clear winner for all tasks
- Characteristics of tasks often dictate which
techniques work best - No Free Lunch
7Overview
- The Main Idea
- Description of Corpus
- Novel ML Systems
- NLP Based System
- High-Precision/Low-Recall Rules
- Image Features
- Newsblaster
- Conclusions and Future Work
8Corpus
- Raw data
- Postings from news related Usenet newsgroups
- Over 2000 include embedded captioned images
- Data sets
- Multiple sets of categories representing various
levels of abstraction - Mutually exclusive and exhaustive categories
9Indoor
Outdoor
10Events Categories
Politics
Struggle
Disaster
Crime
Other
11Subcategories for Disaster Images
Category F1
Politics 89
Struggle 88
Disaster 97
Crime 90
Other 59
Politics
Struggle
Disaster
Crime
Other
12Disaster Image Categories
Affected People
Workers Responding
Other
Wreckage
13Subcategories for Politics Images
Category F1
Politics 89
Struggle 88
Disaster 97
Crime 90
Other 59
Politics
Struggle
Disaster
Crime
Other
14Politics Image Categories
Meeting
Civilians
Announcement
Other
Military
Politician Photographed
15Collect Labels to Train Systems
16Overview
- The Main Idea
- Description of Corpus
- Novel ML Systems
- NLP Based System
- High-Precision/Low-Recall Rules
- Image Features
- Newsblaster
- Conclusions and Future Work
17Two Novel ML Approaches
- Density estimation
- Can be applied to the results of any other system
that calculates a similarity score for every
category - Often improves performance
- Provides probabilistic confidence measures for
predictions - BINS
- Uses binning to estimate accurate term weights
for words with scarce evidence - Smoothing leads to robust performance
- Extremely competitive for two data sets in my
corpus
18Density Estimation
- First apply a standard system
- For each document, compute a similarity or score
for every category. - Apply to training documents as well as test
documents. - For each test document
- Find all documents from training set with similar
category scores. - Use categories of close training documents to
predict categories of test documents.
19Density Estimation Example
Category score vector for test document
Category score vectors for training documents
Actual Categories
85, 35, 25, 95, 20
Distances
(Crime)
Struggle ?
Politics ?
Disaster ?
20.0
Crime ?
Other ?
100, 75, 20, 30, 5
(Struggle)
92.5
100, 40, 30, 90, 10
106.4
40, 30, 80, 25, 40
(Disaster)
Predictions Rocchio/TFIDF Struggle DE Crime
(Probability .679)
27.4
91.4
80, 45, 20, 75, 10
(Struggle)
36.7
60, 95, 20, 30, 5
(Politics)
90, 25, 50, 110, 25
(Crime)
20Density Estimation Significantly Improves
Performancefor the Indoor versus Outdoor Data Set
21Density Estimation Slightly Degrades
Performancefor the Events Data Set
22Density Estimation Sometimes Improves
Performance,Always Provides Confidence Measures
Indoor versus Outdoor
Events Politics, Struggle, Disaster, Crime, Other
23Results of Density Estimation Experiments for the
Indoor versus Outdoor Data Set
Confidence Range of Images Overall Accuracy
High (P ? 0.9) 285 92.6
Medium (0.9 gt P ? 0.7) 98 75.5
Low (0.7 gt P ? 0.5) 62 72.6
Results of Density Estimation Experiments for the
Events Data Set
Confidence Range of Documents Overall Accuracy
High (P ? 0.9) 301 94.4
Medium (0.9 gt P ? 0.7) 68 79.4
Low (0.7 gt P ? 0.5) 60 53.3
Very Low (0.5 gt P) 14 42.9
24BINS SystemNaïve Bayes Smoothing
- Binning based on smoothing in the speech
recognition literature - Not enough training data to estimate term weights
for words with scarce evidence - Words with similar statistical features are
grouped into a common bin - Estimate a single weight for each bin
- This weight is assigned to all words in the bin
- Credible estimates even for small (or zero) counts
25Binning Uses Statistical Features of Words
Intuition Word Indoor Category Count Outdoor Category Count Quantized IDF
Clearly Indoor conference 14 1 4
Clearly Indoor bed 1 0 8
Clearly Outdoor plane 0 9 5
Clearly Outdoor earthquake 0 4 6
Unclear speech 2 2 6
Unclear ceremony 3 8 5
26plane
- Sparse data
- plane does not occur in any Indoor training
documents - Infinitely more likely to be Outdoor ???
- Assign plane to bins of words with similar
features (e.g. IDF, category counts) - In first half of training set, plane appears
in - 9 Outdoor documents
- 0 Indoor documents
27Lambdas Weights
- First half of training set Assign words to bins
- Second half of training set Estimate term
weights
28Lambdas for plane4.03 times more likely in an
Outdoor document
29Binning ? Credible Log Likelihood Ratios
Intuition Word ?Indoor minus ?Outdoor Indoor Category Count Outdoor Category Count Quantized IDF
Clearly Indoor conference 4.84 14 1 4
Clearly Indoor bed 1.35 1 0 8
Clearly Outdoor plane -2.01 0 9 5
Clearly Outdoor earthquake -1.00 0 4 6
Unclear speech 0.84 2 2 6
Unclear ceremony -0.50 3 8 5
30Lambdas Decrease with IDF
31Methodology of BINS
- Divide training set into two halves
- First half used to determine bins for words
- Second half used to determine lambdas for bins
- For each test document
- Map every word to a bin for each category
- Add lambdas, obtaining a score for each category
- Switch halves of training and repeat
- Combine results and assign each document to
category with highest score
32Binning Improves Performancefor the Indoor
versus Outdoor Data Set
33Binning Improves Performancefor the Events Data
Set
34BINS Robust Version of Naïve Bayes
Indoor versus Outdoor
Events Politics, Struggle, Disaster, Crime, Other
35Combining Bin Weights and Naïve Bayes Weights
- Idea
- It might be better to use the Naïve Bayes weight
when there is enough evidence for a word - Back off to the bin weight otherwise
- BINS allows combinations of weights to be used
based on the level of evidence - How can we automatically determine when to use
which weights??? - Entropy
- Minimum Squared Error (MSE)
36BINS Allows User to Combine Weights
Based on Entropy
Based on MSE
0 0.25 0.5 0.75 1
Use only bin weight for evidence of 0
0 0.5 1
Average bin weight and NB weight for evidence of 1
Use only NB weight for evidence of 2 or more
COMBO 1
COMBO 2
37Appropriately Combining the Bin Weight and the
Naïve Bayes Weight Leads to the Best Performance
Yet
Indoor versus Outdoor
Events Politics, Struggle, Disaster, Crime, Other
38BINS Performs the Best of All Systems Tested
Indoor versus Outdoor
Events Politics, Struggle, Disaster, Crime, Other
39How Can We Improve Results?
- One idea Label more documents!
- Usually works
- Boring
- Another idea Use unlabeled documents!
- Easily obtainable
- But can this really work???
- Maybe it can
40Binning Using Unlabeled Documents
- Apply system to unlabeled documents
- Choose documents with confident predictions
- Each word has new feature of unlabeled
documents containing the word that are
confidently predicted to belong to each category
(unlabeled category counts) - Probably less important than regular category
counts - Binning provides a natural mechanism for
weighting the new feature appropriately
41Determining Confident Predictions
- BINS computes a score for each category
- BINS predicts category with highest score
- Confidence for predicted category is score of
that category minus score of second place
category - Confidence for non-predicted category is score of
that category minus score of chosen category - Cross validation experiments can be used to
determine a confidence cutoff for each category - Maximize F? for category
- Beta of 1 gives precision and recall equal
weight, lower beta weights precision higher
42Use F? to Optimize Confidence Cutoffs (example
for a single category)
43Use F? to Optimize Confidence Cutoffs (important
region of graph highlighted)
44Should the New Feature Matter?
45Does the New Feature Help?
- No
- Why???
- New features add info but make bins smaller
- Perhaps more data isnt needed in the first place
- Should more data matter?
- Hard to accumulate more labeled data
- Easy to try out less labeled data!
46Does Size Matter?
47Overview
- The Main Idea
- Description of Corpus
- Novel ML Systems
- NLP Based System
- High-Precision/Low-Recall Rules
- Image Features
- Newsblaster
- Conclusions and Future Work
48Disaster Image Categories
Affected People
Workers Responding
Other
Wreckage
49Performance of Standard SystemsNot Very
Satisfying
50Ambiguity for Disaster ImagesWorkers Responding
vs. Affected People
Philippine rescuers carry a fire victim March 19
who perished in a blaze at a Manila disco.
Hypothetical alternative caption A fire victim
who perished in a blaze at a Manila disco is
carried by Philippine rescuers March 19.
51Summary of Observations About Task
Philippine rescuers carry a fire victim March 19
who perished in a blaze at a Manila disco.
- Need to distinguish foreground from background,
determine focus of image - Not all words are important some are misleading
- Hypothesis the main subject and verb are
particularly useful for this task - Problematic for bag of words approaches
- Need linguistic analysis to determine predicate
argument relationships
52Hypothesis Subject and Verbare Useful Clues
Subject Verb Category Guessable?
Truck makes Wreckage No
couple mourn Affected People Yes
blocks suffered Wreckage Yes
NAME gather Affected People No
child sleeps Affected People Yes
inspectors search Workers Responding Yes
NAME observes Workers Responding No
workers confer Workers Responding Yes
child covers Affected People Yes
chimney stands Wreckage Yes
53Experiments with Humans Subjects 4
ConditionsTest Hypothesis Subject and Verb are
Useful Clues
SENT First sentence of caption Philippine rescuers carry a fire victim March 19 who perished in a blaze at a Manila disco.
RAND All words from first sentence in random order At perished disco who Manila a a in 19 carry Philippine blaze victim a rescuers March fire
IDF Top two TFIDF words disco rescuers
S-V Subject and verb subject rescuers, verb carry
54Experiments with Humans Subjects
ResultsHypothesis Subject and Verb are Useful
Clues
- Syntax is important
- SENT gt RAND
- S-V gt IDF
- Subject and verb are especially important
Condition Average Time (in seconds)
RAND 68
SENT 34
IDF 22
S-V 20
55Experiments with Humans Subjects
ResultsHypothesis Subject and Verb are Useful
Clues
- More words are better than fewer words
- SENT, RAND gt S-V, IDF
- Syntax is important
- SENT gt RAND S-V gt IDF
Condition Average Time (in seconds)
RAND 68
SENT 34
IDF 22
S-V 20
56RAND is Very Slow!
Condition Average Time (in seconds)
RAND 68
SENT 34
IDF 22
S-V 20
- Perhaps human subjects unscrambled words,
regaining syntactic information
57Using Just Two Words (S-V)Almost as Good as All
the Words (Bag of Words)
58Operational NLP Based System
Subjects 83.9
Verbs 80.6
- Extract subjects and verbs from all documents in
training set
CASS shallow parser
Extract subject and verb
WordNet maps to base form
Sentence
POS tagger
Output
- For each test document
- Extract subject and verb
- Compare to those from training set using a novel
method of word-to-word similarity - Based on similarities, generate a score for every
category
59Word Similarity
- Examine large extended corpus to generate many
subject/verb pairs - Use to compute similarities
60Choosing a Category
- For given test document d, calculate total score
for every category c - Choose category with highest score
- If subject is NAME, a bit more complicated
61The NLP Based System Beats All Others by a
Considerable Margin
62Politics Image Categories
Meeting
Civilians
Announcement
Other
Military
Politician Photographed
63The NLP Based System is in the Middle of the Pack
for the Politics Image Data Set
64Why is the Performance for the NLP Based System
not as Strong for the Politics Image Data Set?
- A much wider range of performance scores
- Range for Politics images is 36 to 64.7
- Range for Disaster images is 54 to 59.7
- The top systems are harder to beat
- Too many proper names as subjects
- 60 of test instances for Politics images
- Only 13 of test instances for Disaster images
- For 60 of test documents, only one word (the
main verb) is being used to determine the
prediction
65Overview
- The Main Idea
- Description of Corpus
- Novel ML Systems
- NLP Based System
- High-Precision/Low-Recall Rules
- Image Features
- Newsblaster
- Conclusions and Future Work
66The Original Premise
- For the Disaster image data set, the performance
of the NLP based system still leaves room for
improvement - NLP based system achieves 65 overall accuracy
for the Disaster image data set - Humans viewing all words in random order achieve
about 75 - Humans viewing full first sentence achieve over
90 - Main subject and verb are particularly important,
but sometimes other words might offer good clues
67Higinio Guereca carries family photos he
retrieved from his mobile home which was
destroyed as a tornado moved through the Central
Florida community, early December 27.
68Choosing Indicative Words
- Let x be the number of training documents
containing a word w - Let p be the proportion of these documents that
belong to category c - If x gt X and p gt P then w is indicative of c
- X and P can be varied to generate lists of
indicative words - Lists can be pruned manually
69Selected Indicative Words for the Disaster Image
Data Set
Word Indicated Category Total Count (x) Proportion (p)
her Affected People 7 1.0
his Affected People 7 0.86
family Affected People 6 0.83
relatives Affected People 6 1.0
rescue Workers Responding 15 1.0
search Workers Responding 9 1.0
similar Other 2 1.0
soldiers Workers Responding 6 1.0
workers Workers Responding 12 1.0
70Selected Indicative Words for the Politics Image
Data Set
Word Indicated Category Total Count (x) Proportion (p)
hands Meeting 10 0.90
journalists Announcement 4 1.0
local Civilians 4 1.0
media Announcement 3 1.0
presidential Politician Photographed 9 0.78
press Announcement 7 0.71
reporters Announcement 8 0.88
meeting Meeting 15 0.73
session Meeting 6 0.83
victory Politician Photographed 6 0.83
waves Politician Photographed 4 1.0
wife Politician Photographed 6 1.0
71High-Precision/Low-Recall Rules
- If a word w that indicates category c occurs in a
document d, then assign d to c - Every selected indicative word has an associated
rule of the above form - Each rule is very accurate but rarely applicable
- If only rules are used
- most predictions will be correct (hence, high
precision) - most instances of most categories will remain
unlabeled (hence, low recall)
72Combining the High-Precision/Low-Recall Rules
with Other Systems
- Two-pass approach
- Conduct a first-pass using the indicative words
and the high-precision/low-recall rules - For documents that are still unlabeled, fall back
to some other system - Compared to the fall back system
- If the rules are more accurate for the documents
to which they apply, overall accuracy will
improve! - Intended to improve the NLP based system, but
easy to test with other systems as well
73The Rules Improve Every Fall Back System for the
Disaster Image Data Set
74The Rules Improve 7 of 8 Fall Back Systems for
the Politics Image Data Set
75Overview
- The Main Idea
- Description of Corpus
- Novel ML Systems
- NLP Based System
- High-Precision/Low-Recall Rules
- Image Features
- Newsblaster
- Conclusions and Future Work
76Low-Level Image Features
- Collaboration with Paek and Benitez
- They have provided me with information, pointers
to resources, and code - I have reimplemented some of their code
- Color histograms
- Based on entire images or image regions
- Can be used as input to machine learning
approaches (e.g. kNN, SVMs)
77Color
- Three components to color
- Red, green, blue (RGB)
- Hue, saturation, value (HSV)
- Can convert from RGB to HSV
- Can quantize HSV triples
- 18 hues 3 saturations 3 values 4 grays
166 slots
78Color Histograms
- For each pixel of image, compute its quantized
HSV triple - Color histogram of image is vector such that
- There are 166 dimensions
- Each dimension represents one possible HSV triple
- Value of dimension is proportion of pixels with
associated HSV triple - Can be computed for image regions and
concatenated together - Can be input for machine learning techniques
79Images Divided into 8 x 8 Rectangular Regions of
Equal Size
80Using Color Histograms to Predict Labels for the
Indoor versus Outdoor Data Set
81Combining Text and Image Features
- Combining systems has had mixed results in the TC
literature, but - Most attempts have involved systems that use the
same features (bag of words) - There is little reason to believe that indicative
text is correlated with indicative low-level
image features - Most text based systems are beating the image
based systems, but - Distance from optimal hyperplane can be used as a
confidence measure for support vector machine - Predictions with high confidence may be more
accurate than text systems
82Accuracy of Support Vector Machine Approach Tends
to be Higher when Confidence is Greater
Distance Cutoff Overall Accuracy of Images Above Cutoff
3.5 --- 0.0
3.0 100.0 0.4
2.5 87.5 1.8
2.0 92.3 5.8
1.5 94.4 16.0
1.0 91.0 34.1
0.5 84.6 70.1
0.0 78.0 100.0
83The Combination of Text and Image Beats Text
AloneMost systems show small gains, one has
major improvement
84Overview
- The Main Idea
- Description of Corpus
- Novel ML Systems
- NLP Based System
- High-Precision/Low-Recall Rules
- Image Features
- Newsblaster
- Conclusions and Future Work
85(No Transcript)
86Newsblaster Categories
U.S. News
World News
Finance
Entertainment
Science/Technology
Sports
87(No Transcript)
88(No Transcript)
89(No Transcript)
90(No Transcript)
91(No Transcript)
92(No Transcript)
93(No Transcript)
94Newsblaster
- A pragmatic showcase for NLP
- My contributions
- Extraction of images and captions from web pages
- Image browsing interface
- Categorization of stories (clusters) and images
- Scripts that allow users to suggest labels for
articles with incorrect predictions
95Overview
- The Main Idea
- Description of Corpus
- Novel ML Systems
- NLP Based System
- High-Precision/Low-Recall Rules
- Image Features
- Newsblaster
- Conclusions and Future Work
96Summary
- Examined several methods of categorizing images
- No clear winner for all tasks
- BINS is very competitive
- NLP can lead to substantial improvement, at least
for certain tasks - High-precision/low-recall rules are likely to
improve performance for tough tasks - Image features show promise
- Newsblaster demonstrates pragmatic benefits of my
work
97Conclusions
- TC techniques can be used to categorize images
- Approach that should be used depends on the
specific task being considered - Important and timely
- increased commonality of images on the web, large
corpora of images, and personal collections of
images. - Tools will be needed for better browsing,
searching, and filtering
98Future Work
- BINS
- Explore additional binning features
- Explore use of unlabeled data
- NLP and TC
- Improve current system
- Explore additional categories
- Image features
- Explore additional low-level image features
- Explore better methods of combining text and
image - Pragmatic benefits
- Investigate end user applications
- Expand to video (perhaps using closed captions)
99And Now the Questions