Fisher kernels for image representation - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Fisher kernels for image representation

Description:

Goal is to correctly predict for a test data input the corresponding ... Clustering for visual vocabulary construction. Clustering of local image descriptors ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 27

Provided by: Verb6

Category:

more less

Transcript and Presenter's Notes

Title: Fisher kernels for image representation

1
Fisher kernels for image representation
generative classification models

Jakob Verbeek
December 11, 2009

2
Plan for this course

Introduction to machine learning
Clustering techniques
k-means, Gaussian mixture density
Gaussian mixture density continued
Parameter estimation with EM
Classification techniques 1
Introduction, generative methods, semi-supervised
Fisher kernels
Classification techniques 2
Discriminative methods, kernels
Decomposition of images
Topic models,

3
Classification

Training data consists of inputs, denoted x,
and corresponding output class labels, denoted
as y.
Goal is to correctly predict for a test data
input the corresponding class label.
Learn a classifier f(x) from the input data
that outputs the class label or a probability
over the class labels.
Example
Input image
Output category label, eg cat vs. no cat
Classification can be binary (two classes), or
over a larger number of classes (multi-class).
In binary classification we often refer to one
class as positive, and the other as negative
Binary classifier creates a boundaries in the
input space between areas assigned to each class

4
Example of classification
Given training images and their categories
What are the categories of these test images?
5
Discriminative vs generative methods

Generative probabilistic methods
Model the density of inputs x from each class
p(xy)
Estimate class prior probability p(y)
Use Bayes rule to infer distribution over class
given input
Discriminative (probabilistic) methods
Directly estimate class probability given input
p(yx)
Some methods do not have probabilistic
interpretation,
eg. they fit a function f(x), and assign to class
1 if f(x)gt0,
and to class 2 if f(x)lt0

6
Generative classification methods

Generative probabilistic methods
Model the density of inputs x from each class
p(xy)
Estimate class prior probability p(y)
Use Bayes rule to infer distribution over class
given input
Modeling class-conditional densities over the
inputs x
Selection of model class
Parametric models such as Gaussian (for
continuous), Bernoulli (for binary),
Semi-parametric models mixtures of Gaussian,
Bernoulli,
Non-parametric models Histograms over
one-dimensional, or multi-dimensional data,
nearest-neighbor method, kernel density estimator
Given class conditional model, classification is
trivial just apply Bayes rule
Adding new classes can be done by adding a new
class conditional model
Existing class conditional models stay as they
are

7
Histogram methods

Suppose we
have N data points
use a histogram with C cells
How to set the density level in each cell ?
Maximum (log)-likelihood estimator.
Proportional to nr of points n in cell
Inversely proportional to volume V of cell
Problems with histogram method
cells scales exponentially with the dimension
of the data
Discontinuous density estimate
How to choose cell size?

8
The curse of dimensionality

Number of bins increases exponentially with the
dimensionality of the data.
Fine division of each dimension many empty bins
Rough division of each dimension poor density
model
Probability distribution of D discrete variables
takes at least 2D values
At least 2 values for each variable
The number of cells may be reduced assuming
independency between the components of x the
naïve Bayes model
Model is naïve since it assumes that all
variables are independent
Unrealistic for high dimensional data, where
variables tend to be dependent
Poor density estimator
Classification performance can still be good
using derived p(yx)

9
Example of generative classification

Hand-written digit classification
Input binary 28x28 scanned digit images, collect
in 784 long vector
Desired output class label of image
Generative model
Independent Bernoulli model for each class
Probability per pixel per class
Maximum likelihood estimator is average value
per pixel per class

10
k-nearest-neighbor estimation method

Idea fix number of samples in the cell, find the
right cell size.
Probability to find a point in a sphere A
centered on x with volume v is
Smooth density approximately constant in small
region, and thus
Alternatively estimate P from the fraction of
training data in a sphere on x
Combine the above to obtain estimate

11
k-nearest-neighbor estimation method

Method in practice
Choose k
For given x, compute the volume v which contain k
samples.
Estimate density with
Volume of a sphere with radius r in d dimensions
is
What effect does k have?
Data sampled from mixture
of Gaussians plotted in green
Larger k, larger region,
smoother estimate
Selection of k
Leave-one-out cross validation
Select k that maximizes data
log-likelihood

12
k-nearest-neighbor classification rule

Use k-nearest neighbor density estimation to find
p(xcategory)
Apply Bayes rule for classification k-nearest
neighbor classification
Find sphere volume v to capture k data points for
estimate
Use the same sphere for each class for estimates
Estimate global class priors
Calculate class posterior distribution

13
k-nearest-neighbor classification rule

Effect of k on classification boundary
Larger number of neighbors
Larger regions
Smoother class boundaries

14
Kernel density estimation methods

Consider a simple estimator of the cumulative
distribution function
Derivative gives an estimator of the density
function, but this is just a set of delta peaks.
Derivative is defined as
Consider a non-limiting value of h
Each data point adds 1/(2hN) in region of size h
around it, sum of blocks gives estimate

15
Kernel density estimation methods

Can use other than block function to obtain
smooth estimator.
Widely used kernel function is the (multivariate)
Gaussian
Contribution decreases smoothly as a function of
the distance to data point.
Choice of smoothing parameter
Larger size of kernel function gives
smoother desnity estimator
Use the average distance between samples.
Use cross-validation.
Method can be used for multivariate data
Or in naïve bayes model

16
Summary generative classification methods

(Semi-) Parametric models (eg p(data category)
gaussian or mixture)
No need to store data, but possibly too strong
assumptions on data density
Can lead to poor fit on data, and poor
classification result
Non-parametric models
Histograms
Only practical in low dimensional space (lt5 or
so)
High dimensional space will lead to many cells,
many of which will be empty
Naïve Bayes modeling in higher dimensional cases
K-nearest neighbor kernel density estimation
Need to store all training data
Need to find nearest neighbors or points with
non-zero kernel evaluation (costly)

histogram
k-nn
k.d.e.
17
Discriminative vs generative methods

Generative probabilistic methods
Model the density of inputs x from each class
p(xy)
Estimate class prior probability p(y)
Use Bayes rule to infer distribution over class
given input
Discriminative (probabilistic) methods (next
week)
Directly estimate class probability given input
p(yx)
Some methods do not have probabilistic
interpretation,
eg. they fit a function f(x), and assign to class
1 if f(x)gt0,
and to class 2 if f(x)lt0
Hybrid generative-discriminative models
Fit density model to data
Use properties of this model as input for
classifier
Example Fisher-vectors for image respresentation

18
Clustering for visual vocabulary construction

Clustering of local image descriptors
using k-means or mixture of Gaussians
Recap of the image representation pipe-line
Extract image regions at various locations and
scales Compute descriptor for each region (eg
SIFT)
(Soft) assignment each descriptors to clusters
Make histogram for complete image
Summing of vector representations of each
descriptor
Input to image classification method

Cluster indexes
Image regions
19
Fisher Vector motivation

Feature vector quantization is computationally
expensive in practice
Run-time linear in
N nr. of feature vectors 103 per image
D nr. of dimensions 102 (SIFT)
K nr. of clusters 103 for recognition
So in total in the order of 108 multiplications
per image to assign SIFT descriptors to visual
words
We use histogram of visual word counts
Can we do this more efficiently ?!
Reading material Fisher Kernels on Visual
Vocabularies for Image Categorization
F. Perronnin and C. Dance, in CVPR'07
Xerox Research Centre Europe, Meylan

20
Fisher vector image representation

MoG / k-means stores nr of points per cell
Need many clusters to represent distribution of
descriptors in image
But increases computational cost
Fischer vector adds 1st 2nd order moments
More precise description regions assigned to
cluster
Fewer clusters needed for same accuracy
Representation (2D1) times larger, at same
computational cost
Terms already calculated when computing
soft-assignment

qnk soft-assignment of image region to cluster
(Gaussian mixture component)
21
Image representation using Fisher kernels

General idea of Fischer vector representation
Fit probabilistic model to data
Use derivative of data log-likelihood as data
representation, eg.for classification
Jaakkola Haussler. Exploiting generative
models in discriminative classifiers, in
Advances in Neural Information Processing Systems
11, 1999.
Here, we use Mixture of Gaussians to cluster the
region descriptors
Concatenate derivatives to obtain data
representation

22
Image representation using Fisher kernels

Extended representation of image descriptors
using MoG
Displacement of descriptor from center
Squares of displacement from center
From 1 number per descriptor per cluster, to
1DD2 (D data dimension)
Simplified version obtained when
Using this representation for a linear classifier
Diagonal covariance matrices, variance in
dimensions given by vector vk
For a single image region descriptor
Summed over all descriptors this gives us
1 Soft count of regions assigned to cluster
D Weighted average of assigned descriptors
D Weighted variance of descriptors in all
dimensions

23
Fisher vector image representation

MoG / k-means stores nr of points per cell
Need many clusters to represent distribution of
descriptors in image
Fischer vector adds 1st 2nd order moments
More precise description regions assigned to
cluster
Fewer clusters needed for same accuracy
Representation (2D1) times larger, at same
computational cost
Terms already calculated when computing
soft-assignment
Comp. cost is O(NKD), need difference between all
clusters and data

24
Images from categorization task PASCAL VOC

Yearly competition for image classification
(also object localization, segmentation, and
body-part localization)

25
Fisher Vector results

BOV-supervised learns separate mixture model for
each image class, makes that some of the visual
words are class-specific
MAP assign image to class for which the
corresponding MoG assigns maximum likelihood to
the region descriptors
Other results based on linear classifier of the
image descriptions
Similar performance, using 16x fewer Gaussians
Unsupervised/universal representation good

26
Plan for this course