Fisher kernels for image representation - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Fisher kernels for image representation

Description:

Goal is to correctly predict for a test data input the corresponding ... Clustering for visual vocabulary construction. Clustering of local image descriptors ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 27
Provided by: Verb6
Category:

less

Transcript and Presenter's Notes

Title: Fisher kernels for image representation


1
Fisher kernels for image representation
generative classification models
  • Jakob Verbeek
  • December 11, 2009

2
Plan for this course
  • Introduction to machine learning
  • Clustering techniques
  • k-means, Gaussian mixture density
  • Gaussian mixture density continued
  • Parameter estimation with EM
  • Classification techniques 1
  • Introduction, generative methods, semi-supervised
  • Fisher kernels
  • Classification techniques 2
  • Discriminative methods, kernels
  • Decomposition of images
  • Topic models,

3
Classification
  • Training data consists of inputs, denoted x,
    and corresponding output class labels, denoted
    as y.
  • Goal is to correctly predict for a test data
    input the corresponding class label.
  • Learn a classifier f(x) from the input data
    that outputs the class label or a probability
    over the class labels.
  • Example
  • Input image
  • Output category label, eg cat vs. no cat
  • Classification can be binary (two classes), or
    over a larger number of classes (multi-class).
  • In binary classification we often refer to one
    class as positive, and the other as negative
  • Binary classifier creates a boundaries in the
    input space between areas assigned to each class

4
Example of classification
Given training images and their categories
What are the categories of these test images?
5
Discriminative vs generative methods
  • Generative probabilistic methods
  • Model the density of inputs x from each class
    p(xy)
  • Estimate class prior probability p(y)
  • Use Bayes rule to infer distribution over class
    given input
  • Discriminative (probabilistic) methods
  • Directly estimate class probability given input
    p(yx)
  • Some methods do not have probabilistic
    interpretation,
  • eg. they fit a function f(x), and assign to class
    1 if f(x)gt0,
  • and to class 2 if f(x)lt0

6
Generative classification methods
  • Generative probabilistic methods
  • Model the density of inputs x from each class
    p(xy)
  • Estimate class prior probability p(y)
  • Use Bayes rule to infer distribution over class
    given input
  • Modeling class-conditional densities over the
    inputs x
  • Selection of model class
  • Parametric models such as Gaussian (for
    continuous), Bernoulli (for binary),
  • Semi-parametric models mixtures of Gaussian,
    Bernoulli,
  • Non-parametric models Histograms over
    one-dimensional, or multi-dimensional data,
    nearest-neighbor method, kernel density estimator
  • Given class conditional model, classification is
    trivial just apply Bayes rule
  • Adding new classes can be done by adding a new
    class conditional model
  • Existing class conditional models stay as they
    are

7
Histogram methods
  • Suppose we
  • have N data points
  • use a histogram with C cells
  • How to set the density level in each cell ?
  • Maximum (log)-likelihood estimator.
  • Proportional to nr of points n in cell
  • Inversely proportional to volume V of cell
  • Problems with histogram method
  • cells scales exponentially with the dimension
    of the data
  • Discontinuous density estimate
  • How to choose cell size?

8
The curse of dimensionality
  • Number of bins increases exponentially with the
    dimensionality of the data.
  • Fine division of each dimension many empty bins
  • Rough division of each dimension poor density
    model
  • Probability distribution of D discrete variables
    takes at least 2D values
  • At least 2 values for each variable
  • The number of cells may be reduced assuming
    independency between the components of x the
    naïve Bayes model
  • Model is naïve since it assumes that all
    variables are independent
  • Unrealistic for high dimensional data, where
    variables tend to be dependent
  • Poor density estimator
  • Classification performance can still be good
    using derived p(yx)

9
Example of generative classification
  • Hand-written digit classification
  • Input binary 28x28 scanned digit images, collect
    in 784 long vector
  • Desired output class label of image
  • Generative model
  • Independent Bernoulli model for each class
  • Probability per pixel per class
  • Maximum likelihood estimator is average value
  • per pixel per class

10
k-nearest-neighbor estimation method
  • Idea fix number of samples in the cell, find the
    right cell size.
  • Probability to find a point in a sphere A
    centered on x with volume v is
  • Smooth density approximately constant in small
    region, and thus
  • Alternatively estimate P from the fraction of
    training data in a sphere on x
  • Combine the above to obtain estimate

11
k-nearest-neighbor estimation method
  • Method in practice
  • Choose k
  • For given x, compute the volume v which contain k
    samples.
  • Estimate density with
  • Volume of a sphere with radius r in d dimensions
    is
  • What effect does k have?
  • Data sampled from mixture
  • of Gaussians plotted in green
  • Larger k, larger region,
  • smoother estimate
  • Selection of k
  • Leave-one-out cross validation
  • Select k that maximizes data
  • log-likelihood

12
k-nearest-neighbor classification rule
  • Use k-nearest neighbor density estimation to find
    p(xcategory)
  • Apply Bayes rule for classification k-nearest
    neighbor classification
  • Find sphere volume v to capture k data points for
    estimate
  • Use the same sphere for each class for estimates
  • Estimate global class priors
  • Calculate class posterior distribution

13
k-nearest-neighbor classification rule
  • Effect of k on classification boundary
  • Larger number of neighbors
  • Larger regions
  • Smoother class boundaries

14
Kernel density estimation methods
  • Consider a simple estimator of the cumulative
    distribution function
  • Derivative gives an estimator of the density
    function, but this is just a set of delta peaks.
  • Derivative is defined as
  • Consider a non-limiting value of h
  • Each data point adds 1/(2hN) in region of size h
    around it, sum of blocks gives estimate

15
Kernel density estimation methods
  • Can use other than block function to obtain
    smooth estimator.
  • Widely used kernel function is the (multivariate)
    Gaussian
  • Contribution decreases smoothly as a function of
    the distance to data point.
  • Choice of smoothing parameter
  • Larger size of kernel function gives
  • smoother desnity estimator
  • Use the average distance between samples.
  • Use cross-validation.
  • Method can be used for multivariate data
  • Or in naïve bayes model

16
Summary generative classification methods
  • (Semi-) Parametric models (eg p(data category)
    gaussian or mixture)
  • No need to store data, but possibly too strong
    assumptions on data density
  • Can lead to poor fit on data, and poor
    classification result
  • Non-parametric models
  • Histograms
  • Only practical in low dimensional space (lt5 or
    so)
  • High dimensional space will lead to many cells,
    many of which will be empty
  • Naïve Bayes modeling in higher dimensional cases
  • K-nearest neighbor kernel density estimation
  • Need to store all training data
  • Need to find nearest neighbors or points with
    non-zero kernel evaluation (costly)

histogram
k-nn
k.d.e.
17
Discriminative vs generative methods
  • Generative probabilistic methods
  • Model the density of inputs x from each class
    p(xy)
  • Estimate class prior probability p(y)
  • Use Bayes rule to infer distribution over class
    given input
  • Discriminative (probabilistic) methods (next
    week)
  • Directly estimate class probability given input
    p(yx)
  • Some methods do not have probabilistic
    interpretation,
  • eg. they fit a function f(x), and assign to class
    1 if f(x)gt0,
  • and to class 2 if f(x)lt0
  • Hybrid generative-discriminative models
  • Fit density model to data
  • Use properties of this model as input for
    classifier
  • Example Fisher-vectors for image respresentation

18
Clustering for visual vocabulary construction
  • Clustering of local image descriptors
  • using k-means or mixture of Gaussians
  • Recap of the image representation pipe-line
  • Extract image regions at various locations and
    scales Compute descriptor for each region (eg
    SIFT)
  • (Soft) assignment each descriptors to clusters
  • Make histogram for complete image
  • Summing of vector representations of each
    descriptor
  • Input to image classification method

Cluster indexes
Image regions
19
Fisher Vector motivation
  • Feature vector quantization is computationally
    expensive in practice
  • Run-time linear in
  • N nr. of feature vectors 103 per image
  • D nr. of dimensions 102 (SIFT)
  • K nr. of clusters 103 for recognition
  • So in total in the order of 108 multiplications
    per image to assign SIFT descriptors to visual
    words
  • We use histogram of visual word counts
  • Can we do this more efficiently ?!
  • Reading material Fisher Kernels on Visual
  • Vocabularies for Image Categorization
  • F. Perronnin and C. Dance, in CVPR'07
  • Xerox Research Centre Europe, Meylan

20
Fisher vector image representation
  • MoG / k-means stores nr of points per cell
  • Need many clusters to represent distribution of
    descriptors in image
  • But increases computational cost
  • Fischer vector adds 1st 2nd order moments
  • More precise description regions assigned to
    cluster
  • Fewer clusters needed for same accuracy
  • Representation (2D1) times larger, at same
    computational cost
  • Terms already calculated when computing
    soft-assignment

qnk soft-assignment of image region to cluster
(Gaussian mixture component)
21
Image representation using Fisher kernels
  • General idea of Fischer vector representation
  • Fit probabilistic model to data
  • Use derivative of data log-likelihood as data
    representation, eg.for classification
  • Jaakkola Haussler. Exploiting generative
    models in discriminative classifiers, in
    Advances in Neural Information Processing Systems
    11, 1999.
  • Here, we use Mixture of Gaussians to cluster the
    region descriptors
  • Concatenate derivatives to obtain data
    representation

22
Image representation using Fisher kernels
  • Extended representation of image descriptors
    using MoG
  • Displacement of descriptor from center
  • Squares of displacement from center
  • From 1 number per descriptor per cluster, to
    1DD2 (D data dimension)
  • Simplified version obtained when
  • Using this representation for a linear classifier
  • Diagonal covariance matrices, variance in
    dimensions given by vector vk
  • For a single image region descriptor
  • Summed over all descriptors this gives us
  • 1 Soft count of regions assigned to cluster
  • D Weighted average of assigned descriptors
  • D Weighted variance of descriptors in all
    dimensions

23
Fisher vector image representation
  • MoG / k-means stores nr of points per cell
  • Need many clusters to represent distribution of
    descriptors in image
  • Fischer vector adds 1st 2nd order moments
  • More precise description regions assigned to
    cluster
  • Fewer clusters needed for same accuracy
  • Representation (2D1) times larger, at same
    computational cost
  • Terms already calculated when computing
    soft-assignment
  • Comp. cost is O(NKD), need difference between all
    clusters and data

24
Images from categorization task PASCAL VOC
  • Yearly competition for image classification
    (also object localization, segmentation, and
    body-part localization)

25
Fisher Vector results
  • BOV-supervised learns separate mixture model for
    each image class, makes that some of the visual
    words are class-specific
  • MAP assign image to class for which the
    corresponding MoG assigns maximum likelihood to
    the region descriptors
  • Other results based on linear classifier of the
    image descriptions
  • Similar performance, using 16x fewer Gaussians
  • Unsupervised/universal representation good

26
Plan for this course
  • Introduction to machine learning
  • Clustering techniques
  • k-means, Gaussian mixture density
  • Gaussian mixture density continued
  • Parameter estimation with EM
  • Classification techniques 1
  • Introduction, generative methods, semi-supervised
  • Reading for next week
  • Previous papers (!), nothing new
  • Available on course website http//lear.inrialpes.
    fr/verbeek/teaching
  • Classification techniques 2
  • Discriminative methods, kernels
  • Decomposition of images
Write a Comment
User Comments (0)
About PowerShow.com