Part 3: Supervised Learning - PowerPoint PPT Presentation

1 / 76
About This Presentation
Title:

Part 3: Supervised Learning

Description:

Part 3: Supervised Learning – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 77
Provided by: cmbi5
Category:

less

Transcript and Presenter's Notes

Title: Part 3: Supervised Learning


1
Machine Learning Techniques for Computer Vision
  • Part 3 Supervised Learning

Christopher M. Bishop
Microsoft Research Cambridge
ECCV 2004, Prague
2
Overview of Part 3
  • Linear models for regression and classification
  • Decision theory
  • Discriminative versus generative methods
  • The curse of dimensionality
  • Sparse kernel machines, boosting
  • Neural networks

3
Linear Basis Function Models
  • Prediction given by linear combination of basis
    functions
  • Example polynomialso that the basis
    functions are given by

4
Least Squares
  • Minimize sum-of-squares error function

5
Least Squares Solution
  • Exact closed-form minimizerwhereand
    is the design matrix given by

6
Model Complexity
7
Generalization Error
8
Regularization
  • Discourage large values by adding penalty term to
    error
  • Also called ridge regression or shrinkage or
    weight decay
  • The regularization coefficient now controls
    the effective model complexity

9
Regularized M 9 Polynomial
10
Regularized Parameters
11
Generalization
12
Probability Theory
  • Target values are corrupted with noise which is
    intrinsically unpredictable from the observed
    inputs
  • Inputs themselves may be noisy
  • The most complete description of the data is the
    joint distribution
  • The parameters of any model are also uncertain
    Bayesian probabilities

13
Decision Theory
  • Loss function
  • loss incurred in choosing when truth is
  • Minimize the average, or expected, loss
  • Two phases
  • Inference model the probability distribution
    (hard)
  • Decision choose the optimal output (easy)

14
Squared Loss
  • Common choice for regression is the squared
    loss
  • Minimum expected loss given by the conditional
    average

15
Squared Loss
16
Classification
  • Assign input vector to one of two or more
    classes
  • Joint distribution of data and classes
  • Any decision rule divides input space into
    decision regions separated by decision boundaries

17
Minimum Misclassification Rate
  • Simplest loss minimize number of
    misclassifications
  • For two classes
  • Since this says assign to class
    for which is largest

18
Minimum Misclassification Rate
19
General Loss Matrix for Classification
  • Loss in choosing when true class is
    denoted
  • Expected loss given by
  • Minimized by choosing class which
    minimizes
  • again, this is trivial once we know

20
Generative vs. Discriminative Models
  • Generative approach separately model
    class-conditional densities and priorsthen
    evaluate posterior probabilities using Bayes
    theorem
  • Discriminative approaches
  • model posterior probabilities directly
  • just predict class label (no inference stage)

21
Generative vs. Discriminative
22
Unlabelled Data
23
Generative Methods
  • ? Relatively straightforward to characterize
    invariances
  • ? They can handle partially labelled data
  • ? They waste flexibility modelling variability
    which is unimportant for classification
  • ? They scale badly with the number of classes and
    the number of invariant transformations (slow on
    test data)

24
Discriminative Methods
  • ? They use the flexibility of the model in
    relevant regions of input space
  • ? They can be extremely fast once trained
  • ? They interpolate between training examples, and
    hence can fail if novel inputs are presented
  • ? They dont easily handle compositionality (e.g.
    faces can have glasses and/or moustaches)

25
Advantages of Knowing Posterior Probabilities
  • No re-training if loss matrix changes (e.g.
    screening)
  • inference hard, decision stage is easy
  • Reject option dont make decision when largest
    probability is less than threshold (e.g.
    screening)
  • Compensating for skewed class priors (e.g.
    screening)
  • Combining models, e.g. independent measurements

26
Curve Fitting Re-visited
  • Probabilistic formulation
  • Assume target data generated from deterministic
    function plus Gaussian additive noise
  • Conditional distribution

27
Maximum Likelihood
  • Training data set
  • Likelihood function
  • Log likelihood
  • Maximum likelihood equivalent to least squares

28
Parameter Prior
  • Gaussian prior
  • Log posterior probability
  • MAP (maximum posterior) equivalent to regularized
    least squares with
  • Bayesian optimization of l (model complexity)
  • requires marginalization over w

29
Classification Two Classes
  • Posterior class probabilitywhere
  • Called logistic sigmoid function

30
Logistic Regression
  • Fit parameterized model directly
  • Target variable
  • Class probability
  • Log likelihood function (cross-entropy)

31
Logistic Regression
  • Fixed non-linear basis functions
  • convex optimization problem
  • efficient Newton-Raphson method (IRLS)
  • decision boundaries linear in but
    non-linear in

32
Basis Functions
33
Classification More Than Two Classes
  • Posterior probabilitywhere
  • Called the softmax or normalized exponential

34
Question
Why not simply use fixed basis function models
for all pattern recognition problems?
35
A History Lesson the Perceptron (1957)
36
Perceptron Learning Algorithm
  • Perceptron function
  • For each mis-classified pattern in turn, update
    the weights where target values are
  • Guaranteed to converge in a finite number of
    steps, if there exists an exact solution

37
Perceptron Hardware
38
Perceptrons (1969)
The perceptron has many features that
attract attention its linearity, its intriguing
learning theorem its clear paradigmatic
simplicity as a kind of parallel computation.
There is no reason to suppose that any of these
virtues carry over to the many-layered version.
Nevertheless, we consider it to be an important
research problem to elucidate (or reject) our
intuitive judgement that the extension is
sterile. pp. 231 232
39
Curse of Dimensionality
40
Intrinsic Dimensionality
  • Data often lives on a much lower-dimensional
    manifold
  • example images of a rigid object
  • Also for most problems the outputs are smooth
    functions of the inputs, so we can use
    interpolation

41
Adaptive Basis Functions Strategy 1
  • Position the basis functions in regions of input
    space occupied by the data
  • one basis function on each data point
  • Select from set of fixed candidates during
    training
  • Support Vector Machine (SVM)
  • Relevance Vector Machine (RVM)

42
Support Vector Machine
  • Consider two linearly-separable classes, linear
    model
  • Maximize margin gives sparse solution

43
Maximum Margin
  • Justification from statistical learning theory
  • Bayesian marginalization also gives a large
    margin
  • e.g. logistic regression

44
Quadratic Programming
  • Extend to non-linear feature space
  • Target values
  • Minimize dual quadratic form (convex
    optimization)subject to

45
Overlapping Classes
  • Slack variables

46
Kernels
  • SVM solution depends only on dot product
  • Feature space can be high (infinite)
    dimensional
  • Kernels must be symmetric, positive definite
    (Mercer)
  • Examples
  • polynomial
  • Gaussian

47
Example Face Detection
  • Romdhani, Torr, Schölkopf and Blake (2001)
  • Cascade of ever more complex (slower) models
  • low false negative rate at each step
  • c.f. boosting hierarchy of Viola and Jones (2001)

48
Face Detection
49
Face Detection
50
Face Detection
51
Face Detection
52
Face Detection
53
Adaboost
Final classifier is linear combination of weak
classifiers
54
Simple Features
  • Viola and Jones (2001)

55
Limitations of the SVM
  • Two classes
  • Large number of kernels (in spite of sparsity)
  • Kernels must satisfy Mercer criterion
  • Cross-validation to set parameters C (and e)
  • Decisions at outputs instead of probabilities

56
Multiple Classes
57
Relevance Vector Machine
  • Linear model as for SVM
  • Regression
  • Classification

58
Relevance Vector Machine
  • Gaussian prior for with hyper-parameters
  • Marginalize over
  • sparse solution
  • automatic relevance determination

59
SVM-RVM Comparison
SVM
RVM
60
RVM Tracking
  • Williams, Blake and Cipolla (2003)

61
RVM Tracking
62
Adaptive Basis Functions Strategy 2
  • Neural networks
  • Use small number of efficient planar basis
    functions
  • Adapt the parameters of the basis functions by
    global optimization of cost function

63
Neural Networks for Regression
  • Simplest model has two layers of adaptive
    functions

not a probabilistic graphical model
64
Neural Networks for Classification
  • For binary classification use logistic sigmoid
  • For K-class classification use softmax function

65
General Topologies
66
Error Minimization
  • For regression use sum-of-squares error
  • For classification use cross-entropy error
  • Minimize error function using
  • gradient descent
  • conjugate gradients
  • quasi-Newton methods
  • Requires derivatives of the error function
  • efficiently evaluated using error
    back-propagation
  • compared to

67
Error Back-propagation
  • Derived from chain rule for partial derivatives
  • Three stages
  • Evaluate an error signal at the output units
  • Propagate the signal backwardsthrough the
    network
  • Evaluate derivatives

68
Synthetic Data
69
Convolutional Neural Networks
Le Cun et al.
70
Classification
71
Noise Robustness
72
Face Detection
  • Osadchy, Miller and LeCun (2003)

73
Summary of Part 3
  • Decision theory
  • Generative versus discriminative approaches
  • Linear models and the curse of dimensionality
  • Selecting basis functions
  • support vector machine
  • relevance vector machine
  • Adapting basis functions
  • neural networks

74
Suggested Reading
Oxford University Press
75
New Book
  • Pattern Recognition and Machine Learning
  • Springer (2005)
  • 600 pages, hardback, four colour, low price
  • Graduate-level text book
  • worked solutions to all 250 exercises
  • complete lectures on www
  • Matlab software and companion text with Ian
    Nabney

76
Viewgraphs and papers
  • http//research.microsoft.com/cmbishop
Write a Comment
User Comments (0)
About PowerShow.com