Some Aspects of Bayesian Approach to Model Selection - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Some Aspects of Bayesian Approach to Model Selection

Description:

What means to use for solving a task? Either sophisticated and complex, but accurate; or simple ... The larger is Hessian the less is evidence. Kernel selection ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 30
Provided by: Vet7
Category:

less

Transcript and Presenter's Notes

Title: Some Aspects of Bayesian Approach to Model Selection


1
Some Aspects of Bayesian Approach to Model
Selection
  • Vetrov Dmitry
  • Dorodnicyn Computing Centre of RAS, Moscow

2
Our research team
  • My colleague
  • Dmitry Kropotov, PhD student of MSU
  • Students
  • Nikita Ptashko
  • Pavel Tolpegin
  • Igor Tolstov

3
Overview
  • Problem formulation
  • Ways of solution
  • Bayesian paradigm
  • Bayesian regularization of kernel classifiers

4
Quality vs. Reliability
  • A general problem
  • What means to use for solving a task? Either
    sophisticated and complex, but accurate or
    simple but reliable?
  • A trade-off between quality and reliability is
    needed

5
Machine learning interpretation
6
Regularization
  • The easiest way to establish a compromise is to
    regularize criterion function using some
    heuristic regularizer

The general problem is HOW to express accuracy
and reliability in the same terms. In other words
how to define regularization coefficient ?
7
General ways of compromise I
  • Structural Risk Minimization (SRM) penalizes
    flexibility of classifiers expressed in
    VC-dimension of given classifier.

Drawback VC-dimension is very difficult to
compute and its estimates are too rough. The
upper bound for test error is too high and often
exceeds 1
8
General ways of compromise II
  • Minimal Description Length (MDL) penalizes
    algorithmic complexity of classifier. Classifier
    is considered as a coding algorithm. We encode
    both training data and algorithm itself trying to
    minimize the total description length

9
Important aspect
  • All the described schemes penalize the
    flexibility or complexity of classifier, but is
    it what we really need?

Complex classifier does not always mean bad
classifier. Ludmila Kuncheva private
communication
10
Maximal likelihood principle
  • Well-known maximal likelihood principle states
    that we should select the classifier with the
    largest likelihood (i.e. accuracy on the training
    sample)

11
Bayesian view
Likelihood Prior Evidence
12
Model Selection
  • Suppose we have different classifier families
  • and want to know what family is better without
    performing computationally expensive
    cross-validation techniques.
  • This problem is also known as model selection
    task

13
Bayesian framework I
  • Find the best model, i.e. the optimal value of
    hyperparameter
  • If all models are equally likely then
  • Note that it is exactly the evidence which should
    be maximized to find best model

14
Bayesian framework II
Now compute posterior parameter distribution
and final likelihood of test data
15
Why do we need model selecton?
  • The answer is simple
  • Many classifiers (e.g. neural networks or support
    vector machines) require some additional
    parameters to be set by user before training
    starts.
  • IDEA These parameters can be viewed as model
    hyperparameters and Bayesian framework can be
    applied to select their best values

16
What is evidence
Red model has larger likelihood, but green model
has better evidence. It is more stable and we may
hope for better generalization
17
Support vector machines
  • Separating surface is defined as linear
    combination of kernel functions
  • The weights are determined solving QP
    optimization problem

18
Bottlenecks of SVM
  • SVM proved to be one of the best classifiers due
    to the use of maximal margin principle and kernel
    trick BUT
  • How to define the best kernel for a particular
    task and regularization coefficient C ?
  • Bad kernels may lead to very poor performance due
    to overfitting or undertraining

19
Relevance Vector Machines
  • Probabilistic approach to kernel models. Weights
    are interpreted as random variables with gaussian
    prior distribution
  • Maximal evidence principle is used to select best
    values. Most of them tend to infinity. Hence
    the corresponding weights have zero values that
    makes the classifier quite sparse

20
Sparseness of RVM
SVM (C10) RVM
21
Numerical implementation of RVM
  • We use Laplace approximation to avoid
    integration. Then likelihood can be written as
  • Where
  • Then evidence can be computed analytically.
    Iterative optimization of becomes possible

22
Evidence interpretation
  • Then evidence is given by
  • but
  • This is exactly STABILITY with respect to weights
    changes ! The larger is Hessian the less is
    evidence

23
Kernel selection
  • IDEA To use the same techniques for kernel
    determination, e.g. for finding the best width of
    gaussian kernel

24
Sudden problem
  • It appeared that narrow gaussians are more stable
    with respect to weight changes

25
Solution
  • We allow the centres of kernels be located in
    random points (relevant points). The trade-off
    between narrow (high accuracy on the training
    set) and wide (stable answers) gaussian can
    finally be found.
  • The classifier we got appeared even more sparse
    than RVM!

26
Sparseness of GRVM
RVM GRVM
27
Some experimental results
28
Future work
  • Develop quick optimization procedures
  • Optimize and simultaneously during
    evidence maximization
  • Use different width for different features to get
    more sophisticated kernels
  • Apply this approach to polynomial kernels
  • Apply this approach to regression tasks

29
Thank you!
  • Contact information
  • VetrovD_at_yandex.ru, DKropotov_at_yandex.ru
  • http//vetrovd.narod.ru
Write a Comment
User Comments (0)
About PowerShow.com