# Some Aspects of Bayesian Approach to Model Selection - PowerPoint PPT Presentation

PPT – Some Aspects of Bayesian Approach to Model Selection PowerPoint presentation | free to view - id: 1aace7-ZDc1Z

The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
Title:

## Some Aspects of Bayesian Approach to Model Selection

Description:

### What means to use for solving a task? Either sophisticated and complex, but accurate; or simple ... The larger is Hessian the less is evidence. Kernel selection ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 30
Provided by: Vet7
Category:
Tags:
Transcript and Presenter's Notes

Title: Some Aspects of Bayesian Approach to Model Selection

1
Some Aspects of Bayesian Approach to Model
Selection
• Vetrov Dmitry
• Dorodnicyn Computing Centre of RAS, Moscow

2
Our research team
• My colleague
• Dmitry Kropotov, PhD student of MSU
• Students
• Nikita Ptashko
• Pavel Tolpegin
• Igor Tolstov

3
Overview
• Problem formulation
• Ways of solution
• Bayesian regularization of kernel classifiers

4
Quality vs. Reliability
• A general problem
• What means to use for solving a task? Either
sophisticated and complex, but accurate or
simple but reliable?
• A trade-off between quality and reliability is
needed

5
Machine learning interpretation
6
Regularization
• The easiest way to establish a compromise is to
regularize criterion function using some
heuristic regularizer

The general problem is HOW to express accuracy
and reliability in the same terms. In other words
how to define regularization coefficient ?
7
General ways of compromise I
• Structural Risk Minimization (SRM) penalizes
flexibility of classifiers expressed in
VC-dimension of given classifier.

Drawback VC-dimension is very difficult to
compute and its estimates are too rough. The
upper bound for test error is too high and often
exceeds 1
8
General ways of compromise II
• Minimal Description Length (MDL) penalizes
algorithmic complexity of classifier. Classifier
is considered as a coding algorithm. We encode
both training data and algorithm itself trying to
minimize the total description length

9
Important aspect
• All the described schemes penalize the
flexibility or complexity of classifier, but is
it what we really need?

Complex classifier does not always mean bad
classifier. Ludmila Kuncheva private
communication
10
Maximal likelihood principle
• Well-known maximal likelihood principle states
that we should select the classifier with the
largest likelihood (i.e. accuracy on the training
sample)

11
Bayesian view
Likelihood Prior Evidence
12
Model Selection
• Suppose we have different classifier families
• and want to know what family is better without
performing computationally expensive
cross-validation techniques.
• This problem is also known as model selection

13
Bayesian framework I
• Find the best model, i.e. the optimal value of
hyperparameter
• If all models are equally likely then
• Note that it is exactly the evidence which should
be maximized to find best model

14
Bayesian framework II
Now compute posterior parameter distribution
and final likelihood of test data
15
Why do we need model selecton?
• Many classifiers (e.g. neural networks or support
parameters to be set by user before training
starts.
• IDEA These parameters can be viewed as model
hyperparameters and Bayesian framework can be
applied to select their best values

16
What is evidence
Red model has larger likelihood, but green model
has better evidence. It is more stable and we may
hope for better generalization
17
Support vector machines
• Separating surface is defined as linear
combination of kernel functions
• The weights are determined solving QP
optimization problem

18
Bottlenecks of SVM
• SVM proved to be one of the best classifiers due
to the use of maximal margin principle and kernel
trick BUT
• How to define the best kernel for a particular
task and regularization coefficient C ?
to overfitting or undertraining

19
Relevance Vector Machines
• Probabilistic approach to kernel models. Weights
are interpreted as random variables with gaussian
prior distribution
• Maximal evidence principle is used to select best
values. Most of them tend to infinity. Hence
the corresponding weights have zero values that
makes the classifier quite sparse

20
Sparseness of RVM
SVM (C10) RVM
21
Numerical implementation of RVM
• We use Laplace approximation to avoid
integration. Then likelihood can be written as
• Where
• Then evidence can be computed analytically.
Iterative optimization of becomes possible

22
Evidence interpretation
• Then evidence is given by
• but
• This is exactly STABILITY with respect to weights
changes ! The larger is Hessian the less is
evidence

23
Kernel selection
• IDEA To use the same techniques for kernel
determination, e.g. for finding the best width of
gaussian kernel

24
Sudden problem
• It appeared that narrow gaussians are more stable
with respect to weight changes

25
Solution
• We allow the centres of kernels be located in
random points (relevant points). The trade-off
between narrow (high accuracy on the training
set) and wide (stable answers) gaussian can
finally be found.
• The classifier we got appeared even more sparse
than RVM!

26
Sparseness of GRVM
RVM GRVM
27
Some experimental results
28
Future work
• Develop quick optimization procedures
• Optimize and simultaneously during
evidence maximization
• Use different width for different features to get
more sophisticated kernels
• Apply this approach to polynomial kernels
• Apply this approach to regression tasks

29
Thank you!
• Contact information
• VetrovD_at_yandex.ru, DKropotov_at_yandex.ru
• http//vetrovd.narod.ru