Some Aspects of Bayesian Approach to Model Selection - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Some Aspects of Bayesian Approach to Model Selection

Description:

What means to use for solving a task? Either sophisticated and complex, but accurate; or simple ... The larger is Hessian the less is evidence. Kernel selection ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 30

Provided by: Vet7

Category:

more less

Transcript and Presenter's Notes

Title: Some Aspects of Bayesian Approach to Model Selection

1
Some Aspects of Bayesian Approach to Model
Selection

Vetrov Dmitry
Dorodnicyn Computing Centre of RAS, Moscow

2
Our research team

My colleague
Dmitry Kropotov, PhD student of MSU
Students
Nikita Ptashko
Pavel Tolpegin
Igor Tolstov

3
Overview

Problem formulation
Ways of solution
Bayesian paradigm
Bayesian regularization of kernel classifiers

4
Quality vs. Reliability

A general problem
What means to use for solving a task? Either
sophisticated and complex, but accurate or
simple but reliable?
A trade-off between quality and reliability is
needed

5
Machine learning interpretation
6
Regularization

The easiest way to establish a compromise is to
regularize criterion function using some
heuristic regularizer

The general problem is HOW to express accuracy
and reliability in the same terms. In other words
how to define regularization coefficient ?
7
General ways of compromise I

Structural Risk Minimization (SRM) penalizes
flexibility of classifiers expressed in
VC-dimension of given classifier.

Drawback VC-dimension is very difficult to
compute and its estimates are too rough. The
upper bound for test error is too high and often
exceeds 1
8
General ways of compromise II

Minimal Description Length (MDL) penalizes
algorithmic complexity of classifier. Classifier
is considered as a coding algorithm. We encode
both training data and algorithm itself trying to
minimize the total description length

9
Important aspect

All the described schemes penalize the
flexibility or complexity of classifier, but is
it what we really need?

Complex classifier does not always mean bad
classifier. Ludmila Kuncheva private
communication
10
Maximal likelihood principle

Well-known maximal likelihood principle states
that we should select the classifier with the
largest likelihood (i.e. accuracy on the training
sample)

11
Bayesian view
Likelihood Prior Evidence
12
Model Selection

Suppose we have different classifier families
and want to know what family is better without
performing computationally expensive
cross-validation techniques.
This problem is also known as model selection
task

13
Bayesian framework I

Find the best model, i.e. the optimal value of
hyperparameter
If all models are equally likely then
Note that it is exactly the evidence which should
be maximized to find best model

14
Bayesian framework II
Now compute posterior parameter distribution
and final likelihood of test data
15
Why do we need model selecton?

The answer is simple
Many classifiers (e.g. neural networks or support
vector machines) require some additional
parameters to be set by user before training
starts.
IDEA These parameters can be viewed as model
hyperparameters and Bayesian framework can be
applied to select their best values

16
What is evidence
Red model has larger likelihood, but green model
has better evidence. It is more stable and we may
hope for better generalization
17
Support vector machines

Separating surface is defined as linear
combination of kernel functions
The weights are determined solving QP
optimization problem

18
Bottlenecks of SVM

SVM proved to be one of the best classifiers due
to the use of maximal margin principle and kernel
trick BUT
How to define the best kernel for a particular
task and regularization coefficient C ?
Bad kernels may lead to very poor performance due
to overfitting or undertraining

19
Relevance Vector Machines

Probabilistic approach to kernel models. Weights
are interpreted as random variables with gaussian
prior distribution
Maximal evidence principle is used to select best
values. Most of them tend to infinity. Hence
the corresponding weights have zero values that
makes the classifier quite sparse

20
Sparseness of RVM
SVM (C10) RVM
21
Numerical implementation of RVM

We use Laplace approximation to avoid
integration. Then likelihood can be written as
Where
Then evidence can be computed analytically.
Iterative optimization of becomes possible

22
Evidence interpretation

Then evidence is given by
but
This is exactly STABILITY with respect to weights
changes ! The larger is Hessian the less is
evidence

23
Kernel selection

IDEA To use the same techniques for kernel
determination, e.g. for finding the best width of
gaussian kernel

24
Sudden problem

It appeared that narrow gaussians are more stable
with respect to weight changes

25
Solution

We allow the centres of kernels be located in
random points (relevant points). The trade-off
between narrow (high accuracy on the training
set) and wide (stable answers) gaussian can
finally be found.
The classifier we got appeared even more sparse
than RVM!

26
Sparseness of GRVM
RVM GRVM
27
Some experimental results
28
Future work