Statistical Learning Theory - PowerPoint PPT Presentation

About This Presentation

Title:

Statistical Learning Theory

Description:

... the particular that approximates d in an optimum fashion. ... Illustration of relationship between training error, confidence interval and guaranteed risk ... – PowerPoint PPT presentation

Number of Views:266

Avg rating:3.0/5.0

Slides: 43

Provided by: angel81

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Learning Theory

1
Statistical Learning Theory
2
Statistical Learning Theory

A model of supervised learning consists of
a) Environment
- Supplying a vector with a fixed but
unknown pdf
b) Teacher. It provides a desired response d for
every according to a conditional pdf
. These are related by

3
Statistical Learning Theory

v is a noise term.
c) Learning machine. It is capable of
imple-menting a set of I/O mapping functions
where y is the actual response and is a
set of free parameters (weights) selected from
the parameter (weight) space .

4
Statistical Learning Theory

The supervised learning problem is that of
selecting the particular that
approximates d in an optimum fashion. The
selection itself is based on a set of iid
training samples
Each sample is drawn from with a joint pdf

5
Statistical Learning Theory

Supervised learning depends on the following
Do the training examples
contain enough information to construct a LM
capable of good generalization?
To answer, we will see this problem as an
approximation problem. We wish to find the
function which is the best
possible approximation to .

6
Statistical Learning Theory

Let
denote a measure of the discrepancy
between a d corresponding to a vector and the
actual response produced by
The expected value of the loss is defined by the
risk functional

7
Statistical Learning Theory

The risk functional may be easily understood from
the finite approximation
where denotes the probability of
drawing the i-th sample.

8
Principle of Empirical Risk Minimization

Instead of using we use an empirical
measure
This measure differs from in two
desirable ways
a) It does not depend on the unknown pdf
explicitly.

9
Principle of Empirical Risk Minimization

b) In theory it can be minimized with respect to
.
-------
Let and denote the
weight vector and the mapping that minimize
Also, let and denote the
ana-logues for
Both and correspond to the space
.

10
Principle of Empirical Risk Minimization

We must now consider under which condi-tions
is close to as
measured by the mismatch between
and .

11
Principle of Empirical Risk Minimization

1. In place of , construct
on the basis of the training set of iid samples
i 1, ..., N

12
Principle of Empirical Risk Minimization

2. converges in probability to the
mi-nimum possible values of as
provided that converges uniformly
to .
3. Uniform convergence as per
is necessary and sufficient for consistency of
the PERM.

13
The Vapnik Chervonenkis Dimension

The theory of uniform convergence of
to includes rates of convergence based
on a parameter called the VC dimension.
It is a measure of the capacity or expressive
power of the family of classification functions
realized by the learning machine.

14
The Vapnik Chervonenkis Dimension

To describe the concept of VC dimension let us
consider a binary pattern classification problem
for which the desired response is
.
A dichotomy is a classification function. Let
denote the set of dichotomies implemented by a
learning machine

15
The Vapnik Chervonenkis Dimension

Let denote the set of N points in the
m-dimensional space of input vectors
A dichotomy partitions into two disjoint
sets and such that

16
The Vapnik Chervonenkis Dimension

Let denote the number of distinct
dichotomies implemented by the L.M.
Let denote the maximum
over all with .
is shattered by if
. That is, if all the possible dichotomies of
can be induced by functions in .

17
The Vapnik Chervonenkis Dimension

In the figure we illus-
trate a two-dimensional
space consisting of 4
points (x1,...,x4). The
decision boundaries of
F0 and F1 correspond
to the classes 0 and 1
being true. F0 induces
the dichotomy

18
The Vapnik Chervonenkis Dimension

While F1 induces
with the set consisting of four
points, the cardinality
Hence,

19
The Vapnik Chervonenkis Dimension

We now formally define the VC dimension as
The VC dimension of an ensemble of dichotomies
is the cardinality of the largest set
that is shattered by .

20
The Vapnik Chervonenkis Dimension

In more familiar terms, the VC dimension of the
set of classification functions
is the maximum number of training examples that
can be learned by the machine without error for
all possible labelings of the classification
functions.

21
Importance of the VC Dimension

Roughly speaking, the number of examples needed
to learn a class of interest reliably is
proportional to the VC dimension.
In some cases the VC dimension is determined by
the free parameters of a Neural Network.
In this regard, the following two results are of
interest.

22
Importance of the VC Dimension

1. Let denote an arbitrary feedforward
network built up from neurons with a threshold
activation function
the VC dimension of is O(W logW) where W
is the total number of free parameters in the
network.

23
Importance of the VC Dimension

2. Let denote a multilayer feedforward
network whose neurons use a sigmoid activation
function
the VC dimension is O(W2), where W is the number
of free parameters in the network.

24
Importance of the VC Dimension

In the case of binary pattern classification the
loss function has only two possible values
The risk functional R( ) and the empirical
risk functional Remp( ) assume the following
interpretations

25
Importance of the VC Dimension

R( ) is the probability of classification
error denoted by P( ).
Remp( ) is the training error, denoted by
v( ).
Then (Haykin, p.98)

26
Importance of the VC Dimension

The notion of VC provides a bound on the rate of
uniform convergence. For the set of
classification functions with VC dimension h the
following inequality holds
(vc.1)
where N is the size of the training sample. In
other words, a finite VC dimension is a necessary
and sufficient condition for uniform convergence
of the principle of empirical risk minimization.

27
Importance of the VC dimension

The factor in (vc.1) represents
a bound on the growth function for
the family of functions
for Provided that this function
does not grow too fast, the right hand side will
go to zero as N goes to infinity.
This requirement is satisfied if the VC dimension
is finite.

28
Importance of the VC Dimension

Thus, a finite VC dimension is a necessary and
sufficient condition for uniform convergence of
the principle of empirical risk minimization.
Let denote the probability of occurrence of
the event
using the previous bound (vc.1) we find
(vc.2)

29
Importance of the VC Dimension

Let denote the special
value of that satisfies (vc.2). Then we obtain
(Haykin, 99)
We refer to as the confidence interval.

30
Importance of the VC Dimension

We may also write
where

31
Importance of the VC Dimension

Conclusions
1.
2. For a small training error (close to zero)
3. For a large training error (close to unity)

32
Structural Risk Minimization

The training error is the frequency of errors
made during the training session for some machine
with weight vector during the training
session.
The generalization error is the frequency of
errors made by the machine when it is tested with
examples not seen before.
Let this two errors to be denoted with
and .

33
Structural Risk Minimization

Let h be the VC dimension of a family of
classification functions
with respect to the input space
The generalization error is
lower than the guaranteed risk defined by the sum
of competing terms
where the confidence interval
is defined as before.

34
Structural Risk Minimization

For a fixed number of training samples N, the
training error decreases monotonically as the
capacity or h is increased, whereas the
confidence interval increases monotonically.

35
Structural Risk Minimization

The training error is the frequency of errors
made during the training session for some machine
with weight vector during the training
session.
The generalization error is the frequency of
errors made by the machine when it is tested with
examples not seen before.
Let this two errors to be denoted with
and .

36
Structural Risk Minimization

The training error is the frequency of errors
made during the training session for some machine
with weight vector during the training
session.
The generalization error is the frequency of
errors made by the machine when it is tested with
examples not seen before.
Let this two errors to be denoted with
and .

37
Structural Risk Minimization

The challenge in solving a supervised learning
problem lies in realizing the best generalization
performance by matching the machine capacity to
the available amount of training data for the
problem at hand. The method of structural risk
minimization provides an inductive procedure to
achieve this goal by making the VC dimension of
the learning machine a control variable.

38
Structural Risk Minimization

Consider an ensemble of pattern classifiers
and define a nested structure of n such machines
such that we have
correspondingly, the VC dimensions of the
indivi-dual pattern classifiers satisfy
which implies that the VC dimension of each
classifier is finite (see next figure)

Illustration of relationship between training
error, confidence interval and guaranteed risk

40
Structural Risk Minimization

Then
a) The empirical risk (training error) of each
classifier is minimized
b) The pattern classifier with the
smallest guaranteed risk is identified this
particular machine provides the best compromise
between the training error (quality of
approximation) and the confidence interval
(complexity of the approximation function).

41
Structural Risk Minimization

Our goal is to find a network structure such that
decreasing the VC dimension occurs at the
expense of the smallest possible increase in
trainig error.
We achieve this, for example, varying h by
varying the number of hidden neurons.
We evaluate the ensemble of fully connected
multilayer feedforward networks in which the
number of neurons in one of the hidden layers is
increased in a monotonic fashion.

42
Structural Risk Minimization