Statistical Learning Theory - PowerPoint PPT Presentation

About This Presentation
Title:

Statistical Learning Theory

Description:

... the particular that approximates d in an optimum fashion. ... Illustration of relationship between training error, confidence interval and guaranteed risk ... – PowerPoint PPT presentation

Number of Views:266
Avg rating:3.0/5.0
Slides: 43
Provided by: angel81
Category:

less

Transcript and Presenter's Notes

Title: Statistical Learning Theory


1
Statistical Learning Theory
2
Statistical Learning Theory
  • A model of supervised learning consists of
  • a) Environment
  • - Supplying a vector with a fixed but
    unknown pdf
  • b) Teacher. It provides a desired response d for
    every according to a conditional pdf
  • . These are related by

3
Statistical Learning Theory
  • v is a noise term.
  • c) Learning machine. It is capable of
    imple-menting a set of I/O mapping functions
  • where y is the actual response and is a
    set of free parameters (weights) selected from
    the parameter (weight) space .

4
Statistical Learning Theory
  • The supervised learning problem is that of
    selecting the particular that
    approximates d in an optimum fashion. The
    selection itself is based on a set of iid
    training samples
  • Each sample is drawn from with a joint pdf

5
Statistical Learning Theory
  • Supervised learning depends on the following
  • Do the training examples
    contain enough information to construct a LM
    capable of good generalization?
  • To answer, we will see this problem as an
    approximation problem. We wish to find the
    function which is the best
    possible approximation to .

6
Statistical Learning Theory
  • Let
    denote a measure of the discrepancy
    between a d corresponding to a vector and the
    actual response produced by
  • The expected value of the loss is defined by the
    risk functional

7
Statistical Learning Theory
  • The risk functional may be easily understood from
    the finite approximation
  • where denotes the probability of
    drawing the i-th sample.

8
Principle of Empirical Risk Minimization
  • Instead of using we use an empirical
    measure
  • This measure differs from in two
    desirable ways
  • a) It does not depend on the unknown pdf
  • explicitly.

9
Principle of Empirical Risk Minimization
  • b) In theory it can be minimized with respect to
    .
  • -------
  • Let and denote the
    weight vector and the mapping that minimize
  • Also, let and denote the
    ana-logues for
  • Both and correspond to the space
    .

10
Principle of Empirical Risk Minimization
  • We must now consider under which condi-tions
    is close to as
    measured by the mismatch between
  • and .

11
Principle of Empirical Risk Minimization
  • 1. In place of , construct
  • on the basis of the training set of iid samples
  • i 1, ..., N

12
Principle of Empirical Risk Minimization
  • 2. converges in probability to the
    mi-nimum possible values of as
    provided that converges uniformly
    to .
  • 3. Uniform convergence as per
  • is necessary and sufficient for consistency of
    the PERM.

13
The Vapnik Chervonenkis Dimension
  • The theory of uniform convergence of
  • to includes rates of convergence based
    on a parameter called the VC dimension.
  • It is a measure of the capacity or expressive
    power of the family of classification functions
    realized by the learning machine.

14
The Vapnik Chervonenkis Dimension
  • To describe the concept of VC dimension let us
    consider a binary pattern classification problem
    for which the desired response is
  • .
  • A dichotomy is a classification function. Let
    denote the set of dichotomies implemented by a
    learning machine

15
The Vapnik Chervonenkis Dimension
  • Let denote the set of N points in the
    m-dimensional space of input vectors
  • A dichotomy partitions into two disjoint
    sets and such that

16
The Vapnik Chervonenkis Dimension
  • Let denote the number of distinct
    dichotomies implemented by the L.M.
  • Let denote the maximum
    over all with .
  • is shattered by if
    . That is, if all the possible dichotomies of
    can be induced by functions in .

17
The Vapnik Chervonenkis Dimension
  • In the figure we illus-
  • trate a two-dimensional
  • space consisting of 4
  • points (x1,...,x4). The
  • decision boundaries of
  • F0 and F1 correspond
  • to the classes 0 and 1
  • being true. F0 induces
  • the dichotomy

18
The Vapnik Chervonenkis Dimension
  • While F1 induces
  • with the set consisting of four
  • points, the cardinality
  • Hence,

19
The Vapnik Chervonenkis Dimension
  • We now formally define the VC dimension as
  • The VC dimension of an ensemble of dichotomies
    is the cardinality of the largest set
    that is shattered by .

20
The Vapnik Chervonenkis Dimension
  • In more familiar terms, the VC dimension of the
    set of classification functions
  • is the maximum number of training examples that
    can be learned by the machine without error for
    all possible labelings of the classification
    functions.

21
Importance of the VC Dimension
  • Roughly speaking, the number of examples needed
    to learn a class of interest reliably is
    proportional to the VC dimension.
  • In some cases the VC dimension is determined by
    the free parameters of a Neural Network.
  • In this regard, the following two results are of
    interest.

22
Importance of the VC Dimension
  • 1. Let denote an arbitrary feedforward
    network built up from neurons with a threshold
    activation function
  • the VC dimension of is O(W logW) where W
    is the total number of free parameters in the
    network.

23
Importance of the VC Dimension
  • 2. Let denote a multilayer feedforward
    network whose neurons use a sigmoid activation
    function
  • the VC dimension is O(W2), where W is the number
    of free parameters in the network.

24
Importance of the VC Dimension
  • In the case of binary pattern classification the
    loss function has only two possible values
  • The risk functional R( ) and the empirical
    risk functional Remp( ) assume the following
    interpretations

25
Importance of the VC Dimension
  • R( ) is the probability of classification
    error denoted by P( ).
  • Remp( ) is the training error, denoted by
  • v( ).
  • Then (Haykin, p.98)

26
Importance of the VC Dimension
  • The notion of VC provides a bound on the rate of
    uniform convergence. For the set of
    classification functions with VC dimension h the
    following inequality holds

  • (vc.1)
  • where N is the size of the training sample. In
    other words, a finite VC dimension is a necessary
    and sufficient condition for uniform convergence
    of the principle of empirical risk minimization.

27
Importance of the VC dimension
  • The factor in (vc.1) represents
    a bound on the growth function for
    the family of functions
  • for Provided that this function
    does not grow too fast, the right hand side will
    go to zero as N goes to infinity.
  • This requirement is satisfied if the VC dimension
    is finite.

28
Importance of the VC Dimension
  • Thus, a finite VC dimension is a necessary and
    sufficient condition for uniform convergence of
    the principle of empirical risk minimization.
  • Let denote the probability of occurrence of
    the event
  • using the previous bound (vc.1) we find

  • (vc.2)

29
Importance of the VC Dimension
  • Let denote the special
    value of that satisfies (vc.2). Then we obtain
    (Haykin, 99)
  • We refer to as the confidence interval.

30
Importance of the VC Dimension
  • We may also write
  • where

31
Importance of the VC Dimension
  • Conclusions
  • 1.
  • 2. For a small training error (close to zero)
  • 3. For a large training error (close to unity)

32
Structural Risk Minimization
  • The training error is the frequency of errors
    made during the training session for some machine
    with weight vector during the training
    session.
  • The generalization error is the frequency of
    errors made by the machine when it is tested with
    examples not seen before.
  • Let this two errors to be denoted with
  • and .

33
Structural Risk Minimization
  • Let h be the VC dimension of a family of
    classification functions
  • with respect to the input space
  • The generalization error is
    lower than the guaranteed risk defined by the sum
    of competing terms
  • where the confidence interval
  • is defined as before.

34
Structural Risk Minimization
  • For a fixed number of training samples N, the
    training error decreases monotonically as the
    capacity or h is increased, whereas the
    confidence interval increases monotonically.

35
Structural Risk Minimization
  • The training error is the frequency of errors
    made during the training session for some machine
    with weight vector during the training
    session.
  • The generalization error is the frequency of
    errors made by the machine when it is tested with
    examples not seen before.
  • Let this two errors to be denoted with
  • and .

36
Structural Risk Minimization
  • The training error is the frequency of errors
    made during the training session for some machine
    with weight vector during the training
    session.
  • The generalization error is the frequency of
    errors made by the machine when it is tested with
    examples not seen before.
  • Let this two errors to be denoted with
  • and .

37
Structural Risk Minimization
  • The challenge in solving a supervised learning
    problem lies in realizing the best generalization
    performance by matching the machine capacity to
    the available amount of training data for the
    problem at hand. The method of structural risk
    minimization provides an inductive procedure to
    achieve this goal by making the VC dimension of
    the learning machine a control variable.

38
Structural Risk Minimization
  • Consider an ensemble of pattern classifiers
  • and define a nested structure of n such machines
  • such that we have
  • correspondingly, the VC dimensions of the
    indivi-dual pattern classifiers satisfy
  • which implies that the VC dimension of each
    classifier is finite (see next figure)

39
  • Illustration of relationship between training
    error, confidence interval and guaranteed risk

40
Structural Risk Minimization
  • Then
  • a) The empirical risk (training error) of each
    classifier is minimized
  • b) The pattern classifier with the
    smallest guaranteed risk is identified this
    particular machine provides the best compromise
    between the training error (quality of
    approximation) and the confidence interval
    (complexity of the approximation function).

41
Structural Risk Minimization
  • Our goal is to find a network structure such that
    decreasing the VC dimension occurs at the
    expense of the smallest possible increase in
    trainig error.
  • We achieve this, for example, varying h by
    varying the number of hidden neurons.
  • We evaluate the ensemble of fully connected
    multilayer feedforward networks in which the
    number of neurons in one of the hidden layers is
    increased in a monotonic fashion.

42
Structural Risk Minimization
  • The principle of SRM states that the best network
    in this ensemble is the one for which the
    guaranteed risk is the minimum.
Write a Comment
User Comments (0)
About PowerShow.com