Pattern Recognition and Machine Learning - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Pattern Recognition and Machine Learning

Description:

Example: Polynomial Curve Fitting. Linear Basis Function Models ... and Least Squares (1) ... With the sum-of-squares error function and a quadratic ... – PowerPoint PPT presentation

Number of Views:593
Avg rating:3.0/5.0
Slides: 58
Provided by: markus7
Category:

less

Transcript and Presenter's Notes

Title: Pattern Recognition and Machine Learning


1
Pattern Recognition and Machine Learning
Chapter 3 Linear models for regression
2
Linear Basis Function Models (1)
  • Example Polynomial Curve Fitting

3
Linear Basis Function Models (2)
  • Generally
  • where Áj(x) are known as basis functions.
  • Typically, Á0(x) 1, so that w0 acts as a bias.
  • In the simplest case, we use linear basis
    functions Ád(x) xd.

4
Linear Basis Function Models (3)
  • Polynomial basis functions
  • These are global a small change in x affect all
    basis functions.

5
Linear Basis Function Models (4)
  • Gaussian basis functions
  • These are local a small change in x only affect
    nearby basis functions. ¹j and s control location
    and scale (width).

6
Linear Basis Function Models (5)
  • Sigmoidal basis functions
  • where
  • Also these are local a small change in x only
    affect nearby basis functions. ¹j and s control
    location and scale (slope).

7
Maximum Likelihood and Least Squares (1)
  • Assume observations from a deterministic function
    with added Gaussian noise
  • which is the same as saying,
  • Given observed inputs,
    , and targets, , we obtain
    the likelihood function

8
Maximum Likelihood and Least Squares (2)
  • Taking the logarithm, we get
  • where
  • is the sum-of-squares error.

9
Maximum Likelihood and Least Squares (3)
  • Computing the gradient and setting it to zero
    yields
  • Solving for w, we get
  • where

10
Maximum Likelihood and Least Squares (4)
  • Maximizing with respect to the bias, w0, alone,
    we see that
  • We can also maximize with respect to , giving

11
Geometry of Least Squares
  • Consider
  • S is spanned by .
  • wML minimizes the distance between t and its
    orthogonal projection on S, i.e. y.

N-dimensional M-dimensional
12
Sequential Learning
  • Data items considered one at a time (a.k.a.
    online learning) use stochastic (sequential)
    gradient descent
  • This is known as the least-mean-squares (LMS)
    algorithm. Issue how to choose ?

13
Regularized Least Squares (1)
  • Consider the error function
  • With the sum-of-squares error function and a
    quadratic regularizer, we get
  • which is minimized by

is called the regularization coefficient.
14
Regularized Least Squares (2)
  • With a more general regularizer, we have

Lasso
Quadratic
15
Regularized Least Squares (3)
  • Lasso tends to generate sparser solutions than a
    quadratic regularizer.

16
Multiple Outputs (1)
  • Analogously to the single output case we have
  • Given observed inputs,
    , and targets, , we obtain
    the log likelihood function

17
Multiple Outputs (2)
  • Maximizing with respect to W, we obtain
  • If we consider a single target variable, tk, we
    see that
  • where , which is
    identical with the single output case.

18
The Bias-Variance Decomposition (1)
  • Recall the expected squared loss,
  • where
  • The second term of EL corresponds to the noise
    inherent in the random variable t.
  • What about the first term?

19
The Bias-Variance Decomposition (2)
  • Suppose we were given multiple data sets, each of
    size N. Any particular data set, D, will give a
    particular function y(xD). We then have

20
The Bias-Variance Decomposition (3)
  • Taking the expectation over D yields

21
The Bias-Variance Decomposition (4)
  • Thus we can write
  • where

22
The Bias-Variance Decomposition (5)
  • Example 25 data sets from the sinusoidal,
    varying the degree of regularization, .

23
The Bias-Variance Decomposition (6)
  • Example 25 data sets from the sinusoidal,
    varying the degree of regularization, .

24
The Bias-Variance Decomposition (7)
  • Example 25 data sets from the sinusoidal,
    varying the degree of regularization, .

25
The Bias-Variance Trade-off
  • From these plots, we note that an
    over-regularized model (large ) will have a high
    bias, while an under-regularized model (small )
    will have a high variance.

26
Bayesian Linear Regression (1)
  • Define a conjugate prior over w
  • Combining this with the likelihood function and
    using results for marginal and conditional
    Gaussian distributions, gives the posterior
  • where

27
Bayesian Linear Regression (2)
  • A common choice for the prior is
  • for which
  • Next we consider an example

28
Bayesian Linear Regression (3)
0 data points observed
Prior
Data Space
29
Bayesian Linear Regression (4)
1 data point observed
Likelihood
Posterior
Data Space
30
Bayesian Linear Regression (5)
2 data points observed
Likelihood
Posterior
Data Space
31
Bayesian Linear Regression (6)
20 data points observed
Likelihood
Posterior
Data Space
32
Predictive Distribution (1)
  • Predict t for new values of x by integrating over
    w
  • where

33
Predictive Distribution (2)
  • Example Sinusoidal data, 9 Gaussian basis
    functions, 1 data point

34
Predictive Distribution (3)
  • Example Sinusoidal data, 9 Gaussian basis
    functions, 2 data points

35
Predictive Distribution (4)
  • Example Sinusoidal data, 9 Gaussian basis
    functions, 4 data points

36
Predictive Distribution (5)
  • Example Sinusoidal data, 9 Gaussian basis
    functions, 25 data points

37
Equivalent Kernel (1)
  • The predictive mean can be written
  • This is a weighted sum of the training data
    target values, tn.

Equivalent kernel or smoother matrix.
38
Equivalent Kernel (2)
Weight of tn depends on distance between x and
xn nearby xn carry more weight.
39
Equivalent Kernel (3)
  • Non-local basis functions have local equivalent
    kernels

Polynomial
Sigmoidal
40
Equivalent Kernel (4)
  • The kernel as a covariance function consider
  • We can avoid the use of basis functions and
    define the kernel function directly, leading to
    Gaussian Processes (Chapter 6).

41
Equivalent Kernel (5)
  • for all values of x however, the equivalent
    kernel may be negative for some values of x.
  • Like all kernel functions, the equivalent kernel
    can be expressed as an inner product
  • where .

42
Bayesian Model Comparison (1)
  • How do we choose the right model?
  • Assume we want to compare models Mi, i1, ,L,
    using data D this requires computing
  • Bayes Factor ratio of evidence for two models

43
Bayesian Model Comparison (2)
  • Having computed p(MijD), we can compute the
    predictive (mixture) distribution
  • A simpler approximation, known as model
    selection, is to use the model with the highest
    evidence.

44
Bayesian Model Comparison (3)
  • For a model with parameters w, we get the model
    evidence by marginalizing over w
  • Note that

45
Bayesian Model Comparison (4)
  • For a given model with a single parameter, w,
    con-sider the approximation
  • where the posterior is assumed to be sharply
    peaked.

46
Bayesian Model Comparison (5)
  • Taking logarithms, we obtain
  • With M parameters, all assumed to have the same
    ratio , we get

Negative
Negative and linear in M.
47
Bayesian Model Comparison (6)
  • Matching data and model complexity

48
The Evidence Approximation (1)
  • The fully Bayesian predictive distribution is
    given by
  • but this integral is intractable. Approximate
    with
  • where is the mode of ,
    which is assumed to be sharply peaked a.k.a.
    empirical Bayes, type II or gene-ralized maximum
    likelihood, or evidence approximation.

49
The Evidence Approximation (2)
  • From Bayes theorem we have
  • and if we assume p(,) to be flat we see that
  • General results for Gaussian integrals give

50
The Evidence Approximation (3)
  • Example sinusoidal data, M th degree
    polynomial,

51
Maximizing the Evidence Function (1)
  • To maximise w.r.t. and , we
    define the eigenvector equation
  • Thus
  • has eigenvalues i .

52
Maximizing the Evidence Function (2)
  • We can now differentiate
    w.r.t. and , and set the results to zero, to
    get
  • where

N.B. depends on both and .
53
Effective Number of Parameters (3)
Likelihood
Prior
54
Effective Number of Parameters (2)
  • Example sinusoidal data, 9 Gaussian basis
    functions, 11.1.

55
Effective Number of Parameters (3)
  • Example sinusoidal data, 9 Gaussian basis
    functions, 11.1.

Test set error
56
Effective Number of Parameters (4)
  • Example sinusoidal data, 9 Gaussian basis
    functions, 11.1.

57
Effective Number of Parameters (5)
  • In the limit , M and we can
    consider using the easy-to-compute approximation

58
Limitations of Fixed Basis Functions
  • M basis function along each dimension of a
    D-dimensional input space requires MD basis
    functions the curse of dimensionality.
  • In later chapters, we shall see how we can get
    away with fewer basis functions, by choosing
    these using the training data.
Write a Comment
User Comments (0)
About PowerShow.com