Pattern Recognition and Machine Learning - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Pattern Recognition and Machine Learning

Description:

... an example ... Bayesian Linear Regression (3) 0 data points observed ... Predict t for new values of x by integrating over w: where. Predictive Distribution (2) ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 50
Provided by: marku183
Category:

less

Transcript and Presenter's Notes

Title: Pattern Recognition and Machine Learning


1
Pattern Recognition and Machine Learning
Chapter 1 Introduction
2
Expectations
3
Expectations
Conditional Expectation (discrete)
Approximate Expectation (discrete and continuous)
4
Variances and Covariances
5
Intinsic Error
  • PRML 1.5.5

6
Curve Fitting Re-visited
7
Maximum Likelihood
Determine by minimizing sum-of-squares
error, .
8
Minimizing the loss function for regression
  • Using the squared error as the loss function
  • We want to choose y(x) to minimize the expected
    loss

9
  • Solving for y(x), we get

10
The Squared Loss Function
11
  • The first term is minimized when we select y(x)
    as
  • The second term is independent of y(x) and
    represents the intrinsic variability of the
    target
  • It is called the intrinsic error.

12
(No Transcript)
13
Inverse Problems
14
Bias, Variance
  • PRML 3.2

15
The Bias-Variance Decomposition (1)
  • Recall the expected squared loss,
  • where
  • The second term corresponds to the noise inherent
    in the random variable t.
  • What about the first term?

16
The Bias-Variance Decomposition (2)
  • Suppose we were given multiple data sets, each of
    size N. Any particular data set, D, will give a
    particular function y(x D). We then have

17
The Bias-Variance Decomposition (3)
  • Taking the expectation over D yields

18
The Bias-Variance Decomposition (4)
  • Thus we can write
  • where

19
  • Bias measures how much the prediction (averaged
    over all data sets) differs from the desired
    regression function.
  • Variance measures how much the predictions for
    individual data sets vary around their average.
  • There is a trade-off between bias and variance
  • As we increase model complexity,
  • bias decreases (a better fit to data) and
  • variance increases (fit varies more with data)

20
f
f
bias
gi
g
variance
21
Reminder Introduction to OverfittingPRML 1.1
  • Concepts Polynomial curve fitting,
  • overfitting, regularization,
  • training set size vs model complexity

22
Polynomial Curve Fitting
23
Sum-of-Squares Error Function
24
0th Order Polynomial
25
1st Order Polynomial
26
3rd Order Polynomial
27
9th Order Polynomial
28
Over-fitting
Root-Mean-Square (RMS) Error
29
Polynomial Coefficients
30
Regularization
  • Penalize large coefficient values

31
Regularization
32
Regularization
33
Regularization vs.
34
Polynomial Coefficients
35
Back to Bias/Variance
36
The Bias-Variance Decomposition (5)
  • Example 100 data sets, each with 25 data points
    from the sinusoidal h(x) sin(2px), varying the
    degree of regularization, l.

37
The Bias-Variance Decomposition (6)
  • Example 100 data sets, each with 25 data points
    from the sinusoidal h(x) sin(2px), varying the
    degree of regularization, l.

38
The Bias-Variance Decomposition (7)
  • Example 100 data sets, each with 25 data points
    from the sinusoidal h(x) sin(2px), varying the
    degree of regularization, l.

39
The Bias-Variance Trade-off
  • From these plots, we note that an
    over-regularized model (large l) will have a high
    bias, while an under-regularized model (small l)
    will have a high variance.

Minimum value of bias2variance is around
l-0.31 This is close to the value that gives the
minimum error on the test data.
40
f
f
bias
gi
g
variance
41
Model Selection Procedures
  • Regularization (Breiman 1998) Penalize the
    augmented error
  • error on data l.model complexity
  • If l is too large, we risk introducing bias
  • Use cross validation to optimize for l
  • Structural Risk Minimization (Vapnik 1995)
  • Use a set of models ordered in terms of their
    complexities
  • Number of free parameters
  • VC dimension,
  • Find the best model w.r.t empirical error and
    model complexity.
  • Minimum Description Length Principle
  • Bayesian Model Selection If we have some prior
    knowledge about the approximating function, it
    can be incorporated into the Bayesian approach in
    the form of p(model).

42
Bayesian Model Selection
  • Prior on models, p(model)
  • When prior favors simpler models Bayesian,
    regularization, SRM and MDL are equivalent.

43
Model Selection Procedures
  • Cross validation Measure the total error, rather
    than bias/variance, on a validation set.
  • Train/Validation sets
  • K-fold cross validation
  • Leave-One-Out
  • No prior assumption about the models

44
Polynomial Regression
Best fit min error
45
Best fit, elbow
46
  • Averaging of multiple solutions which themselves
    have low bias
  • Lower variance due to averaging
  • Ensembles, mixture of experts, committee
    machines
  • Bayesian approach weighted average
  • Increasing the number of data points N
  • Constrains the solutions, reducing variance

47
Data Set Size
9th Order Polynomial
48
Data Set Size
9th Order Polynomial
49
End of Lecture Skip the Rest
50
Bayesian Linear Regression (1)
  • Define a conjugate prior over w
  • Combining this with the likelihood function and
    using results for marginal and conditional
    Gaussian distributions, gives the posterior
  • where

51
Bayesian Linear Regression (2)
  • For simplicity, lets assume that the prior is
  • for which
  • Next we consider an example

52
Bayesian Linear Regression (3)
0 data points observed
Prior
Data Space
53
Bayesian Linear Regression (4)
1 data point observed
Likelihood
Posterior
Data Space
54
Bayesian Linear Regression (5)
2 data points observed
Likelihood
Posterior
Data Space
55
Bayesian Linear Regression (6)
20 data points observed
Likelihood
Posterior
Data Space
56
Predictive Distribution (1)
  • Predict t for new values of x by integrating over
    w
  • where

57
Predictive Distribution (2)
  • Example Sinusoidal data, 9 Gaussian basis
    functions, 1 data point

58
Predictive Distribution (3)
  • Example Sinusoidal data, 9 Gaussian basis
    functions, 2 data points

59
Predictive Distribution (4)
  • Example Sinusoidal data, 9 Gaussian basis
    functions, 4 data points

60
Predictive Distribution (5)
  • Example Sinusoidal data, 9 Gaussian basis
    functions, 25 data points
Write a Comment
User Comments (0)
About PowerShow.com