Ch 3. Linear Models for Regression (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. - PowerPoint PPT Presentation

About This Presentation
Title:

Ch 3. Linear Models for Regression (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Description:

Ch 3. Linear Models for Regression (1/2) Pattern Recognition and ... 3.3 Bayesian Lear Regression. 3.3.1 Parameter distribution. 3.3.2 Predictive distribution ... – PowerPoint PPT presentation

Number of Views:733
Avg rating:3.0/5.0
Slides: 27
Provided by: peo62
Category:

less

Transcript and Presenter's Notes

Title: Ch 3. Linear Models for Regression (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006.


1
Ch 3. Linear Models for Regression (1/2)Pattern
Recognition and Machine Learning, C. M. Bishop,
2006.
  • Previously summarized by Yung-Kyun Noh
  • Modified and presented by Rhee, Je-Keun
  • Biointelligence Laboratory, Seoul National
    University
  • http//bi.snu.ac.kr/

2
Contents
  • 3.1 Linear Basis Function Models
  • 3.1.1 Maximum likelihood and least squares
  • 3.1.2 Geometry of least squares
  • 3.1.3 Sequential learning
  • 3.1.4 Regularized least squares
  • 3.1.5 Multiple outputs
  • 3.2 The Bias-Variance Decomposition
  • 3.3 Bayesian Lear Regression
  • 3.3.1 Parameter distribution
  • 3.3.2 Predictive distribution
  • 3.3.3 Equivalent kernel

3
Linear Basis Function Models
  • Linear regression
  • Linear model
  • Linearity in the parameters
  • Using basis functions, allow nonlinear function
    of the input vector x.
  • Simplify the analysis of this class of models
  • Have some significant limitations
  • M total number of parameters
  • basis functions (
    dummy basis function)
  • ,

4
Basis Functions
  • Polynomial functions
  • Global functions of the input variable
  • ? spline functions
  • Gaussian basis functions
  • Sigmoidal basis functions
  • Logistic sigmoid functions
  • Fourier basis ? wavelets

5
Maximum Likelihood and Least Squares (1/2)
  • Assumption Gaussian noise model
  • zero mean Gaussian random variable with
    precision (inverse variance) . ?
  • Result
  • Conditional mean
    (unimodal)
  • For dataset
  • Likelihood (Drop the explicit x)

6
Maximum Likelihood and Least Squares (2/3)
  • Log-likelihood
  • Maximization of the likelihood function under a
    conditional Gaussian noise distribution for a
    linear model is equivalent to minimizing a
    sum-of-squares error function.

7
Maximum Likelihood and Least Squares (3/3)
  • The gradient of the log likelihood function
  • Setting the gradient of log likelihood and
    setting it to zero to get
  • where the NxM design matrix

8
Bias and Precision Parameter by ML
  • Some other solutions we can get by setting
    derivative to zero.
  • Bias maximizing log likelihood
  • The bias compensates for the difference between
    the averages (over the training set) of the
    target values and the weighted sum of the
    averages of the basis function values.
  • Noise precision parameter maximizing log
    likelihood

9
Geometry of Least Squares
  • If the number M of basis functions is smaller
    than the number N of data points, then the M
    vectors will span a linear subspace S
    of dimensionality M.
  • jth column of
  • y linear combination of
  • The least-squares solution for w corresponds to
    that choice of y that lies in subspace S and that
    is closest to t.

10
Sequential Learning
  • On-line learning
  • Technique of Stochastic gradient descent (or
    sequential gradient descent)
  • For the case of sum-of-squares error function
    (least-mean-square or the LMS algorithm)

11
Regularized Least Squares
  • Regularized least-square
  • Control over-fitting
  • Total error function
  • Closed form solution (setting the gradient)
  • This represents a simple extension of the
    least-squares solution.
  • A more general regularizer

12
General Regularizer
  • In case q1 in general regularizer
  • lasso in the statistical literature
  • If ? is sufficiently large, some of the
    coefficients wj are driven to zero.
  • Sparse model corresponding basis functions play
    no role.
  • Minimizing the unregularized sum-of-squares error
    s.t. the constraint

Contours of the regularization term
The lasso gives the sparse solution
13
Regularization complexity
  • Regularization allows complex models to be
    trained on data sets of limited size without
    severe over-fitting, essentially by limiting the
    effective model complexity.
  • However, the problem of determining the optimal
    model complexity is then shifted from on of
    finding the appropriate number of basis functions
    to one of determining a suitable value of the
    regularization coefficient ?.

14
Multiple Outputs
  • For Kgt1 target variables
  • 1. Introduce a different set of basis functions
    for each component of t.
  • 2. Use the same set of basis functions to model
    all of the components of the target vector. (W
    MxK matrix of parameters)
  • For each variable tk,
  • pseudo-inverse of

15
The Bias-Variance Decomposition (1/4)
  • Frequentist viewpoint of the model complexity
    issue bias-variance trade-off.
  • Expected squared loss
  • Bayesian the uncertainty in our model is
    expressed through a posterior distribution over
    w.
  • Frequentist make a point estimate of w based on
    the data set D.

Arises from the intrinsic noise on the data
Dependent on the particular dataset D.
16
The Bias-Variance Decomposition (2/4)
  • Bias
  • The extent to which the average prediction over
    all data sets differs from the desired regression
    function.
  • Variance
  • The extent to which the solutions for individual
    data sets vary around their average.
  • The extent to which the function y(xD) is
    sensitive to the particular choice of data set.
  • Expected loss (bias)2 variance noise

17
The Bias-Variance Decomposition (3/4)
  • ? bias-variance trade-off
  • Averaging many solutions for the complex model
    (M25) is a beneficial procedure.
  • A weighted averaging (although with respect to
    the posterior distribution of parameters, not
    with respect to multiple data sets) of multiple
    solutions lies at the heart of Bayesian approach.

18
The Bias-Variance Decomposition (4/4)
  • The average prediction
  • Bias and variance
  • Bias-variance decomposition is based on averages
    with respect to ensembles of data sets
    (frequentist perspective). We would be better off
    combining them into a single large training set.

19
Bayesian Linear Regression
  • In the particular problem, it cannot be decided
    simply by maximizing the likelihood function,
    because it always leads to excessively complex
    models and overfitting.
  • Independent hold-out data can be used to
    determine model complexity, but this can be both
    computationally expensive and wasteful of
    valuable data.
  • Bayesian treatment of linear regression will
    avoid the overfitting problem of maximum
    likelihood, and will also lead to autoamtic
    methods of determining model complexity using
    training data alome.

20
Parameter distribution (1/3)
  • Conjugate prior of likelihood
  • Posterior
  • The maximum posterior weight vector
  • If S0a -1I with a ? 0, the mean mN reduces to
    wML given by (3.15)

21
Parameter distribution (2/3)
  • Consider prior
  • Corresponding posterior
  • Log of the posterior
  • Maximization of this posterior distribution with
    respect to w is equivalent to the minimization of
    the sum-of squares error function with the
    addition of a quadratic regularization term with
    ?a /ß.

22
Parameter distribution (3/3)
  • Other forms of prior over parameters

23
Predictive Distribution (1/2)
  • Our real interests

Uncertainty associated with the parameters w. 0
if N?8
Mean of the Gaussian predictive distribution (red
line), and predictive uncertainty (shaded region)
as the number of data increases.
noise
24
Predictive Distribution (2/2)
Draw samples from the posterior distribution over
w.
25
Equivalent Kernel (1/2)
  • If we substitue (3.53) into the expression (3.3),
    we see that the predictive mean can be written in
    the form
  • Mean of the predictive distribution at a point x.

Smoother matrix or equivalent kernel
Polynomial and sigmoidal basis function
26
Equivalent Kernel (2/2)
  • Instead of introducing a set of basis functions,
    which implicitly determines an equivalent kernel,
    we can instead define a localized kernel directly
    and use this to make predictions for new input
    vector x, given the observed training set.
  • This leads to a practical framework for
    regression (and classification) called Gaussian
    processes.
  • The equivalent kernel satisfies an important
    property shared by kernel functions in general,
    namely that it can be expressed in the form an
    inner product with respect to a vector ?(x) of
    nonlinear functions.
  • Inner product of nonlinear functions
Write a Comment
User Comments (0)
About PowerShow.com