Ch 3. Linear Models for Regression (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006. - PowerPoint PPT Presentation

About This Presentation

Title:

Ch 3. Linear Models for Regression (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

Description:

Ch 3. Linear Models for Regression (1/2) Pattern Recognition and ... 3.3 Bayesian Lear Regression. 3.3.1 Parameter distribution. 3.3.2 Predictive distribution ... – PowerPoint PPT presentation

Number of Views:733

Avg rating:3.0/5.0

Slides: 27

Provided by: peo62

Category:

more less

Transcript and Presenter's Notes

Title: Ch 3. Linear Models for Regression (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, 2006.

1
Ch 3. Linear Models for Regression (1/2)Pattern
Recognition and Machine Learning, C. M. Bishop,
2006.

Previously summarized by Yung-Kyun Noh
Modified and presented by Rhee, Je-Keun
Biointelligence Laboratory, Seoul National
University
http//bi.snu.ac.kr/

2
Contents

3.1 Linear Basis Function Models
3.1.1 Maximum likelihood and least squares
3.1.2 Geometry of least squares
3.1.3 Sequential learning
3.1.4 Regularized least squares
3.1.5 Multiple outputs
3.2 The Bias-Variance Decomposition
3.3 Bayesian Lear Regression
3.3.1 Parameter distribution
3.3.2 Predictive distribution
3.3.3 Equivalent kernel

3
Linear Basis Function Models

Linear regression
Linear model
Linearity in the parameters
Using basis functions, allow nonlinear function
of the input vector x.
Simplify the analysis of this class of models
Have some significant limitations
M total number of parameters
basis functions (
dummy basis function)
,

4
Basis Functions

Polynomial functions
Global functions of the input variable
? spline functions
Gaussian basis functions
Sigmoidal basis functions
Logistic sigmoid functions
Fourier basis ? wavelets

5
Maximum Likelihood and Least Squares (1/2)

Assumption Gaussian noise model
zero mean Gaussian random variable with
precision (inverse variance) . ?
Result
Conditional mean
(unimodal)
For dataset
Likelihood (Drop the explicit x)

6
Maximum Likelihood and Least Squares (2/3)

Log-likelihood
Maximization of the likelihood function under a
conditional Gaussian noise distribution for a
linear model is equivalent to minimizing a
sum-of-squares error function.

7
Maximum Likelihood and Least Squares (3/3)

The gradient of the log likelihood function
Setting the gradient of log likelihood and
setting it to zero to get
where the NxM design matrix

8
Bias and Precision Parameter by ML

Some other solutions we can get by setting
derivative to zero.
Bias maximizing log likelihood
The bias compensates for the difference between
the averages (over the training set) of the
target values and the weighted sum of the
averages of the basis function values.
Noise precision parameter maximizing log
likelihood

9
Geometry of Least Squares

If the number M of basis functions is smaller
than the number N of data points, then the M
vectors will span a linear subspace S
of dimensionality M.
jth column of

y linear combination of
The least-squares solution for w corresponds to
that choice of y that lies in subspace S and that
is closest to t.

10
Sequential Learning

On-line learning
Technique of Stochastic gradient descent (or
sequential gradient descent)
For the case of sum-of-squares error function
(least-mean-square or the LMS algorithm)

11
Regularized Least Squares

Regularized least-square
Control over-fitting
Total error function
Closed form solution (setting the gradient)
This represents a simple extension of the
least-squares solution.
A more general regularizer

12
General Regularizer

In case q1 in general regularizer
lasso in the statistical literature
If ? is sufficiently large, some of the
coefficients wj are driven to zero.
Sparse model corresponding basis functions play
no role.
Minimizing the unregularized sum-of-squares error
s.t. the constraint

Contours of the regularization term
The lasso gives the sparse solution
13
Regularization complexity

Regularization allows complex models to be
trained on data sets of limited size without
severe over-fitting, essentially by limiting the
effective model complexity.
However, the problem of determining the optimal
model complexity is then shifted from on of
finding the appropriate number of basis functions
to one of determining a suitable value of the
regularization coefficient ?.

14
Multiple Outputs

For Kgt1 target variables
1. Introduce a different set of basis functions
for each component of t.
2. Use the same set of basis functions to model
all of the components of the target vector. (W
MxK matrix of parameters)
For each variable tk,
pseudo-inverse of

15
The Bias-Variance Decomposition (1/4)

Frequentist viewpoint of the model complexity
issue bias-variance trade-off.
Expected squared loss
Bayesian the uncertainty in our model is
expressed through a posterior distribution over
w.
Frequentist make a point estimate of w based on
the data set D.

Arises from the intrinsic noise on the data
Dependent on the particular dataset D.
16
The Bias-Variance Decomposition (2/4)

Bias
The extent to which the average prediction over
all data sets differs from the desired regression
function.
Variance
The extent to which the solutions for individual
data sets vary around their average.
The extent to which the function y(xD) is
sensitive to the particular choice of data set.
Expected loss (bias)2 variance noise

17
The Bias-Variance Decomposition (3/4)

? bias-variance trade-off
Averaging many solutions for the complex model
(M25) is a beneficial procedure.
A weighted averaging (although with respect to
the posterior distribution of parameters, not
with respect to multiple data sets) of multiple
solutions lies at the heart of Bayesian approach.

18
The Bias-Variance Decomposition (4/4)

The average prediction
Bias and variance
Bias-variance decomposition is based on averages
with respect to ensembles of data sets
(frequentist perspective). We would be better off
combining them into a single large training set.

19
Bayesian Linear Regression

In the particular problem, it cannot be decided
simply by maximizing the likelihood function,
because it always leads to excessively complex
models and overfitting.
Independent hold-out data can be used to
determine model complexity, but this can be both
computationally expensive and wasteful of
valuable data.
Bayesian treatment of linear regression will
avoid the overfitting problem of maximum
likelihood, and will also lead to autoamtic
methods of determining model complexity using
training data alome.

20
Parameter distribution (1/3)

Conjugate prior of likelihood
Posterior
The maximum posterior weight vector
If S0a -1I with a ? 0, the mean mN reduces to
wML given by (3.15)

21
Parameter distribution (2/3)

Consider prior
Corresponding posterior
Log of the posterior
Maximization of this posterior distribution with
respect to w is equivalent to the minimization of
the sum-of squares error function with the
addition of a quadratic regularization term with
?a /ß.

22
Parameter distribution (3/3)

Other forms of prior over parameters

23
Predictive Distribution (1/2)

Our real interests

Uncertainty associated with the parameters w. 0
if N?8
Mean of the Gaussian predictive distribution (red
line), and predictive uncertainty (shaded region)
as the number of data increases.
noise
24
Predictive Distribution (2/2)
Draw samples from the posterior distribution over
w.
25
Equivalent Kernel (1/2)

If we substitue (3.53) into the expression (3.3),
we see that the predictive mean can be written in
the form
Mean of the predictive distribution at a point x.

Smoother matrix or equivalent kernel
Polynomial and sigmoidal basis function
26
Equivalent Kernel (2/2)

Instead of introducing a set of basis functions,
which implicitly determines an equivalent kernel,
we can instead define a localized kernel directly
and use this to make predictions for new input
vector x, given the observed training set.
This leads to a practical framework for
regression (and classification) called Gaussian
processes.
The equivalent kernel satisfies an important
property shared by kernel functions in general,
namely that it can be expressed in the form an
inner product with respect to a vector ?(x) of
nonlinear functions.
Inner product of nonlinear functions