118, v3'0 - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

118, v3'0

Description:

http://www.eee.manchester.ac.uk/intranet/pg/coursematerial/ EE-M016 2005/6: IS L5&6 ... MSE (mean squared error) performance function. Gradient descent ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 19
Provided by: intranetE
Category:
Tags: eee

less

Transcript and Presenter's Notes

Title: 118, v3'0


1
Lecture 56 Using Optimization Theory for
Parameter Estimation
  • Dr Martin Brown
  • Room E1k
  • Email martin.brown_at_manchester.ac.uk
  • Telephone 0161 306 4672
  • http//www.eee.manchester.ac.uk/intranet/pg/course
    material/

2
Lecture 56 Outline
  • Parameter estimation using optimization
  • Review of parameter estimation
  • MSE (mean squared error) performance function
  • Gradient descent parameter estimation
  • Convergence/stability analysis
  • Non-linear models, Newtons method and Quadratic
    Programming
  • The goal of this lecture is to show how gradient
    descent algorithms can be used to learn/estimate
    a models parameters, and consider its
    limitations/extensions.

3
Lecture 56 Resources
  • These slides are largely self-contained, but
    extra, background material can be found in
  • Chapter 2, Machine Learning (MIT Open
    Courseware), T Jakkola, http//ocw.mit.edu/OcwWeb/
    Electrical-Engineering-and-Computer-Science/6-867M
    achine-LearningFall2002/CourseHome/index.htm
  • An introduction to conjugate gradient without the
    agonizing pain, JS Shewchuk, 1994, Technical
    Report , http//www.cs.cmu.edu/quake-papers/painl
    ess-conjugate-gradient.ps
  • Adaptive Signal Processing, Widrow Stearns,
    Prentice Hall, 1985
  • Advanced
  • Practical Methods of Optimization, R Fletcher,
    2nd Edition, Wiley

4
Parameter Estimation Framework
  • The basic goal of machine learning is
  • Given a set of data that describes the problem,
    estimate the models parameters in some best
    sense.
  • There exists a data set of the form (regression
    problem)
  • There exists a model of the form
  • y(x) m(q,x), where q is the parameter vector
    (generally including bias )
  • There exists a measure of performance
  • Note there exists many variations on this basic
    form



5
Parameter Estimation Goal
  • For a fixed model and data set, calculate the
    optimal value of the parameters that minimise the
    performance function
  • Open questions
  • How does f() depend on q?
  • How to refine the current estimate, qk, of q?

1 parameter view of parameter optimisation
f(q)

q
qk


6
Direct Solutions and Iterative Estimation
  • Direct solutions such as the generalized inverse
  • are only possible for linear models with
    quadratic performance functions
  • In general, an iterative approach is needed
  • Desire
  • Onto gradient descent learning

f(q)



q
q
qk
qk1
7
Review Mean Squared Error Performance
  • The quadratic (mean) squared error performance
    function is the most widely used
  • This is because
  • Closed form solution for linear models y xTq
  • Simply invert the matrix
  • Local Taylor series approximation for non-linear
    models
  • Analytic representation of gradient and second
    order methods
  • Gaussian interpretation (log likelihood), as it
    represents a scaled (l/2) estimate of the
    measurement noise variance
  • Exercise show that m(y-y)0 for an optimal linear
    model




8
Review Quadratic MSE for Linear Models
  • For a linear (in the parameter) model of the form
    y xTq
  • Quadratic function is determined by
  • Hessian is the (scaled) variance/covariance
    matrix (n1)(n1)
  • correlation vector (n1)1

f
q2
q1
9
Review Normal Equations
  • When the parameter vector is optimal
  • For a quadratic MSE with a linear model
  • At optimality
  • This are the normal equations for least squares
    modelling
  • Using this, the performance function can be
    expressed as

for some constant c
10
Gradient Descent Learning
  • The basic aim of gradient descent learning
  • Given the current parameter estimate and the
    gradient vector of the performance function with
    respect to the parameters, update the parameters
    along the negative gradient
  • For a linear model with a MSE performance
    function
  • Batch gradient descent learning gives (similar to
    LMS)

hgt0, is the learning rate
11
Gradients and Contours
q
  • The gradient/negative gradient is perpendicular
    to the tangent to the contour at the current
    point.
  • Exercise Prove this using Taylor series

12
2D Visualisation of Gradient Descent
20 iterations of gradient descent learning q0
-0.5 1.5 q 1 1 H 0.333 0.166 0.166
0.333 h 1


  • Exercise Obtain a gradient expression using only
    H, q and qk (see S9)
  • Exercise Implement this scenario as a Matlab
    script

13
Pictorial Stability Investigation
h 1
h 3
h 4
h 5
14
Stability of Linear Gradient Descent
  • What values of h gives a stable learning process?
  • What values of h reduce parameter errors?
  • We need all the eigenvalues of
    to have a value lt1 in magnitude
  • Let s1, s2, sn be the positive eigenvalues of
    XTX, then the eigenvalues of are
    given by
  • These are all lt1 in magnitude when

15
Non-linear Gradient Descent
  • When a non-linear model is used, the MSE is no
    longer a quadratic function of the parameters
  • Taylor series estimate
  • Non-linear models can be locally approximated by
    a linear model and gradient descent applied to
    this situation
  • local minima
  • plateau regions
  • non-constant Hessian matrix

f(q)
q
16
2nd Order Methods Newtons Algorithm
  • Gradient descent is known as a 1st order
    algorithm because it only uses knowledge about
    the 1st derivative (gradient)
  • Newtons method
  • Dqk hs
  • 2nd order
  • Requires matrix inversion
  • Gradient descent moves perpendicular to the
    local contours
  • Slow for badly conditioned systems (steep sided,
    flat bottomed valleys)
  • Can we not move directly to the optimum?

s

17
Lecture 56 Summary
  • Often, the central problem of machine learning is
    parameter estimation
  • Using MSE and linear models is a quadratic
    optimisation problem for which there exists a
    closed form solution
  • Gradient descent can be used to recursively
    estimate parameters for both quadratic and
    non-quadratic performance functions, linear and
    non-linear models
  • Learning is stable when 0lt h lt mini 2/si
  • Slow learning when smax gtgt smin (ratio smax/smin
    is known as the condition number)
  • Higher order learning methods (Newton, ) are
    possible, and although these are more
    computationally costly they converge a lot
    quicker.

18
Lecture 56 Laboratory
  • Matlab
  • Obtain the gradient expression and implement the
    gradient descent learning scenario described on
    Slide 12. Make sure you obtain similar stability
    results to Slide 13.
  • Use the contour command to superimpose the
    contours of the quadratic performance function,
    on the learning history in (1).
  • Modify the LMS update algorithm, in lab 2, so
    that the scenario now performs a single batch
    gradient descent at the end of complete pass (see
    slide 10) through the data set, rather than after
    every pattern is presented to the model. How
    does the learning trajectories compare (LMS v.
    gradient descent)?
  • Theory
  • Show H XTX is positive definite (non-singular)
  • Show the instantaneous Hessian Hk xkxkT has a
    single non-zero eigenvalue and calculate its
    corresponding eigenvector. How does this value
    compare with the stability constraint for the LMS
    algorithm?
  • Do the exercises on S7,1112.
Write a Comment
User Comments (0)
About PowerShow.com