132 v2.0 - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

132 v2.0

Description:

Linear models and quadratic performance function. LMS and ... Sepal length and width. Petal length and width. 150 examples were collected, 50 from each class ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 33
Provided by: intranetE
Category:
Tags: sepal

less

Transcript and Presenter's Notes

Title: 132 v2.0


1
Lectures 34Linear Machine Learning Algorithms
  • Dr Martin Brown
  • Room E1k
  • Email martin.brown_at_manchester.ac.uk
  • Telephone 0161 306 4672
  • http//www.csc.umist.ac.uk/msc/intranet/EE-M016

2
Lectures 34 Outline
  • Linear classification using the Perceptron
  • Classification problem
  • Linear classifier and decision boundary
  • Perceptron learning rule
  • Proof of convergence
  • Recursive linear regression using LMS
  • Modelling and recursive parameter estimation
  • Linear models and quadratic performance function
  • LMS and NLMS learning rules
  • Proof of convergence

3
Lectures 34 Learning Objectives
  • Understand what classification and regression
    machine learning techniques are and their
    differences
  • Describe how linear models can be used for both
    classification and regression problems
  • Prove convergence of the learning algorithms for
    linear relationships, subject to restrictive
    conditions
  • Understand the restrictions of these basic proofs
  • Develop basic framework that will be expanded on
    in subsequent lectures

4
Lecture 34 Resources
  • Classification/Perceptron
  • An introduction to Support Vector Machines and
    other kernel-based learning methods, N
    Cristianini, J Shawe-Taylor, CUP, 2000
  • Regression/LMS
  • Adaptive Signal Processing, Widrow Stearns,
    Prentice Hall, 1985
  • Many other sources are available (on-line).

5
What is Classification?
  • Classification is also known as (statistical)
    pattern recognition
  • The aim is to build a machine/algorithm that can
    assign appropriate qualitative labels to new,
    previously unseen quantitative data using a
    priori knowledge and/or information contained in
    a training set. The patterns to be classified
    are usually groups of measurements/observations,
    that are believed to be informative for the
    classification task.
  • Example Face recognition

Training data D X,y
Prior knowledge
Design/ learn
Classifier m(q,x)

Predict

Predicted class label y
New pattern x
6
Classification Training Data
  • To supply training data for a classifier,
    examples must be collected that contain both
    positive (examples of the class) and negative
    (examples of other classes) instances. These are
    qualitative target class values and are stored as
    1 and -1, for the positive and negative
    instances respectively. Generated by expert or
    by observation.
  • The quantitative input features should be
    informative
  • The training set should contain enough examples
    to be able to build statistically significant
    decisions

How to encode qualitative target and input
features?
7
Bayes Class Priors
  • Classification is all about decision making using
    the concept of minimum risk
  • Imagine that the training data contains 100
    examples, 70 of them are class 1 (c1), 30 are
    class 2 (c2)
  • If I have to decide which class an unknown
    example belongs to, which decision is optimal?
  • Errors if decision is class 1
    p(c1)
  • Errors if decision is class 2
    p(c2)
  • Minimum risk decision is
  • p(c1) p(c2) are known as the Bayes priors, they
    represent the baseline performance for any
    classifier. They are derived from the training
    data as simple percentages

8
Structure of a Linear Classifier
  • Given a set of quantitative features x, a linear
    classifier has the form
  • The sgn() function is used to produce the
    qualitative class label (/-1)
  • The class/decision boundary is determined when
  • This is an (n-1)D hyperplane in feature space.
  • In 2-dimensional feature space
  • How does the sign and magnitude of q affect the
    decision boundary?

x2








x1
9
Simple Example Fishers Iris Data
  • Famous example of building classifiers for a
    problem with 3 types of Iris flowers and 4
    measurements about the flower
  • Sepal length and width
  • Petal length and width
  • 150 examples were collected, 50 from each class
  • Build 3 separate classifiers, one for recognizing
    examples of each class
  • Data is shown, plotted against last two features,
    as well as two linear classifiers for the Setosa
    and Virginica classes

Calculate q in lab 34
10
Perceptron Linear Classifier
  • The Perceptron linear classifier was devised by
    Rosenblatt in 1956
  • It comprises a linear classifier (as just
    discussed) and a simple parameter update rule of
    the form
  • Cyclically present each training pattern xk, yk
    to the linear classifier
  • When an error (misclassification) is made, update
    the parameters
  • where hgt0 is the learning rate.
  • The bias term can be included as q0 with an extra
    feature x0 1
  • Continue until there are no prediction errors
  • Perceptron convergence theorem If the data set is
    linearly separable, the perceptron learning
    algorithm will converge to an optimal separator
    in a finite time

11
Instantaneous Parameter Update
  • What does this look like?
  • The parameters are updated to make them more like
    the incorrect feature vector.
  • After updating
  • Updated parameters are closer
  • to correct decision

x2, q2

Error-driven update

x1, q1

y, y
1
0
-1
12
Perceptron Convergence Proof Preamble
  • Basic aim is to minimise the number of
    mis-classifications
  • This is generally an NP-complete problem
  • Weve assumed that there is an optimal solution
    with 0 errors
  • This is similar to Least Squares recursive
    estimation
  • Performance Si(yi-yi)2 4numberOfErrors
  • Except that the sgn() makes it a non-quadratic
    optimization problem
  • Updating only when
    there are errors is the same as

  • with or without errors
  • Sometimes drawn as a network


error driven parameter estimation
Repeatedly cycle through data set D, drawing out
each sample xk, yk

yk
xk
-

yk
13
Convergence Analysis of the Perceptron (i)
  • If a linearly separable data set D is repeatedly
    presented to a Perceptron, then the learning
    procedure is guaranteed to converge (no errors)
    in a finite time
  • If the data set is linearly separable, there
    exists optimal parameters q such that
    for all i 1, , l
  • Note that are also
    optimal parameter vectors
  • Consider the positive quantity g defined by, such
    that q 1
  • This is a concept known as the classification
    margin
  • Assume also that the feature vectors are bounded
    by

14
Convergence Analysis of the Perceptron (ii)
  • To show convergence, we need to establish that at
    the kth iteration, when an error has occurred
  • Using the update formula

q2

qk

qk1
q
q1
To finish proof, select
15
Convergence Analysis of the Perceptron (iii)
  • To show this terminates in a finite number of
    iterations, simply note that
  • is independent of the current training sample, so
    the parameter error must decrease by at least
    this amount at each update iteration. As the
    initial error is finite, q0 0, say, there must
    exist a finite number of steps before the
    parameter error is reduced to zero.
  • Note also that a is proportional to the size of
    the feature vector (R2) and inversely
    proportional to the size of the margin (g). Both
    of these will influence the number of update
    iterations when the Perceptron is learning


16
Example of Perceptron (i)
  • Consider modelling the logical AND data using a
    Perceptron

Is the data linearly separable?



k0, q 0.01, 0.1, 0.006
k5, q -0.98, 1.11, 1.01
k18, q -2.98, 2.11, 1.01
x2
x2
x2
x1
x1
x1
17
Example Parameter Trajectory (ii)
Lab exercise Calculate by hand the first 4
iterations of the learning scheme
18
Classification Margin
  • In this proof, we assumed that there exists a
    single, optimal parameter vector.
  • In practice, when the data is linearly separable,
    there are an infinite number simply requiring
    correct classification results in an ill-posed
    posed problem
  • The classification margin can be defined as the
    minimum distance of the decision boundary to a
    point in that class
  • Used in deriving Support Vector Machines

x2
x1
x2
1
0
-1
x1
19
Classification Summary
  • Classification is the task of assigning an
    object, described by a feature vector, to one of
    a set of mutually exclusive groups
  • A linear classifier has a linear decision
    boundary
  • The perceptron training algorithm is guaranteed
    to converge in a finite time when the data set is
    linearly separable
  • The final boundary is determined by the initial
    values and the order of presentation of the data

20
Definition of Regression
  • Regression is a (statistical) methodology that
    utilizes the relation between two or more
    quantitative variables so that one variable can
    be predicted from the other, or others.
  • Examples
  • Sales of a product can be predicted by using the
    relationship between sales volume and amount of
    advertising
  • The performance of an employee can be predicted
    by using the relationship between performance and
    aptitude tests
  • The size of a childs vocabulary can be predicted
    by using the relationship between the vocabulary
    size, the childs age and the parents
    educational input.

21
Regression Problem Visualisation








y, y
y

















x
  • Data generated by
  • Estimate model parameters
  • Predict a real value (fit a curve to the data)
  • Predictive performance
  • average error

22
Probabilistic Prediction Output
  • An output of 12 with rmse/standard deviation
    1.5 Within a small region close to the query
    point, the average target value was 12 and the
    standard deviation within that region was 1.5
    (variance 2.25)

m(yx) 12

2s(e) 3
95 of the data lies in the range m/-2s 12
/-21.5 9,15
23
Structure of a Linear Regression Model
  • Given a set of features x, a linear predictor has
    the form
  • The output is a real-valued, quantitative
    variable
  • The bias term can be included as an extra feature
    x0 1. This renames the bias parameter as q0.
  • Most linear control system models do not
    explicitly include a bias term, why is this?
  • Similar to the Toluca example in week 1.

24
Least Mean Squares Learning
  • Least Mean Squares (LMS) proposed by Widrow 1962
  • This is a (non-optimal) sequential parameter
    estimation procedure for a linear model
  • NB, compared to classification, both yk and yk
    are quantitative variables, so the error/noise
    signal (yk-yk) is generally non-zero. Similar to
    the Perceptron, but no threshold on xTq. h is
    again the positive learning rate.
  • Widely used in filtering/signal processing and
    adaptive control applications
  • Cheap version of sequential/recursive parameter
    estimation
  • The normalised version (NLMS) was developed by
    Kaczmarz in 1937




25
Proof of LMS Convergence (i)
  • If a noise-free data set containing a linear
    relationship x-gty is repeatedly presented to a
    linear model, then the LMS algorithm is
    guaranteed to update the parameters so that they
    converge to their optimal values, assuming the
    learning rate is sufficiently small.
  • Note
  • Assume there is no measurement noise in the
    target data
  • Assume the data is generated from a linear
    relationship
  • Parameter estimation will take an infinite time
    to converge to the optimal values
  • Rate of convergence and stability depend on the
    learning rate

26
Proof of Convergence (ii)
  • To show convergence, we need to establish that at
    the kth iteration, when an error has occurred
  • Using the update formula

when
27
Example LMS Learning
  • Consider the target linear model y 1 - 2x,
    where the inputs are drawn from a normal
    distribution with zero mean, unit variance
  • Data set consisted of 25 data points, and
    involved 10 cycles through the data set
  • h0.1

k100
y, y
k5

k0
x

q0


q
q1

q1

q0
k
28
Stability and NLMS
  • To normalise the LMS algorithm and remove the
    dependency of h on the input vector size,
    consider
  • This learning algorithm is stable for 0lthlt 2
    (exercise).
  • When h1, the NLMS algorithm has the property
    that the error, on that datum, after adaptation
    is zero, ie
  • Exercise prove this.
  • Is this desirable when the target contains
    (measurement) noise?

29
Regression Summary
  • Regression is a (statistical) technique for
    predicting real-valued outputs, given a
    quantitative feature vector
  • Typically, it is assumed that the dependent,
    target variable is corrupted by Gaussian noise,
    and this is unpredictable.
  • The aim is then to fit the underlying
    linear/non-linear signal.
  • The LMS algorithm is a simple, cheap gradient
    descent technique for updating the linear
    parameter estimates
  • The parameters will converge to their correct
    values when the target does not contain any
    noise, otherwise they will oscillate in a zone
    around the optimum.
  • Stability of the algorithm depends on the
    learning rate

30
Lecture 34 Summary
  • This lecture has looked at basic (linear)
    classification and regression techniques
  • Investigated basic linear model structure
  • Proposed simple, on-line learning rules
  • Proved convergence for simple environments
  • Discussed the practicality of the machine
    learning algorithms
  • While these algorithms are rarely used in this
    form, their structure has strongly influenced the
    development of more advanced techniques
  • Support vector machines
  • Multi-layer perceptrons
  • which will be studied in the coming weeks

31
Laboratory 34 Perceptron/LMS
  • Download the irisClassifier.m iris.mat Matlab
    files that contain a simple GUI for displaying
    the Iris data and entering decision boundaries
  • Enter parameters that create suitable decision
    boundaries for both the Setosa and Virginica
    classes
  • Which of the three classes are linearly
    separable?
  • Make sure you can translate between the
    classifiers parameters, q, and the
    gradient/intercept coordinate systems. Also
    ensure that the output is 1 (rather than -1) in
    the appropriate region
  • Download the irisPerceptron.m and perceptron.m
    Matlab files that contain the Perceptron
    algorithm for the Iris data
  • Run the algorithm and note how the decision
    boundary changes when a point is
    correctly/incorrectly classified
  • Modify the learning rate and note the effect it
    has on the convergence rate and final values

32
Laboratory 34 Perceptron/LMS (ii)
  • Copy and modify the irisPerceptron.m Matlab file
    so that it runs on the logical AND and OR
    classification functions (see slides 16 17).
    Each should contain 2 features and four training
    patterns. Make sure you can calculate the
    updates by hand, as required on Slide 17.
  • Create a Matlab implementation of example given
    in Slide 27 for the LMS algorithm with a simple,
    single input linear model
  • What values of h causes the LMS algorithm to
    become unstable?
  • Can this ever happen with the Perceptron
    algorithm?
  • Modify this implementation to use the NLMS
    training rule
  • Verify that learning is always stable for 0 lt h lt
    2.
  • Complete the two (pen and paper) exercises on
    Slide 28.
  • How might this insight be used with the
    Perceptron algorithm to implement a dynamic
    learning rate?
Write a Comment
User Comments (0)
About PowerShow.com