132 v2.0 - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

132 v2.0

Description:

Linear models and quadratic performance function. LMS and ... Sepal length and width. Petal length and width. 150 examples were collected, 50 from each class ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 33

Provided by: intranetE

Category:

Tags: sepal

more less

Transcript and Presenter's Notes

Title: 132 v2.0

1
Lectures 34Linear Machine Learning Algorithms

Dr Martin Brown
Room E1k
Email martin.brown_at_manchester.ac.uk
Telephone 0161 306 4672
http//www.csc.umist.ac.uk/msc/intranet/EE-M016

2
Lectures 34 Outline

Linear classification using the Perceptron
Classification problem
Linear classifier and decision boundary
Perceptron learning rule
Proof of convergence
Recursive linear regression using LMS
Modelling and recursive parameter estimation
Linear models and quadratic performance function
LMS and NLMS learning rules
Proof of convergence

3
Lectures 34 Learning Objectives

Understand what classification and regression
machine learning techniques are and their
differences
Describe how linear models can be used for both
classification and regression problems
Prove convergence of the learning algorithms for
linear relationships, subject to restrictive
conditions
Understand the restrictions of these basic proofs
Develop basic framework that will be expanded on
in subsequent lectures

4
Lecture 34 Resources

Classification/Perceptron
An introduction to Support Vector Machines and
other kernel-based learning methods, N
Cristianini, J Shawe-Taylor, CUP, 2000
Regression/LMS
Adaptive Signal Processing, Widrow Stearns,
Prentice Hall, 1985
Many other sources are available (on-line).

5
What is Classification?

Classification is also known as (statistical)
pattern recognition
The aim is to build a machine/algorithm that can
assign appropriate qualitative labels to new,
previously unseen quantitative data using a
priori knowledge and/or information contained in
a training set. The patterns to be classified
are usually groups of measurements/observations,
that are believed to be informative for the
classification task.
Example Face recognition

Training data D X,y
Prior knowledge
Design/ learn
Classifier m(q,x)

Predict

Predicted class label y
New pattern x
6
Classification Training Data

To supply training data for a classifier,
examples must be collected that contain both
positive (examples of the class) and negative
(examples of other classes) instances. These are
qualitative target class values and are stored as
1 and -1, for the positive and negative
instances respectively. Generated by expert or
by observation.
The quantitative input features should be
informative
The training set should contain enough examples
to be able to build statistically significant
decisions

How to encode qualitative target and input
features?
7
Bayes Class Priors

Classification is all about decision making using
the concept of minimum risk
Imagine that the training data contains 100
examples, 70 of them are class 1 (c1), 30 are
class 2 (c2)
If I have to decide which class an unknown
example belongs to, which decision is optimal?
Errors if decision is class 1
p(c1)
Errors if decision is class 2
p(c2)
Minimum risk decision is
p(c1) p(c2) are known as the Bayes priors, they
represent the baseline performance for any
classifier. They are derived from the training
data as simple percentages

8
Structure of a Linear Classifier

Given a set of quantitative features x, a linear
classifier has the form
The sgn() function is used to produce the
qualitative class label (/-1)
The class/decision boundary is determined when
This is an (n-1)D hyperplane in feature space.
In 2-dimensional feature space
How does the sign and magnitude of q affect the
decision boundary?

x2

x1
9
Simple Example Fishers Iris Data

Famous example of building classifiers for a
problem with 3 types of Iris flowers and 4
measurements about the flower
Sepal length and width
Petal length and width
150 examples were collected, 50 from each class
Build 3 separate classifiers, one for recognizing
examples of each class
Data is shown, plotted against last two features,
as well as two linear classifiers for the Setosa
and Virginica classes

Calculate q in lab 34
10
Perceptron Linear Classifier

The Perceptron linear classifier was devised by
Rosenblatt in 1956
It comprises a linear classifier (as just
discussed) and a simple parameter update rule of
the form
Cyclically present each training pattern xk, yk
to the linear classifier
When an error (misclassification) is made, update
the parameters
where hgt0 is the learning rate.
The bias term can be included as q0 with an extra
feature x0 1
Continue until there are no prediction errors
Perceptron convergence theorem If the data set is
linearly separable, the perceptron learning
algorithm will converge to an optimal separator
in a finite time

11
Instantaneous Parameter Update

What does this look like?
The parameters are updated to make them more like
the incorrect feature vector.
After updating
Updated parameters are closer
to correct decision

x2, q2

Error-driven update

x1, q1

y, y
1
0
-1
12
Perceptron Convergence Proof Preamble

Basic aim is to minimise the number of
mis-classifications
This is generally an NP-complete problem
Weve assumed that there is an optimal solution
with 0 errors
This is similar to Least Squares recursive
estimation
Performance Si(yi-yi)2 4numberOfErrors
Except that the sgn() makes it a non-quadratic
optimization problem
Updating only when
there are errors is the same as
with or without errors
Sometimes drawn as a network

error driven parameter estimation
Repeatedly cycle through data set D, drawing out
each sample xk, yk

yk
xk
-

yk
13
Convergence Analysis of the Perceptron (i)

If a linearly separable data set D is repeatedly
presented to a Perceptron, then the learning
procedure is guaranteed to converge (no errors)
in a finite time
If the data set is linearly separable, there
exists optimal parameters q such that
for all i 1, , l
Note that are also
optimal parameter vectors
Consider the positive quantity g defined by, such
that q 1
This is a concept known as the classification
margin
Assume also that the feature vectors are bounded
by

14
Convergence Analysis of the Perceptron (ii)

To show convergence, we need to establish that at
the kth iteration, when an error has occurred
Using the update formula

q2

qk

qk1
q
q1
To finish proof, select
15
Convergence Analysis of the Perceptron (iii)

To show this terminates in a finite number of
iterations, simply note that
is independent of the current training sample, so
the parameter error must decrease by at least
this amount at each update iteration. As the
initial error is finite, q0 0, say, there must
exist a finite number of steps before the
parameter error is reduced to zero.
Note also that a is proportional to the size of
the feature vector (R2) and inversely
proportional to the size of the margin (g). Both
of these will influence the number of update
iterations when the Perceptron is learning

16
Example of Perceptron (i)

Consider modelling the logical AND data using a
Perceptron

Is the data linearly separable?

k0, q 0.01, 0.1, 0.006
k5, q -0.98, 1.11, 1.01
k18, q -2.98, 2.11, 1.01
x2
x2
x2
x1
x1
x1
17
Example Parameter Trajectory (ii)
Lab exercise Calculate by hand the first 4
iterations of the learning scheme
18
Classification Margin

In this proof, we assumed that there exists a
single, optimal parameter vector.
In practice, when the data is linearly separable,
there are an infinite number simply requiring
correct classification results in an ill-posed
posed problem
The classification margin can be defined as the
minimum distance of the decision boundary to a
point in that class
Used in deriving Support Vector Machines

x2
x1
x2
1
0
-1
x1
19
Classification Summary

Classification is the task of assigning an
object, described by a feature vector, to one of
a set of mutually exclusive groups
A linear classifier has a linear decision
boundary
The perceptron training algorithm is guaranteed
to converge in a finite time when the data set is
linearly separable
The final boundary is determined by the initial
values and the order of presentation of the data

20
Definition of Regression

Regression is a (statistical) methodology that
utilizes the relation between two or more
quantitative variables so that one variable can
be predicted from the other, or others.
Examples
Sales of a product can be predicted by using the
relationship between sales volume and amount of
advertising
The performance of an employee can be predicted
by using the relationship between performance and
aptitude tests
The size of a childs vocabulary can be predicted
by using the relationship between the vocabulary
size, the childs age and the parents
educational input.

21
Regression Problem Visualisation

y, y
y

x

Data generated by
Estimate model parameters
Predict a real value (fit a curve to the data)
Predictive performance
average error

22
Probabilistic Prediction Output

An output of 12 with rmse/standard deviation
1.5 Within a small region close to the query
point, the average target value was 12 and the
standard deviation within that region was 1.5
(variance 2.25)

m(yx) 12

2s(e) 3
95 of the data lies in the range m/-2s 12
/-21.5 9,15
23
Structure of a Linear Regression Model

Given a set of features x, a linear predictor has
the form
The output is a real-valued, quantitative
variable
The bias term can be included as an extra feature
x0 1. This renames the bias parameter as q0.
Most linear control system models do not
explicitly include a bias term, why is this?
Similar to the Toluca example in week 1.

24
Least Mean Squares Learning

Least Mean Squares (LMS) proposed by Widrow 1962
This is a (non-optimal) sequential parameter
estimation procedure for a linear model
NB, compared to classification, both yk and yk
are quantitative variables, so the error/noise
signal (yk-yk) is generally non-zero. Similar to
the Perceptron, but no threshold on xTq. h is
again the positive learning rate.
Widely used in filtering/signal processing and
adaptive control applications
Cheap version of sequential/recursive parameter
estimation
The normalised version (NLMS) was developed by
Kaczmarz in 1937

25
Proof of LMS Convergence (i)

If a noise-free data set containing a linear
relationship x-gty is repeatedly presented to a
linear model, then the LMS algorithm is
guaranteed to update the parameters so that they
converge to their optimal values, assuming the
learning rate is sufficiently small.
Note
Assume there is no measurement noise in the
target data
Assume the data is generated from a linear
relationship
Parameter estimation will take an infinite time
to converge to the optimal values
Rate of convergence and stability depend on the
learning rate

26
Proof of Convergence (ii)

To show convergence, we need to establish that at
the kth iteration, when an error has occurred
Using the update formula

when
27
Example LMS Learning

Consider the target linear model y 1 - 2x,
where the inputs are drawn from a normal
distribution with zero mean, unit variance
Data set consisted of 25 data points, and
involved 10 cycles through the data set
h0.1

k100
y, y
k5

k0
x

q0

q
q1

q1

q0
k
28
Stability and NLMS

To normalise the LMS algorithm and remove the
dependency of h on the input vector size,
consider
This learning algorithm is stable for 0lthlt 2
(exercise).
When h1, the NLMS algorithm has the property
that the error, on that datum, after adaptation
is zero, ie
Exercise prove this.
Is this desirable when the target contains
(measurement) noise?

29
Regression Summary

Regression is a (statistical) technique for
predicting real-valued outputs, given a
quantitative feature vector
Typically, it is assumed that the dependent,
target variable is corrupted by Gaussian noise,
and this is unpredictable.
The aim is then to fit the underlying
linear/non-linear signal.
The LMS algorithm is a simple, cheap gradient
descent technique for updating the linear
parameter estimates
The parameters will converge to their correct
values when the target does not contain any
noise, otherwise they will oscillate in a zone
around the optimum.
Stability of the algorithm depends on the
learning rate

30
Lecture 34 Summary

This lecture has looked at basic (linear)
classification and regression techniques
Investigated basic linear model structure
Proposed simple, on-line learning rules
Proved convergence for simple environments
Discussed the practicality of the machine
learning algorithms
While these algorithms are rarely used in this
form, their structure has strongly influenced the
development of more advanced techniques
Support vector machines
Multi-layer perceptrons
which will be studied in the coming weeks

31
Laboratory 34 Perceptron/LMS

Download the irisClassifier.m iris.mat Matlab
files that contain a simple GUI for displaying
the Iris data and entering decision boundaries
Enter parameters that create suitable decision
boundaries for both the Setosa and Virginica
classes
Which of the three classes are linearly
separable?
Make sure you can translate between the
classifiers parameters, q, and the
gradient/intercept coordinate systems. Also
ensure that the output is 1 (rather than -1) in
the appropriate region
Download the irisPerceptron.m and perceptron.m
Matlab files that contain the Perceptron
algorithm for the Iris data
Run the algorithm and note how the decision
boundary changes when a point is
correctly/incorrectly classified
Modify the learning rate and note the effect it
has on the convergence rate and final values

32
Laboratory 34 Perceptron/LMS (ii)

Copy and modify the irisPerceptron.m Matlab file
so that it runs on the logical AND and OR
classification functions (see slides 16 17).
Each should contain 2 features and four training
patterns. Make sure you can calculate the
updates by hand, as required on Slide 17.
Create a Matlab implementation of example given
in Slide 27 for the LMS algorithm with a simple,
single input linear model
What values of h causes the LMS algorithm to
become unstable?
Can this ever happen with the Perceptron
algorithm?
Modify this implementation to use the NLMS
training rule
Verify that learning is always stable for 0 lt h lt
2.
Complete the two (pen and paper) exercises on
Slide 28.
How might this insight be used with the
Perceptron algorithm to implement a dynamic
learning rate?