Linear Discriminant Functions Chapter 5 (Duda et al.) - PowerPoint PPT Presentation

About This Presentation

Title:

Linear Discriminant Functions Chapter 5 (Duda et al.)

Description:

Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis Newton s Method requires inverting H a Newton s method ... – PowerPoint PPT presentation

Number of Views:265

Avg rating:3.0/5.0

Slides: 45

Provided by: cseUnrEd

Learn more at: https://www.cse.unr.edu

Category:

more less

Transcript and Presenter's Notes

Title: Linear Discriminant Functions Chapter 5 (Duda et al.)

1
Linear Discriminant FunctionsChapter 5 (Duda et
al.)
CS479/679 Pattern RecognitionDr. George Bebis
2
Generative vs Discriminant Approach

Generative approaches estimate the discriminant
function by first estimating the probability
distribution of the data belonging to each class.
Discriminative approaches estimate the
discriminant function explicitly, without
assuming a probability distribution.

3
Linear Discriminants(case of two categories)

A linear discriminant has the following form
Decide w1 if g(x) gt 0 and w2 if
g(x) lt 0

If g(x)0, then x lies on the decision boundary
and can be
assigned to either class.
4
Decision Boundary

The decision boundary g(x)0 is a hyperplane.
The orientation of the hyperplane is determined
by w and its location by w0.
w is the normal to the hyperplane.
If w00, it passes through the origin

5
Decision Boundary Estimation

Use learning algorithms to estimate w and w0
from training data xk.
Let us suppose that
The solution can be found by minimizing an error
function, e.g., the training error or
empirical risk

true class label
predicted class label
6
Geometric Interpretation

Lets look at g(x) from a geometrical point of
view.

Using vector algebra, x can be expressed as
follows
Substitute x in
7
Geometric Interpretation (contd)
8
Geometric Interpretation (contd)

The distance of x from the hyperplane is given
by
g(x) provides an algebraic measure of the
distance of x from the hyperplane.

(i.e., distance of the plane from origin)
Setting x0
9
Linear Discriminant Functions case of c
categories

There are several ways to devise multi-category
classifiers using linear discriminant functions
(1) One against the rest

How many decision boundaries are there?
c boundaries
But there is a problem ambiguous region
10
Linear Discriminant Functions case of c
categories (contd)

(2) One against another

How many decision boundaries are there?
c(c-1)/2 boundaries
But there is a problem again ambiguous region
11
Linear Discriminant Functions case of c
categories (contd)

To avoid the problem with ambiguous regions
Define c linear functions gi(x), i1,2,,c
Assign x to wi if gi(x) gt gj(x) for all j ? i.
The resulting classifier is called a linear
machine.

(see Chapter 2)
12
Linear Discriminant Functions case of c
categories (contd)

A linear machine divides the feature space in c
convex decisions regions.

If x is in region Ri, the gi(x) is the largest.
Note although there are c(c-1)/2 region
pairs, there typically less decision boundaries
(i.e., 8 instead of 10 in the five class example
above).
13
Geometric Interpretation

The decision boundary between adjacent regions Ri
and Rj is a portion of the hyperplane Hij
(wi-wj) is normal to Hij
The distance from x to Hij is

14
Higher Order Discriminant Functions

Higher order discriminants yield more complex
decision boundaries than linear discriminant
functions.
Quadratic discriminant - add terms corresponding
to products of pairs of components of x
Polynomial discriminant add even higher order
products such as

15
Linear Discriminants Revisited A More General
Definition

More convenient when the decision boundary
passes through the origin augment feature space!

d1 features
d features
d1 parameters
d parameters
16
Linear Discriminants Revisited A More General
Definition (contd)

Decide w1 if aty gt 0 and w2 if aty
lt 0

Discriminant
Classification rule

Separates points in (d1)-space by a hyperplane.
Decision boundary passes through the origin.

17
Generalized Discriminants

The main idea is mapping the data to a space of
higher dimensionality.
This can be done by transforming the data through
properly chosen functions yi(x), i1,2,,
(called f functions)

18
Generalized Discriminants (contd)

A generalized discriminant is defined as a linear
discriminant in the - dimensional space

f
19
Generalized Discriminants (contd)

Why are generalized discriminants attractive?
By properly choosing the f functions, a problem
which is not linearly-separable in the
d-dimensional space, might become linearly
separable in the dimensional space!

20
Example

The corresponding decision regions R1,R2 in the
1D-space are not simply connected (i.e., not
linearly separable).
Consider the following mapping and generalized
discriminant

21
Example (contd)
22
Learning Linearly Separable Categories

Given a linear discriminant function
the goal is to learn the parameters
(weights) a from a set of n labeled samples yi,
where each yi has a class label ?1 or ?2.

23
Learning effect of training examples

Every training sample yi places a constraint on
the weight vector a
Visualize solution in feature space
aty0 defines a hyperplane in the feature space
with a being the normal vector.
Given n examples, the solution a must lie within
a certain region (shaded region in the example).

24
Learning effect of training examples (contd)

Visualize solution in parameter space
aty0 defines a hyperplane in the parameter space
with y being the normal vector.
Given n examples, the solution a must lie on the
intersection of n half-spaces
(shown by the red lines in the example).

parameter space (?1, ?2)
25
Uniqueness of Solution

Solution vector a is usually not unique we can
impose
additional constraints to enforce uniqueness,
e.g.,
Find unit-length weight vector a that
maximizes the
minimum distance from the training
examples to the
separating plane

26
Learning Using Iterative Optimization

Minimize some error function J(a) with respect to
a.
Minimize iteratively

e.g.,

a(k)
a(k1)
How should we choose pk?
27
Choose pk using Gradient
Warning notation is reversed in the figures.
points in the direction of steepest decrease!
J(a)
28
Gradient Descent
Warning notation is reversed in the figure.
J(a)
29
Gradient Descent (contd)

Gradient descent is very popular due to its
simplicity but can get stuck in local minima.

Warning notation is reversed in the figure.
J(a)
30
Gradient Descent

What is the effect of the learning rate h(k) ?
If it is too small, it takes too many iterations.
If it is too big, it might overshoot the
solution (and never find it), possibly leading to
oscillations (no convergence).

Warning notation is reversed in the figure.
J(a)
Take bigger steps to converge faster.
Take smaller steps to avoid overshooting.
31
Gradient Descent (contd)

Even a small change in the learning rate might
lead to overshooting the solution.

J(a)
Converges to the solution!
Overshoots the solution!
32
Gradient Descent (contd)

Could we choose h(k) adaptively?
Yes lets review Taylor Series expansion first.

Expands f(x) around x0 using derivatives
33
Gradient Descent (contd)

Expand J(a) around a0a(k) using Taylor Series
(up to second derivative)
Evaluate J(a) at aa(k1)

Hessian matrix
Expensive to compute in practice!
34
Choosing pk using Hessian

requires inverting H expensive in practice!
Gradient descent can be seen as a special case
of Newtons method assuming HI
35
Newtons Method
h(k)1
If J(a) is quadratic, Newtons method converges
in one iteration!
J(a)
36
Gradient descent vs Newtons method
Gradient Descent

Newton
Typically, Newtons method converges faster than
Gradient Descent.
37
Dual Classification Problem
If atyigt0 assign yi to ?1 else if atyilt0 assign
yi to ?2

If yi in ?2, replace yi by -yi
Find a such that atyigt0

Seeks a hyperplane that separates patterns from
different categories
Seeks a hyperplane that puts normalized patterns
on the same (positive) side
38
Perceptron rule

The perceptron rule minimizes the following error
function
where Y(a) is the set of samples misclassified by
a.
If Y(a) is empty, Jp(a)0 otherwise, Jp(a)gt0

Find a such that atyigt0 for all i
39
Perceptron rule (contd)

Apply gradient descent using Jp(a)
Compute the gradient of Jp(a)

40
Perceptron rule (contd)

missclassified examples
41
Perceptron rule (contd)

Keep updating the orientation of the hyperplane
until all training samples are on its positive
side.

Example
42
Perceptron rule (contd)
Update is done using one misclassified example
at a time
Perceptron Convergence Theorem If training
samples are linearly separable, then the
perceptron algorithm will terminate at a solution
vector in a finite number of steps.
43
Perceptron rule (contd)

order of examples y2 y3 y1 y3
Batch algorithm leads to a smoother trajectory
in solution space.
44
Quiz