Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley - PowerPoint PPT Presentation

About This Presentation

Title:

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley

Description:

Good convergence properties as the sample size increases ... Speech sounds. ideal or prototype corrupted by random process (central limit theorem) ... – PowerPoint PPT presentation

Number of Views:254

Avg rating:3.0/5.0

Slides: 50

Provided by: djam89

Category:

more less

Transcript and Presenter's Notes

Title: Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley

1
Pattern ClassificationAll materials in these
slides were taken from Pattern Classification
(2nd ed) by R. O. Duda, P. E. Hart and D. G.
Stork, John Wiley Sons, 2000 with the
permission of the authors and the publisher

2
Chapter 3Maximum-Likelihood Bayesian
Parameter Estimation (part 1)

Introduction
Maximum-Likelihood Estimation
Example of a Specific Case
Gaussian Case unknown ? and ?
Bias
Appendix ML Problem Statement

3
Introduction

Data availability in a Bayesian framework
To design optimal classifier, need
P(?i) (priors)
P(x ?i) (class-conditional densities)
Unfortunately, rarely have this complete
information!
Design a classifier from a training sample
Easy to estimate prior
Samples are often too small to estimate
class-conditional (large dimension of feature
space!)

1
4
A priori information about the problem

Normality of P(x ?i)
P(x ?i) N( ?i, ?i)
Characterized by 2 parameters
Estimation techniques
Maximum-Likelihood (ML) and the Bayesian
estimations
Results are nearly identical, but the approaches
are different

1
5
ML vs Bayesian Methods

Ml Estimation
Parameters are fixed but unknown!
Obtain best parameters by maximizing probability
of obtaining the samples observed-- argmaxtheta
P( D theta )
Bayesian methods
view parameters as random variables having some
known distribution
compute POSTERIOR distribution
In either approach, classification rule
P(?i x)

1
6
Maximum-Likelihood Estimation

Good convergence properties as the sample size
increases
Simpler than any other alternative techniques
General principle
Assume we have c classes and
P(x ?j) N( ?j, ?j)
P(x ?j) ? P (x ?j, ?j) where

2
7
Details of ML Estimation

Use training samples to estimate ? (?1, ?2,
, ?c),
?i is associated with category i (i 1, 2, ,
c)
Suppose that D contains n samples, x1, x2,,
xn
ML estimate of ? is, by definition, the value
that maximizes P(D ?)
It is the value of ? that best agrees with the
actually observed training sample

2
8
2
9
Optimal Estimation

? (?1, ?2, , ?p)t
?? gradient operator
l(?) ln P(D ?) is log-likelihood function
New problem statement
determine ? that maximizes log-likelihood

2
10
Necessary conditions for Optimum

??l 0
Not sufficient (local opt, )
Check 2nd derivative

2
11
Specific case unknown ?

P(xi ?) N(?, ?)
(Samples drawn from multivariate normal
population)
? ?
ML estimate for ? must satisfy

2
12
Specific case unknown ? (cont)

Multiply by ?, rearranging
Just arithmetic average of training sampls!
Conclusion
If P(xk ?j) (j 1, 2, , c) is d-dimensional
Gaussianthen estimate ? (?1, ?2, , ?c)t to
perform optimal classification!

2
13
ML Estimation (unknown ? and ?)

Gaussian Case unknown ? and ? ? (?1, ?2)
(?, ?2)

2
14
Results

Combine (1) and (2) to obtain

2
15
Bias

ML estimate for ?2 is biased
An elementary unbiased estimator for ? is

2
16
ML Problem Statement

Let D x1, x2, , xn
P(x1,, xn ?) ?k1nP(xk ?)
D n
Goal determine (value of ? that makes this
sample the most representative)

2
17
D n
.
.
.
x2
.
.
.
x1
xn
N(?j, ?j) P(xj ?1)
P(xj ?n)
P(xj ?k)
D1
.
x11
.
.
.
x10
Dk
.
Dc
x8
.
.
.
.
x20
.
x9
x1
.
.
2
18
Problem Statement

? (?1, ?2, , ?c)
Find such that

2
19
Bayesian Decision Theory Chapter 2 (Sections
2.3-2.5)

Minimum-Error-Rate Classification
Classifiers, Discriminant Functions, Decision
Surfaces
The Normal Density

20
Minimum-Error-Rate Classification

Actions are decisions on classes
If take action ?i and the true state
of nature is ?j then
decision is correct iff i j (else in
error)
Seek decision rule that minimizes the probability
of error (aka error rate )

21
Zero-one loss function

Conditional risk
The risk corresponding to this loss function is
the average probability error
?

22
Minimum Error Rate

As R(?i x) 1 P(?i x) to minimize risk,
maximize P(?i x)
For Minimum error rate
Decide ?i if P (?i x) gt P(?j x) ?j ? i

23
Decision Boundary

As 0/1 loss, decide ?1 if

If ? is the zero-one loss function which means

24
(No Transcript)
25
Classifiers, Discriminant Functionsand Decision
Surfaces

The multi-category case
Set of discriminant functions gi(x), i 1,,
c
Classifier assigns feature vector x to class ?i
if
gi(x) gt gj(x) ?j ? i

26
(No Transcript)
27
Max Discriminant

Let gi(x) - R(?i x)
(max discriminant corresponds to min risk!)
For minimum error rate, use
gi(x) P(?i x)
(max discrimination corresponds to max
posterior!)
gi(x) ? P(x ?i) P(?i)
Use gi(x) ln P(x ?i) ln P(?i)
(ln natural logarithm)

28
Decision Regions

Dividee feature space into c decision regions
if gi(x) gt gj(x) ?j ? i then x is in Ri
(Ri ? assign x to ?i)
Two-category case
Classifier is dichotomizer iff it has two
discriminant functions g1 and g2
Let g(x) ? g1(x) g2(x)
Decide ?1 if g(x) gt 0 Otherwise decide ?2

29
Computing g(x)
30
(No Transcript)
31
Univariate Normal Density
where ? mean (or expected value) of
x ?2 expected squared deviation or
variance

Continuous density, analytically tractable
Many processes are asymptotically Gaussian
Handwritten characters
Speech sounds
ideal or prototype corrupted by random process
(central limit theorem)

32
(No Transcript)
33
Multivariate Normal Density

Multivariate normal density in d dimensions is
where
x (x1, x2, , xd)t (t stands for
the transpose vector form)
? (?1, ?2, , ?d)t mean vector
? dd covariance matrix
? and ?-1 are determinant and
inverse respectively

34
Bayesian Decision Theory III Chapter 2 (Sections
2-6,2-9)

Discriminant Functions for the Normal Density
Bayes Decision Theory Discrete Features

35
Discriminant Functions for the Normal Density

Recall minimum error-rate classification
achieved by discriminant function gi(x) ln
P(x ?i) ln P(?i)
Multivariate normal

36
Special Case

Independent variables Constant Variance ?i
?2 ?I (I ? identity matrix)
where
Linear Discriminant Function
?i? is threshold for ith category

37
Linear Machine

Classifier using linear discriminant function is
called a linear machine
Decision surfaces for a linear machine are pieces
of hyperplanes defined by
gi(x) gj(x)

38
(No Transcript)
39
Classification Region
HERE!!!

The hyperplane separating Ri and Rj
always orthogonal to the line linking the means!

40
(No Transcript)
41
(No Transcript)
42

Case ?i ? (covariance of all classes are
identical but arbitrary!)
Hyperplane separating Ri and Rj
(the hyperplane separating Ri and Rj is
generally not orthogonal to the line between the
means!)

43
(No Transcript)
44
(No Transcript)
45

Case ?i arbitrary
The covariance matrices are different for each
category
(Hyperquadrics which are hyperplanes, pairs of
hyperplanes, hyperspheres, hyperellipsoids,
hyperparaboloids, hyperhyperboloids)

46
(No Transcript)
47
(No Transcript)
48
Bayes Decision Theory Discrete Features

Components of x are binary or integer valued, x
can take only one of m discrete values
v1, v2, , vm
Case of independent binary features in 2 category
problem
Let x x1, x2, , xd t where each xi is
either 0 or 1, with probabilities
pi P(xi 1 ?1)
qi P(xi 1 ?2)