Title: Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley
1Pattern ClassificationAll materials in these
slides were taken from Pattern Classification
(2nd ed) by R. O. Duda, P. E. Hart and D. G.
Stork, John Wiley Sons, 2000 with the
permission of the authors and the publisher
2Chapter 3Maximum-Likelihood Bayesian
Parameter Estimation (part 1)
- Introduction
- Maximum-Likelihood Estimation
- Example of a Specific Case
- Gaussian Case unknown ? and ?
- Bias
- Appendix ML Problem Statement
3Introduction
- Data availability in a Bayesian framework
- To design optimal classifier, need
- P(?i) (priors)
- P(x ?i) (class-conditional densities)
- Unfortunately, rarely have this complete
information! - Design a classifier from a training sample
- Easy to estimate prior
- Samples are often too small to estimate
class-conditional (large dimension of feature
space!)
1
4A priori information about the problem
- Normality of P(x ?i)
- P(x ?i) N( ?i, ?i)
- Characterized by 2 parameters
- Estimation techniques
- Maximum-Likelihood (ML) and the Bayesian
estimations - Results are nearly identical, but the approaches
are different
1
5ML vs Bayesian Methods
- Ml Estimation
- Parameters are fixed but unknown!
- Obtain best parameters by maximizing probability
of obtaining the samples observed-- argmaxtheta
P( D theta ) - Bayesian methods
- view parameters as random variables having some
known distribution - compute POSTERIOR distribution
- In either approach, classification rule
P(?i x)
1
6Maximum-Likelihood Estimation
- Good convergence properties as the sample size
increases - Simpler than any other alternative techniques
- General principle
- Assume we have c classes and
- P(x ?j) N( ?j, ?j)
- P(x ?j) ? P (x ?j, ?j) where
2
7Details of ML Estimation
- Use training samples to estimate ? (?1, ?2,
, ?c), - ?i is associated with category i (i 1, 2, ,
c) - Suppose that D contains n samples, x1, x2,,
xn - ML estimate of ? is, by definition, the value
that maximizes P(D ?) - It is the value of ? that best agrees with the
actually observed training sample
2
82
9Optimal Estimation
- ? (?1, ?2, , ?p)t
- ?? gradient operator
- l(?) ln P(D ?) is log-likelihood function
- New problem statement
- determine ? that maximizes log-likelihood
2
10Necessary conditions for Optimum
- ??l 0
- Not sufficient (local opt, )
- Check 2nd derivative
2
11Specific case unknown ?
- P(xi ?) N(?, ?)
- (Samples drawn from multivariate normal
population) - ? ?
- ML estimate for ? must satisfy
2
12Specific case unknown ? (cont)
- Multiply by ?, rearranging
- Just arithmetic average of training sampls!
- Conclusion
- If P(xk ?j) (j 1, 2, , c) is d-dimensional
Gaussianthen estimate ? (?1, ?2, , ?c)t to
perform optimal classification!
2
13ML Estimation (unknown ? and ?)
- Gaussian Case unknown ? and ? ? (?1, ?2)
(?, ?2) -
2
14Results
- Combine (1) and (2) to obtain
2
15Bias
- ML estimate for ?2 is biased
- An elementary unbiased estimator for ? is
-
2
16ML Problem Statement
- Let D x1, x2, , xn
- P(x1,, xn ?) ?k1nP(xk ?)
- D n
- Goal determine (value of ? that makes this
sample the most representative)
2
17D n
.
.
.
x2
.
.
.
x1
xn
N(?j, ?j) P(xj ?1)
P(xj ?n)
P(xj ?k)
D1
.
x11
.
.
.
x10
Dk
.
Dc
x8
.
.
.
.
x20
.
x9
x1
.
.
2
18Problem Statement
- ? (?1, ?2, , ?c)
- Find such that
2
19Bayesian Decision Theory Chapter 2 (Sections
2.3-2.5)
- Minimum-Error-Rate Classification
- Classifiers, Discriminant Functions, Decision
Surfaces - The Normal Density
20Minimum-Error-Rate Classification
- Actions are decisions on classes
- If take action ?i and the true state
of nature is ?j then - decision is correct iff i j (else in
error) - Seek decision rule that minimizes the probability
of error (aka error rate )
21Zero-one loss function
- Conditional risk
- The risk corresponding to this loss function is
the average probability error - ?
22Minimum Error Rate
- As R(?i x) 1 P(?i x) to minimize risk,
maximize P(?i x) -
- For Minimum error rate
- Decide ?i if P (?i x) gt P(?j x) ?j ? i
23Decision Boundary
- As 0/1 loss, decide ?1 if
- If ? is the zero-one loss function which means
24(No Transcript)
25Classifiers, Discriminant Functionsand Decision
Surfaces
- The multi-category case
- Set of discriminant functions gi(x), i 1,,
c - Classifier assigns feature vector x to class ?i
- if
- gi(x) gt gj(x) ?j ? i
26(No Transcript)
27Max Discriminant
- Let gi(x) - R(?i x)
- (max discriminant corresponds to min risk!)
- For minimum error rate, use
- gi(x) P(?i x)
- (max discrimination corresponds to max
posterior!) - gi(x) ? P(x ?i) P(?i)
- Use gi(x) ln P(x ?i) ln P(?i)
- (ln natural logarithm)
28Decision Regions
- Dividee feature space into c decision regions
- if gi(x) gt gj(x) ?j ? i then x is in Ri
- (Ri ? assign x to ?i)
- Two-category case
- Classifier is dichotomizer iff it has two
discriminant functions g1 and g2 - Let g(x) ? g1(x) g2(x)
- Decide ?1 if g(x) gt 0 Otherwise decide ?2
29Computing g(x)
30(No Transcript)
31Univariate Normal Density
where ? mean (or expected value) of
x ?2 expected squared deviation or
variance
- Continuous density, analytically tractable
- Many processes are asymptotically Gaussian
- Handwritten characters
- Speech sounds
- ideal or prototype corrupted by random process
(central limit theorem)
32(No Transcript)
33Multivariate Normal Density
- Multivariate normal density in d dimensions is
-
- where
- x (x1, x2, , xd)t (t stands for
the transpose vector form) - ? (?1, ?2, , ?d)t mean vector
- ? dd covariance matrix
- ? and ?-1 are determinant and
inverse respectively -
-
34Bayesian Decision Theory III Chapter 2 (Sections
2-6,2-9)
- Discriminant Functions for the Normal Density
- Bayes Decision Theory Discrete Features
35Discriminant Functions for the Normal Density
- Recall minimum error-rate classification
achieved by discriminant function gi(x) ln
P(x ?i) ln P(?i) - Multivariate normal
36Special Case
- Independent variables Constant Variance ?i
?2 ?I (I ? identity matrix) -
- where
- Linear Discriminant Function
- ?i? is threshold for ith category
37Linear Machine
- Classifier using linear discriminant function is
called a linear machine - Decision surfaces for a linear machine are pieces
of hyperplanes defined by - gi(x) gj(x)
38(No Transcript)
39Classification Region
HERE!!!
- The hyperplane separating Ri and Rj
-
- always orthogonal to the line linking the means!
40(No Transcript)
41(No Transcript)
42- Case ?i ? (covariance of all classes are
identical but arbitrary!) - Hyperplane separating Ri and Rj
- (the hyperplane separating Ri and Rj is
generally not orthogonal to the line between the
means!)
43(No Transcript)
44(No Transcript)
45- Case ?i arbitrary
- The covariance matrices are different for each
category -
-
-
- (Hyperquadrics which are hyperplanes, pairs of
hyperplanes, hyperspheres, hyperellipsoids,
hyperparaboloids, hyperhyperboloids)
46(No Transcript)
47(No Transcript)
48Bayes Decision Theory Discrete Features
- Components of x are binary or integer valued, x
can take only one of m discrete values - v1, v2, , vm
- Case of independent binary features in 2 category
problem - Let x x1, x2, , xd t where each xi is
either 0 or 1, with probabilities - pi P(xi 1 ?1)
- qi P(xi 1 ?2)
49- The discriminant function in this case is
-