Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley - PowerPoint PPT Presentation

About This Presentation
Title:

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley

Description:

Good convergence properties as the sample size increases ... Speech sounds. ideal or prototype corrupted by random process (central limit theorem) ... – PowerPoint PPT presentation

Number of Views:254
Avg rating:3.0/5.0
Slides: 50
Provided by: djam89
Category:

less

Transcript and Presenter's Notes

Title: Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley


1
Pattern ClassificationAll materials in these
slides were taken from Pattern Classification
(2nd ed) by R. O. Duda, P. E. Hart and D. G.
Stork, John Wiley Sons, 2000 with the
permission of the authors and the publisher

2
Chapter 3Maximum-Likelihood Bayesian
Parameter Estimation (part 1)
  • Introduction
  • Maximum-Likelihood Estimation
  • Example of a Specific Case
  • Gaussian Case unknown ? and ?
  • Bias
  • Appendix ML Problem Statement

3
Introduction
  • Data availability in a Bayesian framework
  • To design optimal classifier, need
  • P(?i) (priors)
  • P(x ?i) (class-conditional densities)
  • Unfortunately, rarely have this complete
    information!
  • Design a classifier from a training sample
  • Easy to estimate prior
  • Samples are often too small to estimate
    class-conditional (large dimension of feature
    space!)

1
4
A priori information about the problem
  • Normality of P(x ?i)
  • P(x ?i) N( ?i, ?i)
  • Characterized by 2 parameters
  • Estimation techniques
  • Maximum-Likelihood (ML) and the Bayesian
    estimations
  • Results are nearly identical, but the approaches
    are different

1
5
ML vs Bayesian Methods
  • Ml Estimation
  • Parameters are fixed but unknown!
  • Obtain best parameters by maximizing probability
    of obtaining the samples observed-- argmaxtheta
    P( D theta )
  • Bayesian methods
  • view parameters as random variables having some
    known distribution
  • compute POSTERIOR distribution
  • In either approach, classification rule
    P(?i x)

1
6
Maximum-Likelihood Estimation
  • Good convergence properties as the sample size
    increases
  • Simpler than any other alternative techniques
  • General principle
  • Assume we have c classes and
  • P(x ?j) N( ?j, ?j)
  • P(x ?j) ? P (x ?j, ?j) where

2
7
Details of ML Estimation
  • Use training samples to estimate ? (?1, ?2,
    , ?c),
  • ?i is associated with category i (i 1, 2, ,
    c)
  • Suppose that D contains n samples, x1, x2,,
    xn
  • ML estimate of ? is, by definition, the value
    that maximizes P(D ?)
  • It is the value of ? that best agrees with the
    actually observed training sample

2
8
2
9
Optimal Estimation
  • ? (?1, ?2, , ?p)t
  • ?? gradient operator
  • l(?) ln P(D ?) is log-likelihood function
  • New problem statement
  • determine ? that maximizes log-likelihood

2
10
Necessary conditions for Optimum
  • ??l 0
  • Not sufficient (local opt, )
  • Check 2nd derivative

2
11
Specific case unknown ?
  • P(xi ?) N(?, ?)
  • (Samples drawn from multivariate normal
    population)
  • ? ?
  • ML estimate for ? must satisfy

2
12
Specific case unknown ? (cont)
  • Multiply by ?, rearranging
  • Just arithmetic average of training sampls!
  • Conclusion
  • If P(xk ?j) (j 1, 2, , c) is d-dimensional
    Gaussianthen estimate ? (?1, ?2, , ?c)t to
    perform optimal classification!

2
13
ML Estimation (unknown ? and ?)
  • Gaussian Case unknown ? and ? ? (?1, ?2)
    (?, ?2)

2
14
Results
  • Combine (1) and (2) to obtain

2
15
Bias
  • ML estimate for ?2 is biased
  • An elementary unbiased estimator for ? is

2
16
ML Problem Statement
  • Let D x1, x2, , xn
  • P(x1,, xn ?) ?k1nP(xk ?)
  • D n
  • Goal determine (value of ? that makes this
    sample the most representative)

2
17
D n
.
.
.
x2
.
.
.
x1
xn
N(?j, ?j) P(xj ?1)
P(xj ?n)
P(xj ?k)
D1
.
x11
.
.
.
x10
Dk
.
Dc
x8
.
.
.
.
x20
.
x9
x1
.
.
2
18
Problem Statement
  • ? (?1, ?2, , ?c)
  • Find such that

2
19
Bayesian Decision Theory Chapter 2 (Sections
2.3-2.5)
  • Minimum-Error-Rate Classification
  • Classifiers, Discriminant Functions, Decision
    Surfaces
  • The Normal Density

20
Minimum-Error-Rate Classification
  • Actions are decisions on classes
  • If take action ?i and the true state
    of nature is ?j then
  • decision is correct iff i j (else in
    error)
  • Seek decision rule that minimizes the probability
    of error (aka error rate )

21
Zero-one loss function
  • Conditional risk
  • The risk corresponding to this loss function is
    the average probability error
  • ?

22
Minimum Error Rate
  • As R(?i x) 1 P(?i x) to minimize risk,
    maximize P(?i x)
  • For Minimum error rate
  • Decide ?i if P (?i x) gt P(?j x) ?j ? i

23
Decision Boundary
  • As 0/1 loss, decide ?1 if
  • If ? is the zero-one loss function which means

24
(No Transcript)
25
Classifiers, Discriminant Functionsand Decision
Surfaces
  • The multi-category case
  • Set of discriminant functions gi(x), i 1,,
    c
  • Classifier assigns feature vector x to class ?i
  • if
  • gi(x) gt gj(x) ?j ? i

26
(No Transcript)
27
Max Discriminant
  • Let gi(x) - R(?i x)
  • (max discriminant corresponds to min risk!)
  • For minimum error rate, use
  • gi(x) P(?i x)
  • (max discrimination corresponds to max
    posterior!)
  • gi(x) ? P(x ?i) P(?i)
  • Use gi(x) ln P(x ?i) ln P(?i)
  • (ln natural logarithm)

28
Decision Regions
  • Dividee feature space into c decision regions
  • if gi(x) gt gj(x) ?j ? i then x is in Ri
  • (Ri ? assign x to ?i)
  • Two-category case
  • Classifier is dichotomizer iff it has two
    discriminant functions g1 and g2
  • Let g(x) ? g1(x) g2(x)
  • Decide ?1 if g(x) gt 0 Otherwise decide ?2

29
Computing g(x)
30
(No Transcript)
31
Univariate Normal Density
where ? mean (or expected value) of
x ?2 expected squared deviation or
variance
  • Continuous density, analytically tractable
  • Many processes are asymptotically Gaussian
  • Handwritten characters
  • Speech sounds
  • ideal or prototype corrupted by random process
    (central limit theorem)

32
(No Transcript)
33
Multivariate Normal Density
  • Multivariate normal density in d dimensions is
  • where
  • x (x1, x2, , xd)t (t stands for
    the transpose vector form)
  • ? (?1, ?2, , ?d)t mean vector
  • ? dd covariance matrix
  • ? and ?-1 are determinant and
    inverse respectively

34
Bayesian Decision Theory III Chapter 2 (Sections
2-6,2-9)
  • Discriminant Functions for the Normal Density
  • Bayes Decision Theory Discrete Features

35
Discriminant Functions for the Normal Density
  • Recall minimum error-rate classification
    achieved by discriminant function gi(x) ln
    P(x ?i) ln P(?i)
  • Multivariate normal

36
Special Case
  • Independent variables Constant Variance ?i
    ?2 ?I (I ? identity matrix)
  • where
  • Linear Discriminant Function
  • ?i? is threshold for ith category

37
Linear Machine
  • Classifier using linear discriminant function is
    called a linear machine
  • Decision surfaces for a linear machine are pieces
    of hyperplanes defined by
  • gi(x) gj(x)

38
(No Transcript)
39
Classification Region
HERE!!!
  • The hyperplane separating Ri and Rj
  • always orthogonal to the line linking the means!

40
(No Transcript)
41
(No Transcript)
42
  • Case ?i ? (covariance of all classes are
    identical but arbitrary!)
  • Hyperplane separating Ri and Rj
  • (the hyperplane separating Ri and Rj is
    generally not orthogonal to the line between the
    means!)

43
(No Transcript)
44
(No Transcript)
45
  • Case ?i arbitrary
  • The covariance matrices are different for each
    category
  • (Hyperquadrics which are hyperplanes, pairs of
    hyperplanes, hyperspheres, hyperellipsoids,
    hyperparaboloids, hyperhyperboloids)

46
(No Transcript)
47
(No Transcript)
48
Bayes Decision Theory Discrete Features
  • Components of x are binary or integer valued, x
    can take only one of m discrete values
  • v1, v2, , vm
  • Case of independent binary features in 2 category
    problem
  • Let x x1, x2, , xd t where each xi is
    either 0 or 1, with probabilities
  • pi P(xi 1 ?1)
  • qi P(xi 1 ?2)

49
  • The discriminant function in this case is
Write a Comment
User Comments (0)
About PowerShow.com