Lecture 23 Bayesian Decision Theory

About This Presentation

Title:

Lecture 23 Bayesian Decision Theory

Description:

4 terms (Prior, likelihood, evidence, Posteriori) ... The probability of a class i for a given feature value x (Posteriori- P( j | x) ) Aug. 2006 ... – PowerPoint PPT presentation

Number of Views:1023

Avg rating:3.0/5.0

Slides: 49

Provided by: djam79

Category:

more less

Transcript and Presenter's Notes

Title: Lecture 23 Bayesian Decision Theory

1
Lecture 2/3 Bayesian Decision Theory
2
Outline

Bayes Decision Theory
Fish example
Generalized Bayes decision theory
Two category classification
Likelihood ratio test
Minimum-Error-Rate classification
Classifiers, Discriminant Functions, and Decision
Surfaces
Discriminant Functions for the Normal Density
Summary

3
Bayes Rule

Bayes rule shows how observing the value of x
changes the a priori probability P (?j) to the a
posteriori probability P(?j x) as follows
Posterior (likelihood Prior) / evidence
Posterior (likelihood Prior)

4
Fish classification example

The sea bass/salmon classification
4 terms (Prior, likelihood, evidence, Posteriori)
Distribution probability of salmon and sea bass (
Prior - P(?i) )
Conditional distribution density of features of
fishes with a given class (likelihood - p(x ?i)
)
Probability density of feature x (evidence
p(x))
The probability of a class ?i for a given feature
value x (Posteriori- P(?j x) )

5
E21
E12
6
Error Probabilities

Decision rule given the posterior probabilities
X is an observation for which
if P(?1 x) gt P(?2 x) True state of
nature ?1
if P(?1 x) lt P(?2 x) True state of
nature ?2
Therefore
whenever we observe a particular x, the
probability of error is
P(error x) P(?1 x) if we decide ?2
P(error x) P(?2 x) if we decide ?1

Minimizing the probability of error
Therefore
P(error x) min P(?1 x), P(?2 x)
(Bayes
decision)

8
Error Probability and Decision Boundary

As priors change the decision boundary moves
accordingly, from left to right side

9
Likelihood Ratio Test
10
(No Transcript)
11
Generalized Bayesian Decision Theory

Higher feature dimensions
Multiple classes
Allowing actions and not only decide on the state
of nature
Introduce a loss of function
more general than the probability of error
Different types of errors may have different costs

Allowing actions other than classification
primarily allows the possibility of rejection
Refusing to make a decision in close or bad
cases!
The loss function states how costly each action
taken is

13
Representations

Let ?1, ?2,, ?c be the set of c states of
nature (or categories)
Let ?1, ?2,, ?a be the set of possible
actions
Let ?(?i ?j) be the loss incurred for taking
action ?i when the state of nature is ?j

14
Bayes Risk

R Sum of all R(?i x) for i 1,,a
Minimizing R Minimizing R(?i x) for i
1,, a
for i 1,,a

Conditional risk
15

Select the action ?i for which R(?i x) is
minimum
R is minimum and R in this case is
called the Bayes risk best
performance that can be achieved!

Two-category classification
?1 deciding ?1
?2 deciding ?2
?ij ?(?i ?j)
loss incurred for deciding ?i when the true state
of nature is ?jConditional risk
R(?1 x) ??11P(?1 x) ?12P(?2 x)
R(?2 x) ??21P(?1 x) ?22P(?2 x)

Our rule is the following
if R(?1 x) lt R(?2 x)
action ?1 decide ?1 is taken
This results in the equivalent rule
decide ?1 if
(?21- ?11) P(x ?1) P(?1) gt
(?12- ?22) P(x ?2) P(?2)
and decide ?2 otherwise

18
Likelihood ratio

Then take action ?1 (decide ?1)
Otherwise take action ?2 (decide ?2)

19
Minimum-Error-Rate Classification

Actions are decisions on classes
If action ?i is taken and the true state of
nature is ?j then
the decision is correct if i j and in error if
i ? j
Seek a decision rule that minimizes the
probability of error which is the error rate

Introduction of the zero-one loss function
Therefore, the conditional risk is
The risk corresponding to this loss function is
the average probability error
?

Minimizing the risk requires maximizing P(?i x)
(since R(?i x) 1 P(?i x))
For Minimum error rate
Decide ?i if P (?i x) gt P(?j x) ?j ? i

Regions of decision and zero-one loss function,
therefore
If ? is the zero-one loss function which means

23
(No Transcript)
24
Classifiers, Discriminant Functionsand Decision
Surfaces

Classifiers can be represented in terms of a set
of discriminant functions gi(x), i 1,, c
The classifier assigns a feature vector x to
class ?i
if gi(x) gt gj(x) ?j ?
I
Visually it can be shown in Fig. 2.5
Many types linear, non-linear, high-order,
parametric, non-parametric, etc.

25
(No Transcript)
26

Let gi(x) - R(?i x)
(max. discriminant corresponds to min. risk!)
For the minimum error rate, we take
gi(x) P(?i x)(max. discrimination
corresponds to max. posterior!)
gi(x) ? p(x ?i) P(?i)
gi(x) ln p(x ?i) ln P(?i)
(ln natural logarithm!)

Feature space divided into c decision regions
if gi(x) gt gj(x) ?j ? i then x is in Ri
(Ri means assign x to ?i)
The two-category case
A classifier is a dichotomizer that has two
discriminant functions g1 and g2
Let g(x) ? g1(x) g2(x)
Decide ?1 if g(x) gt 0 Otherwise decide ?2

The computation of g(x)

29
(No Transcript)
30
The Normal Density

Univariate density
Feature x is one dimensional variable
Density which is analytically tractable
Continuous density
A lot of processes are asymptotically Gaussian
Handwritten characters, speech sounds are ideal
or prototype corrupted by random process (central
limit theorem)
Where
? mean (or expected value) of x
?2 expected squared deviation or
variance

31
(No Transcript)
32
Multivariate density

where
x (x1, x2, , xd)t (t stands for
the transpose vector form)
? (?1, ?2, , ?d)t mean vector
? dd covariance matrix
? and ?-1 are determinant and
inverse respectively

33
Discriminant Functions for the Normal Density

The minimum error-rate classification can be
achieved by the discriminant function
gi(x) ln p(x ?i) ln P(?i)
Case of multivariate normal

34
Case 1 ?i ?2.I

All features have equal variance

A classifier that uses linear discriminant
functions is called a linear machine
The decision surfaces for a linear machine are
pieces of hyperplanes defined by
gi(x) gj(x)

36
(No Transcript)
37

The hyperplane separating Ri and Rj
always orthogonal to the line linking the means!

38
(No Transcript)
39
(No Transcript)
40
Case 2 ?i ?

(covariance of all classes are identical but
arbitrary!)
Hyperplane separating Ri and Rj(the hyperplane
separating Ri and Rj is generally not orthogonal
to the line between the means!)

41
(No Transcript)
42
(No Transcript)
43
Case 3 ?i arbitrary

The covariance matrices are different for each
category
(Hyperquadrics which are hyperplanes, pairs of
hyperplanes, hyperspheres, hyperellipsoids,
hyperparaboloids, hyperhyperboloids)

44
(No Transcript)
45
(No Transcript)
46
Summary

Bayes decision theory (BDT) is a general
framework for statistical pattern recognition to
allow us to minimize Bayes risks (wrong decision
and wrong actions)
Posterior (likelihood Prior) / evidence
Posterior (likelihood Prior)
BDT assumes that the conditional densities p(x
?j) and a priori probabilities P (?j) are known
Parameter estimation (mean, variance)
Non-parametric estimation

47
Key Concepts

Bayes rule
Prior, likelihood, evidence, Posteriori
Decision rule
Decision region
Decision boundary
Bayes error rates
Error probability
ROC
Generalized BDT
Bayes risk
Loss function
Classifier
Discriminate function
Linear discriminate function
Quadratic discriminate function
Minimum distance classifier
Distance measures
Mahalanobis distance (variance normalized)
Euclidean distance

Lecture 23 Bayesian Decision Theory - PowerPoint PPT Presentation

Lecture 23 Bayesian Decision Theory

4 terms (Prior, likelihood, evidence, Posteriori) ... The probability of a class i for a given feature value x (Posteriori- P( j | x) ) Aug. 2006 ... – PowerPoint PPT presentation