Lecture 23 Bayesian Decision Theory - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Lecture 23 Bayesian Decision Theory

Description:

4 terms (Prior, likelihood, evidence, Posteriori) ... The probability of a class i for a given feature value x (Posteriori- P( j | x) ) Aug. 2006 ... – PowerPoint PPT presentation

Number of Views:1023
Avg rating:3.0/5.0
Slides: 49
Provided by: djam79
Category:

less

Transcript and Presenter's Notes

Title: Lecture 23 Bayesian Decision Theory


1
Lecture 2/3 Bayesian Decision Theory
2
Outline
  • Bayes Decision Theory
  • Fish example
  • Generalized Bayes decision theory
  • Two category classification
  • Likelihood ratio test
  • Minimum-Error-Rate classification
  • Classifiers, Discriminant Functions, and Decision
    Surfaces
  • Discriminant Functions for the Normal Density
  • Summary

3
Bayes Rule
  • Bayes rule shows how observing the value of x
    changes the a priori probability P (?j) to the a
    posteriori probability P(?j x) as follows
  • Posterior (likelihood Prior) / evidence
  • Posterior (likelihood Prior)

4
Fish classification example
  • The sea bass/salmon classification
  • 4 terms (Prior, likelihood, evidence, Posteriori)
  • Distribution probability of salmon and sea bass (
    Prior - P(?i) )
  • Conditional distribution density of features of
    fishes with a given class (likelihood - p(x ?i)
    )
  • Probability density of feature x (evidence
    p(x))
  • The probability of a class ?i for a given feature
    value x (Posteriori- P(?j x) )

5
E21
E12
6
Error Probabilities
  • Decision rule given the posterior probabilities
  • X is an observation for which
  • if P(?1 x) gt P(?2 x) True state of
    nature ?1
  • if P(?1 x) lt P(?2 x) True state of
    nature ?2
  • Therefore
  • whenever we observe a particular x, the
    probability of error is
  • P(error x) P(?1 x) if we decide ?2
  • P(error x) P(?2 x) if we decide ?1

7
  • Minimizing the probability of error
  • Therefore
  • P(error x) min P(?1 x), P(?2 x)
  • (Bayes
    decision)

8
Error Probability and Decision Boundary
  • As priors change the decision boundary moves
    accordingly, from left to right side

9
Likelihood Ratio Test
10
(No Transcript)
11
Generalized Bayesian Decision Theory
  • Higher feature dimensions
  • Multiple classes
  • Allowing actions and not only decide on the state
    of nature
  • Introduce a loss of function
  • more general than the probability of error
  • Different types of errors may have different costs

12
  • Allowing actions other than classification
    primarily allows the possibility of rejection
  • Refusing to make a decision in close or bad
    cases!
  • The loss function states how costly each action
    taken is

13
Representations
  • Let ?1, ?2,, ?c be the set of c states of
    nature (or categories)
  • Let ?1, ?2,, ?a be the set of possible
    actions
  • Let ?(?i ?j) be the loss incurred for taking
    action ?i when the state of nature is ?j

14
Bayes Risk
  • R Sum of all R(?i x) for i 1,,a
  • Minimizing R Minimizing R(?i x) for i
    1,, a

  • for i 1,,a

Conditional risk
15
  • Select the action ?i for which R(?i x) is
    minimum
  • R is minimum and R in this case is
    called the Bayes risk best
    performance that can be achieved!

16
  • Two-category classification
  • ?1 deciding ?1
  • ?2 deciding ?2
  • ?ij ?(?i ?j)
  • loss incurred for deciding ?i when the true state
    of nature is ?jConditional risk
  • R(?1 x) ??11P(?1 x) ?12P(?2 x)
  • R(?2 x) ??21P(?1 x) ?22P(?2 x)

17
  • Our rule is the following
  • if R(?1 x) lt R(?2 x)
  • action ?1 decide ?1 is taken
  • This results in the equivalent rule
  • decide ?1 if
  • (?21- ?11) P(x ?1) P(?1) gt
  • (?12- ?22) P(x ?2) P(?2)
  • and decide ?2 otherwise

18
Likelihood ratio
  • Then take action ?1 (decide ?1)
  • Otherwise take action ?2 (decide ?2)

19
Minimum-Error-Rate Classification
  • Actions are decisions on classes
  • If action ?i is taken and the true state of
    nature is ?j then
  • the decision is correct if i j and in error if
    i ? j
  • Seek a decision rule that minimizes the
    probability of error which is the error rate

20
  • Introduction of the zero-one loss function
  • Therefore, the conditional risk is
  • The risk corresponding to this loss function is
    the average probability error
  • ?

21
  • Minimizing the risk requires maximizing P(?i x)
  • (since R(?i x) 1 P(?i x))
  • For Minimum error rate
  • Decide ?i if P (?i x) gt P(?j x) ?j ? i

22
  • Regions of decision and zero-one loss function,
    therefore
  • If ? is the zero-one loss function which means

23
(No Transcript)
24
Classifiers, Discriminant Functionsand Decision
Surfaces
  • Classifiers can be represented in terms of a set
    of discriminant functions gi(x), i 1,, c
  • The classifier assigns a feature vector x to
    class ?i
  • if gi(x) gt gj(x) ?j ?
    I
  • Visually it can be shown in Fig. 2.5
  • Many types linear, non-linear, high-order,
    parametric, non-parametric, etc.

25
(No Transcript)
26
  • Let gi(x) - R(?i x)
  • (max. discriminant corresponds to min. risk!)
  • For the minimum error rate, we take
  • gi(x) P(?i x)(max. discrimination
    corresponds to max. posterior!)
  • gi(x) ? p(x ?i) P(?i)
  • gi(x) ln p(x ?i) ln P(?i)
  • (ln natural logarithm!)

27
  • Feature space divided into c decision regions
  • if gi(x) gt gj(x) ?j ? i then x is in Ri
  • (Ri means assign x to ?i)
  • The two-category case
  • A classifier is a dichotomizer that has two
    discriminant functions g1 and g2
  • Let g(x) ? g1(x) g2(x)
  • Decide ?1 if g(x) gt 0 Otherwise decide ?2

28
  • The computation of g(x)

29
(No Transcript)
30
The Normal Density
  • Univariate density
  • Feature x is one dimensional variable
  • Density which is analytically tractable
  • Continuous density
  • A lot of processes are asymptotically Gaussian
  • Handwritten characters, speech sounds are ideal
    or prototype corrupted by random process (central
    limit theorem)
  • Where
  • ? mean (or expected value) of x
  • ?2 expected squared deviation or
    variance

31
(No Transcript)
32
Multivariate density
  • where
  • x (x1, x2, , xd)t (t stands for
    the transpose vector form)
  • ? (?1, ?2, , ?d)t mean vector
  • ? dd covariance matrix
  • ? and ?-1 are determinant and
    inverse respectively

33
Discriminant Functions for the Normal Density
  • The minimum error-rate classification can be
    achieved by the discriminant function
  • gi(x) ln p(x ?i) ln P(?i)
  • Case of multivariate normal

34
Case 1 ?i ?2.I
  • All features have equal variance

35
  • A classifier that uses linear discriminant
    functions is called a linear machine
  • The decision surfaces for a linear machine are
    pieces of hyperplanes defined by
  • gi(x) gj(x)

36
(No Transcript)
37
  • The hyperplane separating Ri and Rj
  • always orthogonal to the line linking the means!

38
(No Transcript)
39
(No Transcript)
40
Case 2 ?i ?
  • (covariance of all classes are identical but
    arbitrary!)
  • Hyperplane separating Ri and Rj(the hyperplane
    separating Ri and Rj is generally not orthogonal
    to the line between the means!)

41
(No Transcript)
42
(No Transcript)
43
Case 3 ?i arbitrary
  • The covariance matrices are different for each
    category
  • (Hyperquadrics which are hyperplanes, pairs of
    hyperplanes, hyperspheres, hyperellipsoids,
    hyperparaboloids, hyperhyperboloids)

44
(No Transcript)
45
(No Transcript)
46
Summary
  • Bayes decision theory (BDT) is a general
    framework for statistical pattern recognition to
    allow us to minimize Bayes risks (wrong decision
    and wrong actions)
  • Posterior (likelihood Prior) / evidence
  • Posterior (likelihood Prior)
  • BDT assumes that the conditional densities p(x
    ?j) and a priori probabilities P (?j) are known
  • Parameter estimation (mean, variance)
  • Non-parametric estimation

47
Key Concepts
  • Bayes rule
  • Prior, likelihood, evidence, Posteriori
  • Decision rule
  • Decision region
  • Decision boundary
  • Bayes error rates
  • Error probability
  • ROC
  • Generalized BDT
  • Bayes risk
  • Loss function
  • Classifier
  • Discriminate function
  • Linear discriminate function
  • Quadratic discriminate function
  • Minimum distance classifier
  • Distance measures
  • Mahalanobis distance (variance normalized)
  • Euclidean distance

48
Readings
  • Chapter 2, Pattern Classification by Duda, Hart,
    Stork, 2001, Sections 2.1 2.6
Write a Comment
User Comments (0)
About PowerShow.com