Intro to Pattern Recognition : Bayesian Decision Theory - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Intro to Pattern Recognition : Bayesian Decision Theory

Description:

Intro to Pattern Recognition : Bayesian Decision Theory 2. 1 Introduction 2.2 Bayesian Decision Theory Continuous Features Materials used in this course were taken ... – PowerPoint PPT presentation

Number of Views:203
Avg rating:3.0/5.0
Slides: 57
Provided by: CEDA85
Category:

less

Transcript and Presenter's Notes

Title: Intro to Pattern Recognition : Bayesian Decision Theory


1
Intro to Pattern Recognition Bayesian Decision
Theory
  • 2. 1 Introduction
  • 2.2 Bayesian Decision TheoryContinuous Features

Materials used in this course were taken from the
textbook Pattern Classification by Duda et al.,
John Wiley Sons, 2001 with the permission of
the authors and the publisher
2
Credits and Acknowledgments
  • Materials used in this course were taken from the
    textbook Pattern Classification by Duda et al.,
    John Wiley Sons, 2001 with the permission of
    the authors and the publisher and also from
  • Other material on the web
  • Dr. A. Aydin Atalan, Middle East Technical
    University, Turkey
  • Dr. Djamel Bouchaffra, Oakland University
  • Dr. Adam Krzyzak, Concordia University
  • Dr. Joseph Picone, Mississippi State University
  • Dr. Robi Polikar, Rowan University
  • Dr. Stefan A. Robila, University of New Orleans
  • Dr. Sargur N. Srihari, State University of New
    York at Buffalo
  • David G. Stork, Stanford University
  • Dr. Godfried Toussaint, McGill University
  • Dr. Chris Wyatt, Virginia Tech
  • Dr. Alan L. Yuille, University of California, Los
    Angeles
  • Dr. Song-Chun Zhu, University of California, Los
    Angeles

3
TYPICAL APPLICATIONS
GENERALIZATION AND RISK
  • Optimal decision surface still a line
  • Can we integrate prior knowledge about data,
    confidence, or willingness to take risk?

4
TYPICAL APPLICATIONS
FEATURES ARE CONFUSABLE
5
TYPICAL APPLICATIONS
IMAGE PROCESSING EXAMPLE
6
TYPICAL APPLICATIONS
LENGTH AS A DISCRIMINATOR
  • Length is a poor discriminator

7
TYPICAL APPLICATIONS
ADD ANOTHER FEATURE
  • Lightness is a better feature than length because
    it reduces the misclassification error.
  • Can we combine features in such a way that we
    improve performance? (Hint correlation)

8
TYPICAL APPLICATIONS
WIDTH AND LIGHTNESS
  • Treat features as a N-tuple (two-dimensional
    vector)
  • Create a scatter plot
  • Draw a line (regression) separating the two
    classes

9
TYPICAL APPLICATIONS
WIDTH AND LIGHTNESS
  • Treat features as a N-tuple (two-dimensional
    vector)
  • Create a scatter plot
  • Draw a line (regression) separating the two
    classes

10
TYPICAL APPLICATIONS
DECISION THEORY
  • Can we do better than a linear classifier?
  • What is wrong with this decision surface? (hint
    generalization)

11
TYPICAL APPLICATIONS
GENERALIZATION AND RISK
  • Why might a smoother decision surface be a better
    choice? (hint Occams Razor).
  • This course investigates how to find such
    optimal decision surfaces and how to provide
    system designers with the tools to make
    intelligent trade-offs.

12
TYPICAL APPLICATIONS
CORRELATION
13
TYPICAL APPLICATIONS
SPEECH RECOGNITION
14
FEATURE EXTRACTION
15
Application of Pattern RecognitionSpeaker
Verification
  • Basic Principle This system extracts the
    uniqueness in
  • human voice and creates an individual voice
    signature.
  • Data Collection
  • Feature Extraction
  • Classifier
  • Decision
  • Enrollment, Verification
  • MFCC acoustic features
  • (Mel Frequency Cepstrum Coefficients)
  • Pattern Matching(using Likelihood Scores)
  • Accept/Reject

16
Application of Pattern Recognition
  • CLASSIFIER USED IN THIS APPLICATION PERFORMS
    PATTERN MATCHING
  • The pattern matching process involves the
    comparison of a
  • given set of input feature vectors against the
    speaker model
  • for the claimed identity and computing a
    matching score.

A basic Speaker Verification System
17
2.1 Bayesian Decision Theory
18
Thomas Bayes
  • At the time of his death, Rev. Thomas Bayes (1702
    1761) left behind two unpublished essays
    attempting to determine the probabilities of
    causes from observed effects. Forwarded to the
    British Royal Society, the essays had little
    impact and were soon forgotten.
  • When several years later, the French
    mathematician Laplace independently rediscovered
    a very similar concept, the English scientists
    quickly reclaimed the ownership of what is now
    known as the Bayes Theorem.

19
BAYESIAN DECISION THEORY
PROBABILISTIC DECISION THEORY
  • Bayesian decision theory is a fundamental
    statistical approach to the problem of pattern
    classification.
  • Quantify the tradeoffs between various
    classification decisions using probability and
    the costs that accompany these decisions.
  • Assume all relevant probability distributions are
    known (later we will learn how to estimate these
    from data).
  • Can we exploit prior knowledge in our fish
    classification problem
  • Are the sequence of fish predictable?
    (statistics)
  • Is each class equally probable? (uniform priors)
  • What is the cost of an error? (risk, optimization)

20
BAYESIAN DECISION THEORY
PRIOR PROBABILITIES
  • State of nature is prior information
  • Model as a random variable, ?
  • ? ?1 the event that the next fish is a sea
    bass
  • category 1 sea bass category 2 salmon
  • P(?1) probability of category 1
  • P(?2) probability of category 2
  • P(?1) P( ?2) 1
  • Exclusivity ?1 and ?2 share no basic events
  • Exhaustivity the union of all outcomes is the
    sample space (either ?1 or ?2 must occur)
  • If all incorrect classifications have an equal
    cost
  • Decide ?1 if P(?1) gt P(?2) otherwise, decide ?2

21
BAYESIAN DECISION THEORY
CLASS-CONDITIONAL PROBABILITIES
  • A decision rule with only prior information
    always produces the same result and ignores
    measurements.
  • If P(?1) gtgt P( ?2), we will be correct most of
    the time.
  • Probability of error P(E) min(P(?1),P( ?2)).
  • p(x?1) and p(x?2) describe the difference in
    lightness between populations of sea and salmon.

22
BAYESIAN DECISION THEORY
PROBABILITY FUNCTIONS
  • A probability density function is denoted in
    lowercase and represents a function of a
    continuous variable.
  • px(x?), often abbreviated as p(x), denotes a
    probability density function for the random
    variable X. Note that px(x?) and py(y?) can be
    two different functions.
  • P(x?) denotes a probability mass function, and
    must obey the following constraints
  • Probability mass functions are typically used for
    discrete random variables while densities
    describe continuous random variables (latter must
    be integrated).

23
BAYESIAN DECISION THEORY
BAYES FORMULA
  • Suppose we know both P(?j) and p(x?j), and we
    can measure x. How does this influence our
    decision?
  • The joint probability that of finding a pattern
    that is in category j and that this pattern has a
    feature value of x is
  • Rearranging terms, we arrive at Bayes formula

where in the case of two categories
24
BAYESIAN DECISION THEORY
POSTERIOR PROBABILITIES
  • Bayes formula
  • can be expressed in words as
  • By measuring x, we can convert the prior
    probability, P(?j), into a posterior probability,
    P(?jx).
  • Evidence can be viewed as a scale factor and is
    often ignored in optimization applications (e.g.,
    speech recognition).

25
BAYESIAN DECISION THEORY
POSTERIOR PROBABILITIES
  • For every value of x, the posteriors sum to 1.0.
  • At x14, the probability it is in category ?2 is
    0.08, and for category ?1 is 0.92.

26
BAYESIAN DECISION THEORY
BAYES DECISION RULE
  • Decision rule
  • For an observation x, decide ?1 if P(?1x) gt
    P(?2x) otherwise, decide ?2
  • Probability of error
  • The average probability of error is given by
  • If for every x we ensure that P(errorx) is as
    small as possible, then the integral is as small
    as possible. Thus, Bayes decision rule for
    minimizes P(error).

27
Bayes Decision Rule
28
BAYESIAN DECISION THEORY
EVIDENCE
  • The evidence, p(x), is a scale factor that
    assures conditional probabilities sum to 1
  • P(?1x)P(?2x)1
  • We can eliminate the scale factor (which appears
    on both sides of the equation)
  • Decide ?1 if p(x?1)P(?1) gt p(x?2)P(?2)
  • Special cases
  • if p(x ?1)p(x ?2) x gives us no useful
    information
  • if P(?1) P(?2) decision is based entirely on
    the likelihood, p(x?j).

29
CONTINUOUS FEATURES
GENERALIZATION OF TWO-CLASS PROBLEM
  • Generalization of the preceding ideas
  • Use of more than one feature(e.g., length and
    lightness)
  • Use more than two states of nature(e.g., N-way
    classification)
  • Allowing actions other than a decision to decide
    on the state of nature (e.g., rejection refusing
    to take an action when alternatives are close or
    confidence is low)
  • Introduce a loss of function which is more
    general than the probability of error(e.g.,
    errors are not equally costly)
  • Let us replace the scalar x by the vector x in a
    d-dimensional Euclidean space, Rd, calledthe
    feature space.

30
CONTINUOUS FEATURES
LOSS FUNCTION 1
  • Let ?1, ?2,, ?c be the set of c categories
  • Let ?1, ?2,, ?a be the set of a possible
    actions
  • Let ?(?i?j) be the loss incurred for taking
    action ?i when the state of nature is ?j

31
Examples
  • Ex 1 Fish classification
  • X is the image of fish
  • x (brightness, length, fin , etc.)
  • is our belief what the fish type is
  • sea bass, salmon, trout, etc
  • is a decision for the fish type, in this
    case
  • sea bass, salmon, trout, manual
    expection needed, etc
  • Ex 2 Medical diagnosis
  • X all the available medical tests, imaging scans
    that a doctor can order for a patient
  • x (blood pressure, glucose level, cough, x-ray,
    etc.)
  • is an illness type
  • Flu, cold, TB, pneumonia, lung
    cancer, etc
  • is a decision for treatment,
  • Tylenol, Hospitalize, more tests
    needed, etc

32
CONTINUOUS FEATURES
LOSS FUNCTION
  • ?(?i?j) be the loss incurred for taking action
    ?i when the state of nature is ?j
  • The posterior, P(?jx), can be computed from
    Bayes formula

where the evidence is
  • The expected loss from taking action ?i is

33
CONTINUOUS FEATURES
BAYES RISK
  • An expected loss is called a risk.
  • R(?ix) is called the conditional risk.
  • A general decision rule is a function ?(x) that
    tells us which action to take for every possible
    observation.
  • The overall risk is given by
  • If we choose ?(x) so that R(?i(x)) is as small as
    possible for every x, the overall risk will be
    minimized.
  • Compute the conditional risk for every ? and
    select the action that minimizes R(?ix). This is
    denoted R, and is referred to as the Bayes risk.
  • The Bayes risk is the best performance that can
    be achieved.

34
CONTINUOUS FEATURES
TWO-CATEGORY CLASSIFICATION
  • Let ?1 correspond to ?1, ?2 to ?2, and ?ij
    ?(?i?j)
  • The conditional risk is given by
  • R(?1x) ??11P(?1x) ?12P(?2x)
  • R(?2x) ??21P(?1x) ?22P(?2x)
  • Our decision rule is
  • choose ?1 if R(?1x) lt R(?2x) otherwise
    decide ?2
  • This results in the equivalent rule
  • choose ?1 if (?21- ?11) P(x?1) gt (?12- ?22)
    P(x?2)
  • otherwise decide ?2
  • If the loss incurred for making an error is
    greater than that incurred for being correct, the
    factors (?21- ?11) and(?12- ?22) are positive,
    and the ratio of these factors simply scales the
    posteriors.

35
CONTINUOUS FEATURES
LIKELIHOOD
  • By employing Bayes formula, we can replace the
    posteriors by the prior probabilities and
    conditional densities
  • choose ?1 if
  • (?21- ?11) p(x?1) P(?1) gt (?12- ?22) p(x?2)
    P(?2)
  • otherwise decide ?2
  • If ?21- ?11 is positive, our rule becomes
  • If the loss factors are identical, and the prior
    probabilities are equal, this reduces to a
    standard likelihood ratio

36
(No Transcript)
37
2.3 Minimum Error Rate Classification
38
Minimum Error Rate
MINIMUM ERROR RATE
  • Consider a symmetrical or zero-one loss function
  • The conditional risk is
  • The conditional risk is the average probability
    of error.
  • To minimize error, maximize P(?ix) also known
    as maximum a posteriori decoding (MAP).

39
Minimum Error Rate
LIKELIHOOD RATIO
  • Minimum error rate classification
  • choose ?i if P(?i x) gt P(?j x) for all j?i

40
Example
  • 3. It is known that 1 of population suffers from
    a particular disease. A blood test has a 97
    chance to identify the disease for a diseased
    individual, by also has a 6 chance of falsely
    indicating that a healthy person has a disease.
  • a. What is the probability that a random person
    has a positive blood test.
  • b. If a blood test is positive, whats the
    probability that the person has the disease?
  • c. If a blood test is negative, whats the
    probability that the person does not have the
    disease?

41
  • S is a boolean RV indicating whether a person
    has a disease. P(S) 0.01 P(S) 0.99.
  • T is a boolean RV indicating the test result ( T
    true indicates that test is positive.)
  • P(TS) 0.97 P(TS) 0.03
  • P(TA) 0.06 P(TS) 0.94
  • (a) P(T) P(S) P(TS) P(S)P(TS) 0.010.97
    0.99 0.06 0.0691
  • (b) P(ST)P(TS)P(S)/P(T) 0.97 0.01/0.0691
    0.1403
  • (c) P(ST) P(TS)P(S)/P(T)
    P(TS)P(S)/(1-P(T)) 0.940.99/(1-.0691)0.9997

42
  • A physician can do two possible actions after
    seeing patients test results
  • A1 - Decide the patient is sick
  • A2 - Decide the patient is healthy
  • The costs of those actions are
  • If the patient is healthy, but the doctor decides
    he/she is sick - 20,000.
  • If the patient is sick, but the doctor decides
    he/she is healthy - 100.000
  • When the test is positive
  • R(A1T) R(A1S)P(ST) R(A1S) P(ST)
    R(A1S) P(ST) 20.000 P(ST)
    20.0000.8597 17194.00
  • R(A2T) R(A2S)P(ST) R(A2S) P(ST)
    R(A2S)P(ST) 100000 0.1403 14030.00

43
  • A physician can do three possible actions after
    seeing patients test results
  • Decide the patient is sick
  • Decide the patient is healthy
  • Send the patient for another test
  • The costs of those actions are
  • If the patient is healthy, but the doctor decides
    he/she is sick - 20,000.
  • If the patient is sick, but the doctor decides
    he/she is healthy - 100.000
  • Sending the patient for another test costs
    15,000

44
  • When the test is positive
  • R(A1T) R(A1S)P(ST) R(A1S) P(ST)
    R(A1S) P(ST) 20.000 P(ST)
    20.0000.8597 17194.00
  • R(A2T) R(A2S)P(ST) R(A2S) P(ST)
    R(A2S)P(ST) 100000 0.1403 14030.00
  • R(A3T) 15000.00
  • When the test is negative
  • R(A1T) R(A1S)P(ST) R(A1S) P(ST)
    R(A1S) P(ST) 20,000 0.9997 19994.00
  • R(A2T) R(A2S)P(ST) R(A2S) P(ST)
    R(A1S) P(ST) 100,0000.0003 30.00
  • R(A3T) 15000.00

45
Example
  • For sea bass population, the lightness x is a
    normal random variable distributes according to
    N(4,1)
  • for salmon population x is distributed
    according to N(10,1)
  • Select the optimal decision where
  • The two fish are equiprobable
  • P(sea bass) 2X P(salmon)
  • The cost of classifying a fish as a salmon when
    it truly is seabass is 2, and t The cost of
    classifying a fish as a seabass when it is truly
    a salmon is 1.

2
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
Exercise
Consider a 2-class problem with P(C1) 2/3,
P(C2)1/3 a scalar feature x and three possible
actions a1, a2, a3 defined as a1 choose C1 a2
choose C2 a3 do not classify Let the loss matrix
?(ai Cj) be  
  a1 a2 a3
C1 0 1 1/4
C2 1 0 1/4
  •  
  • And let P(x C1) (2-x)/2, P(x C2) 1/2, 0 ?
    x ? 2
  •  
  • Questions
  •      Which action to decide for a pattern x 0 ?
    x ? 2
  •      What is the proportion of patterns for which
    action a3 is performed (i.e., do not classify)
  •      Compute the total minimum risk
  •      If you decide to take action a1 for all x,
    then how much the total risk will be reduced.
  •  
  •  

50
Solution
P(x) (5-2x)/6 P(C1 x) (4-2x)/(5-2x) 0 ? x
? 2 P(C2 x) 1/ (5-2x) This leads to
conditional risks r1(x) r(a1 x) 0.P(C1
x) 1. P(C2 x) 1/(5-2x) r2(x) r(a2 x)
1.P(C1 x) 0. P(C2 x) (4-2x)/(5-2x) r3(x)
r(a3 x) 1/4.P(C1 x) 1/4. P(C2 x)
1/4 Bayes decision rule assigns to each x the
action with the minimum conditional risk. The
conditional risks are sketched in the following
figure and the optimal decision rule is therefore
51






x
If 0 ? x ?0.5 then action a1 choose C1 If
0.5 ? x ? 11/6 then action a3 do not
classify If 11/6 ? x ? 2 then action a2
choose C2
52
In this particular case the action do not
classify is optimal whenIever x is between ½ and
11/6 2)      
Therefore, do not classify action has been
performed for 60 of the input patterns. 3)      
Total minimum risk
     4) If instead of using Bayes classifier we
choose to take a1 for all x, then the total risk
is
53
Case 1 ?i ?2I
GAUSSIAN CLASSIFIERS
  • Features are statistically independent, and all
    features have the same variance Distributions
    are spherical in d dimensions.

6
54
GAUSSIAN CLASSIFIERS
THRESHOLD DECODING
  • This has a simple geometric interpretation
  • The decision region when the priors are equal and
    the support regions are spherical is simply
    halfway between the means (Euclidean distance).

55
GAUSSIAN CLASSIFIERS
6
56
GAUSSIAN CLASSIFIERS
Note how priors shift the boundary away from the
more likely mean !!!
6
57
GAUSSIAN CLASSIFIERS
6
58
Case 2 ?i ?
GAUSSIAN CLASSIFIERS
  • Covariance matrices are arbitrary, but equal to
    each other for all classes. Features then form
    hyper-ellipsoidal clusters of equal size and
    shape.
  • Discriminant function is linear

6
59
6
60
6
61
Case 3 ?i arbitrary
  • The covariance matrices are different for each
    category
  • All bets off !In two class case, the decision
    boundaries form hyperquadratics.
  • (Hyperquadrics are hyperplanes, pairs of
    hyperplanes, hyperspheres, hyperellipsoids,
    hyperparaboloids, hyperhyperboloids)

6
62
6
63
GAUSSIAN CLASSIFIERS
ARBITRARY COVARIANCES
Write a Comment
User Comments (0)
About PowerShow.com