Loading...

PPT – Bayesian Decision Theory (Sections 2.1-2.2) PowerPoint presentation | free to download - id: 4b0515-YTkxN

The Adobe Flash plugin is needed to view this content

Bayesian Decision Theory(Sections 2.1-2.2)

- Decision problem posed in probabilistic terms
- Bayesian Decision TheoryContinuous Features
- All the relevant probability values are known

(No Transcript)

(No Transcript)

(No Transcript)

Probability Density

Course Outline

MODEL INFORMATION

COMPLETE

INCOMPLETE

Supervised Learning

Unsupervised Learning

Bayes Decision Theory

Nonparametric Approach

Parametric Approach

Nonparametric Approach

Parametric Approach

Optimal Rules

Plug-in Rules

Density Estimation

Geometric Rules (K-NN, MLP)

Mixture Resolving

Cluster Analysis (Hard, Fuzzy)

Introduction

- From sea bass vs. salmon example to abstract

decision making problem - State of nature a priori (prior) probability
- State of nature (which type of fish will be

observed next) is unpredictable, so it is a

random variable - The catch of salmon and sea bass is equiprobable
- P(?1) P(?2) (uniform priors)
- P(?1) P( ?2) 1 (exclusivity and exhaustivity)
- Prior prob. reflects our prior knowledge about

how likely we are to observe a sea bass or

salmon these probabilities may depend on time of

the year or the fishing area!

- Bayes decision rule with only the prior

information - Decide ?1 if P(?1) gt P(?2), otherwise decide ?2
- Error rate Min P(?1) , P(?2)
- Suppose now we have a measurement or feature on

the state of nature - say the fish lightness

value - Use of the class-conditional probability density
- P(x ?1) and P(x ?2) describe the difference

in lightness feature between populations of sea

bass and salmon

Amount of overlap between the densities

determines the goodness of feature

- Maximum likelihood decision rule
- Assign input pattern x to class ?1 if
- P(x ?1) gt P(x ?2), otherwise ?2
- How does the feature x influence our attitude

(prior) concerning the true state of nature?

- Bayes decision rule

- Posteriori probability, likelihood, evidence
- P(?j , x) P(?j x)p (x) p(x ?j) P (?j)
- Bayes formula
- P(?j x) p(x ?j) . P (?j) / p(x)
- where
- Posterior (Likelihood. Prior) / Evidence
- Evidence P(x) can be viewed as a scale factor

that guarantees that the posterior probabilities

sum to 1 - P(x ?j) is called the likelihood of ?j with

respect to x the category ?j for which P(x

?j) is large is more likely to be the true

category

(No Transcript)

- P(?1 x) is the probability of the state of

nature being ?1 given that feature value x has

been observed - Decision based on the posterior probabilities is

called the Optimal Bayes Decision rule - For a given observation (feature value) X
- if P(?1 x) gt P(?2 x) decide ?1
- if P(?1 x) lt P(?2 x) decide ?2
- To justify the above rule, calculate the

probability of error - P(error x) P(?1 x) if we decide ?2
- P(error x) P(?2 x) if we decide ?1

- So, for a given x, we can minimize te rob. Of

error, decide ?1 if - P(?1 x) gt P(?2 x) otherwise decide ?2
- Therefore
- P(error x) min P(?1 x), P(?2 x)
- Thus, for each observation x, Bayes decision rule

minimizes the probability of error - Unconditional error P(error) obtained by

integration over all x w.r.t. p(x)

- Optimal Bayes decision rule
- Decide ?1 if P(?1 x) gt P(?2 x) otherwise

decide ?2 - Special cases
- (i) P(?1) P(?2) Decide ?1 if
- p(x ?1) gt p(x ?2), otherwise ?2
- (ii) p(x ?1) p(x ?2) Decide ?1 if
- P(?1) gt P(?2), otherwise ?2

Bayesian Decision Theory Continuous Features

- Generalization of the preceding formulation
- Use of more than one feature (d features)
- Use of more than two states of nature (c classes)
- Allowing other actions besides deciding on the

state of nature - Introduce a loss function which is more general

than the probability of error

- Allowing actions other than classification

primarily allows the possibility of rejection - Refusing to make a decision when it is difficult

to decide between two classes or in noisy cases! - The loss function specifies the cost of each

action

- Let ?1, ?2,, ?c be the set of c states of

nature - (or categories)
- Let ?1, ?2,, ?a be the set of a possible

actions - Let ?(?i ?j) be the loss incurred for taking

action ?i when the true state of nature is ?j - General decision rule ?(x) specifies which action

to take for every possible observation x

- Conditional Risk
- Overall risk
- R Expected value of R(?i x) w.r.t. p(x)
- Minimizing R Minimize R(?i x) for i

1,, a -

For a given x, suppose we take the action ?i

if the true state is ?j , we will incur the loss

?(?i ?j). P(?j x) is the prob. that the true

state is ?j But, any one of the C states is

possible for given x.

Conditional risk

- Select the action ?i for which R(?i x) is

minimum - The overall risk R is minimized

and the resulting risk is called the Bayes risk

it is the best performance that can be achieved!

- Two-category classification
- ?1 deciding ?1
- ?2 deciding ?2
- ?ij ?(?i ?j)
- loss incurred for deciding ?i when the true state

of nature is ?j - Conditional risk
- R(?1 x) ??11P(?1 x) ?12P(?2 x)
- R(?2 x) ??21P(?1 x) ?22P(?2 x)

- Bayes decision rule is stated as
- if R(?1 x) lt R(?2 x)
- Take action ?1 decide ?1
- This results in the equivalent rule
- decide ?1 if
- (?21- ?11) P(x ?1) P(?1) gt
- (?12- ?22) P(x ?2) P(?2)
- and decide ?2 otherwise

- Likelihood ratio
- The preceding rule is equivalent to the following

rule - then take action ?1 (decide ?1) otherwise take

action ?2 (decide ?2) - Note that the posteriori porbabilities are scaled

by the loss differences.

- Interpretation of the Bayes decision rule
- If the likelihood ratio of class ?1 and class ?2

exceeds a threshold value (that is independent of

the input pattern x), the optimal action is to

decide ?1 - Maximum likelihood decision rule the threshold

value is 1 0-1 loss function and equal class

prior probability

Bayesian Decision Theory(Sections 2.3-2.5)

- Minimum Error Rate Classification
- Classifiers, Discriminant Functions and Decision

Surfaces - The Normal Density

Minimum Error Rate Classification

- Actions are decisions on classes
- If action ?i is taken and the true state of

nature is ?j then - the decision is correct if i j and in error if

i ? j - Seek a decision rule that minimizes the

probability of error or the error rate

- Zero-one (0-1) loss function no loss for correct

decision and a unit loss for any error - The conditional risk can now be simplified as
- The risk corresponding to the 0-1 loss function

is the average probability of error - ?

- Minimizing the risk requires maximizing the

posterior probability P(?i x) since - R(?i x) 1 P(?i x))
- For Minimum error rate
- Decide ?i if P (?i x) gt P(?j x) ?j ? i

- Decision boundaries and decision regions
- If ? is the 0-1 loss function then the threshold

involves only the priors

(No Transcript)

Classifiers, Discriminant Functionsand Decision

Surfaces

- Many different ways to represent pattern

classifiers one of the most useful is in terms

of discriminant functions - The multi-category case
- Set of discriminant functions gi(x), i 1,,c
- Classifier assigns a feature vector x to class ?i

if - gi(x) gt gj(x) ?j ? i

Network Representation of a Classifier

- Bayes classifier can be represented in this way,

but the choice of discriminant function is not

unique - gi(x) - R(?i x)
- (max. discriminant corresponds to min. risk!)
- For the minimum error rate, we take
- gi(x) P(?i x)
- (max. discrimination corresponds to max.

posterior!) - gi(x) ? P(x ?i) P(?i)
- gi(x) ln P(x ?i) ln P(?i)
- (ln natural logarithm!)

- Effect of any decision rule is to divide the

feature space into c decision regions - if gi(x) gt gj(x) ?j ? i then x is in Ri
- (Region Ri means assign x to ?i)
- The two-category case
- Here a classifier is a dichotomizer that has

two discriminant functions g1 and g2 - Let g(x) ? g1(x) g2(x)
- Decide ?1 if g(x) gt 0 Otherwise decide ?2

- So, a dichotomizer computes a single

discriminant function g(x) and classifies x

according to whether g(x) is positive or not. - Computation of g(x) g1(x) g2(x)

(No Transcript)

The Normal Density

- Univariate density N(? , ?2)
- Normal density is analytically tractable
- Continuous density
- A number of processes are asymptotically Gaussian
- Patterns (e.g., handwritten characters, speech

signals ) can be viewed as randomly corrupted

versions of a single typical or prototype

(Central Limit theorem) - where
- ? mean (or expected value) of x
- ?2 variance (or expected squared

deviation) of x

(No Transcript)

- Multivariate density N(? , ?)
- Multivariate normal density in d dimensions
- where
- x (x1, x2, , xd)t (t stands for

the transpose of a vector) - ? (?1, ?2, , ?d)t mean vector
- ? dd covariance matrix
- ? and ?-1 are determinant and

inverse of ?, respectively - The covariance matrix is always symmetric and

positive semidefinite we assume ? is positive

definite so the determinant of ? is strictly

positive - Multivariate normal density is completely

specified by d d(d1)/2 parameters - If variables x1 and x2 are statistically

independent then the covariance of x1 and x2

is zero.

Multivariate Normal density

Samples drawn from a normal population tend to

fall in a single cloud or cluster cluster center

is determined by the mean vector and shape by the

covariance matrix The loci of points of constant

density are hyperellipsoids whose principal axes

are the eigenvectors of ?

Transformation of Normal Variables

Linear combinations of jointly normally

distributed random variables are normally

distributed Coordinate transformation can

convert an arbitrary multivariate normal

distribution into a spherical one

Bayesian Decision Theory (Sections 2-6 to 2-9)

- Discriminant Functions for the Normal Density
- Bayes Decision Theory Discrete Features

Discriminant Functions for the Normal Density

- The minimum error-rate classification can be

achieved by the discriminant function - gi(x) ln P(x ?i) ln P(?i)
- In case of multivariate normal densities

- Case ?i ?2.I (I is the identity matrix)
- Features are statistically independent and each

feature has the same variance

- A classifier that uses linear discriminant

functions is called a linear machine - The decision surfaces for a linear machine are

pieces of hyperplanes defined by the linear

equations - gi(x) gj(x)

- The hyperplane separating Ri and Rj
- is orthogonal to the line linking the means!

(No Transcript)

(No Transcript)

(No Transcript)

- Case 2 ?i ? (covariance matrices of all

classes are identical but otherwise arbitrary!) - Hyperplane separating Ri and Rj
- The hyperplane separating Ri and Rj is generally

not orthogonal to the line between the means! - To classify a feature vector x, measure the

squared Mahalanobis distance from x to each of

the c means assign x to the category of the

nearest mean

(No Transcript)

(No Transcript)

Discriminant Functions for 1D Gaussian

- Case 3 ?i arbitrary
- The covariance matrices are different for each

category - In the 2-category case, the decision surfaces

are hyperquadrics that can assume any of the

general forms hyperplanes, pairs of hyperplanes,

hyperspheres, hyperellipsoids, hyperparaboloids,

hyperhyperboloids)

Discriminant Functions for the Normal Density

(No Transcript)

Discriminant Functions for the Normal Density

Discriminant Functions for the Normal Density

Decision Regions for Two-Dimensional Gaussian Data

Error Probabilities and Integrals

- 2-class problem
- There are two types of errors

- Multi-class problem
- Simpler to computer the prob. of being correct

(more ways to be wrong than to be right)

Error Probabilities and Integrals

Bayes optimal decision boundary in 1-D case

Error Bounds for Normal Densities

- The exact calculation of the error for the

general Guassian case (case 3) is extremely

difficult - However, in the 2-category case the general error

can be approximated analytically to give us an

upper bound on the error

Error Rate of Linear Discriminant Function (LDF)

- Assume a 2-class problem
- Due to the symmetry of the problem (identical ?),

the two types of errors are identical - Decide if or
- or

Error Rate of LDF

- Let
- Compute expected values variances of

when - where
- squared Mahalanobis distance between

Error Rate of LDF

- Similarly

Error Rate of LDF

Error Rate of LDF

Error Rate of LDF

Chernoff Bound

- To derive a bound for the error, we need the

following inequality

Assume conditional prob. are normal

where

Chernoff Bound

Chernoff bound for P(error) is found by

determining the value of ? that minimizes

exp(-k(?))

Error Bounds for Normal Densities

- Bhattacharyya Bound
- Assume ? 1/2
- computationally simpler
- slightly less tight bound
- Now, Eq. (73) has the form

When the two covariance matrices are equal,

k(1/2) is te same as the Mahalanobis distance

between the two means

Error Bounds for Gaussian Distributions

Chernoff Bound

Best Chernoff error bound is 0.008190

Bhattacharya Bound (ß1/2)

2category, 2D data

Bhattacharya error bound is 0.008191

True error using numerical integration 0.0021

Neyman-Pearson Rule

Classification, Estimation and Pattern

recognition by Young and Calvert

Neyman-Pearson Rule

Neyman-Pearson Rule

Neyman-Pearson Rule

Neyman-Pearson Rule

Neyman-Pearson Rule

Signal Detection Theory

- We are interested in detecting a single weak

pulse, e.g. radar reflection the internal signal

(x) in detector has mean m1 (m2) when pulse is

absent (present)

The detector uses a threshold x to determine the

presence of pulse

Discriminability ease of determining whether the

pulse is present or not

For given threshold, define hit, false alarm,

miss and correct rejection

Receiver Operating Characteristic (ROC)

- Experimentally compute hit and false alarm rates

for fixed x - Changing x will change the hit and false alarm

rates - A plot of hit and false alarm rates is called the

ROC curve

Performance shown at different operating points

Operating Characteristic

- In practice, distributions may not be Gaussian

and will be multidimensional ROC curve can

still be plotted - Vary a single control parameter for the decision

rule and plot the resulting hit and false alarm

rates

Bayes Decision Theory Discrete Features

- Components of x are binary or integer valued x

can take only one of m discrete values - v1, v2, ,vm
- Case of independent binary features for

2-category problem - Let x x1, x2, , xd t where each xi is

either 0 or 1, with probabilities - pi P(xi 1 ?1)
- qi P(xi 1 ?2)

- The discriminant function in this case is

Bayesian Decision for Three-dimensional Binary

Data

- Consider a 2-class problem with three

independent binary features class priors are

equal and pi 0.8 and qi 0.5, i 1,2,3 - wi 1.3863
- w0 1.2
- Decision surface g(x) 0 is shown below

Decision boundary for 3D binary features. Left

figure shows the case when pi.8 and qi.5. Right

figure shows case when p3q3 (Feature 3 is not

providing any discriminatory information) so

decision surface is parallel to x3 axis

Handling Missing Features

- Suppose it is not possible to measure a certain

feature for a given pattern - Possible solutions
- Reject the pattern
- Approximate the missing feature
- Mean of all the available values for the missing

feature - Marginalize over the distribution of the missing

feature

Handling Missing Features

Other Topics

- Compound Bayes Decision Theory Context
- Consecutive states of nature might not be

statistically independent in sorting two types

of fish, arrival of next fish may not be

independent of the previous fish - Can we exploit such statistical dependence to

gain improved performance (use of context) - Compound decision vs. sequential compound

decision problems - Markov dependence
- Sequential Decision Making
- Feature measurement process is sequential (as in

medical diagnosis) - Feature measurement cost
- Minimize the no. of features to be measured while

achieving a sufficient accuracy minimize a

combination of feature measurement cost

classification accuracy

Context in Text Recognition