1 / 156

Bayesian Decision Theory

- Z. Ghassabi

Outline

- What is pattern recognition?
- What is classification?
- Need for Probabilistic Reasoning
- Probabilistic Classification Theory
- What is Bayesian Decision Theory?
- HISTORY
- PRIOR PROBABILITIES
- CLASS-CONDITIONAL PROBABILITIES
- BAYES FORMULA
- A Casual Formulation
- Decision

Outline

- What is Bayesian Decision Theory?
- Decision fot Two Categories

Outline

- What is classification?
- Classification by Bayesian Classification
- Basic Concepts
- Bayes Rule
- More General Forms of Bayes Rule
- Discriminated Functions
- Bayesian Belief Networks

What is pattern recognition?

TYPICAL APPLICATIONS OF PR

IMAGE PROCESSING EXAMPLE

Pattern Classification System

- Preprocessing
- Segment (isolate) fishes from one another and

from the background - Feature Extraction
- Reduce the data by measuring certain features
- Classification
- Divide the feature space into decision regions

(No Transcript)

Classification

- Initially use the length of the fish as a

possible feature for discrimination

TYPICAL APPLICATIONS

LENGTH AS A DISCRIMINATOR

- Length is a poor discriminator

Feature Selection

- The length is a poor feature alone!
- Select the lightness as a possible feature

TYPICAL APPLICATIONS

ADD ANOTHER FEATURE

- Lightness is a better feature than length because

it reduces the misclassification error. - Can we combine features in such a way that we

improve performance? (Hint correlation)

Threshold decision boundary and cost relationship

- Move decision boundary toward smaller values of

lightness in order to minimize the cost (reduce

the number of sea bass that are classified

salmon!) - Task of decision theory

Feature Vector

- Adopt the lightness and add the width of the fish

to the feature vector - Fish xT x1, x2

Width

Lightness

TYPICAL APPLICATIONS

WIDTH AND LIGHTNESS

Straight line decision boundary

- Treat features as a N-tuple (two-dimensional

vector) - Create a scatter plot
- Draw a line (regression) separating the two

classes

Features

- We might add other features that are not highly

correlated with the ones we already have. Be sure

not to reduce the performance by adding noisy

features - Ideally, you might think the best decision

boundary is the one that provides optimal

performance on the training data (see the

following figure)

TYPICAL APPLICATIONS

DECISION THEORY

- Can we do better than a linear classifier?

Is this a good decision boundary?

- What is wrong with this decision surface? (hint

generalization)

Decision Boundary Choice

- Our satisfaction is premature because the central

aim of designing a classifier is to correctly

classify new (test) input - Issue of generalization!

TYPICAL APPLICATIONS

GENERALIZATION AND RISK

Better decision boundary

- Why might a smoother decision surface be a better

choice? (hint Occams Razor).

- PR investigates how to find such optimal

decision surfaces and how to provide system

designers with the tools to make intelligent

trade-offs.

Need for Probabilistic Reasoning

- Most everyday reasoning is based on uncertain

evidence and inferences. - Classical logic, which only allows conclusions to

be strictly true or strictly false, does not

account for this uncertainty or the need to weigh

and combine conflicting evidence. - Todays expert systems employed fairly ad hoc

methods for reasoning under uncertainty and for

combining evidence.

Probabilistic Classification Theory

- ?? classification ?????? ?? ?? ???????? ??????

???????? ?? ???? ?? ?? ??????? ?? ?? ???? ?????

???. - In practice, some overlap between classes and

random variation within classes occur, hence

perfect separation between classes can not be

achieved Misclassification may occur. - ????? Bayesian decision ????? ?? ??? ?? ????

????? ?? ????? ??? ???? ????? ?? ???? ??? ?? ??

????? ????? ?? ???? ?? ???? A ??? ??????? ?? ????

B ???? ???? ???. (misclassify)

HISTORY

What is Bayesian Decision Theory?

- Bayesian Probability was named after Reverend

Thomas Bayes (1702-1761). - He proved a special case of what is currently

known as the Bayes Theorem. - The term Bayesian came into use around the

1950s.

http//en.wikipedia.org/wiki/Bayesian_probability

HISTORY (Cont.)

- Pierre-Simon, Marquis de Laplace (1749-1827)

independently proved a generalized version of

Bayes Theorem. - 1970 Bayesian Belief Network at Stanford

University (Judea Pearl 1988) - The ideas proposed above was not fully

developed until later. BBN became popular in the

1990s.

HISTORY (Cont.)

- Current uses of Bayesian Networks
- Microsofts printer troubleshooter.
- Diagnose diseases (Mycin).
- Used to predict oil and stock prices
- Control the space shuttle
- Risk Analysis Schedule and Cost Overruns.

BAYESIAN DECISION THEORY

PROBABILISTIC DECISION THEORY

- Bayesian decision theory is a fundamental

statistical approach to the problem of pattern

classification. - Using probabilistic approach to help making

decision (e.g., classification) so as to minimize

the risk (cost). - Assume all relevant probability distributions are

known (later we will learn how to estimate these

from data).

BAYESIAN DECISION THEORY

PRIOR PROBABILITIES

- State of nature is prior information
- ? denote the state of nature
- Model as a random variable, ?
- ? ?1 the event that the next fish is a sea

bass - category 1 sea bass category 2 salmon
- A priori probabilities
- P(?1) probability of category 1
- P(?2) probability of category 2
- P(?1) P( ?2) 1 (either ?1 or ?2 must occur)
- Decision rule
- Decide ?1 if P(?1) gt P(?2) otherwise, decide

?2

But we know there will be many mistakes .

http//www.stat.yale.edu/Courses/1997-98/101/ranva

r.htm

BAYESIAN DECISION THEORY

CLASS-CONDITIONAL PROBABILITIES

- A decision rule with only prior information

always produces the same result and ignores

measurements. - If P(?1) gtgt P( ?2), we will be correct most of

the time.

- Given a feature, x (lightness), which is a

continuous random variable, p(x?2) is the

class-conditional probability density function

- p(x?1) and p(x?2) describe the difference in

lightness between populations of sea and salmon.

Let x be a continuous random variable. p(xw)

is the probability density for x given the state

of nature w.

p(lightness salmon) ?

P(lightness sea bass) ?

BAYESIAN DECISION THEORY

BAYES FORMULA

How do we combine a priori and class-conditional

probabilities to know the probability of a state

of nature?

- Suppose we know both P(?j) and p(x?j), and we

can measure x. How does this influence our

decision? - The joint probability that of finding a pattern

that is in category j and that this pattern has a

feature value of x is

- Rearranging terms, we arrive at Bayes formula.

- A Casual Formulation
- The prior probability reflects knowledge of the

relative frequency of instances of a class - The likelihood is a measure of the probability

that a measurement value occurs in a class - The evidence is a scaling term

BAYESIAN DECISION THEORY

POSTERIOR PROBABILITIES

- Bayes formula
- can be expressed in words as
- By measuring x, we can convert the prior

probability, P(?j), into a posterior probability,

P(?jx). - Evidence can be viewed as a scale factor and is

often ignored in optimization applications (e.g.,

speech recognition).

For two categories

Bayes Decision Choose w1 if P(w1x) gt P(w2x)

otherwise choose w2.

BAYESIAN THEOREM

- A special case of Bayesian Theorem
- P(AnB) P(B) x P(AB)
- P(BnA) P(A) x P(BA)
- Since P(AnB) P(BnA),
- P(B) x P(AB) P(A) x P(BA)
- gt P(AB) P(A) x P(BA) / P(B)

Preliminaries and Notations

a state of nature

prior probability

feature vector

class-conditional density

posterior probability

Decision

Decide ?i if P(?ix) gt P(?jx) ? j ? i

The evidence, p(x), is a scale factor that

assures conditional probabilities sum to 1

P(?1x)P(?2x)1

We can eliminate the scale factor (which appears

on both sides of the equation)

Decide ?i if p(x?i)P(?i) gt p(x?j)P(?j) ? j ? i

- Special cases
- P(?1)P(?2) ? ? ?P(?c)
- p(x?1)p(x?2) ? ? ? p(x?c)

Two Categories

Decide ?i if P(?ix) gt P(?jx) ? j ? i

Decide ?i if p(x?i)P(?i) gt p(x?j)P(?j) ? j ? i

Decide ?1 if P(?1x) gt P(?2x) otherwise

decide ?2

Decide ?1 if p(x?1)P(?1) gt p(x?2)P(?2)

otherwise decide ?2

- Special cases
- P(?1)P(?2)
- Decide ?1 if p(x?1) gt p(x?2) otherwise

decide ?2 - 2. p(x?1)p(x?2)
- Decide ?1 if P(?1) gt P(?2) otherwise decide ?2

Example

R2

R1

P(?1)P(?2)

Example

P(?1)2/3 P(?2)1/3

Bayes Decision Rule

Decide ?1 if p(x?1)P(?1) gt p(x?2)P(?2)

otherwise decide ?2

BAYESIAN DECISION THEORY

POSTERIOR PROBABILITIES

- For every value of x, the posteriors sum to 1.0.
- At x14, the probability it is in category ?2 is

0.08, and for category ?1 is 0.92.

BAYESIAN DECISION THEORY

BAYES DECISION RULE

Classification Error

- Decision rule
- For an observation x, decide ?1 if P(?1x) gt

P(?2x) otherwise, decide ?2 - Probability of error
- The average probability of error is given by
- If for every x we ensure that P(errorx) is as

small as possible, then the integral is as small

as possible. Thus, Bayes decision rule for

minimizes P(error).

CONTINUOUS FEATURES

GENERALIZATION OF TWO-CLASS PROBLEM

- Generalization of the preceding ideas
- Use of more than one feature(e.g., length and

lightness) - Use more than two states of nature(e.g., N-way

classification) - Allowing actions other than a decision to decide

on the state of nature (e.g., rejection refusing

to take an action when alternatives are close or

confidence is low) - Introduce a loss of function which is more

general than the probability of error(e.g.,

errors are not equally costly) - Let us replace the scalar x by the vector x in a

d-dimensional Euclidean space, Rd, calledthe

feature space.

The Generation

a set of c states of nature or c categories

a set of a possible actions

LOSS FUNCTION

The loss incurred for taking action ?i when the

true state of nature is ?j.

Risk

can be zero.

We want to minimize the expected loss in making

decision.

Examples

- Ex 1 Fish classification
- X is the image of fish
- x (brightness, length, fin , etc.)
- is our belief what the fish type is
- sea bass, salmon, trout, etc
- is a decision for the fish type, in this

case - sea bass, salmon, trout, manual

expection needed, etc

- Ex 2 Medical diagnosis
- X all the available medical tests, imaging scans

that a doctor can order for a patient - x (blood pressure, glucose level, cough, x-ray,

etc.) - is an illness type
- Flu, cold, TB, pneumonia, lung

cancer, etc - is a decision for treatment,
- Tylenol, Hospitalize, more tests

needed, etc

Conditional Risk

Given x, the expected loss (risk) associated with

taking action ?i.

0/1 Loss Function

Decision

A general decision rule is a function ?(x) that

tells us which action to take for every possible

observation.

Bayesian Decision Rule

Overall Risk

The overall risk is given by

- Compute the conditional risk for every ? and

select the action that minimizes R(?ix). This is

denoted R, and is referred to as the Bayes risk - The Bayes risk is the best performance that can

be achieved.

Decision function

If we choose ?(x) so that R(?i(x)) is as small as

possible for every x, the overall risk will be

minimized.

Bayesian decision rule the optimal one to

minimize the overall risk Its resulting overall

risk is called the Bayesian risk

Two-Category Classification

- Let ?1 correspond to ?1, ?2 to ?2, and ?ij

?(?i?j)

- The conditional risk is given by

Two-Category Classification

Our decision rule is choose ?1 if R(?1x) lt

R(?2x) otherwise decide ?2

Perform ?1 if R(?2x) gt R(?1x) otherwise

perform ?2

Two-Category Classification

Perform ?1 if R(?2x) gt R(?1x) otherwise

perform ?2

positive

positive

Posterior probabilities are scaled before

comparison.

- If the loss incurred for making an error is

greater than that incurred for being correct, the

factors (?21- ?11) and(?12- ?22) are positive,

and the ratio of these factors simply scales the

posteriors.

Two-Category Classification

irrelevant

Perform ?1 if R(?2x) gt R(?1x) otherwise

perform ?2

By employing Bayes formula, we can replace the

posteriors by the prior probabilities and

conditional densities

Two-Category Classification

This slide will be recalled later.

Stated as Choose a1 if the likelihood ration

exceeds a threshold value independent of the

observation x.

Threshold

Likelihood Ratio

Perform ?1 if

(No Transcript)

Loss Factors

- If the loss factors are identical, and the prior

probabilities are equal, this reduces to a

standard likelihood ratio

MINIMUM ERROR RATE

Error rate (the probability of error) is to be

minimized

- Consider a symmetrical or zero-one loss function

- The conditional risk is

- The conditional risk is the average probability

of error. - To minimize error, maximize P(?ix) also known

as maximum a posteriori decoding (MAP).

Minimum Error Rate

LIKELIHOOD RATIO

- Minimum error rate classification
- choose ?i if P(?i x) gt P(?j x) for all j?i

Example

- For sea bass population, the lightness x is a

normal random variable distributes according to

N(4,1) - for salmon population x is distributed

according to N(10,1) - Select the optimal decision where
- The two fish are equiprobable
- P(sea bass) 2X P(salmon)
- The cost of classifying a fish as a salmon when

it truly is seabass is 2, and t The cost of

classifying a fish as a seabass when it is truly

a salmon is 1.

2

(No Transcript)

(No Transcript)

(No Transcript)

- End of Section 1

The Multicategory Classification

How to define discriminant functions?

How do we represent pattern classifiers?

The most common way is through discriminant

functions. Remember we use w1,w2, , wc to be

the possible states of nature.

For each class we create a discriminant function

gi(x).

gi(x)s are called the discriminant functions.

g1(x)

?(x)

g2(x)

Our classifier is a network or machine that

computes c discriminant functions.

gc(x)

The classifier Assign x to ?i if

gi(x) gt gj(x) for all j ? i.

Simple Discriminant Functions

If f(.) is a monotonically increasing function,

than f(gi(.) )s are also be discriminant

functions.

Notice the decision is the same if we change

every gi(x) for f(gi(x)) Assuming f(.) is a

monotonically increasing function.

Minimum Risk case

Minimum Error-Rate case

Figure 2.5

Decision Regions

The net effect is to divide the feature space

into c regions (one for each class). We then have

c decision regions separated by decision

boundaries.

Two-category example

Decision regions are separated by decision

boundaries.

Figure 2.6

Bayesian Decision Theory(Classification)

- The Normal Distribution

Basics of Probability

Discrete random variable (X) - Assume integer

Probability mass function (pmf)

Cumulative distribution function (cdf)

Continuous random variable (X)

not a probability

Probability density function (pdf)

Cumulative distribution function (cdf)

Probability mass function

- The graph of a probability mass function.
- All the values of this function must be
- non-negative and sum up to 1.

Probability density function

- The pdf can be calculated by taking the integral

of the function f(x) by the integration interval

of the input variable x. - For example the probability of the variable X

being within the interval 4.3,7.8 would be

Expectations

Let g be a function of random variable X.

The kth moment

The 1st moment

The kth central moments

Important Expectations

Fact

Mean

Variance

Entropy

The entropy measures the fundamental uncertainty

in the value of points selected randomly from a

distribution.

Univariate Gaussian Distribution

- Properties
- Maximize the entropy
- Central limit theorem

XN(µ,s2)

EX µ

VarX s2

Illustration of the central limit theorem

Let x1,x2,,xn be a sequence of n independent

and identically distributed random variables

having each finite values of expectation µ and

variance s2gt0 The central limit theorem states

that as the sample size n increases, the

distribution of the sample average of these

random variables approaches the normal

distribution with a mean µ and variance s2 / n

irrespective of the shape of the original

distribution.

Random Vectors

A d-dimensional random vector

Vector Mean

Covariance Matrix

Multivariate Gaussian Distribution

XN(µ,S)

A d-dimensional random vector

EX µ

E(X-µ) (X-µ)T S

Properties of N(µ,S)

XN(µ,S)

A d-dimensional random vector

Let YATX, where A is a d k matrix.

YN(ATµ, ATSA)

Properties of N(µ,S)

XN(µ,S)

A d-dimensional random vector

Let YATX, where A is a d k matrix.

YN(ATµ, ATSA)

On Parameters of N(µ,S)

XN(µ,S)

More On Covariance Matrix

? is symmetric and positive semidefinite.

? orthonormal matrix, whose columns are

eigenvectors of ?.

? diagonal matrix (eigenvalues).

Whitening Transform

XN(µ,S)

YATX

YN(ATµ, ATSA)

Let

Whitening Transform

Whitening

XN(µ,S)

Linear Transform

YATX

YN(ATµ, ATSA)

Let

Projection

Whitening Transform

- The whitening transformation is a decorrelation

method that converts - the covariance matrix S of a set of samples

into the identity matrix I. - This effectively creates new random variables

that are uncorrelated and - have the same variances as the original random

variables. - The method is called the whitening transform

because it transforms the - input matrix closer towards white noise.

This can be expressed as

where F is the matrix with the eigenvectors of

"S" as its columns and ? is the diagonal matrix

of non-increasing eigenvalues.

White noise

- White noise is a random signal (or process) with

a flat power spectral density. - In other words, the signal contains equal power

within a fixed bandwidth - at any center frequency.

Energy spectral density

Mahalanobis Distance

XN(µ,S)

r2

constant

depends on the value of r2

Mahalanobis distance

In statistics, Mahalanobis distance is a distance

measure introduced by P. C. Mahalanobis in

1936. It is based on correlations between

variables by which different patterns can be

identified and analyzed. It is a useful way of

determining similarity of an unknown sample set

to a known one. It differs from Euclidean

distance in that it takes into account the

correlations of the data set and is

scale-invariant, i.e. not dependent on the scale

of measurements .

Mahalanobis distance

- Formally, the Mahalanobis distance from a group

of values with above mean and covariance matrix S

for a multivariate vector is defined as - Mahalanobis distance can also be defined as

dissimilarity measure between two random vectors

of the same distribution with the

covariance matrix S - If the covariance matrix is the identity matrix,

the Mahalanobis distance reduces to the Euclidean

distance. If the covariance matrix is diagonal,

then the resulting distance measure is called the

normalized Euclidean distance - where si is the standard deviation of the xi over

the sample set.

Mahalanobis Distance

XN(µ,S)

r2

constant

depends on the value of r2

Bayesian Decision Theory(Classification)

- Discriminant Functions for the Normal Populations

Normal Density

If features are statistically independent and the

variance is the same for all features, the

discriminant function is simple and is linear in

nature. A classifier that uses linear

discriminant functions is called a linear

machine. The decision surface are pieces of

hyperplanes defined by linear equations.

Minimum-Error-Rate Classification

- Assuming the measurements are normally

distributed, we have

XiN(µi,Si)

Some Algebra to Simplify the Discriminants

- Since
- We take the natural logarithm to re-write the

first term

Some Algebra to Simplify the Discriminants

(continued)

Minimum-Error-Rate Classification

Three Cases

Case 1

Classes are centered at different mean, and their

feature components are pairwisely independent

have the same variance.

Case 2

Classes are centered at different mean, but have

the same variation.

Case 3

Arbitrary.

Case 1. ?i ?2I

irrelevant

irrelevant

Case 1. ?i ?2I

Case 1. ?i ?2I

Boundary btw. ?i and ?j

Case 1. ?i ?2I

The decision boundary will be a hyperplane

perpendicular to the line btw. the means at

somewhere.

0 if P(?i)P(?j)

midpoint

Boundary btw. ?i and ?j

wT

Case 1. ?i ?2I

- The decision region when the priors are equal and

the support regions are spherical is simply

halfway between the means (Euclidean distance).

Minimum distance classifier (template matching)

(No Transcript)

Case 1. ?i ?2I

Note how priors shift the boundary away from the

more likely mean !!!

Case 1. ?i ?2I

Case 1. ?i ?2I

Case 2. ?i ?

- Covariance matrices are arbitrary, but equal to

each other for all classes. - Features then form hyper-ellipsoidal clusters of

equal size and shape.

Mahalanobis Distance

Irrelevant if P(?i) P(?j) ?i, j

irrelevant

Case 2. ?i ?

Irrelevant

Case 2. ?i ?

- The discriminant hyperplanes are often not
- orthogonal to the segments joining the class means

Case 2. ?i ?

Case 2. ?i ?

Case 3. ?i ? ? j

The covariance matrices are different for each

category In two class case, the decision

boundaries form hyperquadratics.

- Decision surfaces are hyperquadrics, e.g.,
- hyperplanes
- hyperspheres
- hyperellipsoids
- hyperhyperboloids

Without this term In Case 1 and 2

irrelevant

Case 3. ?i ? ? j

Non-simply connected decision regions can arise

in one dimensions for Gaussians having unequal

variance.

Case 3. ?i ? ? j

Case 3. ?i ? ? j

Case 3. ?i ? ? j

Multi-Category Classification

Example A Problem

- Exemplars (transposed)
- For w1 (2, 6), (3, 4), (3, 8), (4, 6)
- For w2 (1, -2), (3, 0), (3, -4), (5, -2)
- Calculated means (transposed)
- m1 (3, 6)
- m2 (3, -2)

Example Covariance Matrices

Example Covariance Matrices

Example Inverse and Determinant for Each of the

Covariance Matrices

Example A Discriminant Function for Class 1

Example

Example A Discriminant Function for Class 2

Example

Example The Class Boundary

Example A Quadratic Separator

Example Plot of the Discriminant

Summary Steps for Building a Bayesian Classifier

- Collect class exemplars
- Estimate class a priori probabilities
- Estimate class means
- Form covariance matrices, find the inverse and

determinant for each - Form the discriminant function for each class

Using the Classifier

- Obtain a measurement vector x
- Evaluate the discriminant function gi(x) for each

class i 1,,c - Decide x is in the class j if gj(x) gt gi(x) for

all i ? j

Bayesian Decision Theory(Classification)

- Criterions

Bayesian Decision Theory(Classification)

- Minimun error rate Criterion

Minimum-Error-Rate Classification

- If action is taken and the true state is

, then the decision is correct if and in

error if - Error rate (the probability of error) is to be

minimized - Symmetrical or zero-one loss function
- Conditional risk

Minimum-Error-Rate Classification

Bayesian Decision Theory(Classification)

- Minimax Criterion

Bayesian Decision RuleTwo-Category

Classification

Threshold

Likelihood Ratio

Decide ?1 if

Minimax criterion deals with the case that the

prior probabilities are unknown.

Basic Concept on Minimax

To choose the worst-case prior probabilities (the

maximum loss) and, then, pick the decision rule

that will minimize the overall risk.

Minimize the maximum possible overall risk.

So that the worst risk for any value of the

priors is as small as possible

Overall Risk

Overall Risk

Overall Risk

Overall Risk

Overall Risk

R(x) ax b

The value depends on the setting of decision

boundary

The value depends on the setting of decision

boundary

The overall risk for a particular P(?1).

Overall Risk

R(x) ax b

0 for minimax solution

Rmm, minimax risk

Independent on the value of P(?i).

Minimax Risk

Error Probability

Use 0/1 loss function

Minimax Error-Probability

Use 0/1 loss function

P(?1?2)

P(?2?1)

Minimax Error-Probability

P(?1?2)

P(?2?1)

Minimax Error-Probability

Bayesian Decision Theory(Classification)

- Neyman-Pearson Criterion

Bayesian Decision RuleTwo-Category

Classification

Threshold

Likelihood Ratio

Decide ?1 if

Neyman-Pearson Criterion deals with the case that

both loss functions and the prior probabilities

are unknown.

Signal Detection Theory

- The theory of signal detection theory evolved

from the development of communications and radar

equipment the first half of the last century. - It migrated to psychology, initially as part of

sensation and perception, in the 50's and 60's as

an attempt to understand some of the features of

human behavior when detecting very faint stimuli

that were not being explained by traditional

theories of thresholds.

The situation of interest

- A person is faced with a stimulus (signal) that

is very faint or confusing. - The person must make a decision, is the signal

there or not. - What makes this situation confusing and difficult

is the presences of other mess that is similar to

the signal. Let us call this mess noise.

Example

Noise is present both in the environment and in

the sensory system of the observer. The observer

reacts to the momentary total activation of the

sensory system, which fluctuates from moment to

moment, as well as responding to environmental

stimuli, which may include a signal.

Signal Detection Theory

Suppose we want to detect a single pulse from a

signal. We assume the signal has some random

noise. When the signal is present we observe a

normal distribution with mean u2. When the signal

is not present we observe a normal distribution

with mean u1. We assume same standard deviation.

Can we measure the discriminability of the

problem? Can we do this independent of the

threshold x?

Discriminability d u2 u1 / s

Example

- A radiologist is examining a CT scan, looking for

evidence of a tumor. - A Hard job, because there is always some

uncertainty. - There are four possible outcomes
- hit (tumor present and doctor says "yes'')
- miss (tumor present and doctor says "no'')
- false alarm (tumor absent and doctor says "yes")
- correct rejection (tumor absent and doctor says

"no").

Two types of Error

The Four Cases

Signal detection theory was developed to help us

understand how a continuous and ambiguous signal

can lead to a binary yes/no decision.

Correct Rejection

Miss

P(?1?2)

P(?1?1)

False Alarms

Hit

P(?2?2)

P(?2?1)

Decision Making

Discriminability

Based on expectancy (decision bias)

Criterion

Hit

P(?2?2)

False Alarm

P(?2?1)

(No Transcript)

Signal Detection Theory

- How do we find d if we do not know u1, u2, or

x? - From the data we can compute
- P( x gt x w2) a hit.
- P( x gt x w1) a false alarm.
- P( x lt x w2) a miss.
- P( x lt x w1) a correct rejection.
- If we plot a point in a space representing hit

and false alarm rates, - then we have a ROC (receiver operating

characteristic) curve. - With it we can distinguish between

discriminability and bias.

ROC Curve(Receiver Operating Characteristic)

Hit

PHP(?2?2)

False Alarm

PFAP(?2?1)

Neyman-Pearson Criterion

Hit

PHP(?2?2)

NP

max. PH subject to PFA ? a

False Alarm

PFAP(?2?1)