# Bayesian Decision Theory - PowerPoint PPT Presentation

1 / 156
Title:

## Bayesian Decision Theory

Description:

### Bayesian Decision Theory Z. Ghassabi * – PowerPoint PPT presentation

Number of Views:530
Avg rating:3.0/5.0
Slides: 157
Provided by: cip89
Category:
Tags:
Transcript and Presenter's Notes

Title: Bayesian Decision Theory

1
Bayesian Decision Theory
• Z. Ghassabi

2
Outline
• What is pattern recognition?
• What is classification?
• Need for Probabilistic Reasoning
• Probabilistic Classification Theory
• What is Bayesian Decision Theory?
• HISTORY
• PRIOR PROBABILITIES
• CLASS-CONDITIONAL PROBABILITIES
• BAYES FORMULA
• A Casual Formulation
• Decision

3
Outline
• What is Bayesian Decision Theory?
• Decision fot Two Categories

4
Outline
• What is classification?
• Classification by Bayesian Classification
• Basic Concepts
• Bayes Rule
• More General Forms of Bayes Rule
• Discriminated Functions
• Bayesian Belief Networks

5
What is pattern recognition?
TYPICAL APPLICATIONS OF PR
IMAGE PROCESSING EXAMPLE
6
Pattern Classification System
• Preprocessing
• Segment (isolate) fishes from one another and
from the background
• Feature Extraction
• Reduce the data by measuring certain features
• Classification
• Divide the feature space into decision regions

7
(No Transcript)
8
Classification
• Initially use the length of the fish as a
possible feature for discrimination

9
TYPICAL APPLICATIONS
LENGTH AS A DISCRIMINATOR
• Length is a poor discriminator

10
Feature Selection
• The length is a poor feature alone!
• Select the lightness as a possible feature

11
TYPICAL APPLICATIONS
• Lightness is a better feature than length because
it reduces the misclassification error.
• Can we combine features in such a way that we
improve performance? (Hint correlation)

12
Threshold decision boundary and cost relationship
• Move decision boundary toward smaller values of
lightness in order to minimize the cost (reduce
the number of sea bass that are classified
salmon!)

13
Feature Vector
to the feature vector
• Fish xT x1, x2

Width
Lightness
14
TYPICAL APPLICATIONS
WIDTH AND LIGHTNESS
Straight line decision boundary
• Treat features as a N-tuple (two-dimensional
vector)
• Create a scatter plot
• Draw a line (regression) separating the two
classes

15
Features
• We might add other features that are not highly
correlated with the ones we already have. Be sure
not to reduce the performance by adding noisy
features
• Ideally, you might think the best decision
boundary is the one that provides optimal
performance on the training data (see the
following figure)

16
TYPICAL APPLICATIONS
DECISION THEORY
• Can we do better than a linear classifier?

Is this a good decision boundary?
• What is wrong with this decision surface? (hint
generalization)

17
Decision Boundary Choice
• Our satisfaction is premature because the central
aim of designing a classifier is to correctly
classify new (test) input
• Issue of generalization!

18
TYPICAL APPLICATIONS
GENERALIZATION AND RISK
Better decision boundary
• Why might a smoother decision surface be a better
choice? (hint Occams Razor).
• PR investigates how to find such optimal
decision surfaces and how to provide system
designers with the tools to make intelligent

19
Need for Probabilistic Reasoning
• Most everyday reasoning is based on uncertain
evidence and inferences.
• Classical logic, which only allows conclusions to
be strictly true or strictly false, does not
account for this uncertainty or the need to weigh
and combine conflicting evidence.
• Todays expert systems employed fairly ad hoc
methods for reasoning under uncertainty and for
combining evidence.

20
Probabilistic Classification Theory
• ?? classification ?????? ?? ?? ???????? ??????
???????? ?? ???? ?? ?? ??????? ?? ?? ???? ?????
???.
• In practice, some overlap between classes and
random variation within classes occur, hence
perfect separation between classes can not be
achieved Misclassification may occur.
• ????? Bayesian decision ????? ?? ??? ?? ????
????? ?? ????? ??? ???? ????? ?? ???? ??? ?? ??
????? ????? ?? ???? ?? ???? A ??? ??????? ?? ????
B ???? ???? ???. (misclassify)

21
HISTORY
What is Bayesian Decision Theory?
• Bayesian Probability was named after Reverend
Thomas Bayes (1702-1761).
• He proved a special case of what is currently
known as the Bayes Theorem.
• The term Bayesian came into use around the
1950s.

http//en.wikipedia.org/wiki/Bayesian_probability
22
HISTORY (Cont.)
• Pierre-Simon, Marquis de Laplace (1749-1827)
independently proved a generalized version of
Bayes Theorem.
• 1970 Bayesian Belief Network at Stanford
University (Judea Pearl 1988)
• The ideas proposed above was not fully
developed until later. BBN became popular in the
1990s.

23
HISTORY (Cont.)
• Current uses of Bayesian Networks
• Microsofts printer troubleshooter.
• Diagnose diseases (Mycin).
• Used to predict oil and stock prices
• Control the space shuttle
• Risk Analysis Schedule and Cost Overruns.

24
BAYESIAN DECISION THEORY
PROBABILISTIC DECISION THEORY
• Bayesian decision theory is a fundamental
statistical approach to the problem of pattern
classification.
• Using probabilistic approach to help making
decision (e.g., classification) so as to minimize
the risk (cost).
• Assume all relevant probability distributions are
known (later we will learn how to estimate these
from data).

25
BAYESIAN DECISION THEORY
PRIOR PROBABILITIES
• State of nature is prior information
• ? denote the state of nature
• Model as a random variable, ?
• ? ?1 the event that the next fish is a sea
bass
• category 1 sea bass category 2 salmon
• A priori probabilities
• P(?1) probability of category 1
• P(?2) probability of category 2
• P(?1) P( ?2) 1 (either ?1 or ?2 must occur)
• Decision rule
• Decide ?1 if P(?1) gt P(?2) otherwise, decide
?2

But we know there will be many mistakes .
http//www.stat.yale.edu/Courses/1997-98/101/ranva
r.htm
26
BAYESIAN DECISION THEORY
CLASS-CONDITIONAL PROBABILITIES
• A decision rule with only prior information
always produces the same result and ignores
measurements.
• If P(?1) gtgt P( ?2), we will be correct most of
the time.
• Given a feature, x (lightness), which is a
continuous random variable, p(x?2) is the
class-conditional probability density function
• p(x?1) and p(x?2) describe the difference in
lightness between populations of sea and salmon.

27
Let x be a continuous random variable. p(xw)
is the probability density for x given the state
of nature w.
p(lightness salmon) ?
P(lightness sea bass) ?
28
BAYESIAN DECISION THEORY
BAYES FORMULA
How do we combine a priori and class-conditional
probabilities to know the probability of a state
of nature?
• Suppose we know both P(?j) and p(x?j), and we
can measure x. How does this influence our
decision?
• The joint probability that of finding a pattern
that is in category j and that this pattern has a
feature value of x is
• Rearranging terms, we arrive at Bayes formula.

29
• A Casual Formulation
• The prior probability reflects knowledge of the
relative frequency of instances of a class
• The likelihood is a measure of the probability
that a measurement value occurs in a class
• The evidence is a scaling term

BAYESIAN DECISION THEORY
POSTERIOR PROBABILITIES
• Bayes formula
• can be expressed in words as
• By measuring x, we can convert the prior
probability, P(?j), into a posterior probability,
P(?jx).
• Evidence can be viewed as a scale factor and is
often ignored in optimization applications (e.g.,
speech recognition).

For two categories
Bayes Decision Choose w1 if P(w1x) gt P(w2x)
otherwise choose w2.
30
BAYESIAN THEOREM
• A special case of Bayesian Theorem
• P(AnB) P(B) x P(AB)
• P(BnA) P(A) x P(BA)
• Since P(AnB) P(BnA),
• P(B) x P(AB) P(A) x P(BA)
• gt P(AB) P(A) x P(BA) / P(B)

31
Preliminaries and Notations
a state of nature
prior probability
feature vector
class-conditional density
posterior probability
32
Decision
Decide ?i if P(?ix) gt P(?jx) ? j ? i
The evidence, p(x), is a scale factor that
assures conditional probabilities sum to 1
P(?1x)P(?2x)1
We can eliminate the scale factor (which appears
on both sides of the equation)
Decide ?i if p(x?i)P(?i) gt p(x?j)P(?j) ? j ? i
• Special cases
• P(?1)P(?2) ? ? ?P(?c)
• p(x?1)p(x?2) ? ? ? p(x?c)

33
Two Categories
Decide ?i if P(?ix) gt P(?jx) ? j ? i
Decide ?i if p(x?i)P(?i) gt p(x?j)P(?j) ? j ? i
Decide ?1 if P(?1x) gt P(?2x) otherwise
decide ?2
Decide ?1 if p(x?1)P(?1) gt p(x?2)P(?2)
otherwise decide ?2
• Special cases
• P(?1)P(?2)
• Decide ?1 if p(x?1) gt p(x?2) otherwise
decide ?2
• 2. p(x?1)p(x?2)
• Decide ?1 if P(?1) gt P(?2) otherwise decide ?2

34
Example
R2
R1
P(?1)P(?2)
35
Example
P(?1)2/3 P(?2)1/3
Bayes Decision Rule
Decide ?1 if p(x?1)P(?1) gt p(x?2)P(?2)
otherwise decide ?2
36
BAYESIAN DECISION THEORY
POSTERIOR PROBABILITIES
• For every value of x, the posteriors sum to 1.0.
• At x14, the probability it is in category ?2 is
0.08, and for category ?1 is 0.92.

37
BAYESIAN DECISION THEORY
BAYES DECISION RULE
Classification Error
• Decision rule
• For an observation x, decide ?1 if P(?1x) gt
P(?2x) otherwise, decide ?2
• Probability of error
• The average probability of error is given by
• If for every x we ensure that P(errorx) is as
small as possible, then the integral is as small
as possible. Thus, Bayes decision rule for
minimizes P(error).

38
CONTINUOUS FEATURES
GENERALIZATION OF TWO-CLASS PROBLEM
• Generalization of the preceding ideas
• Use of more than one feature(e.g., length and
lightness)
• Use more than two states of nature(e.g., N-way
classification)
• Allowing actions other than a decision to decide
on the state of nature (e.g., rejection refusing
to take an action when alternatives are close or
confidence is low)
• Introduce a loss of function which is more
general than the probability of error(e.g.,
errors are not equally costly)
• Let us replace the scalar x by the vector x in a
d-dimensional Euclidean space, Rd, calledthe
feature space.

39
The Generation
a set of c states of nature or c categories
a set of a possible actions
LOSS FUNCTION
The loss incurred for taking action ?i when the
true state of nature is ?j.
Risk
can be zero.
We want to minimize the expected loss in making
decision.
40
Examples
• Ex 1 Fish classification
• X is the image of fish
• x (brightness, length, fin , etc.)
• is our belief what the fish type is
• sea bass, salmon, trout, etc
• is a decision for the fish type, in this
case
• sea bass, salmon, trout, manual
expection needed, etc
• Ex 2 Medical diagnosis
• X all the available medical tests, imaging scans
that a doctor can order for a patient
• x (blood pressure, glucose level, cough, x-ray,
etc.)
• is an illness type
• Flu, cold, TB, pneumonia, lung
cancer, etc
• is a decision for treatment,
• Tylenol, Hospitalize, more tests
needed, etc

41
Conditional Risk
Given x, the expected loss (risk) associated with
taking action ?i.
42
0/1 Loss Function
43
Decision
A general decision rule is a function ?(x) that
tells us which action to take for every possible
observation.
Bayesian Decision Rule
44
Overall Risk
The overall risk is given by
• Compute the conditional risk for every ? and
select the action that minimizes R(?ix). This is
denoted R, and is referred to as the Bayes risk
• The Bayes risk is the best performance that can
be achieved.

Decision function
If we choose ?(x) so that R(?i(x)) is as small as
possible for every x, the overall risk will be
minimized.
Bayesian decision rule the optimal one to
minimize the overall risk Its resulting overall
risk is called the Bayesian risk
45
Two-Category Classification
• Let ?1 correspond to ?1, ?2 to ?2, and ?ij
?(?i?j)
• The conditional risk is given by

46
Two-Category Classification
Our decision rule is choose ?1 if R(?1x) lt
R(?2x) otherwise decide ?2
Perform ?1 if R(?2x) gt R(?1x) otherwise
perform ?2
47
Two-Category Classification
Perform ?1 if R(?2x) gt R(?1x) otherwise
perform ?2
positive
positive
Posterior probabilities are scaled before
comparison.
• If the loss incurred for making an error is
greater than that incurred for being correct, the
factors (?21- ?11) and(?12- ?22) are positive,
and the ratio of these factors simply scales the
posteriors.

48
Two-Category Classification
irrelevant
Perform ?1 if R(?2x) gt R(?1x) otherwise
perform ?2
By employing Bayes formula, we can replace the
posteriors by the prior probabilities and
conditional densities
49
Two-Category Classification
This slide will be recalled later.
Stated as Choose a1 if the likelihood ration
exceeds a threshold value independent of the
observation x.
Threshold
Likelihood Ratio
Perform ?1 if
50
(No Transcript)
51
Loss Factors
• If the loss factors are identical, and the prior
probabilities are equal, this reduces to a
standard likelihood ratio

52
MINIMUM ERROR RATE
Error rate (the probability of error) is to be
minimized
• Consider a symmetrical or zero-one loss function
• The conditional risk is
• The conditional risk is the average probability
of error.
• To minimize error, maximize P(?ix) also known
as maximum a posteriori decoding (MAP).

53
Minimum Error Rate
LIKELIHOOD RATIO
• Minimum error rate classification
• choose ?i if P(?i x) gt P(?j x) for all j?i

54
Example
• For sea bass population, the lightness x is a
normal random variable distributes according to
N(4,1)
• for salmon population x is distributed
according to N(10,1)
• Select the optimal decision where
• The two fish are equiprobable
• P(sea bass) 2X P(salmon)
• The cost of classifying a fish as a salmon when
it truly is seabass is 2, and t The cost of
classifying a fish as a seabass when it is truly
a salmon is 1.

2
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
• End of Section 1

59
The Multicategory Classification
How to define discriminant functions?
How do we represent pattern classifiers?
The most common way is through discriminant
functions. Remember we use w1,w2, , wc to be
the possible states of nature.
For each class we create a discriminant function
gi(x).
gi(x)s are called the discriminant functions.
g1(x)
?(x)
g2(x)
Our classifier is a network or machine that
computes c discriminant functions.
gc(x)
The classifier Assign x to ?i if
gi(x) gt gj(x) for all j ? i.
60
Simple Discriminant Functions
If f(.) is a monotonically increasing function,
than f(gi(.) )s are also be discriminant
functions.
Notice the decision is the same if we change
every gi(x) for f(gi(x)) Assuming f(.) is a
monotonically increasing function.
Minimum Risk case
Minimum Error-Rate case
61
Figure 2.5
62
Decision Regions
The net effect is to divide the feature space
into c regions (one for each class). We then have
c decision regions separated by decision
boundaries.
Two-category example
Decision regions are separated by decision
boundaries.
63
Figure 2.6
64
Bayesian Decision Theory(Classification)
• The Normal Distribution

65
Basics of Probability
Discrete random variable (X) - Assume integer
Probability mass function (pmf)
Cumulative distribution function (cdf)
Continuous random variable (X)
not a probability
Probability density function (pdf)
Cumulative distribution function (cdf)
66
Probability mass function
• The graph of a probability mass function.
• All the values of this function must be
• non-negative and sum up to 1.

67
Probability density function
• The pdf can be calculated by taking the integral
of the function f(x) by the integration interval
of the input variable x.
• For example the probability of the variable X
being within the interval 4.3,7.8 would be

68
Expectations
Let g be a function of random variable X.
The kth moment
The 1st moment
The kth central moments
69
Important Expectations
Fact
Mean
Variance
70
Entropy
The entropy measures the fundamental uncertainty
in the value of points selected randomly from a
distribution.
71
Univariate Gaussian Distribution
• Properties
• Maximize the entropy
• Central limit theorem

XN(µ,s2)
EX µ
VarX s2
72
Illustration of the central limit theorem
Let x1,x2,,xn be a sequence of n independent
and identically distributed random variables
having each finite values of expectation µ and
variance s2gt0 The central limit theorem states
that as the sample size n increases, the
distribution of the sample average of these
random variables approaches the normal
distribution with a mean µ and variance s2 / n
irrespective of the shape of the original
distribution.
73
Random Vectors
A d-dimensional random vector
Vector Mean
Covariance Matrix
74
Multivariate Gaussian Distribution
XN(µ,S)
A d-dimensional random vector
EX µ
E(X-µ) (X-µ)T S
75
Properties of N(µ,S)
XN(µ,S)
A d-dimensional random vector
Let YATX, where A is a d k matrix.
YN(ATµ, ATSA)
76
Properties of N(µ,S)
XN(µ,S)
A d-dimensional random vector
Let YATX, where A is a d k matrix.
YN(ATµ, ATSA)
77
On Parameters of N(µ,S)
XN(µ,S)
78
More On Covariance Matrix
? is symmetric and positive semidefinite.
? orthonormal matrix, whose columns are
eigenvectors of ?.
? diagonal matrix (eigenvalues).
79
Whitening Transform
XN(µ,S)
YATX
YN(ATµ, ATSA)
Let
80
Whitening Transform
Whitening
XN(µ,S)
Linear Transform
YATX
YN(ATµ, ATSA)
Let
Projection
81
Whitening Transform
• The whitening transformation is a decorrelation
method that converts
• the covariance matrix S of a set of samples
into the identity matrix I.
• This effectively creates new random variables
that are uncorrelated and
• have the same variances as the original random
variables.
• The method is called the whitening transform
because it transforms the
• input matrix closer towards white noise.

This can be expressed as
where F is the matrix with the eigenvectors of
"S" as its columns and ? is the diagonal matrix
of non-increasing eigenvalues.
82
White noise
• White noise is a random signal (or process) with
a flat power spectral density.
• In other words, the signal contains equal power
within a fixed bandwidth
• at any center frequency.

Energy spectral density
83
Mahalanobis Distance
XN(µ,S)
r2
constant
depends on the value of r2
84
Mahalanobis distance
In statistics, Mahalanobis distance is a distance
measure introduced by P. C. Mahalanobis in
1936. It is based on correlations between
variables by which different patterns can be
identified and analyzed. It is a useful way of
determining similarity of an unknown sample set
to a known one. It differs from Euclidean
distance in that it takes into account the
correlations of the data set and is
scale-invariant, i.e. not dependent on the scale
of measurements .
85
Mahalanobis distance
• Formally, the Mahalanobis distance from a group
of values with above mean and covariance matrix S
for a multivariate vector is defined as
• Mahalanobis distance can also be defined as
dissimilarity measure between two random vectors
of the same distribution with the
covariance matrix S
• If the covariance matrix is the identity matrix,
the Mahalanobis distance reduces to the Euclidean
distance. If the covariance matrix is diagonal,
then the resulting distance measure is called the
normalized Euclidean distance
• where si is the standard deviation of the xi over
the sample set.

86
Mahalanobis Distance
XN(µ,S)
r2
constant
depends on the value of r2
87
Bayesian Decision Theory(Classification)
• Discriminant Functions for the Normal Populations

88
Normal Density
If features are statistically independent and the
variance is the same for all features, the
discriminant function is simple and is linear in
nature. A classifier that uses linear
discriminant functions is called a linear
machine. The decision surface are pieces of
hyperplanes defined by linear equations.
89
Minimum-Error-Rate Classification
• Assuming the measurements are normally
distributed, we have

XiN(µi,Si)
90
Some Algebra to Simplify the Discriminants
• Since
• We take the natural logarithm to re-write the
first term

91
Some Algebra to Simplify the Discriminants
(continued)
92
Minimum-Error-Rate Classification
Three Cases
Case 1
Classes are centered at different mean, and their
feature components are pairwisely independent
have the same variance.
Case 2
Classes are centered at different mean, but have
the same variation.
Case 3
Arbitrary.
93
Case 1. ?i ?2I
irrelevant
irrelevant
94
Case 1. ?i ?2I
95
Case 1. ?i ?2I
Boundary btw. ?i and ?j
96
Case 1. ?i ?2I
The decision boundary will be a hyperplane
perpendicular to the line btw. the means at
somewhere.
0 if P(?i)P(?j)
midpoint
Boundary btw. ?i and ?j
wT
97
Case 1. ?i ?2I
• The decision region when the priors are equal and
the support regions are spherical is simply
halfway between the means (Euclidean distance).

Minimum distance classifier (template matching)
98
(No Transcript)
99
Case 1. ?i ?2I
Note how priors shift the boundary away from the
more likely mean !!!
100
Case 1. ?i ?2I
101
Case 1. ?i ?2I
102
Case 2. ?i ?
• Covariance matrices are arbitrary, but equal to
each other for all classes.
• Features then form hyper-ellipsoidal clusters of
equal size and shape.

Mahalanobis Distance
Irrelevant if P(?i) P(?j) ?i, j
irrelevant
103
Case 2. ?i ?
Irrelevant
104
Case 2. ?i ?
• The discriminant hyperplanes are often not
• orthogonal to the segments joining the class means

105
Case 2. ?i ?
106
Case 2. ?i ?
107
Case 3. ?i ? ? j
The covariance matrices are different for each
category In two class case, the decision
• Decision surfaces are hyperquadrics, e.g.,
• hyperplanes
• hyperspheres
• hyperellipsoids
• hyperhyperboloids

Without this term In Case 1 and 2
irrelevant
108
Case 3. ?i ? ? j
Non-simply connected decision regions can arise
in one dimensions for Gaussians having unequal
variance.
109
Case 3. ?i ? ? j
110
Case 3. ?i ? ? j
111
Case 3. ?i ? ? j
112
Multi-Category Classification
113
Example A Problem
• Exemplars (transposed)
• For w1 (2, 6), (3, 4), (3, 8), (4, 6)
• For w2 (1, -2), (3, 0), (3, -4), (5, -2)
• Calculated means (transposed)
• m1 (3, 6)
• m2 (3, -2)

114
Example Covariance Matrices
115
Example Covariance Matrices
116
Example Inverse and Determinant for Each of the
Covariance Matrices
117
Example A Discriminant Function for Class 1
118
Example
119
Example A Discriminant Function for Class 2
120
Example
121
Example The Class Boundary
122
123
Example Plot of the Discriminant
124
Summary Steps for Building a Bayesian Classifier
• Collect class exemplars
• Estimate class a priori probabilities
• Estimate class means
• Form covariance matrices, find the inverse and
determinant for each
• Form the discriminant function for each class

125
Using the Classifier
• Obtain a measurement vector x
• Evaluate the discriminant function gi(x) for each
class i 1,,c
• Decide x is in the class j if gj(x) gt gi(x) for
all i ? j

126
Bayesian Decision Theory(Classification)
• Criterions

127
Bayesian Decision Theory(Classification)
• Minimun error rate Criterion

128
Minimum-Error-Rate Classification
• If action is taken and the true state is
, then the decision is correct if and in
error if
• Error rate (the probability of error) is to be
minimized
• Symmetrical or zero-one loss function
• Conditional risk

129
Minimum-Error-Rate Classification
130
Bayesian Decision Theory(Classification)
• Minimax Criterion

131
Bayesian Decision RuleTwo-Category
Classification
Threshold
Likelihood Ratio
Decide ?1 if
Minimax criterion deals with the case that the
prior probabilities are unknown.
132
Basic Concept on Minimax
To choose the worst-case prior probabilities (the
maximum loss) and, then, pick the decision rule
that will minimize the overall risk.
Minimize the maximum possible overall risk.
So that the worst risk for any value of the
priors is as small as possible
133
Overall Risk
134
Overall Risk
135
Overall Risk
136
Overall Risk
137
Overall Risk
R(x) ax b
The value depends on the setting of decision
boundary
The value depends on the setting of decision
boundary
The overall risk for a particular P(?1).
138
Overall Risk
R(x) ax b
0 for minimax solution
Rmm, minimax risk
Independent on the value of P(?i).
139
Minimax Risk
140
Error Probability
Use 0/1 loss function
141
Minimax Error-Probability
Use 0/1 loss function
P(?1?2)
P(?2?1)
142
Minimax Error-Probability
P(?1?2)
P(?2?1)
143
Minimax Error-Probability
144
Bayesian Decision Theory(Classification)
• Neyman-Pearson Criterion

145
Bayesian Decision RuleTwo-Category
Classification
Threshold
Likelihood Ratio
Decide ?1 if
Neyman-Pearson Criterion deals with the case that
both loss functions and the prior probabilities
are unknown.
146
Signal Detection Theory
• The theory of signal detection theory evolved
from the development of communications and radar
equipment the first half of the last century.
• It migrated to psychology, initially as part of
sensation and perception, in the 50's and 60's as
an attempt to understand some of the features of
human behavior when detecting very faint stimuli
that were not being explained by traditional
theories of thresholds.

147
The situation of interest
• A person is faced with a stimulus (signal) that
is very faint or confusing.
• The person must make a decision, is the signal
there or not.
• What makes this situation confusing and difficult
is the presences of other mess that is similar to
the signal.  Let us call this mess noise.

148
Example
Noise is present both in the environment and in
the sensory system of the observer. The observer
reacts to the momentary total activation of the
sensory system, which fluctuates from moment to
moment, as well as responding to environmental
stimuli, which may include a signal.
149
Signal Detection Theory
Suppose we want to detect a single pulse from a
signal. We assume the signal has some random
noise. When the signal is present we observe a
normal distribution with mean u2. When the signal
is not present we observe a normal distribution
with mean u1. We assume same standard deviation.
Can we measure the discriminability of the
problem? Can we do this independent of the
threshold x?
Discriminability d u2 u1 / s
150
Example
• A radiologist is examining a CT scan, looking for
evidence of a tumor.
• A Hard job, because there is always some
uncertainty.
• There are four possible outcomes
• hit (tumor present and doctor says "yes'')
• miss (tumor present and doctor says "no'')
• false alarm (tumor absent and doctor says "yes")
• correct rejection (tumor absent and doctor says
"no").

Two types of Error
151
The Four Cases
Signal detection theory was developed to help us
understand how a continuous and ambiguous signal
can lead to a binary yes/no decision.
Correct Rejection
Miss
P(?1?2)
P(?1?1)
False Alarms
Hit
P(?2?2)
P(?2?1)
152
Decision Making
Discriminability
Based on expectancy (decision bias)
Criterion
Hit
P(?2?2)
False Alarm
P(?2?1)
153
(No Transcript)
154
Signal Detection Theory
• How do we find d if we do not know u1, u2, or
x?
• From the data we can compute
• P( x gt x w2) a hit.
• P( x gt x w1) a false alarm.
• P( x lt x w2) a miss.
• P( x lt x w1) a correct rejection.
• If we plot a point in a space representing hit
and false alarm rates,
• then we have a ROC (receiver operating
characteristic) curve.
• With it we can distinguish between
discriminability and bias.

155
Hit
PHP(?2?2)
False Alarm
PFAP(?2?1)
156
Neyman-Pearson Criterion
Hit
PHP(?2?2)
NP
max. PH subject to PFA ? a
False Alarm
PFAP(?2?1)