Comprehensive Introduction to the Evaluation of Neural Networks and other Computational Intelligence Decision Functions: Receiver Operating Characteristic, Jackknife, Bootstrap and other Statistical Methodologies

About This Presentation

Title:

Comprehensive Introduction to the Evaluation of Neural Networks and other Computational Intelligence Decision Functions: Receiver Operating Characteristic, Jackknife, Bootstrap and other Statistical Methodologies

Description:

Comprehensive Introduction to the Evaluation of Neural Networks and other Computational Intelligence Decision Functions: Receiver Operating Characteristic, Jackknife ... – PowerPoint PPT presentation

Number of Views:282

Avg rating:3.0/5.0

Slides: 150

Provided by: DavidG287

Category:

more less

Transcript and Presenter's Notes

Title: Comprehensive Introduction to the Evaluation of Neural Networks and other Computational Intelligence Decision Functions: Receiver Operating Characteristic, Jackknife, Bootstrap and other Statistical Methodologies

1
Comprehensive Introduction to the Evaluation of
Neural Networks and other Computational
Intelligence Decision Functions Receiver
Operating Characteristic, Jackknife, Bootstrap
and other Statistical Methodologies

David G. Brown and Frank Samuelson
Center for Devices and Radiological Health, FDA
6 July 2014

2
Course Outline

Performance measures for Computational
Intelligence (CI) observers
Accuracy
Prevalence dependent measures
Prevalence independent measures
Maximization of performance Utility
analysis/Cost functions
Receiver Operating Characteristic (ROC) analysis
Sensitivity and specificity
Construction of the ROC curve
Area under the ROC curve (AUC)
Error analysis for CI observers
Sources of error
Parametric methods
Nonparametric methods
Standard deviations and confidence intervals
Boot strap methods
Theoretical foundation
Practical use
References

3
Whats the problem?

Emphasis on algorithm innovation to exclusion of
performance assessment
Use of subjective measures of performance
beauty contest
Use of accuracy as a measure of success
Lack of error barsMy CIO is .01 better than
yours (/- ?)
Flawed methodologytraining and testing on same
data
Lack of appreciation for the many different
sources of error that can be taken into account

4
Original imageLena. Courtesy of the Signal and
Image Processing Institute at the University of
Southern California.
5
CI improved imageBaboon. Courtesy of the Signal
and Image Processing Institute at the University
of Southern California.
6
Panel of expertsImage Garosha Dreamstime
7
I. Performance measures for computational
intelligence (CI) observers

Task based (binary) discrimination task
Two populations involved normal and
abnormal,
Accuracy Intuitive but incomplete
Different consequences for success or failure for
each population
Some measures depend on the prevalence (Pr) some
do not, Pr
Accuracy, positive predictive value, negative
predictive value
Sensitivity, specificity, ROC, AUC
True optimization of performance requires
knowledge of cost functions or utilities for
successes and failures in both populations

8
How to make a CIO with gt99 accuracy

Medical problem Screening mammography
(screening means testing in an asymptomatic
population)
Prevalence of breast cancer in the screening
population Pr 0.5
My CIO always says normal
Accuracy (Acc) is 99.5 (accuracy of accepted
present-day systems 75)
Accuracy in a diagnostic setting (Pr20) is 80
-- Acc1-Pr (for my CIO)

9
CIO operates on two different populations
Normal cases p(t0)
Abnormal cases p(t1)
Threshold t T
t-axis
10
Must consider effects on normal and abnormal
populations separately

CIO output t
p(t0) probability distribution of t for the
population of normals
p(t1) probability distribution of t for the
population of abnormals
Threshold T. Everything to the right of T called
abnormal, and everything to the left of T called
normal
Area of p(t0) to left of T is the true negative
fraction (TNF specificity) and to the right the
false positive fraction (FPF type 1 error).
TNF FPF 1
Area of p(t1) to left of T is the false negative
fraction (FNF type 2 error) and to the right is
the true positive fraction (TPF sensitivity)
FNF TPF 1
TNF, FPF, FNF, TPF all are prevalence
independent, since each is some fraction of one
of our two probability distributions
Accuracy Pr x TPF (1-Pr) x TNF

11
Normal cases
FPF (.5)
TNF (.5)
t-axis
Threshold T
Abnormal cases
TPF (.95)
FNF (.05)
t-axis
12
Prevalence dependent measures

Accuracy (Acc)
Acc Pr x TPF (1-Pr) x TNF
Positive predictive value (PPV) fraction of
positives that are true positives
PPV TPF x Pr / (TPF x Pr FPF x (1-Pr))
Negative predictive value (NPV) fraction of
negatives that are true negatives
NPV TNF x (1-Pr) / (TNF x (1-Pr) FNF x Pr)
Using the mammography screening Pr and previous
TPF, TNF, FNF, FPF values Pr .05, TPF .95,
TNF 0.5, FNF.05, FPF0.5
Acc .05x.95.95x.5 .52
PPV .95x.05/(.95x.05.5x.95) .10
NPV .5x.95/(.5x.95.05x.05) .997

13
Prevalence dependent measures

Accuracy (Acc)
Acc Pr x TPF (1-Pr) x TNF
Positive predictive value (PPV) fraction of
positives that are true positives
PPV TPF x Pr / (TPF x Pr FPF x (1-Pr))
Negative predictive value (NPV) fraction of
negatives that are true negatives
NPV TNF x (1-Pr) / (TNF x (1-Pr) FNF x Pr)
Using the mammography screening Pr and previous
TPF, TNF, FNF, FPF values Pr .005, TPF .95,
TNF 0.5, FNF.05, FPF0.5
Acc .005x.95.995x.5 .50
PPV .95x.005/(.95x.005.5x.995) .01
NPV .5x.995/(.5x.995.05x.005) .995

14
Acc, PPV, NPV as functions of prevalence(screenin
g mammography)

TPF.95
FNF.05
TNF0.5
FPF0.5

15
Acc NPV as function of prevalence(forced
normal response CIO)
16
Prevalence independent measures

Sensitivity TPF
Specificity TNF (1-FPF)
Receiver Operating Characteristic (ROC) TPF as
a function of FPF (Sensitivity as a function of 1
Specificity)
Area under the ROC curve (AUC)
Sensitivity averaged over all values of
Specificity

17
Normal / Class 0 subjects
Entire ROC curve
ROC slope
TPF, sensitivity
Abnormal / Class 1 subjects
FPF, 1-specificity
18
Empirical ROC data for mammography screening in
the US
Craig Beam et al.
19
Maximization of performance

Need to know utilities or costs of each type of
decision outcome but these are very hard to
estimate accurately. You dont just maximize
accuracy.
Need prevalence
For mammography example
TPF prolongation of life minus treatment cost
FPF diagnostic work-up cost, anxiety
TNF peace of mind
FNF delay in treatment gt shortened life
Hypothetical assignment of utilities for some
decision threshold T
UtilityT U(TPF) x TPF x Pr U(FPF) x FPF x
(1-Pr) U(TNF) x TNF x (1-Pr)
U(FNF) x FNF x Pr
U(TPF) 100, U(FPF) -10, U(TNF) 4, U(FNF)
-20
UtilityT 100 x .95 x .05 10 x .50 x .95
4 x .50 x .95 20 x .05 x .05 1.85
Now if we only knew how to trade off TPF versus
FPF, we could optimize (?) medical performance.

20
Utility maximization(mammography example)
21
Choice of ROC operating point through utility
analysisscreening mammography
22
Utility maximization(mammography example)
23
Utility maximization calculation

u (UTPFTPFUFNFFNF)PR(UTNFTNFUFPFFPF)(1-PR)
(UTPFTPFUFNF(1-TPF))PR(UTNF(1-FPF)UFPFFPF)(
1-PR)
du/dFPF(UFPF-UTNF)(1-PR)(UTPF-UFNF)PRdTPF/dFPF
0 ? dTPF/dFPF(UTNF-UFPF)(1-PR)/(UT
PF-UFNF)PR
PR.005 ? dTPF/dFPF 23.
PR.05 ? dTPF/dFPF 2.2
(UTPF100, UFNF-20, UTNF4, UFPF-20)

24
Normal cases
Entire ROC curve
ROC slope
TPF, sensitivity
Abnormal cases
FPF, 1-specificity
25
Estimators

TPF, FPF, TNF, FNF, Accuracy, the ROC curve, and
AUC are all fractions or probabilities.
Normally we have a finite sample of subjects on
which to test our CIO. From this finite sample
we try to estimate the above fractions
These estimates will vary depending upon the
sample selected (statistical variation).
Estimates can be nonparametric or parametric

26
Estimators

TPF
TPF
Number in sample ltlt Number in population (at
least in theory)

27
II. Receiver Operating Characteristic (ROC)

Receiver Operating Characteristic
Binary Classification
Test result is compared to a threshold

28
Distribution of CIO Output for all Subjects
Threshold
Computational intelligence observer output
29
Distribution of Output for Normal / Class
0 Subjects, p(t0)
Distribution of Output for Abnormal / Class
1 Subjects, p(t1)
Threshold
t-axis
Computational intelligence observer output
30
Distribution of Output for Normal / Class
0 Subjects, p(t0)
Threshold
Abnormal / Class 1 subjects
31
Distribution of Output for Normal / Class
0 Subjects, p(t0)
Specificity
True Negative Fraction TNF
Threshold
Abnormal / Class 1 subjects
Sensitivity
True Positive Fraction TPF
32
Normal / Class 0 subjects
Specificity
Decision D0 D1
TNF 0.50
Threshold
Truth H1 H0
TPF 0.95
Abnormal / Class 1 subjects
Sensitivity
33
Normal / Class 0 subjects
1 - Specificity
False Positive Fraction FPF
Threshold
Abnormal / Class 1 subjects
1 - Sensitivity
False Negative Fraction FNF
34
Normal / Class 0 subjects
1 - Specificity
Decision D0 D1
TNF 0.50
FPF 0.50
Threshold
Truth H1 H0
FNF 0.05
TPF 0.95
Abnormal / Class 1 subjects
1 - Sensitivity
35
Normal / Class 0 subjects
high sensitivity
TPF, sensitivity
Threshold
Abnormal / Class 1 subjects
FPF, 1-specificity
36
Normal / Class 0 subjects
sensitivity specificity
TPF, sensitivity
Threshold
Abnormal / Class 1 subjects
FPF, 1-specificity
37
Normal / Class 0 subjects
TPF, sensitivity
Threshold
high specificity
Abnormal / Class 1 subjects
FPF, 1-specificity
38
Which CIO is best?
Normal / Class 0 subjects
CIO 3
CIO 2
TPF, sensitivity
Threshold
CIO 1
Abnormal / Class 1 subjects
FPF, 1-specificity
TPF FPF
CIO 1 0.50 0.07
CIO 2 0.78 0.22
CIO 3 0.93 0.50
39
Do not compare rates of one class, e.g. TPF, at
different rates of the other class (FPF).
Normal / Class 0 subjects
CIO 3
CIO 2
TPF, sensitivity
Threshold
CIO 1
Abnormal / Class 1 subjects
FPF, 1-specificity
TPF FPF
CIO 1 0.50 0.07
CIO 2 0.78 0.23
CIO 3 0.93 0.50
40
Normal / Class 0 subjects
Entire ROC curve
TPF, sensitivity
Abnormal / Class 1 subjects
FPF, 1-specificity
41
AUC0.98
Entire ROC curve
chance line
TPF, sensitivity
AUC0.85
Discriminability -or- CIO performance
FPF, 1-specificity
AUC0.5
42
AUC (Area under ROC Curve)

AUC is a separation probability
AUC probability that
CIO output for abnormal gt CIO output for normal
CIO correctly tells which of 2 subjects is normal
Estimating AUC from finite sample
Select abnormal subject score xi
Select normal subject score yk
Is xi gt yk ?
Average over all x,y

43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
(No Transcript)
69
(No Transcript)
70
(No Transcript)
71
(No Transcript)
72
(No Transcript)
73
(No Transcript)
74
(No Transcript)
75
(No Transcript)
76
(No Transcript)
77
(No Transcript)
78
(No Transcript)
79
(No Transcript)
80
(No Transcript)
81
(No Transcript)
82
(No Transcript)
83
(No Transcript)
84
(No Transcript)
85
(No Transcript)
86
(No Transcript)
87
(No Transcript)
88
(No Transcript)
89
(No Transcript)
90
(No Transcript)
91
(No Transcript)
92
ROC as a Q-Q plot

ROC plots in probability space

ROC plots in quantile space

93
Linear Likelihood Ratio Observer for Gaussian Data

When the input features of the data are
distributed as Gaussians with equal variance,
The optimal discriminant, the log-likelihood
ratio, is a linear function,
That linear discriminant is also distributed as a
Gaussian,
The signal to noise ratio (SNR) is easily
calculated from the input data distributions and
is a monotonic function of AUC.
Can serve as a benchmark against which to measure
CIO performance

94
Linear Ideal Observer

p(x0) probability distribution of data x for the
population of normals and p(x1) probability
distribution of x for the population of abnormals
with components xi independent Gaussian
distributed with means 0 and mi respectively and
identical variances si2

95
Maximum Likelihood CIO
96
Linear Ideal Observer ROC
D
97
Likelihood Ratio Slope of ROC

The likelihood ratio of the decision variable t
is the slope of the ROC curve
ROC TPF(FPF) TPF 1-P(t0) FPF 1-P(t1)

98
III. Error analysis for CI observers

Sources of error
Parametric methods
Nonparametric methods
Standard deviations and confidence intervals
Hazards

99
Sources of error

Test errorlimited number of samples in the test
set
Training errorlimited number of samples in the
training set
Incorrect parameters
Incorrect feature selection, etc.
Human observer error (when applicable)
Intraobserver
Interobserver

100
Parametric methods

Use known underlying probability distribution
may be exact for simulated data
Assume Gaussian distribution
Other parameterization e.g., Binomial or ROC
linearity in z-transformation coordinates
(F-1(TPF) versus F-1(FPF), where F is the
cumulative Gaussian distribution)

101
Binomial Estimates of Variance

For single population measures, f TPF, FPF,
FNF, TNF
Var(f) f (1-f) / N
For AUC (back of envelope calculation)Var(AUC)

102
Data rich case

Repeat experiment M times
Estimate distribution parameterse.g., for a
Gaussian distributed performance measure f,
G(m,s2)
Find error bars or confidence limits

103
Example AUC

Mean AUC
Distribution variance
Variance of mean
Error bars, confidence interval

104
Probability distribution for calculation of AUC
from 40 values
105
Probability distribution for calculation of SNR
from 40 values
106
But whats a poor boy to do?

Reuse the data you have Resubstitution,
Resampling
Two common approaches
Jackknife
Bootstrap

107
Resampling
108
Resampling
109
Resampling
110
Resampling
111
Resampling
112
Resampling
113
Resampling
114
Resampling
115
Jackknife

Have N observations
Leave out m of these, then have M subsets of the
N observations to calculate m, s2
N10, m5 M252 N10, m1 M10

116
Round-robin jackknife bias derivation and variance

Given N datasets

117
Fukunaga-Hayes bias derivation

Divide both the normal and abnormal classes in
half, yielding 4 possible pairings

118
Jackknife bias correction exampleTraining error

AUC estimates as a function of number of cases N.
Solid line is the multilayer perceptron result.
Open circle jackknife, closed circle
Fukunaga-Hayes. The horizontal dotted line is
the asymptotic ideal result

119
IV. Bootstrap methods

Theoretical foundation
Practical use

120
Bootstrap variance

What you have is what youve gotthe data is your
best estimate of the probability distribution
Sampling with replacement, M times
Adequate number of samples MgtN
Simple bootstrap

121
Bootstrap and jackknife error estimates

Standard deviation of AUC Solid line simulation
results, open circles jackknife estimate, closed
circles bootstrap estimate. Note how much larger
the jackknife error bars are than those provided
by the bootstrap method.

122
(No Transcript)
123
.632 bootstrap for classifier performance
evaluation

Have N cases, draw M samples of size N with
replacement for training (Have on average .632 x
N unique cases in each sample of size N)
Test on the unused (.368 x N) cases for each
sample

N 5 10 20 100 Infinity
.328 1-.672 .349 1-.651 .358 1-.642 .366 1-.634 .368 1-.632
124
.632 bootstrap for classifier performance
evaluation 2

Have N cases, draw M samples of size N with
replacement for training (Have on average .632 x
N unique cases in each sample of size N)
Test on the unused (.368 x N) cases for each
sample
Get bootstrap average result AUCB
Get resubstitution result (testing on training
set) AUCR
AUC.632 .632 x AUCB .368 x AUCR
As variance take the AUCB variance

125
Traps for the unwary Overparameterization

Covers theorem
For Nlt2(d1) a hyperplane exists that will
perfectly separate almost all possible
dichotomies of N points in d space

126
fd(N) for d1,5,25,125, and the limit of large d.
The abscissa xN/2(d1) is scaled so that the
values of fd(N)0.5 lie superposed at x1 for all
d.
127
Poor data hygiene

Reporting on training data results/ testing on
training data
Carrying out any part of the training process on
data later used for testing
e.g., using all of the data to select a
manageable feature set from among a large number
of featuresand then dividing the data into
training and test sets.

128
Overestimate of AUC frompoor data hygiene
Distributions of AUC values in 900 simulation
experiments (on the left) and the mean ROC curves
(on the right) for four validation methods
Method 1 Feature selection and classifier
training on one dataset and classifier testing on
another independent dataset Method 2 Given
perfect feature selection, classifier training on
one dataset and classifier testing on another
independent dataset Method 3 Feature selection
using the entire dataset and then the dataset is
partitioned into two, one for training and one
for testing the classifier Method 4 Feature
selection, classifier training, and testing using
the same dataset.
129
Correct feature selection is hard to do
An insight of feature selection performance in
Method 1. On the left plots the number of
experiments (out of 900) that a feature is
selected. By design of the simulation population,
the first 30 features are useful for
classification and the remaining are useless. On
the right plots the distribution of the number of
useful features (out of 30) in the 900
experiments.
130
Conclusions

Accuracy and other prevalence dependent measures
are inadequate
ROC/AUC provide good measures of performance
Uncertainty must be quantified
Bootstrap and jackknife techniques are useful
methods

131
V. References

1 K. Fukunaga, Statistical Pattern Recognition,
2nd Edition. Boston Harcourt Brace Jovanovich,
1990.
2 K. Fukunaga and R. R. Hayes, Effects of
sample size in classifier design, IEEE Trans.
Pattern Anal. Machine Intell., vol. PAMI-11, pp.
873885, 1989.
3 D. M. Green and J. A. Swets, Signal Detection
Theory and Psychophysics. New York John Wiley
Sons, 1966.
4 J. P. Egan, Signal Detection Theory and ROC
Analysis. New York Academic Press, 1975.
5 C. E. Metz, Basic principles of roc
analysis, Seminars in Nuclear Medicine, vol.
VIII, no. 4.
6 H. H. Barrett and K. J. Myers, Foundations of
Image Science. Hoboken John Wiley Sons, 2004,
ch. 13 Statistical Decision Theory.
7 B. Efron and R. J. Tibshirani, Introduction
to the Bootstrap. Boca Raton Chapman Hall/CRC,
1993.
8 B. Efron, The Jackknife, the Bootstrap and
Other Resampling Plans. Philadelphia Society for
Industrial and Applied Mathematics, 1982.
9 A. C. Davison and D. V. Hinkley, Bootstrap
Methods and their Applications. Cambridge
Cambridge University Press, 1997.
10 B. Efron, Estimating the error rate of a
prediction rule Some improvements on
cross-validation, Journal of the American
Statistical Association, vol. 78, pp. 316331,
1983.
11 B. Efron and R. J. Tibshirani, Improvements
on cross-validation The .632 bootstrap method,
Journal of the American Statistical Association,
vol. 92, no. 438, pp. 548560, 1997.
12 T. Hastie, R. Tibshirani, and J. Friedman,
The Elements of Statistical Learning, 3rd
Edition. New York Springer, 2009.
13 C. M. Bishop, Pattern Recognition and
Machine Learning. New York Springer, 2006.
14 , Neural Networks for Pattern Recognition.
Oxford Oxford University Press, 1995.
15 R. F. Wagner, D. G. Brown, J.-P. Guedon, K.
J. Myers, and K. A. Wear, Multivariate Guassian
pattern classification effects of finite sample
size and the addition of correlated or noisy
features on summary measures of goodness, in
Information processing in Medical Imaging,
Proceedings of IPMI 93, 1993, pp. 507524.
16 , On combining a few diagnostic tests or
features, in Proceedings of the SPIE, Image
Processing, vol. 2167, 1994.
17 D. G. Brown, A. C. Schneider, M. P.
Anderson, and R. F. Wagner, Effects of finite
sample size and correlated noisy input features
on neural network pattern classification, in
Proceedings of the SPIE, Image Processing, vol.
2167, 1994.
18 C. A. Beam, Analysis of clustered data in
receiver operating characteristic studies,
Statistical Methods in Medical Research, vol. 7,
pp. 324336, 1998.
19 W. A. Yousef, et al. Assessing Classifiers
from Two Independent Data Sets Using ROC
Analysis A Nonparametric Approach, in IEEE
Transactions on Pattern Analysis and Machine
Intelligence, vol. 28, no. 11, pp. 1809-1817,
2006

132
Appendix I
133
Searching suitcases
134
(No Transcript)
135
(No Transcript)
136
(No Transcript)
137
(No Transcript)
138
(No Transcript)
139
(No Transcript)
140
(No Transcript)
141
(No Transcript)
142
(No Transcript)
143
(No Transcript)
144
Previous class results
145
(No Transcript)
146
(No Transcript)
147
(No Transcript)
148
(No Transcript)
149
(No Transcript)

Write a Comment

User Comments (0)