Title: Comprehensive Introduction to the Evaluation of Neural Networks and other Computational Intelligence Decision Functions: Receiver Operating Characteristic, Jackknife, Bootstrap and other Statistical Methodologies
1Comprehensive Introduction to the Evaluation of
Neural Networks and other Computational
Intelligence Decision Functions Receiver
Operating Characteristic, Jackknife, Bootstrap
and other Statistical Methodologies
- David G. Brown and Frank Samuelson
- Center for Devices and Radiological Health, FDA
- 6 July 2014
2Course Outline
- Performance measures for Computational
Intelligence (CI) observers - Accuracy
- Prevalence dependent measures
- Prevalence independent measures
- Maximization of performance Utility
analysis/Cost functions - Receiver Operating Characteristic (ROC) analysis
- Sensitivity and specificity
- Construction of the ROC curve
- Area under the ROC curve (AUC)
- Error analysis for CI observers
- Sources of error
- Parametric methods
- Nonparametric methods
- Standard deviations and confidence intervals
- Boot strap methods
- Theoretical foundation
- Practical use
- References
3Whats the problem?
- Emphasis on algorithm innovation to exclusion of
performance assessment - Use of subjective measures of performance
beauty contest - Use of accuracy as a measure of success
- Lack of error barsMy CIO is .01 better than
yours (/- ?) - Flawed methodologytraining and testing on same
data - Lack of appreciation for the many different
sources of error that can be taken into account
4Original imageLena. Courtesy of the Signal and
Image Processing Institute at the University of
Southern California.
5CI improved imageBaboon. Courtesy of the Signal
and Image Processing Institute at the University
of Southern California.
6Panel of expertsImage Garosha Dreamstime
7I. Performance measures for computational
intelligence (CI) observers
- Task based (binary) discrimination task
- Two populations involved normal and
abnormal, - Accuracy Intuitive but incomplete
- Different consequences for success or failure for
each population - Some measures depend on the prevalence (Pr) some
do not, Pr - Accuracy, positive predictive value, negative
predictive value - Sensitivity, specificity, ROC, AUC
- True optimization of performance requires
knowledge of cost functions or utilities for
successes and failures in both populations
8How to make a CIO with gt99 accuracy
- Medical problem Screening mammography
(screening means testing in an asymptomatic
population) - Prevalence of breast cancer in the screening
population Pr 0.5 - My CIO always says normal
- Accuracy (Acc) is 99.5 (accuracy of accepted
present-day systems 75) - Accuracy in a diagnostic setting (Pr20) is 80
-- Acc1-Pr (for my CIO)
9CIO operates on two different populations
Normal cases p(t0)
Abnormal cases p(t1)
Threshold t T
t-axis
10Must consider effects on normal and abnormal
populations separately
- CIO output t
- p(t0) probability distribution of t for the
population of normals - p(t1) probability distribution of t for the
population of abnormals - Threshold T. Everything to the right of T called
abnormal, and everything to the left of T called
normal - Area of p(t0) to left of T is the true negative
fraction (TNF specificity) and to the right the
false positive fraction (FPF type 1 error). - TNF FPF 1
- Area of p(t1) to left of T is the false negative
fraction (FNF type 2 error) and to the right is
the true positive fraction (TPF sensitivity) - FNF TPF 1
- TNF, FPF, FNF, TPF all are prevalence
independent, since each is some fraction of one
of our two probability distributions - Accuracy Pr x TPF (1-Pr) x TNF
11Normal cases
FPF (.5)
TNF (.5)
t-axis
Threshold T
Abnormal cases
TPF (.95)
FNF (.05)
t-axis
12Prevalence dependent measures
- Accuracy (Acc)
- Acc Pr x TPF (1-Pr) x TNF
- Positive predictive value (PPV) fraction of
positives that are true positives - PPV TPF x Pr / (TPF x Pr FPF x (1-Pr))
- Negative predictive value (NPV) fraction of
negatives that are true negatives - NPV TNF x (1-Pr) / (TNF x (1-Pr) FNF x Pr)
- Using the mammography screening Pr and previous
TPF, TNF, FNF, FPF values Pr .05, TPF .95,
TNF 0.5, FNF.05, FPF0.5 - Acc .05x.95.95x.5 .52
- PPV .95x.05/(.95x.05.5x.95) .10
- NPV .5x.95/(.5x.95.05x.05) .997
13Prevalence dependent measures
- Accuracy (Acc)
- Acc Pr x TPF (1-Pr) x TNF
- Positive predictive value (PPV) fraction of
positives that are true positives - PPV TPF x Pr / (TPF x Pr FPF x (1-Pr))
- Negative predictive value (NPV) fraction of
negatives that are true negatives - NPV TNF x (1-Pr) / (TNF x (1-Pr) FNF x Pr)
- Using the mammography screening Pr and previous
TPF, TNF, FNF, FPF values Pr .005, TPF .95,
TNF 0.5, FNF.05, FPF0.5 - Acc .005x.95.995x.5 .50
- PPV .95x.005/(.95x.005.5x.995) .01
- NPV .5x.995/(.5x.995.05x.005) .995
14Acc, PPV, NPV as functions of prevalence(screenin
g mammography)
- TPF.95
- FNF.05
- TNF0.5
- FPF0.5
15Acc NPV as function of prevalence(forced
normal response CIO)
16Prevalence independent measures
- Sensitivity TPF
- Specificity TNF (1-FPF)
- Receiver Operating Characteristic (ROC) TPF as
a function of FPF (Sensitivity as a function of 1
Specificity) - Area under the ROC curve (AUC)
- Sensitivity averaged over all values of
Specificity
17Normal / Class 0 subjects
Entire ROC curve
ROC slope
TPF, sensitivity
Abnormal / Class 1 subjects
FPF, 1-specificity
18Empirical ROC data for mammography screening in
the US
Craig Beam et al.
19Maximization of performance
- Need to know utilities or costs of each type of
decision outcome but these are very hard to
estimate accurately. You dont just maximize
accuracy. - Need prevalence
- For mammography example
- TPF prolongation of life minus treatment cost
- FPF diagnostic work-up cost, anxiety
- TNF peace of mind
- FNF delay in treatment gt shortened life
- Hypothetical assignment of utilities for some
decision threshold T - UtilityT U(TPF) x TPF x Pr U(FPF) x FPF x
(1-Pr) U(TNF) x TNF x (1-Pr)
U(FNF) x FNF x Pr - U(TPF) 100, U(FPF) -10, U(TNF) 4, U(FNF)
-20 - UtilityT 100 x .95 x .05 10 x .50 x .95
4 x .50 x .95 20 x .05 x .05 1.85 - Now if we only knew how to trade off TPF versus
FPF, we could optimize (?) medical performance.
20Utility maximization(mammography example)
21Choice of ROC operating point through utility
analysisscreening mammography
22Utility maximization(mammography example)
23Utility maximization calculation
- u (UTPFTPFUFNFFNF)PR(UTNFTNFUFPFFPF)(1-PR)
- (UTPFTPFUFNF(1-TPF))PR(UTNF(1-FPF)UFPFFPF)(
1-PR) - du/dFPF(UFPF-UTNF)(1-PR)(UTPF-UFNF)PRdTPF/dFPF
- 0 ? dTPF/dFPF(UTNF-UFPF)(1-PR)/(UT
PF-UFNF)PR - PR.005 ? dTPF/dFPF 23.
- PR.05 ? dTPF/dFPF 2.2
- (UTPF100, UFNF-20, UTNF4, UFPF-20)
24Normal cases
Entire ROC curve
ROC slope
TPF, sensitivity
Abnormal cases
FPF, 1-specificity
25Estimators
- TPF, FPF, TNF, FNF, Accuracy, the ROC curve, and
AUC are all fractions or probabilities. - Normally we have a finite sample of subjects on
which to test our CIO. From this finite sample
we try to estimate the above fractions - These estimates will vary depending upon the
sample selected (statistical variation). - Estimates can be nonparametric or parametric
26Estimators
- TPF
- TPF
- Number in sample ltlt Number in population (at
least in theory)
27II. Receiver Operating Characteristic (ROC)
- Receiver Operating Characteristic
- Binary Classification
- Test result is compared to a threshold
28Distribution of CIO Output for all Subjects
Threshold
Computational intelligence observer output
29Distribution of Output for Normal / Class
0 Subjects, p(t0)
Distribution of Output for Abnormal / Class
1 Subjects, p(t1)
Threshold
t-axis
Computational intelligence observer output
30Distribution of Output for Normal / Class
0 Subjects, p(t0)
Threshold
Abnormal / Class 1 subjects
31Distribution of Output for Normal / Class
0 Subjects, p(t0)
Specificity
True Negative Fraction TNF
Threshold
Abnormal / Class 1 subjects
Sensitivity
True Positive Fraction TPF
32Normal / Class 0 subjects
Specificity
Decision D0 D1
TNF 0.50
Threshold
Truth H1 H0
TPF 0.95
Abnormal / Class 1 subjects
Sensitivity
33Normal / Class 0 subjects
1 - Specificity
False Positive Fraction FPF
Threshold
Abnormal / Class 1 subjects
1 - Sensitivity
False Negative Fraction FNF
34Normal / Class 0 subjects
1 - Specificity
Decision D0 D1
TNF 0.50
FPF 0.50
Threshold
Truth H1 H0
FNF 0.05
TPF 0.95
Abnormal / Class 1 subjects
1 - Sensitivity
35Normal / Class 0 subjects
high sensitivity
TPF, sensitivity
Threshold
Abnormal / Class 1 subjects
FPF, 1-specificity
36Normal / Class 0 subjects
sensitivity specificity
TPF, sensitivity
Threshold
Abnormal / Class 1 subjects
FPF, 1-specificity
37Normal / Class 0 subjects
TPF, sensitivity
Threshold
high specificity
Abnormal / Class 1 subjects
FPF, 1-specificity
38Which CIO is best?
Normal / Class 0 subjects
CIO 3
CIO 2
TPF, sensitivity
Threshold
CIO 1
Abnormal / Class 1 subjects
FPF, 1-specificity
TPF FPF
CIO 1 0.50 0.07
CIO 2 0.78 0.22
CIO 3 0.93 0.50
39Do not compare rates of one class, e.g. TPF, at
different rates of the other class (FPF).
Normal / Class 0 subjects
CIO 3
CIO 2
TPF, sensitivity
Threshold
CIO 1
Abnormal / Class 1 subjects
FPF, 1-specificity
TPF FPF
CIO 1 0.50 0.07
CIO 2 0.78 0.23
CIO 3 0.93 0.50
40Normal / Class 0 subjects
Entire ROC curve
TPF, sensitivity
Abnormal / Class 1 subjects
FPF, 1-specificity
41AUC0.98
Entire ROC curve
chance line
TPF, sensitivity
AUC0.85
Discriminability -or- CIO performance
FPF, 1-specificity
AUC0.5
42AUC (Area under ROC Curve)
- AUC is a separation probability
- AUC probability that
- CIO output for abnormal gt CIO output for normal
- CIO correctly tells which of 2 subjects is normal
- Estimating AUC from finite sample
- Select abnormal subject score xi
- Select normal subject score yk
- Is xi gt yk ?
- Average over all x,y
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47(No Transcript)
48(No Transcript)
49(No Transcript)
50(No Transcript)
51(No Transcript)
52(No Transcript)
53(No Transcript)
54(No Transcript)
55(No Transcript)
56(No Transcript)
57(No Transcript)
58(No Transcript)
59(No Transcript)
60(No Transcript)
61(No Transcript)
62(No Transcript)
63(No Transcript)
64(No Transcript)
65(No Transcript)
66(No Transcript)
67(No Transcript)
68(No Transcript)
69(No Transcript)
70(No Transcript)
71(No Transcript)
72(No Transcript)
73(No Transcript)
74(No Transcript)
75(No Transcript)
76(No Transcript)
77(No Transcript)
78(No Transcript)
79(No Transcript)
80(No Transcript)
81(No Transcript)
82(No Transcript)
83(No Transcript)
84(No Transcript)
85(No Transcript)
86(No Transcript)
87(No Transcript)
88(No Transcript)
89(No Transcript)
90(No Transcript)
91(No Transcript)
92ROC as a Q-Q plot
- ROC plots in probability space
- ROC plots in quantile space
93Linear Likelihood Ratio Observer for Gaussian Data
- When the input features of the data are
distributed as Gaussians with equal variance, - The optimal discriminant, the log-likelihood
ratio, is a linear function, - That linear discriminant is also distributed as a
Gaussian, - The signal to noise ratio (SNR) is easily
calculated from the input data distributions and
is a monotonic function of AUC. - Can serve as a benchmark against which to measure
CIO performance
94Linear Ideal Observer
- p(x0) probability distribution of data x for the
population of normals and p(x1) probability
distribution of x for the population of abnormals
with components xi independent Gaussian
distributed with means 0 and mi respectively and
identical variances si2
95Maximum Likelihood CIO
96Linear Ideal Observer ROC
D
97Likelihood Ratio Slope of ROC
- The likelihood ratio of the decision variable t
is the slope of the ROC curve - ROC TPF(FPF) TPF 1-P(t0) FPF 1-P(t1)
98III. Error analysis for CI observers
- Sources of error
- Parametric methods
- Nonparametric methods
- Standard deviations and confidence intervals
- Hazards
99Sources of error
- Test errorlimited number of samples in the test
set - Training errorlimited number of samples in the
training set - Incorrect parameters
- Incorrect feature selection, etc.
- Human observer error (when applicable)
- Intraobserver
- Interobserver
100Parametric methods
- Use known underlying probability distribution
may be exact for simulated data - Assume Gaussian distribution
- Other parameterization e.g., Binomial or ROC
linearity in z-transformation coordinates - (F-1(TPF) versus F-1(FPF), where F is the
cumulative Gaussian distribution)
101Binomial Estimates of Variance
- For single population measures, f TPF, FPF,
FNF, TNF - Var(f) f (1-f) / N
- For AUC (back of envelope calculation)Var(AUC)
102Data rich case
- Repeat experiment M times
- Estimate distribution parameterse.g., for a
Gaussian distributed performance measure f,
G(m,s2) - Find error bars or confidence limits
103Example AUC
- Mean AUC
- Distribution variance
- Variance of mean
- Error bars, confidence interval
104Probability distribution for calculation of AUC
from 40 values
105Probability distribution for calculation of SNR
from 40 values
106But whats a poor boy to do?
- Reuse the data you have Resubstitution,
Resampling - Two common approaches
- Jackknife
- Bootstrap
107Resampling
108Resampling
109Resampling
110Resampling
111Resampling
112Resampling
113Resampling
114Resampling
115Jackknife
- Have N observations
- Leave out m of these, then have M subsets of the
N observations to calculate m, s2 - N10, m5 M252 N10, m1 M10
116Round-robin jackknife bias derivation and variance
117Fukunaga-Hayes bias derivation
- Divide both the normal and abnormal classes in
half, yielding 4 possible pairings
118Jackknife bias correction exampleTraining error
- AUC estimates as a function of number of cases N.
Solid line is the multilayer perceptron result.
Open circle jackknife, closed circle
Fukunaga-Hayes. The horizontal dotted line is
the asymptotic ideal result
119IV. Bootstrap methods
- Theoretical foundation
- Practical use
120Bootstrap variance
- What you have is what youve gotthe data is your
best estimate of the probability distribution - Sampling with replacement, M times
- Adequate number of samples MgtN
- Simple bootstrap
121Bootstrap and jackknife error estimates
- Standard deviation of AUC Solid line simulation
results, open circles jackknife estimate, closed
circles bootstrap estimate. Note how much larger
the jackknife error bars are than those provided
by the bootstrap method.
122(No Transcript)
123.632 bootstrap for classifier performance
evaluation
- Have N cases, draw M samples of size N with
replacement for training (Have on average .632 x
N unique cases in each sample of size N) - Test on the unused (.368 x N) cases for each
sample
N 5 10 20 100 Infinity
.328 1-.672 .349 1-.651 .358 1-.642 .366 1-.634 .368 1-.632
124.632 bootstrap for classifier performance
evaluation 2
- Have N cases, draw M samples of size N with
replacement for training (Have on average .632 x
N unique cases in each sample of size N) - Test on the unused (.368 x N) cases for each
sample - Get bootstrap average result AUCB
- Get resubstitution result (testing on training
set) AUCR - AUC.632 .632 x AUCB .368 x AUCR
- As variance take the AUCB variance
125Traps for the unwary Overparameterization
- Covers theorem
- For Nlt2(d1) a hyperplane exists that will
perfectly separate almost all possible
dichotomies of N points in d space
126fd(N) for d1,5,25,125, and the limit of large d.
The abscissa xN/2(d1) is scaled so that the
values of fd(N)0.5 lie superposed at x1 for all
d.
127Poor data hygiene
- Reporting on training data results/ testing on
training data - Carrying out any part of the training process on
data later used for testing - e.g., using all of the data to select a
manageable feature set from among a large number
of featuresand then dividing the data into
training and test sets.
128Overestimate of AUC frompoor data hygiene
Distributions of AUC values in 900 simulation
experiments (on the left) and the mean ROC curves
(on the right) for four validation methods
Method 1 Feature selection and classifier
training on one dataset and classifier testing on
another independent dataset Method 2 Given
perfect feature selection, classifier training on
one dataset and classifier testing on another
independent dataset Method 3 Feature selection
using the entire dataset and then the dataset is
partitioned into two, one for training and one
for testing the classifier Method 4 Feature
selection, classifier training, and testing using
the same dataset.
129Correct feature selection is hard to do
An insight of feature selection performance in
Method 1. On the left plots the number of
experiments (out of 900) that a feature is
selected. By design of the simulation population,
the first 30 features are useful for
classification and the remaining are useless. On
the right plots the distribution of the number of
useful features (out of 30) in the 900
experiments.
130Conclusions
- Accuracy and other prevalence dependent measures
are inadequate - ROC/AUC provide good measures of performance
- Uncertainty must be quantified
- Bootstrap and jackknife techniques are useful
methods
131V. References
- 1 K. Fukunaga, Statistical Pattern Recognition,
2nd Edition. Boston Harcourt Brace Jovanovich,
1990. - 2 K. Fukunaga and R. R. Hayes, Effects of
sample size in classifier design, IEEE Trans.
Pattern Anal. Machine Intell., vol. PAMI-11, pp.
873885, 1989. - 3 D. M. Green and J. A. Swets, Signal Detection
Theory and Psychophysics. New York John Wiley
Sons, 1966. - 4 J. P. Egan, Signal Detection Theory and ROC
Analysis. New York Academic Press, 1975. - 5 C. E. Metz, Basic principles of roc
analysis, Seminars in Nuclear Medicine, vol.
VIII, no. 4. - 6 H. H. Barrett and K. J. Myers, Foundations of
Image Science. Hoboken John Wiley Sons, 2004,
ch. 13 Statistical Decision Theory. - 7 B. Efron and R. J. Tibshirani, Introduction
to the Bootstrap. Boca Raton Chapman Hall/CRC,
1993. - 8 B. Efron, The Jackknife, the Bootstrap and
Other Resampling Plans. Philadelphia Society for
Industrial and Applied Mathematics, 1982. - 9 A. C. Davison and D. V. Hinkley, Bootstrap
Methods and their Applications. Cambridge
Cambridge University Press, 1997. - 10 B. Efron, Estimating the error rate of a
prediction rule Some improvements on
cross-validation, Journal of the American
Statistical Association, vol. 78, pp. 316331,
1983. - 11 B. Efron and R. J. Tibshirani, Improvements
on cross-validation The .632 bootstrap method,
Journal of the American Statistical Association,
vol. 92, no. 438, pp. 548560, 1997. - 12 T. Hastie, R. Tibshirani, and J. Friedman,
The Elements of Statistical Learning, 3rd
Edition. New York Springer, 2009. - 13 C. M. Bishop, Pattern Recognition and
Machine Learning. New York Springer, 2006. - 14 , Neural Networks for Pattern Recognition.
Oxford Oxford University Press, 1995. - 15 R. F. Wagner, D. G. Brown, J.-P. Guedon, K.
J. Myers, and K. A. Wear, Multivariate Guassian
pattern classification effects of finite sample
size and the addition of correlated or noisy
features on summary measures of goodness, in
Information processing in Medical Imaging,
Proceedings of IPMI 93, 1993, pp. 507524. - 16 , On combining a few diagnostic tests or
features, in Proceedings of the SPIE, Image
Processing, vol. 2167, 1994. - 17 D. G. Brown, A. C. Schneider, M. P.
Anderson, and R. F. Wagner, Effects of finite
sample size and correlated noisy input features
on neural network pattern classification, in
Proceedings of the SPIE, Image Processing, vol.
2167, 1994. - 18 C. A. Beam, Analysis of clustered data in
receiver operating characteristic studies,
Statistical Methods in Medical Research, vol. 7,
pp. 324336, 1998. - 19 W. A. Yousef, et al. Assessing Classifiers
from Two Independent Data Sets Using ROC
Analysis A Nonparametric Approach, in IEEE
Transactions on Pattern Analysis and Machine
Intelligence, vol. 28, no. 11, pp. 1809-1817,
2006
132Appendix I
133Searching suitcases
134(No Transcript)
135(No Transcript)
136(No Transcript)
137(No Transcript)
138(No Transcript)
139(No Transcript)
140(No Transcript)
141(No Transcript)
142(No Transcript)
143(No Transcript)
144Previous class results
145(No Transcript)
146(No Transcript)
147(No Transcript)
148(No Transcript)
149(No Transcript)