Comprehensive Introduction to the Evaluation of Neural Networks and other Computational Intelligence Decision Functions: Receiver Operating Characteristic, Jackknife, Bootstrap and other Statistical Methodologies - PowerPoint PPT Presentation

About This Presentation
Title:

Comprehensive Introduction to the Evaluation of Neural Networks and other Computational Intelligence Decision Functions: Receiver Operating Characteristic, Jackknife, Bootstrap and other Statistical Methodologies

Description:

Comprehensive Introduction to the Evaluation of Neural Networks and other Computational Intelligence Decision Functions: Receiver Operating Characteristic, Jackknife ... – PowerPoint PPT presentation

Number of Views:282
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Comprehensive Introduction to the Evaluation of Neural Networks and other Computational Intelligence Decision Functions: Receiver Operating Characteristic, Jackknife, Bootstrap and other Statistical Methodologies


1
Comprehensive Introduction to the Evaluation of
Neural Networks and other Computational
Intelligence Decision Functions Receiver
Operating Characteristic, Jackknife, Bootstrap
and other Statistical Methodologies
  • David G. Brown and Frank Samuelson
  • Center for Devices and Radiological Health, FDA
  • 6 July 2014

2
Course Outline
  • Performance measures for Computational
    Intelligence (CI) observers
  • Accuracy
  • Prevalence dependent measures
  • Prevalence independent measures
  • Maximization of performance Utility
    analysis/Cost functions
  • Receiver Operating Characteristic (ROC) analysis
  • Sensitivity and specificity
  • Construction of the ROC curve
  • Area under the ROC curve (AUC)
  • Error analysis for CI observers
  • Sources of error
  • Parametric methods
  • Nonparametric methods
  • Standard deviations and confidence intervals
  • Boot strap methods
  • Theoretical foundation
  • Practical use
  • References

3
Whats the problem?
  • Emphasis on algorithm innovation to exclusion of
    performance assessment
  • Use of subjective measures of performance
    beauty contest
  • Use of accuracy as a measure of success
  • Lack of error barsMy CIO is .01 better than
    yours (/- ?)
  • Flawed methodologytraining and testing on same
    data
  • Lack of appreciation for the many different
    sources of error that can be taken into account

4
Original imageLena. Courtesy of the Signal and
Image Processing Institute at the University of
Southern California.
5
CI improved imageBaboon. Courtesy of the Signal
and Image Processing Institute at the University
of Southern California.
6
Panel of expertsImage Garosha Dreamstime
7
I. Performance measures for computational
intelligence (CI) observers
  • Task based (binary) discrimination task
  • Two populations involved normal and
    abnormal,
  • Accuracy Intuitive but incomplete
  • Different consequences for success or failure for
    each population
  • Some measures depend on the prevalence (Pr) some
    do not, Pr
  • Accuracy, positive predictive value, negative
    predictive value
  • Sensitivity, specificity, ROC, AUC
  • True optimization of performance requires
    knowledge of cost functions or utilities for
    successes and failures in both populations

8
How to make a CIO with gt99 accuracy
  • Medical problem Screening mammography
    (screening means testing in an asymptomatic
    population)
  • Prevalence of breast cancer in the screening
    population Pr 0.5
  • My CIO always says normal
  • Accuracy (Acc) is 99.5 (accuracy of accepted
    present-day systems 75)
  • Accuracy in a diagnostic setting (Pr20) is 80
    -- Acc1-Pr (for my CIO)

9
CIO operates on two different populations
Normal cases p(t0)
Abnormal cases p(t1)
Threshold t T
t-axis
10
Must consider effects on normal and abnormal
populations separately
  • CIO output t
  • p(t0) probability distribution of t for the
    population of normals
  • p(t1) probability distribution of t for the
    population of abnormals
  • Threshold T. Everything to the right of T called
    abnormal, and everything to the left of T called
    normal
  • Area of p(t0) to left of T is the true negative
    fraction (TNF specificity) and to the right the
    false positive fraction (FPF type 1 error).
  • TNF FPF 1
  • Area of p(t1) to left of T is the false negative
    fraction (FNF type 2 error) and to the right is
    the true positive fraction (TPF sensitivity)
  • FNF TPF 1
  • TNF, FPF, FNF, TPF all are prevalence
    independent, since each is some fraction of one
    of our two probability distributions
  • Accuracy Pr x TPF (1-Pr) x TNF

11
Normal cases
FPF (.5)
TNF (.5)
t-axis
Threshold T
Abnormal cases
TPF (.95)
FNF (.05)
t-axis
12
Prevalence dependent measures
  • Accuracy (Acc)
  • Acc Pr x TPF (1-Pr) x TNF
  • Positive predictive value (PPV) fraction of
    positives that are true positives
  • PPV TPF x Pr / (TPF x Pr FPF x (1-Pr))
  • Negative predictive value (NPV) fraction of
    negatives that are true negatives
  • NPV TNF x (1-Pr) / (TNF x (1-Pr) FNF x Pr)
  • Using the mammography screening Pr and previous
    TPF, TNF, FNF, FPF values Pr .05, TPF .95,
    TNF 0.5, FNF.05, FPF0.5
  • Acc .05x.95.95x.5 .52
  • PPV .95x.05/(.95x.05.5x.95) .10
  • NPV .5x.95/(.5x.95.05x.05) .997

13
Prevalence dependent measures
  • Accuracy (Acc)
  • Acc Pr x TPF (1-Pr) x TNF
  • Positive predictive value (PPV) fraction of
    positives that are true positives
  • PPV TPF x Pr / (TPF x Pr FPF x (1-Pr))
  • Negative predictive value (NPV) fraction of
    negatives that are true negatives
  • NPV TNF x (1-Pr) / (TNF x (1-Pr) FNF x Pr)
  • Using the mammography screening Pr and previous
    TPF, TNF, FNF, FPF values Pr .005, TPF .95,
    TNF 0.5, FNF.05, FPF0.5
  • Acc .005x.95.995x.5 .50
  • PPV .95x.005/(.95x.005.5x.995) .01
  • NPV .5x.995/(.5x.995.05x.005) .995

14
Acc, PPV, NPV as functions of prevalence(screenin
g mammography)
  • TPF.95
  • FNF.05
  • TNF0.5
  • FPF0.5

15
Acc NPV as function of prevalence(forced
normal response CIO)
16
Prevalence independent measures
  • Sensitivity TPF
  • Specificity TNF (1-FPF)
  • Receiver Operating Characteristic (ROC) TPF as
    a function of FPF (Sensitivity as a function of 1
    Specificity)
  • Area under the ROC curve (AUC)
  • Sensitivity averaged over all values of
    Specificity

17
Normal / Class 0 subjects
Entire ROC curve
ROC slope
TPF, sensitivity
Abnormal / Class 1 subjects
FPF, 1-specificity
18
Empirical ROC data for mammography screening in
the US
Craig Beam et al.
19
Maximization of performance
  • Need to know utilities or costs of each type of
    decision outcome but these are very hard to
    estimate accurately. You dont just maximize
    accuracy.
  • Need prevalence
  • For mammography example
  • TPF prolongation of life minus treatment cost
  • FPF diagnostic work-up cost, anxiety
  • TNF peace of mind
  • FNF delay in treatment gt shortened life
  • Hypothetical assignment of utilities for some
    decision threshold T
  • UtilityT U(TPF) x TPF x Pr U(FPF) x FPF x
    (1-Pr) U(TNF) x TNF x (1-Pr)
    U(FNF) x FNF x Pr
  • U(TPF) 100, U(FPF) -10, U(TNF) 4, U(FNF)
    -20
  • UtilityT 100 x .95 x .05 10 x .50 x .95
    4 x .50 x .95 20 x .05 x .05 1.85
  • Now if we only knew how to trade off TPF versus
    FPF, we could optimize (?) medical performance.

20
Utility maximization(mammography example)
21
Choice of ROC operating point through utility
analysisscreening mammography
22
Utility maximization(mammography example)
23
Utility maximization calculation
  • u (UTPFTPFUFNFFNF)PR(UTNFTNFUFPFFPF)(1-PR)
  • (UTPFTPFUFNF(1-TPF))PR(UTNF(1-FPF)UFPFFPF)(
    1-PR)
  • du/dFPF(UFPF-UTNF)(1-PR)(UTPF-UFNF)PRdTPF/dFPF
  • 0 ? dTPF/dFPF(UTNF-UFPF)(1-PR)/(UT
    PF-UFNF)PR
  • PR.005 ? dTPF/dFPF 23.
  • PR.05 ? dTPF/dFPF 2.2
  • (UTPF100, UFNF-20, UTNF4, UFPF-20)

24
Normal cases
Entire ROC curve
ROC slope
TPF, sensitivity
Abnormal cases
FPF, 1-specificity
25
Estimators
  • TPF, FPF, TNF, FNF, Accuracy, the ROC curve, and
    AUC are all fractions or probabilities.
  • Normally we have a finite sample of subjects on
    which to test our CIO. From this finite sample
    we try to estimate the above fractions
  • These estimates will vary depending upon the
    sample selected (statistical variation).
  • Estimates can be nonparametric or parametric

26
Estimators
  • TPF
  • TPF
  • Number in sample ltlt Number in population (at
    least in theory)

27
II. Receiver Operating Characteristic (ROC)
  • Receiver Operating Characteristic
  • Binary Classification
  • Test result is compared to a threshold

28
Distribution of CIO Output for all Subjects
Threshold
Computational intelligence observer output
29
Distribution of Output for Normal / Class
0 Subjects, p(t0)
Distribution of Output for Abnormal / Class
1 Subjects, p(t1)
Threshold
t-axis
Computational intelligence observer output
30
Distribution of Output for Normal / Class
0 Subjects, p(t0)
Threshold
Abnormal / Class 1 subjects
31
Distribution of Output for Normal / Class
0 Subjects, p(t0)
Specificity
True Negative Fraction TNF
Threshold
Abnormal / Class 1 subjects
Sensitivity
True Positive Fraction TPF
32
Normal / Class 0 subjects
Specificity
Decision D0 D1
TNF 0.50
Threshold
Truth H1 H0
TPF 0.95
Abnormal / Class 1 subjects
Sensitivity
33
Normal / Class 0 subjects
1 - Specificity
False Positive Fraction FPF
Threshold
Abnormal / Class 1 subjects
1 - Sensitivity
False Negative Fraction FNF
34
Normal / Class 0 subjects
1 - Specificity
Decision D0 D1
TNF 0.50
FPF 0.50
Threshold
Truth H1 H0
FNF 0.05
TPF 0.95
Abnormal / Class 1 subjects
1 - Sensitivity
35
Normal / Class 0 subjects
high sensitivity
TPF, sensitivity
Threshold
Abnormal / Class 1 subjects
FPF, 1-specificity
36
Normal / Class 0 subjects
sensitivity specificity
TPF, sensitivity
Threshold
Abnormal / Class 1 subjects
FPF, 1-specificity
37
Normal / Class 0 subjects
TPF, sensitivity
Threshold
high specificity
Abnormal / Class 1 subjects
FPF, 1-specificity
38
Which CIO is best?
Normal / Class 0 subjects
CIO 3
CIO 2
TPF, sensitivity
Threshold
CIO 1
Abnormal / Class 1 subjects
FPF, 1-specificity
TPF FPF
CIO 1 0.50 0.07
CIO 2 0.78 0.22
CIO 3 0.93 0.50
39
Do not compare rates of one class, e.g. TPF, at
different rates of the other class (FPF).
Normal / Class 0 subjects
CIO 3
CIO 2
TPF, sensitivity
Threshold
CIO 1
Abnormal / Class 1 subjects
FPF, 1-specificity
TPF FPF
CIO 1 0.50 0.07
CIO 2 0.78 0.23
CIO 3 0.93 0.50
40
Normal / Class 0 subjects
Entire ROC curve
TPF, sensitivity
Abnormal / Class 1 subjects
FPF, 1-specificity
41
AUC0.98
Entire ROC curve
chance line
TPF, sensitivity
AUC0.85
Discriminability -or- CIO performance
FPF, 1-specificity
AUC0.5
42
AUC (Area under ROC Curve)
  • AUC is a separation probability
  • AUC probability that
  • CIO output for abnormal gt CIO output for normal
  • CIO correctly tells which of 2 subjects is normal
  • Estimating AUC from finite sample
  • Select abnormal subject score xi
  • Select normal subject score yk
  • Is xi gt yk ?
  • Average over all x,y

43
(No Transcript)
44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
(No Transcript)
52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
(No Transcript)
69
(No Transcript)
70
(No Transcript)
71
(No Transcript)
72
(No Transcript)
73
(No Transcript)
74
(No Transcript)
75
(No Transcript)
76
(No Transcript)
77
(No Transcript)
78
(No Transcript)
79
(No Transcript)
80
(No Transcript)
81
(No Transcript)
82
(No Transcript)
83
(No Transcript)
84
(No Transcript)
85
(No Transcript)
86
(No Transcript)
87
(No Transcript)
88
(No Transcript)
89
(No Transcript)
90
(No Transcript)
91
(No Transcript)
92
ROC as a Q-Q plot
  • ROC plots in probability space
  • ROC plots in quantile space

93
Linear Likelihood Ratio Observer for Gaussian Data
  • When the input features of the data are
    distributed as Gaussians with equal variance,
  • The optimal discriminant, the log-likelihood
    ratio, is a linear function,
  • That linear discriminant is also distributed as a
    Gaussian,
  • The signal to noise ratio (SNR) is easily
    calculated from the input data distributions and
    is a monotonic function of AUC.
  • Can serve as a benchmark against which to measure
    CIO performance

94
Linear Ideal Observer
  • p(x0) probability distribution of data x for the
    population of normals and p(x1) probability
    distribution of x for the population of abnormals
    with components xi independent Gaussian
    distributed with means 0 and mi respectively and
    identical variances si2

95
Maximum Likelihood CIO
96
Linear Ideal Observer ROC
D
97
Likelihood Ratio Slope of ROC
  • The likelihood ratio of the decision variable t
    is the slope of the ROC curve
  • ROC TPF(FPF) TPF 1-P(t0) FPF 1-P(t1)

98
III. Error analysis for CI observers
  • Sources of error
  • Parametric methods
  • Nonparametric methods
  • Standard deviations and confidence intervals
  • Hazards

99
Sources of error
  • Test errorlimited number of samples in the test
    set
  • Training errorlimited number of samples in the
    training set
  • Incorrect parameters
  • Incorrect feature selection, etc.
  • Human observer error (when applicable)
  • Intraobserver
  • Interobserver

100
Parametric methods
  • Use known underlying probability distribution
    may be exact for simulated data
  • Assume Gaussian distribution
  • Other parameterization e.g., Binomial or ROC
    linearity in z-transformation coordinates
  • (F-1(TPF) versus F-1(FPF), where F is the
    cumulative Gaussian distribution)

101
Binomial Estimates of Variance
  • For single population measures, f TPF, FPF,
    FNF, TNF
  • Var(f) f (1-f) / N
  • For AUC (back of envelope calculation)Var(AUC)

102
Data rich case
  • Repeat experiment M times
  • Estimate distribution parameterse.g., for a
    Gaussian distributed performance measure f,
    G(m,s2)
  • Find error bars or confidence limits

103
Example AUC
  • Mean AUC
  • Distribution variance
  • Variance of mean
  • Error bars, confidence interval

104
Probability distribution for calculation of AUC
from 40 values
105
Probability distribution for calculation of SNR
from 40 values
106
But whats a poor boy to do?
  • Reuse the data you have Resubstitution,
    Resampling
  • Two common approaches
  • Jackknife
  • Bootstrap

107
Resampling
108
Resampling
109
Resampling
110
Resampling
111
Resampling
112
Resampling
113
Resampling
114
Resampling
115
Jackknife
  • Have N observations
  • Leave out m of these, then have M subsets of the
    N observations to calculate m, s2
  • N10, m5 M252 N10, m1 M10

116
Round-robin jackknife bias derivation and variance
  • Given N datasets

117
Fukunaga-Hayes bias derivation
  • Divide both the normal and abnormal classes in
    half, yielding 4 possible pairings

118
Jackknife bias correction exampleTraining error
  • AUC estimates as a function of number of cases N.
    Solid line is the multilayer perceptron result.
    Open circle jackknife, closed circle
    Fukunaga-Hayes. The horizontal dotted line is
    the asymptotic ideal result

119
IV. Bootstrap methods
  • Theoretical foundation
  • Practical use

120
Bootstrap variance
  • What you have is what youve gotthe data is your
    best estimate of the probability distribution
  • Sampling with replacement, M times
  • Adequate number of samples MgtN
  • Simple bootstrap

121
Bootstrap and jackknife error estimates
  • Standard deviation of AUC Solid line simulation
    results, open circles jackknife estimate, closed
    circles bootstrap estimate. Note how much larger
    the jackknife error bars are than those provided
    by the bootstrap method.

122
(No Transcript)
123
.632 bootstrap for classifier performance
evaluation
  • Have N cases, draw M samples of size N with
    replacement for training (Have on average .632 x
    N unique cases in each sample of size N)
  • Test on the unused (.368 x N) cases for each
    sample

N 5 10 20 100 Infinity
.328 1-.672 .349 1-.651 .358 1-.642 .366 1-.634 .368 1-.632
124
.632 bootstrap for classifier performance
evaluation 2
  • Have N cases, draw M samples of size N with
    replacement for training (Have on average .632 x
    N unique cases in each sample of size N)
  • Test on the unused (.368 x N) cases for each
    sample
  • Get bootstrap average result AUCB
  • Get resubstitution result (testing on training
    set) AUCR
  • AUC.632 .632 x AUCB .368 x AUCR
  • As variance take the AUCB variance

125
Traps for the unwary Overparameterization
  • Covers theorem
  • For Nlt2(d1) a hyperplane exists that will
    perfectly separate almost all possible
    dichotomies of N points in d space

126
fd(N) for d1,5,25,125, and the limit of large d.
The abscissa xN/2(d1) is scaled so that the
values of fd(N)0.5 lie superposed at x1 for all
d.
127
Poor data hygiene
  • Reporting on training data results/ testing on
    training data
  • Carrying out any part of the training process on
    data later used for testing
  • e.g., using all of the data to select a
    manageable feature set from among a large number
    of featuresand then dividing the data into
    training and test sets.

128
Overestimate of AUC frompoor data hygiene
Distributions of AUC values in 900 simulation
experiments (on the left) and the mean ROC curves
(on the right) for four validation methods
Method 1 Feature selection and classifier
training on one dataset and classifier testing on
another independent dataset Method 2 Given
perfect feature selection, classifier training on
one dataset and classifier testing on another
independent dataset Method 3 Feature selection
using the entire dataset and then the dataset is
partitioned into two, one for training and one
for testing the classifier Method 4 Feature
selection, classifier training, and testing using
the same dataset.
129
Correct feature selection is hard to do
An insight of feature selection performance in
Method 1. On the left plots the number of
experiments (out of 900) that a feature is
selected. By design of the simulation population,
the first 30 features are useful for
classification and the remaining are useless. On
the right plots the distribution of the number of
useful features (out of 30) in the 900
experiments.
130
Conclusions
  • Accuracy and other prevalence dependent measures
    are inadequate
  • ROC/AUC provide good measures of performance
  • Uncertainty must be quantified
  • Bootstrap and jackknife techniques are useful
    methods

131
V. References
  • 1 K. Fukunaga, Statistical Pattern Recognition,
    2nd Edition. Boston Harcourt Brace Jovanovich,
    1990.
  • 2 K. Fukunaga and R. R. Hayes, Effects of
    sample size in classifier design, IEEE Trans.
    Pattern Anal. Machine Intell., vol. PAMI-11, pp.
    873885, 1989.
  • 3 D. M. Green and J. A. Swets, Signal Detection
    Theory and Psychophysics. New York John Wiley
    Sons, 1966.
  • 4 J. P. Egan, Signal Detection Theory and ROC
    Analysis. New York Academic Press, 1975.
  • 5 C. E. Metz, Basic principles of roc
    analysis, Seminars in Nuclear Medicine, vol.
    VIII, no. 4.
  • 6 H. H. Barrett and K. J. Myers, Foundations of
    Image Science. Hoboken John Wiley Sons, 2004,
    ch. 13 Statistical Decision Theory.
  • 7 B. Efron and R. J. Tibshirani, Introduction
    to the Bootstrap. Boca Raton Chapman Hall/CRC,
    1993.
  • 8 B. Efron, The Jackknife, the Bootstrap and
    Other Resampling Plans. Philadelphia Society for
    Industrial and Applied Mathematics, 1982.
  • 9 A. C. Davison and D. V. Hinkley, Bootstrap
    Methods and their Applications. Cambridge
    Cambridge University Press, 1997.
  • 10 B. Efron, Estimating the error rate of a
    prediction rule Some improvements on
    cross-validation, Journal of the American
    Statistical Association, vol. 78, pp. 316331,
    1983.
  • 11 B. Efron and R. J. Tibshirani, Improvements
    on cross-validation The .632 bootstrap method,
    Journal of the American Statistical Association,
    vol. 92, no. 438, pp. 548560, 1997.
  • 12 T. Hastie, R. Tibshirani, and J. Friedman,
    The Elements of Statistical Learning, 3rd
    Edition. New York Springer, 2009.
  • 13 C. M. Bishop, Pattern Recognition and
    Machine Learning. New York Springer, 2006.
  • 14 , Neural Networks for Pattern Recognition.
    Oxford Oxford University Press, 1995.
  • 15 R. F. Wagner, D. G. Brown, J.-P. Guedon, K.
    J. Myers, and K. A. Wear, Multivariate Guassian
    pattern classification effects of finite sample
    size and the addition of correlated or noisy
    features on summary measures of goodness, in
    Information processing in Medical Imaging,
    Proceedings of IPMI 93, 1993, pp. 507524.
  • 16 , On combining a few diagnostic tests or
    features, in Proceedings of the SPIE, Image
    Processing, vol. 2167, 1994.
  • 17 D. G. Brown, A. C. Schneider, M. P.
    Anderson, and R. F. Wagner, Effects of finite
    sample size and correlated noisy input features
    on neural network pattern classification, in
    Proceedings of the SPIE, Image Processing, vol.
    2167, 1994.
  • 18 C. A. Beam, Analysis of clustered data in
    receiver operating characteristic studies,
    Statistical Methods in Medical Research, vol. 7,
    pp. 324336, 1998.
  • 19 W. A. Yousef, et al. Assessing Classifiers
    from Two Independent Data Sets Using ROC
    Analysis A Nonparametric Approach, in IEEE
    Transactions on Pattern Analysis and Machine
    Intelligence, vol. 28, no. 11, pp. 1809-1817,
    2006

132
Appendix I
133
Searching suitcases
134
(No Transcript)
135
(No Transcript)
136
(No Transcript)
137
(No Transcript)
138
(No Transcript)
139
(No Transcript)
140
(No Transcript)
141
(No Transcript)
142
(No Transcript)
143
(No Transcript)
144
Previous class results
145
(No Transcript)
146
(No Transcript)
147
(No Transcript)
148
(No Transcript)
149
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com