Title: CSI5388:%20A%20Critique%20of%20our%20Evaluation%20Practices%20in%20Machine%20Learning
1CSI5388A Critique of our Evaluation Practices
in Machine Learning
2Observations
- The way in which Evaluation is conducted in
Machine learning/Data Mining has not been a
primary concern in the community. - This is very different from the way Evaluation
is approached in other applied fields such as
Economics, Psychology and Sociology. - In such fields, researchers have been more
concerned with the meaning and validity of their
results than in ours.
3The Problem
- The objective value of our advances in Machine
Learning may be different from what we believe it
is. - Our conclusions may be flawed or meaningless.
- ML methods may get undue credit or not get
sufficiently recognized. - The field may start stagnating.
- Practitioners in other fields or potential
business partners may dismiss our
approaches/results. - We hope that with better evaluation practices, we
can help the field of machine learning focus on
more effective research and encourage more
cross-discipline or cross-purposes exchanges.
4 Organization of theLecture
- A review of the shortcomings of current
evaluation methods - Problems with Performance Evaluation
- Problems with Confidence Estimation
- Problems with Data Sets
5Recommended Steps for Proper Evaluation
- Identify the interesting properties of the
classifier. - Choose an evaluation metric accordingly
- Choose a confidence estimation method .
- Check that all the assumptions made by the
evaluation metric and confidence estimator are
verified. - Run the evaluation method with the chosen metric
and confidence estimator, and analyze the
results. - Interpret the results with respect to the domain.
6Commonly Followed Steps of Evaluation
- Identify the interesting properties of the
classifier. - Choose an evaluation metric accordingly
- Choose a confidence estimation method .
- Check that all the assumptions made by the
evaluation metric and confidence estimator are
verified. - Run the evaluation method with the chosen metric
and confidence estimator, and analyze the
results. - Interpret the results with respect to the domain.
These steps are typically considered, but only
very lightly
.
7Overview
- What happens when bad choices of performance
evaluation metrics are made? (Steps 1 and 2 are
considered too lightly) - Accuracy
- Precision/Recall
- ROC Analysis
- Note each metric solves the problem of the
previous one, but introduces new shortcomings
(usually caught by the previous metrics) - What happens when bad choices of confidence
estimators are made and the assumptions
underlying these confidence estimator are not
respected (Steps 3 is considered lightly and Step
4 is disregarded). - The t-test
E.g.,
E.g.,
8A Short Review I Confusion Matrix / Common
Performance evaluation Metrics
- Accuracy (TPTN)/(PN)
- Precision TP/(TPFP)
- Recall/TP rate TP/P
- FP Rate FP/N
- ROC Analysis moves the threshold between the
positive and negative class from a small FP rate
to a large one. It plots the value of the Recall
against that of the FP Rate at each FP Rate
considered.
True class ? Hypothesized class V Pos Neg
Yes TP FP
No FN TN
PTPFN NFPTN
A Confusion Matrix
9A Short Review II Confidence Estimation / The
t-Test
- The most commonly used approach to confidence
estimation in Machine learning is - To run the algorithm using 10-fold
cross-validation and to record the accuracy at
each fold. - To compute a confidence interval around the
average of the difference between these reported
accuracies and a given gold standard, using the
t-test, i.e., the following formula - d /- tN,9 sd where
- d is the average difference between the reported
accuracy and the given gold standard, - tN,9 is a constant chosen according to the degree
of confidence desired, - sd sqrt(1/90 Si110 (di d)2) where di
represents the difference between the reported
accuracy and the given gold standard at fold i.
10Whats wrong with Accuracy?
True class ? Pos Neg
Yes 200 100
No 300 400
P500 N500
True class ? Pos Neg
Yes 400 300
No 100 200
P500 N500
- Both classifiers obtain 60 accuracy
- They exhibit very different behaviours
- On the left weak positive recognition
rate/strong negative recognition rate - On the right strong positive recognition
rate/weak negative recognition rate
11Whats wrong with Precision/Recall?
True class ? Pos Neg
Yes 200 100
No 300 400
P500 N500
True class ? Pos Neg
Yes 200 100
No 300 0
P500 N100
- Both classifiers obtain the same precision and
recall values of 66.7 and 40 - They exhibit very different behaviours
- Same positive recognition rate
- Extremely different negative recognition rate
strong on the left / nil on the right - Note Accuracy has no problem catching this!
12Whats wrong with ROC Analysis?(We consider
single points in space not the entire ROC Curve)
True class ? Pos Neg
Yes 200 10
No 300 4,000
P500 N4,010
True class ? Pos Neg
Yes 500 1,000
No 300 400,000
P800 N401,000
- ROC Analysis and Precision yield contradictory
results - In terms of ROC Analysis, the classifier on the
right is a significantly better choice than the
one on the left. the point representing the
right classifier is on the same vertical line but
22.25 higher than the point representing the
left classifier - Yet, the classifier on the right has ridiculously
low precision (33.3) while the classifier on the
left has excellent precision (95.24).
13Whats wrong with the t-test?
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10
C 1 5 -5 5 -5 5 -5 5 -5 5 -5
C 2 10 -5 -5 0 0 0 0 0 0 0
- Classifiers 1 and 2 yield the same average mean
and confidence interval. - Yet, Classifier 1 is relatively stable, while
Classifier 2 is not. - Problem the t-test assumes a normal
distribution. The difference in accuracy between
classifier 2 and the gold-standard is not
normally distributed
14Discussion
- There is nothing intrinsically wrong with any of
the performance evaluation measures or confidence
tests discussed. Its all a matter of thinking
about which one to use when, and what the results
means (both in terms of added value and
limitations). - Simple conceptualization of the Problem with
current evaluation practices - Evaluation Metrics and Confidence Measures
summarize the results ? ML Practitioners must
understand the terms of these summarizations and
verify that their assumptions are verified. - In certain cases, however, it is necessary to
look further and, eventually, borrow practices
from other disciplines. In, yet, other cases, it
pays to devise our own methods. Both instances
are discussed in what follows.