CSI5388:%20A%20Critique%20of%20our%20Evaluation%20Practices%20in%20Machine%20Learning - PowerPoint PPT Presentation

About This Presentation

Title:

CSI5388:%20A%20Critique%20of%20our%20Evaluation%20Practices%20in%20Machine%20Learning

Description:

1. CSI5388: A Critique of our Evaluation Practices in Machine Learning. 2. Observations ... In such fields, researchers have been more concerned with the meaning and ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 15

Provided by: COEMA4

Category:

more less

Transcript and Presenter's Notes

Title: CSI5388:%20A%20Critique%20of%20our%20Evaluation%20Practices%20in%20Machine%20Learning

1
CSI5388A Critique of our Evaluation Practices
in Machine Learning
2
Observations

The way in which Evaluation is conducted in
Machine learning/Data Mining has not been a
primary concern in the community.
This is very different from the way Evaluation
is approached in other applied fields such as
Economics, Psychology and Sociology.
In such fields, researchers have been more
concerned with the meaning and validity of their
results than in ours.

3
The Problem

The objective value of our advances in Machine
Learning may be different from what we believe it
is.
Our conclusions may be flawed or meaningless.
ML methods may get undue credit or not get
sufficiently recognized.
The field may start stagnating.
Practitioners in other fields or potential
business partners may dismiss our
approaches/results.
We hope that with better evaluation practices, we
can help the field of machine learning focus on
more effective research and encourage more
cross-discipline or cross-purposes exchanges.

4
Organization of theLecture

A review of the shortcomings of current
evaluation methods
Problems with Performance Evaluation
Problems with Confidence Estimation
Problems with Data Sets

5
Recommended Steps for Proper Evaluation

Identify the interesting properties of the
classifier.
Choose an evaluation metric accordingly
Choose a confidence estimation method .
Check that all the assumptions made by the
evaluation metric and confidence estimator are
verified.
Run the evaluation method with the chosen metric
and confidence estimator, and analyze the
results.
Interpret the results with respect to the domain.

6
Commonly Followed Steps of Evaluation

Identify the interesting properties of the
classifier.
Choose an evaluation metric accordingly
Choose a confidence estimation method .
Check that all the assumptions made by the
evaluation metric and confidence estimator are
verified.
Run the evaluation method with the chosen metric
and confidence estimator, and analyze the
results.
Interpret the results with respect to the domain.

These steps are typically considered, but only
very lightly
.
7
Overview

What happens when bad choices of performance
evaluation metrics are made? (Steps 1 and 2 are
considered too lightly)
Accuracy
Precision/Recall
ROC Analysis
Note each metric solves the problem of the
previous one, but introduces new shortcomings
(usually caught by the previous metrics)
What happens when bad choices of confidence
estimators are made and the assumptions
underlying these confidence estimator are not
respected (Steps 3 is considered lightly and Step
4 is disregarded).
The t-test

E.g.,
E.g.,
8
A Short Review I Confusion Matrix / Common
Performance evaluation Metrics

Accuracy (TPTN)/(PN)
Precision TP/(TPFP)
Recall/TP rate TP/P
FP Rate FP/N
ROC Analysis moves the threshold between the
positive and negative class from a small FP rate
to a large one. It plots the value of the Recall
against that of the FP Rate at each FP Rate
considered.

True class ? Hypothesized class V Pos Neg
Yes TP FP
No FN TN
PTPFN NFPTN
A Confusion Matrix
9
A Short Review II Confidence Estimation / The
t-Test

The most commonly used approach to confidence
estimation in Machine learning is
To run the algorithm using 10-fold
cross-validation and to record the accuracy at
each fold.
To compute a confidence interval around the
average of the difference between these reported
accuracies and a given gold standard, using the
t-test, i.e., the following formula
d /- tN,9 sd where
d is the average difference between the reported
accuracy and the given gold standard,
tN,9 is a constant chosen according to the degree
of confidence desired,
sd sqrt(1/90 Si110 (di d)2) where di
represents the difference between the reported
accuracy and the given gold standard at fold i.

10
Whats wrong with Accuracy?
True class ? Pos Neg
Yes 200 100
No 300 400
P500 N500
True class ? Pos Neg
Yes 400 300
No 100 200
P500 N500

Both classifiers obtain 60 accuracy
They exhibit very different behaviours
On the left weak positive recognition
rate/strong negative recognition rate
On the right strong positive recognition
rate/weak negative recognition rate

11
Whats wrong with Precision/Recall?
True class ? Pos Neg
Yes 200 100
No 300 400
P500 N500
True class ? Pos Neg
Yes 200 100
No 300 0
P500 N100

Both classifiers obtain the same precision and
recall values of 66.7 and 40
They exhibit very different behaviours
Same positive recognition rate
Extremely different negative recognition rate
strong on the left / nil on the right
Note Accuracy has no problem catching this!

12
Whats wrong with ROC Analysis?(We consider
single points in space not the entire ROC Curve)
True class ? Pos Neg
Yes 200 10
No 300 4,000
P500 N4,010
True class ? Pos Neg
Yes 500 1,000
No 300 400,000
P800 N401,000

ROC Analysis and Precision yield contradictory
results
In terms of ROC Analysis, the classifier on the
right is a significantly better choice than the
one on the left. the point representing the
right classifier is on the same vertical line but
22.25 higher than the point representing the
left classifier
Yet, the classifier on the right has ridiculously
low precision (33.3) while the classifier on the
left has excellent precision (95.24).

13
Whats wrong with the t-test?
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10
C 1 5 -5 5 -5 5 -5 5 -5 5 -5
C 2 10 -5 -5 0 0 0 0 0 0 0

Classifiers 1 and 2 yield the same average mean
and confidence interval.
Yet, Classifier 1 is relatively stable, while
Classifier 2 is not.
Problem the t-test assumes a normal
distribution. The difference in accuracy between
classifier 2 and the gold-standard is not
normally distributed

14
Discussion

There is nothing intrinsically wrong with any of
the performance evaluation measures or confidence
tests discussed. Its all a matter of thinking
about which one to use when, and what the results
means (both in terms of added value and
limitations).
Simple conceptualization of the Problem with
current evaluation practices
Evaluation Metrics and Confidence Measures
summarize the results ? ML Practitioners must
understand the terms of these summarizations and
verify that their assumptions are verified.
In certain cases, however, it is necessary to
look further and, eventually, borrow practices
from other disciplines. In, yet, other cases, it
pays to devise our own methods. Both instances
are discussed in what follows.