CSI5388:%20A%20Critique%20of%20our%20Evaluation%20Practices%20in%20Machine%20Learning - PowerPoint PPT Presentation

About This Presentation
Title:

CSI5388:%20A%20Critique%20of%20our%20Evaluation%20Practices%20in%20Machine%20Learning

Description:

1. CSI5388: A Critique of our Evaluation Practices in Machine Learning. 2. Observations ... In such fields, researchers have been more concerned with the meaning and ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 15
Provided by: COEMA4
Category:

less

Transcript and Presenter's Notes

Title: CSI5388:%20A%20Critique%20of%20our%20Evaluation%20Practices%20in%20Machine%20Learning


1
CSI5388A Critique of our Evaluation Practices
in Machine Learning
2
Observations
  • The way in which Evaluation is conducted in
    Machine learning/Data Mining has not been a
    primary concern in the community.
  • This is very different from the way Evaluation
    is approached in other applied fields such as
    Economics, Psychology and Sociology.
  • In such fields, researchers have been more
    concerned with the meaning and validity of their
    results than in ours.

3
The Problem
  • The objective value of our advances in Machine
    Learning may be different from what we believe it
    is.
  • Our conclusions may be flawed or meaningless.
  • ML methods may get undue credit or not get
    sufficiently recognized.
  • The field may start stagnating.
  • Practitioners in other fields or potential
    business partners may dismiss our
    approaches/results.
  • We hope that with better evaluation practices, we
    can help the field of machine learning focus on
    more effective research and encourage more
    cross-discipline or cross-purposes exchanges.

4
Organization of theLecture
  • A review of the shortcomings of current
    evaluation methods
  • Problems with Performance Evaluation
  • Problems with Confidence Estimation
  • Problems with Data Sets

5
Recommended Steps for Proper Evaluation
  1. Identify the interesting properties of the
    classifier.
  2. Choose an evaluation metric accordingly
  3. Choose a confidence estimation method .
  4. Check that all the assumptions made by the
    evaluation metric and confidence estimator are
    verified.
  5. Run the evaluation method with the chosen metric
    and confidence estimator, and analyze the
    results.
  6. Interpret the results with respect to the domain.

6
Commonly Followed Steps of Evaluation
  1. Identify the interesting properties of the
    classifier.
  2. Choose an evaluation metric accordingly
  3. Choose a confidence estimation method .
  4. Check that all the assumptions made by the
    evaluation metric and confidence estimator are
    verified.
  5. Run the evaluation method with the chosen metric
    and confidence estimator, and analyze the
    results.
  6. Interpret the results with respect to the domain.

These steps are typically considered, but only
very lightly
.
7
Overview
  • What happens when bad choices of performance
    evaluation metrics are made? (Steps 1 and 2 are
    considered too lightly)
  • Accuracy
  • Precision/Recall
  • ROC Analysis
  • Note each metric solves the problem of the
    previous one, but introduces new shortcomings
    (usually caught by the previous metrics)
  • What happens when bad choices of confidence
    estimators are made and the assumptions
    underlying these confidence estimator are not
    respected (Steps 3 is considered lightly and Step
    4 is disregarded).
  • The t-test

E.g.,
E.g.,
8
A Short Review I Confusion Matrix / Common
Performance evaluation Metrics
  • Accuracy (TPTN)/(PN)
  • Precision TP/(TPFP)
  • Recall/TP rate TP/P
  • FP Rate FP/N
  • ROC Analysis moves the threshold between the
    positive and negative class from a small FP rate
    to a large one. It plots the value of the Recall
    against that of the FP Rate at each FP Rate
    considered.

True class ? Hypothesized class V Pos Neg
Yes TP FP
No FN TN
PTPFN NFPTN
A Confusion Matrix
9
A Short Review II Confidence Estimation / The
t-Test
  • The most commonly used approach to confidence
    estimation in Machine learning is
  • To run the algorithm using 10-fold
    cross-validation and to record the accuracy at
    each fold.
  • To compute a confidence interval around the
    average of the difference between these reported
    accuracies and a given gold standard, using the
    t-test, i.e., the following formula
  • d /- tN,9 sd where
  • d is the average difference between the reported
    accuracy and the given gold standard,
  • tN,9 is a constant chosen according to the degree
    of confidence desired,
  • sd sqrt(1/90 Si110 (di d)2) where di
    represents the difference between the reported
    accuracy and the given gold standard at fold i.

10
Whats wrong with Accuracy?
True class ? Pos Neg
Yes 200 100
No 300 400
P500 N500
True class ? Pos Neg
Yes 400 300
No 100 200
P500 N500
  • Both classifiers obtain 60 accuracy
  • They exhibit very different behaviours
  • On the left weak positive recognition
    rate/strong negative recognition rate
  • On the right strong positive recognition
    rate/weak negative recognition rate

11
Whats wrong with Precision/Recall?
True class ? Pos Neg
Yes 200 100
No 300 400
P500 N500
True class ? Pos Neg
Yes 200 100
No 300 0
P500 N100
  • Both classifiers obtain the same precision and
    recall values of 66.7 and 40
  • They exhibit very different behaviours
  • Same positive recognition rate
  • Extremely different negative recognition rate
    strong on the left / nil on the right
  • Note Accuracy has no problem catching this!

12
Whats wrong with ROC Analysis?(We consider
single points in space not the entire ROC Curve)
True class ? Pos Neg
Yes 200 10
No 300 4,000
P500 N4,010
True class ? Pos Neg
Yes 500 1,000
No 300 400,000
P800 N401,000
  • ROC Analysis and Precision yield contradictory
    results
  • In terms of ROC Analysis, the classifier on the
    right is a significantly better choice than the
    one on the left. the point representing the
    right classifier is on the same vertical line but
    22.25 higher than the point representing the
    left classifier
  • Yet, the classifier on the right has ridiculously
    low precision (33.3) while the classifier on the
    left has excellent precision (95.24).

13
Whats wrong with the t-test?
Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Fold 6 Fold 7 Fold 8 Fold 9 Fold 10
C 1 5 -5 5 -5 5 -5 5 -5 5 -5
C 2 10 -5 -5 0 0 0 0 0 0 0
  • Classifiers 1 and 2 yield the same average mean
    and confidence interval.
  • Yet, Classifier 1 is relatively stable, while
    Classifier 2 is not.
  • Problem the t-test assumes a normal
    distribution. The difference in accuracy between
    classifier 2 and the gold-standard is not
    normally distributed

14
Discussion
  • There is nothing intrinsically wrong with any of
    the performance evaluation measures or confidence
    tests discussed. Its all a matter of thinking
    about which one to use when, and what the results
    means (both in terms of added value and
    limitations).
  • Simple conceptualization of the Problem with
    current evaluation practices
  • Evaluation Metrics and Confidence Measures
    summarize the results ? ML Practitioners must
    understand the terms of these summarizations and
    verify that their assumptions are verified.
  • In certain cases, however, it is necessary to
    look further and, eventually, borrow practices
    from other disciplines. In, yet, other cases, it
    pays to devise our own methods. Both instances
    are discussed in what follows.
Write a Comment
User Comments (0)
About PowerShow.com