Learning Algorithm Evaluation - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Learning Algorithm Evaluation

Description:

Confusion Matrix ROC curves Method selection Overall: use method with largest Area Under ROC curve (AUROC) If you aim to cover just 40% of true positives in a sample ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 34
Provided by: Piete8
Category:

less

Transcript and Presenter's Notes

Title: Learning Algorithm Evaluation


1
Learning Algorithm Evaluation
2
Algorithm evaluation Outline
  • Why?
  • Overfitting
  • How?
  • Train/Test vs Cross-validation
  • What?
  • Evaluation measures
  • Who wins?
  • Statistical significance

3
Introduction
4
Introduction
  • A model should perform well on unseen data drawn
    from the same distribution

5
Classification accuracy
  • performance measure
  • Success instances class is predicted correctly
  • Error instances class is predicted incorrectly
  • Error rate errors/instances
  • Accuracy successes/instances
  • Quiz
  • 50 examples, 10 classified incorrectly
  • Accuracy? Error rate?

6
Evaluation
Rule 1
Never evaluate on training data!
7
Train and Test
Step 1 Randomly split data into training and
test set (e.g. 2/3-1/3)
a.k.a. holdout set
8
Train and Test
Step 2 Train model on training data
9
Train and Test
Step 3 Evaluate model on test data
10
Train and Test
Quiz Can I retry with other parameter settings?
11
Evaluation
Rule 1
Never evaluate on training data!
Rule 2
Never train on test data! (that includes
parameter setting or feature selection)
12
Train and Test
Step 4 Optimize parameters on separate
validation set
validation
testing
13
Test data leakage
  • Never use test data to create the classifier
  • Can be tricky e.g. social network
  • Proper procedure uses three sets
  • training set train models
  • validation set optimize algorithm parameters
  • test set evaluate final model

14
Making the most of the data
  • Once evaluation is complete, all the data can be
    used to build the final classifier
  • Trade-off performance ? evaluation accuracy
  • More training data, better model (but returns
    diminish)
  • More test data, more accurate error estimate

15
Train and Test
Step 5 Build final model on ALL data (more data,
better model)
16
Cross-Validation
17
k-fold Cross-validation
  • Split data (stratified) in k-folds
  • Use (k-1) for training, 1 for testing
  • Repeat k times
  • Average results

18
Cross-validation
  • Standard method
  • Stratified ten-fold cross-validation
  • 10? Enough to reduce sampling bias
  • Experimentally determined

19
Leave-One-Out Cross-validation
  • A particular form of cross-validation
  • folds instances
  • n instances, build classifier n times
  • Makes best use of the data, no sampling bias
  • Computationally expensive

20
ROC Analysis
21
ROC Analysis
  • Stands for Receiver Operating Characteristic
  • From signal processing tradeoff between hit rate
    and false alarm rate over noisy channel
  • Compute FPR, TPR and plot them in ROC space
  • Every classifier is a point in ROC space
  • For probabilistic algorithms
  • Collect many points by varying prediction
    threshold
  • Or, make cost sensitive and vary costs (see below)

22
Confusion Matrix
actual

-
TP
FP

true positive
false positive
predicted
TN
FN
-
false negative
true negative
FPTN
TPFN
TPrate (sensitivity)
FPrate (fall-out)
23
ROC space
J48 parameters fitted
J48
OneR
classifiers
24
ROC curves
Change prediction threshold
Threshold t (P() gt t)
Area Under Curve (AUC) 0.75
25
ROC curves
  • Alternative method (easier, but less intuitive)
  • Rank probabilities
  • Start curve in (0,0), move down probability list
  • If positive, move up. If negative, move right
  • Jagged curveone set of test data
  • Smooth curveuse cross-validation

26
ROC curvesMethod selection
  • Overall use method with largest Area Under ROC
    curve (AUROC)
  • If you aim to cover just 40 of true positives in
    a sample use method A
  • Large sample use method B
  • In between choose between A and B with
    appropriate probabilities

27
ROC Space and Costs
28
Different Costs
  • In practice, TP and FN errors incur different
    costs
  • Examples
  • Medical diagnostic tests does X have leukemia?
  • Loan decisions approve mortgage for X?
  • Promotional mailing will X buy the product?
  • Add cost matrix to evaluation that weighs
    TP,FP,...

pred pred -
actual cTP 0 cFN 1
actual - cFP 1 cTN 0
29
Statistical Significance
30
Comparing data mining schemes
  • Which of two learning algorithms performs better?
  • Note this is domain dependent!
  • Obvious way compare 10-fold CV estimates
  • Problem variance in estimate
  • Variance can be reduced using repeated CV
  • However, we still dont know whether results are
    reliable

31
Significance tests
  • Significance tests tell us how confident we can
    be that there really is a difference
  • Null hypothesis there is no real difference
  • Alternative hypothesis there is a difference
  • A significance test measures how much evidence
    there is in favor of rejecting the null
    hypothesis
  • E.g. 10 cross-validation scores B better than A?

mean A
mean B
P(perf)
Algorithm A Algorithm B
perf
x x x xxxxx x x
x x x xxxx x x x
32
Paired t-test
32
P(perf)
Algorithm A Algorithm B
perf
  • Students t-test tells whether the means of two
    samples (e.g., 10 cross-validation scores) are
    significantly different
  • Use a paired t-test when individual samples are
    paired
  • i.e., they use the same randomization
  • Same CV folds are used for both algorithms

William Gosset Born 1876 in Canterbury Died
1937 in Beaconsfield, England Worked as chemist
in the Guinness brewery in Dublin in 1899.
Invented the t-test to handle small samples for
quality control in brewing. Wrote under the name
"Student".
33
Performing the test
P(perf)
Algoritme A Algoritme B
  • Fix a significance level ?
  • Significant difference at ? level implies
    (100-?) chance that there really is a difference
  • Scientific work 5 or smaller (gt95 certainty)
  • Divide ? by two (two-tailed test)
  • Look up the z-value corresponding to ?/2
  • If t ? z or t ? z difference is significant
  • null hypothesis can be rejected

perf
a z
0.1 4.3
0.5 3.25
1 2.82
5 1.83
10 1.38
20 0.88
Write a Comment
User Comments (0)
About PowerShow.com