Title: Evaluation
1Evaluation
2Evaluation
- How predictive is the model we learned?
- Error on the training data is not a good
indicator of performance on future data - Simple solution that can be used if lots of
(labeled) data is available - Split data into training and test set
- However (labeled) data is usually limited
- More sophisticated techniques need to be used
3Training and testing
- Natural performance measure for classification
problems error rate - Success instances class is predicted correctly
- Error instances class is predicted incorrectly
- Error rate proportion of errors made over the
whole set of instances - Resubstitution error error rate obtained from
training data - Resubstitution error is (hopelessly) optimistic!
- Test set independent instances that have played
no part in formation of classifier - Assumption both training data and test data are
representative samples of the underlying problem
4Making the most of the data
- Once evaluation is complete, all the data can be
used to build the final classifier - Generally, the larger the training data the
better the classifier - The larger the test data the more accurate the
error estimate - Holdout procedure method of splitting original
data into training and test set - Dilemma ideally both training set and test set
should be large!
5Predicting performance
- Assume the estimated error rate is 25. How close
is this to the true error rate? - Depends on the amount of test data
- Prediction is just like tossing a (biased!) coin
- Head is a success, tail is an error
- In statistics, a succession of independent events
like this is called a Bernoulli process - Statistical theory provides us with confidence
intervals for the true underlying proportion
6Confidence intervals
- We can say p lies within a certain specified
interval with a certain specified confidence - Example S750 successes in N1000 trials
- Estimated success rate 75
- How close is this to the true success rate p?
- Answer with 80 confidence p?73.2,76.7
- Another example S75 and N100
- Estimated success rate 75
- With 80 confidence p?69.1,80.1
- I.e. the probability that p?69.1,80.1 is 0.8.
- Bigger the N more confident we are, i.e. the
surrounding interval is smaller. - Above, for N100 we were less confident than for
N1000.
7Mean and Variance
- Let Y be the random variable with possible values
- 1 for success and
- 0 for error.
- Let probability of success be p.
- Then probability of error is q1-p.
- Whats the mean?
- 1p 0q p
- Whats the variance?
- (1-p)2p (0-p)2q
- q2pp2q
- pq(pq)
- pq
8Estimating p
- Well, we dont know p. Our goal is to estimate p.
- For this we make N trials, i.e. tests.
- More trials we do more confident we are.
- Let S be the random variable denoting the number
of successes, i.e. S is the sum of N value
samplings of Y. - Now, we approximate p with the success rate in N
trials, i.e. S/N. - By the Central Limit Theorem, when N is big, the
probability distribution of the random variable
fS/N is approximated by a normal distribution
with - mean p and
- variance pq/N.
9Estimating p
- c confidence interval z X z for random
variable with 0 mean is given by - Pr- z X z c
- With a symmetric distribution
- Pr- z X z1-2 Pr x z
- Confidence limits for the normal distribution
with 0 mean and a variance of 1
Thus Pr-1.65 X1.6590 To use this we have
to reduce our random variable fS/N to have 0
mean and unit variance
10Estimating p
Thus Pr-1.65 X1.6590 To use this we have
to reduce our random variable S/N to have 0 mean
and unit variance Pr-1.65 (S/N p) / ?S/N
1.6590 Now we solve two equations (S/N p)
/ ?S/N 1.65 (S/N p) / ?S/N -1.65
11Estimating p
Let N100, and S70 ?S/N is sqrt(pq/N) and we
approximate it by sqrt(p'(1-p')/N) where p'
is the estimation of p, i.e. 0.7 So, ?S/N is
approximated by sqrt(.7.3/100) .046 The two
equations become (0.7 p) / .046 1.65 p .7
- 1.65.046 .624 (0.7 p) / .046 -1.65 p
.7 1.65.046 .776
Thus, we say With a 90 confidence we have that
the success rate p of the classifier will be
0.624 ? p ? 0.776
12Exercise
- Suppose I want to be 95 confident in my
estimation. - Looking at a detailed table we find Pr-2 X2
? 95 - Normalizing S/N, we need to solve
- (S/N p) / ?S/N 2
- (S/N p) / ?S/N -2
- We approximate ?S/N with
-
- where p' is the estimation of p through trials,
i.e. S/N
13Exercise
- Suppose N1000 trials, S590 successes
- p'S/N590/1000 .59
14Holdout estimation
- What to do if the amount of data is limited?
- The holdout method reserves a certain amount for
testing and uses the remainder for training - Usually one third for testing, the rest for
training - Problem the samples might not be representative
- Example class might be missing in the test data
- Advanced version uses stratification
- Ensures that each class is represented with
approximately equal proportions in both subsets
15Repeated holdout method
- Holdout estimate can be made more reliable by
repeating the process with different subsamples - In each iteration, a certain proportion is
randomly selected for training (possibly with
stratificiation) - The error rates on the different iterations are
averaged to yield an overall error rate - This is called the repeated holdout method
- Still not optimum the different test sets
overlap - Can we prevent overlapping?
16Cross-validation
- Cross-validation avoids overlapping test sets
- First step split data into k subsets of equal
size - Second step use each subset in turn for testing,
the remainder for training - Called k-fold cross-validation
- Often the subsets are stratified before the
cross-validation is performed - The error estimates are averaged to yield an
overall error estimate - Standard method for evaluation stratified
10-fold cross-validation
17Leave-One-Out cross-validation
- Leave-One-Outa particular form of
cross-validation - Set number of folds to number of training
instances - I.e., for n training instances, build classifier
n times - Makes best use of the data
- Involves no random subsampling
- But, computationally expensive
18Leave-One-Out-CV and stratification
- Disadvantage of Leave-One-Out-CV stratification
is not possible - It guarantees a non-stratified sample because
there is only one instance in the test set! - Extreme example completely random dataset split
equally into two classes - Best inducer predicts majority class
- 50 accuracy on fresh data
- Leave-One-Out-CV estimate is 100 error!
19The bootstrap
- Cross Validation uses sampling without
replacement - The same instance, once selected, can not be
selected again for a particular training/test set - The bootstrap uses sampling with replacement to
form the training set - Sample a dataset of n instances n times with
replacement to form a new dataset of n instances - Use this data as the training set
- Use the instances from the original dataset that
dont occur in the new training set for testing - Also called the 0.632 bootstrap (Why?)
20The 0.632 bootstrap
- Also called the 0.632 bootstrap
- A particular instance has a probability of 11/n
of not being picked - Thus its probability of ending up in the test
data is - This means the training data will contain
approximately 63.2 of the instances
21Estimating errorwith the bootstrap
- The error estimate on the test data will be very
pessimistic - Trained on just 63 of the instances
- Therefore, combine it with the resubstitution
error - The resubstitution error gets less weight than
the error on the test data - Repeat process several times with different
replacement samples average the results - Probably the best way of estimating performance
for very small datasets
22Comparing data mining schemes
- Frequent question which of two learning schemes
performs better? - Note this is domain dependent!
- Obvious way compare 10-fold Cross Validation
estimates - Problem variance in estimate
- Variance can be reduced using repeated CV
- However, we still dont know whether the results
are reliable (need to use statistical-test for
that)
23Counting the cost
- In practice, different types of classification
errors often incur different costs - Examples
- Cancer Detection
- Not cancer correct 99 of the time
- Loan decisions
- Fault diagnosis
- Promotional mailing
24Counting the cost
Predicted class Predicted class
Yes No
Actual class Yes True positive False negative
Actual class No False positive True negative
25Paired t-test
- Students t-test tells whether the means of two
samples are significantly different. - In our case the samples are cross-validation
samples for different datasets from the domain - Use a paired t-test because the individual
samples are paired - The same Cross Validation is applied twice
26Distribution of the means
- x1, x2, xk and y1, y2, yk are the 2k samples
for the k different datasets - mx and my are the means
- With enough samples, the mean of a set of
independent samples is normally distributed - Estimated variances of the means are
- sx2 / k and sy2 / k
- If ?x and ?y are the true means then the
following are approximately normally distributed
with mean 0, and variance 1
27Distribution of the differences
- Let md mx my
- The difference of the means (md) has a Students
distribution with k1 degrees of freedom - Let sd2 be the estimated variance of the
difference - The standardized version of md is called the
t-statistic
28Performing the test
- Fix a significance level
- If a difference is significant at the ? level,
there is a (100- ?) chance that the true means
differ - Divide the significance level by two because the
test is two-tailed - Look up the value for z that corresponds to ?/2
- If t?z or t?z then the difference is significant
- I.e. the null hypothesis (that the difference is
zero) can be rejected
29Example
- We have compared two classifiers through
cross-validation on 10 different datasets. - The error rates are
- Dataset Classifier A Classifier B
Difference - 1 10.6 10.2 .4
- 2 9.8 9.4 .4
- 3 12.3 11.8 .5
- 4 9.7 9.1 .6
- 5 8.8 8.3 .5
- 6 10.6 10.2 .4
- 7 9.8 9.4 .4
- 8 12.3 11.8 .5
- 9 9.7 9.1 .6
- 10 8.8 8.3 .5
30Example
- The critical value of t for a two-tailed
statistical test, ? 10 and 9 degrees of
freedom is 1.83 - 19.24 is way bigger than 1.83, so classifier B is
much better than A.