Evaluation presentation

About This Presentation

Transcript and Presenter's Notes

Title: Evaluation

1
Evaluation
2
Evaluation

How predictive is the model we learned?
Error on the training data is not a good
indicator of performance on future data
Simple solution that can be used if lots of
(labeled) data is available
Split data into training and test set
However (labeled) data is usually limited
More sophisticated techniques need to be used

3
Training and testing

Natural performance measure for classification
problems error rate
Success instances class is predicted correctly
Error instances class is predicted incorrectly
Error rate proportion of errors made over the
whole set of instances
Resubstitution error error rate obtained from
training data
Resubstitution error is (hopelessly) optimistic!
Test set independent instances that have played
no part in formation of classifier
Assumption both training data and test data are
representative samples of the underlying problem

4
Making the most of the data

Once evaluation is complete, all the data can be
used to build the final classifier
Generally, the larger the training data the
better the classifier
The larger the test data the more accurate the
error estimate
Holdout procedure method of splitting original
data into training and test set
Dilemma ideally both training set and test set
should be large!

5
Predicting performance

Assume the estimated error rate is 25. How close
is this to the true error rate?
Depends on the amount of test data
Prediction is just like tossing a (biased!) coin
Head is a success, tail is an error
In statistics, a succession of independent events
like this is called a Bernoulli process
Statistical theory provides us with confidence
intervals for the true underlying proportion

6
Confidence intervals

We can say p lies within a certain specified
interval with a certain specified confidence
Example S750 successes in N1000 trials
Estimated success rate 75
How close is this to the true success rate p?
Answer with 80 confidence p?73.2,76.7
Another example S75 and N100
Estimated success rate 75
With 80 confidence p?69.1,80.1
I.e. the probability that p?69.1,80.1 is 0.8.
Bigger the N more confident we are, i.e. the
surrounding interval is smaller.
Above, for N100 we were less confident than for
N1000.

7
Mean and Variance

Let Y be the random variable with possible values
1 for success and
0 for error.
Let probability of success be p.
Then probability of error is q1-p.
Whats the mean?
1p 0q p
Whats the variance?
(1-p)2p (0-p)2q
q2pp2q
pq(pq)
pq

8
Estimating p

Well, we dont know p. Our goal is to estimate p.
For this we make N trials, i.e. tests.
More trials we do more confident we are.
Let S be the random variable denoting the number
of successes, i.e. S is the sum of N value
samplings of Y.
Now, we approximate p with the success rate in N
trials, i.e. S/N.
By the Central Limit Theorem, when N is big, the
probability distribution of the random variable
fS/N is approximated by a normal distribution
with
mean p and
variance pq/N.

9
Estimating p

c confidence interval z X z for random
variable with 0 mean is given by
Pr- z X z c
With a symmetric distribution
Pr- z X z1-2 Pr x z
Confidence limits for the normal distribution
with 0 mean and a variance of 1

Thus Pr-1.65 X1.6590 To use this we have
to reduce our random variable fS/N to have 0
mean and unit variance
10
Estimating p
Thus Pr-1.65 X1.6590 To use this we have
to reduce our random variable S/N to have 0 mean
and unit variance Pr-1.65 (S/N p) / ?S/N
1.6590 Now we solve two equations (S/N p)
/ ?S/N 1.65 (S/N p) / ?S/N -1.65
11
Estimating p
Let N100, and S70 ?S/N is sqrt(pq/N) and we
approximate it by sqrt(p'(1-p')/N) where p'
is the estimation of p, i.e. 0.7 So, ?S/N is
approximated by sqrt(.7.3/100) .046 The two
equations become (0.7 p) / .046 1.65 p .7
- 1.65.046 .624 (0.7 p) / .046 -1.65 p
.7 1.65.046 .776
Thus, we say With a 90 confidence we have that
the success rate p of the classifier will be
0.624 ? p ? 0.776
12
Exercise

Suppose I want to be 95 confident in my
estimation.
Looking at a detailed table we find Pr-2 X2
? 95
Normalizing S/N, we need to solve
(S/N p) / ?S/N 2
(S/N p) / ?S/N -2
We approximate ?S/N with
where p' is the estimation of p through trials,
i.e. S/N

So we need to solve
So,

13
Exercise

Suppose N1000 trials, S590 successes
p'S/N590/1000 .59

14
Holdout estimation

What to do if the amount of data is limited?
The holdout method reserves a certain amount for
testing and uses the remainder for training
Usually one third for testing, the rest for
training
Problem the samples might not be representative
Example class might be missing in the test data
Advanced version uses stratification
Ensures that each class is represented with
approximately equal proportions in both subsets

15
Repeated holdout method

Holdout estimate can be made more reliable by
repeating the process with different subsamples
In each iteration, a certain proportion is
randomly selected for training (possibly with
stratificiation)
The error rates on the different iterations are
averaged to yield an overall error rate
This is called the repeated holdout method
Still not optimum the different test sets
overlap
Can we prevent overlapping?

16
Cross-validation

Cross-validation avoids overlapping test sets
First step split data into k subsets of equal
size
Second step use each subset in turn for testing,
the remainder for training
Called k-fold cross-validation
Often the subsets are stratified before the
cross-validation is performed
The error estimates are averaged to yield an
overall error estimate
Standard method for evaluation stratified
10-fold cross-validation

17
Leave-One-Out cross-validation

Leave-One-Outa particular form of
cross-validation
Set number of folds to number of training
instances
I.e., for n training instances, build classifier
n times
Makes best use of the data
Involves no random subsampling
But, computationally expensive

18
Leave-One-Out-CV and stratification

Disadvantage of Leave-One-Out-CV stratification
is not possible
It guarantees a non-stratified sample because
there is only one instance in the test set!
Extreme example completely random dataset split
equally into two classes
Best inducer predicts majority class
50 accuracy on fresh data
Leave-One-Out-CV estimate is 100 error!

19
The bootstrap

Cross Validation uses sampling without
replacement
The same instance, once selected, can not be
selected again for a particular training/test set
The bootstrap uses sampling with replacement to
form the training set
Sample a dataset of n instances n times with
replacement to form a new dataset of n instances
Use this data as the training set
Use the instances from the original dataset that
dont occur in the new training set for testing
Also called the 0.632 bootstrap (Why?)

20
The 0.632 bootstrap

Also called the 0.632 bootstrap
A particular instance has a probability of 11/n
of not being picked
Thus its probability of ending up in the test
data is
This means the training data will contain
approximately 63.2 of the instances

21
Estimating errorwith the bootstrap

The error estimate on the test data will be very
pessimistic
Trained on just 63 of the instances
Therefore, combine it with the resubstitution
error
The resubstitution error gets less weight than
the error on the test data
Repeat process several times with different
replacement samples average the results
Probably the best way of estimating performance
for very small datasets

22
Comparing data mining schemes

Frequent question which of two learning schemes
performs better?
Note this is domain dependent!
Obvious way compare 10-fold Cross Validation
estimates
Problem variance in estimate
Variance can be reduced using repeated CV
However, we still dont know whether the results
are reliable (need to use statistical-test for
that)

23
Counting the cost

In practice, different types of classification
errors often incur different costs
Examples
Cancer Detection
Not cancer correct 99 of the time
Loan decisions
Fault diagnosis
Promotional mailing

24
Counting the cost

The confusion matrix

Predicted class Predicted class
Yes No
Actual class Yes True positive False negative
Actual class No False positive True negative
25
Paired t-test

Students t-test tells whether the means of two
samples are significantly different.
In our case the samples are cross-validation
samples for different datasets from the domain
Use a paired t-test because the individual
samples are paired
The same Cross Validation is applied twice

26
Distribution of the means

x1, x2, xk and y1, y2, yk are the 2k samples
for the k different datasets
mx and my are the means
With enough samples, the mean of a set of
independent samples is normally distributed
Estimated variances of the means are
sx2 / k and sy2 / k
If ?x and ?y are the true means then the
following are approximately normally distributed
with mean 0, and variance 1

27
Distribution of the differences

Let md mx my
The difference of the means (md) has a Students
distribution with k1 degrees of freedom
Let sd2 be the estimated variance of the
difference
The standardized version of md is called the
t-statistic

28
Performing the test

Fix a significance level
If a difference is significant at the ? level,
there is a (100- ?) chance that the true means
differ
Divide the significance level by two because the
test is two-tailed
Look up the value for z that corresponds to ?/2
If t?z or t?z then the difference is significant
I.e. the null hypothesis (that the difference is
zero) can be rejected

29
Example

We have compared two classifiers through
cross-validation on 10 different datasets.
The error rates are
Dataset Classifier A Classifier B
Difference
1 10.6 10.2 .4
2 9.8 9.4 .4
3 12.3 11.8 .5
4 9.7 9.1 .6
5 8.8 8.3 .5
6 10.6 10.2 .4
7 9.8 9.4 .4
8 12.3 11.8 .5
9 9.7 9.1 .6
10 8.8 8.3 .5

30
Example

md 0.48
sd 0.0789

The critical value of t for a two-tailed
statistical test, ? 10 and 9 degrees of
freedom is 1.83
19.24 is way bigger than 1.83, so classifier B is
much better than A.

Write a Comment

User Comments (0)

About PowerShow.com

Evaluation PowerPoint PPT Presentation