Evaluation PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Evaluation


1
Evaluation
2
Evaluation
  • How predictive is the model we learned?
  • Error on the training data is not a good
    indicator of performance on future data
  • Simple solution that can be used if lots of
    (labeled) data is available
  • Split data into training and test set
  • However (labeled) data is usually limited
  • More sophisticated techniques need to be used

3
Training and testing
  • Natural performance measure for classification
    problems error rate
  • Success instances class is predicted correctly
  • Error instances class is predicted incorrectly
  • Error rate proportion of errors made over the
    whole set of instances
  • Resubstitution error error rate obtained from
    training data
  • Resubstitution error is (hopelessly) optimistic!
  • Test set independent instances that have played
    no part in formation of classifier
  • Assumption both training data and test data are
    representative samples of the underlying problem

4
Making the most of the data
  • Once evaluation is complete, all the data can be
    used to build the final classifier
  • Generally, the larger the training data the
    better the classifier
  • The larger the test data the more accurate the
    error estimate
  • Holdout procedure method of splitting original
    data into training and test set
  • Dilemma ideally both training set and test set
    should be large!

5
Predicting performance
  • Assume the estimated error rate is 25. How close
    is this to the true error rate?
  • Depends on the amount of test data
  • Prediction is just like tossing a (biased!) coin
  • Head is a success, tail is an error
  • In statistics, a succession of independent events
    like this is called a Bernoulli process
  • Statistical theory provides us with confidence
    intervals for the true underlying proportion

6
Confidence intervals
  • We can say p lies within a certain specified
    interval with a certain specified confidence
  • Example S750 successes in N1000 trials
  • Estimated success rate 75
  • How close is this to the true success rate p?
  • Answer with 80 confidence p?73.2,76.7
  • Another example S75 and N100
  • Estimated success rate 75
  • With 80 confidence p?69.1,80.1
  • I.e. the probability that p?69.1,80.1 is 0.8.
  • Bigger the N more confident we are, i.e. the
    surrounding interval is smaller.
  • Above, for N100 we were less confident than for
    N1000.

7
Mean and Variance
  • Let Y be the random variable with possible values
  • 1 for success and
  • 0 for error.
  • Let probability of success be p.
  • Then probability of error is q1-p.
  • Whats the mean?
  • 1p 0q p
  • Whats the variance?
  • (1-p)2p (0-p)2q
  • q2pp2q
  • pq(pq)
  • pq

8
Estimating p
  • Well, we dont know p. Our goal is to estimate p.
  • For this we make N trials, i.e. tests.
  • More trials we do more confident we are.
  • Let S be the random variable denoting the number
    of successes, i.e. S is the sum of N value
    samplings of Y.
  • Now, we approximate p with the success rate in N
    trials, i.e. S/N.
  • By the Central Limit Theorem, when N is big, the
    probability distribution of the random variable
    fS/N is approximated by a normal distribution
    with
  • mean p and
  • variance pq/N.

9
Estimating p
  • c confidence interval z X z for random
    variable with 0 mean is given by
  • Pr- z X z c
  • With a symmetric distribution
  • Pr- z X z1-2 Pr x z
  • Confidence limits for the normal distribution
    with 0 mean and a variance of 1

Thus Pr-1.65 X1.6590 To use this we have
to reduce our random variable fS/N to have 0
mean and unit variance
10
Estimating p
Thus Pr-1.65 X1.6590 To use this we have
to reduce our random variable S/N to have 0 mean
and unit variance Pr-1.65 (S/N p) / ?S/N
1.6590 Now we solve two equations (S/N p)
/ ?S/N 1.65 (S/N p) / ?S/N -1.65
11
Estimating p
Let N100, and S70 ?S/N is sqrt(pq/N) and we
approximate it by sqrt(p'(1-p')/N) where p'
is the estimation of p, i.e. 0.7 So, ?S/N is
approximated by sqrt(.7.3/100) .046 The two
equations become (0.7 p) / .046 1.65 p .7
- 1.65.046 .624 (0.7 p) / .046 -1.65 p
.7 1.65.046 .776
Thus, we say With a 90 confidence we have that
the success rate p of the classifier will be
0.624 ? p ? 0.776
12
Exercise
  • Suppose I want to be 95 confident in my
    estimation.
  • Looking at a detailed table we find Pr-2 X2
    ? 95
  • Normalizing S/N, we need to solve
  • (S/N p) / ?S/N 2
  • (S/N p) / ?S/N -2
  • We approximate ?S/N with
  • where p' is the estimation of p through trials,
    i.e. S/N
  • So we need to solve
  • So,

13
Exercise
  • Suppose N1000 trials, S590 successes
  • p'S/N590/1000 .59

14
Holdout estimation
  • What to do if the amount of data is limited?
  • The holdout method reserves a certain amount for
    testing and uses the remainder for training
  • Usually one third for testing, the rest for
    training
  • Problem the samples might not be representative
  • Example class might be missing in the test data
  • Advanced version uses stratification
  • Ensures that each class is represented with
    approximately equal proportions in both subsets

15
Repeated holdout method
  • Holdout estimate can be made more reliable by
    repeating the process with different subsamples
  • In each iteration, a certain proportion is
    randomly selected for training (possibly with
    stratificiation)
  • The error rates on the different iterations are
    averaged to yield an overall error rate
  • This is called the repeated holdout method
  • Still not optimum the different test sets
    overlap
  • Can we prevent overlapping?

16
Cross-validation
  • Cross-validation avoids overlapping test sets
  • First step split data into k subsets of equal
    size
  • Second step use each subset in turn for testing,
    the remainder for training
  • Called k-fold cross-validation
  • Often the subsets are stratified before the
    cross-validation is performed
  • The error estimates are averaged to yield an
    overall error estimate
  • Standard method for evaluation stratified
    10-fold cross-validation

17
Leave-One-Out cross-validation
  • Leave-One-Outa particular form of
    cross-validation
  • Set number of folds to number of training
    instances
  • I.e., for n training instances, build classifier
    n times
  • Makes best use of the data
  • Involves no random subsampling
  • But, computationally expensive

18
Leave-One-Out-CV and stratification
  • Disadvantage of Leave-One-Out-CV stratification
    is not possible
  • It guarantees a non-stratified sample because
    there is only one instance in the test set!
  • Extreme example completely random dataset split
    equally into two classes
  • Best inducer predicts majority class
  • 50 accuracy on fresh data
  • Leave-One-Out-CV estimate is 100 error!

19
The bootstrap
  • Cross Validation uses sampling without
    replacement
  • The same instance, once selected, can not be
    selected again for a particular training/test set
  • The bootstrap uses sampling with replacement to
    form the training set
  • Sample a dataset of n instances n times with
    replacement to form a new dataset of n instances
  • Use this data as the training set
  • Use the instances from the original dataset that
    dont occur in the new training set for testing
  • Also called the 0.632 bootstrap (Why?)

20
The 0.632 bootstrap
  • Also called the 0.632 bootstrap
  • A particular instance has a probability of 11/n
    of not being picked
  • Thus its probability of ending up in the test
    data is
  • This means the training data will contain
    approximately 63.2 of the instances

21
Estimating errorwith the bootstrap
  • The error estimate on the test data will be very
    pessimistic
  • Trained on just 63 of the instances
  • Therefore, combine it with the resubstitution
    error
  • The resubstitution error gets less weight than
    the error on the test data
  • Repeat process several times with different
    replacement samples average the results
  • Probably the best way of estimating performance
    for very small datasets

22
Comparing data mining schemes
  • Frequent question which of two learning schemes
    performs better?
  • Note this is domain dependent!
  • Obvious way compare 10-fold Cross Validation
    estimates
  • Problem variance in estimate
  • Variance can be reduced using repeated CV
  • However, we still dont know whether the results
    are reliable (need to use statistical-test for
    that)

23
Counting the cost
  • In practice, different types of classification
    errors often incur different costs
  • Examples
  • Cancer Detection
  • Not cancer correct 99 of the time
  • Loan decisions
  • Fault diagnosis
  • Promotional mailing

24
Counting the cost
  • The confusion matrix

Predicted class Predicted class
Yes No
Actual class Yes True positive False negative
Actual class No False positive True negative
25
Paired t-test
  • Students t-test tells whether the means of two
    samples are significantly different.
  • In our case the samples are cross-validation
    samples for different datasets from the domain
  • Use a paired t-test because the individual
    samples are paired
  • The same Cross Validation is applied twice

26
Distribution of the means
  • x1, x2, xk and y1, y2, yk are the 2k samples
    for the k different datasets
  • mx and my are the means
  • With enough samples, the mean of a set of
    independent samples is normally distributed
  • Estimated variances of the means are
  • sx2 / k and sy2 / k
  • If ?x and ?y are the true means then the
    following are approximately normally distributed
    with mean 0, and variance 1

27
Distribution of the differences
  • Let md mx my
  • The difference of the means (md) has a Students
    distribution with k1 degrees of freedom
  • Let sd2 be the estimated variance of the
    difference
  • The standardized version of md is called the
    t-statistic

28
Performing the test
  • Fix a significance level
  • If a difference is significant at the ? level,
    there is a (100- ?) chance that the true means
    differ
  • Divide the significance level by two because the
    test is two-tailed
  • Look up the value for z that corresponds to ?/2
  • If t?z or t?z then the difference is significant
  • I.e. the null hypothesis (that the difference is
    zero) can be rejected

29
Example
  • We have compared two classifiers through
    cross-validation on 10 different datasets.
  • The error rates are
  • Dataset Classifier A Classifier B
    Difference
  • 1 10.6 10.2 .4
  • 2 9.8 9.4 .4
  • 3 12.3 11.8 .5
  • 4 9.7 9.1 .6
  • 5 8.8 8.3 .5
  • 6 10.6 10.2 .4
  • 7 9.8 9.4 .4
  • 8 12.3 11.8 .5
  • 9 9.7 9.1 .6
  • 10 8.8 8.3 .5

30
Example
  • md 0.48
  • sd 0.0789
  • The critical value of t for a two-tailed
    statistical test, ? 10 and 9 degrees of
    freedom is 1.83
  • 19.24 is way bigger than 1.83, so classifier B is
    much better than A.
Write a Comment
User Comments (0)
About PowerShow.com