Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? - PowerPoint PPT Presentation

About This Presentation
Title:

Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs?

Description:

Typically more interested in correctly classifying the minority class examples ... Does not throw away any data. Sampling. Down-sample the majority class ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 20
Provided by: kmc56
Category:

less

Transcript and Presenter's Notes

Title: Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs?


1
Cost-Sensitive Learning vs. SamplingWhich is
Best for Handling Unbalanced Classes with Unequal
Error Costs?
  • Gary Weiss, Kate McCarthy, Bibi Zabar
  • Fordham University

2
Background
  • Highly skewed data is common
  • Typically more interested in correctly
    classifying the minority class examples
  • Without special measures, classifier will rarely
    predict the minority class
  • A common approach balance the data
  • Imposes non-uniform misclassification costs
  • If alter training set class distribution from 11
    to 21 then have essentially applied a cost ratio
    of 21

C. Elkan. The foundations of cost-sensitive
learning. IJCAI 2001.
3
Two Competing Approaches
  • Cost-sensitive learning algorithm
  • The algorithm itself handles cost-sensitivity
  • Does not throw away any data
  • Sampling
  • Down-sample the majority class
  • Discards potentially useful data
  • Up-sample the minority class
  • Increases amount of training data
  • Replicated examples may lead to overfitting

4
The Question
?
  • Which method is best?
  • cost-sensitive learning algorithm
  • up-sampling
  • down-sampling
  • Most prior work compares sampling methods

5
Experiments
  • We assume that cost information is known
  • Since cost info not really provided, we evaluate
    a variety of cost ratios and reports all results
  • Classifier performance is evaluated using total
    cost
  • Used cost-sensitive C5.0
  • Evaluated scenarios where CFN ? CFP
  • All results are based on averages over 10 runs
  • For cost-sensitive learning, cost info passed in
  • For sampling approaches
  • Altered the the training data to impose the
    specified misclassification cost

6
Fourteen Data Sets
Name in Minority Total Examples
Letter-a 4 20,000
Pendigits 8 13,821
Connect-4 10 11,258
Bridges1 15 102
Letter-vowel 19 20,000
Hepatitis 21 155
Contraceptive 23 1,473
Adult 24 21,281
Blackjack 36 15,000
Weather 40 5,597
Sonar 47 208
Boa1 50 11,000
Promoters 50 106
Coding 50 20,000
7
Results Letter-a Data Set
4 minority 20,000 examples
8
Weather Data Set
40 minority 5,597 examples
9
Coding Data Set
50 minority 20,000 examples
10
Blackjack Data Set
36 minority 15,000 examples
11
Contraceptive Data Set
23 minority 1,473 examples
12
Results 1st/2nd/3rd Place Finishes
13
Comparison of 3 Methods
14
Discussion
  • Results vary widely based on the data set
  • no method consistently outperforms the other two
    or even one of the other two
  • Are there any patterns based on the properties of
    the data sets?

15
Discussion II Patterns
  • For the four smallest data sets (size lt 209)
  • Up-sampling does by far the best
  • Down-sampling does poorly since it discards data
  • For the eight largest data sets (size gt 10,000)
  • Cost-sensitive learning does best
  • Beats up-sampling on average by 5.5
  • Beats down-sampling on average by 5.7
  • No clear pattern based on the degree of class
    imbalance

16
Discussion III
  • Why might cost-sensitive learning algorithm
    perform best for large data sets?
  • Perhaps this method requires accurate probability
    estimates in order to perform well
  • This requires many examples per classification
    rule

17
Conclusion
  • No consistent winner between cost-sensitive
    learning and sampling methods
  • Substantial differences for specific data sets
  • Cost-sensitive learning may be best for large
    data sets
  • Up-sampling appears best for small data sets

18
Follow-up Questions
  • Why isnt cost-sensitive learning the best?
  • Can we identify problems with cost-sensitive
    learners?
  • Can we improve cost-sensitive learners?
  • Are we better off not using cost-sensitive
    learner and using sampling instead?!

19
Future Work
  • There are areas for future work
  • Use additional cost-sensitive learners
  • Use larger data sets (then cost-sensitive best?)
  • Include more sophisticated sampling schemes
  • Dont assume known costs (ROC analysis)
  • I believe more comprehensive studies are needed
    and are underway
Write a Comment
User Comments (0)
About PowerShow.com