Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs?

About This Presentation

Title:

Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs?

Description:

Typically more interested in correctly classifying the minority class examples ... Does not throw away any data. Sampling. Down-sample the majority class ... – PowerPoint PPT presentation

Number of Views:96

Avg rating:3.0/5.0

Slides: 20

Provided by: kmc56

Learn more at: https://storm.cis.fordham.edu

Category:

more less

Transcript and Presenter's Notes

Title: Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs?

1
Cost-Sensitive Learning vs. SamplingWhich is
Best for Handling Unbalanced Classes with Unequal
Error Costs?

Gary Weiss, Kate McCarthy, Bibi Zabar
Fordham University

2
Background

Highly skewed data is common
Typically more interested in correctly
classifying the minority class examples
Without special measures, classifier will rarely
predict the minority class
A common approach balance the data
Imposes non-uniform misclassification costs
If alter training set class distribution from 11
to 21 then have essentially applied a cost ratio
of 21

C. Elkan. The foundations of cost-sensitive
learning. IJCAI 2001.
3
Two Competing Approaches

Cost-sensitive learning algorithm
The algorithm itself handles cost-sensitivity
Does not throw away any data
Sampling
Down-sample the majority class
Discards potentially useful data
Up-sample the minority class
Increases amount of training data
Replicated examples may lead to overfitting

4
The Question
?

Which method is best?
cost-sensitive learning algorithm
up-sampling
down-sampling
Most prior work compares sampling methods

5
Experiments

We assume that cost information is known
Since cost info not really provided, we evaluate
a variety of cost ratios and reports all results
Classifier performance is evaluated using total
cost
Used cost-sensitive C5.0
Evaluated scenarios where CFN ? CFP
All results are based on averages over 10 runs
For cost-sensitive learning, cost info passed in
For sampling approaches
Altered the the training data to impose the
specified misclassification cost

6
Fourteen Data Sets
Name in Minority Total Examples
Letter-a 4 20,000
Pendigits 8 13,821
Connect-4 10 11,258
Bridges1 15 102
Letter-vowel 19 20,000
Hepatitis 21 155
Contraceptive 23 1,473
Adult 24 21,281
Blackjack 36 15,000
Weather 40 5,597
Sonar 47 208
Boa1 50 11,000
Promoters 50 106
Coding 50 20,000
7
Results Letter-a Data Set
4 minority 20,000 examples
8
Weather Data Set
40 minority 5,597 examples
9
Coding Data Set
50 minority 20,000 examples
10
Blackjack Data Set
36 minority 15,000 examples
11
Contraceptive Data Set
23 minority 1,473 examples
12
Results 1st/2nd/3rd Place Finishes
13
Comparison of 3 Methods
14
Discussion

Results vary widely based on the data set
no method consistently outperforms the other two
or even one of the other two
Are there any patterns based on the properties of
the data sets?

15
Discussion II Patterns

For the four smallest data sets (size lt 209)
Up-sampling does by far the best
Down-sampling does poorly since it discards data
For the eight largest data sets (size gt 10,000)
Cost-sensitive learning does best
Beats up-sampling on average by 5.5
Beats down-sampling on average by 5.7
No clear pattern based on the degree of class
imbalance

16
Discussion III

Why might cost-sensitive learning algorithm
perform best for large data sets?
Perhaps this method requires accurate probability
estimates in order to perform well
This requires many examples per classification
rule

17
Conclusion

No consistent winner between cost-sensitive
learning and sampling methods
Substantial differences for specific data sets
Cost-sensitive learning may be best for large
data sets
Up-sampling appears best for small data sets

18
Follow-up Questions

Why isnt cost-sensitive learning the best?
Can we identify problems with cost-sensitive
learners?
Can we improve cost-sensitive learners?
Are we better off not using cost-sensitive
learner and using sampling instead?!

19
Future Work

There are areas for future work
Use additional cost-sensitive learners
Use larger data sets (then cost-sensitive best?)
Include more sophisticated sampling schemes
Dont assume known costs (ROC analysis)
I believe more comprehensive studies are needed
and are underway

Write a Comment

User Comments (0)

About PowerShow.com

Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs? - PowerPoint PPT Presentation

Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs?

Typically more interested in correctly classifying the minority class examples ... Does not throw away any data. Sampling. Down-sample the majority class ... – PowerPoint PPT presentation