Title: Cost-Sensitive Learning vs. Sampling: Which is Best for Handling Unbalanced Classes with Unequal Error Costs?
1Cost-Sensitive Learning vs. SamplingWhich is
Best for Handling Unbalanced Classes with Unequal
Error Costs?
- Gary Weiss, Kate McCarthy, Bibi Zabar
- Fordham University
2Background
- Highly skewed data is common
- Typically more interested in correctly
classifying the minority class examples - Without special measures, classifier will rarely
predict the minority class - A common approach balance the data
- Imposes non-uniform misclassification costs
- If alter training set class distribution from 11
to 21 then have essentially applied a cost ratio
of 21
C. Elkan. The foundations of cost-sensitive
learning. IJCAI 2001.
3Two Competing Approaches
- Cost-sensitive learning algorithm
- The algorithm itself handles cost-sensitivity
- Does not throw away any data
- Sampling
- Down-sample the majority class
- Discards potentially useful data
- Up-sample the minority class
- Increases amount of training data
- Replicated examples may lead to overfitting
4The Question
?
- Which method is best?
- cost-sensitive learning algorithm
- up-sampling
- down-sampling
- Most prior work compares sampling methods
5Experiments
- We assume that cost information is known
- Since cost info not really provided, we evaluate
a variety of cost ratios and reports all results - Classifier performance is evaluated using total
cost - Used cost-sensitive C5.0
- Evaluated scenarios where CFN ? CFP
- All results are based on averages over 10 runs
- For cost-sensitive learning, cost info passed in
- For sampling approaches
- Altered the the training data to impose the
specified misclassification cost
6Fourteen Data Sets
Name in Minority Total Examples
Letter-a 4 20,000
Pendigits 8 13,821
Connect-4 10 11,258
Bridges1 15 102
Letter-vowel 19 20,000
Hepatitis 21 155
Contraceptive 23 1,473
Adult 24 21,281
Blackjack 36 15,000
Weather 40 5,597
Sonar 47 208
Boa1 50 11,000
Promoters 50 106
Coding 50 20,000
7Results Letter-a Data Set
4 minority 20,000 examples
8Weather Data Set
40 minority 5,597 examples
9Coding Data Set
50 minority 20,000 examples
10Blackjack Data Set
36 minority 15,000 examples
11Contraceptive Data Set
23 minority 1,473 examples
12Results 1st/2nd/3rd Place Finishes
13Comparison of 3 Methods
14Discussion
- Results vary widely based on the data set
- no method consistently outperforms the other two
or even one of the other two - Are there any patterns based on the properties of
the data sets?
15Discussion II Patterns
- For the four smallest data sets (size lt 209)
- Up-sampling does by far the best
- Down-sampling does poorly since it discards data
- For the eight largest data sets (size gt 10,000)
- Cost-sensitive learning does best
- Beats up-sampling on average by 5.5
- Beats down-sampling on average by 5.7
- No clear pattern based on the degree of class
imbalance
16Discussion III
- Why might cost-sensitive learning algorithm
perform best for large data sets? - Perhaps this method requires accurate probability
estimates in order to perform well - This requires many examples per classification
rule
17Conclusion
- No consistent winner between cost-sensitive
learning and sampling methods - Substantial differences for specific data sets
- Cost-sensitive learning may be best for large
data sets - Up-sampling appears best for small data sets
18Follow-up Questions
- Why isnt cost-sensitive learning the best?
- Can we identify problems with cost-sensitive
learners? - Can we improve cost-sensitive learners?
- Are we better off not using cost-sensitive
learner and using sampling instead?!
19Future Work
- There are areas for future work
- Use additional cost-sensitive learners
- Use larger data sets (then cost-sensitive best?)
- Include more sophisticated sampling schemes
- Dont assume known costs (ROC analysis)
- I believe more comprehensive studies are needed
and are underway