Data Set Balancing - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Data Set Balancing

Description:

Cancer cases. Loan defaults binary or other. Poor performing employees binary or other ... Insurance Fraud Data. 5000 observations (4000 training, 1000 test) ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 17
Provided by: CBA478
Category:

less

Transcript and Presenter's Notes

Title: Data Set Balancing


1
Data Set Balancing
  • David L. Olson
  • Department of Management
  • University of Nebraska

2
Skewed Data Sets
  • Many interesting applications involve data with
    many cases in one category, few in another
  • Insurance claims binary fraudulent or not
  • Cancer cases
  • Loan defaults binary or other
  • Poor performing employees binary or other
  • Skewed data sets cause modeling problems
  • Can cause model degeneracy
  • call all claims non-fraudulent

3
Test Domain
  • Models
  • Decision tree
  • Regression
  • Neural network
  • Data
  • Categorical or Continuous
  • Binary or Four-outcome

4
Data Sets
  • All generated for pedagogical purposes
  • Loan Application Data
  • 650 observations (400 training, 250 test)
  • Binary (0 not on time 1 on time)
  • 0.1125 late or default
  • Insurance Fraud Data
  • 5000 observations (4000 training, 1000 test)
  • Binary (OK, Fraudulent)
  • 0.0150 fraudulent
  • Job Application Data
  • 500 observations (250 training, 250 test)
  • Four outputs (unacceptable, minimal, adequate,
    excellent)
  • 0.028 excellent

5
Loan Application Data
6
Insurance Fraud Data
7
Job Application Data
8
Experiments
  • High degree of imbalance in each data set
  • Tested both categorical continuous data
  • Categorical
  • Decision tree See5
  • Logistic regression Clementine
  • Neural network Clementine
  • Continuous
  • Regression tree See5
  • Discriminant analysis Clementine
  • Neural network Clementine

9
Procedure
  • Full model run
  • Training set reduced
  • Deleted cases from most common outcome
  • Correct classification rate
  • Correct/total
  • Also identified type of error
  • (coincidence matrix)

10
Loan Application Data Set
11
Insurance Fraud Data Set
12
Job Application Data Set
13
Degeneracy
  • Model classifies all samples in dominant category
  • The greater the data set skew
  • The greater the correct classification rate
  • BUT MODEL DOESNT HELP

14
Comparison
15
Advanced Solutions
  • BAGGING
  • Combine several classifiers majority vote
  • BOOSTING
  • Sequentially learn several classifiers
  • Each classifier used to focus on data poorly
    classified by the previous classifier
  • Combine by weighted vote
  • STACKING
  • Combine outputs of multiple classifiers obtained
    by different learning algorithms

16
Conclusions
  • If data highly unbalanced
  • Algorithms tend to degenerate
  • If data balanced
  • Reduces training set size
  • Can lead to degeneracy by eliminating rare cases
  • Accuracy rates tend to decline
  • Decision tree algorithms the most robust
Write a Comment
User Comments (0)
About PowerShow.com