Skewed Class Distributions and Mislabeled Examples - PowerPoint PPT Presentation

1 / 18

About This Presentation

Title:

Skewed Class Distributions and Mislabeled Examples

Description:

Skewed Class Distributions and Mislabeled Examples. Jason Van Hulse. Taghi M. Khoshgoftaar ... Class imbalance or skewed class distributions: in the case of a ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 19

Provided by: www4Comp

Category:

more less

Transcript and Presenter's Notes

Title: Skewed Class Distributions and Mislabeled Examples

1
Skewed Class Distributions and Mislabeled Examples

Jason Van Hulse
Taghi M. Khoshgoftaar
Amri Napolitano
Department of Computer Science and Engineering
Florida Atlantic University

2
Overview

Introduction
Datasets
Noise Injection Procedure
Learners
Experimental Design
Results
Conclusions

3
Introduction

Our research considers a synthesis of two
important and pervasive problems encountered in
data mining.
Class imbalance or skewed class distributions in
the case of a binary dependent variable, class
imbalance occurs when the examples in one class
(dramatically) outnumbers the examples in the
other class.
Class noise or labeling errors occur when an
example has an incorrect class label.

4
Introduction

Numerous works have considered the issues of
class imbalance and class noise separately,
however, there has been a lack of systematic
study of the interaction between these two
concepts.
Therefore, our work presents a comprehensive
empirical evaluation of learning from imbalanced
data which contains labeling errors.
Our experiments utilize a unique noise simulation
methodology, allowing for the controlled
evaluation of several important parameters
governing the distribution of class noise.

5
Datasets

The experimental datasets considered in these
experiments come from the domain of empirical
software engineering.
The software measurement data used are from five
NASA software projects, CM1, MW1, PC1, KC1, and
KC3. The data was made available through the
Metrics Data Program at NASA.
Each module was characterized by 13 software
measurements (independent variables). The quality
of the modules (dependent variable) is described
by their class labels, i.e., nfp (not-fault
prone) and fp (fault-prone).

6
Noise Injection Procedure

Before simulated noise was injected, a learnable
subconcept was extracted from each dataset.
This was done because the original dataset may
contain some class noise, and injecting noise
into a dataset which is already noisy can be very
problematic.
A technique called the RBCM-based noise filter,
proposed in related work, was used to eliminate a
subset of examples from each dataset.
After the examples were eliminated from each
dataset, numerous classifiers were built on the
reduced dataset and all achieved perfect or
near-perfect classification accuracy,
demonstrating that the target concept in the
reduced dataset is learnable.

7
Noise Injection Procedure

Dataset statistics, by positive class (i.e.,
fp) examples and negative class (i.e., nfp)
examples, before and after filtering
Dataset CM1 originally contained 48 fp and 457
nfp examples, while after filtering, CM1
contained 39 fp and 277 nfp examples.

8
Noise Injection Procedure

Two noise simulation parameters were used
The overall level of noise in the dataset ?
10, 20, 30, 40, 50.
The percentage of noise from the minority class ?
0, 25, 50, 75, 100.
Given ? ?0 and ? ?0, randomly select 2 ?0
P ?0 examples from class P and corrupt the
class to N, and randomly select 2 ?0 P
(1-?0) examples from class N and corrupt to
class P.
P is the number of minority class examples in
the dataset.

9
Noise Injection Procedure

For example, suppose the dataset D (after
filtering) contains 100 P and 900 N examples, and
? 30 and ? 25.
Then 2 0.3 100 0.25 15 P examples from D
are corrupted to class N, and 2 0.3 100 (1
- 0.25) 45 N examples from D are corrupted to
class P.

10
Noise Injection Procedure

The below chart shows a sample (for ? 10) of
the noise injection statistics for the filtered
KC1 dataset.
In the second row, which shows ? 10 and ?
25, a total of 55 instances are corrupted, 14 of
which had the class flipped from fp to nfp
(denoted P ? N), while 41 examples have the class
flipped from nfp to fp (denoted N ? P).
After noise corruption, there are 298 fp examples
(13.6 of which are mislabeled) and 1066 nfp
examples (1.3 of which are mislabeled).

11
Learners

Our experiments utilized 11 different learners,
all constructed using the Weka data mining tool
C4.5 with the default Weka parameters (C4.5D)
C4.5 without pruning and with Laplace smoothing
(C4.5N)
Two k nearest neighbor learners, with k 2, 5
(2NN and 5NN)
Naïve Bayes (NB)
Logistic regression (LR)
Multi-layer perceptron (MLP)
Support vector machines, called SMO in Weka (SVM)
Random forests (RF)
RIPPER, a rule-based learner
Radial basis function network (RBF)

12
Experimental Design

For each of the five input datasets, 24 different
combinations of ? and ? were considered (note
that ? 50 and ? 100 could not be
accommodated since there would be no minority
class examples remaining).
All learners were evaluated using 10-fold cross
validation (CV).
Noise was injected into the training dataset
only, and performance was measured (using the
area under the ROC curve or AUC) using the
test-fold.
The process of randomly corruption instances
according to the parameters ? and ?, and
performing CV, was repeated 10 times.
5 datasets 10 runs of 10-fold CV results in 500
uncorrupted datasets, each of which are corrupted
with 24 different combinations of ? and ?, for a
total of 12,000 noisy datasets. 11 learners were
constructed on each noisy dataset, so in total
132,000 classifiers were built in these
experiments.

13
Results

The x-axis is the percentage of noise coming from
the positive class , labeled min0, . . . ,
min100, and the y-axis represents the AUC.
Each level of noise is represented by a
different line, labeled n10, . . . , n50.
For almost all learners, increasing the level of
noise ? with ? 0 does not dramatically impact
learner performance.
There is a very strong interaction between ? and
? related to the impact on learner performance.

14
Results

The above table shows the AUC for each learner,
averaged over all five datasets and all values
for ?, for different levels of noise ?.
Some learners (NB and MLP) are relatively robust
to increases in class noise, while others (RBF,
RIPPER, SVM, and C4.5D) suffer significantly as
the percentage of class noise increases.
For example, the AUC for RIPPER decreases from
0.965 at 10 class noise to 0.836 at 50, for a
decrease of 13.37.

15
Results

The above table shows the AUC for each learner,
averaged over all five datasets and all values
for ?, for different levels of noise from the
minority class ?.
Again, some learners deteriorate significantly
when more noise comes from the minority class
(RIPPER, SVM, RBF, C4.5D, and LR), while NB, and
to a lesser extent MLP and RF, are relatively
robust.

16
Conclusions

P ? N-type noise (i.e., examples whose correct
class is P but are mislabeled as N) has a
significant impact on the performance of learners
built from imbalanced data.
Conversely, the same level of N ? P-type noise
does not cause as much harm.
Therefore, if a data mining practitioner is
considering filtering the training dataset when
learning from imbalanced data, it may not be
sensible to filter any positive examples
The impact of mislabeled positive examples is
relatively low.
The cost of mistakenly filtering a correctly
labeled positive example is high. Since the
positive class is already relatively rare,
mistakenly filtering correctly labeled positive
examples further exacerbates the difficulties
encountered when learning from imbalanced data.

17
Conclusions

All of the learners were adversely impacted by
noise, however some learners showed more
stability at higher levels of noise. A more
thorough evaluation of the robustness of learners
in the presence of noise should be evaluated in
future work.
Future work should also consider multi-class
problems.
Additional learning algorithms can also be
considered, and methods to optimize the
performance of classifiers in the presence of
noisy and imbalanced data should be evaluated.
Finally, additional datasets from different
application domains can be utilized.