Skewed Class Distributions and Mislabeled Examples - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Skewed Class Distributions and Mislabeled Examples

Description:

Skewed Class Distributions and Mislabeled Examples. Jason Van Hulse. Taghi M. Khoshgoftaar ... Class imbalance or skewed class distributions: in the case of a ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 19
Provided by: www4Comp
Category:

less

Transcript and Presenter's Notes

Title: Skewed Class Distributions and Mislabeled Examples


1
Skewed Class Distributions and Mislabeled Examples
  • Jason Van Hulse
  • Taghi M. Khoshgoftaar
  • Amri Napolitano
  • Department of Computer Science and Engineering
  • Florida Atlantic University

2
Overview
  • Introduction
  • Datasets
  • Noise Injection Procedure
  • Learners
  • Experimental Design
  • Results
  • Conclusions

3
Introduction
  • Our research considers a synthesis of two
    important and pervasive problems encountered in
    data mining.
  • Class imbalance or skewed class distributions in
    the case of a binary dependent variable, class
    imbalance occurs when the examples in one class
    (dramatically) outnumbers the examples in the
    other class.
  • Class noise or labeling errors occur when an
    example has an incorrect class label.

4
Introduction
  • Numerous works have considered the issues of
    class imbalance and class noise separately,
    however, there has been a lack of systematic
    study of the interaction between these two
    concepts.
  • Therefore, our work presents a comprehensive
    empirical evaluation of learning from imbalanced
    data which contains labeling errors.
  • Our experiments utilize a unique noise simulation
    methodology, allowing for the controlled
    evaluation of several important parameters
    governing the distribution of class noise.

5
Datasets
  • The experimental datasets considered in these
    experiments come from the domain of empirical
    software engineering.
  • The software measurement data used are from five
    NASA software projects, CM1, MW1, PC1, KC1, and
    KC3. The data was made available through the
    Metrics Data Program at NASA.
  • Each module was characterized by 13 software
    measurements (independent variables). The quality
    of the modules (dependent variable) is described
    by their class labels, i.e., nfp (not-fault
    prone) and fp (fault-prone).

6
Noise Injection Procedure
  • Before simulated noise was injected, a learnable
    subconcept was extracted from each dataset.
  • This was done because the original dataset may
    contain some class noise, and injecting noise
    into a dataset which is already noisy can be very
    problematic.
  • A technique called the RBCM-based noise filter,
    proposed in related work, was used to eliminate a
    subset of examples from each dataset.
  • After the examples were eliminated from each
    dataset, numerous classifiers were built on the
    reduced dataset and all achieved perfect or
    near-perfect classification accuracy,
    demonstrating that the target concept in the
    reduced dataset is learnable.

7
Noise Injection Procedure
  • Dataset statistics, by positive class (i.e.,
    fp) examples and negative class (i.e., nfp)
    examples, before and after filtering
  • Dataset CM1 originally contained 48 fp and 457
    nfp examples, while after filtering, CM1
    contained 39 fp and 277 nfp examples.

8
Noise Injection Procedure
  • Two noise simulation parameters were used
  • The overall level of noise in the dataset ?
    10, 20, 30, 40, 50.
  • The percentage of noise from the minority class ?
    0, 25, 50, 75, 100.
  • Given ? ?0 and ? ?0, randomly select 2 ?0
    P ?0 examples from class P and corrupt the
    class to N, and randomly select 2 ?0 P
    (1-?0) examples from class N and corrupt to
    class P.
  • P is the number of minority class examples in
    the dataset.

9
Noise Injection Procedure
  • For example, suppose the dataset D (after
    filtering) contains 100 P and 900 N examples, and
    ? 30 and ? 25.
  • Then 2 0.3 100 0.25 15 P examples from D
    are corrupted to class N, and 2 0.3 100 (1
    - 0.25) 45 N examples from D are corrupted to
    class P.

10
Noise Injection Procedure
  • The below chart shows a sample (for ? 10) of
    the noise injection statistics for the filtered
    KC1 dataset.
  • In the second row, which shows ? 10 and ?
    25, a total of 55 instances are corrupted, 14 of
    which had the class flipped from fp to nfp
    (denoted P ? N), while 41 examples have the class
    flipped from nfp to fp (denoted N ? P).
  • After noise corruption, there are 298 fp examples
    (13.6 of which are mislabeled) and 1066 nfp
    examples (1.3 of which are mislabeled).

11
Learners
  • Our experiments utilized 11 different learners,
    all constructed using the Weka data mining tool
  • C4.5 with the default Weka parameters (C4.5D)
  • C4.5 without pruning and with Laplace smoothing
    (C4.5N)
  • Two k nearest neighbor learners, with k 2, 5
    (2NN and 5NN)
  • Naïve Bayes (NB)
  • Logistic regression (LR)
  • Multi-layer perceptron (MLP)
  • Support vector machines, called SMO in Weka (SVM)
  • Random forests (RF)
  • RIPPER, a rule-based learner
  • Radial basis function network (RBF)

12
Experimental Design
  • For each of the five input datasets, 24 different
    combinations of ? and ? were considered (note
    that ? 50 and ? 100 could not be
    accommodated since there would be no minority
    class examples remaining).
  • All learners were evaluated using 10-fold cross
    validation (CV).
  • Noise was injected into the training dataset
    only, and performance was measured (using the
    area under the ROC curve or AUC) using the
    test-fold.
  • The process of randomly corruption instances
    according to the parameters ? and ?, and
    performing CV, was repeated 10 times.
  • 5 datasets 10 runs of 10-fold CV results in 500
    uncorrupted datasets, each of which are corrupted
    with 24 different combinations of ? and ?, for a
    total of 12,000 noisy datasets. 11 learners were
    constructed on each noisy dataset, so in total
    132,000 classifiers were built in these
    experiments.

13
Results
  • The x-axis is the percentage of noise coming from
    the positive class , labeled min0, . . . ,
    min100, and the y-axis represents the AUC.
  • Each level of noise is represented by a
    different line, labeled n10, . . . , n50.
  • For almost all learners, increasing the level of
    noise ? with ? 0 does not dramatically impact
    learner performance.
  • There is a very strong interaction between ? and
    ? related to the impact on learner performance.

14
Results
  • The above table shows the AUC for each learner,
    averaged over all five datasets and all values
    for ?, for different levels of noise ?.
  • Some learners (NB and MLP) are relatively robust
    to increases in class noise, while others (RBF,
    RIPPER, SVM, and C4.5D) suffer significantly as
    the percentage of class noise increases.
  • For example, the AUC for RIPPER decreases from
    0.965 at 10 class noise to 0.836 at 50, for a
    decrease of 13.37.

15
Results
  • The above table shows the AUC for each learner,
    averaged over all five datasets and all values
    for ?, for different levels of noise from the
    minority class ?.
  • Again, some learners deteriorate significantly
    when more noise comes from the minority class
    (RIPPER, SVM, RBF, C4.5D, and LR), while NB, and
    to a lesser extent MLP and RF, are relatively
    robust.

16
Conclusions
  • P ? N-type noise (i.e., examples whose correct
    class is P but are mislabeled as N) has a
    significant impact on the performance of learners
    built from imbalanced data.
  • Conversely, the same level of N ? P-type noise
    does not cause as much harm.
  • Therefore, if a data mining practitioner is
    considering filtering the training dataset when
    learning from imbalanced data, it may not be
    sensible to filter any positive examples
  • The impact of mislabeled positive examples is
    relatively low.
  • The cost of mistakenly filtering a correctly
    labeled positive example is high. Since the
    positive class is already relatively rare,
    mistakenly filtering correctly labeled positive
    examples further exacerbates the difficulties
    encountered when learning from imbalanced data.

17
Conclusions
  • All of the learners were adversely impacted by
    noise, however some learners showed more
    stability at higher levels of noise. A more
    thorough evaluation of the robustness of learners
    in the presence of noise should be evaluated in
    future work.
  • Future work should also consider multi-class
    problems.
  • Additional learning algorithms can also be
    considered, and methods to optimize the
    performance of classifiers in the presence of
    noisy and imbalanced data should be evaluated.
  • Finally, additional datasets from different
    application domains can be utilized.

18
Questions?
Write a Comment
User Comments (0)
About PowerShow.com