Title: A particle swarm based hybrid system for imbalanced medical data sampling
1A particle swarm based hybrid system for
imbalanced medical data sampling
- Pengyi Yang
- yangpy_at_it.usyd.edu.au
School of Information Technologies The University
of Sydney, NSW 2006, Australia
NICTA, Australian Technology Park, Eveleigh NSW
2015, Australia
2Road map
- Imbalanced class distribution in medical data
- Sampling
- Over-sampling
- Under-sampling
- Convert feature selection techniques as sampling
strategy - System overview
- Results
- Conclusion
3Imbalanced class distribution in medical data
- Medical data are commonly with imbalanced class
distribution - Why?
- Positive samples are special cases (rare) while
negative samples are abundant - (Reversely, only positive samples are collected)
- Data contain subtypes each with limited samples
4Problem
- Building classification model with imbalanced
dataset will cause the under represented class
been overlooked or even ignored. - Yet, the rare classes often carry important
biological implication. - The difficulty becomes how to remedy the
imbalanced class distribution.
5Remedy
- Via sampling before model building process
- Over sampling increase sample size of minority
class (could introduce noise and redundancy) - Under sampling decrease sample size of majority
class (could remove representative samples) - Via cost-sensitive learning within model
building process - Need to choose an appropriate cost-metric (hard
to determine a prior)
6Current methods
- The most straightforward way random
over-sampling and under-sampling - Naive method but work well in different
situations - Clustering and sampling
- Clustering dataset and sampling according to the
characteristic of each cluster - Synthesizing new examples
- Most popular is smote which creates
artificial samples to increase the size of
minority class
7Our contribution Proposing a novel sampling
strategy
- Convert feature selection technique as sampling
strategy - Selecting a subset of optimal samples from
majority class
Supervised sample selection
(balanced dataset)
(imbalanced dataset)
8Conceptual representation
9Particle swarm optimization
- Problem encoding
-
- Each particle is a subset of samples from the
majority class
m is the sample size of majority class
10Final Schema
11Results
(1) PSO achieved better classification
results.
(2) Different evaluation metrics could
gives a different evaluation
indication.
12Results continue
(3) Different classifiers also perform
differently within the same sampling method
13Key observation
- The evaluation of data sampling strategy is
compounded by the type of classifier applied and
the evaluation metric used. - Therefore, caution should be drawn when the
conclusion is made on the basis of a single type
of classifier or evaluation metric.
14Conclusion
- The study shows that with proper modification
feature selection techniques can be applied to
sampling of imbalanced data. - The application of such technique to medical
domain demonstrates it can help to increase the
classification accuracy which is valuable to
prediction or decision support systems.
15Publication
- P Yang, L Hsu, B. Zhou, Z. Zhang, A Zomaya, A
particle swarm based hybrid system for imbalanced
medical data sampling, accepted by BMC Genomics.
16Questions!