A Genetic Algorithms Approach to Feature Subset Selection Problem - PowerPoint PPT Presentation

About This Presentation

Title:

A Genetic Algorithms Approach to Feature Subset Selection Problem

Description:

Title: Robust Real-time Object Detection Paul Viola and Michael Jones ICCV 2001 Workshop on Statistical and Computation Theories of Vision Author – PowerPoint PPT presentation

Number of Views:106

Avg rating:3.0/5.0

Slides: 19

Provided by: Guest

Category:

more less

Transcript and Presenter's Notes

Title: A Genetic Algorithms Approach to Feature Subset Selection Problem

1
A Genetic Algorithms Approach to Feature Subset
Selection Problem

by Hasan Dogu TASKIRAN
CS 550 Machine Learning Workshop
Department of Computer Engineering
Bilkent University
May 16, 2005

2
Outline

Motivation
Neural Networks
Feature Subset Selection
Genetic Algorithms
Methodology
Experiments and Results
Conclusions and Future Work

3
Motivation

It is not unusual to find problems involving
hundreds of features
Beyond a point the inclusion of additional
features leads to worse rather than better
performance
Differentiate between features that contribute
new information and not
Many of current techniques such as PCA and LDA
involve linear transformations to lower
dimensions
A multi-objective genetic algorithm is needed to
Reduce the cost
Increase the accuracy (if applicable)

4
Neural Networks

An information processing paradigm that is
inspired by the way biological nervous systems
process information
A large number of highly interconnected
processing elements (neurones) working in unison
to solve specific problems
They are configured for a specific application
through a learning process
Adjustments to synaptic connections that exist
between the neurons

5
Neural Networks

The network may become unbelievably complex if
the number of the features used to for
classification increases very much
If the network becomes too complex, then
Size increases
Training time increases
Training set size increases
Classification time increases
Some optimization methods such as node pruning
techniques exist for classification using ANNs

6
Feature Subset Selection

Reduce the number of features used in
classification while maintaining acceptable
classification accuracy
Considerable impact on the effectiveness of the
resulting classification
Computational complexity is reduced as there is
smaller number of inputs
Accuracy increases as the removed features
hinders the classification process
Can be seen as a case of binary feature weighting

7
Genetic Algorithms

A family of computational models inspired by
evolution
GAs are parallel iterative optimizers, and have
been successfully applied to a broad spectrum of
optimization problems
Focusing on the application of selection,
mutation, and recombination to a population of
competing problem solutions
A directed search rather than an exhaustive search

8
Genetic Algorithms

Given enough time and a well bounded problem, the
genetic algorithm can find a global optimum
Performance of genetic algorithm depends on a
number of factors including
The choice of genetic representation and
operators,
The fitness function,
The details of the fitness-dependent selection
procedure,
Various user-determined parameters such as
population size
All about representation and fitness

9
Methodology

Represent the feature subsets as binary strings
where
A value of 1 will represent the inclusion of a
particular feature in the training process
A value of 1 will represent its absence
The genetic algorithm will operate on a pool of
binary strings
For each binary string we train a new neural
network with the selected features as input nodes
to evaluate the fitness of the resulting binary
set

10
Methodology

As a result of the training we obtain an error
value e(x),
where 0 e(x) 1
A cost function for the network s(x) is obtained,
where again 0 s(x) 1
After training the fitness of the feature subset
is obtained through

11
Experiments and Results

We conducted an experiment that shows our results
on a handwritten digit recognition problem
We implemented our methodology using the Matlab
Neural Network and Genetic Algorithm toolboxes
The database we used in our experiments was the
UCI database for handwritten digits
This database includes 200 samples for each digit
(totally 2000 samples)
The digits are represented as 15 x 16 images each

12
Experiments and Results

We have randomly chosen 100 digits from each
digit for the training set and used the remaining
100 digits for testing our networks to obtain the
necessary e(x) and s(x) values
We decided to use the pixels as our features and
so we have 240 features to evaluate
We create a pool of feature subsets represented
as 240-bit bit-strings where 1s represent the
inclusion of the associated pixel value and 0s
represent the absence of it while training the
network
For each binary string in the pool we create a
new Feed-Forward back-propagation ANN with one
hidden layer composed of 10 neurons.
We used logarithmic sigmoid functions which are
gradient descent with momentum and adaptive
learning rate back-propagation training
functions (the slowest in Matlab, namely
traingdx)

13
Experiments and Results

The parameters for our GA are
Population Size 50
Number of Generations 100
Probability of Crossover 0.6
Probability of Mutation 0.001
Elite Count 2
Type of Mutation Uniform
Type of Selection Rank-based
Stall Generations Limit 10
Stall Time Limit Infinite

14
Experiments and Results
15
Experiments and Results
Accuracy Training Dataset Test Dataset
Full Feature Set (240, s(x) 1.00) 99.7 89.9
Optimal Subset (53, s(x) 0.221) 99.4 90.4
16
Conclusions

Proposed methodology succeeds in reducing the
complexity of the feature set used by the
ANN-classifier
Genetic algorithms offer an attractive approach
to solving the features subset selection problem
This methodology finds application areas in cost
sensitive design of classifiers for tasks such as
medical diagnosis and computer vision
Other application areas include automated data
mining and knowledge discovery from datasets with
an abundance of irrelevant or redundant features
The GA-based approach to feature subset selection
does not rely on monotonicity assumptions that
are used in traditional approaches to feature
subset selection

17
Future Work

An analysis is still needed to improve the
results obtained using GAs
Performance improvement and trials for the other
datasets may be included
Performance improvements should be done on the
Genetic Algorithms themselves
Another analysis could be based on the fitness
evaluation function where there may be used other
fitness functions
The approach may be tried in semi-supervised
learning case

18
Thanks for Listening