A Genetic Algorithms Approach to Feature Subset Selection Problem - PowerPoint PPT Presentation

About This Presentation
Title:

A Genetic Algorithms Approach to Feature Subset Selection Problem

Description:

Title: Robust Real-time Object Detection Paul Viola and Michael Jones ICCV 2001 Workshop on Statistical and Computation Theories of Vision Author – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 19
Provided by: Guest
Category:

less

Transcript and Presenter's Notes

Title: A Genetic Algorithms Approach to Feature Subset Selection Problem


1
A Genetic Algorithms Approach to Feature Subset
Selection Problem
  • by Hasan Dogu TASKIRAN
  • CS 550 Machine Learning Workshop
  • Department of Computer Engineering
  • Bilkent University
  • May 16, 2005

2
Outline
  • Motivation
  • Neural Networks
  • Feature Subset Selection
  • Genetic Algorithms
  • Methodology
  • Experiments and Results
  • Conclusions and Future Work

3
Motivation
  • It is not unusual to find problems involving
    hundreds of features
  • Beyond a point the inclusion of additional
    features leads to worse rather than better
    performance
  • Differentiate between features that contribute
    new information and not
  • Many of current techniques such as PCA and LDA
    involve linear transformations to lower
    dimensions
  • A multi-objective genetic algorithm is needed to
  • Reduce the cost
  • Increase the accuracy (if applicable)

4
Neural Networks
  • An information processing paradigm that is
    inspired by the way biological nervous systems
    process information
  • A large number of highly interconnected
    processing elements (neurones) working in unison
    to solve specific problems
  • They are configured for a specific application
    through a learning process
  • Adjustments to synaptic connections that exist
    between the neurons

5
Neural Networks
  • The network may become unbelievably complex if
    the number of the features used to for
    classification increases very much
  • If the network becomes too complex, then
  • Size increases
  • Training time increases
  • Training set size increases
  • Classification time increases
  • Some optimization methods such as node pruning
    techniques exist for classification using ANNs

6
Feature Subset Selection
  • Reduce the number of features used in
    classification while maintaining acceptable
    classification accuracy
  • Considerable impact on the effectiveness of the
    resulting classification
  • Computational complexity is reduced as there is
    smaller number of inputs
  • Accuracy increases as the removed features
    hinders the classification process
  • Can be seen as a case of binary feature weighting

7
Genetic Algorithms
  • A family of computational models inspired by
    evolution
  • GAs are parallel iterative optimizers, and have
    been successfully applied to a broad spectrum of
    optimization problems
  • Focusing on the application of selection,
    mutation, and recombination to a population of
    competing problem solutions
  • A directed search rather than an exhaustive search

8
Genetic Algorithms
  • Given enough time and a well bounded problem, the
    genetic algorithm can find a global optimum
  • Performance of genetic algorithm depends on a
    number of factors including
  • The choice of genetic representation and
    operators,
  • The fitness function,
  • The details of the fitness-dependent selection
    procedure,
  • Various user-determined parameters such as
    population size
  • All about representation and fitness

9
Methodology
  • Represent the feature subsets as binary strings
    where
  • A value of 1 will represent the inclusion of a
    particular feature in the training process
  • A value of 1 will represent its absence
  • The genetic algorithm will operate on a pool of
    binary strings
  • For each binary string we train a new neural
    network with the selected features as input nodes
    to evaluate the fitness of the resulting binary
    set

10
Methodology
  • As a result of the training we obtain an error
    value e(x),
  • where 0 e(x) 1
  • A cost function for the network s(x) is obtained,
  • where again 0 s(x) 1
  • After training the fitness of the feature subset
    is obtained through

11
Experiments and Results
  • We conducted an experiment that shows our results
    on a handwritten digit recognition problem
  • We implemented our methodology using the Matlab
    Neural Network and Genetic Algorithm toolboxes
  • The database we used in our experiments was the
    UCI database for handwritten digits
  • This database includes 200 samples for each digit
    (totally 2000 samples)
  • The digits are represented as 15 x 16 images each

12
Experiments and Results
  • We have randomly chosen 100 digits from each
    digit for the training set and used the remaining
    100 digits for testing our networks to obtain the
    necessary e(x) and s(x) values
  • We decided to use the pixels as our features and
    so we have 240 features to evaluate
  • We create a pool of feature subsets represented
    as 240-bit bit-strings where 1s represent the
    inclusion of the associated pixel value and 0s
    represent the absence of it while training the
    network
  • For each binary string in the pool we create a
    new Feed-Forward back-propagation ANN with one
    hidden layer composed of 10 neurons.
  • We used logarithmic sigmoid functions which are
    gradient descent with momentum and adaptive
    learning rate back-propagation training
    functions (the slowest in Matlab, namely
    traingdx)

13
Experiments and Results
  • The parameters for our GA are
  • Population Size 50
  • Number of Generations 100
  • Probability of Crossover 0.6
  • Probability of Mutation 0.001
  • Elite Count 2
  • Type of Mutation Uniform
  • Type of Selection Rank-based
  • Stall Generations Limit 10
  • Stall Time Limit Infinite

14
Experiments and Results
15
Experiments and Results
Accuracy Training Dataset Test Dataset
Full Feature Set (240, s(x) 1.00) 99.7 89.9
Optimal Subset (53, s(x) 0.221) 99.4 90.4
16
Conclusions
  • Proposed methodology succeeds in reducing the
    complexity of the feature set used by the
    ANN-classifier
  • Genetic algorithms offer an attractive approach
    to solving the features subset selection problem
  • This methodology finds application areas in cost
    sensitive design of classifiers for tasks such as
    medical diagnosis and computer vision
  • Other application areas include automated data
    mining and knowledge discovery from datasets with
    an abundance of irrelevant or redundant features
  • The GA-based approach to feature subset selection
    does not rely on monotonicity assumptions that
    are used in traditional approaches to feature
    subset selection

17
Future Work
  • An analysis is still needed to improve the
    results obtained using GAs
  • Performance improvement and trials for the other
    datasets may be included
  • Performance improvements should be done on the
    Genetic Algorithms themselves
  • Another analysis could be based on the fitness
    evaluation function where there may be used other
    fitness functions
  • The approach may be tried in semi-supervised
    learning case

18
Thanks for Listening
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com