Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution

Description:

Introduction to Feature Selection. Feature Selection Models ... High-Dimensional Data: A Kolmogorov-Smirnov Correlation-Based Filter Solution. ... – PowerPoint PPT presentation

Number of Views:932
Avg rating:3.0/5.0
Slides: 25
Provided by: dabiTe
Category:

less

Transcript and Presenter's Notes

Title: Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution


1
Feature Selection for High-Dimensional Data A
Fast Correlation-Based Filter Solution
  • Presented by Jingting Zeng
  • 11/26/2007

2
Outline
  • Introduction to Feature Selection
  • Feature Selection Models
  • Fast Correlation-Based Filter (FCBF) Algorithm
  • Experiment
  • Discussion
  • Reference

3
Introduction of Feature Selection
  • Definition
  • A process that chooses an optimal subset of
    features according to an objective function
  • Objectives
  • To reduce dimensionality and remove noise
  • To improve mining performance
  • Speed of learning
  • Predictive accuracy
  • Simplicity and comprehensibility of mined results

4
An Example for Optimal Subset
  • Data set (whole set)
  • Five Boolean features
  • C F1?F2
  • F3 F2 ,F5 F4
  • Optimal subset
  • F1, F2orF1, F3

5
Models of Feature Selection
  • Filter model
  • Separating feature selection from classifier
    learning
  • Relying on general characteristics of data
    (information, distance, dependence, consistency)
  • No bias toward any learning algorithm, fast
  • Wrapper model
  • Relying on a predetermined classification
    algorithm
  • Using predictive accuracy as goodness measure
  • High accuracy, computationally expensive

6
Filter Model
7
Wrapper Model
8
Two Aspects for Feature Selection
  • How to decide whether a feature is relevant to
    the class or not
  • How to decide whether such a relevant feature is
    redundant or not compared to other features

9
Linear Correlation Coefficient
  • For a pair of variables (x,y)
  • However, it may not be able to capture the
    non-linear correlations

10
Information Measures
  • Entropy of variable X
  • Entropy of X after observing Y
  • Information Gain
  • Symmetrical Uncertainty

11
Fast Correlation-Based Filter (FCBF) Algorithm
  • How to decide whether a feature is relevant
    to the class C or not
  • Find a subset , such that
  • How to decide whether such a relevant feature is
    redundant
  • Use the correlation of features and class as a
    reference

12
Definitions
  • Predominant Correlation
  • The correlation between a feature and the
    class C is predominant
  • Redundant peer (RP)
  • If there is , is a RP of
  • Use to denote the set of RP for

13
i
C
14
Three Heuristics
  • If , treat as a predominant
    feature, remove all features in and skip
    identifying redundant peers for them
  • If , process all the features in
    at first. If non of them becomes predominant,
    follow the first heuristic
  • The feature with the largest value is
    always a predominant feature and can be a
    starting point to remove other features.

15
i
C
16
FCBF Algorithm
Time Complexity O(N)
17
FCBF Algorithm (cont.)
Time complexity O(NlogN)
18
Experiments
  • FCBF are compared to ReliefF, CorrSF and ConsSF
  • Summary of the 10 data sets

19
Results
20
Results (cont.)
21
Pros and Cons
  • Advantage
  • Very fast
  • Select fewer features with higher accuracy
  • Disadvantage
  • Cannot detect some features
  • 4 features generated by 4 Gaussian functions and
    adding 4 additional redundant features, FCBF
    selected only 3 features

22
Discussion
  • FCBF compares only individual features with each
    other
  • Try to use PCA to capture a group of features.
    Based on the result, then the FCBF is used.

23
Reference
  • L. Yu and H. Liu. Feature selection for
    high-dimensional data A fast correlation-based
    filter solution. In Proc 12th Int Conf on Machine
    Learning (ICML-03), pages 856863, 2003
  • Biesiada J, Duch W (2005), Feature Selection for
    High-Dimensional Data A Kolmogorov-Smirnov
    Correlation-Based Filter Solution. (CORES'05)
    Advances in Soft Computing, Springer Verlag, pp.
    95-104, 2005.
  • www.cse.msu.edu/ptan/SDM07/Yu-Ye-Liu.pdf
  • www1.cs.columbia.edu/jebara/6772/proj/Keith.ppt

24
  • Thank you!
  • Q and A
Write a Comment
User Comments (0)
About PowerShow.com