Feature selection methods from correlation to causality - PowerPoint PPT Presentation

Loading...

PPT – Feature selection methods from correlation to causality PowerPoint presentation | free to download - id: 1fefef-YWRkN



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Feature selection methods from correlation to causality

Description:

NIPS 2008 workshop on kernel learning. Feature Extraction, Foundations ... Andr Elisseeff Jean-Philippe Pellet. Gregory F. Cooper Peter Spirtes. Introduction ... – PowerPoint PPT presentation

Number of Views:214
Avg rating:3.0/5.0
Slides: 36
Provided by: Isabell45
Learn more at: http://carbon.videolectures.net
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Feature selection methods from correlation to causality


1
Feature selection methodsfrom correlation to
causality
  • Isabelle Guyon
  • isabelle_at_clopinet.com

NIPS 2008 workshop on kernel learning
2
Acknowledgements and references
  • Feature Extraction,
  • Foundations and Applications
  • I. Guyon, S. Gunn, et al.
  • Springer, 2006.
  • http//clopinet.com/fextract-book
  • 2) Causal feature selection
  • I. Guyon, C. Aliferis, A. Elisseeff
  • To appear in Computational Methods of Feature
    Selection,
  • Huan Liu and Hiroshi Motoda Eds.,
  • Chapman and Hall/CRC Press, 2007.
  • http//clopinet.com/causality

3
http//clopinet.com/causality
Constantin Aliferis Alexander Statnikov André
Elisseeff Jean-Philippe Pellet Gregory F.
Cooper Peter Spirtes
4
Introduction
5
Feature Selection
  • Thousands to millions of low level features
    select the most relevant one to build better,
    faster, and easier to understand learning
    machines.

n
X
m
6
Applications
examples
Customer knowledge
Quality control
Market Analysis
106
OCR HWR
Machine vision
105
Text Categorization
104
103
System diagnosis
Bioinformatics
102
10
variables/features
10
102
103
104
105
7
Nomenclature
  • Univariate method considers one variable
    (feature) at a time.
  • Multivariate method considers subsets of
    variables (features) together.
  • Filter method ranks features or feature subsets
    independently of the predictor (classifier).
  • Wrapper method uses a classifier to assess
    features or feature subsets.

8
Univariate Filter Methods
9
Univariate feature ranking
m-
m
P(XiY1)
P(XiY-1)
-1
xi
s-
s
  • Normally distributed classes, equal variance s2
    unknown estimated from data as s2within.
  • Null hypothesis H0 m m-
  • T statistic If H0 is true,
  • t (m - m-)/(swithin?1/m1/m-)
    Student(mm--2 d.f.)

10
Statistical tests ( chap. 2)
Null distribution
  • H0 X and Y are independent.
  • Relevance index ? test statistic.
  • Pvalue ? false positive rate FPR nfp / nirr
  • Multiple testing problem use Bonferroni
    correction pval ? n pval
  • False discovery rate FDR nfp / nsc ? FPR
    n/nsc
  • Probe method FPR ? nsp/np

(Guyon, Dreyfus, 2006, )
11
Univariate Dependence
  • Independence
  • P(X, Y) P(X) P(Y)
  • Measure of dependence
  • MI(X, Y) ? P(X,Y) log dX dY
  • KL( P(X,Y) P(X)P(Y) )

P(X,Y) P(X)P(Y)
12
Other criteria ( chap. 3)
  • A choice of feature selection ranking methods
    depending on the nature of
  • the variables and the target (binary,
    categorical, continuous)
  • the problem (dependencies between variables,
    linear/non-linear relationships between variables
    and target)
  • the available data (number of examples and
    number of variables, noise in data)
  • the available tabulated statistics.

(Wlodzislaw Duch, 2006)
13
Multivariate Methods
14
Univariate selection may fail
Guyon-Elisseeff, JMLR 2004 Springer 2006
15
Filters,Wrappers, andEmbedded methods
16
Relief
ReliefltDmiss/Dhitgt
nearest hit
Dhit
Dmiss
nearest miss
Kira and Rendell, 1992
Dhit
Dmiss
17
Wrappers for feature selection
Kohavi-John, 1997
N features, 2N possible feature subsets!
18
Search Strategies ( chap. 4)
  • Exhaustive search.
  • Stochastic search (simulated annealing, genetic
    algorithms)
  • Beam search keep k best path at each step.
  • Greedy search forward selection or backward
    elimination.
  • Floating search Alternate forward and backward
    strategies.

(Juha Reunanen, 2006)
19
Forward Selection (wrapper)
n n-1 n-2 1

Also referred to as SFS Sequential Forward
Selection
20
Forward Selection (embedded)
n n-1 n-2 1

Guided search we do not consider alternative
paths. Typical ex. Gram-Schmidt orthog. and tree
classifiers.
21
Backward Elimination (wrapper)
Also referred to as SBS Sequential Backward
Selection
1 n-2 n-1 n

Start
22
Backward Elimination (embedded)
Guided search we do not consider alternative
paths. Typical ex. recursive feature
elimination RFE-SVM.
1 n-2 n-1 n

Start
23
Scaling Factors
  • Idea Transform a discrete space into a
    continuous space.

ss1, s2, s3, s4
  • Discrete indicators of feature presence si ?0,
    1
  • Continuous scaling factors si ? IR

Now we can do gradient descent!
24
Formalism ( chap. 5)
  • Many learning algorithms are cast into a
    minimization of some regularized functional

Regularization capacity control
Empirical error
Justification of RFE and many other embedded
methods.
(Lal, Chapelle, Weston, Elisseeff, 2006)
25
Embedded method
  • Embedded methods are a good inspiration to design
    new feature selection techniques for your own
    algorithms
  • Find a functional that represents your prior
    knowledge about what a good model is.
  • Add the s weights into the functional and make
    sure its either differentiable or you can
    perform a sensitivity analysis efficiently
  • Optimize alternatively according to a and s
  • Use early stopping (validation set) or your own
    stopping criterion to stop and select the subset
    of features
  • Embedded methods are therefore not too far from
    wrapper techniques and can be extended to
    multiclass, regression, etc

26
Causality
27
What can go wrong?
Guyon-Aliferis-Elisseeff, 2007
28
What can go wrong?
29
What can go wrong?
Guyon-Aliferis-Elisseeff, 2007
30
Local causal graph
31
What works and why?
32
Bilevel optimization
N variables/features
Split data into 3 sets training, validation, and
test set.
  • 1) For each feature subset, train predictor on
    training data.
  • 2) Select the feature subset, which performs best
    on validation data.
  • Repeat and average if you want to reduce variance
    (cross-validation).
  • 3) Test on test data.

M samples
33
Complexity of Feature Selection
With high probability
Generalization_error ? Validation_error e(C/m2)
Error
m2 number of validation examples, N total
number of features, n feature subset size.
n
Try to keep C of the order of m2.
34
Insensitivity to irrelevant features

Simple univariate predictive model, binary target
and features, all relevant features correlate
perfectly with the target, all irrelevant
features randomly drawn. With 98 confidence,
abs(feat_weight) lt w and Si wixi lt v. ng
number of good (relevant) features nb number of
bad (irrelevant) features m number of training
examples.
35
Conclusion
  • Feature selection focuses on uncovering subsets
    of variables X1, X2, predictive of the target
    Y.
  • Multivariate feature selection is in principle
    more powerful than univariate feature selection,
    but not always in practice.
  • Taking a closer look at the type of dependencies
    in terms of causal relationships may help
    refining the notion of variable relevance.
  • Feature selection and causal discovery may be
    more harmful than useful.
  • Causality can help ML but ML can also help
    causality
About PowerShow.com