Variable - / Feature Selection - PowerPoint PPT Presentation

Loading...

PPT – Variable - / Feature Selection PowerPoint presentation | free to download - id: 76896b-YTE5Y



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Variable - / Feature Selection

Description:

Title: Variable and Feature Selection in Machine Learning (Review) Martin Bachler martinb_at_igi.tugraz.at Based on: Isabelle Guyon, Andr Elisseeff. – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 51
Provided by: ucon150
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Variable - / Feature Selection


1
  • Variable - / Feature Selection
  • in Machine Learning
  • (Review)
  • Adapted from http//www.igi.tugraz.at/lehre/MLA/WS
    05/mla_2005_11_15.ppt

2
Overview
WHY ? WHAT ? HOW ?
  • Introduction/Motivation
  • Basic definitions, Terminology
  • Variable Ranking methods
  • Feature subset selection

2/54
3
Feature Selection in ML ?
  • Why even think about Feature Selection in ML?
  • The information about the target class is
    inherent in the variables!
  • Naive theoretical view More features gt More
    informationgt More discrimination power.
  • In practice many reasons why this is not the
    case!
  • AlsoOptimization is (usually) good, so why not
    try to optimize the input-coding ?

3/54
4
Feature Selection in ML ? YES!
  • Many explored domains have hundreds to tens of
    thousands of variables/features with many
    irrelevant and redundant ones!
  • - In domains with many features the underlying
    probability distribution can be very complex and
    very hard to estimate (e.g. dependencies between
    variables) !
  • Irrelevant and redundant features can confuse
    learners!
  • Limited training data!
  • Limited computational resources!
  • Curse of dimensionality!

4/54
5
Curse of dimensionality
5/54
6
Curse of dimensionality
  • The required number of samples (to achieve the
    same accuracy) grows exponentionally with the
    number of variables!
  • In practice number of training examples is
    fixed!
  • gt the classifiers performance usually will
    degrade for a large number of features!

In many cases the information that is lost by
discarding variables is made up for by a more
accurate mapping/sampling in the
lower-dimensional space !
6/54
7
Example for ML-Problem
  • Gene selection from microarray data
  • Variables gene expression coefficients
    corresponding to the amount of mRNA in a
    patients sample (e.g. tissue biopsy)
  • Task Seperate healthy patients from cancer
    patients
  • Usually there are only about 100 examples
    (patients) available for training and testing
    (!!!)
  • Number of variables in the raw data 6.000
    60.000
  • Does this work ? (8)

7/54
8 C. Ambroise, G.J. McLachlan Selection bias
in gene extraction on the basis of microarray
gene-expresseion data. PNAS Vol. 99
6562-6566(2002)
8
Example for ML-Problem
  • Text-Categorization
  • Documents are represented by a vector of
    dimension the size of the vocabulary containing
    word frequency counts
  • Vocabulary 15.000 words (i.e. each document is
    represented by a 15.000-dimensional vector)
  • Typical tasks
  • Automatic sorting of documents into
    web-directories
  • Detection of spam-email

8/54
9
Motivation
  • Especially when dealing with a large number of
    variables there is a need for dimensionality
    reduction!
  • Feature Selection can significantly improve a
    learning algorithms performance!

9/54
10
Overview
  • Introduction/Motivation
  • Basic definitions, Terminology
  • Variable Ranking methods
  • Feature subset selection

10/54
11
Problem setting
  • Classification/Regression (Supervised Learning)
  • Given empirical data (training data)
  • a learner has to find a hypothesis
  • that is used to assign a label y to unseen x.
  • Classification y is an integer (e.g. Y -1,1)
  • Regression y is real-valued (e.g. Y -1,1)

11/54
12
Features/Variables
  • Take a closer look at the data
  • i.e. each instance x has n
  • attributes, variables, features,
  • dimensions

12/54
13
Feature Selection - Definition
  • Given a set of features
  • the Feature Selection problem is
  • to find a subset that maximizes
    the learners
  • ability to classify patterns.
  • Formally F should maximize some scoring function
  • (where is the space of all
    possible feature subsets of F), i.e.

13/54
14
Feature Extraction-Definition
  • Given a set of features
  • the Feature Extraction(Construction) problem is
  • is to map F to some feature set that
    maximizes the
  • learners ability to classify patterns.
  • (again )
  • This general definition subsumes feature
    selection (i.e. a feature selection algorithm
    also performs a mapping but can only map to
    subsets of the input variables)

here is the set of all possible feature
sets
14/51
15
Feature Selection / - Extraction
  • Feature Selection
  • Feature Extraction/Creation

F
F
F
F
16
Feature Selection Optimality ?
  • In theory the goal is to find an optimal
    feature-subset (one that maximizes the scoring
    function)
  • In real world applications this is usually not
    possible
  • For most problems it is computationally
    intractable to search the whole space of possible
    feature subsets
  • One usually has to settle for approximations of
    the optimal subset
  • Most of the research in this area is devoted to
    finding efficient search-heuristics

16/54
17
Optimal feature subset
  • Often Definition of optimal feature subset in
    terms of classifiers performance
  • The best one can hope for theoretically is the
    Bayes error rate
  • Given a learner I and training data L with
    features
  • F f1,.fi,,fn an optimal feature subset Fopt
    is a subset of F such that the accuracy of the
    learners hypothesis h is maximal (i.e. its
    performance is equal to an optimal Bayes
    classifier).
  • Fopt (under this definition) depends on I
  • Fopt need not be unique
  • Finding Fopt is usually computationally
    intractable

for this definition a possible scoring function
is 1 true_error(h)
17/54
18
Relevance of features
  • Relevance of a variable/feature
  • There are several definitions of relevance in
    literature
  • Relevance of 1 variable, Relevance of a variable
    given other variables, Relevance given a certain
    learning algorithm,..
  • Most definitions are problematic, because there
    are problems where all features would be declared
    to be irrelevant
  • The authors of 2 define two degrees of
    relevance weak and strong relevance.
  • A feature is relevant iff it is weakly or
    strongly relevant and irrelevant(redundant)
    otherwise.

1 R. Kohavi and G. John Wrappers for features
selection. Artificial Intelligence,
97(1-2)273-324, December 1997
18/54
2 Ron Kohavi, George H. John Wrappers for
Feature Subset Selection. AIJ special issue on
relevance (1996)
19
Relevance of featurs
  • Strong Relevance of a variable/feature
  • Let Si f1, , fi-1, fi1, fn be the set of
    all features except fi. Denote by si a
    value-assignment to all features in Si.
  • A feature fi is strongly relevant, iff there
    exists some xi, y and si for which p(fi xi, Si
    si) gt 0 such that
  • This means that removal of fi alone will always
    result in a performance deterioration of an
    optimal Bayes classifier.

p(Y y fi xi Si si) ? p(Y y Si si)
19/54
20
Relevance of features
Si f1, , fi-1, fi1, fn
  • Weak Relevance of a variable/feature
  • A feature fi is weakly relevant, iff it is not
    strongly relevant, and there exists a subset of
    features Si of Si for which there exists some
    xi, y and si with p(fi xi, Si si) gt 0 such
    that
  • This means that there exists a subset of
    features Si, such that the performance of an
    optimal Bayes classifier on Si is worse than on
    .

p(Y y fi xi Si si) ? p(Y y Si
si)
20/54
21
Relevance of features
  • Relevance Optimality of Feature-Set
  • Classifiers induced from training data are likely
    to be suboptimal (no access to the real
    distribution of the data)
  • Relevance does not imply that the feature is in
    the optimal feature subset
  • Even irrelevant features can improve a
    classifiers performance
  • Defining relevance in terms of a given classifier
    (and therefore a hypothesis space) would be
    better.

21/54
22
Overview
  • Introduction/Motivation
  • Basic definitions, Terminology
  • Variable Ranking methods
  • Feature subset selection

22/54
23
Variable Ranking
  • Given a set of features F
  • Variable Ranking is the process of ordering the
    features by the value of some scoring function
    (which usually measures
    feature-relevance)
  • Resulting set a permutation of F
    with
  • The score S(fi) is computed from the training
    data, measuring some criteria of feature fi.
  • By convention a high score is indicative for a
    valuable (relevant) feature.

23/54
24
Variable Ranking Feature Selection
  • A simple method for feature selection using
    variable ranking is to select the k highest
    ranked features according to S.
  • This is usually not optimal
  • but often preferable to other, more complicated
    methods
  • computationally efficient(!) only calculation
    and sorting of n scores

24/54
25
Ranking Criteria Correlation
  • Correlation Criteria
  • Pearson correlation coefficient
  • Estimate for m samples

The higher the correlation between the feature
and the target, the higher the score!
25/54
26
Ranking Criteria Correlation
26/54
27
Ranking Criteria Correlation
  • Correlation Criteria
  • mostly R(xi,y)² or R(xi,y) is used
  • measure for the goodness of linear fit of xi and
    y.
  • (can only detect linear dependencies between
    variable and target.)
  • what if y XOR(x1,x2) ?
  • often used for microarray data analysis

27/54
28
Ranking Criteria Correlation
  • Questions
  • Can variables with small score be automatically
    discarded ?
  • Can a useless variable (i.e. one with a small
    score) be useful together with others ?
  • Can two variables that are useless by themselves
    can be useful together?)

28/54
29
Ranking Criteria Correlation
  • correlation between variables and target are not
    enough to assess relevance!
  • correlation / covariance between pairs of
    variables has to be considered too!
  • (potentially difficult)
  • diversity of features

29/54
30
Ranking Criteria Inf. Theory
  • Information Theoretic Criteria
  • Most approaches use (empirical estimates of)
    mutual information between features and the
    target
  • Case of discrete variables
  • (probabilities are estimated from frequency
    counts)

30/54
31
Ranking Criteria Inf. Theory
  • Mutual information can also detect non-linear
    dependencies among variables!
  • But harder to estimate than correlation!
  • It is a measure for how much information (in
    terms of entropy) two random variables share

31/54
32
mRMR
33
Variable Ranking - SVC
  • Single Variable Classifiers
  • Idea Select variables according to their
    individual predictive power
  • criterion Performance of a classifier built with
    1 variable
  • e.g. the value of the variable itself
  • (set treshold on the value of the variable)
  • predictive power is usually measured in terms of
    error rate (or criteria using fpr, fnr)
  • also combination of SVCs using ensemble methods
    (boosting,)

33/54
34
Overview
  • Introduction/Motivation
  • Basic definitions, Terminology
  • Variable Ranking methods
  • Feature subset selection

34/54
35
Feature Subset Selection
  • Goal
  • - Find the optimal feature subset.
  • (or at least a good one.)
  • Classification of methods
  • Filters
  • Wrappers
  • Embedded Methods

35/54
36
Feature Subset Selection
  • You need
  • a measure for assessing the goodness of a feature
    subset (scoring function)
  • a strategy to search the space of possible
    feature subsets
  • Finding a minimal optimal feature set for an
    arbitrary target concept is NP-hard
  • gt Good heuristics are needed!

36/54
9 E. Amaldi, V. Kann The approximability of
minimizing nonzero variables and unsatisfied
relations in linear systems. (1997)
37
Feature Subset Selection
  • Filter Methods
  • Select subsets of variables as a pre-processing
    step,independently of the used classifier!!
  • Note that Variable Ranking-FS is a filter method

37/54
38
Feature Subset Selection
  • Filter Methods
  • usually fast
  • provide generic selection of features, not tuned
    by given learner (universal)
  • this is also often criticised (feature set not
    optimized for used classifier)
  • sometimes used as a preprocessing step for other
    methods

38/54
39
Feature Subset Selection
  • Wrapper Methods
  • Learner is considered a black-box
  • Interface of the black-box is used to score
    subsets of variables according to the predictive
    power of the learner when using the subsets.
  • Results vary for different learners
  • One needs to define
  • how to search the space of all possible variable
    subsets ?
  • how to assess the prediction performance of a
    learner ?

39/54
40
LASSO
Least Absolute Shrinkage and Selection Operator
41
1 and 2 Norm Regularization
42
Feature Selection
  • Existing Approaches
  • Filter based methods Feature selection is
    independent of the prediction model (Information
    Gain, Minimum Redundancy Maximum Relevance, etc.)
  • Wrapper based methods Feature selection is
    integrated in the prediction model, typically
    very slow, not able to incorporate domain
    knowledge (LASSO, SVM-RFE)
  • Our approach Scalable Orthogonal Regression
  • Scalable linear with respect to numbers of
    features and samples
  • Interpretable can incorporate domain knowledge

43
Feature Selection by Scalable Orthogonal
Regression
  • Model Accuracy
  • The selected risk factors are highly predictive
    of the target condition
  • Sparse feature selection through L1
    regularization
  • Minimal Correlations
  • Little correlation between the selected data
    driven risk factors and existing knowledge driven
    risk factor
  • Little correlation among the additional risk
    factors from data, to further ensure quality of
    the additional factors

Correlation among data-driven features
Model accuracy
Correlation between data- and knowledge-driven
features
Sparse Penalty
Dijun Luo, Fei Wang, Jimeng Sun, Marianthi
Markatou, Jianying Hu, Shahram Ebadollahi. SOR
Scalable Orthogonal Regression for Low-Redundancy
Feature Selection and its Healthcare
Applications. SDM 2012.
44
Experimental Setup
  • EHRs from 2003 to 2010 from from an integrated
    care delivery network were utilized
  • 4,644 incident HF cases and 46K group-matched
    controls
  • Diagnosis, Symptoms, Medications and Labs are
    used
  • Evaluation metric is Area under the ROC curve
    (AUC)
  • Knowledge-based risk factors

45
Prediction Results
  • AUC significantly improves as complementary data
    driven risk factors are added into existing
    knowledge based risk factors.
  • A significant AUC increase occurs when we add
    first 50 data driven features

Jimeng Sun, Dijun Luo, Marianthi Markatou, Fei
Wang, Shahram Ebadollahi, Steven E. Steinhubl,
Zahra Daar, Walter F. Stewart. Combining
Knowledge and Data Driven Insights for
Identifying Risk Factors using Electronic Health
Records. AMIA (2012).
46
Top-10 Selected Data-driven Features
Feature Relevancy to HF
Dyslipidemia
Thiazides-like Diuretics
Antihypertensive Combinations
Aminopenicillins
Bone density regulators
Naturietic Peptide
Rales
Diuretic Combinations
S3Gallop
NSAIDS
Category
Diagnosis
Medication
Lab
Symptom
  • 9 out of 10 are considered relevant to HF
  • The data driven features are complementary to the
    existing knowledge-driven features

47
Feature Subset Selection
  • Wrapper Methods

47/54
48
Feature Subset Selection
  • Embedded Methods
  • Specific to a given learning machine!
  • Performs variable selection (implicitly) in the
    process of training
  • E.g. WINNOW-algorithm
  • (linear unit with multiplicative updates)

48/54
49
THANK YOU!
50
Sources
  • Nathasha Sigala, Nikos Logothetis Visual
    categorization shapes feature selectivity in the
    primate visual cortex. Nature Vol. 415(2002)
  • Ron Kohavi, George H. John Wrappers for Feature
    Subset Selection. AIJ special issue on relevance
    (1996)
  • Isabelle Guyon and Steve Gunn. Nips feature
    selection challenge. http//www.nipsfsc.ecs.soton.
    ac.uk/, 2003.
  • Isabelle Guyon, Andre Elisseeff An Introduction
    to Variable and Feature Selection. Journal of
    Machine Learning Research 3 (2003) 1157-1182
  • Nathasha Sigala, Nikos Logothetis Visual
    categorization shapes feature selectivity in the
    primate visual cortex. Nature Vol. 415(2002)
  • Daphne Koller, Mehran Sahami Toward Optimal
    Feature Selection. 13. ICML (1996) p. 248-292
  • Nick Littlestone Learning Quickly When
    Irrelevant Attributes Abound A New
    Linear-treshold Algorithm. Machine Learning 2, p.
    285-318 (1987)
  • C. Ambroise, G.J. McLachlan Selection bias in
    gene extraction on the basis of microarray
    gene-expresseion data. PNAS Vol. 99
    6562-6566(2002)
  • E. Amaldi, V. Kann The approximability of
    minimizing nonzero variables and unsatisfied
    relations in linear systems. (1997)
About PowerShow.com