Guest lecture: Feature Selection presentation

About This Presentation

Transcript and Presenter's Notes

Title: Guest lecture: Feature Selection

1
Guest lecture Feature Selection

Alan Qi
yuanqi_at_mit.edu
Dec 2, 2004

2
Outline

Problems
Overview of feature selection (FS)
Filtering correlation information criteria
Wrapper approach greedy FS regularization
Classical Bayesian feature selection
New Bayesian approach predictive-ARD

3
Feature selection

Gene expression thousands of gene measurements
Documents bag of words model with more than
10,000 words
Images histograms, colors, wavelet coefficients,
etc.
Task find a small subset of features for
prediction

4
Gene Expression Classification
Task Classify gene expression datasets into
different categories, e.g., normal v.s.
cancer Challenge Thousands of genes measured in
the micro-array data. Only a small subset of
genes are probably correlated with the
classification task.
5
Filtering approach

Feature ranking based on sensible criteria
Correlation between features and labels
Mutual information between features and labels

6
Wrapper Approach

Assess subsets of variables according to their
usefulness to a given predictor. A combinatorial
problem 2K combinations given K features.
Sequentially adding/removing features Sequential
Forward Selection (SFS), Backward Sequential
Selection (SBS).
Recursively adding/removing features Sequential
Forward Floating Selection (SFFS) (When to stop?
Overfitting?)
-Regularization use sparse prior to enhance the
sparsity of a trained predictor (classifier).

7
Regularization
Labels t t1, t1, , tN Inputs X x1, x1,
, xN Parameters w Likelihood for the data set
(For classification)
Regularization combining the fit to the data and
a penalty for complexity. Minimizing the
following
8
Bayesian feature selection

Background
Bayesian classification model
Automatic relevance determination (ARD)
Risk of Overfitting by optimizing hyperparameters
Predictive ARD by expectation propagation (EP)
Approximate prediction error
EP approximation
Experiments
Conclusions

9
Motivation

Task 1 Classify high dimensional datasets with
many irrelevant features, e.g., normal v.s.
cancer microarray data.
Task 2 Sparse Bayesian kernel classifiers for
fast test performance.

10
Bayesian Classification Model
Labels t inputs X parameters w Likelihood
for the data set
Prior of the classifier w
Where
is a cumulative distribution function for
a standard Gaussian.
11
Evidence and Predictive Distribution
The evidence, i.e., the marginal likelihood of
the hyperparameters
The predictive posterior distribution of the
label for a new input
12
Automatic Relevance Determination (ARD)

Give the classifier weight independent Gaussian
priors whose variance, , controls how far
away from zero each weight is allowed to go
Maximize , the marginal likelihood of
the model, with respect to .
Outcome many elements of go to infinity,
which naturally prunes irrelevant features in the
data.

13
Two Types of Overfitting

Classical Maximum likelihood
Optimizing the classifier weights w can directly
fit noise in the data, resulting in a complicated
model.
Type II Maximum likelihood (ARD)
Optimizing the hyperparameters corresponds to
choosing which variables are irrelevant. Choosing
one out of exponentially many models can also
overfit if we maximize the model marginal
likelihood.

14
Risk of Optimizing

X Class 1 vs O Class 2

15
Outline

Background
Bayesian classification model
Automatic relevance determination (ARD)
Risk of Overfitting by optimizing hyperparameters
Predictive ARD by expectation propagation (EP)
Approximate prediction error
EP approximation
Experiments
Conclusions

16
Predictive-ARD

Choosing the model with the best estimated
predictive performance instead of the most
probable model.
Expectation propagation (EP) estimates the
leave-one-out predictive performance without
performing any expensive cross-validation.

17
Estimate Predictive Performance

Predictive posterior given a test data point
EP can estimate predictive leave-one-out error
probability
where q( w t\i) is the approximate posterior of
leaving out the ith label.
EP can also estimate predictive leave-one-out
error count

18
Expectation Propagation in a Nutshell

Approximate a probability distribution by
simpler parametric terms
Each approximation term lives in an
exponential family (e.g. Gaussian)

19
EP in a Nutshell

Three key steps
Deletion Step approximate the leave-one-out
predictive posterior for the ith point
Minimizing the following KL divergence by moment
matching
Inclusion

The key observation we can use the approximate
predictive posterior, obtained in the deletion
step, for model selection. No extra computation!
20
Comparison of different model selection criteria
for ARD training
The estimated leave-one-out error probabilities
and counts are better correlated with the test
error than evidence and sparsity level.

1st row Test error
2nd row Estimated leave-one-out error
probability
3rd row Estimated leave-one-out error counts
4th row Evidence (Model marginal likelihood)
5th row Fraction of selected features

21
Gene Expression Classification

Task Classify gene expression datasets into
different categories, e.g., normal v.s. cancer
Challenge Thousands of genes measured in the
micro-array data. Only a small subset of genes
are probably correlated with the classification
task.

22
Classifying Leukemia Data

The task distinguish acute myeloid leukemia
(AML) from acute lymphoblastic leukemia (ALL).
The dataset 47 and 25 samples of type ALL and
AML respectively with 7129 features per sample.
The dataset was randomly split 100 times into 36
training and 36 testing samples.

23
Classifying Colon Cancer Data

The task distinguish normal and cancer samples
The dataset 22 normal and 40 cancer samples with
2000 features per sample.
The dataset was randomly split 100 times into 50
training and 12 testing samples.
SVM results from Li et al. 2002

24
Bayesian Sparse Kernel Classifiers

Using feature/kernel expansions defined on
training data points
Predictive-ARD-EP trains a classifier that
depends on a small subset of the training set.
Fast test performance.

25
Test error rates and numbers of relevance or
support vectors on breast cancer dataset.

50 partitionings of the data were used. All
these methods use the same Gaussian kernel with
kernel width 5. The trade-off parameter C in
SVM is chosen via 10-fold cross-validation for
each partition.

26
Test error rates on diabetes data

100 partitionings of the data were used.
Evidence and Predictive ARD-EPs use the Gaussian
kernel with kernel width 5.

27
Summary

Two kinds of feature selection methods
Filtering and wrapper methods
Classical Bayesian feature selection
Excellent classical approach Tuning prior to
prune features.
However, maximizing marginal likelihood can lead
to overfitting in the model space if there are a
lot of features.
New Bayesian approach Predictive-ARD, which
focus on the prediction performance.
feature selection
sparse kernel learning

Write a Comment

User Comments (0)

About PowerShow.com

Guest lecture: Feature Selection PowerPoint PPT Presentation