talk proteomics - PowerPoint PPT Presentation

About This Presentation

Title:

talk proteomics

Description:

Title: talk proteomics Author: Elena Marchiori Last modified by: gebruiker Created Date: 3/28/2002 7:39:25 AM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 35

Provided by: Elen113

Category:

more less

Transcript and Presenter's Notes

Title: talk proteomics

1
Lecture 5 Feature Selection (Elena Marchioris
slides adapted)Bioinformatics Data Analysis
and Tools
bvhoute_at_few.vu.nl
2
What is feature selection?

Reducing the feature space by removing some of
the (non-relevant) features.
Also known as
variable selection
feature reduction
attribute selection
variable subset selection

3
Why select features?

It is cheaper to measure less variables.
The resulting classifier is simpler and
potentially faster.
Prediction accuracy may improve by discarding
irrelevant variables.
Identifying relevant variables gives more insight
into the nature of the corresponding
classification problem (biomarker detection).
Alleviate the curse of dimensionality.

4
Why select features?
No feature selection
Top 100 feature selection Selection based on
variance
Correlation plot Data Leukemia, 3 class
1
-1
5
The curse of dimensionality

Term introduced by Richard Bellman1.
Problems caused by the exponential increase in
volume associated with adding extra dimensions to
a (mathematical) space.
So the problem space increases with the number
of variables/features.

1Bellman, R.E. 1957. Dynamic Programming.
Princeton University Press, Princeton, NJ
6
The curse of dimensionality

A high dimensional feature space leads to
problems in for example
Machine learning danger of overfitting with too
many variables.
Optimization finding the global optimum is
(virtually) infeasible in a high-dimensional
space.
Microarray analysis the number of features
(genes) is much larger than the number of objects
(samples). So a huge amount of observations is
needed to obtain a good estimate of the function
of a gene.

7
Approaches

Wrapper
Feature selection takes into account the
contribution to the performance of a given type
of classifier.
Filter
Feature selection is based on an evaluation
criterion for quantifying how well feature
(subsets) discriminate the two classes.
Embedded
Feature selection is part of the training
procedure of a classifier (e.g. decision trees).

8
Embedded methods

Attempt to jointly or simultaneously train both a
classifier and a feature subset.
Often optimize an objective function that jointly
rewards accuracy of classification and penalizes
use of more features.
Intuitively appealing.
Example tree-building algorithms

Adapted from J. Fridlyand
9
Approaches to Feature Selection
Filter Approach
Feature Selection by Distance Metric Score
Input Features
Model
Train Model
Wrapper Approach
Feature Set
Feature Selection Search
Model
Train Model
Input Features
Importance of features given by the model
Adapted from Shin and Jasso
10
Filter methods
Feature selection
p
S
Classifier design
R
R
S ltlt p

Features are scored independently and the top S
are used by the classifier.
Score correlation, mutual information,
t-statistic, F-statistic, p-value, tree
importance statistic, etc.

Easy to interpret. Can provide some insight into
the disease markers.
Adapted from J. Fridlyand
11
Problems with filter method

Redundancy in selected features features are
considered independently and not measured on the
basis of whether they contribute new information.
Interactions among features generally can not be
explicitly incorporated (some filter methods are
smarter than others).
Classifier has no say in what features should be
used some scores may be more appropriates in
conjuction with some classifiers than others.

Adapted from J. Fridlyand
12
Wrapper methods
Feature selection
p
S
Classifier design
R
R
S ltlt p

Iterative approach many feature subsets are
scored based on classification performance and
best is used.

Adapted from J. Fridlyand
13
Problems with wrapper methods

Computationally expensive for each feature
subset to be considered, a classifier must be
built and evaluated.
No exhaustive search is possible (2 subsets to
consider) generally greedy algorithms only.
Easy to overfit.

Adapted from J. Fridlyand
14
Example Microarray Analysis
Labeled cases (38 bone marrow samples 27 AML,
11 ALL Each contains 7129 gene expression values)
Train model (using Neural Networks, Support
Vector Machines, Bayesian nets, etc.)
key genes
34 New unlabeled bone marrow samples
Model
AML/ALL
15
Microarray Data Challenges to Machine Learning
Algorithms

Few samples for analysis (38 labeled).
Extremely high-dimensional data (7129 gene
expression values per sample).
Noisy data.
Complex underlying mechanisms, not fully
understood.

16
Some genes are more useful than others for
building classification models
Example genes 36569_at and 36495_at are useful
17
Some genes are more useful than others for
building classification models
Example genes 36569_at and 36495_at are useful
AML
ALL
18
Some genes are more useful than others for
building classification models
Example genes 37176_at and 36563_at not useful
19
Importance of feature (gene) selection

Majority of genes are not directly related to
leukemia.
Having a large number of features enhances the
models flexibility, but makes it prone to
overfitting.
Noise and the small number of training samples
makes this even more likely.
Some types of models, like kNN do not scale well
with many features.

20
How do we choose the most relevant of the 7219
genes?

Distance metrics to capture class separation.
Rank genes according to distance metric score.
Choose the top n ranked genes.

HIGH score
LOW score
21
Distance metrics

Tamayos Relative Class Separation
t-test
Bhattacharyya distance

22
SVM-RFE wrapper

Recursive Feature Elimination
Train linear SVM ? linear decision function.
Use absolute value of variable weights to rank
variables.
Remove half variables with lower rank.
Repeat above steps (train, rank, remove) on data
restricted to variables not removed.
Output subset of variables.

23
SVM-RFE

Linear binary classifier decision function
Recursive Feature Elimination (SVM-RFE)
- At each iteration
eliminate threshold of variables with lower
score
recompute scores of remaining variables

24
SVM-RFE I. Guyon et al., Machine
Learning, 46,389-422, 2002
25
RELIEF

Idea relevant features make (1) nearest examples
of same class closer and (2) nearest examples of
opposite classes more far apart.
weights of all features zero
For each example in training set
find nearest example from same (hit) and opposite
class (miss)
update weight of each feature by adding
abs(example - miss) -abs(example - hit)

RELIEF I. Kira K, Rendell L, 10th Int. Conf. on
AI, 129-134, 1992
26
RELIEF Algorithm

RELIEF assigns weights to variables based on how
well they separate samples from their nearest
neighbors (nnb) from the same and from the
opposite class.
RELIEF
input X (two classes)
output W (weights assigned to variables)
nr_var total number of variables
weights zero vector of size nr_var
for all x in X do
hit(x) nnb of x from same class
miss(x) nnb of x from opposite class
weights abs(x-miss(x)) - abs(x-hit(x))
end
nr_ex number of examples of X
return W weights/nr_ex
Note Variables have to be normalized (e.g.,
divide each variable by its (max min) values)

27
RELIEF example
Gene expressions for two types of leukemia -
3 patiënts with AML (Acute Myeloid Leukemia) -
3 patiënts with ALL (Acute Lymphoblastic Leukemia)

What are the weights of genes 1-5, assigned by
RELIEF?

28
RELIEF normalization
First, apply (max-min) normalization -
identify the max and min value of each feature
(gene) - Divide all values of each feature
with the corresponding (max-min)
normalization 3 / (6-1) 0.6
29
RELIEF distance matrix
Data after normalization
Then, calculate the distance matrix
Distance measure 1 - Pearson Correlation
30
RELIEF 1st iteration
RELIEF, Iteration 1 AML1
31
RELIEF 2nd iteration
RELIEF, Iteration 2 AML2
32
RELIEF results (after 6th iteration)
Weights after last iteration
Last step is to sort the features by
their weights, and select the features with
the highest ranks
33
RELIEF

Advantages
Fast.
Easy to implement.
Disadvantages
Does not filter out redundant features, so
features with very similar values could be
selected.
Not robust to outliers.
Classic RELIEF can only handle data sets with two
classes.

34
Extension of RELIEF RELIEF-F

Extension for multi-class problems.
Instead of finding one near miss, the algorithm
finds
one near miss for each
different class and
averages their contribution
of updating the weights.

RELIEF-F input X (two or more classes
C) output W (weights assigned to
variables) nr_var total number of
variables weights zero vector of size
nr_var for all x in X do hit(x) nnb of x from
same class sum_miss 0 for all c in C do
miss(x, c) nnb of x from class c
sum_miss abs(x-miss(x, c)) / nr_examples(c)
end weights sum_miss -
abs(x-hit(x)) end nr_ex number of examples of
X return W weights/nr_ex

Write a Comment

User Comments (0)