Feature Selection - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Feature Selection

Description:

Sometimes, pairs of features give far more information than either feature in isolation ... Fitness. Using proper experimental techniques, measure system ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 22

Provided by: hpl5

Category:

more less

Transcript and Presenter's Notes

Title: Feature Selection

1
Feature Selection

Carl Staelin

2
Motivation

Document vector spaces are HUGE
Many algorithms are sensitive to the number of
parameters
It is likely that only a few of the parameters
are really important anyway
So lets get rid of useless parameters!

3
Feature Selection

Feature selection reduces the number of
parameters
Usually by eliminating parameters
Sometimes by transforming parameters
Latent Semantic Indexing using Singular Value
Decomposition is a variant on this theme and will
be discussed in another lecture
Method may depend on problem type
For classification and filtering, may use
information from example documents to guide
selection

4
Question

How do you know which features can be pruned?
Genetic Algorithms
Try lots of subsets and choose the best
Mutual Information
Iteratively eliminate features with least mutual
information with other remaining features

5
Problem Introduction

Many features are insignificant
The, a, and,
Usually significance of a feature is independent
of other features
Sometimes, pairs of features give far more
information than either feature in isolation
Sometimes feature significance is related to the
specific task or query

6
Simple Approach

Suppose you are building a document
classification system
You have sample documents that are
relevant/irrelevant to the class
There are N features in total
You will use classification algorithm A

7
Simple Approach (contd)

F set of features
Evaluate precision/recall of A using F
Choose precision/recall threshold
For each feature f in F
F? F f
Evaluate precision/recall of A using F?
If performance better than threshold
F? F

8
Feature Goodness Measures

Class-independent measures
Document frequency
Term strength
Feature mutual information
Relative to a class or classes of documents
Information gain
Mutual information
X2 statistic

9
Document Frequency

This is the simplest feature selection method
Based on Zipfs Law
Remove N most common terms
Remove terms that appear in fewer than M
documents (M usually 1 or 2)

10
Term Strength

Term strength does not rely on predefined
categories
For x and y any pair of related documents
Documents which are closer than some distance
s(t) P(t?yt?x)
Retain those features where s(t) gt thresh

11
Feature Mutual Information

A documents containing ti and tj
B documents containing ti and not tj
C documents containing tj and not ti
N documents
I(ti, tj) log(P(ti and tj) / (P(ti)?P(tj)))
I(ti, tj) ? log(A?N / ((A C)?(A B)))

12
MST-FeatureSelection

?i, j i?j Compute I(ti, tj)
Build maximal spanning tree MST of complete graph
G with edge weights I(ti, tj)
FMST?direct-arcs(MST), R ?tii
While (R gt k) do
If ?v where v is a node representing feature tv
that is not connected to any other node in FMST
Then R ? R -tv, FMST ?FMST -v
Else remove least weighted arc from FMST

13
PIL-FeatureSelection

?i, j i?j Compute I(ti, tj)
R ?tii
While (R gt k) do
? ti? R, compute ?i ?? tj? R -tiI(ti, tj)
ti ?feature with minimal ?i
R ? R-ti

14
Information Gain

For term t and m categories ci
P(ci) probability document in ci
P(t) probability t appears in document
P(cit) probability document in ci given t
appears
P(ci?t) document in ci given t does not appear
G(t) -?P(ci)log(P(ci)) P(t) ?
P(cit)log(P(cit)) P(? t) ? P(ci?t)log(P(ci?t
))
Retain those features where G(t) gt thresh

15
Mutual Information

A documents in c containing t
B documents not in c containing t
C documents in c not containing t
N documents
I(t,c) log(P(t and c) / (P(t)?P(c)))
I(t,c) ? log(A?N / ((A C)?(A B)))
Iavg(t) ? P(ci) I(t,ci)
Imax(t) maxI(t,ci)
Retain those features where I(t) gt thresh

16
X2 Statistic

Measures the lack of independence between t and c
A, B, C, N as above
D documents not in c and not containing t
X2(t,c) N(AD-CB)2 / ((AC)(BD)(AB)(CD))
X2avg(t) ? P(ci) X2(t,ci)
X2max(t) maxX2(t,ci)
Retain those features where X2(t) gt thresh

17
Genetic Algorithms

General-purpose optimization system
Define a problem representation
Classical GA uses binary bit-strings
Create an evaluator which returns a fitness
Create a random population
For each generation
Evaluate fitness of each member of population
Create new strings
Cross-over using fitness-weighted parent
selection
Mutation

18
Genetic Feature Selection

Representation
Bit-string
Each 0,1 bit represents exclusion/inclusion of
feature
Fitness
Using proper experimental techniques, measure
system performance with those features
Various queries,

19
Feature Selection Performance
20
Feature Selection Performance
21
Summary

Feature selection is useful for reducing the
computational complexity
Feature selection can improve both performance
and accuracy
The right algorithm may depend on the problem
and classification/retrieval algorithm

Write a Comment

User Comments (0)