Feature Selection - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Feature Selection

Description:

Sometimes, pairs of features give far more information than either feature in isolation ... Fitness. Using proper experimental techniques, measure system ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 22
Provided by: hpl5
Category:

less

Transcript and Presenter's Notes

Title: Feature Selection


1
Feature Selection
  • Carl Staelin

2
Motivation
  • Document vector spaces are HUGE
  • Many algorithms are sensitive to the number of
    parameters
  • It is likely that only a few of the parameters
    are really important anyway
  • So lets get rid of useless parameters!

3
Feature Selection
  • Feature selection reduces the number of
    parameters
  • Usually by eliminating parameters
  • Sometimes by transforming parameters
  • Latent Semantic Indexing using Singular Value
    Decomposition is a variant on this theme and will
    be discussed in another lecture
  • Method may depend on problem type
  • For classification and filtering, may use
    information from example documents to guide
    selection

4
Question
  • How do you know which features can be pruned?
  • Genetic Algorithms
  • Try lots of subsets and choose the best
  • Mutual Information
  • Iteratively eliminate features with least mutual
    information with other remaining features

5
Problem Introduction
  • Many features are insignificant
  • The, a, and,
  • Usually significance of a feature is independent
    of other features
  • Sometimes, pairs of features give far more
    information than either feature in isolation
  • Sometimes feature significance is related to the
    specific task or query

6
Simple Approach
  • Suppose you are building a document
    classification system
  • You have sample documents that are
    relevant/irrelevant to the class
  • There are N features in total
  • You will use classification algorithm A

7
Simple Approach (contd)
  • F set of features
  • Evaluate precision/recall of A using F
  • Choose precision/recall threshold
  • For each feature f in F
  • F? F f
  • Evaluate precision/recall of A using F?
  • If performance better than threshold
  • F? F

8
Feature Goodness Measures
  • Class-independent measures
  • Document frequency
  • Term strength
  • Feature mutual information
  • Relative to a class or classes of documents
  • Information gain
  • Mutual information
  • X2 statistic

9
Document Frequency
  • This is the simplest feature selection method
  • Based on Zipfs Law
  • Remove N most common terms
  • Remove terms that appear in fewer than M
    documents (M usually 1 or 2)

10
Term Strength
  • Term strength does not rely on predefined
    categories
  • For x and y any pair of related documents
  • Documents which are closer than some distance
  • s(t) P(t?yt?x)
  • Retain those features where s(t) gt thresh

11
Feature Mutual Information
  • A documents containing ti and tj
  • B documents containing ti and not tj
  • C documents containing tj and not ti
  • N documents
  • I(ti, tj) log(P(ti and tj) / (P(ti)?P(tj)))
  • I(ti, tj) ? log(A?N / ((A C)?(A B)))

12
MST-FeatureSelection
  • ?i, j i?j Compute I(ti, tj)
  • Build maximal spanning tree MST of complete graph
    G with edge weights I(ti, tj)
  • FMST?direct-arcs(MST), R ?tii
  • While (R gt k) do
  • If ?v where v is a node representing feature tv
    that is not connected to any other node in FMST
  • Then R ? R -tv, FMST ?FMST -v
  • Else remove least weighted arc from FMST

13
PIL-FeatureSelection
  • ?i, j i?j Compute I(ti, tj)
  • R ?tii
  • While (R gt k) do
  • ? ti? R, compute ?i ?? tj? R -tiI(ti, tj)
  • ti ?feature with minimal ?i
  • R ? R-ti

14
Information Gain
  • For term t and m categories ci
  • P(ci) probability document in ci
  • P(t) probability t appears in document
  • P(cit) probability document in ci given t
    appears
  • P(ci?t) document in ci given t does not appear
  • G(t) -?P(ci)log(P(ci)) P(t) ?
    P(cit)log(P(cit)) P(? t) ? P(ci?t)log(P(ci?t
    ))
  • Retain those features where G(t) gt thresh

15
Mutual Information
  • A documents in c containing t
  • B documents not in c containing t
  • C documents in c not containing t
  • N documents
  • I(t,c) log(P(t and c) / (P(t)?P(c)))
  • I(t,c) ? log(A?N / ((A C)?(A B)))
  • Iavg(t) ? P(ci) I(t,ci)
  • Imax(t) maxI(t,ci)
  • Retain those features where I(t) gt thresh

16
X2 Statistic
  • Measures the lack of independence between t and c
  • A, B, C, N as above
  • D documents not in c and not containing t
  • X2(t,c) N(AD-CB)2 / ((AC)(BD)(AB)(CD))
  • X2avg(t) ? P(ci) X2(t,ci)
  • X2max(t) maxX2(t,ci)
  • Retain those features where X2(t) gt thresh

17
Genetic Algorithms
  • General-purpose optimization system
  • Define a problem representation
  • Classical GA uses binary bit-strings
  • Create an evaluator which returns a fitness
  • Create a random population
  • For each generation
  • Evaluate fitness of each member of population
  • Create new strings
  • Cross-over using fitness-weighted parent
    selection
  • Mutation

18
Genetic Feature Selection
  • Representation
  • Bit-string
  • Each 0,1 bit represents exclusion/inclusion of
    feature
  • Fitness
  • Using proper experimental techniques, measure
    system performance with those features
  • Various queries,

19
Feature Selection Performance
20
Feature Selection Performance
21
Summary
  • Feature selection is useful for reducing the
    computational complexity
  • Feature selection can improve both performance
    and accuracy
  • The right algorithm may depend on the problem
    and classification/retrieval algorithm
Write a Comment
User Comments (0)
About PowerShow.com