A Comparative Study on Feature Selection in Text Categorization (Proc. 14th International Conference on Machine Learning

About This Presentation

Title:

A Comparative Study on Feature Selection in Text Categorization (Proc. 14th International Conference on Machine Learning

Description:

Terms whose document frequency is less than some predetermined threshold, are ... based on how commonly a term is likely to appear in closely-related' documents. ... – PowerPoint PPT presentation

Number of Views:372

Avg rating:3.0/5.0

Slides: 10

Provided by: preraks

Learn more at: https://cedar.buffalo.edu

Category:

more less

Transcript and Presenter's Notes

Title: A Comparative Study on Feature Selection in Text Categorization (Proc. 14th International Conference on Machine Learning

1
A Comparative Study on Feature Selection in Text
Categorization(Proc. 14th International
Conference on Machine Learning 1997)

Paper By
Yiming Yang, CMU
Jan O. Pedersen, Verity, Inc.
Presented By
Prerak Sanghvi
Computer Science and Engineering Department
State University of New York at Buffalo

2
Introduction

This paper is a comparative study of feature
selection methods in statistical learning of text
categorization.
Five methods were evaluated
Document Frequency (DF)
Information Gain (IG)
Mutual Information (MI)
?2 test (CHI)
Term Strength (TS)

3
Document Frequency (DF)

Document Frequency is the number of documents in
which a term occurs.
Terms whose document frequency is less than some
predetermined threshold, are removed from the
feature space.
The basic assumption is that rare terms are
either non-informative for category prediction,
or not influential in global performance.
However, this assumption must be handled
carefully.

4
Information Gain (IG)

IG measures the number of bits of information
obtained for category prediction by knowing the
presence or absence of a term in a document.
For a term t, and set of classes ci
G (t) - ?i1 to m Pr (ci) log Pr (ci)
Pr(t) ?i1 to m Pr (cit) log Pr (cit)
Pr(?t) ?i1 to m Pr (ci?t) log Pr (ci?t)

5
Information Gain (IG)

Given a training corpus, for each unique term, IG
is computed, and those terms are removed from the
feature space whose IG is less than some
predetermined threshold.

6
Mutual Information (MI)

Each word is ranked according to its mutual
information with respect to the class labels.
Mutual information criterion is defined as
I(t, c) log Pr (t ? c) / Pr(t) Pr(c)
Category specific scores are often combined as
Iavg (t) ?i1 to m Pr (ci) I (t, ci)
Imax (t) maxi1 to m I (t, ci)

7
?2 statistic (CHI)

The ?2 statistic measures the lack of
independence between t and c.
?2 statistic is known to be not reliable for
low-frequency terms.

8
Term Strength (TS)

This method estimates term importance based on
how commonly a term is likely to appear in
closely-related documents.
It uses a training set of documents to derive
document pairs whose similarity is above a
threshold.
This criterion is based on document clustering,
assuming that documents with many shared words
are related, and that terms in the heavily
overlapping area of related documents are
relatively informative.

9
Conclusion

IG and CHI were found to be most effective in
aggressive term removal without losing
categorization accuracy in experiments with kNN
and LLSF (Linear Least Squares Fit) on Reuters
22173 and OHSUMED collection.
DF is found comparable to IG and CHI with up to
90 term removal, while TS is comparable with up
to 50-60.
MI has inferior performance.

Write a Comment

User Comments (0)