A Comparative Study on Feature Selection in Text Categorization (Proc. 14th International Conference on Machine Learning - PowerPoint PPT Presentation

About This Presentation
Title:

A Comparative Study on Feature Selection in Text Categorization (Proc. 14th International Conference on Machine Learning

Description:

Terms whose document frequency is less than some predetermined threshold, are ... based on how commonly a term is likely to appear in closely-related' documents. ... – PowerPoint PPT presentation

Number of Views:372
Avg rating:3.0/5.0
Slides: 10
Provided by: preraks
Category:

less

Transcript and Presenter's Notes

Title: A Comparative Study on Feature Selection in Text Categorization (Proc. 14th International Conference on Machine Learning


1
A Comparative Study on Feature Selection in Text
Categorization(Proc. 14th International
Conference on Machine Learning 1997)
  • Paper By
  • Yiming Yang, CMU
  • Jan O. Pedersen, Verity, Inc.
  • Presented By
  • Prerak Sanghvi
  • Computer Science and Engineering Department
  • State University of New York at Buffalo

2
Introduction
  • This paper is a comparative study of feature
    selection methods in statistical learning of text
    categorization.
  • Five methods were evaluated
  • Document Frequency (DF)
  • Information Gain (IG)
  • Mutual Information (MI)
  • ?2 test (CHI)
  • Term Strength (TS)

3
Document Frequency (DF)
  • Document Frequency is the number of documents in
    which a term occurs.
  • Terms whose document frequency is less than some
    predetermined threshold, are removed from the
    feature space.
  • The basic assumption is that rare terms are
    either non-informative for category prediction,
    or not influential in global performance.
    However, this assumption must be handled
    carefully.

4
Information Gain (IG)
  • IG measures the number of bits of information
    obtained for category prediction by knowing the
    presence or absence of a term in a document.
  • For a term t, and set of classes ci
  • G (t) - ?i1 to m Pr (ci) log Pr (ci)
  • Pr(t) ?i1 to m Pr (cit) log Pr (cit)
  • Pr(?t) ?i1 to m Pr (ci?t) log Pr (ci?t)

5
Information Gain (IG)
  • Given a training corpus, for each unique term, IG
    is computed, and those terms are removed from the
    feature space whose IG is less than some
    predetermined threshold.

6
Mutual Information (MI)
  • Each word is ranked according to its mutual
    information with respect to the class labels.
  • Mutual information criterion is defined as
  • I(t, c) log Pr (t ? c) / Pr(t) Pr(c)
  • Category specific scores are often combined as
  • Iavg (t) ?i1 to m Pr (ci) I (t, ci)
  • Imax (t) maxi1 to m I (t, ci)

7
?2 statistic (CHI)
  • The ?2 statistic measures the lack of
    independence between t and c.
  • ?2 statistic is known to be not reliable for
    low-frequency terms.

8
Term Strength (TS)
  • This method estimates term importance based on
    how commonly a term is likely to appear in
    closely-related documents.
  • It uses a training set of documents to derive
    document pairs whose similarity is above a
    threshold.
  • This criterion is based on document clustering,
    assuming that documents with many shared words
    are related, and that terms in the heavily
    overlapping area of related documents are
    relatively informative.

9
Conclusion
  • IG and CHI were found to be most effective in
    aggressive term removal without losing
    categorization accuracy in experiments with kNN
    and LLSF (Linear Least Squares Fit) on Reuters
    22173 and OHSUMED collection.
  • DF is found comparable to IG and CHI with up to
    90 term removal, while TS is comparable with up
    to 50-60.
  • MI has inferior performance.
Write a Comment
User Comments (0)
About PowerShow.com