Sparsity Analysis of Term Weighting Schemes and Application to Text Classification - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Sparsity Analysis of Term Weighting Schemes and Application to Text Classification

Description:

Sparsity Analysis of Term Weighting Schemes and Application to Text Classification. Nata a Milic-Frayling,1 Dunja Mladenic,2. Janez Brank,2 Marko Grobelnik2 ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 15
Provided by: janez
Category:

less

Transcript and Presenter's Notes

Title: Sparsity Analysis of Term Weighting Schemes and Application to Text Classification


1
Sparsity Analysis of Term Weighting Schemes and
Application to Text Classification
  • Nataa Milic-Frayling,1 Dunja Mladenic,2 Janez
    Brank,2 Marko Grobelnik21 Microsoft Research,
    Cambridge, UK2 Joef Stefan Institute,
    Ljubljana, Slovenia

2
Introduction
  • Feature selection in the context of text
    categorization
  • Comparing different feature ranking schemes
  • Characterizing feature rankings based on their
    sparsity behavior
  • Sparsity defined as the average number of
    different words in a document (after feature
    selection removed some words)

3
Feature Weighting Schemes
  • Odds ratioOR(t) logodds(tc) / odds(t?c)
  • Information gainIG(t c) entropy(c)
    entropy(ct)
  • ?2-statistic?2(t) N (NtcN?t?c Nt?cNc?t)2 /
    Nc N?c Nt N?tN number of all documents Ntc
    number of documents from class c containing
    term t, etc.Numerator equals 0 if t and c are
    independent.
  • Robertson Sparck-Jones weightingRSJ(t)
    log(Ntc0.5) (N?t?c0.5) /
    (Nc?t0.5)(Nt?c0.5)(very similar to odds
    ratio)

4
Feature Weighting Schemes
  • Weights based on word frequencyDF document
    frequency (no. of documents containing the
    word this ranking suggests to use the most
    common words)IDF inverse document frequency
    (use the least common words)

5
Feature Weighting Schemes
  • Weights based on a linear classifier (w,
    b) prediction(d) sgnb Sterm ti wi TF(ti,
    d)
  • If a weight wi is close to 0, the term ti has
    little influence on the predictions.
  • If it is not important for predictions, it is
    probably not important for learning either.
  • Thus, use wi as the score of a the term ti.
  • We use linear models trained using SVM and
    perceptron.
  • It might be practical to train the model on a
    subset of the full training set only (e.g. ½ or ¼
    of the full training set, etc.).

6
Characterization of Feature Rankings in terms of
Sparsity
  • We have a reatively good understanding of feature
    rankings based on odds ratio, information gain,
    etc., because they are based on explicit formulas
    for feature scores
  • How to better understand the rankings based on
    linear classifiers?
  • Let sparsity be the average number of different
    words per document, after some feature selection
    has been applied.
  • Equivalently the avg. number of nonzero
    components per vector representing the document.
  • This has direct ties to memory consuption, as
    well as to CPU time consumption for computing
    norms, dot products, etc.
  • We can plot the sparsity curve showing how
    sparsity grows as we add more and more features
    from a given ranking.

7
Sparsity Curves
8
Sparsity as the independent variable
  • When discussing and comparing feature rankings,
    we often use the number of features as the
    independent variable.
  • What is the performance when using the first 100
    features? etc.
  • Somewhat unfair towards rankings that prefer (at
    least initially) less frequent features, such as
    odds ratio
  • Sparsity is much more directly connected to
    memory and CPU time requirements
  • Thus, we propose the use of sparsity as the
    independent variable when comparing feature
    rankings.

9
Performance as a function of the number of
features(Naïve Bayes, 16 categories of RCV2)
10
Performance as a function of sparsity
11
Sparsity as a cutoff criterion
  • Each category is treated as a binary
    classification problem (does the document belong
    to category c or not?)
  • Thus, a feature ranking method produces one
    ranking per category
  • We must choose how many of the top ranked
    features to use for learning and classification
  • Alternatively, we can define the cutoff in terms
    of sparsity.
  • The best number of features can vary greatly from
    one category to another
  • Does the best sparsity vary less between
    categories?
  • Suppose we want a constant number of features for
    each category. Is it better to use a constant
    sparsity for each category?

12
Results
13
Conclusions
  • Sparsity is an interesting and useful concept
  • As a cutoff criterion, it is not any worse, and
    is often a little better, than the number of
    features
  • It offers more direct control over memory and CPU
    time consumption
  • When comparing feature selection methods, it is
    not biased in favour of methods which prefer more
    common features

14
Future work
  • Characterize feature ranking schemes in terms of
    other characteristics besides sparsity curves
  • E.g. cumulative information gain how the sum of
    IG(t c) over the first k terms t of the feature
    ranking grows with k.
  • The goal define a set of characteristic curves
    that would explain why some feature rankings
    (e.g. SVM-based) are better than others.
  • If we know the characteristic curves of a good
    feature ranking, we can synthesize new rankings
    with approximately the same characteristic curves
  • Would they also perform comparatively well?
  • With a good set of feature characteristics, we
    might be able to take the approximate
    characteristics of a good feature ranking and
    then synthesize comparably good rankings on other
    classes or datasets.
  • (Otherwise it can be expensive to get a really
    good feature ranking, such as the SVM-based one.)
Write a Comment
User Comments (0)
About PowerShow.com