Sparsity Analysis of Term Weighting Schemes and Application to Text Classification

About This Presentation

Title:

Sparsity Analysis of Term Weighting Schemes and Application to Text Classification

Description:

Sparsity Analysis of Term Weighting Schemes and Application to Text Classification. Nata a Milic-Frayling,1 Dunja Mladenic,2. Janez Brank,2 Marko Grobelnik2 ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 15

Provided by: janez

Category:

more less

Transcript and Presenter's Notes

Title: Sparsity Analysis of Term Weighting Schemes and Application to Text Classification

1
Sparsity Analysis of Term Weighting Schemes and
Application to Text Classification

Nataa Milic-Frayling,1 Dunja Mladenic,2 Janez
Brank,2 Marko Grobelnik21 Microsoft Research,
Cambridge, UK2 Joef Stefan Institute,
Ljubljana, Slovenia

2
Introduction

Feature selection in the context of text
categorization
Comparing different feature ranking schemes
Characterizing feature rankings based on their
sparsity behavior
Sparsity defined as the average number of
different words in a document (after feature
selection removed some words)

3
Feature Weighting Schemes

Odds ratioOR(t) logodds(tc) / odds(t?c)
Information gainIG(t c) entropy(c)
entropy(ct)
?2-statistic?2(t) N (NtcN?t?c Nt?cNc?t)2 /
Nc N?c Nt N?tN number of all documents Ntc
number of documents from class c containing
term t, etc.Numerator equals 0 if t and c are
independent.
Robertson Sparck-Jones weightingRSJ(t)
log(Ntc0.5) (N?t?c0.5) /
(Nc?t0.5)(Nt?c0.5)(very similar to odds
ratio)

4
Feature Weighting Schemes

Weights based on word frequencyDF document
frequency (no. of documents containing the
word this ranking suggests to use the most
common words)IDF inverse document frequency
(use the least common words)

5
Feature Weighting Schemes

Weights based on a linear classifier (w,
b) prediction(d) sgnb Sterm ti wi TF(ti,
d)
If a weight wi is close to 0, the term ti has
little influence on the predictions.
If it is not important for predictions, it is
probably not important for learning either.
Thus, use wi as the score of a the term ti.
We use linear models trained using SVM and
perceptron.
It might be practical to train the model on a
subset of the full training set only (e.g. ½ or ¼
of the full training set, etc.).

6
Characterization of Feature Rankings in terms of
Sparsity

We have a reatively good understanding of feature
rankings based on odds ratio, information gain,
etc., because they are based on explicit formulas
for feature scores
How to better understand the rankings based on
linear classifiers?
Let sparsity be the average number of different
words per document, after some feature selection
has been applied.
Equivalently the avg. number of nonzero
components per vector representing the document.
This has direct ties to memory consuption, as
well as to CPU time consumption for computing
norms, dot products, etc.
We can plot the sparsity curve showing how
sparsity grows as we add more and more features
from a given ranking.

7
Sparsity Curves
8
Sparsity as the independent variable

When discussing and comparing feature rankings,
we often use the number of features as the
independent variable.
What is the performance when using the first 100
features? etc.
Somewhat unfair towards rankings that prefer (at
least initially) less frequent features, such as
odds ratio
Sparsity is much more directly connected to
memory and CPU time requirements
Thus, we propose the use of sparsity as the
independent variable when comparing feature
rankings.

9
Performance as a function of the number of
features(Naïve Bayes, 16 categories of RCV2)
10
Performance as a function of sparsity
11
Sparsity as a cutoff criterion

Each category is treated as a binary
classification problem (does the document belong
to category c or not?)
Thus, a feature ranking method produces one
ranking per category
We must choose how many of the top ranked
features to use for learning and classification
Alternatively, we can define the cutoff in terms
of sparsity.
The best number of features can vary greatly from
one category to another
Does the best sparsity vary less between
categories?
Suppose we want a constant number of features for
each category. Is it better to use a constant
sparsity for each category?

12
Results
13
Conclusions

Sparsity is an interesting and useful concept
As a cutoff criterion, it is not any worse, and
is often a little better, than the number of
features
It offers more direct control over memory and CPU
time consumption
When comparing feature selection methods, it is
not biased in favour of methods which prefer more
common features

14
Future work

Characterize feature ranking schemes in terms of
other characteristics besides sparsity curves
E.g. cumulative information gain how the sum of
IG(t c) over the first k terms t of the feature
ranking grows with k.
The goal define a set of characteristic curves
that would explain why some feature rankings
(e.g. SVM-based) are better than others.
If we know the characteristic curves of a good
feature ranking, we can synthesize new rankings
with approximately the same characteristic curves
Would they also perform comparatively well?
With a good set of feature characteristics, we
might be able to take the approximate
characteristics of a good feature ranking and
then synthesize comparably good rankings on other
classes or datasets.
(Otherwise it can be expensive to get a really
good feature ranking, such as the SVM-based one.)