Title: Extracting KeySubstringGroup Features for Text Classification
1Extracting Key-Substring-Group Features for Text
Classification
KDD2006
- Dell Zhang and Wee Sun Lee
2The Context
- Text Classification via Machine Learning (ML)
Learning
Predicting
Training Documents
Test Documents
3Text Data
To be, or not to be
4Some Applications
- Non-Topical Text Classification
- Text Genre Classification
- Paper? Poem? Prose?
- Text Authorship Classification
- Washington? Adams? Jefferson?
How to exploit sub-word/super-word information?
5Some Applications
- Asian-Language Text Classification
How to avoid the problem of word-segmentation?
6Some Applications
(Pampapathi et al., 2006)
How to handle non-alphabetical characters etc.?
7Some Applications
- Desktop Text Classification
How to deal with different types of files?
8Learning Algorithms
- Generative
- Naïve Bayes, Rocchio,
- Discriminative
- Support Vector Machine (SVM) , AdaBoost,
For word-based text classification,
discriminative methods are often superior to
generative methods. How about string-based text
classification?
9String-Based Text Classification
- Generative
- Markov Chain Models (char-level)
- fixed order n-gram,
- variable order PST, PPM,
- Discriminative
- SVM with string kernel ( taking all substrings
as features implicitly through the kernel
trick) - limitations (1) ridge problem (2) feature
redundancy (3) feature selection/weighting and
advanced kernels.
10The Problem
generative
word-based
string-based
?
discriminative
11The Difficulty
- The number of substrings O(n2)
d1 to_be
d2 not_to_be
5 9 14 characters
15 45 60 substrings
12Our Idea
- The substrings could be partitioned into
statistical equivalence groups
ot ot_ ot_t ot_to ot_to_ ot_to_b ot_to_be
to to_ to_b to_be
d1 to_be
d2 not_to_be
13Suffix Tree
o _ b e
t
_ t o _ b e
_ b e
o
t _ t o _ b e
b e
_
t o _ b e
a suffix tree node a substring group
b e
e
n o t _ t o _ b e
14Substring-Groups
The substrings in an equivalence group have
exactly identical distribution over the corpus,
therefore such a substring-group could be taken
in whole as a single feature to be used by a
statistical machine learning algorithm for text
classification.
15Substring-Groups
- The number of substring-groups O(n)
- n trivial substring-groups
- leaf nodes
- frequency 1
- not so useful to learning
- at most n-1 non-trivial substring-groups
- internal (non-root) nodes
- frequency gt 1
- to be selected as features
16Key-Substring-Groups
- Select the key (salient) substring-groups by
- -l the minimum frequency
- freq(SGv)
- -h the maximum frequency
- freq(SGv)
- -b the minimum number of branches
- children_num(v)
- -p the maximum parent-child conditional
probability - freq(SGv) / freq(SGp(v))
- -q the maximum suffix-link conditional
probability - freq(SGv) / freq(SGs(v))
17Suffix Link
- c1 c2 ck ? c2 ck
- v ? s(v)
- s(v) ? root
18Feature Extraction Algorithm
- Input
- a set of documents
- the parameters
- Output
- the key-substring-groups for each document
- Time Complexity O(n)
- Trick
- make use of suffix links to traverse the tree
19Feature Extraction Algorithm
construct the (generalized) suffix tree T
using Ukkonens algorithm count frequencies
recursively select features recursively accumula
te features recursively for each document d
match d to T and get to the node v while v
is not the root output the features
associated with v move v to the next
node via the suffix link of v
20Experiments
- Parameter Tuning
- the number of features
- the cross-validation performance
- Feature Weighting
- TFxIDF (with l2 normalization)
- Learning Algorithm
- LibSVM
- linear kernel
21English Text Topic Classification
- Dataset
- Reuters-21578 Top10 (ApteMod)
- The home-ground of word-based text classification
- Classes
- (1) earn (2) acq (3) money-fx (4) grain (5)
crude (6) trade (7) interest (8) ship (9)
wheat (10) corn. - Parameters
- -l 80 -h 8000 -b 8 -p 0.8 -q 0.8
- Features
- 91013 ? 6,055 (extracted in lt 30 seconds)
22English Text Topic Classification
The distribution of substring-groups Zips law
(power law)
23English Text Topic Classification
The performance of linear kernel SVM with
key-substring-group features on the Reuters-21578
top10 dataset.
24English Text Topic Classification
Comparing the experimental results of our
proposed approach and some representative
existing approaches.
25English Text Topic Classification
The influence of feature extraction parameters to
the number of features and the text
classification performance.
26Chinese Text Topic Classification
- Dataset
- TREC-5 Peoples Daily News
- Classes
- (1) Politics, Law and Society (2) Literature and
Arts (3) Education, Science and Culture (4)
Sports (5) Theory and Academy (6) Economics. - Parameters
- -l 20 -h 8000 -b 8 -p 0.8 -q 0.8
27Chinese Text Topic Classification
- Performance (miF)
- SVM word segmentation 82.0
- (He et al., 2000 He et al., 2003)
- char-level n-gram language model 86.7
- (Peng et al. 2004)
- SVM with key-substring-group features 87.3
28Greek Text Authorship Classification
- Dataset
- (Stamatatos et al., 2000)
- Classes
- (1) S. Alaxiotis (2) G. Babiniotis (3) G.
Dertilis (4) C. Kiosse (5) A. Liakos (6) D.
Maronitis (7) M. Ploritis (8) T. Tasios (9) K.
Tsoukalas (10) G. Vokos.
29Greek Text Authorship Classification
- Performance (accuracy)
- deep natural language processing 72
- (Stamatatos et al., 2000)
- char-level n-gram language model 90
- (Peng et al. 2004)
- SVM with key-substring-group features 92
30Greek Text Genre Classification
- Dataset
- (Stamatatos et al., 2000)
- Classes
- (1) press editorial (2) press reportage (3)
academic prose (4) official documents (5)
literature (6) recipes (7) curriculum vitae
(8) interviews (9) planned speeches (10)
broadcast news.
31Greek Text Genre Classification
- Performance (accuracy)
- deep natural language processing 82
- (Stamatatos et al., 2000)
- char-level n-gram language model 86
- (Peng et al. 2004)
- SVM with key-substring-group features 94
32Conclusion
- We propose
- the concept of key-substring-group features and
- a linear-time (suffix tree based) algorithm to
extract them - We show that
- our method works well for some text
classification tasks
clustering etc.? gene/protein sequence data?
33?
34?