Extracting KeySubstringGroup Features for Text Classification - PowerPoint PPT Presentation

About This Presentation

Title:

Extracting KeySubstringGroup Features for Text Classification

Description:

Discriminative. Support Vector Machine (SVM) , AdaBoost, ... For word-based text classification, discriminative methods are often superior to ... Discriminative ... – PowerPoint PPT presentation

Number of Views:83

Avg rating:3.0/5.0

Slides: 35

Provided by: dellz

Category:

more less

Transcript and Presenter's Notes

Title: Extracting KeySubstringGroup Features for Text Classification

1
Extracting Key-Substring-Group Features for Text
Classification
KDD2006

Dell Zhang and Wee Sun Lee

2
The Context

Text Classification via Machine Learning (ML)

Learning
Predicting
Training Documents
Test Documents
3
Text Data
To be, or not to be
4
Some Applications

Non-Topical Text Classification
Text Genre Classification
Paper? Poem? Prose?
Text Authorship Classification
Washington? Adams? Jefferson?

How to exploit sub-word/super-word information?
5
Some Applications

Asian-Language Text Classification

How to avoid the problem of word-segmentation?
6
Some Applications

Spam Filtering

(Pampapathi et al., 2006)
How to handle non-alphabetical characters etc.?
7
Some Applications

Desktop Text Classification

How to deal with different types of files?
8
Learning Algorithms

Generative
Naïve Bayes, Rocchio,
Discriminative
Support Vector Machine (SVM) , AdaBoost,

For word-based text classification,
discriminative methods are often superior to
generative methods. How about string-based text
classification?
9
String-Based Text Classification

Generative
Markov Chain Models (char-level)
fixed order n-gram,
variable order PST, PPM,
Discriminative
SVM with string kernel ( taking all substrings
as features implicitly through the kernel
trick)
limitations (1) ridge problem (2) feature
redundancy (3) feature selection/weighting and
advanced kernels.

10
The Problem
generative
word-based
string-based
?
discriminative
11
The Difficulty

The number of substrings O(n2)

d1 to_be
d2 not_to_be
5 9 14 characters
15 45 60 substrings
12
Our Idea

The substrings could be partitioned into
statistical equivalence groups

ot ot_ ot_t ot_to ot_to_ ot_to_b ot_to_be
to to_ to_b to_be
d1 to_be

d2 not_to_be
13
Suffix Tree
o _ b e
t
_ t o _ b e
_ b e
o
t _ t o _ b e
b e
_
t o _ b e
a suffix tree node a substring group
b e
e
n o t _ t o _ b e
14
Substring-Groups
The substrings in an equivalence group have
exactly identical distribution over the corpus,
therefore such a substring-group could be taken
in whole as a single feature to be used by a
statistical machine learning algorithm for text
classification.
15
Substring-Groups

The number of substring-groups O(n)
n trivial substring-groups
leaf nodes
frequency 1
not so useful to learning
at most n-1 non-trivial substring-groups
internal (non-root) nodes
frequency gt 1
to be selected as features

16
Key-Substring-Groups

Select the key (salient) substring-groups by
-l the minimum frequency
freq(SGv)
-h the maximum frequency
freq(SGv)
-b the minimum number of branches
children_num(v)
-p the maximum parent-child conditional
probability
freq(SGv) / freq(SGp(v))
-q the maximum suffix-link conditional
probability
freq(SGv) / freq(SGs(v))

17
Suffix Link

c1 c2 ck ? c2 ck
v ? s(v)
s(v) ? root

18
Feature Extraction Algorithm

Input
a set of documents
the parameters
Output
the key-substring-groups for each document
Time Complexity O(n)
Trick
make use of suffix links to traverse the tree

19
Feature Extraction Algorithm
construct the (generalized) suffix tree T
using Ukkonens algorithm count frequencies
recursively select features recursively accumula
te features recursively for each document d
match d to T and get to the node v while v
is not the root output the features
associated with v move v to the next
node via the suffix link of v
20
Experiments

Parameter Tuning
the number of features
the cross-validation performance
Feature Weighting
TFxIDF (with l2 normalization)
Learning Algorithm
LibSVM
linear kernel

21
English Text Topic Classification

Dataset
Reuters-21578 Top10 (ApteMod)
The home-ground of word-based text classification
Classes
(1) earn (2) acq (3) money-fx (4) grain (5)
crude (6) trade (7) interest (8) ship (9)
wheat (10) corn.
Parameters
-l 80 -h 8000 -b 8 -p 0.8 -q 0.8
Features
91013 ? 6,055 (extracted in lt 30 seconds)

22
English Text Topic Classification
The distribution of substring-groups Zips law
(power law)
23
English Text Topic Classification
The performance of linear kernel SVM with
key-substring-group features on the Reuters-21578
top10 dataset.
24
English Text Topic Classification
Comparing the experimental results of our
proposed approach and some representative
existing approaches.
25
English Text Topic Classification
The influence of feature extraction parameters to
the number of features and the text
classification performance.
26
Chinese Text Topic Classification

Dataset
TREC-5 Peoples Daily News
Classes
(1) Politics, Law and Society (2) Literature and
Arts (3) Education, Science and Culture (4)
Sports (5) Theory and Academy (6) Economics.
Parameters
-l 20 -h 8000 -b 8 -p 0.8 -q 0.8

27
Chinese Text Topic Classification

Performance (miF)
SVM word segmentation 82.0
(He et al., 2000 He et al., 2003)
char-level n-gram language model 86.7
(Peng et al. 2004)
SVM with key-substring-group features 87.3

28
Greek Text Authorship Classification

Dataset
(Stamatatos et al., 2000)
Classes
(1) S. Alaxiotis (2) G. Babiniotis (3) G.
Dertilis (4) C. Kiosse (5) A. Liakos (6) D.
Maronitis (7) M. Ploritis (8) T. Tasios (9) K.
Tsoukalas (10) G. Vokos.

29
Greek Text Authorship Classification

Performance (accuracy)
deep natural language processing 72
(Stamatatos et al., 2000)
char-level n-gram language model 90
(Peng et al. 2004)
SVM with key-substring-group features 92

30
Greek Text Genre Classification

Dataset
(Stamatatos et al., 2000)
Classes
(1) press editorial (2) press reportage (3)
academic prose (4) official documents (5)
literature (6) recipes (7) curriculum vitae
(8) interviews (9) planned speeches (10)
broadcast news.

31
Greek Text Genre Classification