Title: A Boosting Algorithm for Classification of SemiStructured Text
1A Boosting Algorithm for Classification of
Semi-Structured Text
- Taku Kudo
- Yuji Matsumoto
- Nara Institute Science and Technology
- Currently, NTT Communication Science Labs.
2Backgrounds
- Text Classification using Machine Learning
- categories topics (sports, finance, politics)
- features bag-of-words (BOW)
- methods SVM, Boosting, Naïve Bayes
- Changes in categories
- modalities, subjectivities, or sentiments
- Changes in text size
- document (large) ? passage, sentence (small)
-
- Our Claim BOW is not sufficient
3Backgrounds, cont.
- Straightforward extensions
- Add some structural features, e.g., fixed-length
N-gram or fixed-length syntactic relations - But
- Ad-hoc and task dependent
- require careful feature selections
- How to determine the optimal size (length) ?
- Use of larger substructures yields an
inefficiency - Use of smaller substructures is the same as BOW
4Our approach
- Semi-structured text
- assume that text is represented in a tree
- word sequence, dependency tree, base-phrases, XML
- Propose a new ML algorithm that can automatically
capture relevant substructures in a
semi-structured text - Characteristics
- Instance is not a numerical vector but a tree
- Use all subtrees as features without any
constraints - A compact and relevant feature set is
automatically selected
5Classifier for Trees
6Tree classification problem
- Goal
- Induce a mapping
from - given training data
- Training data
- A set of pairs of tree x and class label y (1
or -1)
a
1
-1
1
-1
d
d
d
c
c
d
b
a
T
,
,
,
a
a
b
a
c
c
7Labeled ordered tree, subtree
- Labeled ordered tree (or simply tree)
- labeled
- each node is associated with a label
- ordered
- siblings are ordered
- Subtree
- preserves parent-daughter relation
- preserves sibling relation
- preserves the label
a
a
b
c
c
b
d
b
a
b
c
B is a subtree of A A is a supertree of B
8Decision stumps for trees
- A simple rule-based classifier
- ltt, ygt is a parameter (rule) of decision stumps
c
d
d
ltt1, ygtlt , 1gt
ltt2, ygtlt , -1gt
a
b
x
c
a
b
h ltt1, ygt(x) 1
h ltt2, ygt(x) 1
9Decision stumps for trees, cont.
- Training select the optimal rule that
maximizes the gain (or accuracy)
- F feature set (a set of all subtrees)
10Decision stumps for trees, cont.
a
1
-1
-1
d
d
d
1
c
c
d
b
a
a
a
b
a
gain
c
c
ltt,ygt
11Boosting
- Decision stumps are too weak
- Boosting Schapire97
- build an weak leaner (decision stumps) Hj
- re-weight instances with respect to error rates
- repeat 1 to 2 in K times
- output a liner combination of H1 HK
- Redefine the gain to use Boosting
12Efficient Computation
13How to find the optimal rule?
- F is too huge to be enumerated explicitly
- Need to find the optimal rule efficiently
14Right most extension Asai02, Zaki02
- extend a given tree of size (n-1) by adding a new
node to obtain trees of size n - a node is added to the right-most-path
- a node is added as the rightmost sibling
15Right most extension, cont.
- Recursive applications of right most extensions
create a search space
16Pruning
- For all , propose an
upper bound such that - Can prune the node t if ,
- where is a suboptimal gain
17Upper bound of the gain(an extension of
Morishita 02)
18Relation to SVMs with Tree Kernel
19Classification algorithm
Modeled as a linear classifier wt weight
of tree t -b bias (default class label)
20SVMs and Tree Kernel Collins 02
Tree Kernel all subtrees are expanded implicitly
a
- Feature spaces are essentially the same
- Learning strategies are different
0,,1,1,,1,,1,,1,,1,,0,
b
c
a
b
c
a
a
a
b
c
b
c
21SVM v.s Boosting Rätsch 01
- Both are known as Large Margin Classifiers
- Metric of margin is different
- Boosting L1-norm margin
- w is expressed in a small number of features
- - sparse solution in the feature space
22SVM v.s Boosting, cont.
- Accuracy is task-dependent
- Practical advantages of Boosting
- Good interpretability
- Can analyze how the model performs or what kinds
of features are useful - Compact features (rules) are easy to deal with
- Fast classification
- Complexity depends on the small number of rules
- Kernel methods are too heavy
23 Experiments
24Sentence classifications
- PHS cell phone review classification (5,741
sent.) - domain Web-based BBS on PHS, a sort of cell
phone - categories positive review or negative review
- MOD modality identification (1,710 sent.)
- domain editorial news articles
- categories assertion, opinion, or description
positive It is useful that we can know the
date and time of E-Mails. negative I feel that
the response is not so good.
assertion We should not hold an optimistic
view of the success of POKEMON. opinion I
think that now is the best time for developing
the blue print. description Social function of
education has been changing.
25Sentence representations
- N-gram tree
- each word simply modifies the next word
- subtree is an N-gram (N is unrestricted)
- dependency tree
- word-based dependency tree
- A Japanese dependency parser, CaboCha, is used
- bag-of-words (baseline)
response is very good
response is very good
26Results
- outperforms the baseline (bow)
- dep v.s n-gram comparable (no significant
difference)
- SVMs show worse performance depending on tasks
- overfitting
27Interpretability
PHS dataset with dependency
A subtrees that include hard,
difficult
B subtrees that include use
0.0004 be hard to hung up -0.0006 be hard to
read -0.0007 be hard to use -0.0017 be hard to
0.0027 want to use 0.0002 use 0.0002 be in
use 0.0001 be easy to use
-0.0001 was easy to use -0.0007 be hard to
use -0.0019 is easier to use than..
C subtrees that include recharge
0.0028 recharging time is short -0.0041
recharging time is long
28Interpretability, cont.
PHS dataset with dependency
Input The LCD is large, beautiful and easy to
see
weight w
subtree t
0.00368 be easy to 0.00353
beautiful 0.00237 be easy to see 0.00174
is large 0.00107 The LCD is large 0.00074
The LCD is 0.00057 The LCD 0.00036
see -0.00001 large
29Advantages
- Compact feature set
- Boosting extracts only 1,783 unique features
- The set sizes of distinct 1-gram, 2-gram, and
3-gram are 4,211, 24,206, and 43,658
respectively - SVMs implicitly use a huge number of features
- Fast classification
- Boosting 0.531 sec. / 5,741 instances
- SVM 255.42 sec. / 5,741 instances
- Boosting is about 480 times faster than SVMs
30Conclusions
- Assume that text is represented in a tree
- Extension of decision stumps
- all subtrees are potentially used as features
- Boosting
- Branch and bound
- enables to find the optimal rule efficiently
- Advantages
- good interpretability
- fast classification
- comparable accuracy to SVMs with kernels
31Future work
- Other applications
- Information extraction
- semantic-role labeling
- parse tree re-ranking
- Confidence rated predictions for decision stumps
32Thank you!
- An implementation of our method is available as
an open source software at - http//chasen.naist.jp/taku/software/bact/