A Boosting Algorithm for Classification of SemiStructured Text - PowerPoint PPT Presentation

About This Presentation

Title:

A Boosting Algorithm for Classification of SemiStructured Text

Description:

h t1, y (x) = 1. h t2, y (x) = 1 t2, y = , -1 b. d. 9. Decision stumps for ... assertion: We should not hold an optimistic view of the success of POKEMON. ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 33

Provided by: Taku9

Learn more at: http://chasen.org

Category:

more less

Transcript and Presenter's Notes

Title: A Boosting Algorithm for Classification of SemiStructured Text

1
A Boosting Algorithm for Classification of
Semi-Structured Text

Taku Kudo
Yuji Matsumoto
Nara Institute Science and Technology
Currently, NTT Communication Science Labs.

2
Backgrounds

Text Classification using Machine Learning
categories topics (sports, finance, politics)
features bag-of-words (BOW)
methods SVM, Boosting, Naïve Bayes
Changes in categories
modalities, subjectivities, or sentiments
Changes in text size
document (large) ? passage, sentence (small)
Our Claim BOW is not sufficient

3
Backgrounds, cont.

Straightforward extensions
Add some structural features, e.g., fixed-length
N-gram or fixed-length syntactic relations
But
Ad-hoc and task dependent
require careful feature selections
How to determine the optimal size (length) ?
Use of larger substructures yields an
inefficiency
Use of smaller substructures is the same as BOW

4
Our approach

Semi-structured text
assume that text is represented in a tree
word sequence, dependency tree, base-phrases, XML
Propose a new ML algorithm that can automatically
capture relevant substructures in a
semi-structured text
Characteristics
Instance is not a numerical vector but a tree
Use all subtrees as features without any
constraints
A compact and relevant feature set is
automatically selected

5
Classifier for Trees
6
Tree classification problem

Goal
Induce a mapping
from
given training data
Training data
A set of pairs of tree x and class label y (1
or -1)

a
1
-1
1
-1
d
d
d
c
c
d
b
a
T
,
,
,
a
a
b
a
c
c
7
Labeled ordered tree, subtree

Labeled ordered tree (or simply tree)
labeled
each node is associated with a label
ordered
siblings are ordered
Subtree
preserves parent-daughter relation
preserves sibling relation
preserves the label

a
a
b
c
c
b
d
b
a
b
c
B is a subtree of A A is a supertree of B
8
Decision stumps for trees

A simple rule-based classifier

ltt, ygt is a parameter (rule) of decision stumps

c
d
d
ltt1, ygtlt , 1gt
ltt2, ygtlt , -1gt
a
b
x
c
a
b
h ltt1, ygt(x) 1
h ltt2, ygt(x) 1
9
Decision stumps for trees, cont.

Training select the optimal rule that
maximizes the gain (or accuracy)

F feature set (a set of all subtrees)

10
Decision stumps for trees, cont.
a
1
-1
-1
d
d
d
1
c
c
d
b
a
a
a
b
a
gain
c
c
ltt,ygt
11
Boosting

Decision stumps are too weak
Boosting Schapire97
build an weak leaner (decision stumps) Hj
re-weight instances with respect to error rates
repeat 1 to 2 in K times
output a liner combination of H1 HK
Redefine the gain to use Boosting

12
Efficient Computation
13
How to find the optimal rule?

F is too huge to be enumerated explicitly
Need to find the optimal rule efficiently

14
Right most extension Asai02, Zaki02

extend a given tree of size (n-1) by adding a new
node to obtain trees of size n
a node is added to the right-most-path
a node is added as the rightmost sibling

15
Right most extension, cont.

Recursive applications of right most extensions
create a search space

16
Pruning

For all , propose an
upper bound such that
Can prune the node t if ,
where is a suboptimal gain

17
Upper bound of the gain(an extension of
Morishita 02)
18
Relation to SVMs with Tree Kernel
19
Classification algorithm
Modeled as a linear classifier wt weight
of tree t -b bias (default class label)
20
SVMs and Tree Kernel Collins 02
Tree Kernel all subtrees are expanded implicitly
a

Feature spaces are essentially the same
Learning strategies are different

0,,1,1,,1,,1,,1,,1,,0,
b
c
a
b
c
a
a
a
b
c
b
c
21
SVM v.s Boosting Rätsch 01

Both are known as Large Margin Classifiers
Metric of margin is different

Boosting L1-norm margin
w is expressed in a small number of features
- sparse solution in the feature space

22
SVM v.s Boosting, cont.

Accuracy is task-dependent
Practical advantages of Boosting
Good interpretability
Can analyze how the model performs or what kinds
of features are useful
Compact features (rules) are easy to deal with
Fast classification
Complexity depends on the small number of rules
Kernel methods are too heavy

23
Experiments
24
Sentence classifications

PHS cell phone review classification (5,741
sent.)
domain Web-based BBS on PHS, a sort of cell
phone
categories positive review or negative review
MOD modality identification (1,710 sent.)
domain editorial news articles
categories assertion, opinion, or description

positive It is useful that we can know the
date and time of E-Mails. negative I feel that
the response is not so good.
assertion We should not hold an optimistic
view of the success of POKEMON. opinion I
think that now is the best time for developing
the blue print. description Social function of
education has been changing.
25
Sentence representations

N-gram tree
each word simply modifies the next word
subtree is an N-gram (N is unrestricted)
dependency tree
word-based dependency tree
A Japanese dependency parser, CaboCha, is used
bag-of-words (baseline)

response is very good
response is very good
26
Results

outperforms the baseline (bow)
dep v.s n-gram comparable (no significant
difference)

SVMs show worse performance depending on tasks
overfitting

27
Interpretability
PHS dataset with dependency
A subtrees that include hard,
difficult
B subtrees that include use
0.0004 be hard to hung up -0.0006 be hard to
read -0.0007 be hard to use -0.0017 be hard to
0.0027 want to use 0.0002 use 0.0002 be in
use 0.0001 be easy to use
-0.0001 was easy to use -0.0007 be hard to
use -0.0019 is easier to use than..
C subtrees that include recharge
0.0028 recharging time is short -0.0041
recharging time is long
28
Interpretability, cont.
PHS dataset with dependency
Input The LCD is large, beautiful and easy to
see
weight w
subtree t
0.00368 be easy to 0.00353
beautiful 0.00237 be easy to see 0.00174
is large 0.00107 The LCD is large 0.00074
The LCD is 0.00057 The LCD 0.00036
see -0.00001 large
29
Advantages

Compact feature set
Boosting extracts only 1,783 unique features
The set sizes of distinct 1-gram, 2-gram, and
3-gram are 4,211, 24,206, and 43,658
respectively
SVMs implicitly use a huge number of features
Fast classification
Boosting 0.531 sec. / 5,741 instances
SVM 255.42 sec. / 5,741 instances
Boosting is about 480 times faster than SVMs

30
Conclusions