A Boosting Algorithm for Classification of SemiStructured Text - PowerPoint PPT Presentation

About This Presentation
Title:

A Boosting Algorithm for Classification of SemiStructured Text

Description:

h t1, y (x) = 1. h t2, y (x) = 1 t2, y = , -1 b. d. 9. Decision stumps for ... assertion: We should not hold an optimistic view of the success of POKEMON. ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 33
Provided by: Taku9
Learn more at: http://chasen.org
Category:

less

Transcript and Presenter's Notes

Title: A Boosting Algorithm for Classification of SemiStructured Text


1
A Boosting Algorithm for Classification of
Semi-Structured Text
  • Taku Kudo
  • Yuji Matsumoto
  • Nara Institute Science and Technology
  • Currently, NTT Communication Science Labs.

2
Backgrounds
  • Text Classification using Machine Learning
  • categories topics (sports, finance, politics)
  • features bag-of-words (BOW)
  • methods SVM, Boosting, Naïve Bayes
  • Changes in categories
  • modalities, subjectivities, or sentiments
  • Changes in text size
  • document (large) ? passage, sentence (small)
  • Our Claim BOW is not sufficient

3
Backgrounds, cont.
  • Straightforward extensions
  • Add some structural features, e.g., fixed-length
    N-gram or fixed-length syntactic relations
  • But
  • Ad-hoc and task dependent
  • require careful feature selections
  • How to determine the optimal size (length) ?
  • Use of larger substructures yields an
    inefficiency
  • Use of smaller substructures is the same as BOW

4
Our approach
  • Semi-structured text
  • assume that text is represented in a tree
  • word sequence, dependency tree, base-phrases, XML
  • Propose a new ML algorithm that can automatically
    capture relevant substructures in a
    semi-structured text
  • Characteristics
  • Instance is not a numerical vector but a tree
  • Use all subtrees as features without any
    constraints
  • A compact and relevant feature set is
    automatically selected

5
Classifier for Trees
6
Tree classification problem
  • Goal
  • Induce a mapping
    from
  • given training data
  • Training data
  • A set of pairs of tree x and class label y (1
    or -1)

a
1
-1
1
-1
d
d
d
c
c
d
b
a
T
,
,
,
a
a
b
a
c
c
7
Labeled ordered tree, subtree
  • Labeled ordered tree (or simply tree)
  • labeled
  • each node is associated with a label
  • ordered
  • siblings are ordered
  • Subtree
  • preserves parent-daughter relation
  • preserves sibling relation
  • preserves the label

a
a
b
c
c
b
d
b
a
b
c
B is a subtree of A A is a supertree of B
8
Decision stumps for trees
  • A simple rule-based classifier
  • ltt, ygt is a parameter (rule) of decision stumps

c
d
d
ltt1, ygtlt , 1gt
ltt2, ygtlt , -1gt
a
b
x
c
a
b
h ltt1, ygt(x) 1
h ltt2, ygt(x) 1
9
Decision stumps for trees, cont.
  • Training select the optimal rule that
    maximizes the gain (or accuracy)
  • F feature set (a set of all subtrees)

10
Decision stumps for trees, cont.
a
1
-1
-1
d
d
d
1
c
c
d
b
a
a
a
b
a
gain
c
c
ltt,ygt
11
Boosting
  • Decision stumps are too weak
  • Boosting Schapire97
  • build an weak leaner (decision stumps) Hj
  • re-weight instances with respect to error rates
  • repeat 1 to 2 in K times
  • output a liner combination of H1 HK
  • Redefine the gain to use Boosting

12
Efficient Computation
13
How to find the optimal rule?
  • F is too huge to be enumerated explicitly
  • Need to find the optimal rule efficiently

14
Right most extension Asai02, Zaki02
  • extend a given tree of size (n-1) by adding a new
    node to obtain trees of size n
  • a node is added to the right-most-path
  • a node is added as the rightmost sibling

15
Right most extension, cont.
  • Recursive applications of right most extensions
    create a search space

16
Pruning
  • For all , propose an
    upper bound such that
  • Can prune the node t if ,
  • where is a suboptimal gain

17
Upper bound of the gain(an extension of
Morishita 02)
18
Relation to SVMs with Tree Kernel
19
Classification algorithm
Modeled as a linear classifier wt weight
of tree t -b bias (default class label)
20
SVMs and Tree Kernel Collins 02
Tree Kernel all subtrees are expanded implicitly
a
  • Feature spaces are essentially the same
  • Learning strategies are different

0,,1,1,,1,,1,,1,,1,,0,
b
c
a
b
c
a
a
a
b
c
b
c
21
SVM v.s Boosting Rätsch 01
  • Both are known as Large Margin Classifiers
  • Metric of margin is different
  • Boosting L1-norm margin
  • w is expressed in a small number of features
  • - sparse solution in the feature space

22
SVM v.s Boosting, cont.
  • Accuracy is task-dependent
  • Practical advantages of Boosting
  • Good interpretability
  • Can analyze how the model performs or what kinds
    of features are useful
  • Compact features (rules) are easy to deal with
  • Fast classification
  • Complexity depends on the small number of rules
  • Kernel methods are too heavy

23
Experiments
24
Sentence classifications
  • PHS cell phone review classification (5,741
    sent.)
  • domain Web-based BBS on PHS, a sort of cell
    phone
  • categories positive review or negative review
  • MOD modality identification (1,710 sent.)
  • domain editorial news articles
  • categories assertion, opinion, or description

positive It is useful that we can know the
date and time of E-Mails. negative I feel that
the response is not so good.
assertion We should not hold an optimistic
view of the success of POKEMON. opinion I
think that now is the best time for developing
the blue print. description Social function of
education has been changing.
25
Sentence representations
  • N-gram tree
  • each word simply modifies the next word
  • subtree is an N-gram (N is unrestricted)
  • dependency tree
  • word-based dependency tree
  • A Japanese dependency parser, CaboCha, is used
  • bag-of-words (baseline)

response is very good
response is very good
26
Results
  • outperforms the baseline (bow)
  • dep v.s n-gram comparable (no significant
    difference)
  • SVMs show worse performance depending on tasks
  • overfitting

27
Interpretability
PHS dataset with dependency
A subtrees that include hard,
difficult
B subtrees that include use
0.0004 be hard to hung up -0.0006 be hard to
read -0.0007 be hard to use -0.0017 be hard to
0.0027 want to use 0.0002 use 0.0002 be in
use 0.0001 be easy to use
-0.0001 was easy to use -0.0007 be hard to
use -0.0019 is easier to use than..
C subtrees that include recharge
0.0028 recharging time is short -0.0041
recharging time is long
28
Interpretability, cont.
PHS dataset with dependency
Input The LCD is large, beautiful and easy to
see
weight w
subtree t
0.00368 be easy to 0.00353
beautiful 0.00237 be easy to see 0.00174
is large 0.00107 The LCD is large 0.00074
The LCD is 0.00057 The LCD 0.00036
see -0.00001 large
29
Advantages
  • Compact feature set
  • Boosting extracts only 1,783 unique features
  • The set sizes of distinct 1-gram, 2-gram, and
    3-gram are 4,211, 24,206, and 43,658
    respectively
  • SVMs implicitly use a huge number of features
  • Fast classification
  • Boosting 0.531 sec. / 5,741 instances
  • SVM 255.42 sec. / 5,741 instances
  • Boosting is about 480 times faster than SVMs

30
Conclusions
  • Assume that text is represented in a tree
  • Extension of decision stumps
  • all subtrees are potentially used as features
  • Boosting
  • Branch and bound
  • enables to find the optimal rule efficiently
  • Advantages
  • good interpretability
  • fast classification
  • comparable accuracy to SVMs with kernels

31
Future work
  • Other applications
  • Information extraction
  • semantic-role labeling
  • parse tree re-ranking
  • Confidence rated predictions for decision stumps

32
Thank you!
  • An implementation of our method is available as
    an open source software at
  • http//chasen.naist.jp/taku/software/bact/
Write a Comment
User Comments (0)
About PowerShow.com