Title: Efficient Text Categorization with a Large Number of Categories
1Efficient Text Categorization with a Large Number
of Categories
- Rayid Ghani
- KDD Project Proposal
2Text Categorization
3How do people deal with a large number of classes?
- Use fast multiclass algorithms (Naïve Bayes)
- Builds one model per class
- Use Binary classification algorithms (SVMs) and
break an n class problems into n binary problems - What happens with a 1000 class problem?
- Can we do better?
4ECOC to the Rescue!
- An n-class problem can be solved by solving log2n
binary problems - More efficient than one-per-class
- Does it actually perform better?
5What is ECOC?
- Solve multiclass problems by decomposing them
into multiple binary problems (Dietterich
Bakiri 1995) - Use a learner to learn the binary problems
6Training ECOC
Testing ECOC
f1 f2 f3 f4 f5
0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0
0 1
A B C D
1 1 1 1 0
X
7ECOC - Picture
f1 f2 f3 f4 f5
A B C D
0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0
0 1
A
B
C
D
8ECOC - Picture
f1 f2 f3 f4 f5
A B C D
0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0
0 1
A
B
C
D
9ECOC - Picture
f1 f2 f3 f4 f5
A B C D
0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0
0 1
A
B
C
D
10ECOC - Picture
f1 f2 f3 f4 f5
A B C D
0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0
0 1
A
B
C
D
X
1 1 1 1 0
11This Proposal
Efficiency
NB
Preliminary Results
ECOC
(as used in Berger 99)
Classification Performance
Preliminary Results ECOC reduces the error of
the Naïve Bayes Classifier by 66 with NO
increase in computational cost
12Proposed Solutions
- Design codewords that minimize cost and maximize
performance - Investigate the assignment of codewords to
classes - Learn the decoding function
- Incorporate unlabeled data into ECOC
13Use unlabeled data with a large number of classes
- How?
- Use EM
- Mixed Results
- Think Again!
- Use Co-Training
- Disastrous Results
- Think one more time
14Use Unlabeled data
- Current learning algorithms using unlabeled data
(EM, Co-Training) dont work well with a large
number of categories - ECOC works great with a large number of classes
but there is no framework for using unlabeled
data
15Use Unlabeled Data
- ECOC decomposes multiclass problems into binary
problems - Co-Training works great with binary problems
- ECOC Co-Train Learn each binary problem in
ECOC with Co-Training - Preliminary Results Not so great! (very
sensitive to initial labeled documents)
16What Next?
- Use improved version of co-training (gradient
descent) - Less prone to random fluctuations
- Uses all unlabeled data at every iteration
- Use Co-EM (Nigam Ghani 2000) - hybrid of EM
and Co-Training
17Work Plan
- Collect Datasets ?
- Codeword Assignment - 2 weeks
- Learning Decoding 1-2 weeks
- Using Unlabeled Data - 2 weeks
- Design Codes - 2 weeks
- Project Write-up 1 week
18Summary
- Use ECOC for efficient text classification with a
large number of categories - Reduce code length without sacrificing
performance - Fix code length and Increase Performance
- Generalize to domain-independent classification
tasks involving a large number of categories
19Testing ECOC
- To test a new instance
- Apply each of the n classifiers to the new
instance - Combine the predictions to obtain a binary
string(codeword) for the new point - Classify to the class with the nearest codeword
(usually hamming distance is used as the distance
measure)
20The Decoding Step
- Standard Map to the nearest codeword according
to hamming distance - Can we do better?
21The Real Question?
- Tradeoff between learnability of binary
problems and the error-correcting power of the
code
22Codeword assignment
- Standard Procedure Assign codewords to classes
randomly - Can we do better?
23Goal of Current Research
- Improve classification performance without
increasing cost
Develop algorithms that increase performance
without affecting code length
Design short codes that perform well
24Previous Results
- Performance increases with length of code
- Gives the same percentage increase in performance
over NB regardless of training set size - BCH Codes gt Random Codes gt Hand-constructed Codes
25Others have shown that ECOC
- Works great with arbitrary long codes
- Longer codes More Error-Correcting Power
Better Performance - Longer codes More Computational Cost
26ECOC to the Rescue!
- An n-class problem can be solved by solving log2n
problems - More efficient than one-per-class
- Does it actually perform better?
27Previous Results
(Ghani 2000)
Industry Sector Data Set
Naïve Bayes Shrinkage1 ME2 ME/ w Prior3 ECOC 63-bit
66.1 76 79 81.1 88.5
ECOC reduces the error of the Naïve Bayes
Classifier by 66 with no increase in
computational cost
- (McCallum et al. 1998) 2,3. (Nigam et
al. 1999)
28Design codewords
- Maximize Performance (Accuracy, Precision ,
Recall, F1?) - Minimize length of codes
- Search in the space of codewords through gradient
descent GError Code_Length
29Codeword Assignment
- Generate the confusion matrix and use that to
assign the most confusable classes the codewords
that are farthest apart - Pros
- Focusing on confusable classes more can help
- Cons
- Individual binary problems can be very hard
30The Decoding Step
- Weight the individual classifiers according to
their training accuracies and do weighted
majority decoding. - Pose the decoding as a separate learning problem
and use regression/Neural Network
1 0 0 1
1 1 0 1