Loading...

PPT – ECOC for Text Classification PowerPoint presentation | free to download - id: 6dbf7c-NGY2M

The Adobe Flash plugin is needed to view this content

Some Recent work

- ECOC for Text Classification
- Hybrids of EM Co-Training (with Kamal Nigam)
- Learning to build a monolingual corpus from the

web (with Rosie Jones) - Effect of Smoothing on Naive Bayes for text

classification (with Tong Zhang) - Hypertext Categorization using link and extracted

information (with Sean Slattery Yiming Yang)

Using Error-Correcting Codes For Text

Classification

- Rayid Ghani
- Center for Automated Learning Discovery
- Carnegie Mellon University

This presentation can be accessed at

http//www.cs.cmu.edu/rayid/talks/

Outline

- Introduction to ECOC
- Intuition Motivation
- Some Questions?
- Experimental Results
- Semi-Theoretical Model
- Types of Codes
- Drawbacks
- Conclusions

Introduction

- Decompose a multiclass classification problem

into multiple binary problems - One-Per-Class Approach (moderately expensive)
- All-Pairs (very expensive)
- Distributed Output Code (efficient but what about

performance?) - Error-Correcting Output Codes (?)

(No Transcript)

Is it a good idea?

- Larger margin for error since errors can now be

corrected - One-per-class is a code with minimum hamming

distance (HD) 2 - Distributed codes have low HD
- The individual binary problems can be harder than

before - Useless unless number of classes gt 5

Training ECOC

f1 f2 f3 f4 f5

A B C D

0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0

0 1

- Given m distinct classes

1. Create an m x n binary matrix M.

2. Each class is assigned ONE row of M.

3. Each column of the matrix divides the classes

into TWO groups.

4. Train the Base classifiers to learn the n

binary problems.

Training ECOC

- Given m distinct classes
- Create an m x n binary matrix M.
- Each class is assigned ONE row of M.
- Each column of the matrix divides the classes

into TWO groups. - Train the Base classifiers to learn the n binary

problems.

Testing ECOC

- To test a new instance
- Apply each of the n classifiers to the new

instance - Combine the predictions to obtain a binary

string(codeword) for the new point - Classify to the class with the nearest codeword

(usually hamming distance is used as the distance

measure)

ECOC - Picture

f1 f2 f3 f4 f5

A B C D

0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0

0 1

A

B

C

D

ECOC - Picture

f1 f2 f3 f4 f5

A B C D

0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0

0 1

A

B

C

D

ECOC - Picture

f1 f2 f3 f4 f5

A B C D

0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0

0 1

A

B

C

D

ECOC - Picture

f1 f2 f3 f4 f5

A B C D

0 0 1 1 0 1 0 1 0 0 0 1 1 1 0 0 1 0

0 1

A

B

C

D

X

1 1 1 1 0

- Single classifier learns a complex boundary

once - Ensemble learns a complex boundary multiple

times - ECOC learns a simple boudary multiple times

Questions?

- How well does it work?
- How long should the code be?
- Do we need a lot of training data?
- What kind of codes can we use?
- Are there intelligent ways of creating the code?

Previous Work

- Combine with Boosting ADABOOST.OC (Schapire,

1997), (Guruswami Sahai, 1999) - Local Learners (Ricci Aha, 1997)
- Text Classification (Berger, 1999)

Experimental Setup

- Generate the code
- BCH Codes
- Choose a Base Learner
- Naive Bayes Classifier as used in text

classification tasks (McCallum Nigam 1998)

Dataset

- Industry Sector Dataset
- Consists of company web pages classified into 105

economic sectors - Standard stoplist
- No Stemming
- Skip all MIME headers and HTML tags
- Experimental approach similar to McCallum et al.

(1998) for comparison purposes.

Results

ECOC is 88 accurate!

Classification Accuracies on five random 50-50

train-test splits of the Industry Sector dataset

with a vocabulary size of 10000.

Results

Industry Sector Data Set

Naïve Bayes Shrinkage1 ME2 ME/ w Prior3 ECOC 63-bit

66.1 76 79 81.1 88.5

ECOC reduces the error of the Naïve Bayes

Classifier by 66

- (McCallum et al. 1998) 2,3. (Nigam et

al. 1999)

The Longer the Better!

Table 2 Average Classification Accuracy on 5

random 50-50 train-test splits of the Industry

Sector dataset with a vocabulary size of 10000

words selected using Information Gain.

- Longer codes mean larger codeword separation
- The minimum hamming distance of a code C is the

smallest distance between any pair of distance

codewords in C - If minimum hamming distance is h, then the code

can correct ? (h-1)/2? errors

Size Matters?

Size does NOT matter!

Semi-Theoretical Model

- Model ECOC by a Binomial Distribution B(n,p)
- n length of the code
- p probability of each bit being classified

incorrectly

Semi-Theoretical Model

- Model ECOC by a Binomial Distribution B(n,p)
- n length of the code
- p probability of each bit being classified

incorrectly

of Bits Hmin Emax Pave Accuracy

15 5 2 .85 .59

15 5 2 .89 .80

15 5 2 .91 .84

31 11 5 .85 .67

31 11 5 .89 .91

31 11 5 .91 .94

63 31 15 .89 .99

Semi-Theoretical Model

- Model ECOC by a Binomial Distribution B(n,p)
- n length of the code
- p probability of each bit being classified

incorrectly

of Bits Hmin Emax Pave Accuracy

15 5 2 .85 .59

15 5 2 .89 .80

15 5 2 .91 .84

31 11 5 .85 .67

31 11 5 .89 .91

31 11 5 .91 .94

63 31 15 .89 .99

(No Transcript)

Talk.misc.religion Comp.sys.ibm.hardware

Comp.os.windows Comp.sys.ibm.hardware

Comp.os.windows Talk.misc.religion

Comp.os.windows Alt.atheism

Talk.misc.religion Alt.atheism

Comp.sys.ibm.hardware Alt.atheism

Alt.atheism

Talk.misc.religion

Comp.os.windows

Talk.misc.religion Comp.sys.ibm.hardware Comp.

os.windows

Alt.atheism Comp.sys.ibm.hardware Comp.os.win

dows

Alt.atheism Comp.sys.ibm.hardware Talk.misc.r

eligion

Types of Codes

Types of Codes

- Data-Independent

- Data-Dependent

Hand-Constructed Adaptive

Algebraic Random

What is a Good Code?

- Row Separation
- Column Separation (Independence of errors for

each binary classifier) - Efficiency (for long codes)

Choosing Codes

Random Algebraic

Row Sep On Average For long codes Guaranteed

Col Sep On Average For long codes Can be Guaranteed

Efficiency No Yes

Experimental Results

Code Min Row HD Max Row HD Min Col HD Max Col HD Error Rate

15-Bit BCH 5 15 49 64 20.6

19-Bit Hybrid 5 18 15 69 22.3

15-bit Random 2 (1.5) 13 42 60 24.1

Interesting Questions?

- NB does not give good probabilitiy estimates-

using ECOC results in better estimates? - Assignment of codewords to classes?
- Can Decoding be posed as a supervised learning

task?

Drawbacks

- Can be computationally expensive
- Random Codes throw away the real-world nature of

the data by picking random partitions to create

artificial binary problems

Current Work

- Combine ECOC with Co-Training to use unlabeled

data - Automatically construct optimal / adaptive codes

Conclusion

- Performs well on text classification tasks
- Can be used when training data is sparse
- Algebraic codes perform better than random codes

for a given code length - Hand-constructed codes may not be the answer

Background

- Co-training seems to be the way to go when there

is (and maybe even when there isnt) a feature

split in the data - Reported results on co-training only deal with

very small (toy) problems mostly binary

classification tasks (Blum Mitchell 98, Nigam

Ghani 2000)

Co-Training Challenge

- Task Apply cotraining to a 65 class dataset

containing 130,000 training examples - Result Cotraining fails!

Solution?

- ECOC seems to work well when there are a large

number of classes - ECOC decomposes a multiclass problems into

several binary problems - Cotraining works well with binary problems

Combine ECOC and Cotrain

Algorithm

- Learn each bit for ECOC using a cotrained

classifier

Dataset (Job Descriptions)

- 65 classes
- 32000 examples
- Two feature sets
- Title
- Description

(No Transcript)

Results

- 10 Train, 50 unlabeled, 40 test
- NB 40.3
- ECOC 48.9
- EM 30.83
- CoTraining
- ECOC-EM
- ECOC-Cotrain
- ECOC-CoEM