Evaluation of Decision Forests on Text Categorization

About This Presentation

Title:

Description:

Number of Views:72

Avg rating:3.0/5.0

Slides: 17

Provided by: hch5

Learn more at: http://www.cs.ucdavis.edu

Category:

more less

Transcript and Presenter's Notes

Title: Evaluation of Decision Forests on Text Categorization

1
Evaluation of Decision Forestson Text
Categorization
2
Text Categorization

3
Text Collection

4
Feature Extraction

5
Classification

6
C4.5

7
Decision Forest

Consisting of many decision trees combined
by averaging the class probability estimates at
the leaves.
Each tree is constructed in a randomly chosen
(coordinate) subspace of the feature space.
An oblique hyperplane is used as a discriminator
at each internalnode of the trees.

8
Why choose these 3 classifiers?

We do not have a parametric model for the problem
(we cannot assume Gaussian distributions etc.)
kNN and decision tree (c4.5) are the most popular
nonparametric classifiers. We use them as the
baselines for comparison
We expect decision forest to do well since we
have a high dimensional problem for which it is
known to do well from previous studies

9
Evaluation

Measurements
Precision p a / (ab)
Recall r a / (ac)
F1 value F1 2rp / (rp)
Tradeoff between Precision and Recall
kNN tends to have higher precision than recall,
especially when k becomes larger.

10
Averaging scores

Macro-averaging
Calculate precision/recall for each category
Average all the precision/recall values
Assign equal weight to each category
Micro-averaging
Sum up classification decision of each document
Calculate precision/recall from the summations
Assign equal weight to each document
This was used in experiment because the number of
documents in each category varies considerably.

11
Performance in F1 Value
12
Comparison between Classifiers

13
Category Dependent vs. Independent method

14
Reuters vs. OHSUMED

15
Reuters vs. OHSUMED

OHSUMED is a harder problem because
Documents are more evenly distributed
This even distribution confuses kNN recall rate
more than others, because there are more
confusion classes in the fixed size neighborhood.

16
Conclusion

Decision Forest is substantially better than C4.5
and kNN in text categorization
Difficult to make comparison with results of
other classifiers outside this experiment,
because
Different ways of spliting training/test set
Different term selection methods

Write a Comment

User Comments (0)