Evaluation of Decision Forests on Text Categorization - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Evaluation of Decision Forests on Text Categorization

Description:

Reuters vs. OHSUMED. All classifiers degrades from Reuters ... Reuters vs. OHSUMED. OHSUMED is a harder problem because: Documents are more evenly distributed ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 17
Provided by: hch5
Category:

less

Transcript and Presenter's Notes

Title: Evaluation of Decision Forests on Text Categorization


1
Evaluation of Decision Forestson Text
Categorization
2
Text Categorization
  • Text Collection
  • Feature Extraction
  • Classification
  • Evaluation

3
Text Collection
  • Reuters
  • Newswires from Reuters in 1987
  • Training set 9603
  • Test set 3299
  • Categories 95
  • OHSUMED
  • Abstracts from medical journals
  • Training set 12327
  • Test set 3616
  • Categories 75 (within Heart Disease subtree)

4
Feature Extraction
  • Stop Word Removal
  • 430 stop words
  • Stemming
  • Porters stemmer
  • Term Selection
  • by Document Frequency
  • Category independent selection
  • Category dependent selection
  • Feature Extraction
  • TF ? IDF

5
Classification
  • Method
  • Each document may belong to multiple categories
  • Treating each category as a separate
    classification problem
  • Binary classification
  • Classifiers
  • kNN (k Nearest Neighbor)
  • C4.5 (Quinlan)
  • Decision Forest

6
C4.5
  • A method to build decision trees
  • Training
  • Grow the tree by splitting the data set
  • Prune the tree back to prevent over-fitting
  • Testing
  • Test vector goes down the tree and arrives at a
    leaf.
  • Probability that the vector belongs to each
    category is estimated.

7
Decision Forest
  • Consisting of many decision trees combined
    by averaging the class probability estimates at
    the leaves.
  • Each tree is constructed in a randomly chosen
    (coordinate) subspace of the feature space.
  • An oblique hyperplane is used as a discriminator
    at each internalnode of the trees.

8
Why choose these 3 classifiers?
  • We do not have a parametric model for the problem
    (we cannot assume Gaussian distributions etc.)
  • kNN and decision tree (c4.5) are the most popular
    nonparametric classifiers. We use them as the
    baselines for comparison
  • We expect decision forest to do well since we
    have a high dimensional problem for which it is
    known to do well from previous studies

9
Evaluation
  • Measurements
  • Precision p a / (ab)
  • Recall r a / (ac)
  • F1 value F1 2rp / (rp)
  • Tradeoff between Precision and Recall
  • kNN tends to have higher precision than recall,
    especially when k becomes larger.

10
Averaging scores
  • Macro-averaging
  • Calculate precision/recall for each category
  • Average all the precision/recall values
  • Assign equal weight to each category
  • Micro-averaging
  • Sum up classification decision of each document
  • Calculate precision/recall from the summations
  • Assign equal weight to each document
  • This was used in experiment because the number of
    documents in each category varies considerably.

11
Performance in F1 Value
12
Comparison between Classifiers
  • Decision Forest better than C4.5 and kNN
  • In category dependent case, C4.5 better than kNN
  • In category independent case, kNN better than C4.5

13
Category Dependent vs. Independent method
  • For Decision Forest and C4.5, category dependent
    better than independent.
  • But for kNN, category independent better than
    dependent.
  • No obvious explanation found.

14
Reuters vs. OHSUMED
  • All classifiers degrades from Reuters to OHSUMED
  • kNN degrades faster(26) than C4.5(12) and
    DF(12)

15
Reuters vs. OHSUMED
  • OHSUMED is a harder problem because
  • Documents are more evenly distributed
  • This even distribution confuses kNN recall rate
    more than others, because there are more
    confusion classes in the fixed size neighborhood.

16
Conclusion
  • Decision Forest is substantially better than C4.5
    and kNN in text categorization
  • Difficult to make comparison with results of
    other classifiers outside this experiment,
    because
  • Different ways of spliting training/test set
  • Different term selection methods
Write a Comment
User Comments (0)
About PowerShow.com