Text Document Categorization by Term Association - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Text Document Categorization by Term Association

Description:

University of Alberta, Canada. 2002 IEEE International Conference on Data Mining (ICDM'02) ... in classifying e-mails, memos or web pages in a yahoo-like manner. ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 27
Provided by: makingCsi
Category:

less

Transcript and Presenter's Notes

Title: Text Document Categorization by Term Association


1
Text Document Categorization by Term Association
  • Maria-luiza Antonie Osmar R.
    Zaiane
  • University of Alberta, Canada
  • 2002 IEEE International Conference on Data Mining
    (ICDM02)

  • Presentation by Yu-Kai Lin

2
Outline
  • Introduction
  • Related work
  • Building an Associative Text Classifier
  • Experimental Results
  • Conclusion

3
Introduction
  • Text categorization is a necessity due to the
    very large amount of text documents that we have
    to deal with daily.
  • A text categorization system can be used in
    indexing documents to assist information
    retrieval tasks as well as in classifying
    e-mails, memos or web pages in a yahoo-like
    manner.

4
Introduction (cont.)
  • The data classification process
  • (a) Learning Training data are analyzed by a
    classification algorithm. (Figure 1)
  • (b) classification Test data are used to
    estimated in the form of classification rules.
    (Figure 2)

5
Figure 1
Classification algorithm
Training data
Classification rules
name age income Credit_rating
Jones Bill Lee Fox Lake lt 30 lt 30 31..40 gt 40 Low Low High Med Fair Excellent Excellent Fair
If age 3140 And income high Then Credit_rat
ing excellent
6
Figure 2
Classification rules
Training data
New data
name age income Credit_rating
Frank Sylvia Anne gt 30 lt 30 31..40 high low high fair fair excellent
( John ,3140,high) Credit rating ?
excellent
7
Related Work
  • Text classifier
  • Association Rule Mining

8
Related Work (cont.)
  • Text classifier
  • Naïve Bayesian classifier (chapter 7.4)
  • ID3 (Decision tree chapter 7.3)
  • C4.5 ( chapter 7.6)
  • K-NN (chapter 7.7.1)
  • Neural Networks
  • Support Vector Machines (SVM)

9
Related Work (cont.)
  • Association Rule Mining
  • Association Rules Generation
  • Associative classifiers

10
Related Work (cont.)
  • Association Rules Generation
  • XgtY
  • support s
  • confidence c
  • strong rules
  • rules that have a support and confidence greater
    than given thresholds

11
Related Work (cont.)
  • Associative classifiers
  • Learning method is represented by the association
    rule mining
  • Discover strong patterns that are associated with
    the class labels
  • New object are categorized by these patterns
    (classifier)

12
Building an Association Text Classifier
Training Set
Testing Set
Preprocessing Phase
Association Rule Mining
Model Validation
Associative Classifier
13
Building an Association Text Classifier
(cont.)
  • Data collection Preprocessing
  • Association Rules Generation
  • Pruning the Set of Association Rules
  • Prediction of Classes Associated with New
    Documents

14
Building an Association Text Classifier
(cont.)
  • Data collection Preprocessing
  • Weed out not interesting words
  • stopwording
  • stemming
  • Transform documents into transactions
  • categories set C c1, c2, , cm
  • term set T t1, t2, , tn
  • document Di cc1, cc2, , ccm, tt1, tt2, ,
    ttn

15
Building an Association Text Classifier
(cont.)
  • Association Rules Generation
  • Apriori
  • Advantage
  • The performance studies show its efficiency and
    scalability
  • Drawback of using on our transactions
  • Generate a large number of associations rules
  • Most of them are irrelevant for classification

16
ARC-BC
  • Association Rule-based Categorizer By Category
    algorithm
  • Apriori-based
  • Interested in rules that indicate a category
    label (T gt ci ) Strong rules
  • Prune the rules that no use for categorization

17
ARC-BC Algorithm
18
ARC-BC Algorithm
19
ARC-BC
association rules for category 1
category 1
association rules for category i
category i
classifier
association rules for category n
category n
put the new documents in the correct class
20
Examples of association rules composing the
classifier
21
Building an Association Text Classifier (cont.)
  • Pruning the Set of Association Rules
  • The number of rules that can be generated in the
    association rule mining phase could be very large
  • Noisy information mislead the classification
    process
  • Make classification time longer
  • Pruning method
  • Eliminate the specific rules and keep only those
    that are more general and with high confidence
  • Prune unnecessary rules by database coverage

22
Building an Association Text Classifier
(cont.)
  • Pruning the Set of Association Rules definition

23
Pruning the Set of Association Rules
Algorithm
24
Building an Association Text Classifier
(cont.)
  • Prediction of Classes Associated with New
    Documents
  • Algorithm

25
Experimental results
  • 9,603 training documents and 3,299 testing
    documents

26
Conclusion
  • Its effectiveness is comparable to most
    well-known text classifiers
  • Relatively fast training time
  • Rules generated are understandable and can be
    easily manually updated
  • When retraining a new document, only the
    concerned categories are adjusted and the rules
    could be incrementally updated
Write a Comment
User Comments (0)
About PowerShow.com