Title: Text Document Categorization by Term Association
1Text Document Categorization by Term Association
- Maria-luiza Antonie Osmar R.
Zaiane - University of Alberta, Canada
- 2002 IEEE International Conference on Data Mining
(ICDM02) -
Presentation by Yu-Kai Lin -
2Outline
- Introduction
- Related work
- Building an Associative Text Classifier
- Experimental Results
- Conclusion
3Introduction
- Text categorization is a necessity due to the
very large amount of text documents that we have
to deal with daily. - A text categorization system can be used in
indexing documents to assist information
retrieval tasks as well as in classifying
e-mails, memos or web pages in a yahoo-like
manner.
4Introduction (cont.)
- The data classification process
- (a) Learning Training data are analyzed by a
classification algorithm. (Figure 1) - (b) classification Test data are used to
estimated in the form of classification rules.
(Figure 2)
5Figure 1
Classification algorithm
Training data
Classification rules
name age income Credit_rating
Jones Bill Lee Fox Lake lt 30 lt 30 31..40 gt 40 Low Low High Med Fair Excellent Excellent Fair
If age 3140 And income high Then Credit_rat
ing excellent
6Figure 2
Classification rules
Training data
New data
name age income Credit_rating
Frank Sylvia Anne gt 30 lt 30 31..40 high low high fair fair excellent
( John ,3140,high) Credit rating ?
excellent
7Related Work
- Text classifier
- Association Rule Mining
8Related Work (cont.)
- Text classifier
- Naïve Bayesian classifier (chapter 7.4)
- ID3 (Decision tree chapter 7.3)
- C4.5 ( chapter 7.6)
- K-NN (chapter 7.7.1)
- Neural Networks
- Support Vector Machines (SVM)
9Related Work (cont.)
- Association Rule Mining
- Association Rules Generation
- Associative classifiers
10Related Work (cont.)
- Association Rules Generation
- XgtY
- support s
- confidence c
- strong rules
- rules that have a support and confidence greater
than given thresholds
11Related Work (cont.)
- Associative classifiers
- Learning method is represented by the association
rule mining - Discover strong patterns that are associated with
the class labels - New object are categorized by these patterns
(classifier)
12Building an Association Text Classifier
Training Set
Testing Set
Preprocessing Phase
Association Rule Mining
Model Validation
Associative Classifier
13 Building an Association Text Classifier
(cont.)
- Data collection Preprocessing
- Association Rules Generation
- Pruning the Set of Association Rules
- Prediction of Classes Associated with New
Documents
14 Building an Association Text Classifier
(cont.)
- Data collection Preprocessing
- Weed out not interesting words
- stopwording
- stemming
- Transform documents into transactions
- categories set C c1, c2, , cm
- term set T t1, t2, , tn
- document Di cc1, cc2, , ccm, tt1, tt2, ,
ttn
15 Building an Association Text Classifier
(cont.)
- Association Rules Generation
- Apriori
- Advantage
- The performance studies show its efficiency and
scalability - Drawback of using on our transactions
- Generate a large number of associations rules
- Most of them are irrelevant for classification
16 ARC-BC
- Association Rule-based Categorizer By Category
algorithm - Apriori-based
- Interested in rules that indicate a category
label (T gt ci ) Strong rules - Prune the rules that no use for categorization
17 ARC-BC Algorithm
18ARC-BC Algorithm
19 ARC-BC
association rules for category 1
category 1
association rules for category i
category i
classifier
association rules for category n
category n
put the new documents in the correct class
20 Examples of association rules composing the
classifier
21Building an Association Text Classifier (cont.)
- Pruning the Set of Association Rules
- The number of rules that can be generated in the
association rule mining phase could be very large - Noisy information mislead the classification
process - Make classification time longer
- Pruning method
- Eliminate the specific rules and keep only those
that are more general and with high confidence - Prune unnecessary rules by database coverage
22 Building an Association Text Classifier
(cont.)
- Pruning the Set of Association Rules definition
23 Pruning the Set of Association Rules
Algorithm
24 Building an Association Text Classifier
(cont.)
- Prediction of Classes Associated with New
Documents - Algorithm
25Experimental results
- 9,603 training documents and 3,299 testing
documents
26Conclusion
- Its effectiveness is comparable to most
well-known text classifiers - Relatively fast training time
- Rules generated are understandable and can be
easily manually updated - When retraining a new document, only the
concerned categories are adjusted and the rules
could be incrementally updated