Title: A Survey on Text Classification
1A Survey on Text Classification
- December 10 2003
- 20033077 Dongho Kim
- KAIST
2Contents
- Introduction
- Statistical Properties of Text
- Feature Selection
- Feature Space Reduction
- Classification Methods
- Using SVM and TSVM
- Hierarchical Text Classification
- Summary
3Introduction
- Text classification
- Assign text to predefined categories based on
content - Types of text
- Documents (typical)
- Paragraphs
- Sentences
- WWW-Sites
- Different types of categories
- By topic
- By function
- By author
- By style
4Text Classification Example
5Computer-Based Text Classification Technologies
- Naive word-matching (Chute Yang Buntrock
1994) - Finding shared words between the text and names
of categories - Weakest method
- Cannot capture any conceptually relation
- Thesaurus-based matching (Lindberg Humphreys
1990) - Using lexical links
- Insensitive to the context
- High cost and low adaptivity across domains
6Computer-Based Text Classification Technologies
- Empirical learning of term-category associations
- Learning from a training set
- Fundamentally different from word-matching
- Statistically capturing the semantic association
between terms and categories - Context sensitive mapping from terms to
categories - For example
- Decision tree methods
- Bayesian belief networks
- Neural networks
- Nearest neighbor classification methods
- Least-squares regression techiniques
7Statistical Properties of Text
- There are stable language-independent patterns
in how people use natural language - A few words occur very frequently most occur
rarely - In general
- Top 2 words 1015 of all word occurrences
- Top 6 words 20 of all word occurrences
- Top 50 words 50 of all word occurrences
Most common words from Tom Sawyer
1 14
8Statistical Properties of Text
- The most frequent words in one corpus may be rare
words in another corpus - Example computer in CACM vs. National
Geographic - Each corpus has a different fairly small
working vocabulary
These properties hold in a wide range of languages
9Statistical Properties of Text
- Summary
- Term usage is highly skewed but in a predictable
pattern - Why is it important to know the characteristics
of text - Optimization of data structures
- Statistical retrieval algorithms depend on them
10Statistical Profiles
- Can act as a summarization device
- Indicate what a document is about
- Indicate what a collection is about
11Zipfs Law
- Zipfs Law relates a terms frequency to its rank
- Frequency 1/rank
- There is a constant such that
- Rank the terms in a vocabulary by frequency in
descending order - Empirical observation
- Hence
- for English
12Precision and Recall
Evaluation Metrics
- Recall
- Percentage of all relevant documents that are
found by a search - Precision
- Percentage of retrieved documents that are
relevant
13F-measure
Evaluation Metrics
Harmonic average of precision and recall
- Rewards results that keep recall and precision
close together - R40 P60. R/P average50. F-measure48
- R45 P55. R/P average50. F-measure49.5
14Break Even Point
Evaluation Metrics
- The point at which recall equals precision
Evaluation metric The value of this point
15Term Weights A Brief Introduction
Feature Selection
- The words of a text are not equally indicative of
its meaning - Important butterflies monarchs scientists
direction compass - Unimportant most think kind sky determine
cues learn - Term weights reflect the (estimated) importance
of each term
Most scientists think that butterflies use the
position of the sun in the sky as a kind of
compass that allows them to determine which way
is north. Scientists think that butterflies may
use other cues such as the earths magnetic
field but we have a lot to learn about monarchs
sense of direction.
16Term Weights
Feature Selection
- Term frequency (TF)
- The more often a word occurs in a document the
better that term is in describing what the
document is about - Often normalized e.g. by the length of the
document - Sometimes biased to range 0.4..1.0 to represent
the fact that even a single occurrence of a term
is a significant event
17Term Weights
Feature Selection
- Inverse document frequency (IDF)
- Terms that occur in many documents in the
collection are less useful for discriminating
among documents - Document frequency (df) number of documents
containing the term - IDF often calculated as
- TF and IDF are used in combination as product
18Vector Space Similarity
Feature Selection
- Similarity is inversely related to the angle
between the vectors - Cosine of the angle between the two vectors
19Feature Space Reduction
- Main reasons
- Improve accuracy of the algorithm
- Decrease the size of data set
- Control the computation time
- Avoid overfitting
- Feature space reduction technique
- Stopword removal stemming
- Information gain
- Natural language processing
20Stopword Removal
Feature Space Reduction
- Stopwords words that are discarded from a
document representation - Function words a an and as for in of
the to - About 400 words in English
- Other frequent words Lotus in a Lotus Support
21Stemming
Feature Space Reduction
- Group morphological variants
- Plural streets street
- Adverbs fully full
- Other inflected word forms goes go
- Grouping process is called conflation
- Current stemming algorithms make mistakes
- Conflating terms manually is difficult
time-consuming - Automatic conflation using rules
- Porter Stemmer
- Porter stemming example police policy
polic
22Information Gain
Feature Space Reduction
- Measuring information obtained by presence or
absence of a term in a document - Feature space reduction by thresholding
- Biased to common term large reduction in size
of data set cannot be achieved
23Natural Language Processing
Feature Space Reduction
- Pick out the important words from a document
- For example nouns proper nouns or verbs
- Ignoring all other parts
- Not biased to common terms reduction in bath
feature space and size of data - Named entities
- The subset of proper nouns consisting of people
locations and organization - Effective in cases of news story classification
24Experimental Results
Robert Cooley Classification of News Stories
Using Support Vector Machines Proceedings of the
16th International Joint Conference on Artificial
Intelligence Text Mining Workshop 1999
- Data set
- From six news media sources
- Two print sources (New York Times and Associated
Press Wire) - Two television sources (ABC World News Tonight
and CNN Headline News) - Two radio sources (Public Radio International and
Voice of America)
25Experimental Results
Robert Cooley Classification of News Stories
Using Support Vector Machines Proceedings of the
16th International Joint Conference on Artificial
Intelligence Text Mining Workshop 1999
- Results
- NLP significant loss in recall and precision
- SVM kNN (using full text or information gain)
- Binary weighting significant loss in recall
26kNN
Classification Methods
- Stands for k-nearest neighbor classification
- Algorithms
- Given a test document
- Find k nearest neighbors among training documents
- Calculate and sort score of candidate categories
- Thresholding on these scores
- Decision rule
27LLSF
Classification Methods
- Stands for Linear Least Squares Fit
- Obtain matrix of word-category regression
coefficients by LLSF - FLS arbitrary document vector of weighted
categories - By thresholding like kNN assign categories
28Naïve Bayes
Classification Methods
- Assumption
- Words are drawn randomly from class dependent
lexicons (with replacement) - Word independence
- Result
Word independence
Classification rule
29Estimating the Parameters
Naïve Bayes
- Count frequencies in training data
- Estimating P(Y)
- Fraction of positive / negative examples in
training data - Estimating P(WY)
- Smoothing with Laplace estimate
30Experiment Results
Yiming Yang and Xin Liu A re-examination of text
categorization methods Proceedings of ACM SIGIR
Conference on Research and Development in
Information Retrieval 1999.
31Text Classification using SVM
T. Joachims A Statistical Learning Model of Text
Classification with Support Vector
Machines Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR) ACM 2001.
- A statistical learning model of text
classification with SVMs
0 if linearly separable
32Properties 12 Sparse Examples in High Dimension
T. Joachims A Statistical Learning Model of Text
Classification with Support Vector
Machines Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR) ACM 2001.
- High dimensional feature vectors (30000
features) - Sparse document vectors only a few words of the
whole language occur in each document - SVMs use overfitting protection which does not
depend on the dimension of feature
33Property 3 Heterogeneous Use of Words
T. Joachims A Statistical Learning Model of Text
Classification with Support Vector
Machines Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR) ACM 2001.
No pair of documents shares any words but it
the and of for an a not
that in.
34Property 4 High Level of Redundancy
T. Joachims A Statistical Learning Model of Text
Classification with Support Vector
Machines Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR) ACM 2001.
Few features are irrelevant! Feature space
reduction causes loss of information
35Property 5 Zipfs Law
T. Joachims A Statistical Learning Model of Text
Classification with Support Vector
Machines Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR) ACM 2001.
Most words occur very infrequently!
36TCat Concepts
T. Joachims A Statistical Learning Model of Text
Classification with Support Vector
Machines Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR) ACM 2001.
Modeling real text-classification tasks Used for
previous proof
TCat(2020100 high freq.
412001420055600. medium freq.
91300019300010104000 low
freq. )
37TCat Concepts
T. Joachims A Statistical Learning Model of Text
Classification with Support Vector
Machines Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR) ACM 2001.
- Margin of Tcat-Concepts
- By Zipfs law we can bound R2
- Intuitively many words with low frequency
- relatively short document vectors
with
Linearly separable
38TCat Concepts
T. Joachims A Statistical Learning Model of Text
Classification with Support Vector
Machines Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR) ACM 2001.
- Bound on Expected Error of SVM
39Text Classification using TSVM
T. Joachims Transductive Inference for Text
Classification using Support Vector
Machines Proceedings of the International
Conference on Machine Learning (ICML) 1999.
- How would you classify the test set
- Training set D1 D6
- Test set D2 D3 D4 D5
40Why Does Adding Test Examples Reduce Error
T. Joachims Transductive Inference for Text
Classification using Support Vector
Machines Proceedings of the International
Conference on Machine Learning (ICML) 1999.
41Experiment Results
T. Joachims Transductive Inference for Text
Classification using Support Vector
Machines Proceedings of the International
Conference on Machine Learning (ICML) 1999.
- Data set
- Reuter-21578 dataset-ModApte
- Training 9603 test 3299
- WebKB collection of WWW pages
- Only the class course faculty project
student are used - Stemming and stopword removal are not used
- Ohsumed corpus compiled by William Hersh
- Training 10000 test 10000
42Experiment Results
T. Joachims Transductive Inference for Text
Classification using Support Vector
Machines Proceedings of the International
Conference on Machine Learning (ICML) 1999.
P/R-breakeven point for Reuters categories
43Experiment Results
T. Joachims Transductive Inference for Text
Classification using Support Vector
Machines Proceedings of the International
Conference on Machine Learning (ICML) 1999.
Average P/R-breakeven point on WebKB
Average P/R-breakeven point on Ohsumed
44Hierarchical Text Classification
- Real world classification complex hierarchical
structure - Due to difficulties of training for many classes
or features
Class 1-1
Class 1
Class 1-2
documents
Class 2
Class 1-3
Class 2-1
Class 3
Level 1
Level 2
45Hierarchical Text Classification
- More accurate specialized classifiers
computer not discriminating
Hardware
documents
Computers
Software
Chat
Sports
Soccer
Football
computer discriminating
46Experiment Setting
S. Dumais and H. Chen Hierarchical
classification of Web content. Proceedings of
SIGIR00 August 2000 pp. 256-263.
- Data set LookSmarts web directory
- Using short summary from search engine
- 370597 unique pages
- 17173 categories
- 7-level hierarchy
- Focus on 13 top-level and 150 second-level
categories
47Experiment Setting
S. Dumais and H. Chen Hierarchical
classification of Web content. Proceedings of
SIGIR00 August 2000 pp. 256-263.
- Using SVM
- Posterior probabilities by regularized maximum
likelihood fitting - Combining probabilities from the first and second
level - Boolean scoring function P(L1) P(L2) or
- Multiplicative scoring function P(L1) P(L2)
48Experiment Results
S. Dumais and H. Chen Hierarchical
classification of Web content. Proceedings of
SIGIR00 August 2000 pp. 256-263.
- Non-hierarchical (baseline) F1 0.476
- Hierarchical
- Top-level
- Training set F1 0.649
- Test set F1 0.572
- Second-level
- Multiplicative F1 0.495
- Boolean F1 0.497
- Assuming top-level classification is correct
- F1 0.711
49Summary
- Feature space reduction
- Performance of SVM and TSVM is better than others
- TSVM has merits in text classification
- Hierarchical classification is helpful
- Other issues
- Sampling strategies
- Other kinds of feature selection
50Reference
- T. Joachims Text Categorization with Support
Vector Machines Learning with Many Relevant
Features. Proceedings of the European Conference
on Machine Learning (ECML) Springer 1998. - T. Joachims Transductive Inference for Text
Classification using Support Vector Machines.
Proceedings of the International Conference on
Machine Learning (ICML) 1999. - T. Joachims A Statistical Learning Model of Text
Classification with Support Vector Machines.
Proceedings of the Conference on Research and
Development in Information Retrieval (SIGIR)
ACM 2001. - Robert Cooley Classification of News Stories
Using Support Vector Machines (1999). Proceedings
of the Sixteenth International Joint Conference
on Artificial Intelligence Text Mining Workshop
August 1999. - Yiming Yang and Xin Liu A re-examination of text
categorization methods. Proceedings of ACM SIGIR
Conference on Research and Development in
Information Retrieval (SIGIR) 1999. - S. Dumais and H. Chen Hierarchical
classification of Web content. Proceedings of
SIGIR00 August 2000 pp. 256-263.