Title: Two television sources (ABC World News Tonight and CN
1A Survey on Text Classification
- December 10, 2003
- 20033077 Dongho Kim
- KAIST
2Contents
- Introduction
- Statistical Properties of Text
- Feature Selection
- Feature Space Reduction
- Classification Methods
- Using SVM and TSVM
- Hierarchical Text Classification
- Summary
3Introduction
- Text classification
- Assign text to predefined categories based on
content - Types of text
- Documents (typical)
- Paragraphs
- Sentences
- WWW-Sites
- Different types of categories
- By topic
- By function
- By author
- By style
4Text Classification Example
5Computer-Based Text Classification Technologies
- Naive word-matching (Chute, Yang, Buntrock
1994) - Finding shared words between the text and names
of categories - Weakest method
- Cannot capture any conceptually relation
- Thesaurus-based matching (Lindberg Humphreys
1990) - Using lexical links
- Insensitive to the context
- High cost and low adaptivity across domains
6Computer-Based Text Classification Technologies
- Empirical learning of term-category associations
- Learning from a training set
- Fundamentally different from word-matching
- Statistically capturing the semantic association
between terms and categories - Context sensitive mapping from terms to
categories - For example,
- Decision tree methods
- Bayesian belief networks
- Neural networks
- Nearest neighbor classification methods
- Least-squares regression techiniques
7Statistical Properties of Text
- There are stable, language-independent patterns
in how people use natural language - A few words occur very frequently most occur
rarely - In general
- Top 2 words 1015 of all word occurrences
- Top 6 words 20 of all word occurrences
- Top 50 words 50 of all word occurrences
Most common words from Tom Sawyer
1 14
8Statistical Properties of Text
- The most frequent words in one corpus may be rare
words in another corpus - Example computer in CACM vs. National
Geographic - Each corpus has a different, fairly small
working vocabulary
These properties hold in a wide range of languages
9Statistical Properties of Text
- Summary
- Term usage is highly skewed, but in a predictable
pattern - Why is it important to know the characteristics
of text? - Optimization of data structures
- Statistical retrieval algorithms depend on them
10Statistical Profiles
- Can act as a summarization device
- Indicate what a document is about
- Indicate what a collection is about
11Zipfs Law
- Zipfs Law relates a terms frequency to its rank
- Frequency 1/rank
- There is a constant such that
- Rank the terms in a vocabulary by frequency, in
descending order - Empirical observation
- Hence
- for English
12Precision and Recall
Evaluation Metrics
- Recall
- Percentage of all relevant documents that are
found by a search - Precision
- Percentage of retrieved documents that are
relevant
13F-measure
Evaluation Metrics
Harmonic average of precision and recall
- Rewards results that keep recall and precision
close together - R40, P60. R/P average50. F-measure48
- R45, P55. R/P average50. F-measure49.5
14Break Even Point
Evaluation Metrics
- The point at which recall equals precision
Evaluation metric The value of this point
15Term Weights A Brief Introduction
Feature Selection
- The words of a text are not equally indicative of
its meaning - Important butterflies, monarchs, scientists,
direction, compass - Unimportant most, think, kind, sky, determine,
cues, learn - Term weights reflect the (estimated) importance
of each term
Most scientists think that butterflies use the
position of the sun in the sky as a kind of
compass that allows them to determine which way
is north. Scientists think that butterflies may
use other cues, such as the earths magnetic
field, but we have a lot to learn about monarchs
sense of direction.
16Term Weights
Feature Selection
- Term frequency (TF)
- The more often a word occurs in a document, the
better that term is in describing what the
document is about - Often normalized, e.g. by the length of the
document - Sometimes biased to range 0.4..1.0 to represent
the fact that even a single occurrence of a term
is a significant event
17Term Weights
Feature Selection
- Inverse document frequency (IDF)
- Terms that occur in many documents in the
collection are less useful for discriminating
among documents - Document frequency (df) number of documents
containing the term - IDF often calculated as
- TF and IDF are used in combination as product
18Vector Space Similarity
Feature Selection
- Similarity is inversely related to the angle
between the vectors - Cosine of the angle between the two vectors
19Feature Space Reduction
- Main reasons
- Improve accuracy of the algorithm
- Decrease the size of data set
- Control the computation time
- Avoid overfitting
- Feature space reduction technique
- Stopword removal, stemming
- Information gain
- Natural language processing
20Stopword Removal
Feature Space Reduction
- Stopwords words that are discarded from a
document representation - Function words a, an, and, as, for, in, of,
the, to, - About 400 words in English
- Other frequent words Lotus in a Lotus Support
21Stemming
Feature Space Reduction
- Group morphological variants
- Plural streets ?? street
- Adverbs fully ?? full
- Other inflected word forms goes ?? go
- Grouping process is called conflation
- Current stemming algorithms make mistakes
- Conflating terms manually is difficult,
time-consuming - Automatic conflation using rules
- Porter Stemmer
- Porter stemming example police, policy ?
polic
22Information Gain
Feature Space Reduction
- Measuring information obtained by presence or
absence of a term in a document - Feature space reduction by thresholding
- Biased to common term ? large reduction in size
of data set cannot be achieved
23Natural Language Processing
Feature Space Reduction
- Pick out the important words from a document
- For example, nouns, proper nouns, or verbs
- Ignoring all other parts
- Not biased to common terms ? reduction in bath
feature space and size of data - Named entities
- The subset of proper nouns consisting of people,
locations, and organization - Effective in cases of news story classification
24Experimental Results
Robert Cooley, Classification of News Stories
Using Support Vector Machines, Proceedings of the
16th International Joint Conference on Artificial
Intelligence Text Mining Workshop, 1999
- Data set
- From six news media sources
- Two print sources (New York Times and Associated
Press Wire) - Two television sources (ABC World News Tonight
and CNN Headline News) - Two radio sources (Public Radio International and
Voice of America)
25Experimental Results
Robert Cooley, Classification of News Stories
Using Support Vector Machines, Proceedings of the
16th International Joint Conference on Artificial
Intelligence Text Mining Workshop, 1999
- Results
- NLP ? significant loss in recall and precision
- SVM gtgt kNN (using full text or information gain)
- Binary weighting ? significant loss in recall
26kNN
Classification Methods
- Stands for k-nearest neighbor classification
- Algorithms
- Given a test document,
- Find k nearest neighbors among training documents
- Calculate and sort score of candidate categories
- Thresholding on these scores
- Decision rule
27LLSF
Classification Methods
- Stands for Linear Least Squares Fit
- Obtain matrix of word-category regression
coefficients by LLSF - FLS arbitrary document ? vector of weighted
categories - By thresholding like kNN, assign categories
28Naïve Bayes
Classification Methods
- Assumption
- Words are drawn randomly from class dependent
lexicons (with replacement) - Word independence
- Result
Word independence
Classification rule
29Estimating the Parameters
Naïve Bayes
- Count frequencies in training data
- Estimating P(Y)
- Fraction of positive / negative examples in
training data - Estimating P(WY)
- Smoothing with Laplace estimate
30Experiment Results
Yiming Yang and Xin Liu, A re-examination of text
categorization methods, Proceedings of ACM SIGIR
Conference on Research and Development in
Information Retrieval, 1999.
31Text Classification using SVM
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.
- A statistical learning model of text
classification with SVMs
0 if linearly separable
32Properties 12 Sparse Examples in High Dimension
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.
- High dimensional feature vectors (30,000
features) - Sparse document vectors only a few words of the
whole language occur in each document - SVMs use overfitting protection which does not
depend on the dimension of feature
33Property 3 Heterogeneous Use of Words
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.
No pair of documents shares any words, but it,
the, and, of, for, an, a, not,
that, in.
34Property 4 High Level of Redundancy
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.
Few features are irrelevant! Feature space
reduction causes loss of information
35Property 5 Zipfs Law
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.
Most words occur very infrequently!
36TCat Concepts
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.
Modeling real text-classification tasks Used for
previous proof
TCat(2020100, high freq.
41200,14200,55600. medium freq.
913000,193000,10104000 low
freq. )
37TCat Concepts
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.
- Margin of Tcat-Concepts
- By Zipfs law, we can bound R2
- Intuitively, many words with low frequency
- ? relatively short document vectors
with
Linearly separable
38TCat Concepts
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.
- Bound on Expected Error of SVM
39Text Classification using TSVM
T. Joachims, Transductive Inference for Text
Classification using Support Vector
Machines, Proceedings of the International
Conference on Machine Learning (ICML), 1999.
- How would you classify the test set?
- Training set D1, D6
- Test set D2, D3, D4, D5
40Why Does Adding Test Examples Reduce Error?
T. Joachims, Transductive Inference for Text
Classification using Support Vector
Machines, Proceedings of the International
Conference on Machine Learning (ICML), 1999.
41Experiment Results
T. Joachims, Transductive Inference for Text
Classification using Support Vector
Machines, Proceedings of the International
Conference on Machine Learning (ICML), 1999.
- Data set
- Reuter-21578 dataset-ModApte
- Training 9,603 test 3,299
- WebKB collection of WWW pages
- Only the class course, faculty, project,
student are used - Stemming and stopword removal are not used
- Ohsumed corpus compiled by William Hersh
- Training 10,000 test 10,000
42Experiment Results
T. Joachims, Transductive Inference for Text
Classification using Support Vector
Machines, Proceedings of the International
Conference on Machine Learning (ICML), 1999.
P/R-breakeven point for Reuters categories
43Experiment Results
T. Joachims, Transductive Inference for Text
Classification using Support Vector
Machines, Proceedings of the International
Conference on Machine Learning (ICML), 1999.
Average P/R-breakeven point on WebKB
Average P/R-breakeven point on Ohsumed
44Hierarchical Text Classification
- Real world classification ? complex hierarchical
structure - Due to difficulties of training for many classes
or features
Class 1-1
Class 1
Class 1-2
documents
Class 2
Class 1-3
Class 2-1
Class 3
Level 1
Level 2
45Hierarchical Text Classification
- More accurate specialized classifiers
computer not discriminating
Hardware
documents
Computers
Software
Chat
Sports
Soccer
Football
computer discriminating
46Experiment Setting
S. Dumais and H. Chen, Hierarchical
classification of Web content. Proceedings of
SIGIR'00, August 2000, pp. 256-263.
- Data set LookSmarts web directory
- Using short summary from search engine
- 370597 unique pages
- 17173 categories
- 7-level hierarchy
- Focus on 13 top-level and 150 second-level
categories
47Experiment Setting
S. Dumais and H. Chen, Hierarchical
classification of Web content. Proceedings of
SIGIR'00, August 2000, pp. 256-263.
- Using SVM
- Posterior probabilities by regularized maximum
likelihood fitting - Combining probabilities from the first and second
level - Boolean scoring function, P(L1) P(L2) or,
- Multiplicative scoring function, P(L1) P(L2)
48Experiment Results
S. Dumais and H. Chen, Hierarchical
classification of Web content. Proceedings of
SIGIR'00, August 2000, pp. 256-263.
- Non-hierarchical (baseline) F1 0.476
- Hierarchical
- Top-level
- Training set F1 0.649
- Test set F1 0.572
- Second-level
- Multiplicative F1 0.495
- Boolean F1 0.497
- Assuming top-level classification is correct,
- F1 0.711
49Summary
- Feature space reduction
- Performance of SVM and TSVM is better than others
- TSVM has merits in text classification
- Hierarchical classification is helpful
- Other issues
- Sampling strategies
- Other kinds of feature selection
50Reference
- T. Joachims, Text Categorization with Support
Vector Machines Learning with Many Relevant
Features. Proceedings of the European Conference
on Machine Learning (ECML), Springer, 1998. - T. Joachims, Transductive Inference for Text
Classification using Support Vector Machines.
Proceedings of the International Conference on
Machine Learning (ICML), 1999. - T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector Machines.
Proceedings of the Conference on Research and
Development in Information Retrieval (SIGIR),
ACM, 2001. - Robert Cooley, Classification of News Stories
Using Support Vector Machines (1999). Proceedings
of the Sixteenth International Joint Conference
on Artificial Intelligence Text Mining Workshop,
August 1999. - Yiming Yang and Xin Liu, A re-examination of text
categorization methods. Proceedings of ACM SIGIR
Conference on Research and Development in
Information Retrieval, (SIGIR), 1999. - S. Dumais and H. Chen, Hierarchical
classification of Web content. Proceedings of
SIGIR'00, August 2000, pp. 256-263.