Two television sources (ABC World News Tonight and CN - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

Two television sources (ABC World News Tonight and CN

Description:

Two television sources (ABC World News Tonight and CNN Headline News) ... Robert Cooley, Classification of News Stories Using Support Vector Machines, ... – PowerPoint PPT presentation

Number of Views:415

Avg rating:3.0/5.0

Slides: 51

Provided by: kimdo

Category:

more less

Transcript and Presenter's Notes

Title: Two television sources (ABC World News Tonight and CN

1
A Survey on Text Classification

December 10, 2003
20033077 Dongho Kim
KAIST

2
Contents

Introduction
Statistical Properties of Text
Feature Selection
Feature Space Reduction
Classification Methods
Using SVM and TSVM
Hierarchical Text Classification
Summary

3
Introduction

Text classification
Assign text to predefined categories based on
content
Types of text
Documents (typical)
Paragraphs
Sentences
WWW-Sites
Different types of categories
By topic
By function
By author
By style

4
Text Classification Example
5
Computer-Based Text Classification Technologies

Naive word-matching (Chute, Yang, Buntrock
1994)
Finding shared words between the text and names
of categories
Weakest method
Cannot capture any conceptually relation
Thesaurus-based matching (Lindberg Humphreys
1990)
Using lexical links
Insensitive to the context
High cost and low adaptivity across domains

6
Computer-Based Text Classification Technologies

Empirical learning of term-category associations
Learning from a training set
Fundamentally different from word-matching
Statistically capturing the semantic association
between terms and categories
Context sensitive mapping from terms to
categories
For example,
Decision tree methods
Bayesian belief networks
Neural networks
Nearest neighbor classification methods
Least-squares regression techiniques

7
Statistical Properties of Text

There are stable, language-independent patterns
in how people use natural language
A few words occur very frequently most occur
rarely
In general
Top 2 words 1015 of all word occurrences
Top 6 words 20 of all word occurrences
Top 50 words 50 of all word occurrences

Most common words from Tom Sawyer
1 14
8
Statistical Properties of Text

The most frequent words in one corpus may be rare
words in another corpus
Example computer in CACM vs. National
Geographic
Each corpus has a different, fairly small
working vocabulary

These properties hold in a wide range of languages
9
Statistical Properties of Text

Summary
Term usage is highly skewed, but in a predictable
pattern
Why is it important to know the characteristics
of text?
Optimization of data structures
Statistical retrieval algorithms depend on them

10
Statistical Profiles

Can act as a summarization device
Indicate what a document is about
Indicate what a collection is about

11
Zipfs Law

Zipfs Law relates a terms frequency to its rank
Frequency 1/rank
There is a constant such that
Rank the terms in a vocabulary by frequency, in
descending order
Empirical observation
Hence
for English

12
Precision and Recall
Evaluation Metrics

Recall
Percentage of all relevant documents that are
found by a search
Precision
Percentage of retrieved documents that are
relevant

13
F-measure
Evaluation Metrics
Harmonic average of precision and recall

Rewards results that keep recall and precision
close together
R40, P60. R/P average50. F-measure48
R45, P55. R/P average50. F-measure49.5

14
Break Even Point
Evaluation Metrics

The point at which recall equals precision

Evaluation metric The value of this point
15
Term Weights A Brief Introduction
Feature Selection

The words of a text are not equally indicative of
its meaning
Important butterflies, monarchs, scientists,
direction, compass
Unimportant most, think, kind, sky, determine,
cues, learn
Term weights reflect the (estimated) importance
of each term

Most scientists think that butterflies use the
position of the sun in the sky as a kind of
compass that allows them to determine which way
is north. Scientists think that butterflies may
use other cues, such as the earths magnetic
field, but we have a lot to learn about monarchs
sense of direction.
16
Term Weights
Feature Selection

Term frequency (TF)
The more often a word occurs in a document, the
better that term is in describing what the
document is about
Often normalized, e.g. by the length of the
document
Sometimes biased to range 0.4..1.0 to represent
the fact that even a single occurrence of a term
is a significant event

17
Term Weights
Feature Selection

Inverse document frequency (IDF)
Terms that occur in many documents in the
collection are less useful for discriminating
among documents
Document frequency (df) number of documents
containing the term
IDF often calculated as
TF and IDF are used in combination as product

18
Vector Space Similarity
Feature Selection

Similarity is inversely related to the angle
between the vectors
Cosine of the angle between the two vectors

19
Feature Space Reduction

Main reasons
Improve accuracy of the algorithm
Decrease the size of data set
Control the computation time
Avoid overfitting
Feature space reduction technique
Stopword removal, stemming
Information gain
Natural language processing

20
Stopword Removal
Feature Space Reduction

Stopwords words that are discarded from a
document representation
Function words a, an, and, as, for, in, of,
the, to,
About 400 words in English
Other frequent words Lotus in a Lotus Support

21
Stemming
Feature Space Reduction

Group morphological variants
Plural streets ?? street
Adverbs fully ?? full
Other inflected word forms goes ?? go
Grouping process is called conflation
Current stemming algorithms make mistakes
Conflating terms manually is difficult,
time-consuming
Automatic conflation using rules
Porter Stemmer
Porter stemming example police, policy ?
polic

22
Information Gain
Feature Space Reduction

Measuring information obtained by presence or
absence of a term in a document
Feature space reduction by thresholding
Biased to common term ? large reduction in size
of data set cannot be achieved

23
Natural Language Processing
Feature Space Reduction

Pick out the important words from a document
For example, nouns, proper nouns, or verbs
Ignoring all other parts
Not biased to common terms ? reduction in bath
feature space and size of data
Named entities
The subset of proper nouns consisting of people,
locations, and organization
Effective in cases of news story classification

24
Experimental Results
Robert Cooley, Classification of News Stories
Using Support Vector Machines, Proceedings of the
16th International Joint Conference on Artificial
Intelligence Text Mining Workshop, 1999

Data set
From six news media sources
Two print sources (New York Times and Associated
Press Wire)
Two television sources (ABC World News Tonight
and CNN Headline News)
Two radio sources (Public Radio International and
Voice of America)

25
Experimental Results
Robert Cooley, Classification of News Stories
Using Support Vector Machines, Proceedings of the
16th International Joint Conference on Artificial
Intelligence Text Mining Workshop, 1999

Results
NLP ? significant loss in recall and precision
SVM gtgt kNN (using full text or information gain)
Binary weighting ? significant loss in recall

26
kNN
Classification Methods

Stands for k-nearest neighbor classification
Algorithms
Given a test document,
Find k nearest neighbors among training documents
Calculate and sort score of candidate categories
Thresholding on these scores
Decision rule

27
LLSF
Classification Methods

Stands for Linear Least Squares Fit
Obtain matrix of word-category regression
coefficients by LLSF
FLS arbitrary document ? vector of weighted
categories
By thresholding like kNN, assign categories

28
Naïve Bayes
Classification Methods

Assumption
Words are drawn randomly from class dependent
lexicons (with replacement)
Word independence
Result

Word independence
Classification rule
29
Estimating the Parameters
Naïve Bayes

Count frequencies in training data
Estimating P(Y)
Fraction of positive / negative examples in
training data
Estimating P(WY)
Smoothing with Laplace estimate

30
Experiment Results
Yiming Yang and Xin Liu, A re-examination of text
categorization methods, Proceedings of ACM SIGIR
Conference on Research and Development in
Information Retrieval, 1999.
31
Text Classification using SVM
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.

A statistical learning model of text
classification with SVMs

0 if linearly separable
32
Properties 12 Sparse Examples in High Dimension
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.

High dimensional feature vectors (30,000
features)
Sparse document vectors only a few words of the
whole language occur in each document
SVMs use overfitting protection which does not
depend on the dimension of feature

33
Property 3 Heterogeneous Use of Words
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.
No pair of documents shares any words, but it,
the, and, of, for, an, a, not,
that, in.
34
Property 4 High Level of Redundancy
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.
Few features are irrelevant! Feature space
reduction causes loss of information
35
Property 5 Zipfs Law
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.
Most words occur very infrequently!
36
TCat Concepts
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.
Modeling real text-classification tasks Used for
previous proof
TCat(2020100, high freq.
41200,14200,55600. medium freq.
913000,193000,10104000 low
freq. )
37
TCat Concepts
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.

Margin of Tcat-Concepts
By Zipfs law, we can bound R2
Intuitively, many words with low frequency
? relatively short document vectors

with
Linearly separable
38
TCat Concepts
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.

Bound on Expected Error of SVM

39
Text Classification using TSVM
T. Joachims, Transductive Inference for Text
Classification using Support Vector
Machines, Proceedings of the International
Conference on Machine Learning (ICML), 1999.

How would you classify the test set?
Training set D1, D6
Test set D2, D3, D4, D5

40
Why Does Adding Test Examples Reduce Error?
T. Joachims, Transductive Inference for Text
Classification using Support Vector
Machines, Proceedings of the International
Conference on Machine Learning (ICML), 1999.
41
Experiment Results
T. Joachims, Transductive Inference for Text
Classification using Support Vector
Machines, Proceedings of the International
Conference on Machine Learning (ICML), 1999.

Data set
Reuter-21578 dataset-ModApte
Training 9,603 test 3,299
WebKB collection of WWW pages
Only the class course, faculty, project,
student are used
Stemming and stopword removal are not used
Ohsumed corpus compiled by William Hersh
Training 10,000 test 10,000

42
Experiment Results
T. Joachims, Transductive Inference for Text
Classification using Support Vector
Machines, Proceedings of the International
Conference on Machine Learning (ICML), 1999.

Results

P/R-breakeven point for Reuters categories
43
Experiment Results
T. Joachims, Transductive Inference for Text
Classification using Support Vector
Machines, Proceedings of the International
Conference on Machine Learning (ICML), 1999.

Results

Average P/R-breakeven point on WebKB
Average P/R-breakeven point on Ohsumed
44
Hierarchical Text Classification

Real world classification ? complex hierarchical
structure
Due to difficulties of training for many classes
or features

Class 1-1
Class 1
Class 1-2
documents

Class 2
Class 1-3
Class 2-1
Class 3

Level 1
Level 2
45
Hierarchical Text Classification

More accurate specialized classifiers

computer not discriminating
Hardware
documents
Computers
Software
Chat
Sports
Soccer
Football
computer discriminating
46
Experiment Setting
S. Dumais and H. Chen, Hierarchical
classification of Web content. Proceedings of
SIGIR'00, August 2000, pp. 256-263.

Data set LookSmarts web directory
Using short summary from search engine
370597 unique pages
17173 categories
7-level hierarchy
Focus on 13 top-level and 150 second-level
categories

47
Experiment Setting
S. Dumais and H. Chen, Hierarchical
classification of Web content. Proceedings of
SIGIR'00, August 2000, pp. 256-263.

Using SVM
Posterior probabilities by regularized maximum
likelihood fitting
Combining probabilities from the first and second
level
Boolean scoring function, P(L1) P(L2) or,
Multiplicative scoring function, P(L1) P(L2)

48
Experiment Results
S. Dumais and H. Chen, Hierarchical
classification of Web content. Proceedings of
SIGIR'00, August 2000, pp. 256-263.

Non-hierarchical (baseline) F1 0.476
Hierarchical
Top-level
Training set F1 0.649
Test set F1 0.572
Second-level
Multiplicative F1 0.495
Boolean F1 0.497
Assuming top-level classification is correct,
F1 0.711

49
Summary

Feature space reduction
Performance of SVM and TSVM is better than others
TSVM has merits in text classification
Hierarchical classification is helpful
Other issues
Sampling strategies
Other kinds of feature selection

50
Reference

T. Joachims, Text Categorization with Support
Vector Machines Learning with Many Relevant
Features. Proceedings of the European Conference
on Machine Learning (ECML), Springer, 1998.
T. Joachims, Transductive Inference for Text
Classification using Support Vector Machines.
Proceedings of the International Conference on
Machine Learning (ICML), 1999.
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector Machines.
Proceedings of the Conference on Research and
Development in Information Retrieval (SIGIR),
ACM, 2001.
Robert Cooley, Classification of News Stories
Using Support Vector Machines (1999). Proceedings
of the Sixteenth International Joint Conference
on Artificial Intelligence Text Mining Workshop,
August 1999.
Yiming Yang and Xin Liu, A re-examination of text
categorization methods. Proceedings of ACM SIGIR
Conference on Research and Development in
Information Retrieval, (SIGIR), 1999.
S. Dumais and H. Chen, Hierarchical
classification of Web content. Proceedings of
SIGIR'00, August 2000, pp. 256-263.