Two television sources (ABC World News Tonight and CN - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Two television sources (ABC World News Tonight and CN

Description:

Two television sources (ABC World News Tonight and CNN Headline News) ... Robert Cooley, Classification of News Stories Using Support Vector Machines, ... – PowerPoint PPT presentation

Number of Views:415
Avg rating:3.0/5.0
Slides: 51
Provided by: kimdo
Category:

less

Transcript and Presenter's Notes

Title: Two television sources (ABC World News Tonight and CN


1
A Survey on Text Classification
  • December 10, 2003
  • 20033077 Dongho Kim
  • KAIST

2
Contents
  • Introduction
  • Statistical Properties of Text
  • Feature Selection
  • Feature Space Reduction
  • Classification Methods
  • Using SVM and TSVM
  • Hierarchical Text Classification
  • Summary

3
Introduction
  • Text classification
  • Assign text to predefined categories based on
    content
  • Types of text
  • Documents (typical)
  • Paragraphs
  • Sentences
  • WWW-Sites
  • Different types of categories
  • By topic
  • By function
  • By author
  • By style

4
Text Classification Example
5
Computer-Based Text Classification Technologies
  • Naive word-matching (Chute, Yang, Buntrock
    1994)
  • Finding shared words between the text and names
    of categories
  • Weakest method
  • Cannot capture any conceptually relation
  • Thesaurus-based matching (Lindberg Humphreys
    1990)
  • Using lexical links
  • Insensitive to the context
  • High cost and low adaptivity across domains

6
Computer-Based Text Classification Technologies
  • Empirical learning of term-category associations
  • Learning from a training set
  • Fundamentally different from word-matching
  • Statistically capturing the semantic association
    between terms and categories
  • Context sensitive mapping from terms to
    categories
  • For example,
  • Decision tree methods
  • Bayesian belief networks
  • Neural networks
  • Nearest neighbor classification methods
  • Least-squares regression techiniques

7
Statistical Properties of Text
  • There are stable, language-independent patterns
    in how people use natural language
  • A few words occur very frequently most occur
    rarely
  • In general
  • Top 2 words 1015 of all word occurrences
  • Top 6 words 20 of all word occurrences
  • Top 50 words 50 of all word occurrences

Most common words from Tom Sawyer
1 14
8
Statistical Properties of Text
  • The most frequent words in one corpus may be rare
    words in another corpus
  • Example computer in CACM vs. National
    Geographic
  • Each corpus has a different, fairly small
    working vocabulary

These properties hold in a wide range of languages
9
Statistical Properties of Text
  • Summary
  • Term usage is highly skewed, but in a predictable
    pattern
  • Why is it important to know the characteristics
    of text?
  • Optimization of data structures
  • Statistical retrieval algorithms depend on them

10
Statistical Profiles
  • Can act as a summarization device
  • Indicate what a document is about
  • Indicate what a collection is about

11
Zipfs Law
  • Zipfs Law relates a terms frequency to its rank
  • Frequency 1/rank
  • There is a constant such that
  • Rank the terms in a vocabulary by frequency, in
    descending order
  • Empirical observation
  • Hence
  • for English

12
Precision and Recall
Evaluation Metrics
  • Recall
  • Percentage of all relevant documents that are
    found by a search
  • Precision
  • Percentage of retrieved documents that are
    relevant

13
F-measure
Evaluation Metrics
Harmonic average of precision and recall
  • Rewards results that keep recall and precision
    close together
  • R40, P60. R/P average50. F-measure48
  • R45, P55. R/P average50. F-measure49.5

14
Break Even Point
Evaluation Metrics
  • The point at which recall equals precision

Evaluation metric The value of this point
15
Term Weights A Brief Introduction
Feature Selection
  • The words of a text are not equally indicative of
    its meaning
  • Important butterflies, monarchs, scientists,
    direction, compass
  • Unimportant most, think, kind, sky, determine,
    cues, learn
  • Term weights reflect the (estimated) importance
    of each term

Most scientists think that butterflies use the
position of the sun in the sky as a kind of
compass that allows them to determine which way
is north. Scientists think that butterflies may
use other cues, such as the earths magnetic
field, but we have a lot to learn about monarchs
sense of direction.
16
Term Weights
Feature Selection
  • Term frequency (TF)
  • The more often a word occurs in a document, the
    better that term is in describing what the
    document is about
  • Often normalized, e.g. by the length of the
    document
  • Sometimes biased to range 0.4..1.0 to represent
    the fact that even a single occurrence of a term
    is a significant event

17
Term Weights
Feature Selection
  • Inverse document frequency (IDF)
  • Terms that occur in many documents in the
    collection are less useful for discriminating
    among documents
  • Document frequency (df) number of documents
    containing the term
  • IDF often calculated as
  • TF and IDF are used in combination as product

18
Vector Space Similarity
Feature Selection
  • Similarity is inversely related to the angle
    between the vectors
  • Cosine of the angle between the two vectors

19
Feature Space Reduction
  • Main reasons
  • Improve accuracy of the algorithm
  • Decrease the size of data set
  • Control the computation time
  • Avoid overfitting
  • Feature space reduction technique
  • Stopword removal, stemming
  • Information gain
  • Natural language processing

20
Stopword Removal
Feature Space Reduction
  • Stopwords words that are discarded from a
    document representation
  • Function words a, an, and, as, for, in, of,
    the, to,
  • About 400 words in English
  • Other frequent words Lotus in a Lotus Support

21
Stemming
Feature Space Reduction
  • Group morphological variants
  • Plural streets ?? street
  • Adverbs fully ?? full
  • Other inflected word forms goes ?? go
  • Grouping process is called conflation
  • Current stemming algorithms make mistakes
  • Conflating terms manually is difficult,
    time-consuming
  • Automatic conflation using rules
  • Porter Stemmer
  • Porter stemming example police, policy ?
    polic

22
Information Gain
Feature Space Reduction
  • Measuring information obtained by presence or
    absence of a term in a document
  • Feature space reduction by thresholding
  • Biased to common term ? large reduction in size
    of data set cannot be achieved

23
Natural Language Processing
Feature Space Reduction
  • Pick out the important words from a document
  • For example, nouns, proper nouns, or verbs
  • Ignoring all other parts
  • Not biased to common terms ? reduction in bath
    feature space and size of data
  • Named entities
  • The subset of proper nouns consisting of people,
    locations, and organization
  • Effective in cases of news story classification

24
Experimental Results
Robert Cooley, Classification of News Stories
Using Support Vector Machines, Proceedings of the
16th International Joint Conference on Artificial
Intelligence Text Mining Workshop, 1999
  • Data set
  • From six news media sources
  • Two print sources (New York Times and Associated
    Press Wire)
  • Two television sources (ABC World News Tonight
    and CNN Headline News)
  • Two radio sources (Public Radio International and
    Voice of America)

25
Experimental Results
Robert Cooley, Classification of News Stories
Using Support Vector Machines, Proceedings of the
16th International Joint Conference on Artificial
Intelligence Text Mining Workshop, 1999
  • Results
  • NLP ? significant loss in recall and precision
  • SVM gtgt kNN (using full text or information gain)
  • Binary weighting ? significant loss in recall

26
kNN
Classification Methods
  • Stands for k-nearest neighbor classification
  • Algorithms
  • Given a test document,
  • Find k nearest neighbors among training documents
  • Calculate and sort score of candidate categories
  • Thresholding on these scores
  • Decision rule

27
LLSF
Classification Methods
  • Stands for Linear Least Squares Fit
  • Obtain matrix of word-category regression
    coefficients by LLSF
  • FLS arbitrary document ? vector of weighted
    categories
  • By thresholding like kNN, assign categories

28
Naïve Bayes
Classification Methods
  • Assumption
  • Words are drawn randomly from class dependent
    lexicons (with replacement)
  • Word independence
  • Result

Word independence
Classification rule
29
Estimating the Parameters
Naïve Bayes
  • Count frequencies in training data
  • Estimating P(Y)
  • Fraction of positive / negative examples in
    training data
  • Estimating P(WY)
  • Smoothing with Laplace estimate

30
Experiment Results
Yiming Yang and Xin Liu, A re-examination of text
categorization methods, Proceedings of ACM SIGIR
Conference on Research and Development in
Information Retrieval, 1999.
31
Text Classification using SVM
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.
  • A statistical learning model of text
    classification with SVMs

0 if linearly separable
32
Properties 12 Sparse Examples in High Dimension
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.
  • High dimensional feature vectors (30,000
    features)
  • Sparse document vectors only a few words of the
    whole language occur in each document
  • SVMs use overfitting protection which does not
    depend on the dimension of feature

33
Property 3 Heterogeneous Use of Words
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.
No pair of documents shares any words, but it,
the, and, of, for, an, a, not,
that, in.
34
Property 4 High Level of Redundancy
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.
Few features are irrelevant! Feature space
reduction causes loss of information
35
Property 5 Zipfs Law
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.
Most words occur very infrequently!
36
TCat Concepts
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.
Modeling real text-classification tasks Used for
previous proof
TCat(2020100, high freq.
41200,14200,55600. medium freq.
913000,193000,10104000 low
freq. )
37
TCat Concepts
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.
  • Margin of Tcat-Concepts
  • By Zipfs law, we can bound R2
  • Intuitively, many words with low frequency
  • ? relatively short document vectors

with
Linearly separable
38
TCat Concepts
T. Joachims, A Statistical Learning Model of Text
Classification with Support Vector
Machines, Proceedings of the Conference on
Research and Development in Information Retrieval
(SIGIR), ACM, 2001.
  • Bound on Expected Error of SVM

39
Text Classification using TSVM
T. Joachims, Transductive Inference for Text
Classification using Support Vector
Machines, Proceedings of the International
Conference on Machine Learning (ICML), 1999.
  • How would you classify the test set?
  • Training set D1, D6
  • Test set D2, D3, D4, D5

40
Why Does Adding Test Examples Reduce Error?
T. Joachims, Transductive Inference for Text
Classification using Support Vector
Machines, Proceedings of the International
Conference on Machine Learning (ICML), 1999.
41
Experiment Results
T. Joachims, Transductive Inference for Text
Classification using Support Vector
Machines, Proceedings of the International
Conference on Machine Learning (ICML), 1999.
  • Data set
  • Reuter-21578 dataset-ModApte
  • Training 9,603 test 3,299
  • WebKB collection of WWW pages
  • Only the class course, faculty, project,
    student are used
  • Stemming and stopword removal are not used
  • Ohsumed corpus compiled by William Hersh
  • Training 10,000 test 10,000

42
Experiment Results
T. Joachims, Transductive Inference for Text
Classification using Support Vector
Machines, Proceedings of the International
Conference on Machine Learning (ICML), 1999.
  • Results

P/R-breakeven point for Reuters categories
43
Experiment Results
T. Joachims, Transductive Inference for Text
Classification using Support Vector
Machines, Proceedings of the International
Conference on Machine Learning (ICML), 1999.
  • Results

Average P/R-breakeven point on WebKB
Average P/R-breakeven point on Ohsumed
44
Hierarchical Text Classification
  • Real world classification ? complex hierarchical
    structure
  • Due to difficulties of training for many classes
    or features

Class 1-1
Class 1
Class 1-2
documents

Class 2
Class 1-3
Class 2-1
Class 3


Level 1
Level 2
45
Hierarchical Text Classification
  • More accurate specialized classifiers

computer not discriminating
Hardware
documents
Computers
Software
Chat
Sports
Soccer
Football
computer discriminating
46
Experiment Setting
S. Dumais and H. Chen, Hierarchical
classification of Web content. Proceedings of
SIGIR'00, August 2000, pp. 256-263.
  • Data set LookSmarts web directory
  • Using short summary from search engine
  • 370597 unique pages
  • 17173 categories
  • 7-level hierarchy
  • Focus on 13 top-level and 150 second-level
    categories

47
Experiment Setting
S. Dumais and H. Chen, Hierarchical
classification of Web content. Proceedings of
SIGIR'00, August 2000, pp. 256-263.
  • Using SVM
  • Posterior probabilities by regularized maximum
    likelihood fitting
  • Combining probabilities from the first and second
    level
  • Boolean scoring function, P(L1) P(L2) or,
  • Multiplicative scoring function, P(L1) P(L2)

48
Experiment Results
S. Dumais and H. Chen, Hierarchical
classification of Web content. Proceedings of
SIGIR'00, August 2000, pp. 256-263.
  • Non-hierarchical (baseline) F1 0.476
  • Hierarchical
  • Top-level
  • Training set F1 0.649
  • Test set F1 0.572
  • Second-level
  • Multiplicative F1 0.495
  • Boolean F1 0.497
  • Assuming top-level classification is correct,
  • F1 0.711

49
Summary
  • Feature space reduction
  • Performance of SVM and TSVM is better than others
  • TSVM has merits in text classification
  • Hierarchical classification is helpful
  • Other issues
  • Sampling strategies
  • Other kinds of feature selection

50
Reference
  • T. Joachims, Text Categorization with Support
    Vector Machines Learning with Many Relevant
    Features. Proceedings of the European Conference
    on Machine Learning (ECML), Springer, 1998.
  • T. Joachims, Transductive Inference for Text
    Classification using Support Vector Machines.
    Proceedings of the International Conference on
    Machine Learning (ICML), 1999.
  • T. Joachims, A Statistical Learning Model of Text
    Classification with Support Vector Machines.
    Proceedings of the Conference on Research and
    Development in Information Retrieval (SIGIR),
    ACM, 2001.
  • Robert Cooley, Classification of News Stories
    Using Support Vector Machines (1999). Proceedings
    of the Sixteenth International Joint Conference
    on Artificial Intelligence Text Mining Workshop,
    August 1999.
  • Yiming Yang and Xin Liu, A re-examination of text
    categorization methods. Proceedings of ACM SIGIR
    Conference on Research and Development in
    Information Retrieval, (SIGIR), 1999.
  • S. Dumais and H. Chen, Hierarchical
    classification of Web content. Proceedings of
    SIGIR'00, August 2000, pp. 256-263.
Write a Comment
User Comments (0)
About PowerShow.com