Practical Things to Do with Bags of Words: I' Text Classification - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

Practical Things to Do with Bags of Words: I' Text Classification

Description:

Practical Things to Do with Bags of Words: I. Text Classification & Spam Detection ... Documents are represented as 'bags of words' Represented as vectors when ... – PowerPoint PPT presentation

Number of Views:179

Avg rating:3.0/5.0

Slides: 43

Provided by: mitchel4

Category:

more less

Transcript and Presenter's Notes

Title: Practical Things to Do with Bags of Words: I' Text Classification

1
Practical Things to Do with Bags of Words I.
Text Classification Spam Detection II. The
Vector Space Model for Information Retrieval

Mitch Marcus
CIS 530 Intro to NLP

2
TEXT CLASSIFICATION

adapted from slides by
Chris Manning Massimo Poesio

3
Is this spam?

From "" lttakworlld_at_hotmail.comgt
Subject real estate is the only way... gem
oalvgkay
Anyone can buy real estate with no money down
Stop paying rent TODAY !
There is no need to spend hundreds or even
thousands for similar courses
I am 22 years old and I have already purchased 6
properties using the
methods outlined in this truly INCREDIBLE ebook.
Change your life NOW !
Click Below to order
http//www.wholesaledaily.com/sales/nmd.htm

4
Categorization/Classification

Given
A description of an instance, x?X, where X is the
instance language or instance space.
Issue how to represent text documents.
A fixed set of categories
C c1, c2,, cn
Determine
The category of x c(x)?C, where c(x) is a
categorization function whose domain is X and
whose range is C.
We want to know how to build categorization
functions (classifiers).

5
A GRAPHICAL VIEW OF TEXT CLASSIFICATION
6
Document Classification
planning language proof intelligence
Test Data
(AI)
(Programming)
(HCI)
Classes
Multimedia
GUI
Garb.Coll.
Semantics
Planning
ML
Training Data
planning temporal reasoning plan language...
programming semantics language proof...
learning intelligence algorithm reinforcement netw
ork...
garbage collection memory optimization region...
...
...
7
EXAMPLES OF TEXT CATEGORIZATION

LABELSBINARY
spam / not spam
LABELSTOPICS
finance / sports / asia
LABELSOPINION
like / hate / neutral
LABELSAUTHOR
Shakespeare / Marlowe / Ben Jonson
The Federalist papers

8
Methods (1)

Manual classification
Used by Yahoo!, Looksmart, about.com, ODP,
Medline
very accurate when job is done by experts
consistent when the problem size and team is
small
difficult and expensive to scale
Automatic document classification
Hand-coded rule-based systems
Reuters, CIA, Verity,
Commercial systems have complex query languages

9
Methods (2)

Supervised learning of document-label assignment
function Autonomy, Kana, MSN, Verity,
Naive Bayes (simple, common method)
k-Nearest Neighbors (simple, powerful)
Support-vector machines (new, more powerful)
plus many other methods
No free lunch requires hand-classified training
data
But can be built (and refined) by amateurs

10
Bayesian Methods

Learning and classification methods based on
probability theory (see spelling / POS)
Bayes theorem plays a critical role
Build a generative model that approximates how
data is produced
Uses prior probability of each category given no
information about an item.
Categorization produces a posterior probability
distribution over the possible categories given a
description of an item.

11
Bayes Rule once more
12
Maximum a posteriori Hypothesis
As P(D) is constant
13
Maximum likelihood Hypothesis

If all hypotheses are a priori equally likely, we
only
need to consider the P(Dh) term

14
Naive Bayes Classifiers

Task Classify a new instance D based on a tuple
of attribute values
into one of the classes cj ? C

15
Naïve Bayes Classifier Assumption

P(cj)
Can be estimated from the frequency of classes in
the training examples.
P(x1,x2,,xncj)
O(XnC) parameters
Could only be estimated if a very, very large
number of training examples was available.
Naïve Bayes Conditional Independence Assumption
Assume that the probability of observing the
conjunction of attributes is equal to the product
of the individual probabilities P(xicj).

16
The Naïve Bayes Classifier

Conditional Independence Assumption features are
independent of each other given the class
This model is appropriate for binary variables

17
Learning the Model

First attempt maximum likelihood estimates
simply use the frequencies in the data (
smoothing, of course)

18
Using Naive Bayes Classifiers to Classify Text
Basic method

Attributes are text positions, values are words.

Still too many possibilities
Assume that classification is independent of the
positions of the words
Use same parameters for each position
Result is bag of words model

19
Text Classification Algorithms Learning

From training corpus, extract Vocabulary
Calculate required P(cj) and P(xk cj) terms
For each cj in C do
docsj ? subset of documents for which the target
class is cj

Textj ? single document containing all
docsj For each word xk in Vocabulary nk ?
number of occurrences of xk in Textj (must
be smoothed)
20
Naïve Bayes Classifying

positions ? all word positions in current
document which contain tokens found in
Vocabulary
Return cNB, where

21
Underflow Prevention

Multiplying lots of probabilities, which are
between 0 and 1 by definition, can result in
floating-point underflow.
Since log(xy) log(x) log(y), it is better to
perform all computations by summing logs of
probabilities rather than multiplying
probabilities.
Class with highest final un-normalized log
probability score is still the most probable.

22
Feature selection via Mutual Information

We might not want to use all words, but just
reliable, good discriminating terms
In training set, choose k words which best
discriminate the categories.
One way is using terms with maximal Mutual
Information with the classes
For each word w and each category c

23
Feature selection via MI (contd.)

For each category we build a list of k most
discriminating terms.
For example (on 20 Newsgroups)
sci.electronics circuit, voltage, amp, ground,
copy, battery, electronics, cooling,
rec.autos car, cars, engine, ford, dealer,
mustang, oil, collision, autos, tires, toyota,
Greedy does not account for correlations between
terms

24
Feature Selection

Mutual Information
Clear information-theoretic interpretation
May select rare uninformative terms
Commonest terms
No particular foundation
In practice often is 90 as good
Other methods Chi-square, etc.
Modern methods use regularization

25
PANTEL AND LIN SPAMCOP

Uses a Naïve Bayes classifier
M is spam if P(SpamM) gt P(NonSpamM)
Method
Tokenize message using Porter Stemmer
Estimate P(WC) using m-estimate (a form of
smoothing)
Remove words that do not satisfy certain
conditions
Train 160 spams, 466 non-spams
Test 277 spams, 346 non-spams
Results ERROR RATE of 4.33
Worse results using trigrams

26
Naive Bayes is Not So Naive

Naïve Bayes First and Second place in KDD-CUP 97
competition, among 16 (then) state of the art
algorithms
Goal Financial services industry direct mail
response prediction model Predict if the
recipient of mail will actually respond to the
advertisement 750,000 records.
Robust to Irrelevant Features
Irrelevant Features cancel each other without
affecting results
Instead Decision Trees can heavily suffer from
this.
Very good in Domains with many equally important
features
Decision Trees suffer from fragmentation in such
cases especially if little data
A good dependable baseline for text
classification (but not the best)!
Optimal if the Independence Assumptions hold If
assumed independence is correct, then it is the
Bayes Optimal Classifier for problem
Very Fast Learning with one pass over the data
testing linear in the number of attributes, and
document collection size
Low Storage requirements

27
REFERENCES

Mosteller, F., Wallace, D. L. (1984). Applied
Bayesian and Classical Inference the Case of the
Federalist Papers (2nd ed.). New York
Springer-Verlag.
P. Pantel and D. Lin, 1998. SPAMCOP A Spam
classification and organization program, In
Proc. Of the 1998 workshop on learning for text
categorization, AAAI
Sebastiani, F., 2002, Machine Learning in
Automated Text Categorization, ACM Computing
Surveys, 34(1), 1-47

28
A Gentle Introduction to Information Retrieval
using the Vector Space Model

from slides by R. Ramakrishan
based on Larson and Hearsts slides at
UC-Berkeley

29
Document Vectors

Documents are represented as bags of words
Represented as vectors when used computationally
Each vector holds a place for every term in the
collection
Therefore, most vectors are sparse

30
Document VectorsOne location for each word.

nova galaxy heat hwood film role diet fur
10 5 3
5 10
10 8 7
9 10 5
10 10
9 10
5 7 9
6 10 2 8
7 5 1 3

A B C D E F G H I
Nova occurs 10 times in text A Galaxy occurs
5 times in text A Heat occurs 3 times in text
A (Blank means 0 occurrences.)
31
Document Vectors
Document ids

nova galaxy heat hwood film role diet fur
10 5 3
5 10
10 8 7
9 10 5
10 10
9 10
5 7 9
6 10 2 8
7 5 1 3

A B C D E F G H I
32
We Can Plot the Vectors
Star
Doc about movie stars
Doc about astronomy
Doc about mammal behavior
Diet
Assumption Documents that are close in space
are similar.
33
Vector Space Model

Documents are represented as vectors in term
space
Terms are usually stems
Documents represented by binary vectors of terms
Queries represented the same as documents
A vector distance measure between the query and
documents is used to rank retrieved documents
Query and Document similarity is based on length
and direction of their vectors
Vector operations to capture boolean query
conditions
Terms in a vector can be weighted in many ways

34
Assigning Weights to Terms

Binary Weights
Raw term frequency
tf x idf
Recall the Zipf distribution
Want to weight terms highly if they are
frequent in relevant documents BUT
infrequent in the collection as a whole

35
TF x IDF Weights

tf x idf measure
Term Frequency (tf)
Inverse Document Frequency (idf) -- a way to deal
with the problems of the Zipf distribution
Goal Assign a tf idf weight to each term in
each document

36
TF x IDF Calculation
37
Inverse Document Frequency

IDF provides high values for rare words and low
values for common words

For a collection of 10000 documents
38
TF x IDF Normalization

Normalize the term weights (so longer documents
are not unfairly given more weight)
The longer the document, the more likely it is
for a given term to appear in it, and the more
often a given term is likely to appear in it. So,
we want to reduce the importance attached to a
term appearing in a document based on the length
of the document.

39
Pair-wise Document Similarity
A B C D

nova galaxy heat hwood film role diet fur
1 3 1
5 2
2 1 5
4 1

40
Pair-wise Document Similarity(cosine
normalization)
41
Problems with Vector Space

There is no real theoretical basis for the
assumption of a term space
It is more for visualization than having any real
basis
Most similarity measures work about the same
Terms are not really orthogonal dimensions
Terms are not independent of all other terms
remember our discussion of correlated terms in
text

42
Query Modification

Problem How can we reformulate the query to help
a user who is trying several searches to get at
the same information?
Thesaurus expansion
Suggest terms similar to query terms
Relevance feedback
Suggest terms (and documents) similar to
retrieved documents that have been judged to be
relevant

Write a Comment

User Comments (0)