Title: Practical Things to Do with Bags of Words: I' Text Classification
1Practical Things to Do with Bags of Words I.
Text Classification Spam Detection II. The
Vector Space Model for Information Retrieval
- Mitch Marcus
- CIS 530 Intro to NLP
2TEXT CLASSIFICATION
- adapted from slides by
- Chris Manning Massimo Poesio
3Is this spam?
- From "" lttakworlld_at_hotmail.comgt
- Subject real estate is the only way... gem
oalvgkay - Anyone can buy real estate with no money down
- Stop paying rent TODAY !
- There is no need to spend hundreds or even
thousands for similar courses - I am 22 years old and I have already purchased 6
properties using the - methods outlined in this truly INCREDIBLE ebook.
- Change your life NOW !
-
- Click Below to order
- http//www.wholesaledaily.com/sales/nmd.htm
4Categorization/Classification
- Given
- A description of an instance, x?X, where X is the
instance language or instance space. - Issue how to represent text documents.
- A fixed set of categories
- C c1, c2,, cn
- Determine
- The category of x c(x)?C, where c(x) is a
categorization function whose domain is X and
whose range is C. - We want to know how to build categorization
functions (classifiers).
5A GRAPHICAL VIEW OF TEXT CLASSIFICATION
6Document Classification
planning language proof intelligence
Test Data
(AI)
(Programming)
(HCI)
Classes
Multimedia
GUI
Garb.Coll.
Semantics
Planning
ML
Training Data
planning temporal reasoning plan language...
programming semantics language proof...
learning intelligence algorithm reinforcement netw
ork...
garbage collection memory optimization region...
...
...
7EXAMPLES OF TEXT CATEGORIZATION
- LABELSBINARY
- spam / not spam
- LABELSTOPICS
- finance / sports / asia
- LABELSOPINION
- like / hate / neutral
- LABELSAUTHOR
- Shakespeare / Marlowe / Ben Jonson
- The Federalist papers
8Methods (1)
- Manual classification
- Used by Yahoo!, Looksmart, about.com, ODP,
Medline - very accurate when job is done by experts
- consistent when the problem size and team is
small - difficult and expensive to scale
- Automatic document classification
- Hand-coded rule-based systems
- Reuters, CIA, Verity,
- Commercial systems have complex query languages
9Methods (2)
- Supervised learning of document-label assignment
function Autonomy, Kana, MSN, Verity, - Naive Bayes (simple, common method)
- k-Nearest Neighbors (simple, powerful)
- Support-vector machines (new, more powerful)
- plus many other methods
- No free lunch requires hand-classified training
data - But can be built (and refined) by amateurs
10Bayesian Methods
- Learning and classification methods based on
probability theory (see spelling / POS) - Bayes theorem plays a critical role
- Build a generative model that approximates how
data is produced - Uses prior probability of each category given no
information about an item. - Categorization produces a posterior probability
distribution over the possible categories given a
description of an item.
11Bayes Rule once more
12Maximum a posteriori Hypothesis
As P(D) is constant
13Maximum likelihood Hypothesis
- If all hypotheses are a priori equally likely, we
only - need to consider the P(Dh) term
14Naive Bayes Classifiers
- Task Classify a new instance D based on a tuple
of attribute values
into one of the classes cj ? C
15Naïve Bayes Classifier Assumption
- P(cj)
- Can be estimated from the frequency of classes in
the training examples. - P(x1,x2,,xncj)
- O(XnC) parameters
- Could only be estimated if a very, very large
number of training examples was available. - Naïve Bayes Conditional Independence Assumption
- Assume that the probability of observing the
conjunction of attributes is equal to the product
of the individual probabilities P(xicj).
16The Naïve Bayes Classifier
- Conditional Independence Assumption features are
independent of each other given the class - This model is appropriate for binary variables
17Learning the Model
- First attempt maximum likelihood estimates
- simply use the frequencies in the data (
smoothing, of course)
18Using Naive Bayes Classifiers to Classify Text
Basic method
- Attributes are text positions, values are words.
- Still too many possibilities
- Assume that classification is independent of the
positions of the words - Use same parameters for each position
- Result is bag of words model
19Text Classification Algorithms Learning
- From training corpus, extract Vocabulary
- Calculate required P(cj) and P(xk cj) terms
- For each cj in C do
- docsj ? subset of documents for which the target
class is cj -
Textj ? single document containing all
docsj For each word xk in Vocabulary nk ?
number of occurrences of xk in Textj (must
be smoothed)
20Naïve Bayes Classifying
- positions ? all word positions in current
document which contain tokens found in
Vocabulary - Return cNB, where
21Underflow Prevention
- Multiplying lots of probabilities, which are
between 0 and 1 by definition, can result in
floating-point underflow. - Since log(xy) log(x) log(y), it is better to
perform all computations by summing logs of
probabilities rather than multiplying
probabilities. - Class with highest final un-normalized log
probability score is still the most probable.
22Feature selection via Mutual Information
- We might not want to use all words, but just
reliable, good discriminating terms - In training set, choose k words which best
discriminate the categories. - One way is using terms with maximal Mutual
Information with the classes - For each word w and each category c
23Feature selection via MI (contd.)
- For each category we build a list of k most
discriminating terms. - For example (on 20 Newsgroups)
- sci.electronics circuit, voltage, amp, ground,
copy, battery, electronics, cooling, - rec.autos car, cars, engine, ford, dealer,
mustang, oil, collision, autos, tires, toyota, - Greedy does not account for correlations between
terms
24Feature Selection
- Mutual Information
- Clear information-theoretic interpretation
- May select rare uninformative terms
- Commonest terms
- No particular foundation
- In practice often is 90 as good
- Other methods Chi-square, etc.
- Modern methods use regularization
25PANTEL AND LIN SPAMCOP
- Uses a Naïve Bayes classifier
- M is spam if P(SpamM) gt P(NonSpamM)
- Method
- Tokenize message using Porter Stemmer
- Estimate P(WC) using m-estimate (a form of
smoothing) - Remove words that do not satisfy certain
conditions - Train 160 spams, 466 non-spams
- Test 277 spams, 346 non-spams
- Results ERROR RATE of 4.33
- Worse results using trigrams
26Naive Bayes is Not So Naive
- Naïve Bayes First and Second place in KDD-CUP 97
competition, among 16 (then) state of the art
algorithms - Goal Financial services industry direct mail
response prediction model Predict if the
recipient of mail will actually respond to the
advertisement 750,000 records. - Robust to Irrelevant Features
- Irrelevant Features cancel each other without
affecting results - Instead Decision Trees can heavily suffer from
this. - Very good in Domains with many equally important
features - Decision Trees suffer from fragmentation in such
cases especially if little data - A good dependable baseline for text
classification (but not the best)! - Optimal if the Independence Assumptions hold If
assumed independence is correct, then it is the
Bayes Optimal Classifier for problem - Very Fast Learning with one pass over the data
testing linear in the number of attributes, and
document collection size - Low Storage requirements
27REFERENCES
- Mosteller, F., Wallace, D. L. (1984). Applied
Bayesian and Classical Inference the Case of the
Federalist Papers (2nd ed.). New York
Springer-Verlag. - P. Pantel and D. Lin, 1998. SPAMCOP A Spam
classification and organization program, In
Proc. Of the 1998 workshop on learning for text
categorization, AAAI - Sebastiani, F., 2002, Machine Learning in
Automated Text Categorization, ACM Computing
Surveys, 34(1), 1-47
28A Gentle Introduction to Information Retrieval
using the Vector Space Model
- from slides by R. Ramakrishan
- based on Larson and Hearsts slides at
UC-Berkeley
29Document Vectors
- Documents are represented as bags of words
- Represented as vectors when used computationally
- Each vector holds a place for every term in the
collection - Therefore, most vectors are sparse
30Document VectorsOne location for each word.
- nova galaxy heat hwood film role diet fur
- 10 5 3
- 5 10
- 10 8 7
- 9 10 5
- 10 10
- 9 10
- 5 7 9
- 6 10 2 8
- 7 5 1 3
A B C D E F G H I
Nova occurs 10 times in text A Galaxy occurs
5 times in text A Heat occurs 3 times in text
A (Blank means 0 occurrences.)
31Document Vectors
Document ids
- nova galaxy heat hwood film role diet fur
- 10 5 3
- 5 10
- 10 8 7
- 9 10 5
- 10 10
- 9 10
- 5 7 9
- 6 10 2 8
- 7 5 1 3
A B C D E F G H I
32We Can Plot the Vectors
Star
Doc about movie stars
Doc about astronomy
Doc about mammal behavior
Diet
Assumption Documents that are close in space
are similar.
33Vector Space Model
- Documents are represented as vectors in term
space - Terms are usually stems
- Documents represented by binary vectors of terms
- Queries represented the same as documents
- A vector distance measure between the query and
documents is used to rank retrieved documents - Query and Document similarity is based on length
and direction of their vectors - Vector operations to capture boolean query
conditions - Terms in a vector can be weighted in many ways
34Assigning Weights to Terms
- Binary Weights
- Raw term frequency
- tf x idf
- Recall the Zipf distribution
- Want to weight terms highly if they are
- frequent in relevant documents BUT
- infrequent in the collection as a whole
35TF x IDF Weights
- tf x idf measure
- Term Frequency (tf)
- Inverse Document Frequency (idf) -- a way to deal
with the problems of the Zipf distribution - Goal Assign a tf idf weight to each term in
each document
36TF x IDF Calculation
37Inverse Document Frequency
- IDF provides high values for rare words and low
values for common words
For a collection of 10000 documents
38TF x IDF Normalization
- Normalize the term weights (so longer documents
are not unfairly given more weight) - The longer the document, the more likely it is
for a given term to appear in it, and the more
often a given term is likely to appear in it. So,
we want to reduce the importance attached to a
term appearing in a document based on the length
of the document.
39Pair-wise Document Similarity
A B C D
- nova galaxy heat hwood film role diet fur
- 1 3 1
- 5 2
- 2 1 5
- 4 1
-
40Pair-wise Document Similarity(cosine
normalization)
41Problems with Vector Space
- There is no real theoretical basis for the
assumption of a term space - It is more for visualization than having any real
basis - Most similarity measures work about the same
- Terms are not really orthogonal dimensions
- Terms are not independent of all other terms
remember our discussion of correlated terms in
text
42Query Modification
- Problem How can we reformulate the query to help
a user who is trying several searches to get at
the same information? - Thesaurus expansion
- Suggest terms similar to query terms
- Relevance feedback
- Suggest terms (and documents) similar to
retrieved documents that have been judged to be
relevant