Practical Things to Do with Bags of Words: I' Text Classification - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Practical Things to Do with Bags of Words: I' Text Classification

Description:

Practical Things to Do with Bags of Words: I. Text Classification & Spam Detection ... Documents are represented as 'bags of words' Represented as vectors when ... – PowerPoint PPT presentation

Number of Views:179
Avg rating:3.0/5.0
Slides: 43
Provided by: mitchel4
Category:

less

Transcript and Presenter's Notes

Title: Practical Things to Do with Bags of Words: I' Text Classification


1
Practical Things to Do with Bags of Words I.
Text Classification Spam Detection II. The
Vector Space Model for Information Retrieval
  • Mitch Marcus
  • CIS 530 Intro to NLP

2
TEXT CLASSIFICATION
  • adapted from slides by
  • Chris Manning Massimo Poesio

3
Is this spam?
  • From "" lttakworlld_at_hotmail.comgt
  • Subject real estate is the only way... gem
    oalvgkay
  • Anyone can buy real estate with no money down
  • Stop paying rent TODAY !
  • There is no need to spend hundreds or even
    thousands for similar courses
  • I am 22 years old and I have already purchased 6
    properties using the
  • methods outlined in this truly INCREDIBLE ebook.
  • Change your life NOW !
  • Click Below to order
  • http//www.wholesaledaily.com/sales/nmd.htm

4
Categorization/Classification
  • Given
  • A description of an instance, x?X, where X is the
    instance language or instance space.
  • Issue how to represent text documents.
  • A fixed set of categories
  • C c1, c2,, cn
  • Determine
  • The category of x c(x)?C, where c(x) is a
    categorization function whose domain is X and
    whose range is C.
  • We want to know how to build categorization
    functions (classifiers).

5
A GRAPHICAL VIEW OF TEXT CLASSIFICATION
6
Document Classification
planning language proof intelligence
Test Data
(AI)
(Programming)
(HCI)
Classes
Multimedia
GUI
Garb.Coll.
Semantics
Planning
ML
Training Data
planning temporal reasoning plan language...
programming semantics language proof...
learning intelligence algorithm reinforcement netw
ork...
garbage collection memory optimization region...
...
...
7
EXAMPLES OF TEXT CATEGORIZATION
  • LABELSBINARY
  • spam / not spam
  • LABELSTOPICS
  • finance / sports / asia
  • LABELSOPINION
  • like / hate / neutral
  • LABELSAUTHOR
  • Shakespeare / Marlowe / Ben Jonson
  • The Federalist papers

8
Methods (1)
  • Manual classification
  • Used by Yahoo!, Looksmart, about.com, ODP,
    Medline
  • very accurate when job is done by experts
  • consistent when the problem size and team is
    small
  • difficult and expensive to scale
  • Automatic document classification
  • Hand-coded rule-based systems
  • Reuters, CIA, Verity,
  • Commercial systems have complex query languages

9
Methods (2)
  • Supervised learning of document-label assignment
    function Autonomy, Kana, MSN, Verity,
  • Naive Bayes (simple, common method)
  • k-Nearest Neighbors (simple, powerful)
  • Support-vector machines (new, more powerful)
  • plus many other methods
  • No free lunch requires hand-classified training
    data
  • But can be built (and refined) by amateurs

10
Bayesian Methods
  • Learning and classification methods based on
    probability theory (see spelling / POS)
  • Bayes theorem plays a critical role
  • Build a generative model that approximates how
    data is produced
  • Uses prior probability of each category given no
    information about an item.
  • Categorization produces a posterior probability
    distribution over the possible categories given a
    description of an item.

11
Bayes Rule once more
12
Maximum a posteriori Hypothesis
As P(D) is constant
13
Maximum likelihood Hypothesis
  • If all hypotheses are a priori equally likely, we
    only
  • need to consider the P(Dh) term

14
Naive Bayes Classifiers
  • Task Classify a new instance D based on a tuple
    of attribute values
    into one of the classes cj ? C

15
Naïve Bayes Classifier Assumption
  • P(cj)
  • Can be estimated from the frequency of classes in
    the training examples.
  • P(x1,x2,,xncj)
  • O(XnC) parameters
  • Could only be estimated if a very, very large
    number of training examples was available.
  • Naïve Bayes Conditional Independence Assumption
  • Assume that the probability of observing the
    conjunction of attributes is equal to the product
    of the individual probabilities P(xicj).

16
The Naïve Bayes Classifier
  • Conditional Independence Assumption features are
    independent of each other given the class
  • This model is appropriate for binary variables

17
Learning the Model
  • First attempt maximum likelihood estimates
  • simply use the frequencies in the data (
    smoothing, of course)

18
Using Naive Bayes Classifiers to Classify Text
Basic method
  • Attributes are text positions, values are words.
  • Still too many possibilities
  • Assume that classification is independent of the
    positions of the words
  • Use same parameters for each position
  • Result is bag of words model

19
Text Classification Algorithms Learning
  • From training corpus, extract Vocabulary
  • Calculate required P(cj) and P(xk cj) terms
  • For each cj in C do
  • docsj ? subset of documents for which the target
    class is cj

Textj ? single document containing all
docsj For each word xk in Vocabulary nk ?
number of occurrences of xk in Textj (must
be smoothed)
20
Naïve Bayes Classifying
  • positions ? all word positions in current
    document which contain tokens found in
    Vocabulary
  • Return cNB, where

21
Underflow Prevention
  • Multiplying lots of probabilities, which are
    between 0 and 1 by definition, can result in
    floating-point underflow.
  • Since log(xy) log(x) log(y), it is better to
    perform all computations by summing logs of
    probabilities rather than multiplying
    probabilities.
  • Class with highest final un-normalized log
    probability score is still the most probable.

22
Feature selection via Mutual Information
  • We might not want to use all words, but just
    reliable, good discriminating terms
  • In training set, choose k words which best
    discriminate the categories.
  • One way is using terms with maximal Mutual
    Information with the classes
  • For each word w and each category c

23
Feature selection via MI (contd.)
  • For each category we build a list of k most
    discriminating terms.
  • For example (on 20 Newsgroups)
  • sci.electronics circuit, voltage, amp, ground,
    copy, battery, electronics, cooling,
  • rec.autos car, cars, engine, ford, dealer,
    mustang, oil, collision, autos, tires, toyota,
  • Greedy does not account for correlations between
    terms

24
Feature Selection
  • Mutual Information
  • Clear information-theoretic interpretation
  • May select rare uninformative terms
  • Commonest terms
  • No particular foundation
  • In practice often is 90 as good
  • Other methods Chi-square, etc.
  • Modern methods use regularization

25
PANTEL AND LIN SPAMCOP
  • Uses a Naïve Bayes classifier
  • M is spam if P(SpamM) gt P(NonSpamM)
  • Method
  • Tokenize message using Porter Stemmer
  • Estimate P(WC) using m-estimate (a form of
    smoothing)
  • Remove words that do not satisfy certain
    conditions
  • Train 160 spams, 466 non-spams
  • Test 277 spams, 346 non-spams
  • Results ERROR RATE of 4.33
  • Worse results using trigrams

26
Naive Bayes is Not So Naive
  • Naïve Bayes First and Second place in KDD-CUP 97
    competition, among 16 (then) state of the art
    algorithms
  • Goal Financial services industry direct mail
    response prediction model Predict if the
    recipient of mail will actually respond to the
    advertisement 750,000 records.
  • Robust to Irrelevant Features
  • Irrelevant Features cancel each other without
    affecting results
  • Instead Decision Trees can heavily suffer from
    this.
  • Very good in Domains with many equally important
    features
  • Decision Trees suffer from fragmentation in such
    cases especially if little data
  • A good dependable baseline for text
    classification (but not the best)!
  • Optimal if the Independence Assumptions hold If
    assumed independence is correct, then it is the
    Bayes Optimal Classifier for problem
  • Very Fast Learning with one pass over the data
    testing linear in the number of attributes, and
    document collection size
  • Low Storage requirements

27
REFERENCES
  • Mosteller, F., Wallace, D. L. (1984). Applied
    Bayesian and Classical Inference the Case of the
    Federalist Papers (2nd ed.). New York
    Springer-Verlag.
  • P. Pantel and D. Lin, 1998. SPAMCOP A Spam
    classification and organization program, In
    Proc. Of the 1998 workshop on learning for text
    categorization, AAAI
  • Sebastiani, F., 2002, Machine Learning in
    Automated Text Categorization, ACM Computing
    Surveys, 34(1), 1-47

28
A Gentle Introduction to Information Retrieval
using the Vector Space Model
  • from slides by R. Ramakrishan
  • based on Larson and Hearsts slides at
    UC-Berkeley

29
Document Vectors
  • Documents are represented as bags of words
  • Represented as vectors when used computationally
  • Each vector holds a place for every term in the
    collection
  • Therefore, most vectors are sparse

30
Document VectorsOne location for each word.
  • nova galaxy heat hwood film role diet fur
  • 10 5 3
  • 5 10
  • 10 8 7
  • 9 10 5
  • 10 10
  • 9 10
  • 5 7 9
  • 6 10 2 8
  • 7 5 1 3

A B C D E F G H I
Nova occurs 10 times in text A Galaxy occurs
5 times in text A Heat occurs 3 times in text
A (Blank means 0 occurrences.)
31
Document Vectors
Document ids
  • nova galaxy heat hwood film role diet fur
  • 10 5 3
  • 5 10
  • 10 8 7
  • 9 10 5
  • 10 10
  • 9 10
  • 5 7 9
  • 6 10 2 8
  • 7 5 1 3

A B C D E F G H I
32
We Can Plot the Vectors
Star
Doc about movie stars
Doc about astronomy
Doc about mammal behavior
Diet
Assumption Documents that are close in space
are similar.
33
Vector Space Model
  • Documents are represented as vectors in term
    space
  • Terms are usually stems
  • Documents represented by binary vectors of terms
  • Queries represented the same as documents
  • A vector distance measure between the query and
    documents is used to rank retrieved documents
  • Query and Document similarity is based on length
    and direction of their vectors
  • Vector operations to capture boolean query
    conditions
  • Terms in a vector can be weighted in many ways

34
Assigning Weights to Terms
  • Binary Weights
  • Raw term frequency
  • tf x idf
  • Recall the Zipf distribution
  • Want to weight terms highly if they are
  • frequent in relevant documents BUT
  • infrequent in the collection as a whole

35
TF x IDF Weights
  • tf x idf measure
  • Term Frequency (tf)
  • Inverse Document Frequency (idf) -- a way to deal
    with the problems of the Zipf distribution
  • Goal Assign a tf idf weight to each term in
    each document

36
TF x IDF Calculation
37
Inverse Document Frequency
  • IDF provides high values for rare words and low
    values for common words

For a collection of 10000 documents
38
TF x IDF Normalization
  • Normalize the term weights (so longer documents
    are not unfairly given more weight)
  • The longer the document, the more likely it is
    for a given term to appear in it, and the more
    often a given term is likely to appear in it. So,
    we want to reduce the importance attached to a
    term appearing in a document based on the length
    of the document.

39
Pair-wise Document Similarity
A B C D
  • nova galaxy heat hwood film role diet fur
  • 1 3 1
  • 5 2
  • 2 1 5
  • 4 1

40
Pair-wise Document Similarity(cosine
normalization)
41
Problems with Vector Space
  • There is no real theoretical basis for the
    assumption of a term space
  • It is more for visualization than having any real
    basis
  • Most similarity measures work about the same
  • Terms are not really orthogonal dimensions
  • Terms are not independent of all other terms
    remember our discussion of correlated terms in
    text

42
Query Modification
  • Problem How can we reformulate the query to help
    a user who is trying several searches to get at
    the same information?
  • Thesaurus expansion
  • Suggest terms similar to query terms
  • Relevance feedback
  • Suggest terms (and documents) similar to
    retrieved documents that have been judged to be
    relevant
Write a Comment
User Comments (0)
About PowerShow.com