Introduction to Automatic Text Classification - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Introduction to Automatic Text Classification

Description:

Category: Music? Health? Entertainment? R&B? USA? Medicine? UK? 8. How Automatic TC is done: ... the task of classifying a news story into one of the categories ... – PowerPoint PPT presentation

Number of Views:155
Avg rating:3.0/5.0
Slides: 33
Provided by: chuan4
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Automatic Text Classification


1
Introduction to Automatic Text Classification
  • George Ke
  • 13 Feb 2007

2
Overview
  • What is Text Classification (TC)
  • Motivation of Automatic TC
  • How Automatic TC is done
  • Preprocessing
  • k-Nearest Neighbour
  • How we know it works
  • Example Email Classification
  • Summary

3
What is Text Classification
  • TC is commonly referred to as the task of
    classifying natural language documents into a
    pre-defined set of semantic categories.
  • For example Entertainment, Health, Business,
    Technology etc.

4
Motivation of Automatic TC
  • Categorised data are easier for users to browse
  • Organisational view of data provides more
    effective retrieval
  • Efficient search is not enough

5
(No Transcript)
6
Motivation of Automatic TC
  • Manual text classification is time-consuming and
    expensive
  • MEDLINE (National Library of Medicine) indexed
    over 600k citations in 2006 using MEdical Subject
    Headings (23,000 categories)
  • Yahoo! Directories over 500k categories

7
Motivation of Automatic TC
  • Fatal drug mix killed US RB star
  • Grammy-nominated RB star Gerald Levert was
    killed by an accidental mixture of
    over-the-counter and prescription drugs according
    to a US coroner.
  • The singer, who died last November, had pain
    killers, anxiety medication and allergy drugs in
    his bloodstream, said Cleveland coroner Kevin
    Chartrand.
  • The official cause of death was acute
    intoxication, and the death was ruled to be
    accidental, he said.
  • Levert found fame in RB trio LeVert, and had a
    UK top 10 hit with Casanova.
  • He also recorded as a solo artist, and worked
    with soul legends such as Anita Baker, Barry
    White and Patti LaBelle.
  • --- BBC Sunday, 11 February 2007, 1303 GMT

Category Music? Health? Entertainment? RB? USA?
Medicine? UK?
8
How Automatic TC is doneLearning Task
  • Binary setting
  • Simplest problem e.g., spam vs non-spam
  • Multi-Class setting
  • E.g. the task of classifying a news story into
    one of the categories in BBC directory
  • Can be treated as n binary tasks
  • Multi-Label setting
  • One document can be in multiple, exactly one or
    no category at all

9
How Automatic TC is done Knowledge Engineering
  • In the late 1980s
  • Knowledge Engineering
  • Experts hand-craft classification rules
  • Rules
  • Rule 1(RB or star or soul ) and (singer or
    artist ) Music
  • Rule 2(drug or prescription ) and medication
    Medicine
  • Rule 3(anxiety or pain or allergy) and acute
    Health
  • Rule 4 (play or fame ) and award
    Entertainment
  • Rule

10
How Automatic TC is done Knowledge Engineering
  • Still inefficient and impractical when
  • Number of categories is large
  • Category definitions can change over time
  • Personalised application where an
    expert/knowledge engineer is unavailable
  • Inconsistency issues as rule set gets larger

11
How Automatic TC is done Machine Learning
  • Since 1990s
  • The learning algorithm is given a small set of
    manually classified documents (training
    documents/dataset)
  • Documents to be classified are test
    documents/dataset
  • Produces a classification rule automatically
  • A.k.a a supervised learning problem
  • But, how do we make the learning algorithm learn
    from the training documents?

12
How Automatic TC is done Machine Learning -
Preprocessing
  • Pre-processing
  • Representing Text
  • Bag-of-words approach Term Frequency (TF)
  • Feature selection
  • Stopword removal
  • Feature construction
  • Stemming
  • Term weighting DF, IDF
  • bag-of-words approach may not be the best method
    for other languages

13
How Automatic TC is done Machine Learning -
Preprocessing
  • Fatal drug mix killed US RB star
  • Grammy-nominated RB star Gerald Levert was
    killed by an accidental mixture of
    over-the-counter and prescription drugs according
    to a US coroner.
  • The singer, who died last November, had pain
    killers, anxiety medication and allergy drugs in
    his bloodstream, said Cleveland coroner Kevin
    Chartrand.
  • The official cause of death was acute
    intoxication, and the death was ruled to be
    accidental, he said.
  • Levert found fame in RB trio LeVert, and had a
    UK top 10 hit with Casanova.
  • He also recorded as a solo artist, and worked
    with soul legends such as Anita Baker, Barry
    White and Patti LaBelle.
  • --- BBC Sunday, 11 February 2007, 1303 GMT

14
How Automatic TC is done Machine Learning -
Preprocessing
  • Fatal drug mix killed US RB star
  • Grammy-nominated RB star Gerald Levert was
    killed by an accidental mixture of
    over-the-counter and prescription drugs according
    to a US coroner.
  • The singer, who died last November, had pain
    killers, anxiety medication and allergy drugs in
    his bloodstream, said Cleveland coroner Kevin
    Chartrand.
  • The official cause of death was acute
    intoxication, and the death was ruled to be
    accidental, he said.
  • Levert found fame in RB trio LeVert, and had a
    UK top 10 hit with Casanova.
  • He also recorded as a solo artist, and worked
    with soul legends such as Anita Baker, Barry
    White and Patti LaBelle.
  • --- BBC Sunday, 11 February 2007, 1303 GMT

15
How Automatic TC is done Machine Learning -
Preprocessing
  • Fatal drug mix killed US RB star
  • Grammy-nominated RB star Gerald Levert was
    killed by an accidental mixture of
    over-the-counter and prescription drugs according
    to a US coroner.
  • The singer, who died last November, had pain
    killers, anxiety medication and allergy drugs in
    his bloodstream, said Cleveland coroner Kevin
    Chartrand.
  • The official cause of death was acute
    intoxication, and the death was ruled to be
    accidental, he said.
  • Levert found fame in RB trio LeVert, and had a
    UK top 10 hit with Casanova.
  • He also recorded as a solo artist, and worked
    with soul legends such as Anita Baker, Barry
    White and Patti LaBelle.
  • --- BBC Sunday, 11 February 2007, 1303 GMT

16
How Automatic TC is done Machine Learning -
Preprocessing
  • Fatal drug mix killed US RB star
  • Grammy-nominated RB star Gerald Levert was
    killed by an accidental mixture of
    over-the-counter and prescription drugs according
    to a US coroner.
  • The singer, who died last November, had pain
    killers, anxiety medication and allergy drugs in
    his bloodstream, said Cleveland coroner Kevin
    Chartrand.
  • The official cause of death was acute
    intoxication, and the death was ruled to be
    accidental, he said.
  • Levert found fame in RB trio LeVert, and had a
    UK top 10 hit with Casanova.
  • He also recorded as a solo artist, and worked
    with soul legends such as Anita Baker, Barry
    White and Patti LaBelle.
  • --- BBC Sunday, 11 February 2007, 1303 GMT

17
How Automatic TC is done Machine Learning -
Preprocessing
  • Fatal drug mix killed US RB star
  • Grammy-nominated RB star Gerald Levert was
    killed by an accidental mixture of
    over-the-counter and prescription drugs according
    to a US coroner.
  • The singer, who died last November, had pain
    killers, anxiety medication and allergy drugs in
    his bloodstream, said Cleveland coroner Kevin
    Chartrand.
  • The official cause of death was acute
    intoxication, and the death was ruled to be
    accidental, he said.
  • Levert found fame in RB trio LeVert, and had a
    UK top 10 hit with Casanova.
  • He also recorded as a solo artist, and worked
    with soul legends such as Anita Baker, Barry
    White and Patti LaBelle.
  • --- BBC Sunday, 11 February 2007, 1303 GMT

18
How Automatic TC is done Machine Learning -
Preprocessing
  • Fatal drug mix killed US RB star
  • Grammy-nominated RB star Gerald Levert was
    killed by an accidental mixture of
    over-the-counter and prescription drugs according
    to a US coroner.
  • The singer, who died last November, had pain
    killers, anxiety medication and allergy drugs in
    his bloodstream, said Cleveland coroner Kevin
    Chartrand.
  • The official cause of death was acute
    intoxication, and the death was ruled to be
    accidental, he said.
  • Levert found fame in RB trio LeVert, and had a
    UK top 10 hit with Casanova.
  • He also recorded as a solo artist, and worked
    with soul legends such as Anita Baker, Barry
    White and Patti LaBelle.
  • --- BBC Sunday, 11 February 2007, 1303 GMT

19
How Automatic TC is done Machine Learning - kNN
  • k-Nearest Neighbour (kNN)
  • Documents located close to each other are more
    likely to belong to the same class
  • k is a pre-defined parameter, which determines
    how many neighbouring training documents to be
    considered when classifying a test document
  • k is an integer 1, 3 ,5, 7, 10
  • Cosine Similarity is commonly used to determine
    the closeness of two documents

20
How Automatic TC is done Machine Learning - kNN
21
How Automatic TC is done Machine Learning - kNN
  • Majority voting scheme

22
How Automatic TC is done Machine Learning - kNN
  • Weighted-sum voting scheme

23
How Automatic TC is done Machine Learning - kNN
  • The score for a category is the sum of the
    similarity scores between the point to be
    classified and all of its k-neighbours that
    belong to the given category.
  • To restatewhere x is the new point c is a
    class (e.g. black or white)d is a classified
    point among the k-nearest neighbours of
    xsim(x,d) is the similarity between x and
    dI(d,c) 1 if point d belongs to class
    cI(d,c) 0 otherwise.

24
Exercise
  • Imagine a language that is made up with five
    English letters, A, B, C, D and E with B, D and E
    being stopwords. The kNN system has been
    trained with 3 training documents, which belong
    to TWO different categories (see below) and the
    task is to classify a new document (test
    document) into one of the two categories using
    the process of automatic text classification with
    kNN (k1).
  • Preprocessed Training Documents

Unpreprocessed Test Document
25
How we know it works
  • Given n test documents and m category in
    consideration, a classifier makes n ? m binary
    decisions. A two-by-two contingency table can be
    computed for each category

26
How we know it works
  • Performance measures
  • Precision (p)
  • Recall (r)
  • F1-measure
  • Accuracy

27
How we know it works
  • Precision TP/(TPFP) where TP FP gt 0
    (otherwise undefined).
  • Of the times we predicted it was in class, how
    often are we correct?
  • Recall TP/(TPFN) where TP FN gt 0 (o.w.
    undefined).
  • Did we find all of those that belonged in the
    class?

28
How we know it works
  • F1-measure 2(p ?r)/(p r)
  • The weighted harmonic mean of precision and
    recall
  • Single performance measure to compare different
    learning algorithms
  • Accuracy No. TP for all categories
  • No. all test documents

29
Example Email Classification
  • Emails are classified into folders
  • Multi-class setting
  • Emails are constantly being received
  • kNN is updated weekly, i.e. add received emails
    that were foldered to the training dataset
  • Text in email body and sender field is used to
    represent an email
  • BOW representation, stemming but no stopword
    removal
  • Dataset Enron Email Corpus

30
Example Email Classification
  • Results
  • User ID 5 received 87 emails in 18 weeks and
    keeps them in 7 folders
  • kNN correctly classified 72 emails
  • Accuracy 72 / 87 0.8276 82.76
  • User ID 70 received 881 emails in 114 weeks and
    keeps them in 69 folders
  • kNN correctly classified 517 emails
  • Accuracy 517 / 881 0.5868 58.68
  • More folders means more complex classification
    problem

31
Summary
  • Categorised data means more effective retrieval
    and search
  • Exponential growth of the number of electronic
    documents makes automatic TC is a must
  • Simple yet robust techniques can deliver
    practical solutions to real-world problems
  • kNN is one of the most effective methods (and
    arguably the simplest)
  • Personal Information Management (PIM) is a new
    direction for TC

32
Other Resources
  • Sebastiani, F. Machine Learning in Automated Text
    Categorization, ACM Computing Surveys, Vol. 34,
    No. 1, 2002.
  • Joachims, T. Learning to Classify Text Using
    Support Vector Machines Methods, Theory and
    Algorithms, Kluwer Academic Publishers, 2002
Write a Comment
User Comments (0)
About PowerShow.com