I256%20Applied%20Natural%20Language%20Processing%20Fall%202009 - PowerPoint PPT Presentation

About This Presentation
Title:

I256%20Applied%20Natural%20Language%20Processing%20Fall%202009

Description:

... with the congress, will feature the latest in technology in all areas of the ... label = 'sport' labeled_text = LabeledText(text, label) ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 66
Provided by: BROS62
Category:

less

Transcript and Presenter's Notes

Title: I256%20Applied%20Natural%20Language%20Processing%20Fall%202009


1
I256 Applied Natural Language ProcessingFall
2009
  • Lecture 10
  • Classification

Barbara Rosario
2
Today
  • Classification tasks
  • Various issues regarding classification
  • Clustering vs. classification, binary vs.
    multi-way, flat vs. hierarchical classification,
    variants
  • Introduce the steps necessary for a
    classification task
  • Define classes (aka labels)
  • Label text
  • Define and extract features
  • Training and evaluation
  • NLTK example

3
Classification tasks
  • Assign the correct class label for a given
    input/object
  • In basic classification tasks, each input is
    considered in isolation from all other inputs,
    and the set of labels is defined in advance.
  • Examples

Problem Object
Labels categories Tagging
Word POS Sense
Disambiguation Word The words
senses Information retrieval Document
Relevant/not relevant Sentiment
classification Document
Positive/negative Text categorization
Document Topics/classes Author
identification Document Authors Language
identification Document Language
Adapted from Foundations of Statistical NLP
(Manning et al)
4
Author identification
  • They agreed that Mrs. X should only hear of the
    departure of the family, without being alarmed on
    the score of the gentleman's conduct but even
    this partial communication gave her a great deal
    of concern, and she bewailed it as exceedingly
    unlucky that the ladies should happen to go away,
    just as they were all getting so intimate
    together.
  • Gas looming through the fog in divers places in
    the streets, much as the sun may, from the
    spongey fields, be seen to loom by husbandman and
    ploughboy. Most of the shops lighted two hours
    before their time--as the gas seems to know, for
    it has a haggard and unwilling look. The raw
    afternoon is rawest, and the dense fog is
    densest, and the muddy streets are muddiest near
    that leaden-headed old obstruction, appropriate
    ornament for the threshold of a leaden-headed old
    corporation, Temple Bar.

5
Author identification
  • Called Stylometry in the humanities
  • Jane Austen (1775-1817), Pride and Prejudice
  • Charles Dickens (1812-70), Bleak House

6
Author identification
  • Federalist papers
  • 77 short essays written in 1787-1788 by Hamilton,
    Jay and Madison to persuade NY to ratify the US
    Constitution published under a pseudonym
  • The authorships of 12 papers was in dispute
    (disputed papers)
  • In 1964 Mosteller and Wallace solved the problem
  • They identified 70 function words as good
    candidates for authorships analysis
  • Using statistical inference they concluded the
    author was Madison

Mosteller and Wallace 1964. Inference and
Disputed Authorship The Federalist.
7
Function words for Author Identification
8
Function words for Author Identification
9
Language identification
  • Tutti gli esseri umani nascono liberi ed eguali
    in dignità e diritti. Essi sono dotati di ragione
    e di coscienza e devono agire gli uni verso gli
    altri in spirito di fratellanza.
  • Alle Menschen sind frei und gleich an Würde und
    Rechten geboren. Sie sind mit Vernunft und
    Gewissen begabt und sollen einander im Geist der
    Brüderlichkeit begegnen.
  • Universal Declaration of Human Rights, UN, in 363
    languages

10
Language identification
  • égaux
  • eguali
  • iguales
  • edistämään
  • Ü
  • How to do determine, for a stretch of text, which
    language it is from?
  • Turns out to be really simple
  • Just a few character bigrams can do it (Sibun
    Reynar 96)
  • Using special character sets helps a bit, but
    barely

11
Language Identification
(Sibun Reynar 96)
12
Confusion Matrix
  • A table that shows, for each class, which ones
    your algorithm got right and which wrong

Gold standard
Algorithms guess
13
(No Transcript)
14
Text categorization
  • Topic categorization classify the document into
    semantics topics

The U.S. swept into the Davis Cup final on Saturday when twins Bob and Mike Bryan defeated Belarus's Max Mirnyi and Vladimir Voltchkov to give the Americans an unsurmountable 3-0 lead in the best-of-five semi-final tie. One of the strangest, most relentless hurricane seasons on record reached new bizarre heights yesterday as the plodding approach of Hurricane Jeanne prompted evacuation orders for hundreds of thousands of Floridians and high wind warnings that stretched 350 miles from the swamp towns south of Miami to the historic city of St. Augustine.
15
Text Categorization Applications
  • Web pages organized into category hierarchies
  • Journal articles indexed by subject categories
    (e.g., the Library of Congress, MEDLINE, etc.)
  • Patents archived using International Patent
    Classification
  • Patient records coded using international
    insurance categories
  • E-mail message filtering
  • Spam vs. anti-palm
  • Customer service message classification
  • News events tracked and filtered by topics

16
News topic categorization
  • http//news.google.com/
  • Reuters
  • Gold standard
  • Collection of (21,578) newswire documents.
  • For research purposes a standard text collection
    to compare systems and algorithms
  • 135 valid topics categories

17
Reuters
  • Top topics in Reuters

18
Reuters
ltREUTERS TOPICS"YES" LEWISSPLIT"TRAIN"
CGISPLIT"TRAINING-SET" OLDID"12981"
NEWID"798"gt ltDATEgt 2-MAR-1987 165143.42lt/DATEgt
ltTOPICSgtltDgtlivestocklt/DgtltDgthoglt/Dgtlt/TOPICSgt ltTITLE
gtAMERICAN PORK CONGRESS KICKS OFF
TOMORROWlt/TITLEgt ltDATELINEgt CHICAGO, March 2 -
lt/DATELINEgtltBODYgtThe American Pork Congress kicks
off tomorrow, March 3, in Indianapolis with 160
of the nations pork producers from 44 member
states determining industry positions on a number
of issues, according to the National Pork
Producers Council, NPPC. Delegates to the
three day Congress will be considering 26
resolutions concerning various issues, including
the future direction of farm policy and the tax
law as it applies to the agriculture sector. The
delegates will also debate whether to endorse
concepts of a national PRV (pseudorabies virus)
control and eradication program, the NPPC said.
A large trade show, in conjunction with the
congress, will feature the latest in technology
in all areas of the industry, the NPPC added.
Reuter 3lt/BODYgtlt/TEXTgtlt/REUTERSgt
19
Outline
  • Classification tasks
  • Various issues regarding classification
  • Clustering vs. classification, binary vs.
    multi-way, flat vs. hierarchical classification,
    variants
  • Introduce the steps necessary for a
    classification task
  • Define classes (aka labels)
  • Label text
  • Define and extract features
  • Training and evaluation
  • NLTK example

20
Classification vs. Clustering
  • Classification assumes labeled data we know how
    many classes there are and we have examples for
    each class (labeled data).
  • Classification is supervised
  • In Clustering we dont have labeled data we just
    assume that there is a natural division in the
    data and we may not know how many divisions
    (clusters) there are
  • Clustering is unsupervised

21
Classification
Class1
Class2
22
Classification
Class1
Class2
23
Classification
Class1
Class2
24
Classification
Class1
Class2
25
Clustering
26
Clustering
27
Clustering
28
Clustering
29
Clustering
30
Supervised classification
  • A classifier is called supervised if it is built
    based on training corpora containing the correct
    label for each input.

31
Binary vs. multi-way classification
  • Binary classification two classes
  • Multi-way classification more than two classes
  • Sometime it can be convenient to treat a
    multi-way problem like a binary one one class
    versus all the others, for all classes

32
Flat vs. Hierarchical classification
  • Flat classification relations between the
    classes undetermined
  • Hierarchical classification hierarchy where each
    node is the sub-class of its parents node

33
Variants
  • In single-category text classification each text
    belongs to exactly one category
  • In multi-category text classification, each text
    can have zero or more categories
  • In open-class classification, the set of labels
    is not defined in advance
  • In sequence classification, a list of inputs are
    jointly classified.
  • E.g. POS tagging

34
Reuters (multi-category)
ltREUTERS TOPICS"YES" LEWISSPLIT"TRAIN"
CGISPLIT"TRAINING-SET" OLDID"12981"
NEWID"798"gt ltDATEgt 2-MAR-1987 165143.42lt/DATEgt
ltTOPICSgtltDgtlivestocklt/DgtltDgthoglt/Dgtlt/TOPICSgt ltTITLE
gtAMERICAN PORK CONGRESS KICKS OFF
TOMORROWlt/TITLEgt ltDATELINEgt CHICAGO, March 2 -
lt/DATELINEgtltBODYgtThe American Pork Congress kicks
off tomorrow, March 3, in Indianapolis with 160
of the nations pork producers from 44 member
states determining industry positions on a number
of issues, according to the National Pork
Producers Council, NPPC. Delegates to the
three day Congress will be considering 26
resolutions concerning various issues, including
the future direction of farm policy and the tax
law as it applies to the agriculture sector. The
delegates will also debate whether to endorse
concepts of a national PRV (pseudorabies virus)
control and eradication program, the NPPC said.
A large trade show, in conjunction with the
congress, will feature the latest in technology
in all areas of the industry, the NPPC added.
Reuter 3lt/BODYgtlt/TEXTgtlt/REUTERSgt
35
Outline
  • Classification tasks
  • Various issues regarding classification
  • Clustering vs. classification, binary vs.
    multi-way, flat vs. hierarchical classification,
    variants
  • Introduce the steps necessary for a
    classification task
  • Define classes (aka labels)
  • Label text
  • Define and extract features
  • Training and evaluation
  • NLTK example

36
Classification
  • Define classes
  • Label text
  • Extract Features
  • Choose a classifier
  • The Naive Bayes Classifier
  • NN (perceptron)
  • SVM
  • . (next class)
  • Train it (and test it)
  • Use it to classify new examples

37
Categories (Labels, Classes)
  • Labeling data
  • 2 problems
  • Decide the possible classes (which ones, how
    many)
  • Domain and application dependent
  • Trade-off between accuracy and coverage
  • Label text
  • Difficult, time consuming, inconsistency between
    annotators

38
Cost of Manual Text Categorization
  • Time and money!
  • Yahoo!
  • 200 (?) people for manual labeling of Web pages
  • using a hierarchy of 500,000 categories
  • MEDLINE (National Library of Medicine)
  • 2 million/year for manual indexing of journal
    articles
  • using MEdical Subject Headings (18,000
    categories)
  • Mayo Clinic
  • 1.4 million annually for coding patient-record
    events
  • using the International Classification of
    Diseases (ICD) for billing insurance companies
  • US Census Bureau decennial census (1990 22
    million responses)
  • 232 industry categories and 504 occupation
    categories
  • 15 million if fully done by hand

39
Features
  • gtgtgt text "Seven-time Formula One champion
    Michael Schumacher took on the Shanghai circuit
    Saturday in qualifying for the first Chinese
    Grand Prix."
  • gtgtgt label sport
  • gtgtgt labeled_text LabeledText(text, label)
  • Here the classification takes as input the whole
    string
  • Whats the problem with that?
  • What are the features that could be useful for
    this example?

40
Feature terminology
  • Feature An aspect of the text that is relevant
    to the task
  • Feature value the realization of the feature in
    the text
  • Some typical features
  • Words present in text Kerry, Schumacher, China
  • Frequency of word Kerry(10), Schumacher(1)
  • Are there dates? Yes/no
  • Capitalization (is word capitalized?)
  • Are there PERSONS? Yes/no
  • Are there ORGANIZATIONS? Yes/no
  • WordNet Holonyms (China is part of Asia),
    Synonyms(China, People's Republic of China, mainla
    nd China)
  • Chunks, parse trees, POS

41
Feature Types
  • Boolean (or Binary) Features
  • Features that generate boolean (binary) values.
  • Boolean features are the simplest and the most
    common type of feature.
  • f1(text) 1 if text contain Kerry
  • 0 otherwise
  • f2(text) 1 if text contain PERSON
  • 0 otherwise

42
Feature Types
  • Integer Features
  • Features that generate integer values.
  • Integer features can be used to give classifiers
    access to more precise information about the
    text.
  • f1(text) Number of times text contains Kerry
  • f2(text) Number of times text contains PERSON

43
Feature selection
  • Selecting relevant features and deciding how to
    encode them for a learning method can have an
    enormous impact on the learning method's ability
    to extract a good model
  • How do we choose the right features?
  • Typically, feature extractors are built through a
    process of trial-and-error, guided by intuitions
    about what information is relevant to the
    problem.
  • But there are also more principled way of
    features selection

44
Feature selection
  • There are usually limits to the number of
    features that you should use with a given
    learning algorithm if you provide too many
    features, then the algorithm will have a higher
    chance of relying on idiosyncrasies of your
    training data that don't generalize well to new
    examples.
  • This problem is known as overfitting, and can be
    especially problematic when working with small
    training sets.

45
Feature selection
  • Once an initial set of features has been chosen,
    a very productive method for refining the feature
    set is error analysis. First, we select a
    development set, containing the corpus data for
    creating the model. This development set is then
    subdivided into the training set and the dev-test
    set.
  • The training set is used to train the model, and
    the dev-test set is used to perform error
    analysis.
  • Look at errors, change features or model
  • The test set serves in our final evaluation of
    the system.

46
Outline
  • Classification tasks
  • Various issues regarding classification
  • Clustering vs. classification, binary vs.
    multi-way, flat vs. hierarchical classification,
    variants
  • Introduce the steps necessary for a
    classification task
  • Define classes (aka labels)
  • Label text
  • Define and extract features
  • Training and evaluation
  • NLTK example

47
Training
  • Adaptation of the classifier to the data
  • Usually the classifier is defined by a set of
    parameters
  • Training is the procedure for finding a good
    set of parameters
  • Goodness is determined by an optimization
    criterion such as misclassification rate
  • Some classifiers are guaranteed to find the
    optimal set of parameters
  • (Next class)

48
(Linear) Classification
Class1
Linear classifier g(x) wx w0
parameters w, w0
Class2
49
(Linear) Classification
Class1
Linear classifier g(x) wx w0
Changing the parameters w, w0
Class2
50
(Linear) Classification
Class1
Linear classifier g(x) wx w0
Class2
For each set of parameters w, w0, calculate
error
51
(Linear) Classification
Class1
Linear classifier g(x) wx w0
Class2
For each set of parameters w, w0, calculate
error
52
(Linear) Classification
Choose the classier with the lower rate of
misclassification
Class1
Linear classifier g(x) wx w0
Class2
For each set of parameters w, w0, calculate
error
53
Testing evaluation of the classifier
  • After choosing the parameters of the classifiers
    (i.e. after training it) we need to test how well
    its doing on a test set (not included in the
    training set)
  • How trustworthy the model is
  • Evaluation can also be an effective tool for
    guiding us in making future improvements to the
    model.

54
The Test Set
  • This test set typically has the same format as
    the training set
  • It is very important that the test set be
    distinct from the training corpus if we simply
    re-used the training set as the test set, then a
    model that simply memorized its input, without
    learning how to generalize to new examples, would
    receive misleadingly high scores.
  • When building the test set, there is often a
    trade-off between the amount of data available
    for testing and the amount available for
    training.
  • The more training data the better, but need to
    make sure the test set is diverse
  • Another consideration when choosing the test set
    is the degree of similarity between instances in
    the test set and those in the development set.
    The more similar these two datasets are, the less
    confident we can be that evaluation results will
    generalize to other datasets.
  • But they cant be totally different either!

55
Accuracy
  • The simplest metric accuracy, measures the
    percentage of inputs in the test set that the
    classifier correctly labeled.
  • For example, a spam classifier that predicts
    correctly spam 60 times in an test set containing
    80 email would have an accuracy of 60/80 75.
  • Important to take into consideration the
    frequencies of the individual class labels
  • If only 1/100 is spam, an accuracy of 90 is bad
  • If ½ is spam, accuracy of 90 is good
  • This is also why we use precision recall and
    F-measure
  • Important compare with fair baselines

56
Evaluating classifiers
  • Contingency table for the evaluation of a binary
    classifier

GREEN is correct RED is correct
GREEN was assigned a b
RED was assigned c d
  • Accuracy (ad)/(abcd)
  • Precision P_GREEN a/(ab), P_ RED d/(cd)
  • Recall R_GREEN a/(ac), R_ RED d/(bd)

57
Training size
  • The more the better! (usually)
  • Make sure that test set contains instances for
    all classes
  • Results for text classification

From Improving the Performance of Naive Bayes
for Text Classification, Shen and Yang
58
Training size
From Improving the Performance of Naive Bayes
for Text Classification, Shen and Yang
59
Training size
From Improving the Performance of Naive Bayes
for Text Classification, Shen and Yang
60
Training Size
  • Author identification

Authorship Attribution a Comparison Of Three
Methods, Matthew Care
61
Document classification NLTK example
  • Define a feature extractor a feature for each
    word, indicating whether the document contains
    that word.

62
Document classification NLTK example
  • Define a feature extractor a feature for each
    word, indicating whether the document contains
    that word.

63
Document classification NLTK example
  • Now that we've defined our feature extractor, we
    can use it to train a classifier.
  • To check how reliable the resulting classifier
    is, we compute its accuracy on the test set

64
Document classification NLTK example
  • We can examine the classifier to determine which
    features it found most effective for
    distinguishing the reviews sentiment
  • Apparently in this corpus, a review that mentions
    "Seagal" is almost 8 times more likely to be
    negative than positive, while a review that
    mentions "Damon" is about 6 times more likely to
    be positive.

65
Next class
  • Classification models
  • Reading Chapter 6 NLTK book (especially 6.4 on)
Write a Comment
User Comments (0)
About PowerShow.com