SIMS 290-2: Applied Natural Language Processing - PowerPoint PPT Presentation

About This Presentation
Title:

SIMS 290-2: Applied Natural Language Processing

Description:

... the dense fog is densest, and the muddy streets are muddiest near that leaden ... ornament for the threshold of a leaden-headed old corporation, Temple Bar. ... – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 56
Provided by: coursesIs8
Category:

less

Transcript and Presenter's Notes

Title: SIMS 290-2: Applied Natural Language Processing


1
SIMS 290-2 Applied Natural Language Processing
Barbara Rosario Sept 27, 2004    
2
Today
  • Classification
  • Text categorization (and other applications)
  • Various issues regarding classification
  • Clustering vs. classification, binary vs.
    multi-way, flat vs. hierarchical classification
  • Introduce the steps necessary for a
    classification task
  • Define classes
  • Label text
  • Features
  • Training and evaluation of a classifier

3
Classification
  • Goal Assign objects from a universe to two or
    more classes or categories
  • Examples
  • Problem Object
    Categories
  • Tagging Word POS
  • Sense Disambiguation Word The
    words senses
  • Information retrieval Document
    Relevant/not relevant
  • Sentiment classification Document
    Positive/negative
  • Author identification Document Authors

4
Author identification
  • They agreed that Mrs. X should only hear of the
    departure of the family, without being alarmed on
    the score of the gentleman's conduct but even
    this partial communication gave her a great deal
    of concern, and she bewailed it as exceedingly
    unlucky that the ladies should happen to go away,
    just as they were all getting so intimate
    together.
  • Gas looming through the fog in divers places in
    the streets, much as the sun may, from the
    spongey fields, be seen to loom by husbandman and
    ploughboy. Most of the shops lighted two hours
    before their time--as the gas seems to know, for
    it has a haggard and unwilling look. The raw
    afternoon is rawest, and the dense fog is
    densest, and the muddy streets are muddiest near
    that leaden-headed old obstruction, appropriate
    ornament for the threshold of a leaden-headed old
    corporation, Temple Bar.

5
Author identification
  • Jane Austen (1775-1817), Pride and Prejudice
  • Charles Dickens (1812-70), Bleak House

6
Author identification
  • Federalist papers
  • 77 short essays written in 1787-1788 by Hamilton,
    Jay and Madison to persuade NY to ratify the US
    Constitution published under a pseudonym
  • The authorships of 12 papers was in dispute
    (disputed papers)
  • In 1964 Mosteller and Wallace solved the problem
  • They identified 70 function words as good
    candidates for authorships analysis
  • Using statistical inference they concluded the
    author was Madison

7
Function words for Author Identification
8
Function words for Author Identification
9
Classification
  • Goal Assign objects from a universe to two or
    more classes or categories
  • Examples
  • Problem Object
    Categories

Author identification Document
Authors Language identification Document
Language
10
Language identification
  • Tutti gli esseri umani nascono liberi ed eguali
    in dignità e diritti. Essi sono dotati di ragione
    e di coscienza e devono agire gli uni verso gli
    altri in spirito di fratellanza.
  • Alle Menschen sind frei und gleich an Würde und
    Rechten geboren. Sie sind mit Vernunft und
    Gewissen begabt und sollen einander im Geist der
    Brüderlichkeit begegnen.
  • Universal Declaration of Human Rights, UN, in 363
    languages

11
Language identification
  • égaux
  • eguali
  • iguales
  • edistämään
  • Ü

12
Classification
  • Goal Assign objects from a universe to two or
    more classes or categories
  • Examples
  • Problem Object
    Categories
  • Author identification Document Authors
  • Language identification Document Language
  • Text categorization Document Topics

13
Text categorization
  • Topic categorization classify the document into
    semantics topics

The U.S. swept into the Davis Cup final on Saturday when twins Bob and Mike Bryan defeated Belarus's Max Mirnyi and Vladimir Voltchkov to give the Americans an unsurmountable 3-0 lead in the best-of-five semi-final tie. One of the strangest, most relentless hurricane seasons on record reached new bizarre heights yesterday as the plodding approach of Hurricane Jeanne prompted evacuation orders for hundreds of thousands of Floridians and high wind warnings that stretched 350 miles from the swamp towns south of Miami to the historic city of St. Augustine.
14
Text categorization
  • http//news.google.com/
  • Reuters
  • Collection of (21,578) newswire documents.
  • For research purposes a standard text collection
    to compare systems and algorithms
  • 135 valid topics categories

15
Reuters
  • Top topics in Reuters

16
Reuters
ltREUTERS TOPICS"YES" LEWISSPLIT"TRAIN"
CGISPLIT"TRAINING-SET" OLDID"12981"
NEWID"798"gt ltDATEgt 2-MAR-1987 165143.42lt/DATEgt
ltTOPICSgtltDgtlivestocklt/DgtltDgthoglt/Dgtlt/TOPICSgt ltTITLE
gtAMERICAN PORK CONGRESS KICKS OFF
TOMORROWlt/TITLEgt ltDATELINEgt CHICAGO, March 2 -
lt/DATELINEgtltBODYgtThe American Pork Congress kicks
off tomorrow, March 3, in Indianapolis with 160
of the nations pork producers from 44 member
states determining industry positions on a number
of issues, according to the National Pork
Producers Council, NPPC. Delegates to the
three day Congress will be considering 26
resolutions concerning various issues, including
the future direction of farm policy and the tax
law as it applies to the agriculture sector. The
delegates will also debate whether to endorse
concepts of a national PRV (pseudorabies virus)
control and eradication program, the NPPC said.
A large trade show, in conjunction with the
congress, will feature the latest in technology
in all areas of the industry, the NPPC added.
Reuter 3lt/BODYgtlt/TEXTgtlt/REUTERSgt
17
Text categorization examples
  • Topic categorization
  • http//news.google.com/
  • Reuters.
  • Spam filtering
  • Determine if a mail message is spam (or not)
  • Customer service message classification

18
Classification vs. Clustering
  • Classification assumes labeled data we know how
    many classes there are and we have examples for
    each class (labeled data).
  • Classification is supervised
  • In Clustering we dont have labeled data we just
    assume that there is a natural division in the
    data and we may not know how many divisions
    (clusters) there are
  • Clustering is unsupervised

19
(No Transcript)
20
(No Transcript)
21
Classification
Class1
Class2
22
Classification
Class1
Class2
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
Clustering
28
Categories (Labels, Classes)
  • Labeling data
  • 2 problems
  • Decide the possible classes (which ones, how
    many)
  • Domain and application dependent
  • http//news.google.com
  • Label text
  • Difficult, time consuming, inconsistency between
    annotators

29
Reuters
ltREUTERS TOPICS"YES" LEWISSPLIT"TRAIN"
CGISPLIT"TRAINING-SET" OLDID"12981"
NEWID"798"gt ltDATEgt 2-MAR-1987 165143.42lt/DATEgt
ltTOPICSgtltDgtlivestocklt/DgtltDgthoglt/Dgtlt/TOPICSgt ltTITLE
gtAMERICAN PORK CONGRESS KICKS OFF
TOMORROWlt/TITLEgt ltDATELINEgt CHICAGO, March 2 -
lt/DATELINEgtltBODYgtThe American Pork Congress kicks
off tomorrow, March 3, in Indianapolis with 160
of the nations pork producers from 44 member
states determining industry positions on a number
of issues, according to the National Pork
Producers Council, NPPC. Delegates to the
three day Congress will be considering 26
resolutions concerning various issues, including
the future direction of farm policy and the tax
law as it applies to the agriculture sector. The
delegates will also debate whether to endorse
concepts of a national PRV (pseudorabies virus)
control and eradication program, the NPPC said.
A large trade show, in conjunction with the
congress, will feature the latest in technology
in all areas of the industry, the NPPC added.
Reuter 3lt/BODYgtlt/TEXTgtlt/REUTERSgt
Why not topic policy ?
30
Binary vs. multi-way classification
  • Binary classification two classes
  • Multi-way classification more than two classes
  • Sometime it can be convenient to treat a
    multi-way problem like a binary one one class
    versus all the others, for all classes

31
Flat vs. Hierarchical classification
  • Flat classification relations between the
    classes undetermined
  • Hierarchical classification hierarchy where each
    node is the sub-class of its parents node

32
Single- vs. multi-category classification
  • In single-category text classification each text
    belongs to exactly one category
  • In multi-category text classification, each text
    can have zero or more categories

33
LabeledText class in NLTK
  • LabeledText class
  • gtgtgt text "Seven-time Formula One champion
    Michael Schumacher took on the Shanghai circuit
    Saturday in qualifying for the first Chinese
    Grand Prix."
  • gtgtgt label sport
  • gtgtgt labeled_text LabeledText(text, label)
  • gtgtgt labeled_text.text()
  • Seven-time Formula One champion Michael
    Schumacher took on the Shanghai circuit Saturday
    in qualifying for the first Chinese Grand Prix.
  • gtgtgt labeled_text.label()
  • sport

34
NLTK The Classifier Interface
  • classify determines which label is most
    appropriate for a given text token, and returns a
    labeled text token with that label.
  • labels returns the list of category labels that
    are used by the classifier.
  • gtgtgt token Token(The World Health Organization
    is recommending more importance be attached to
    the prevention of heart disease and other
    cardiovascular ailments rather than focusing on
    treatment.)
  • gtgtgt my_classifier.classify(token)
  • The World Health Organization is recommending
    more importance be attached to the prevention of
    heart disease and other cardiovascular ailments
    rather than focusing on treatment./ health
  • gtgtgt my_classifier.labels()
  • ("sport", "health", "world",)

35
Features
  • gtgtgt text "Seven-time Formula One champion
    Michael Schumacher took on the Shanghai circuit
    Saturday in qualifying for the first Chinese
    Grand Prix."
  • gtgtgt label sport
  • gtgtgt labeled_text LabeledText(text, label)
  • Here the classification takes as input the whole
    string
  • Whats the problem with that?
  • What are the features that could be useful for
    this example?

36
Feature terminology
  • Feature An aspect of the text that is relevant
    to the task
  • Some typical features
  • Words present in text
  • Frequency of words
  • Capitalization
  • Are there NE?
  • WordNet
  • Others?

37
Feature terminology
  • Feature An aspect of the text that is relevant
    to the task
  • Feature value the realization of the feature in
    the text
  • Words present in text Kerry, Schumacher, China
  • Frequency of word Kerry(10), Schumacher(1)
  • Are there dates? Yes/no
  • Are there PERSONS? Yes/no
  • Are there ORGANIZATIONS? Yes/no
  • WordNet Holonyms (China is part of Asia),
    Synonyms(China, People's Republic of China, mainla
    nd China)

38
Feature Types
  • Boolean (or Binary) Features
  • Features that generate boolean (binary) values.
  • Boolean features are the simplest and the most
    common type of feature.
  • f1(text) 1 if text contain Kerry
  • 0 otherwise
  • f2(text) 1 if text contain PERSON
  • 0 otherwise

39
Feature Types
  • Integer Features
  • Features that generate integer values.
  • Integer features can be used to give classifiers
    access to more precise information about the
    text.
  • f1(text) Number of times text contains Kerry
  • f2(text) Number of times text contains PERSON

40
Features in NLTK
  • Feature Detectors
  • Features can be defined using feature detector
    functions, which map LabeledTexts to values
  • Method detect, which takes a labeled text, and
    returns a feature value.
  • gtgtgt def ball(ltext)
  • return (ball in ltext.text())
  • gtgtgt fdetector FunctionFeatureDetector(ball)
  • gtgtgt document1 "John threw the ball over the
    fence".split()
  • gtgtgt fdetector.detect(LabeledText(document1)
  • 1
  • gtgtgt document2 "Mary solved the
    equation".split()
  • gtgtgt fdetector.detect(LabeledText(document2)
  • 0

41
Features in NLTK
  • Feature Detector Lists data structures that
    represent the feature detector functions for a
    set of features.
  • Feature Value Lists

42
Feature selection
  • How do we choose the right features?
  • Next lecture

43
Classification
  • Define classes
  • Label text
  • Extract Features
  • Choose a classifier
  • gtgtgt my_classifier.classify(token)
  • The Naive Bayes Classifier
  • NN (perceptron)
  • SVM
  • . (next Monday)
  • Train it (and test it)
  • Use it to classify new examples

44
Training
  • (Well see what we mean exactly with training
    when well talk about the algorithms)
  • Adaptation of the classifier to the data
  • Usually the classifier is defined by a set of
    parameters
  • Training is the procedure for finding a good
    set of parameters
  • Goodness is determined by an optimization
    criterion such as misclassification rate
  • Some classifiers are guaranteed to find the
    optimal set of parameters

45
(Linear) Classification
Class1
Linear classifier g(x) wx w0
parameters w, w0
Class2
46
(Linear) Classification
Class1
Linear classifier g(x) wx w0
Changing the parameters w, w0
Class2
47
(Linear) Classification
Class1
Linear classifier g(x) wx w0
Class2
For each set of parameters w, w0, calculate
error
48
(Linear) Classification
Class1
Linear classifier g(x) wx w0
Class2
For each set of parameters w, w0, calculate
error
49
(Linear) Classification
Choose the classier with the lower rate of
misclassification
Class1
Linear classifier g(x) wx w0
Class2
For each set of parameters w, w0, calculate
error
50
Testing, evaluation of the classifier
  • After choosing the parameters of the classifiers
    (i.e. after training it) we need to test how well
    its doing on a test set (not included in the
    training set)
  • Calculate misclassification on the test set

51
Evaluating classifiers
  • Contingency table for the evaluation of a binary
    classifier

GREEN is correct RED is correct
GREEN was assigned a b
RED was assigned c d
  • Accuracy (ad)/(abcd)
  • Precision P_GREEN a/(ab), P_ RED d/(cd)
  • Recall R_GREEN a/(ac), R_ RED d/(bd)

52
Training size
  • The more the better! (usually)
  • Results for text classification

53
Training size
54
Training size
55
Training Size
  • Author identification

56
Next Time and Upcoming
  • Define classes
  • Label text
  • Features (Wednesday)
  • Classifiers (next week)
  • The Naive Bayes Classifier
  • NN (perceptron)
  • SVM
  • Decision trees
  • K nearest neighbor
  • Maximum Entropy models
Write a Comment
User Comments (0)
About PowerShow.com