SIMS 290-2: Applied Natural Language Processing presentation

About This Presentation

Transcript and Presenter's Notes

Title: SIMS 290-2: Applied Natural Language Processing

1
SIMS 290-2 Applied Natural Language Processing
Marti Hearst Sept 13, 2004
2
Today

Purpose of Part-of-Speech Tagging
Training and Testing Collections
Intro to N-grams and Language Modeling
Using NLTK for POS Tagging

3
Class Exercise

I will read off a few words from the beginning of
a sentence
You should write down the very first 2 words that
come to mind that should follow these words.
Example
I say One fish
You write two fish
Dont second-guess or try to be clever.
Note there are no correct answers

4
Terminology

Tagging
The process of associating labels with each token
in a text
Tags
The labels
Tag Set
The collection of tags used for a particular task

5
Example

Typically a tagged text is a sequence of
white-space separated base/tag tokens
The/at Pantheons/np interior/nn ,/,still/rb
in/in its/pp original/jj form/nn ,/, is/bez
truly/ql majestic/jj and/cc an/at
architectural/jj triumph/nn ./. Its/pp rotunda/nn
forms/vbz a/at perfect/jj circle/nn whose/wp
diameter/nn is/bez equal/jj to/in the/at
height/nn from/in the/at floor/nn to/in the/at
ceiling/nn ./.

6
What does Tagging do?

Collapses Distinctions
Lexical identity may be discarded
e.g. all personal pronouns tagged with PRP
Introduces Distinctions
Ambiguities may be removed
e.g. deal tagged with NN or VB
e.g. deal tagged with DEAL1 or DEAL2
Helps classification and prediction

7
Significance of Parts of Speech

A words POS tells us a lot about the word and
its neighbors
Limits the range of meanings (deal),
pronunciation (object vs object) or both (wind)
Helps in stemming
Limits the range of following words for Speech
Recognition
Can help select nouns from a document for IR
Basis for partial parsing (chunked parsing)
Parsers can build trees directly on the POS tags
instead of maintaining a lexicon

8
Choosing a tagset

The choice of tagset greatly affects the
difficulty of the problem
Need to strike a balance between
Getting better information about context (best
introduce more distinctions)
Make it possible for classifiers to do their job
(need to minimize distinctions)

9
Some of the best-known Tagsets

Brown corpus 87 tags
Penn Treebank 45 tags
Lancaster UCREL C5 (used to tag the BNC) 61 tags
Lancaster C7 145 tags

10
The Brown Corpus

The first digital corpus (1961)
Francis and Kucera, Brown University
Contents 500 texts, each 2000 words long
From American books, newspapers, magazines
Representing genres
Science fiction, romance fiction, press reportage
scientific writing, popular lore

11
Penn Treebank

First syntactically annotated corpus
1 million words from Wall Street Journal
Part of speech tags and syntax trees

12
How hard is POS tagging?
In the Brown corpus,- 11.5 of word types
ambiguous- 40 of word TOKENS
Number of tags 1 2 3 4 5 6 7
Number of words types 35340 3760 264 61 12 2 1
13
Important Penn Treebank tags
14
Verb inflection tags
15
The entire Penn Treebank tagset
16
Quick test
DoCoMo and Sony are to develop a chip that would
let people pay for goods through their mobiles.
17
Tagging methods

Hand-coded
Statistical taggers
Brill (transformation-based) tagger

18
Reading Tagged Corpora

gtgt corpus brown.read(ca01)
gtgt corpusWORDS010
ltThe/atgt, ltFulton/np-tlgt, ltCounty/nn-tlgt,
ltGrand/jj-tlgt,
ltJury/nn-tlgt, ltsaid/vbdgt, ltFriday/nrgt, ltan/atgt,
ltinvestigation/nngt, ltof/ingt
gtgt corpusWORDS2TAG
nn-tl
gtgt corpusWORDS2TEXT
County

19
Default Tagger

We need something to use for unseen words
E.g., guess NNP for a word with an initial
capital
How to do this?
Apply a sequence of regular expression tests
Assign the word to a suitable tag
If there are no matches
Assign to the most frequent unknown tag, NN
Other common ones are verb, proper noun,
adjective
Note the role of closed-class words in English
Prepositions, auxiliaries, etc.
New ones do not tend to appear.

20
A Default Tagger

gt from nltk.tokenizer import
gt from nltk.tagger import
gt text_token Token(TEXT"John saw 3 polar bears
.")
gt WhitespaceTokenizer().tokenize(text_token)
gt NN_CD_tagger
RegexpTagger((r'0-9(.0-9)?', 'cd'),
(r'.', 'nn'))
gt NN_CD_tagger.tag(text_token)
ltltJohn/nngt, ltsaw/nngt, lt3/cdgt, ltpolar/nngt,
ltbears/nngt, lt./nngtgt
NN_CD_Tagger assigns CD to numbers, otherwise
NN.
Poor performance (20-30) in isolation, but when
used with other taggers can significantly improve
performance

21
Finding the most frequent tag

gtgtgtfrom nltk.probability import FreqDist
gtgtgtfrom nltk.corpus import brown
gtgtgt fd FreqDist()
gtgtgt corpus brown.read('ca01')
gtgtgt for token in corpus'WORDS'
fd.inc(token'TAG')...
gtgtgt fd.max()
gtgtgt fd.count(fd.max())

22
Evaluating the Tagger
This gets 2 wrong out of 16, or 18.5 error Can
also say an accuracy of 81.5.
23
Training vs. Testing

A fundamental idea in computational linguistics
Start with a collection labeled with the right
answers
Supervised learning
Usually the labels are done by hand
Train or teach the algorithm on a subset of
the labeled text.
Test the algorithm on a different set of data.
Why?
If memorization worked, wed be done.
Need to generalize so the algorithm works on
examples that you havent seen yet.
Thus testing only makes sense on examples you
didnt train on.
NLTK has an excellent interface for doing this
easily.

24
Training the Unigram Tagger
25
Creating Separate Training and Testing Sets
26
Evaluating a Tagger

Tagged tokens the original data
Untag (exclude) the data
Tag the data with your own tagger
Compare the original and new tags
Iterate over the two lists checking for identity
and counting
Accuracy fraction correct

27
Assessing the Errors
Why the tuple method? Dictionaries cannot be
indexed by lists, so convert lists to tuples.
exclude returns a new token containing only the
properties that are not named in the given list.
28
Assessing the Errors
29
Language Modeling

Another fundamental concept in NLP
Main idea
For a given language, some words are more likely
than others to follow each other, or
You can predict (with some degree of accuracy)
the probability that a given word will follow
another word.
Illustration
Distributions of words in class-participation
exercise.

30
N-Grams

The N stands for how many terms are used
Unigram 1 term
Bigram 2 terms
Trigrams 3 terms
Usually dont go beyond this
You can use different kinds of terms, e.g.
Character based n-grams
Word-based n-grams
POS-based n-grams
Ordering
Often adjacent, but not required
We use n-grams to help determine the context in
which some linguistic phenomenon happens.
E.g., look at the words before and after the
period to see if it is the end of a sentence or
not.

31
Features and Contexts
wn-2 wn-1 wn wn1
CONTEXT FEATURE CONTEXT
tn-1
tn
tn1
tn-2
32
Unigram Tagger

Trained using a tagged corpus to determine which
tags are most common for each word.
E.g. in tagged WSJ sample, deal is tagged with
NN 11 times, with VB 1 time, and with VBP 1 time
Performance is highly dependent on the quality of
its training set.
Cant be too small
Cant be too different from texts we actually
want to tag

33
Nth Order Tagging

Order refers to how much context
Its one less than the N in N-gram here because
we use the target word itself as part of the
context.
Oth order unigram tagger
1st order bigrams
2nd order trigrams
Bigram tagger
For tagging, in addition to considering the
tokens type, the context also considers the tags
of the n preceding tokens
What is the most likely tag for w_n, given w_n-1
and t_n-1?
The tagger picks the tag which is most likely for
that context.

34
Reading the Bigram table
35
Tagging with lexical frequencies

Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
tomorrow/NN
People/NNS continue/VBP to/TO inquire/VB the/DT
reason/NN for/IN the/DT race/NN for/IN outer/JJ
space/NN
Problem assign a tag to race given its lexical
frequency
Solution we choose the tag that has the greater
P(raceVB)
P(raceNN)
Actual estimate from the Switchboard corpus
P(raceNN) .00041
P(raceVB) .00003

36
Combining Taggers

Use more accurate algorithms when we can, backoff
to wider coverage when needed.
Try tagging the token with the 1st order tagger.
If the 1st order tagger is unable to find a tag
for the token, try finding a tag with the 0th
order tagger.
If the 0th order tagger is also unable to find a
tag, use the NN_CD_Tagger to find a tag.

37
BackoffTagger class

gtgtgt train_toks TaggedTokenizer().tokenize(tagged
_text_str)
Construct the taggers
gtgtgt tagger1 NthOrderTagger(1,
SUBTOKENSWORDS)
gtgtgt tagger2 UnigramTagger() 0th order
gtgtgt tagger3 NN_CD_Tagger()
Train the taggers
gtgtgt for tok in train_toks
tagger1.train(tok)
tagger2.train(tok)

38
Backoff (continued)

Combine the taggers (in order, by specificity)
gt tagger BackoffTagger(tagger1, tagger2,
tagger3)
Use the combined tagger
gt accuracy tagger_accuracy(tagger,
unseen_tokens)

39
Rule-Based Tagger

The Linguistic Complaint
Where is the linguistic knowledge of a tagger?
Just a massive table of numbers
Arent there any linguistic insights that could
emerge from the data?
Could thus use handcrafted sets of rules to tag
input sentences, for example, if input follows a
determiner tag it as a noun.

40
The Brill tagger

An example of TRANSFORMATION-BASED LEARNING
Very popular (freely available, works fairly
well)
A SUPERVISED method requires a tagged corpus
Basic idea do a quick job first (using
frequency), then revise it using contextual rules

41
Brill Tagging In more detail

Start with simple (less accurate) ruleslearn
better ones from tagged corpus
Tag each word initially with most likely POS
Examine set of transformations to see which
improves tagging decisions compared to tagged
corpus
Re-tag corpus using best transformation
Repeat until, e.g., performance doesnt improve
Result tagging procedure (ordered list of
transformations) which can be applied to new,
untagged text

42
An example

Examples
It is expected to race tomorrow.
The race for outer space.
Tagging algorithm
Tag all uses of race as NN (most likely tag in
the Brown corpus)
It is expected to race/NN tomorrow
the race/NN for outer space
Use a transformation rule to replace the tag NN
with VB for all uses of race preceded by the
tag TO
It is expected to race/VB tomorrow
the race/NN for outer space

43
Transformation-based learning in the Brill tagger

Tag the corpus with the most likely tag for each
word
Choose a TRANSFORMATION that deterministically
replaces an existing tag with a new one such that
the resulting tagged corpus has the lowest error
rate
Apply that transformation to the training corpus
Repeat
Return a tagger that
first tags using unigrams
then applies the learned transformations in order

44
Examples of learned transformations
45
Templates
46
Additional issues
Most of the difference in performance between POS
algorithms depends on their treatment of UNKNOWN
WORDS
Multiple token words (Penn Treebank)
Class-based N-grams
47
Upcoming

I will email the procedures for turning in the
first assignment on Wed Sept 15
Will be over the web
On Wed Ill discuss shallow parsing
Start reading the Chunking (Shallow Parsing)
tutorial
I will assign homework from this on Wed, due in
one week on Sept 22.
Next Monday Ill briefly discuss syntactic
parsting
There is a tutorial on this feel free to read it
In the interests of reducing workload, Im not
assigning it however

Write a Comment

User Comments (0)

About PowerShow.com

SIMS 290-2: Applied Natural Language Processing PowerPoint PPT Presentation