SIMS 290-2: Applied Natural Language Processing - PowerPoint PPT Presentation

About This Presentation
Title:

SIMS 290-2: Applied Natural Language Processing

Description:

Don't second-guess or try to be clever. Note: there are no correct answers. 4 ... E.g., guess NNP for a word with an initial capital. How to do this? ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 48
Provided by: coursesIs8
Category:

less

Transcript and Presenter's Notes

Title: SIMS 290-2: Applied Natural Language Processing


1
SIMS 290-2 Applied Natural Language Processing
Marti Hearst Sept 13, 2004    
2
Today
  • Purpose of Part-of-Speech Tagging
  • Training and Testing Collections
  • Intro to N-grams and Language Modeling
  • Using NLTK for POS Tagging

3
Class Exercise
  • I will read off a few words from the beginning of
    a sentence
  • You should write down the very first 2 words that
    come to mind that should follow these words.
  • Example
  • I say One fish
  • You write two fish
  • Dont second-guess or try to be clever.
  • Note there are no correct answers

4
Terminology
  • Tagging
  • The process of associating labels with each token
    in a text
  • Tags
  • The labels
  • Tag Set
  • The collection of tags used for a particular task

5
Example
  • Typically a tagged text is a sequence of
    white-space separated base/tag tokens
  • The/at Pantheons/np interior/nn ,/,still/rb
    in/in its/pp original/jj form/nn ,/, is/bez
    truly/ql majestic/jj and/cc an/at
    architectural/jj triumph/nn ./. Its/pp rotunda/nn
    forms/vbz a/at perfect/jj circle/nn whose/wp
    diameter/nn is/bez equal/jj to/in the/at
    height/nn from/in the/at floor/nn to/in the/at
    ceiling/nn ./.

6
What does Tagging do?
  • Collapses Distinctions
  • Lexical identity may be discarded
  • e.g. all personal pronouns tagged with PRP
  • Introduces Distinctions
  • Ambiguities may be removed
  • e.g. deal tagged with NN or VB
  • e.g. deal tagged with DEAL1 or DEAL2
  • Helps classification and prediction

7
Significance of Parts of Speech
  • A words POS tells us a lot about the word and
    its neighbors
  • Limits the range of meanings (deal),
    pronunciation (object vs object) or both (wind)
  • Helps in stemming
  • Limits the range of following words for Speech
    Recognition
  • Can help select nouns from a document for IR
  • Basis for partial parsing (chunked parsing)
  • Parsers can build trees directly on the POS tags
    instead of maintaining a lexicon

8
Choosing a tagset
  • The choice of tagset greatly affects the
    difficulty of the problem
  • Need to strike a balance between
  • Getting better information about context (best
    introduce more distinctions)
  • Make it possible for classifiers to do their job
    (need to minimize distinctions)

9
Some of the best-known Tagsets
  • Brown corpus 87 tags
  • Penn Treebank 45 tags
  • Lancaster UCREL C5 (used to tag the BNC) 61 tags
  • Lancaster C7 145 tags

10
The Brown Corpus
  • The first digital corpus (1961)
  • Francis and Kucera, Brown University
  • Contents 500 texts, each 2000 words long
  • From American books, newspapers, magazines
  • Representing genres
  • Science fiction, romance fiction, press reportage
    scientific writing, popular lore

11
Penn Treebank
  • First syntactically annotated corpus
  • 1 million words from Wall Street Journal
  • Part of speech tags and syntax trees

12
How hard is POS tagging?
In the Brown corpus,- 11.5 of word types
ambiguous- 40 of word TOKENS
Number of tags 1 2 3 4 5 6 7
Number of words types 35340 3760 264 61 12 2 1
13
Important Penn Treebank tags
14
Verb inflection tags
15
The entire Penn Treebank tagset
16
Quick test
DoCoMo and Sony are to develop a chip that would
let people pay for goods through their mobiles.
17
Tagging methods
  • Hand-coded
  • Statistical taggers
  • Brill (transformation-based) tagger

18
Reading Tagged Corpora
  • gtgt corpus brown.read(ca01)
  • gtgt corpusWORDS010
  • ltThe/atgt, ltFulton/np-tlgt, ltCounty/nn-tlgt,
    ltGrand/jj-tlgt,
  • ltJury/nn-tlgt, ltsaid/vbdgt, ltFriday/nrgt, ltan/atgt,
    ltinvestigation/nngt, ltof/ingt
  • gtgt corpusWORDS2TAG
  • nn-tl
  • gtgt corpusWORDS2TEXT
  • County

19
Default Tagger
  • We need something to use for unseen words
  • E.g., guess NNP for a word with an initial
    capital
  • How to do this?
  • Apply a sequence of regular expression tests
  • Assign the word to a suitable tag
  • If there are no matches
  • Assign to the most frequent unknown tag, NN
  • Other common ones are verb, proper noun,
    adjective
  • Note the role of closed-class words in English
  • Prepositions, auxiliaries, etc.
  • New ones do not tend to appear.

20
A Default Tagger
  • gt from nltk.tokenizer import
  • gt from nltk.tagger import
  • gt text_token Token(TEXT"John saw 3 polar bears
    .")
  • gt WhitespaceTokenizer().tokenize(text_token)
  • gt NN_CD_tagger
  • RegexpTagger((r'0-9(.0-9)?', 'cd'),
    (r'.', 'nn'))
  • gt NN_CD_tagger.tag(text_token)
  • ltltJohn/nngt, ltsaw/nngt, lt3/cdgt, ltpolar/nngt,
    ltbears/nngt, lt./nngtgt
  • NN_CD_Tagger assigns CD to numbers, otherwise
    NN.
  • Poor performance (20-30) in isolation, but when
    used with other taggers can significantly improve
    performance

21
Finding the most frequent tag
  • gtgtgtfrom nltk.probability import FreqDist
  • gtgtgtfrom nltk.corpus import brown
  • gtgtgt fd FreqDist()
  • gtgtgt corpus brown.read('ca01')
  • gtgtgt for token in corpus'WORDS'
    fd.inc(token'TAG')...
  • gtgtgt fd.max()
  • gtgtgt fd.count(fd.max())

22
Evaluating the Tagger
This gets 2 wrong out of 16, or 18.5 error Can
also say an accuracy of 81.5.
23
Training vs. Testing
  • A fundamental idea in computational linguistics
  • Start with a collection labeled with the right
    answers
  • Supervised learning
  • Usually the labels are done by hand
  • Train or teach the algorithm on a subset of
    the labeled text.
  • Test the algorithm on a different set of data.
  • Why?
  • If memorization worked, wed be done.
  • Need to generalize so the algorithm works on
    examples that you havent seen yet.
  • Thus testing only makes sense on examples you
    didnt train on.
  • NLTK has an excellent interface for doing this
    easily.

24
Training the Unigram Tagger
25
Creating Separate Training and Testing Sets
26
Evaluating a Tagger
  • Tagged tokens the original data
  • Untag (exclude) the data
  • Tag the data with your own tagger
  • Compare the original and new tags
  • Iterate over the two lists checking for identity
    and counting
  • Accuracy fraction correct

27
Assessing the Errors
Why the tuple method? Dictionaries cannot be
indexed by lists, so convert lists to tuples.
exclude returns a new token containing only the
properties that are not named in the given list.
28
Assessing the Errors
29
Language Modeling
  • Another fundamental concept in NLP
  • Main idea
  • For a given language, some words are more likely
    than others to follow each other, or
  • You can predict (with some degree of accuracy)
    the probability that a given word will follow
    another word.
  • Illustration
  • Distributions of words in class-participation
    exercise.

30
N-Grams
  • The N stands for how many terms are used
  • Unigram 1 term
  • Bigram 2 terms
  • Trigrams 3 terms
  • Usually dont go beyond this
  • You can use different kinds of terms, e.g.
  • Character based n-grams
  • Word-based n-grams
  • POS-based n-grams
  • Ordering
  • Often adjacent, but not required
  • We use n-grams to help determine the context in
    which some linguistic phenomenon happens.
  • E.g., look at the words before and after the
    period to see if it is the end of a sentence or
    not.

31
Features and Contexts
wn-2 wn-1 wn wn1
CONTEXT FEATURE CONTEXT
tn-1
tn
tn1
tn-2
32
Unigram Tagger
  • Trained using a tagged corpus to determine which
    tags are most common for each word.
  • E.g. in tagged WSJ sample, deal is tagged with
    NN 11 times, with VB 1 time, and with VBP 1 time
  • Performance is highly dependent on the quality of
    its training set.
  • Cant be too small
  • Cant be too different from texts we actually
    want to tag

33
Nth Order Tagging
  • Order refers to how much context
  • Its one less than the N in N-gram here because
    we use the target word itself as part of the
    context.
  • Oth order unigram tagger
  • 1st order bigrams
  • 2nd order trigrams
  • Bigram tagger
  • For tagging, in addition to considering the
    tokens type, the context also considers the tags
    of the n preceding tokens
  • What is the most likely tag for w_n, given w_n-1
    and t_n-1?
  • The tagger picks the tag which is most likely for
    that context.

34
Reading the Bigram table
35
Tagging with lexical frequencies
  • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
    tomorrow/NN
  • People/NNS continue/VBP to/TO inquire/VB the/DT
    reason/NN for/IN the/DT race/NN for/IN outer/JJ
    space/NN
  • Problem assign a tag to race given its lexical
    frequency
  • Solution we choose the tag that has the greater
  • P(raceVB)
  • P(raceNN)
  • Actual estimate from the Switchboard corpus
  • P(raceNN) .00041
  • P(raceVB) .00003

36
Combining Taggers
  • Use more accurate algorithms when we can, backoff
    to wider coverage when needed.
  • Try tagging the token with the 1st order tagger.
  • If the 1st order tagger is unable to find a tag
    for the token, try finding a tag with the 0th
    order tagger.
  • If the 0th order tagger is also unable to find a
    tag, use the NN_CD_Tagger to find a tag.

37
BackoffTagger class
  • gtgtgt train_toks TaggedTokenizer().tokenize(tagged
    _text_str)
  • Construct the taggers
  • gtgtgt tagger1 NthOrderTagger(1,
    SUBTOKENSWORDS)
  • gtgtgt tagger2 UnigramTagger() 0th order
  • gtgtgt tagger3 NN_CD_Tagger()
  • Train the taggers
  • gtgtgt for tok in train_toks
  • tagger1.train(tok)
  • tagger2.train(tok)

38
Backoff (continued)
  • Combine the taggers (in order, by specificity)
  • gt tagger BackoffTagger(tagger1, tagger2,
    tagger3)
  • Use the combined tagger
  • gt accuracy tagger_accuracy(tagger,
    unseen_tokens)

39
Rule-Based Tagger
  • The Linguistic Complaint
  • Where is the linguistic knowledge of a tagger?
  • Just a massive table of numbers
  • Arent there any linguistic insights that could
    emerge from the data?
  • Could thus use handcrafted sets of rules to tag
    input sentences, for example, if input follows a
    determiner tag it as a noun.

40
The Brill tagger
  • An example of TRANSFORMATION-BASED LEARNING
  • Very popular (freely available, works fairly
    well)
  • A SUPERVISED method requires a tagged corpus
  • Basic idea do a quick job first (using
    frequency), then revise it using contextual rules

41
Brill Tagging In more detail
  • Start with simple (less accurate) ruleslearn
    better ones from tagged corpus
  • Tag each word initially with most likely POS
  • Examine set of transformations to see which
    improves tagging decisions compared to tagged
    corpus
  • Re-tag corpus using best transformation
  • Repeat until, e.g., performance doesnt improve
  • Result tagging procedure (ordered list of
    transformations) which can be applied to new,
    untagged text

42
An example
  • Examples
  • It is expected to race tomorrow.
  • The race for outer space.
  • Tagging algorithm
  • Tag all uses of race as NN (most likely tag in
    the Brown corpus)
  • It is expected to race/NN tomorrow
  • the race/NN for outer space
  • Use a transformation rule to replace the tag NN
    with VB for all uses of race preceded by the
    tag TO
  • It is expected to race/VB tomorrow
  • the race/NN for outer space

43
Transformation-based learning in the Brill tagger
  • Tag the corpus with the most likely tag for each
    word
  • Choose a TRANSFORMATION that deterministically
    replaces an existing tag with a new one such that
    the resulting tagged corpus has the lowest error
    rate
  • Apply that transformation to the training corpus
  • Repeat
  • Return a tagger that
  • first tags using unigrams
  • then applies the learned transformations in order

44
Examples of learned transformations
45
Templates
46
Additional issues
Most of the difference in performance between POS
algorithms depends on their treatment of UNKNOWN
WORDS
Multiple token words (Penn Treebank)
Class-based N-grams
47
Upcoming
  • I will email the procedures for turning in the
    first assignment on Wed Sept 15
  • Will be over the web
  • On Wed Ill discuss shallow parsing
  • Start reading the Chunking (Shallow Parsing)
    tutorial
  • I will assign homework from this on Wed, due in
    one week on Sept 22.
  • Next Monday Ill briefly discuss syntactic
    parsting
  • There is a tutorial on this feel free to read it
  • In the interests of reducing workload, Im not
    assigning it however
Write a Comment
User Comments (0)
About PowerShow.com