WSTA Lecture 14 Part-of-speech Tagging PowerPoint PPT Presentation

presentation player overlay
1 / 26
About This Presentation
Transcript and Presenter's Notes

Title: WSTA Lecture 14 Part-of-speech Tagging


1
WSTA Lecture 14Part-of-speech Tagging
  • Tags
  • introduction
  • tagged corpora, tagsets
  • Tagging
  • motivation
  • Simple unigram tagger
  • Markov model tagging
  • Rule based tagging
  • Evaluation

Slide credits Steven Bird
2
NLP versus IR
  • Covered predominantly IR up until now
  • processing, stemming, indexing, querying, etc
  • mostly bag of words and vector space models
  • word order unimportant
  • word inflections unimportant
  • What do we mean by natural language processing?
  • and how does this differ from / overlap with IR?

3
Tags 1 ambiguity
  • time flies like an arrow
  • fruit flies like a banana
  • ambiguous headlines
  • http//www.snopes.com/humor/nonsense/head97.htm
  • British Left Waffles on Falkland Islands
  • Juvenile Court to Try Shooting Defendant

4
Tags 2 Representationsto resolve ambiguity
5
Exercise tag some headlines
  • British Left Waffles on Falkland Islands
  • Juvenile Court to Try Shooting Defendant

6
Tags 3 Tagged Corpora
  • The/DT limits/NNS to/TO legal/JJ absurdity/NN
    stretched/VBD another/DT notch/NN this/DT
    week/NN when/WRB the/DT Supreme/NNP Court/NNP
    refused/VBD to/TO hear/VB an/DT appeal/NN from/IN
    a/DT case/NNthat/WDT says/VBZ corporate/JJ
    defendants/NNS must/MD pay/VB damages/NNS even/RB
    after/IN proving/VBG that/IN they/PRP could/MD
    not/RB possibly/RB have/VB caused/VBN the/DT
    harm/NN ./.
  • Source Penn Treebank Corpus (nltk/data/treebank/
    wsj_0130)

7
Another kind of taggingSense Tagging
  • The Pantheon's interior/a , still in its
    original/a form/a ,
  • interior (a) inside a space (b) inside a
    country and at a distance from the coast or
    border (c) domestic (d) private.
  • original (a) relating to the beginning of
    something (b) novel (c) that from which a copy
    is made (d) mentally ill or eccentric.
  • form (a) definite shape or appearance (b) body
    (c) mould (d) particular structural character
    exhibited by something (e) a style as in music,
    art or literature (f) homogenous polynomial in
    two or more variables ...

8
Significance of Parts of Speech
  • a word's POS tells us a lot about the word and
    its neighbors
  • limits the range of meanings (deal),pronunciation
    s (object vs object), or both (wind)
  • helps in stemming
  • limits the range of following words for ASR
  • helps select nouns from a document for IR
  • More advanced uses (these won't make sense yet)
  • basis for chunk parsing
  • parsers can build trees directly on the POS tags
    instead of maintaining a lexicon
  • first step for many different NLP tasks

9
What does Tagging do?
  • Collapses Distinctions
  • Lexical identity may be discarded
  • e.g. all personal pronouns tagged with PRP
  • Introduces Distinctions
  • Ambiguities may be removed
  • e.g. deal tagged with NN or VB deal tagged with
    DEAL1 or DEAL2
  • Helps classification and prediction
  • There are many tagsets. This is due to
  • the different ways to define a tag
  • the need to balance classification and prediction
  • harder/easier classification task vs more/less
    information about context

10
Tagged Corpora
  • Brown Corpus
  • The first digital corpus (1961), Francis and
    Kucera, Brown U
  • Contents 500 texts, each 2000 words long
  • from American books, newspapers, magazines,
    representing 15 genres
  • science fiction, romance fiction, press reportage
    scientific writing, popular lore.
  • See nltk/data/brown/
  • See reading for definition of Brown tags
  • Penn Treebank
  • First syntactically annotated corpus
  • Contents 1 million words from WSJ POS tags,
    syntax trees
  • See nltk/data/treebank/ (5 sample)

11
Tagged Corpora in other languages
  • Parsed treebanks in many other languages
  • Basque, Bulgarian, Chinese, Czech, Finnish,
    French
  • German, Greek, Hebrew, Hungarian, Irish, Italian
  • Japanese, Korean, Persian, Romanian, Spanish
  • Swedish and many more!
  • All with part-of-speech annotation
  • language specific tag sets
  • recent work on mapping to common tag set
  • https//code.google.com/p/universal-pos-tags/
  • http//universaldependencies.github.io/docs/

12
Application of tagged corporagenre
classification
13
Important Treebank Tags
  • NN noun JJ adjective
  • NNP proper noun CC coord conjunc
    (and/or/..)
  • DT determiner (the/a/..) CD cardinal
    number
  • IN preposition (in/of/..) PRP personal
    pronoun (I/you/..)
  • VB verb RB adverb (gently, now)
  • -R comparative (better)
  • -S superlative (bravest) or plural
  • - possessive (my)

14
Verb Tags
  • VBP base present take
  • VB infinitive take
  • VBD past took
  • VBG present participle taking
  • VBN past participle taken
  • VBZ present 3sg takes
  • MD modal can, would

15
Simple Tagging in NLTK
  • Reading Tagged Corpora
  • gtgtgt from nltk.corpus import treebankgtgtgt
    treebank.fileids()gtgtgt treebank.tagged_sents('wsj_
    0001.mrg')0(u'Pierre', u'NNP'), (u'Vinken',
    u'NNP'), (u',', u','), (u'61', u'CD'), (u'years',
    u'NNS'), (u'old', u'JJ'), (u',', u','), (u'will',
    u'MD'), (u'join', u'VB'), (u'the', u'DT'), ...
  • see also Brown corpus, Conll2000, Alpino and more
  • Tagging a string
  • gtgtgt nltk.tag.pos_tag('Fruit flies like a
    banana'.split())
  • ('Fruit', 'NN'), ('flies', 'NNS'), ('like',
    'IN'), ('a', 'DT'), ('banana', 'NN')
  • (N.b. Uses a maximum entropy tagger)

16
Tagging Algorithms
  • rule based taggers
  • original methods, based on layers of rules about
    how to tag words based on their context (e.g.,
    Brill tagger)
  • unigram tagger
  • assign the tag which is the most probable for the
    word in question, based on frequency in a
    training corpus
  • bigram tagger, n-gram tagger
  • inspect one or more tags in the context(usually,
    immediate left context)
  • Maximum entropy and HMM taggers (next lecture)

17
Unigram Tagging
  • Unigram table of tag frequencies for each word
  • e.g. in tagged WSJ sample (from Penn Treebank)
  • deal NN (11) VB (1) VBP (1)
  • Training
  • load a corpus
  • count the occurrences of each (word, tag) in the
    corpus
  • Tagging
  • lookup the most common tag for each word to tag
  • Gets 90 accuracy!
  • See the code in nltk.tag.UnigramTagger

18
The problem with unigram taggers
  • what evidence do they consider when assigning a
    tag?
  • when does this method fail?

19
Fixing the problem usinga bigram tagger
  • construct sentences involving a word which can
    have two different parts of speech
  • e.g. wind noun, verb
  • The wind blew forcefully
  • I wind up the clock
  • gather statistics for current tag, based on
  • (i) current word (ii) previous tag
  • result a 2-D array of frequency distributions
  • what does this look like?

20
Generalizing the context
21
Bigram n-gram taggers
  • n-gram tagger consider n-1 previous tags
  • how big does the model get?
  • how much data do we need to train it?
  • Sparse-data problem
  • As n gets large, the chances of having seen all
    possible patterns of tags during training
    diminishes (large gt3)
  • Approaches
  • Combine taggers (backoff, weighted average)
  • statistical estimation of the probability of
    unseen events
  • See nltk.tag.sequential.NgramTagger
  • and various others in nltk.tag package

22
Markov Model Taggers
  • Recall n-gram language model
  • similar problem of modelling next word given
    previous words, similar issues with sparsity and
    estimation
  • here we focus on generating tag sequences rather
    than words
  • both are in instances of a Markov model
  • tag sequence modelled as a Markov chain
  • each tag is linked to word sequence
  • Can we just predict each tag in sequence?
  • need to know the preceding tag(s)
  • but these are unknown
  • Next lecture, well explore this further using
    Hidden Markov Models

23
The Brill rule-Based Tagger
  • The Linguistic Complaint
  • where is the linguistic knowledge of a tagger?
  • just a massive table of numbers
  • aren't there any linguistic insights that could
    emerge from the data?
  • Transformation-Based Tagging / Brill Tagging
  • Tag each word with its most likely tag
  • Repeatedly correct tags based on context
  • Example rule NN VB PREVTAG TO
  • to/TO race/NN -gt to/TO race/VB
  • Other contexts
  • PREV1OR2TAG, PREV1OR2WD, WDNEXTTAG, ...
  • See nltk.tag.brill.BrillTagger

24
Evaluating Tagger Performance
  • Need an objective measure of performance
  • Commonly use per-token accuracy
  • measured against heldout gold standard data
  • fraction of words tagged correctly
  • Simple methods get 90 performance
  • 1 and 2-gram
  • Brill tagger
  • HMMs get 95 and CRFs get 97 performance
  • see nltk.tag.hmm,tnt,crf,stanford,senna,
  • Why can't we get 100?

25
Tagging broader lessons
  • Tagging has several properties that are typical
    of NLP
  • classification (words have properties)
  • disambiguation through representation
  • sequence learning from annotated corpora
  • simple, general methods
  • conditional frequency distributions
  • Cool things you can do now elementary NLU, NLG
  • Review
  • tokenization tagging segmentation and
    annotation of words
  • chunking segmentation and annotation of word
    sequences

26
Readings
  • One of
  • Jurafsky Martin, chapter 5
  • Manning Schutze, chapter 10
  • NLTK tagging tutorial
  • http//www.nltk.org/book/ch05.html
  • Next lecture
  • tagging with (hidden) Markov models
  • other sequence tagging tasks
  • named entity tagging
  • shallow parsing
Write a Comment
User Comments (0)
About PowerShow.com