Part of Speech POS Tagging - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Part of Speech POS Tagging

Description:

Hearst: www.sims.berkeley.edu/courses/is290-2/f04/resources.html ... Limits the range of following words for Speech Recognition ... – PowerPoint PPT presentation

Number of Views:361
Avg rating:3.0/5.0
Slides: 65
Provided by: paulama6
Category:
Tags: pos | part | speech | tagging

less

Transcript and Presenter's Notes

Title: Part of Speech POS Tagging


1
Part of Speech (POS) Tagging
CSC 9010 Special Topics. Natural Language
Processing. Paula Matuszek, Mary-Angela
Papalaskari Spring, 2005
2
Sources (and Resources)
  • Some slides adapted from
  • Dorr, www.umiacs.umd.edu/christof/courses/cmsc72
    3-fall04
  • Jurafsky, www.stanford.edu/class/linguist238
  • McCoy, www.cis.udel.edu/mccoy/courses/cisc882.03f
  • With some additional examples and ideas from
  • Martin www.cs.colorado.edu/martin/csci5832.html
  • Hearst www.sims.berkeley.edu/courses/is290-2/f04
    /resources.html
  • Litman www.cs.pitt.edu/litman/courses/cs2731f03
    /cs2731.html
  • Rich www.cs.utexas.edu/users/ear/cs378NLP
  • You may find some or all of these useful
    resources throughout the course.

3
Word Classes and Part-of-Speech Tagging
  • What is POS tagging?
  • Why do we need POS?
  • Word Classes
  • Rule-based Tagging
  • Stochastic Tagging
  • Transformation-Based Tagging
  • Tagging Unknown Words
  • Evaluating POS Taggers

4
Parts of Speech
  • 8 traditional parts of speech (more or less)
  • Noun, verb, adjective, preposition, adverb,
    article, pronoun, conjunction.
  • This idea has been around for over 2000 years
    (Dionysius Thrax of Alexandria, c. 100 B.C.)
  • Called parts-of-speech, lexical category, word
    classes, morphological classes, lexical tags, POS
  • Actual categories vary by language , by reason
    for tagging, by who you ask!

5
POS examples
  • N noun chair, bandwidth, pacing
  • V verb study, debate, munch
  • ADJ adj purple, tall, ridiculous
  • ADV adverb unfortunately, slowly,
  • P preposition of, by, to
  • PRO pronoun I, me, mine
  • DET determiner the, a, that, those

6
Definition of POS Tagging
The process of assigning a part-of-speech or
other lexical class marker to each word in a
corpus (Jurafsky and Martin)
7
POS Tagging example
  • WORD tag
  • the DET
  • koala N
  • put V
  • the DET
  • keys N
  • on P
  • the DET
  • table N

8
What does Tagging do?
  • Collapses Distinctions
  • Lexical identity may be discarded
  • e.g. all personal pronouns tagged with PRP
  • Introduces Distinctions
  • Ambiguities may be removed
  • e.g. deal tagged with NN or VB
  • e.g. deal tagged with DEAL1 or DEAL2
  • Helps classification and prediction

9
Significance of Parts of Speech
  • A words POS tells us a lot about the word and
    its neighbors
  • Limits the range of meanings (deal),
    pronunciation (object vs object) or both (wind)
  • Helps in stemming
  • Limits the range of following words for Speech
    Recognition
  • Can help select nouns from a document for IR
  • Basis for partial parsing (chunked parsing)
  • Parsers can build trees directly on the POS tags
    instead of maintaining a lexicon

10
Word Classes
  • What are we trying to classify words into?
  • Classes based on
  • Syntactic properties. What can precede/follow.
  • Morphological properties. What affixes they
    take.
  • Not primarily by semantic coherence (Conjunction
    Junction notwithstanding!)
  • Broad "grammar" categories are familiar
  • NLP uses much richer "tagsets"

11
Open and closed class words
  • Two major categories of classes
  • Closed class a relatively fixed membership
  • Prepositions of, in, by,
  • Auxiliaries may, can, will had, been,
  • Pronouns I, you, she, mine, his, them,
  • Usually function words (short common words which
    play a role in grammar)
  • Open class new ones can be created all the time
  • English has 4 Nouns, Verbs, Adjectives, Adverbs
  • Many languages have all 4, but not all!

12
Open Class Words
  • Every known human language has nouns and verbs
  • Nouns people, places, things
  • Classes of nouns
  • proper vs. common
  • count vs. mass
  • Verbs actions and processes
  • Adjectives properties, qualities
  • Adverbs hodgepodge!
  • Unfortunately, John walked home extremely slowly
    yesterday

13
Closed Class Words
  • Idiosyncratic. Differ more from language to
    language.
  • Language strongly resists additions
  • Examples
  • prepositions on, under, over,
  • particles up, down, on, off,
  • determiners a, an, the,
  • pronouns she, who, I, ..
  • conjunctions and, but, or,
  • auxiliary verbs can, may should,
  • numerals one, two, three, third,

14
Prepositions from CELEX
15
English Single-Word Particles
16
Pronouns in CELEX
17
Conjunctions
18
Auxiliaries
19
POS Tagging Choosing a Tagset
  • Many parts of speech, potential distinctions
  • To do POS tagging, need to choose a standard set
    of tags to work with
  • Sets vary in of tags a dozen to over 200
  • Size of tag sets depends on language, objectives
    and purpose
  • Need to strike a balance between
  • Getting better information about context (best
    introduce more distinctions)
  • Make it possible for classifiers to do their job
    (need to minimize distinctions)

20
Some of the best-known Tagsets
  • Brown corpus 87 tags
  • Penn Treebank 45 tags
  • Lancaster UCREL C5 (used to tag the BNC) 61 tags
  • Lancaster C7 145 tags

21
The Brown Corpus
  • The first digital corpus (1961)
  • Francis and Kucera, Brown University
  • Contents 500 texts, each 2000 words long
  • From American books, newspapers, magazines
  • Representing genres
  • Science fiction, romance fiction, press reportage
    scientific writing, popular lore

22
Penn Treebank
  • First syntactically annotated corpus
  • 1 million words from Wall Street Journal
  • Part of speech tags and syntax trees

23
Tag Set Example Penn Treebank
PRP PRP
24
Example of Penn Treebank Tagging of Brown Corpus
Sentence
  • The/DT grand/JJ jury/NN commented/VBD on/IN a/DT
    number/NN of/IN other/JJ topics/NNS ./.
  • VB DT NN .Book that flight .
  • VBZ DT NN VB NN ?Does that flight
    serve dinner ?

25
POS Tagging
  • Words often have more than one POS back
  • The back door JJ
  • On my back NN
  • Win the voters back RB
  • Promised to back the bill VB
  • The POS tagging problem is to determine the POS
    tag for a particular instance of a word.

26
Word Class Ambiguity(in the Brown Corpus)
  • Unambiguous (1 tag) 35,340
  • Ambiguous (2-7 tags) 4,100

(Derose, 1988)
27
Part-of-Speech Tagging
  • Rule-Based Tagger ENGTWOL
  • Stochastic Tagger HMM-based
  • Transformation-Based Tagger Brill

28
Rule-Based Tagging
  • Basic Idea
  • Assign all possible tags to words
  • Remove tags according to set of rules of type if
    word1 is an adj, adv, or quantifier and the
    following is a sentence boundary and word-1 is
    not a verb like consider then eliminate non-adv
    else eliminate adv.
  • Typically more than 1000 hand-written rules, but
    may be machine-learned.

29
Start With a Dictionary
  • she PRP
  • promised VBN,VBD
  • to TO
  • back VB, JJ, RB, NN
  • the DT
  • bill NN, VB
  • Etc for the 100,000 words of English

30
Assign All Possible Tags
  • NN
  • RB
  • VBN JJ VB
  • PRP VBD TO VB DT NN
  • She promised to back the bill

31
Write rules to eliminate tags
  • Eliminate VBN if VBD is an option when VBNVBD
    follows ltstartgt PRP
  • NN
  • RB
  • JJ VB
  • PRP VBD TO VB DT NN
  • She promised to back the bill

VBN
32
Sample ENGTWOL Lexicon
33
Stochastic Tagging
  • Based on probability of certain tag occurring
    given various possibilities
  • Necessitates a training corpus
  • No probabilities for words not in corpus.
  • Training corpus may be too different from test
    corpus.

34
Stochastic Tagging (cont.)
  • Simple Method Choose most frequent tag in
    training text for each word!
  • Result 90 accuracy
  • Why?
  • Baseline Others will do better
  • HMM is an example

35
HMM Tagger
  • Intuition Pick the most likely tag for this
    word.
  • HMM Taggers choose tag sequence that maximizes
    this formula
  • P(wordtag) P(tagprevious n tags)
  • Let T t1,t2,,tnLet W w1,w2,,wn
  • Find POS tags that generate a sequence of words,
    i.e., look for most probable sequence of tags T
    underlying the observed words W.

36
Conditional Probability
  • A brief digression
  • Conditional probability how do we determine the
    likelihood of one event following another if they
    are not independent?
  • Example
  • I am trying to diagnose a rash in a 6-year-old
    child.
  • Is it measles?
  • On other words, given that the child has a rash,
    what is the probability that it is measles?

37
Conditional Probabilities cont.
  • What would affect your decision?
  • The overall frequency of rashes in 6-yr-olds
  • The overall frequency of measles in 6-yr-olds
  • The frequency with which 6-yr-olds with measles
    have rashes.
  • P(measlesrash) P(rashmeasles)P(measles)

  • P(rash)

38
Bayes' Theorem
  • Bayes' Theorem or Bayes' Rule formalizes this
    intuition
  • P(XY) P(YX) P(X)
  • P(Y)
  • P(X) and P(Y) are known as the "prior
    probabilities" or "priors".

39
Probabilities
  • We want the best set of tags for a sequence of
    words (a sentence)
  • W is a sequence of words
  • T is a sequence of tags

40
Probabilities
  • We want the best set of tags for a sequence of
    words (a sentence)
  • W is a sequence of words
  • T is a sequence of tags

41
Tag Sequence P(T)
  • How do we get the probability of a specific tag
    sequence?
  • Count the number of times a sequence occurs and
    divide by the number of sequences of that length.
    Not likely.
  • Make a Markov assumption and use N-grams over
    tags...
  • P(T) is a product of the probability of N-grams
    that make it up

42
N-Grams
  • The N stands for how many terms are used
  • Unigram 1 term Bigram 2 terms Trigrams 3
    terms
  • Usually dont go beyond 3.
  • You can use different kinds of terms, e.g.
  • Character based n-grams
  • Word-based n-grams
  • POS-based n-grams
  • Ordering
  • Often adjacent, but not required
  • We use N-grams to help determine the context in
    which some linguistic phenomenon happens.
  • E.g., look at the words before and after the
    period to see if it is the end of a sentence or
    not.

43
P(T) Bigram Example
  • Given a sentence
  • ltsgt Det Adj Adj Noun lt/sgt
  • Probability is product of four N-grams
  • P(Detltsgt) P(AdjDet) P(AdjAdj) P(NounAdj)

44
Counts
  • Where do you get the N-gram counts?
  • From a large hand-tagged corpus.
  • For N-grams, count all the Tagi Tagi1 pairs
  • And smooth them to get rid of the zeroes
  • Alternatively, you can learn them from an
    untagged corpus

45
What about P(WT)
  • First its odd. It is asking the probability of
    seeing The big red dog given Det Adj Adj Noun
  • Collect up all the times you see that tag
    sequence and see how often The big red dog
    shows up. Again not likely to work.

46
P(WT)
  • Well make the following assumption (because its
    easy) Each word in the sequence only depends on
    its corresponding tag. So
  • How do you get the statistics for that?

47
So
  • We start with
  • And get

48
HMMs
  • This is a Hidden Markov Model (HMM)
  • The states in the model are the tags, and the
    observations are the words.
  • The state to state transitions are driven by the
    bigram statistics
  • The observed words are based solely on the state
    that youre in

49
An Example
  • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
    tomorrow/NN
  • People/NNS continue/VBP to/TO inquire/VB the DT
    reason/NN for/IN the/DT race/NN for/IN outer/JJ
    space/NN
  • to/TO race/???the/DT race/???
  • ti argmaxj P(tjti-1)P(witj)
  • maxP(VBTO)P(raceVB) , P(NNTO)P(raceNN)
  • Brown
  • P(NNTO) .021 P(raceNN) .00041 .000007
  • P(VBTO) .34 P(raceVB) .00003 .00001

50
Performance
  • This method has achieved 95-96 correct with
    reasonably complex English tagsets and reasonable
    amounts of hand-tagged training data.

51
Transformation-Based Tagging (Brill Tagging)
  • Combination of Rule-based and stochastic tagging
    methodologies
  • Like rule-based because rules are used to specify
    tags in a certain environment
  • Like stochastic approach because machine learning
    is usedwith tagged corpus as input
  • Transformation-Based Learning (TBL)
  • Input
  • tagged corpus
  • dictionary (with most frequent tags)

52
Transformation-Based Tagging (cont.)
  • Basic Idea
  • Set the most probable tag for each word as a
    start value
  • Change tags according to rules of type if word-1
    is a determiner and word is a verb then change
    the tag to noun in a specific order
  • Training is done on tagged corpus
  • Write a set of rule templates
  • Among the set of rules, find one with highest
    score
  • Continue from 2 until lowest score threshold is
    passed
  • Keep the ordered set of rules
  • Rules make errors, corrected by later rules

53
TBL Rule Application
  • Tagger labels every word with its most-likely tag
  • For example race has the following probabilities
    in the Brown corpus
  • P(NNrace) .98
  • P(VBrace) .02
  • Transformation rules make changes to tags
  • Change NN to VB when previous tag is TO
    is/VBZ expected/VBN to/TO race/NN
    tomorrow/NNbecomes is/VBZ expected/VBN to/TO
    race/VB tomorrow/NN

54
TBL The Rule-Learning Algorithm
  • Step 1 Label every word with most likely tag
    (from dictionary)
  • Step 2 Check every possible transformation
    select one which most improves tagging
  • Step 3 Re-tag corpus applying the rules
  • Repeat 2-3 until some criterion is reached, e.g.,
    X correct with respect to training corpus
  • RESULT Sequence of transformation rules

55
TBL Rule Learning (cont.)
  • Problem Could apply transformations ad
    infinitum!
  • Constrain the set of transformations with
    templates
  • Replace tag X with tag Y, provided tag Z or word
    Z appears in some position
  • Rules are learned in ordered sequence
  • Rules may interact.
  • Rules are compact and can be inspected by humans

56
Templates for TBL
57
TBL Problems
  • First 100 rules achieve 96.8 accuracyFirst 200
    rules achieve 97.0 accuracy
  • Execution Speed TBL tagger is slower than HMM
    approach
  • Learning Speed Brills implementation can take
    over a day (600k tokens)
  • BUT
  • (1) Learns small number of simple,
    non-stochastic rules
  • (2) Can be made to work faster with FST
  • (3) Best performing algorithm on unknown words

58
Tagging Unknown Words
  • Major continuing issue in taggers
  • New words added to language 20 per month
  • Plus many proper names
  • Increases error rates by 1-2
  • Method 1 assume they are nouns
  • Method 2 assume the unknown words have a
    probability distribution similar to words only
    occurring once in the training set.
  • Method 3 Use morphological information, e.g.,
    words ending with ed tend to be tagged VBN.

59
Evaluating performance
  • How do we know how well a tagger does?
  • Say we had a test sentence, or a set of test
    sentences, that were already tagged by a human (a
    Gold Standard)
  • We could run a tagger on this set of test
    sentences
  • And see how many of the tags we got right.
  • This is called Tag accuracy or Tag percent
    correct

60
Test set
  • We take a set of test sentences
  • Hand-label them for part of speech
  • The result is a Gold Standard test set
  • Who does this?
  • Brown corpus done by U Penn
  • Grad students in linguistics
  • Dont they disagree?
  • Yes! But on about 97 of tags no disagreements
  • And if you let the taggers discuss the remaining
    3, they often reach agreement

61
So What's "Good"?
  • If we tag every word with its most frequent POS
    we get about 90 accuracy, so this is a minimum
    tagger.
  • Human taggers (without discussion) agree about
    97 of the time, so if we can get to 97 we have
    done about as well as we can.

62
Training and test sets
  • But we cant train our frequencies on the test
    set sentences. (Why?)
  • So for testing the Most-Frequent-Tag algorithm
    (or any other stochastic algorithm), we need 2
    things
  • A hand-labeled training set the data that we
    compute frequencies from, etc
  • A hand-labeled test set The data that we use to
    compute our correct.

63
Computing correct
  • Of all the words in the test set
  • For what percent of them did the tag chosen by
    the tagger equal the human-selected tag.
  • Human tag set (Gold Standard set)

64
Training and Test sets
  • Often they come from the same labeled corpus!
  • We just use 90 of the corpus for training and
    save out 10 for testing!
Write a Comment
User Comments (0)
About PowerShow.com