CIS 530 Part of Speech Tagging

1 / 33
About This Presentation
Title:

CIS 530 Part of Speech Tagging

Description:

Noun, verb, adjective, preposition, adverb, article, ... History: From Yair Halevi (Bar-Ilan U.) 1960. 1970. 1980. 1990. 2000. Brown Corpus Created (EN-US) ... – PowerPoint PPT presentation

Number of Views:241
Avg rating:3.0/5.0
Slides: 34
Provided by: mitchel4

less

Transcript and Presenter's Notes

Title: CIS 530 Part of Speech Tagging


1
CIS 530 - Part of Speech Tagging
  • Reading JM 5.1-5.5.1 (MS 3.1, 4.3.2,
    10.1-10.3)
  • (A few slides adapted from Dan Jurafsky, Jim
    Martin, Dekang Lin, Rada Mihalcea, and Bonnie
    Dorr.)

2
Parts of Speech
  • 8 (ish) traditional parts of speech
  • Noun, verb, adjective, preposition, adverb,
    article, interjection, pronoun, conjunction, etc
  • This idea has been around for over 2000 years
    (Dionysius Thrax of Alexandria, c. 100 B.C.)
  • Called parts-of-speech, lexical category, word
    classes, morphological classes, lexical tags, POS
  • Well use POS most frequently
  • (This and next 4 slides from Dan Jurafsky from
    slides by Jim Martin, Dekang Lin, and Bonnie
    Dorr.)

3
POS examples for English
  • N noun chair, bandwidth, pacing
  • V verb study, debate, munch
  • ADJ adj purple, tall, ridiculous
  • ADV adverb unfortunately, slowly,
  • P preposition of, by, to
  • PRO pronoun I, me, mine
  • DET determiner the, a, that, those

4
Open vs. Closed classes
  • Open vs. Closed classes
  • Open
  • Nouns, Verbs, Adjectives, Adverbs.
  • Why open?
  • Closed
  • determiners a, an, the
  • pronouns she, he, I
  • prepositions on, under, over, near, by,

5
Open Class Words
  • Every known human language has nouns and verbs
  • Nouns people, places, things
  • Classes of nouns
  • proper vs. common
  • count vs. mass
  • Verbs actions and processes
  • Adjectives properties, qualities
  • Adverbs hodgepodge!
  • Unfortunately, John walked home extremely slowly
    yesterday

6
Closed Class Words
  • Differ more from language to language than open
    class words
  • Examples
  • prepositions on, under, over,
  • particles up, down, on, off,
  • determiners a, an, the,
  • pronouns she, who, I, ..
  • conjunctions and, but, or,
  • auxiliary verbs can, may should,
  • numerals one, two, three, third,

7
Prepositions from CELEX
8
Pronouns in CELEX
9
Conjunctions
10
Auxiliaries
11
NLP Task I Determining Part of Speech Tags
  • The Problem

12
POS Tagging Definition
  • The process of assigning a part-of-speech or
    lexical class marker to each word in a corpus

13
POS Tagging example
  • WORD tag
  • the DET
  • koala N
  • put V
  • the DET
  • keys N
  • on P
  • the DET
  • table N

14
What is POS tagging good for?
  • Speech synthesis
  • How to pronounce lead?
  • INsult inSULT
  • OBject obJECT
  • OVERflow overFLOW
  • DIScount disCOUNT
  • CONtent conTENT
  • Stemming for information retrieval
  • Knowing a word is a N tells you it gets plurals
  • Can search for aardvarks get aardvark
  • Parsing and speech recognition and etc
  • Possessive pronouns (my, your, her) followed by
    nouns
  • Personal pronouns (I, you, he) likely to be
    followed by verbs

15
Equivalent Problem in Bioinformatics
  • Durbin et al. Biological Sequence Analysis,
    Cambridge University Press.
  • Several applications, e.g. proteins
  • From primary structure ATCPLELLLD
  • Infer secondary structure HHHBBBBBC..

16
History From Yair Halevi (Bar-Ilan U.)
Combined Methods 98
Trigram Tagger (Kempe) 96
DeRose/Church Efficient HMM Sparse Data 95
Tree-Based Statistics (Helmut Shmid) Rule Based
96
Transformation Based Tagging (Eric Brill) Rule
Based 95
Greene and Rubin Rule Based - 70
HMM Tagging (CLAWS) 93-95
Neural Network 96
LOB Corpus Tagged
Brown Corpus Created (EN-US) 1 Million Words
Brown Corpus Tagged
British National Corpus (tagged by CLAWS)
POS Tagging separated from other NLP
LOB Corpus Created (EN-UK) 1 Million Words
Penn Treebank Corpus (WSJ, 4.5M)
17
POS Tag Sets for English Design
18
Penn Treebank Tagset
19
A Simplified Tagset for English
  • Tagsets for English have grown progressively
    larger since the Brown Corpus until the Penn
    Treebank project.

20
Rationale behind British European tag sets
  • To provide distinct codings for all classes of
    words having distinct grammatical behaviour
    Garside et al. 1987
  • The Lund tagset for adverb distinguishes
    between
  • Adjunct Process, Space, Time
  • Wh-type Manner, Reason, Space, Time, Wh-type
    S
  • Conjunct Appositional, Contrastive,
    Inferential, Listing,
  • Disjunct Content, Style
  • Postmodifier else
  • Negative not
  • Discourse Item Appositional, Expletive,
    Greeting, Hesitator,

21
One of Several Reasons for a Smaller Tagset
  • Many tags are unique to particular lexical items,
    and can be recovered automatically if desired.

22
Syntactic Recoverability
  • Prepositions vs. Subordinating Conjunctions
  • Since the last meeting, things have changed.
  • Since we first learned about stochastic methods,
    things have changed
  • We tag both as IN
  • Subject vs. Object Pronouns
  • Recoverable from Position in Parse Tree
  • To as Preposition vs. to as Auxiliary
  • Can be recovered by position in parse tree
  • BIG MISTAKE The parser needs this information.

23
POS Tagging - Statistical Models
24
Task I Determining Part of Speech Tags
  • The Problem
  • The Old Solution Combinatoric search.
  • If each of n words has k tags on average, try
    the nk combinations until one works.

25
NLP Task I Determining Part of Speech Tags
  • The Old Solution Depth First search.
  • If each of n words has k tags on average, try
    the nk combinations until one works.
  • Machine Learning Solutions Automatically learn
    Part of Speech (POS) assignment.
  • The best techniques achieve 96-97 accuracy per
    word on new materials, given large training
    corpora.

26
Simple Statistical Approaches Idea 1
27
Simple Statistical Approaches Idea 2
  • For a string of words
  • W w1w2w3wn
  • find the string of POS tags
  • T t1 t2 t3 tn
  • which maximizes P(TW)
  • i.e., the probability of tag string T given that
    the word string was W
  • i.e., that W was tagged T

28
Again, The Sparse Data Problem
  • A Simple, Impossible Approach to Compute P(TW)
  • Count up instances of the string "heat oil in a
    large pot" in the training corpus, and pick the
    most common tag assignment to the string..

29
A BOTEC Estimate of What We Can Estimate
  • What parameters can we estimate with a million
    words of hand tagged training data?
  • Assume a uniform distribution of 5000 words and
    40 part of speech tags..
  • Rich Models often require vast amounts of data
  • Good estimates of models with bad assumptions
    often outperform better models which are badly
    estimated

30
A Practical Statistical Tagger
31
A Practical Statistical Tagger II
  • But we can't accurately estimate more than tag
    bigrams or so
  • Again, we change to a model that we CAN estimate

32
A Practical Statistical Tagger III
  • So, for a given string W w1w2w3wn, the tagger
    needs to find the string of tags T which maximizes

33
Training and Performance
  • To estimate the parameters of this model, given
    an annotated training corpus
  • Because many of these counts are small, smoothing
    is necessary for best results
  • Such taggers typically achieve about 95-96
    correct tagging, for tag sets of 40-80 tags.
Write a Comment
User Comments (0)