Parts of Speech presentation

About This Presentation

Transcript and Presenter's Notes

Title: Parts of Speech

1
Parts of Speech

Sudeshna Sarkar
7 Aug 2008

2
Why Do We Care about Parts of Speech?

Pronunciation
Hand me the lead pipe.
Predicting what words can be expected next
Personal pronoun (e.g., I, she) ____________
Stemming
-s means singular for verbs, plural for nouns
As the basis for syntactic parsing and then
meaning extraction
I will lead the group into the lead smelter.
Machine translation
(E) content N ? (F) contenu N
(E) content Adj ? (F) content Adj or
satisfait Adj

3
What is a Part of Speech?
Is this a semantic distinction? For example,
maybe Noun is the class of words for people,
places and things. Maybe Adjective is the class
of words for properties of nouns. Consider green
book book is a Noun green is an
Adjective Now consider book worm This green
is very soothing.
4
How Many Parts of Speech Are There?

A first cut at the easy distinctions
Open classes
nouns, verbs, adjectives, adverbs
Closed classes function words
conjunctions and, or, but
pronounts I, she, him
prepositions with, on
determiners the, a, an

5
Part of speech tagging

8 (ish) traditional parts of speech
Noun, verb, adjective, preposition, adverb,
article, interjection, pronoun, conjunction, etc
This idea has been around for over 2000 years
(Dionysius Thrax of Alexandria, c. 100 B.C.)
Called parts-of-speech, lexical category, word
classes, morphological classes, lexical tags, POS
Well use POS most frequently
Ill assume that you all know what these are

6
POS examples

N noun chair, bandwidth, pacing
V verb study, debate, munch
ADJ adj purple, tall, ridiculous
ADV adverb unfortunately, slowly,
P preposition of, by, to
PRO pronoun I, me, mine
DET determiner the, a, that, those

7
Tagsets
Brown corpus tagset (87 tags)
http//www.scs.leeds.ac.uk/amalgam/tagsets/brown.h
tml Penn Treebank tagset (45 tags)
http//www.cs.colorado.edu/martin/SLP/Figures/
(8.6) C7 tagset (146 tags) http//www.comp.lancs
.ac.uk/ucrel/claws7tags.html
8
POS Tagging Definition

The process of assigning a part-of-speech or
lexical class marker to each word in a corpus

9
POS Tagging example

WORD tag
the DET
koala N
put V
the DET
keys N
on P
the DET
table N

10
POS tagging Choosing a tagset

There are so many parts of speech, potential
distinctions we can draw
To do POS tagging, need to choose a standard set
of tags to work with
Could pick very coarse tagets
N, V, Adj, Adv.
More commonly used set is finer grained, the
UPenn TreeBank tagset, 45 tags
PRP, WRB, WP, VBG
Even more fine-grained tagsets exist

11
Penn TreeBank POS Tag set
12
Using the UPenn tagset

The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT
number/NN of/IN other/JJ topics/NNS ./.
Prepositions and subordinating conjunctions
marked IN (although/IN I/PRP..)
Except the preposition/complementizer to is
just marked to.

13
POS Tagging

Words often have more than one POS back
The back door JJ
On my back NN
Win the voters back RB
Promised to back the bill VB
The POS tagging problem is to determine the POS
tag for a particular instance of a word.

14
How hard is POS tagging? Measuring ambiguity
15
Algorithms for POS Tagging

Ambiguity In the Brown corpus, 11.5 of the
word types are ambiguous (using 87 tags)

Worse, 40 of the tokens are ambiguous.
16
Algorithms for POS Tagging

Why cant we just look them up in a dictionary?
Words that arent in the dictionary

http//story.news.yahoo.com/news?tmplstorycid57
8ncid578e1u/nm/20030922/ts_nm/iraq_usa_dc

One idea P(ti wi) the probability that a
random hapax legomenon in the corpus has tag ti.
Nouns are more likely than verbs, which are more
likely than pronouns.
Another idea use morphology.

17
Algorithms for POS Tagging - Knowledge

Dictionary
Morphological rules, e.g.,
_____-tion
_____-ly
capitalization
N-gram frequencies
to _____
DET _____ N
But what about rare words, e.g, smelt (two verb
forms, melt and past tense of smell, and one noun
form, a small fish)
Combining these
V _____-ing I was gracking vs. Gracking
is fun.

18
POS Tagging - Approaches

Approaches
Rule-based tagging
(ENGTWOL)
Stochastic (Probabilistic) tagging
HMM (Hidden Markov Model) tagging
Transformation-based tagging
Brill tagger
Do we return one best answer or several answers
and let later steps decide?
How does the requisite knowledge get entered?

19
3 methods for POS tagging

1. Rule-based tagging
Example Karlsson (1995) EngCG tagger based on
the Constraint Grammar architecture and ENGTWOL
lexicon
Basic Idea
Assign all possible tags to words (morphological
analyzer used)
Remove wrong tags according to set of constraint
rules (typically more than 1000 hand-written
constraint rules, but may be machine-learned)

20
3 methods for POS tagging

2. Transformation-based tagging
Example Brill (1995) tagger - combination of
rule-based and stochastic (probabilistic) tagging
methodologies
Basic Idea
Start with a tagged corpus dictionary (with
most frequent tags)
Set the most probable tag for each word as a
start value
Change tags according to rules of type if word-1
is a determiner and word is a verb then change
the tag to noun in a specific order (like
rule-based taggers)
machine learning is usedthe rules are
automatically induced from a previously tagged
training corpus (like stochastic approach)

21
3 methods for POS tagging

3. Stochastic (Probabilistic) tagging
Example HMM (Hidden Markov Model) tagging - a
training corpus used to compute the probability
(frequency) of a given word having a given POS
tag in a given context

22
Hidden Markov Model (HMM) Tagging

Using an HMM to do POS tagging
HMM is a special case of Bayesian inference
It is also related to the noisy channel model
in ASR (Automatic Speech Recognition)

23
Hidden Markov Model (HMM) Taggers

Goal maximize P(wordtag) x P(tagprevious n
tags)
P(wordtag)
word/lexical likelihood
probability that given this tag, we have this
word
NOT probability that this word has this tag
modeled through language model (word-tag matrix)
P(tagprevious n tags)
tag sequence likelihood
probability that this tag follows these previous
tags
modeled through language model (tag-tag matrix)

Lexical information
Syntagmatic information
24
POS tagging as a sequence classification task

We are given a sentence (an observation or
sequence of observations)
Secretariat is expected to race tomorrow
sequence of n words w1wn.
What is the best sequence of tags which
corresponds to this sequence of observations?
Probabilistic/Bayesian view
Consider all possible sequences of tags
Out of this universe of sequences, choose the tag
sequence which is most probable given the
observation sequence of n words w1wn.

25
Getting to HMM

Let T t1,t2,,tn
Let W w1,w2,,wn
Goal Out of all sequences of tags t1tn, get the
the most probable sequence of POS tags T
underlying the observed sequence of words
w1,w2,,wn
Hat means our estimate of the best the most
probable tag sequence
Argmaxx f(x) means the x such that f(x) is
maximized
it maximazes our estimate of the best tag
sequence

26
Getting to HMM

This equation is guaranteed to give us the best
tag sequence
But how do we make it operational? How do we
compute this value?
Intuition of Bayesian classification
Use Bayes rule to transform it into a set of
other probabilities that are easier to compute
Thomas Bayes British mathematician (1702-1761)

27
Bayes Rule
Breaks down any conditional probability P(xy)
into three other probabilities P(xy) The
conditional probability of an event x assuming
that y has occurred
28
Bayes Rule
We can drop the denominator it does not change
for each tag sequence we are looking for the
best tag sequence for the same observation, for
the same fixed set of words
29
Bayes Rule
30
Likelihood and prior
n
31
Likelihood and prior Further Simplifications
1. the probability of a word appearing depends
only on its own POS tag, i.e, independent of
other words around it
n
2. BIGRAM assumption the probability of a
tag appearing depends only on the previous tag
3. The most probable tag sequence estimated by
the bigram tagger
32
Likelihood and prior Further Simplifications
1. the probability of a word appearing depends
only on its own POS tag, i.e, independent of
other words around it
n
33
Likelihood and prior Further Simplifications
2. BIGRAM assumption the probability of a
tag appearing depends only on the previous tag
Bigrams are groups of two written letters, two
syllables, or two words they are a special case
of N-gram. Bigrams are used as the basis for
simple statistical analysis of text The bigram
assumption is related to the first-order Markov
assumption
34
Likelihood and prior Further Simplifications
3. The most probable tag sequence estimated by
the bigram tagger
--------------------------------------------------
--------------------------------------------------
-----------
n
biagram assumption
35
Two kinds of probabilities (1)

Tag transition probabilities p(titi-1)
Determiners likely to precede adjs and nouns
That/DT flight/NN
The/DT yellow/JJ hat/NN
So we expect P(NNDT) and P(JJDT) to be high
But P(DTJJ) to be?

36
Two kinds of probabilities (1)

Tag transition probabilities p(titi-1)
Compute P(NNDT) by counting in a labeled corpus

of times DT is followed by NN
37
Two kinds of probabilities (2)

Word likelihood probabilities p(witi)
P(isVBZ) probability of VBZ (3sg Pres verb)
being is
Compute P(isVBZ) by counting in a labeled corpus

If we were expecting a third person singular
verb, how likely is it that this verb would be
is?
38
An Example the verb race

Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
tomorrow/NR
People/NNS continue/VB to/TO inquire/VB the/DT
reason/NN for/IN the/DT race/NN for/IN outer/JJ
space/NN
How do we pick the right tag?

39
Disambiguating race
40
Disambiguating race

P(NNTO) .00047
P(VBTO) .83
The tag transition probabilities P(NNTO) and
P(VBTO) answer the question How likely are we
to expect verb/noun given the previous tag TO?
P(raceNN) .00057
P(raceVB) .00012
Lexical likelihoods from the Brown corpus for
race given a POS tag NN or VB.
P(NRVB) .0027
P(NRNN) .0012
tag sequence probability for the likelihood of an
adverb occurring given the previous tag verb or
noun
P(VBTO)P(NRVB)P(raceVB) .00000027
P(NNTO)P(NRNN)P(raceNN).00000000032
Multiply the lexical likelihoods with the tag
sequence probabiliies the verb wins

41
Hidden Markov Models

What weve described with these two kinds of
probabilities is a Hidden Markov Model (HMM)
Lets just spend a bit of time tying this into
the model
In order to define HMM, we will first introduce
the Markov Chain, or observable Markov Model.

42
Definitions

A weighted finite-state automaton adds
probabilities to the arcs
The sum of the probabilities leaving any arc must
sum to one
A Markov chain is a special case of a WFST in
which the input sequence uniquely determines
which states the automaton will go through
Markov chains cant represent inherently
ambiguous problems
Useful for assigning probabilities to unambiguous
sequences

43
Markov chain First-order observed Markov
Model

a set of states
Q q1, q2qN the state at time t is qt
a set of transition probabilities
a set of probabilities A a01a02an1ann.
Each aij represents the probability of
transitioning from state i to state j
The set of these is the transition probability
matrix A
Distinguished start and end states
Special initial probability vector ?
?i the probability that the MM will start in
state i, each ?i expresses the probability
p(qiSTART)

44
Markov chain First-order observed Markov
Model

Markov Chain for weather Example 1
three types of weather sunny, rainy, foggy
we want to find the following conditional
probabilities
P(qnqn-1, qn-2, , q1)
- I.e., the probability of the unknown weather
on day n, depending on the (known) weather of
the preceding days
- We could infer this probability from the
relative frequency (the statistics) of past
observations of weather sequences
Problem the larger n is, the more observations
we must collect.
Suppose that n6, then we have to collect
statistics for 3(6-1) 243 past histories

45
Markov chain First-order observed Markov
Model

Therefore, we make a simplifying assumption,
called the (first-order) Markov assumption
for a sequence of observations q1, qn,
current state only depends on previous state
the joint probability of certain past and current
observations

46
Markov chain First-order observable Markov
Model

47
Markov chain First-order observed Markov
Model

Given that today the weather is sunny, what's
the probability that tomorrow is sunny and the
day after is rainy?
Using the Markov assumption and the
probabilities in table 1, this translates into

48
The weather figure specific example

Markov Chain for weather Example 2

49
Markov chain for weather

What is the probability of 4 consecutive rainy
days?
Sequence is rainy-rainy-rainy-rainy
I.e., state sequence is 3-3-3-3
P(3,3,3,3)
?1a11a11a11a11 0.2 x (0.6)3 0.0432

50
Hidden Markov Model

For Markov chains, the output symbols are the
same as the states.
See sunny weather were in state sunny
But in part-of-speech tagging (and other things)
The output symbols are words
But the hidden states are part-of-speech tags
So we need an extension!
A Hidden Markov Model is an extension of a Markov
chain in which the output symbols are not the
same as the states.
This means we dont know which state we are in.

51
Markov chain for weather
52
Markov chain for words
Observed events words Hidden events tags
53
Hidden Markov Models

States Q q1, q2qN
Observations O o1, o2oN
Each observation is a symbol from a vocabulary V
v1,v2,vV
Transition probabilities (prior)
Transition probability matrix A aij
Observation likelihoods (likelihood)
Output probability matrix Bbi(ot)
a set of observation likelihoods, each
expressing the probability of an observation ot
being generated from a state i, emission
probabilities
Special initial probability vector ?
?i the probability that the HMM will start in
state i, each ?i expresses the probability
p(qiSTART)

54
Assumptions

Markov assumption the probability of a
particular state depends only on the previous
state
Output-independence assumption the probability
of an output observation depends only on the
state that produced that observation

55
HMM for Ice Cream

You are a climatologist in the year 2799
Studying global warming
You cant find any records of the weather in
Boston, MA for summer of 2007
But you find Jason Eisners diary
Which lists how many ice-creams Jason ate every
date that summer
Our job figure out how hot it was

56
Noam task

Given
Ice Cream Observation Sequence 1,2,3,2,2,2,3
(cp. with output symbols)
Produce
Weather Sequence C,C,H,C,C,C,H
(cp. with hidden states, causing states)

57
HMM for ice cream
58
Different types of HMM structure
Ergodic fully-connected
Bakis left-to-right
59
HMM Taggers

Two kinds of probabilities
A transition probabilities (PRIOR)
B observation likelihoods (LIKELIHOOD)
HMM Taggers choose the tag sequence which
maximizes the product of word likelihood and tag
sequence probability

60
Weighted FSM corresponding to hidden states of
HMM, showing A probs
61
B observation likelihoods for POS HMM
62
The A matrix for the POS HMM
63
The B matrix for the POS HMM
64
HMM Taggers

The probabilities are trained on hand-labeled
training corpora (training set)
Combine different N-gram levels
Evaluated by comparing their output from a test
set to human labels for that test set (Gold
Standard)

65
The Viterbi Algorithm

best tag sequence for "John likes to fish in the
sea"?
efficiently computes the most likely state
sequence given a particular output sequence
based on dynamic programming

66
A smaller example
a
b

What is the best sequence of states for the input
string bbba?
Computing all possible paths and finding the one
with the max probability is exponential

67
A smaller example (cont)

For each state, store the most likely sequence
that could lead to it (and its probability)
Path probability matrix
An array of states versus time (tags versus
words)
That stores the prob. of being at each state at
each time in terms of the prob. for being in each
state at the preceding time.

Best sequence Best sequence Input sequence / time Input sequence / time Input sequence / time Input sequence / time
e --gt b b --gt b bb --gt b bbb --gt a
leading to q coming from q e --gt q 0.6 (1.0x0.6) q --gt q 0.108 (0.6x0.3x0.6) qq --gt q 0.01944 (0.108x0.3x0.6) qrq --gt q 0.018144 (0.1008x0.3x0.4)
leading to q coming from r r --gt q 0 (0x0.5x0.6) qr --gt q 0.1008 (0.336x0.5x 0.6) qrr --gt q 0.02688 (0.1344x0.5x0.4)
leading to r coming from q e --gt r 0 (0x0.8) q --gt r 0.336 (0.6x0.7x0.8) qq --gt r 0.0648 (0.108x0.7x0.8) qrq --gt r 0.014112 (0.1008x0.7x0.2)
leading to r coming from r r --gt r 0 (0x0.5x0.8) qr --gt r 0.1344 (0.336x0.5x0.8) qrr --gt r 0.01344 (0.1344x0.5x0.2)
68
Viterbi intuition we are looking for the best
path
S1
S2
S4
S3
S5
Slide from Dekang Lin
69
The Viterbi Algorithm
70
Intuition

The value in each cell is computed by taking the
MAX over all paths that lead to this cell.
An extension of a path from state i at time t-1
is computed by multiplying
Previous path probability from previous cell
viterbit-1,i
Transition probability aij from previous state I
to current state j
Observation likelihood bj(ot) that current state
j matches observation symbol t

71
Viterbi example
72
Smoothing of probabilities

Data sparseness is a problem when estimating
probabilities based on corpus data.
The add one smoothing technique

C- absolute frequency N no of training
instances B no of different types

Linear interpolation methods can compensate for
data sparseness with higher order models. A
common method is interpolating trigrams, bigrams
and unigrams

The lambda values are automatically determined
using a variant of the Expectation Maximization
algorithm.

73
Viterbi for POS tagging

Let
n nb of words in sentence to tag (nb of input
tokens)
T nb of tags in the tag set (nb of states)
vit path probability matrix (viterbi)
viti,j probability of being at state
(tag) j at word i
state matrix to recover the nodes of the best
path (best tag sequence)
statei1,j the state (tag) of the incoming
arc that led to this most probable state j at
word i1
// Initialization
vit1,PERIOD1.0 // pretend that there is
a period before
// our
sentence (start tag PERIOD)
vit1,t0.0 for t ? PERIOD

74
Viterbi for POS tagging (cont)

// Induction (build the path probability matrix)
for i1 to n step 1 do // for all words in the
sentence
for all tags tj do // for all possible
tags
// store the max prob of the path
viti1,tj max1kT(viti,tk x P(wi1tj) x
P(tj tk))
// store the actual state
pathi1,tj argmax1kT ( viti,tk x
P(wi1tj) x P(tj tk))
end
end
//Termination and path-readout
bestStaten1 argmax1jT vitn1,j
for jn to 1 step -1 do // for all the words in
the sentence
bestStatej pathi1, bestStatej1
end
P(bestState1,, bestStaten ) max1jT
vitn1,j

emission probability
state transition probability
probability of best path leading to state tk at
word i
75
Possible improvements

in bigram POS tagging, we condition a tag only on
the preceding tag
why not...
use more context (ex. use trigram model)
more precise
is clearly marked --gt verb, past participle
he clearly marked --gt verb, past tense
combine trigram, bigram, unigram models
condition on words too
but with an n-gram approach, this is too costly
(too many parameters to model)

76
Further issues with Markov Model tagging

Unknown words are a problem since we dont have
the required probabilities. Possible solutions
Assign the word probabilities based on
corpus-wide distribution of POS
Use morphological cues (capitalization, suffix)
to assign a more calculated guess.
Using higher order Markov models
Using a trigram model captures more context
However, data sparseness is much more of a
problem.

77
TnT

Efficient statistical POS tagger developed by
Thorsten Brants, ANLP-2000
Underlying model
Trigram modelling
The probability of a POS only depends on its two
preceding POS
The probability of a word appearing at a
particular position given that its POS occurs at
that position is independent of everything else.

78
Training

Maximum likelihood estimates

Smoothing context-independent variant of linear
interpolation.
79
Smoothing algorithm

Set ?i0
For each trigram t1 t2 t3 with f(t1,t2,t3 )gt0
Depending on the max of the following three
values
Case (f(t1,t2,t3 )-1)/ f(t1,t2) incr ?3 by
f(t1,t2,t3 )
Case (f(t2,t3 )-1)/ f(t2) incr ?2 by
f(t1,t2,t3 )
Case (f(t3 )-1)/ N-1 incr ?1 by
f(t1,t2,t3 )
Normalize ?i

80
Evaluation of POS taggers

compared with gold-standard of human performance
metric
accuracy of tags that are identical to gold
standard
most taggers 96-97 accuracy
must compare accuracy to
ceiling (best possible results)
how do human annotators score compared to each
other? (96-97)
so systems are not bad at all!
baseline (worst possible results)
what if we take the most-likely tag (unigram
model) regardless of previous tags ? (90-91)
so anything less is really bad

81
More on tagger accuracy

is 95 good?
thats 5 mistakes every 100 words
if on average, a sentence is 20 words, thats 1
mistake per sentence
when comparing tagger accuracy, beware of
size of training corpus
the bigger, the better the results
difference between training testing corpora
(genre, domain)
the closer, the better the results
size of tag set
Prediction versus classification
unknown words
the more unknown words (not in dictionary), the
worst the results

82
Error Analysis

Look at a confusion matrix (contingency table)
E.g. 4.4 of the total errors caused by
mistagging VBD as VBN
See what errors are causing problems
Noun (NN) vs ProperNoun (NNP) vs Adj (JJ)
Adverb (RB) vs Particle (RP) vs Prep (IN)
Preterite (VBD) vs Participle (VBN) vs Adjective
(JJ)
ERROR ANALYSIS IS ESSENTIAL!!!

83
Tag indeterminacy
84
Major difficulties in POS tagging

Unknown words (proper names)
because we do not know the set of tags it can
take
and knowing this takes you a long way (cf.
baseline POS tagger)
possible solutions
assign all possible tags with probabilities
distribution identical to lexicon as a whole
use morphological cues to infer possible tags
ex. word ending in -ed are likely to be past
tense verbs or past participles
Frequently confused tag pairs
preposition vs particle
ltrunninggt ltupgt a hill (prep) / ltrunning upgt a
bill (particle)
verb, past tense vs. past participle vs.
adjective

85
Unknown Words

Most-frequent-tag approach.
What about words that dont appear in the
training set?
Suffix analysis
The probability distribution for a particular
suffix is generated from all words in the
training set that share the same suffix.
Suffix estimation Calculate the probability of
a tag t given the last i letters of an n letter
word.
Smoothing successive abstraction through
sequences of increasingly more general contexts
(i.e., omit more and more characters of the
suffix)
Use a morphological analyzer to get the
restriction on the possible tags.

86
Unknown words
87
Alternative graphical models for part of speech
tagging
88
Different Models for POS tagging

HMM
Maximum Entropy Markov Models
Conditional Random Fields

89
Hidden Markov Model (HMM) Generative Modeling
Source Model P(Y)
Noisy Channel P(XY)
y
x
90
Dependency (1st order)
91
Disadvantage of HMMs (1)

No Rich Feature Information
Rich information are required
When xk is complex
When data of xk is sparse
Example POS Tagging
How to evaluate P(wktk) for unknown words wk ?
Useful features
Suffix, e.g., -ed, -tion, -ing, etc.
Capitalization
Generative Model
Parameter estimation maximize the joint
likelihood of training examples

92
Generative Models

Hidden Markov models (HMMs) and stochastic
grammars
Assign a joint probability to paired observation
and label sequences
The parameters typically trained to maximize the
joint likelihood of train examples

93
Generative Models (contd)

Difficulties and disadvantages
Need to enumerate all possible observation
sequences
Not practical to represent multiple interacting
features or long-range dependencies of the
observations
Very strict independence assumptions on the
observations

Better Approach
Discriminative model which models P(yx) directly
Maximize the conditional likelihood of training
examples

95
Maximum Entropy modeling

N-gram model probabilities depend on the
previous few tokens.
We may identify a more heterogeneous set of
features which contribute in some way to the
choice of the current word. (whether it is the
first word in a story, whether the next word is
to, whether one of the last 5 words is a
preposition, etc)
Maxent combines these features in a probabilistic
model.
The given features provide a constraint on the
model.
We would like to have a probability distribution
which, outside of these constraints, is as
uniform as possible has the maximum entropy
among all models that satisfy these constraints.

96
Maximum Entropy Markov Model

Discriminative Sub Models
Unify two parameters in generative model into one
conditional model
Two parameters in generative model,
parameter in source model
and parameter in noisy channel
Unified conditional model
Employ maximum entropy principle

Maximum Entropy Markov Model

97
General Maximum Entropy Principle

Model
Model distribution P(Y X) with a set of features
f1, f2, ?, fl defined on X and Y
Idea
Collect information of features from training
data
Principle
Model what is known
Assume nothing else
? Flattest distribution
? Distribution with the maximum Entropy

98
Example

(Berger et al., 1996) example
Model translation of word in from English to
French
Need to model P(wordFrench)
Constraints
1 Possible translations dans, en, à, au course
de, pendant
2 dans or en used in 30 of the time
3 dans or à in 50 of the time

99
Features

Features
0-1 indicator functions
1 if (x, y) satisfies a predefined condition
0 if not
Example POS Tagging

100
Constraints

Empirical Information
Statistics from training data T

Expected Value
From the distribution P(Y X) we want to model

Constraints

101
Maximum Entropy Objective

Entropy

Maximization Problem

102
Dual Problem

Dual Problem
Conditional model
Maximum likelihood of conditional data

Solution
Improved iterative scaling (IIS) (Berger et al.
1996)
Generalized iterative scaling (GIS) (McCallum et
al. 2000)

103
Maximum Entropy Markov Model

Use Maximum Entropy Approach to Model
1st order

Features
Basic features (like parameters in HMM)
Bigram (1st order) or trigram (2nd order) in
source model
State-output pair feature (Xk xk, Yk yk)
Advantage incorporate other advanced features on
(xk, yk)

104
HMM vs MEMM (1st order)
Maximum Entropy Markov Model (MEMM)
HMM
105
Performance in POS Tagging

POS Tagging
Data set WSJ
Features
HMM features, spelling features (like ed, -tion,
-s, -ing, etc.)
Results (Lafferty et al. 2001)
1st order HMM
94.31 accuracy, 54.01 OOV accuracy
1st order MEMM
95.19 accuracy, 73.01 OOV accuracy

106
ME applications

Part of Speech (POS) Tagging (Ratnaparkhi, 1996)
P(POS tag context)
Information sources
Word window (4)
Word features (prefix, suffix, capitalization)
Previous POS tags

107
ME applications

Abbreviation expansion (Pakhomov, 2002)
Information sources
Word window (4)
Document title
Word Sense Disambiguation (WSD) (Chao Dyer,
2002)
Information sources
Word window (4)
Structurally related words (4)
Sentence Boundary Detection (Reynar
Ratnaparkhi, 1997)
Information sources
Token features (prefix, suffix, capitalization,
abbreviation)
Word window (2)

108
Solution

Global Optimization
Optimize parameters in a global model
simultaneously, not in sub models separately
Alternatives
Conditional random fields
Application of perceptron algorithm

109
Why ME?

Advantages
Combine multiple knowledge sources
Local
Word prefix, suffix, capitalization (POS -
(Ratnaparkhi, 1996))
Word POS, POS class, suffix (WSD - (Chao Dyer,
2002))
Token prefix, suffix, capitalization,
abbreviation (Sentence Boundary - (Reynar
Ratnaparkhi, 1997))
Global
N-grams (Rosenfeld, 1997)
Word window
Document title (Pakhomov, 2002)
Structurally related words (Chao Dyer, 2002)
Sentence length, conventional lexicon (Och Ney,
2002)
Combine dependent knowledge sources

110
Why ME?

Advantages
Add additional knowledge sources
Implicit smoothing
Disadvantages
Computational
Expected value at each iteration
Normalizing constant
Overfitting
Feature selection
Cutoffs
Basic Feature Selection (Berger et al., 1996)

111
Conditional Models

Conditional probability P(label sequence y
observation sequence x) rather than joint
probability P(y, x)
Specify the probability of possible label
sequences given an observation sequence
Allow arbitrary, non-independent features on the
observation sequence X
The probability of a transition between labels
may depend on past and future observations
Relax strong independence assumptions in
generative models

112
Discriminative ModelsMaximum Entropy Markov
Models (MEMMs)

Exponential model
Given training set X with label sequence Y
Train a model ? that maximizes P(YX, ?)
For a new data sequence x, the predicted label y
maximizes P(yx, ?)
Notice the per-state normalization

113
MEMMs (contd)

MEMMs have all the advantages of Conditional
Models
Per-state normalization all the mass that
arrives at a state must be distributed among the
possible successor states (conservation of score
mass)
Subject to Label Bias Problem
Bias toward states with fewer outgoing transitions

114
Label Bias Problem

Consider this MEMM

P(1 and 2 ro) P(2 1 and ro)P(1 ro)
P(2 1 and o)P(1 r)
P(1 and 2 ri) P(2 1 and ri)P(1 ri)
P(2 1 and i)P(1 r)
Since P(2 1 and x) 1 for all x, P(1 and 2
ro) P(1 and 2 ri)
In the training data, label value 2 is the only
label value observed after label value 1
Therefore P(2 1) 1, so P(2 1 and x) 1 for
all x
However, we expect P(1 and 2 ri) to be
greater than P(1 and 2 ro).
Per-state normalization does not allow the
required expectation

115
Solve the Label Bias Problem

Change the state-transition structure of the
model
Not always practical to change the set of states
Start with a fully-connected model and let the
training procedure figure out a good structure
Prelude the use of prior, which is very valuable
(e.g. in information extraction)

116
Random Field
117
Conditional Random Fields (CRFs)

CRFs have all the advantages of MEMMs without
label bias problem
MEMM uses per-state exponential model for the
conditional probabilities of next states given
the current state
CRF has a single exponential model for the joint
probability of the entire sequence of labels
given the observation sequence
Undirected acyclic graph
Allow some transitions vote more strongly than
others depending on the corresponding observations

118
Definition of CRFs
X is a random variable over data sequences to be
labeled Y is a random variable over corresponding
label sequences
119
Example of CRFs
120
Graphical comparison among HMMs, MEMMs and CRFs
HMM MEMM CRF
121
Conditional Distribution
122
Conditional Distribution (contd)

CRFs use the observation-dependent
normalization Z(x) for the conditional
distributions

Z(x) is a normalization over the data sequence x
123
Parameter Estimation for CRFs

The paper provided iterative scaling algorithms
It turns out to be very inefficient
Prof. Dietterichs group applied Gradient
Descendent Algorithm, which is quite efficient

124
Training of CRFs (From Prof. Dietterich)

Then, take the derivative of the above equation

For training, the first 2 items are easy to get.
For example, for each lk, fk is a sequence of
Boolean numbers, such as 00101110100111.
is just the total number of 1s in the
sequence.

The hardest thing is how to calculate Z(x)

125
Training of CRFs (From Prof. Dietterich) (contd)

Maximal cliques

126
POS tagging Experiments
127
POS tagging Experiments (contd)

Compared HMMs, MEMMs, and CRFs on Penn treebank
POS tagging
Each word in a given input sentence must be
labeled with one of 45 syntactic tags
Add a small set of orthographic features whether
a spelling begins with a number or upper case
letter, whether it contains a hyphen, and if it
contains one of the following suffixes -ing,
-ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies
oov out-of-vocabulary (not observed in the
training set)

Parts of Speech PowerPoint PPT Presentation