Parts of Speech - PowerPoint PPT Presentation


PPT – Parts of Speech PowerPoint presentation | free to view - id: 1e5412-ZDc1Z


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Parts of Speech


Parts of Speech – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 127
Provided by: DanJur1
Tags: channel | com | parts | speech | weather


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Parts of Speech

Parts of Speech
  • Sudeshna Sarkar
  • 7 Aug 2008

Why Do We Care about Parts of Speech?
  • Pronunciation
  • Hand me the lead pipe.
  • Predicting what words can be expected next
  • Personal pronoun (e.g., I, she) ____________
  • Stemming
  • -s means singular for verbs, plural for nouns
  • As the basis for syntactic parsing and then
    meaning extraction
  • I will lead the group into the lead smelter.
  • Machine translation
  • (E) content N ? (F) contenu N
  • (E) content Adj ? (F) content Adj or
    satisfait Adj

What is a Part of Speech?
Is this a semantic distinction? For example,
maybe Noun is the class of words for people,
places and things. Maybe Adjective is the class
of words for properties of nouns. Consider green
book book is a Noun green is an
Adjective Now consider book worm This green
is very soothing.
How Many Parts of Speech Are There?
  • A first cut at the easy distinctions
  • Open classes
  • nouns, verbs, adjectives, adverbs
  • Closed classes function words
  • conjunctions and, or, but
  • pronounts I, she, him
  • prepositions with, on
  • determiners the, a, an

Part of speech tagging
  • 8 (ish) traditional parts of speech
  • Noun, verb, adjective, preposition, adverb,
    article, interjection, pronoun, conjunction, etc
  • This idea has been around for over 2000 years
    (Dionysius Thrax of Alexandria, c. 100 B.C.)
  • Called parts-of-speech, lexical category, word
    classes, morphological classes, lexical tags, POS
  • Well use POS most frequently
  • Ill assume that you all know what these are

POS examples
  • N noun chair, bandwidth, pacing
  • V verb study, debate, munch
  • ADJ adj purple, tall, ridiculous
  • ADV adverb unfortunately, slowly,
  • P preposition of, by, to
  • PRO pronoun I, me, mine
  • DET determiner the, a, that, those

Brown corpus tagset (87 tags)
tml Penn Treebank tagset (45 tags)
(8.6) C7 tagset (146 tags) http//www.comp.lancs
POS Tagging Definition
  • The process of assigning a part-of-speech or
    lexical class marker to each word in a corpus

POS Tagging example
  • WORD tag
  • the DET
  • koala N
  • put V
  • the DET
  • keys N
  • on P
  • the DET
  • table N

POS tagging Choosing a tagset
  • There are so many parts of speech, potential
    distinctions we can draw
  • To do POS tagging, need to choose a standard set
    of tags to work with
  • Could pick very coarse tagets
  • N, V, Adj, Adv.
  • More commonly used set is finer grained, the
    UPenn TreeBank tagset, 45 tags
  • Even more fine-grained tagsets exist

Penn TreeBank POS Tag set
Using the UPenn tagset
  • The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT
    number/NN of/IN other/JJ topics/NNS ./.
  • Prepositions and subordinating conjunctions
    marked IN (although/IN I/PRP..)
  • Except the preposition/complementizer to is
    just marked to.

POS Tagging
  • Words often have more than one POS back
  • The back door JJ
  • On my back NN
  • Win the voters back RB
  • Promised to back the bill VB
  • The POS tagging problem is to determine the POS
    tag for a particular instance of a word.

How hard is POS tagging? Measuring ambiguity
Algorithms for POS Tagging
  • Ambiguity In the Brown corpus, 11.5 of the
    word types are ambiguous (using 87 tags)

Worse, 40 of the tokens are ambiguous.
Algorithms for POS Tagging
  • Why cant we just look them up in a dictionary?
  • Words that arent in the dictionary

  • One idea P(ti wi) the probability that a
    random hapax legomenon in the corpus has tag ti.
  • Nouns are more likely than verbs, which are more
    likely than pronouns.
  • Another idea use morphology.

Algorithms for POS Tagging - Knowledge
  • Dictionary
  • Morphological rules, e.g.,
  • _____-tion
  • _____-ly
  • capitalization
  • N-gram frequencies
  • to _____
  • DET _____ N
  • But what about rare words, e.g, smelt (two verb
    forms, melt and past tense of smell, and one noun
    form, a small fish)
  • Combining these
  • V _____-ing I was gracking vs. Gracking
    is fun.

POS Tagging - Approaches
  • Approaches
  • Rule-based tagging
  • Stochastic (Probabilistic) tagging
  • HMM (Hidden Markov Model) tagging
  • Transformation-based tagging
  • Brill tagger
  • Do we return one best answer or several answers
    and let later steps decide?
  • How does the requisite knowledge get entered?

3 methods for POS tagging
  • 1. Rule-based tagging
  • Example Karlsson (1995) EngCG tagger based on
    the Constraint Grammar architecture and ENGTWOL
  • Basic Idea
  • Assign all possible tags to words (morphological
    analyzer used)
  • Remove wrong tags according to set of constraint
    rules (typically more than 1000 hand-written
    constraint rules, but may be machine-learned)

3 methods for POS tagging
  • 2. Transformation-based tagging
  • Example Brill (1995) tagger - combination of
    rule-based and stochastic (probabilistic) tagging
  • Basic Idea
  • Start with a tagged corpus dictionary (with
    most frequent tags)
  • Set the most probable tag for each word as a
    start value
  • Change tags according to rules of type if word-1
    is a determiner and word is a verb then change
    the tag to noun in a specific order (like
    rule-based taggers)
  • machine learning is usedthe rules are
    automatically induced from a previously tagged
    training corpus (like stochastic approach)

3 methods for POS tagging
  • 3. Stochastic (Probabilistic) tagging
  • Example HMM (Hidden Markov Model) tagging - a
    training corpus used to compute the probability
    (frequency) of a given word having a given POS
    tag in a given context

Hidden Markov Model (HMM) Tagging
  • Using an HMM to do POS tagging
  • HMM is a special case of Bayesian inference
  • It is also related to the noisy channel model
    in ASR (Automatic Speech Recognition)

Hidden Markov Model (HMM) Taggers
  • Goal maximize P(wordtag) x P(tagprevious n
  • P(wordtag)
  • word/lexical likelihood
  • probability that given this tag, we have this
  • NOT probability that this word has this tag
  • modeled through language model (word-tag matrix)
  • P(tagprevious n tags)
  • tag sequence likelihood
  • probability that this tag follows these previous
  • modeled through language model (tag-tag matrix)

Lexical information
Syntagmatic information
POS tagging as a sequence classification task
  • We are given a sentence (an observation or
    sequence of observations)
  • Secretariat is expected to race tomorrow
  • sequence of n words w1wn.
  • What is the best sequence of tags which
    corresponds to this sequence of observations?
  • Probabilistic/Bayesian view
  • Consider all possible sequences of tags
  • Out of this universe of sequences, choose the tag
    sequence which is most probable given the
    observation sequence of n words w1wn.

Getting to HMM
  • Let T t1,t2,,tn
  • Let W w1,w2,,wn
  • Goal Out of all sequences of tags t1tn, get the
    the most probable sequence of POS tags T
    underlying the observed sequence of words
  • Hat means our estimate of the best the most
    probable tag sequence
  • Argmaxx f(x) means the x such that f(x) is
  • it maximazes our estimate of the best tag

Getting to HMM
  • This equation is guaranteed to give us the best
    tag sequence
  • But how do we make it operational? How do we
    compute this value?
  • Intuition of Bayesian classification
  • Use Bayes rule to transform it into a set of
    other probabilities that are easier to compute
  • Thomas Bayes British mathematician (1702-1761)

Bayes Rule
Breaks down any conditional probability P(xy)
into three other probabilities P(xy) The
conditional probability of an event x assuming
that y has occurred
Bayes Rule
We can drop the denominator it does not change
for each tag sequence we are looking for the
best tag sequence for the same observation, for
the same fixed set of words
Bayes Rule
Likelihood and prior
Likelihood and prior Further Simplifications
1. the probability of a word appearing depends
only on its own POS tag, i.e, independent of
other words around it
2. BIGRAM assumption the probability of a
tag appearing depends only on the previous tag
3. The most probable tag sequence estimated by
the bigram tagger
Likelihood and prior Further Simplifications
1. the probability of a word appearing depends
only on its own POS tag, i.e, independent of
other words around it
Likelihood and prior Further Simplifications
2. BIGRAM assumption the probability of a
tag appearing depends only on the previous tag
Bigrams are groups of two written letters, two
syllables, or two words they are a special case
of N-gram. Bigrams are used as the basis for
simple statistical analysis of text The bigram
assumption is related to the first-order Markov
Likelihood and prior Further Simplifications
3. The most probable tag sequence estimated by
the bigram tagger
biagram assumption
Two kinds of probabilities (1)
  • Tag transition probabilities p(titi-1)
  • Determiners likely to precede adjs and nouns
  • That/DT flight/NN
  • The/DT yellow/JJ hat/NN
  • So we expect P(NNDT) and P(JJDT) to be high
  • But P(DTJJ) to be?

Two kinds of probabilities (1)
  • Tag transition probabilities p(titi-1)
  • Compute P(NNDT) by counting in a labeled corpus

of times DT is followed by NN
Two kinds of probabilities (2)
  • Word likelihood probabilities p(witi)
  • P(isVBZ) probability of VBZ (3sg Pres verb)
    being is
  • Compute P(isVBZ) by counting in a labeled corpus

If we were expecting a third person singular
verb, how likely is it that this verb would be
An Example the verb race
  • Secretariat/NNP is/VBZ expected/VBN to/TO race/VB
  • People/NNS continue/VB to/TO inquire/VB the/DT
    reason/NN for/IN the/DT race/NN for/IN outer/JJ
  • How do we pick the right tag?

Disambiguating race
Disambiguating race
  • P(NNTO) .00047
  • P(VBTO) .83
  • The tag transition probabilities P(NNTO) and
    P(VBTO) answer the question How likely are we
    to expect verb/noun given the previous tag TO?
  • P(raceNN) .00057
  • P(raceVB) .00012
  • Lexical likelihoods from the Brown corpus for
    race given a POS tag NN or VB.
  • P(NRVB) .0027
  • P(NRNN) .0012
  • tag sequence probability for the likelihood of an
    adverb occurring given the previous tag verb or
  • P(VBTO)P(NRVB)P(raceVB) .00000027
  • P(NNTO)P(NRNN)P(raceNN).00000000032
  • Multiply the lexical likelihoods with the tag
    sequence probabiliies the verb wins

Hidden Markov Models
  • What weve described with these two kinds of
    probabilities is a Hidden Markov Model (HMM)
  • Lets just spend a bit of time tying this into
    the model
  • In order to define HMM, we will first introduce
    the Markov Chain, or observable Markov Model.

  • A weighted finite-state automaton adds
    probabilities to the arcs
  • The sum of the probabilities leaving any arc must
    sum to one
  • A Markov chain is a special case of a WFST in
    which the input sequence uniquely determines
    which states the automaton will go through
  • Markov chains cant represent inherently
    ambiguous problems
  • Useful for assigning probabilities to unambiguous

Markov chain First-order observed Markov
  • a set of states
  • Q q1, q2qN the state at time t is qt
  • a set of transition probabilities
  • a set of probabilities A a01a02an1ann.
  • Each aij represents the probability of
    transitioning from state i to state j
  • The set of these is the transition probability
    matrix A
  • Distinguished start and end states
  • Special initial probability vector ?
  • ?i the probability that the MM will start in
    state i, each ?i expresses the probability

Markov chain First-order observed Markov
  • Markov Chain for weather Example 1
  • three types of weather sunny, rainy, foggy
  • we want to find the following conditional
  • P(qnqn-1, qn-2, , q1)
  • - I.e., the probability of the unknown weather
    on day n, depending on the (known) weather of
    the preceding days
  • - We could infer this probability from the
    relative frequency (the statistics) of past
    observations of weather sequences
  • Problem the larger n is, the more observations
    we must collect.
  • Suppose that n6, then we have to collect
    statistics for 3(6-1) 243 past histories

Markov chain First-order observed Markov
  • Therefore, we make a simplifying assumption,
    called the (first-order) Markov assumption
  • for a sequence of observations q1, qn,
  • current state only depends on previous state
  • the joint probability of certain past and current

Markov chain First-order observable Markov

Markov chain First-order observed Markov
  • Given that today the weather is sunny, what's
    the probability that tomorrow is sunny and the
    day after is rainy?
  • Using the Markov assumption and the
    probabilities in table 1, this translates into

The weather figure specific example
  • Markov Chain for weather Example 2

Markov chain for weather
  • What is the probability of 4 consecutive rainy
  • Sequence is rainy-rainy-rainy-rainy
  • I.e., state sequence is 3-3-3-3
  • P(3,3,3,3)
  • ?1a11a11a11a11 0.2 x (0.6)3 0.0432

Hidden Markov Model
  • For Markov chains, the output symbols are the
    same as the states.
  • See sunny weather were in state sunny
  • But in part-of-speech tagging (and other things)
  • The output symbols are words
  • But the hidden states are part-of-speech tags
  • So we need an extension!
  • A Hidden Markov Model is an extension of a Markov
    chain in which the output symbols are not the
    same as the states.
  • This means we dont know which state we are in.

Markov chain for weather
Markov chain for words
Observed events words Hidden events tags
Hidden Markov Models
  • States Q q1, q2qN
  • Observations O o1, o2oN
  • Each observation is a symbol from a vocabulary V
  • Transition probabilities (prior)
  • Transition probability matrix A aij
  • Observation likelihoods (likelihood)
  • Output probability matrix Bbi(ot)
  • a set of observation likelihoods, each
    expressing the probability of an observation ot
    being generated from a state i, emission
  • Special initial probability vector ?
  • ?i the probability that the HMM will start in
    state i, each ?i expresses the probability
  • p(qiSTART)

  • Markov assumption the probability of a
    particular state depends only on the previous
  • Output-independence assumption the probability
    of an output observation depends only on the
    state that produced that observation

HMM for Ice Cream
  • You are a climatologist in the year 2799
  • Studying global warming
  • You cant find any records of the weather in
    Boston, MA for summer of 2007
  • But you find Jason Eisners diary
  • Which lists how many ice-creams Jason ate every
    date that summer
  • Our job figure out how hot it was

Noam task
  • Given
  • Ice Cream Observation Sequence 1,2,3,2,2,2,3
  • (cp. with output symbols)
  • Produce
  • Weather Sequence C,C,H,C,C,C,H
  • (cp. with hidden states, causing states)

HMM for ice cream
Different types of HMM structure
Ergodic fully-connected
Bakis left-to-right
HMM Taggers
  • Two kinds of probabilities
  • A transition probabilities (PRIOR)
  • B observation likelihoods (LIKELIHOOD)
  • HMM Taggers choose the tag sequence which
    maximizes the product of word likelihood and tag
    sequence probability

Weighted FSM corresponding to hidden states of
HMM, showing A probs
B observation likelihoods for POS HMM
The A matrix for the POS HMM
The B matrix for the POS HMM
HMM Taggers
  • The probabilities are trained on hand-labeled
    training corpora (training set)
  • Combine different N-gram levels
  • Evaluated by comparing their output from a test
    set to human labels for that test set (Gold

The Viterbi Algorithm
  • best tag sequence for "John likes to fish in the
  • efficiently computes the most likely state
    sequence given a particular output sequence
  • based on dynamic programming

A smaller example
  • What is the best sequence of states for the input
    string bbba?
  • Computing all possible paths and finding the one
    with the max probability is exponential

A smaller example (cont)
  • For each state, store the most likely sequence
    that could lead to it (and its probability)
  • Path probability matrix
  • An array of states versus time (tags versus
  • That stores the prob. of being at each state at
    each time in terms of the prob. for being in each
    state at the preceding time.

Best sequence Best sequence Input sequence / time Input sequence / time Input sequence / time Input sequence / time
e --gt b b --gt b bb --gt b bbb --gt a
leading to q coming from q e --gt q 0.6 (1.0x0.6) q --gt q 0.108 (0.6x0.3x0.6) qq --gt q 0.01944 (0.108x0.3x0.6) qrq --gt q 0.018144 (0.1008x0.3x0.4)
leading to q coming from r r --gt q 0 (0x0.5x0.6) qr --gt q 0.1008 (0.336x0.5x 0.6) qrr --gt q 0.02688 (0.1344x0.5x0.4)
leading to r coming from q e --gt r 0 (0x0.8) q --gt r 0.336 (0.6x0.7x0.8) qq --gt r 0.0648 (0.108x0.7x0.8) qrq --gt r 0.014112 (0.1008x0.7x0.2)
leading to r coming from r r --gt r 0 (0x0.5x0.8) qr --gt r 0.1344 (0.336x0.5x0.8) qrr --gt r 0.01344 (0.1344x0.5x0.2)
Viterbi intuition we are looking for the best
Slide from Dekang Lin
The Viterbi Algorithm
  • The value in each cell is computed by taking the
    MAX over all paths that lead to this cell.
  • An extension of a path from state i at time t-1
    is computed by multiplying
  • Previous path probability from previous cell
  • Transition probability aij from previous state I
    to current state j
  • Observation likelihood bj(ot) that current state
    j matches observation symbol t

Viterbi example
Smoothing of probabilities
  • Data sparseness is a problem when estimating
    probabilities based on corpus data.
  • The add one smoothing technique

C- absolute frequency N no of training
instances B no of different types
  • Linear interpolation methods can compensate for
    data sparseness with higher order models. A
    common method is interpolating trigrams, bigrams
    and unigrams
  • The lambda values are automatically determined
    using a variant of the Expectation Maximization

Viterbi for POS tagging
  • Let
  • n nb of words in sentence to tag (nb of input
  • T nb of tags in the tag set (nb of states)
  • vit path probability matrix (viterbi)
  • viti,j probability of being at state
    (tag) j at word i
  • state matrix to recover the nodes of the best
    path (best tag sequence)
  • statei1,j the state (tag) of the incoming
    arc that led to this most probable state j at
    word i1
  • // Initialization
  • vit1,PERIOD1.0 // pretend that there is
    a period before
  • // our
    sentence (start tag PERIOD)
  • vit1,t0.0 for t ? PERIOD

Viterbi for POS tagging (cont)
  • // Induction (build the path probability matrix)
  • for i1 to n step 1 do // for all words in the
  • for all tags tj do // for all possible
  • // store the max prob of the path
  • viti1,tj max1kT(viti,tk x P(wi1tj) x
    P(tj tk))
  • // store the actual state
  • pathi1,tj argmax1kT ( viti,tk x
    P(wi1tj) x P(tj tk))
  • end
  • end
  • //Termination and path-readout
  • bestStaten1 argmax1jT vitn1,j
  • for jn to 1 step -1 do // for all the words in
    the sentence
  • bestStatej pathi1, bestStatej1
  • end
  • P(bestState1,, bestStaten ) max1jT

emission probability
state transition probability
probability of best path leading to state tk at
word i
Possible improvements
  • in bigram POS tagging, we condition a tag only on
    the preceding tag
  • why not...
  • use more context (ex. use trigram model)
  • more precise
  • is clearly marked --gt verb, past participle
  • he clearly marked --gt verb, past tense
  • combine trigram, bigram, unigram models
  • condition on words too
  • but with an n-gram approach, this is too costly
    (too many parameters to model)

Further issues with Markov Model tagging
  • Unknown words are a problem since we dont have
    the required probabilities. Possible solutions
  • Assign the word probabilities based on
    corpus-wide distribution of POS
  • Use morphological cues (capitalization, suffix)
    to assign a more calculated guess.
  • Using higher order Markov models
  • Using a trigram model captures more context
  • However, data sparseness is much more of a

  • Efficient statistical POS tagger developed by
    Thorsten Brants, ANLP-2000
  • Underlying model
  • Trigram modelling
  • The probability of a POS only depends on its two
    preceding POS
  • The probability of a word appearing at a
    particular position given that its POS occurs at
    that position is independent of everything else.

  • Maximum likelihood estimates

Smoothing context-independent variant of linear
Smoothing algorithm
  • Set ?i0
  • For each trigram t1 t2 t3 with f(t1,t2,t3 )gt0
  • Depending on the max of the following three
  • Case (f(t1,t2,t3 )-1)/ f(t1,t2) incr ?3 by
    f(t1,t2,t3 )
  • Case (f(t2,t3 )-1)/ f(t2) incr ?2 by
    f(t1,t2,t3 )
  • Case (f(t3 )-1)/ N-1 incr ?1 by
    f(t1,t2,t3 )
  • Normalize ?i

Evaluation of POS taggers
  • compared with gold-standard of human performance
  • metric
  • accuracy of tags that are identical to gold
  • most taggers 96-97 accuracy
  • must compare accuracy to
  • ceiling (best possible results)
  • how do human annotators score compared to each
    other? (96-97)
  • so systems are not bad at all!
  • baseline (worst possible results)
  • what if we take the most-likely tag (unigram
    model) regardless of previous tags ? (90-91)
  • so anything less is really bad

More on tagger accuracy
  • is 95 good?
  • thats 5 mistakes every 100 words
  • if on average, a sentence is 20 words, thats 1
    mistake per sentence
  • when comparing tagger accuracy, beware of
  • size of training corpus
  • the bigger, the better the results
  • difference between training testing corpora
    (genre, domain)
  • the closer, the better the results
  • size of tag set
  • Prediction versus classification
  • unknown words
  • the more unknown words (not in dictionary), the
    worst the results

Error Analysis
  • Look at a confusion matrix (contingency table)
  • E.g. 4.4 of the total errors caused by
    mistagging VBD as VBN
  • See what errors are causing problems
  • Noun (NN) vs ProperNoun (NNP) vs Adj (JJ)
  • Adverb (RB) vs Particle (RP) vs Prep (IN)
  • Preterite (VBD) vs Participle (VBN) vs Adjective

Tag indeterminacy
Major difficulties in POS tagging
  • Unknown words (proper names)
  • because we do not know the set of tags it can
  • and knowing this takes you a long way (cf.
    baseline POS tagger)
  • possible solutions
  • assign all possible tags with probabilities
    distribution identical to lexicon as a whole
  • use morphological cues to infer possible tags
  • ex. word ending in -ed are likely to be past
    tense verbs or past participles
  • Frequently confused tag pairs
  • preposition vs particle
  • ltrunninggt ltupgt a hill (prep) / ltrunning upgt a
    bill (particle)
  • verb, past tense vs. past participle vs.

Unknown Words
  • Most-frequent-tag approach.
  • What about words that dont appear in the
    training set?
  • Suffix analysis
  • The probability distribution for a particular
    suffix is generated from all words in the
    training set that share the same suffix.
  • Suffix estimation Calculate the probability of
    a tag t given the last i letters of an n letter
  • Smoothing successive abstraction through
    sequences of increasingly more general contexts
    (i.e., omit more and more characters of the
  • Use a morphological analyzer to get the
    restriction on the possible tags.

Unknown words
Alternative graphical models for part of speech
Different Models for POS tagging
  • HMM
  • Maximum Entropy Markov Models
  • Conditional Random Fields

Hidden Markov Model (HMM) Generative Modeling
Source Model P(Y)
Noisy Channel P(XY)
Dependency (1st order)
Disadvantage of HMMs (1)
  • No Rich Feature Information
  • Rich information are required
  • When xk is complex
  • When data of xk is sparse
  • Example POS Tagging
  • How to evaluate P(wktk) for unknown words wk ?
  • Useful features
  • Suffix, e.g., -ed, -tion, -ing, etc.
  • Capitalization
  • Generative Model
  • Parameter estimation maximize the joint
    likelihood of training examples

Generative Models
  • Hidden Markov models (HMMs) and stochastic
  • Assign a joint probability to paired observation
    and label sequences
  • The parameters typically trained to maximize the
    joint likelihood of train examples

Generative Models (contd)
  • Difficulties and disadvantages
  • Need to enumerate all possible observation
  • Not practical to represent multiple interacting
    features or long-range dependencies of the
  • Very strict independence assumptions on the

  • Better Approach
  • Discriminative model which models P(yx) directly
  • Maximize the conditional likelihood of training

Maximum Entropy modeling
  • N-gram model probabilities depend on the
    previous few tokens.
  • We may identify a more heterogeneous set of
    features which contribute in some way to the
    choice of the current word. (whether it is the
    first word in a story, whether the next word is
    to, whether one of the last 5 words is a
    preposition, etc)
  • Maxent combines these features in a probabilistic
  • The given features provide a constraint on the
  • We would like to have a probability distribution
    which, outside of these constraints, is as
    uniform as possible has the maximum entropy
    among all models that satisfy these constraints.

Maximum Entropy Markov Model
  • Discriminative Sub Models
  • Unify two parameters in generative model into one
    conditional model
  • Two parameters in generative model,
  • parameter in source model
    and parameter in noisy channel
  • Unified conditional model
  • Employ maximum entropy principle
  • Maximum Entropy Markov Model

General Maximum Entropy Principle
  • Model
  • Model distribution P(Y X) with a set of features
    f1, f2, ?, fl defined on X and Y
  • Idea
  • Collect information of features from training
  • Principle
  • Model what is known
  • Assume nothing else
  • ? Flattest distribution
  • ? Distribution with the maximum Entropy

  • (Berger et al., 1996) example
  • Model translation of word in from English to
  • Need to model P(wordFrench)
  • Constraints
  • 1 Possible translations dans, en, à, au course
    de, pendant
  • 2 dans or en used in 30 of the time
  • 3 dans or à in 50 of the time

  • Features
  • 0-1 indicator functions
  • 1 if (x, y) satisfies a predefined condition
  • 0 if not
  • Example POS Tagging

  • Empirical Information
  • Statistics from training data T
  • Expected Value
  • From the distribution P(Y X) we want to model
  • Constraints

Maximum Entropy Objective
  • Entropy
  • Maximization Problem

Dual Problem
  • Dual Problem
  • Conditional model
  • Maximum likelihood of conditional data
  • Solution
  • Improved iterative scaling (IIS) (Berger et al.
  • Generalized iterative scaling (GIS) (McCallum et
    al. 2000)

Maximum Entropy Markov Model
  • Use Maximum Entropy Approach to Model
  • 1st order
  • Features
  • Basic features (like parameters in HMM)
  • Bigram (1st order) or trigram (2nd order) in
    source model
  • State-output pair feature (Xk xk, Yk yk)
  • Advantage incorporate other advanced features on
    (xk, yk)

HMM vs MEMM (1st order)
Maximum Entropy Markov Model (MEMM)
Performance in POS Tagging
  • POS Tagging
  • Data set WSJ
  • Features
  • HMM features, spelling features (like ed, -tion,
    -s, -ing, etc.)
  • Results (Lafferty et al. 2001)
  • 1st order HMM
  • 94.31 accuracy, 54.01 OOV accuracy
  • 1st order MEMM
  • 95.19 accuracy, 73.01 OOV accuracy

ME applications
  • Part of Speech (POS) Tagging (Ratnaparkhi, 1996)
  • P(POS tag context)
  • Information sources
  • Word window (4)
  • Word features (prefix, suffix, capitalization)
  • Previous POS tags

ME applications
  • Abbreviation expansion (Pakhomov, 2002)
  • Information sources
  • Word window (4)
  • Document title
  • Word Sense Disambiguation (WSD) (Chao Dyer,
  • Information sources
  • Word window (4)
  • Structurally related words (4)
  • Sentence Boundary Detection (Reynar
    Ratnaparkhi, 1997)
  • Information sources
  • Token features (prefix, suffix, capitalization,
  • Word window (2)

  • Global Optimization
  • Optimize parameters in a global model
    simultaneously, not in sub models separately
  • Alternatives
  • Conditional random fields
  • Application of perceptron algorithm

Why ME?
  • Advantages
  • Combine multiple knowledge sources
  • Local
  • Word prefix, suffix, capitalization (POS -
    (Ratnaparkhi, 1996))
  • Word POS, POS class, suffix (WSD - (Chao Dyer,
  • Token prefix, suffix, capitalization,
    abbreviation (Sentence Boundary - (Reynar
    Ratnaparkhi, 1997))
  • Global
  • N-grams (Rosenfeld, 1997)
  • Word window
  • Document title (Pakhomov, 2002)
  • Structurally related words (Chao Dyer, 2002)
  • Sentence length, conventional lexicon (Och Ney,
  • Combine dependent knowledge sources

Why ME?
  • Advantages
  • Add additional knowledge sources
  • Implicit smoothing
  • Disadvantages
  • Computational
  • Expected value at each iteration
  • Normalizing constant
  • Overfitting
  • Feature selection
  • Cutoffs
  • Basic Feature Selection (Berger et al., 1996)

Conditional Models
  • Conditional probability P(label sequence y
    observation sequence x) rather than joint
    probability P(y, x)
  • Specify the probability of possible label
    sequences given an observation sequence
  • Allow arbitrary, non-independent features on the
    observation sequence X
  • The probability of a transition between labels
    may depend on past and future observations
  • Relax strong independence assumptions in
    generative models

Discriminative ModelsMaximum Entropy Markov
Models (MEMMs)
  • Exponential model
  • Given training set X with label sequence Y
  • Train a model ? that maximizes P(YX, ?)
  • For a new data sequence x, the predicted label y
    maximizes P(yx, ?)
  • Notice the per-state normalization

MEMMs (contd)
  • MEMMs have all the advantages of Conditional
  • Per-state normalization all the mass that
    arrives at a state must be distributed among the
    possible successor states (conservation of score
  • Subject to Label Bias Problem
  • Bias toward states with fewer outgoing transitions

Label Bias Problem
  • Consider this MEMM
  • P(1 and 2 ro) P(2 1 and ro)P(1 ro)
    P(2 1 and o)P(1 r)
  • P(1 and 2 ri) P(2 1 and ri)P(1 ri)
    P(2 1 and i)P(1 r)
  • Since P(2 1 and x) 1 for all x, P(1 and 2
    ro) P(1 and 2 ri)
  • In the training data, label value 2 is the only
    label value observed after label value 1
  • Therefore P(2 1) 1, so P(2 1 and x) 1 for
    all x
  • However, we expect P(1 and 2 ri) to be
    greater than P(1 and 2 ro).
  • Per-state normalization does not allow the
    required expectation

Solve the Label Bias Problem
  • Change the state-transition structure of the
  • Not always practical to change the set of states
  • Start with a fully-connected model and let the
    training procedure figure out a good structure
  • Prelude the use of prior, which is very valuable
    (e.g. in information extraction)

Random Field
Conditional Random Fields (CRFs)
  • CRFs have all the advantages of MEMMs without
    label bias problem
  • MEMM uses per-state exponential model for the
    conditional probabilities of next states given
    the current state
  • CRF has a single exponential model for the joint
    probability of the entire sequence of labels
    given the observation sequence
  • Undirected acyclic graph
  • Allow some transitions vote more strongly than
    others depending on the corresponding observations

Definition of CRFs
X is a random variable over data sequences to be
labeled Y is a random variable over corresponding
label sequences
Example of CRFs
Graphical comparison among HMMs, MEMMs and CRFs
Conditional Distribution
Conditional Distribution (contd)
  • CRFs use the observation-dependent
    normalization Z(x) for the conditional

Z(x) is a normalization over the data sequence x
Parameter Estimation for CRFs
  • The paper provided iterative scaling algorithms
  • It turns out to be very inefficient
  • Prof. Dietterichs group applied Gradient
    Descendent Algorithm, which is quite efficient

Training of CRFs (From Prof. Dietterich)
  • Then, take the derivative of the above equation
  • For training, the first 2 items are easy to get.
  • For example, for each lk, fk is a sequence of
    Boolean numbers, such as 00101110100111.
  • is just the total number of 1s in the
  • The hardest thing is how to calculate Z(x)

Training of CRFs (From Prof. Dietterich) (contd)
  • Maximal cliques

POS tagging Experiments
POS tagging Experiments (contd)
  • Compared HMMs, MEMMs, and CRFs on Penn treebank
    POS tagging
  • Each word in a given input sentence must be
    labeled with one of 45 syntactic tags
  • Add a small set of orthographic features whether
    a spelling begins with a number or upper case
    letter, whether it contains a hyphen, and if it
    contains one of the following suffixes -ing,
    -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies
  • oov out-of-vocabulary (not observed in the
    training set)

  • Discriminative models are prone to the label bias
  • CRFs provide the benefits of discriminative
  • CRFs solve the label bias problem well, and
    demonstrate good performance