Title: HMM and n-gram tagger (Recap)
1HMM and n-gram tagger (Recap)
2HMM
- Two types of HMM
- Arc-emission HMM
- State-emission HMM
- The two types are equivalent.
- We normally use state-emission HMM to build
n-gram taggers.
3Definition of state-emission HMM
- A HMM is a tuple
- A set of states Ss1, s2, , sN.
- A set of output symbols Sw1, , wM.
- Initial state probabilities
- Transition prob Aaij.
- Emission prob Bbjk
- We use si and wk to refer to what is in an HMM
structure. - We use Xi and Oi to refer to what is in a
particular HMM path and its output
4An HMM structure
s1
s2
sN
w1
w2
w1
w5
w3
w1
- Two kinds of parameters
- Transition probability P(sj si)
- Emission probability P(wk si)
- ? of Parameters O(NMN2)
5 A path for an output sequence
- State sequence X1,n1
- Output sequence O1,n
6Definition of arc-emission HMM
- A HMM is a tuple
- A set of states Ss1, s2, , sN.
- A set of output symbols Sw1, , wM.
- Initial state probabilities
- Transition prob Aaij.
- Emission prob Bbijk
- We use si and wk to refer to what is in an HMM
structure. - We use Xi and Oi to refer to what is in a
particular HMM path and its output
7Arc-emission vs. state-emission
8Three fundamental questions for HMMs
- To train an HMM to learn the transition and
emission probabilities - To find the best state sequence for a given
observation - To compute the probability of a given observation
9Training an HMM estimating the probabilities
- Supervised learning
- The state sequences in the training data are
known - ML estimation by simple counting
- Unsupervised learning
- The state sequences in the training data are
unknown - The forward-backward algorithm
- More in later slides
10HMM as a parser Finding the best state sequence
- Given the observation O1,To1oT, find the state
sequence X1,T1X1 XT1 that maximizes P(X1,T1
O1,T). - ? Viterbi algorithm
11Viterbi algorithm
- The probability of the best path that produces
O1,t-1 while ending up in state sj
Initialization
Induction
12HMM as an LM computing P(o1, , oT)
13Definition of the forward probability
- The probability of producing O1,t-1 while ending
up in state si
14Calculating forward probability
Initialization
Induction
15HMM Summary
- Definition hidden states, output symbols
- Two types of HMMs
- Three basic questions in HMM
- Estimate probability MLE
- Find the best sequence Viterbi algorithm
- Find the probability of an observation forward
probability
16N-gram POS tagger
17N-gram POS tagger
18N-gram POS tagger (cont)
Bigram model
Trigram model
19The bigram tagger
- States Each state corresponds to a POS tag,
plus a state for BOS (and a state for EOS) - Output symbols Each output symbol is a word or
ltsgt or lt/sgt - Initial probability
- Transition probability aij P(sj si)
- Emission probability bjk P(wk sj)
20The bigram tagger (cont)
21The trigram tagger
- States Each state corresponds to a tag pair, a
tag is a POS tag or BOS or EOS - Output symbols words, ltsgt, lt/sgt
- Initial probability
- Transition probability
- aij P(t3 t1, t2), where si(t1, t2),
and sj(t2, t3) - 0, where si(t1, t2), and sj(t2, t3),
and t2 ! t2 - Emission probability
- bjk P(wk t), where sj(t,t) for any
t
22The trigram tagger (cont)
23Training a n-gram tagger
24Estimating the probability
25Smoothing
- To handle unseen tag sequences
- ? to smooth the transition prob
- To handle unknown words
- ? to smooth the emission prob
- To handle unseen (word, tag) pairs, where both
word and tag are known - There is very low percentage of such pairs (e.g.,
0.44) in PTB. -
26Handling unseen tag sequences
- Ex To smooth P(t3t1, t2) for a trigram tagger.
- Can we use GT smoothing for this?
- How about interpolation?
- P(t3 t1, t2)
-
27Handling unknown words?
- Introduce a new output symbol ltunkgt
- Estimate P(ltunkgt t) for each tag t
- Ex split training data into two sets create the
voc from set1, and estimate P(ltunkgtt) from set2. - Add P(ltunkgt t) to the emission prob and
renormalize so that sumw P(wt) 1. - Ex Keep P(ltunkgtt) the same, and make