HMM and n-gram tagger (Recap) - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

HMM and n-gram tagger (Recap)

Description:

We use si and wk to refer to what is in an HMM structure. ... data into two sets: create the voc from set1, and estimate P( unk |t) from set2. ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 28
Provided by: coursesWa5
Category:
Tags: hmm | gram | recap | set1 | tagger

less

Transcript and Presenter's Notes

Title: HMM and n-gram tagger (Recap)


1
HMM and n-gram tagger (Recap)
  • LING 575
  • Week 1 1/08/08

2
HMM
  • Two types of HMM
  • Arc-emission HMM
  • State-emission HMM
  • The two types are equivalent.
  • We normally use state-emission HMM to build
    n-gram taggers.

3
Definition of state-emission HMM
  • A HMM is a tuple
  • A set of states Ss1, s2, , sN.
  • A set of output symbols Sw1, , wM.
  • Initial state probabilities
  • Transition prob Aaij.
  • Emission prob Bbjk
  • We use si and wk to refer to what is in an HMM
    structure.
  • We use Xi and Oi to refer to what is in a
    particular HMM path and its output

4
An HMM structure

s1
s2
sN
w1
w2
w1
w5
w3
w1
  • Two kinds of parameters
  • Transition probability P(sj si)
  • Emission probability P(wk si)
  • ? of Parameters O(NMN2)

5
A path for an output sequence
  • State sequence X1,n1
  • Output sequence O1,n

6
Definition of arc-emission HMM
  • A HMM is a tuple
  • A set of states Ss1, s2, , sN.
  • A set of output symbols Sw1, , wM.
  • Initial state probabilities
  • Transition prob Aaij.
  • Emission prob Bbijk
  • We use si and wk to refer to what is in an HMM
    structure.
  • We use Xi and Oi to refer to what is in a
    particular HMM path and its output

7
Arc-emission vs. state-emission
8
Three fundamental questions for HMMs
  • To train an HMM to learn the transition and
    emission probabilities
  • To find the best state sequence for a given
    observation
  • To compute the probability of a given observation

9
Training an HMM estimating the probabilities
  • Supervised learning
  • The state sequences in the training data are
    known
  • ML estimation by simple counting
  • Unsupervised learning
  • The state sequences in the training data are
    unknown
  • The forward-backward algorithm
  • More in later slides

10
HMM as a parser Finding the best state sequence
  • Given the observation O1,To1oT, find the state
    sequence X1,T1X1 XT1 that maximizes P(X1,T1
    O1,T).
  • ? Viterbi algorithm

11
Viterbi algorithm
  • The probability of the best path that produces
    O1,t-1 while ending up in state sj

Initialization
Induction
12
HMM as an LM computing P(o1, , oT)
13
Definition of the forward probability
  • The probability of producing O1,t-1 while ending
    up in state si

14
Calculating forward probability
Initialization
Induction
15
HMM Summary
  • Definition hidden states, output symbols
  • Two types of HMMs
  • Three basic questions in HMM
  • Estimate probability MLE
  • Find the best sequence Viterbi algorithm
  • Find the probability of an observation forward
    probability

16
N-gram POS tagger
17
N-gram POS tagger
18
N-gram POS tagger (cont)
Bigram model
Trigram model
19
The bigram tagger
  • States Each state corresponds to a POS tag,
    plus a state for BOS (and a state for EOS)
  • Output symbols Each output symbol is a word or
    ltsgt or lt/sgt
  • Initial probability
  • Transition probability aij P(sj si)
  • Emission probability bjk P(wk sj)

20
The bigram tagger (cont)
21
The trigram tagger
  • States Each state corresponds to a tag pair, a
    tag is a POS tag or BOS or EOS
  • Output symbols words, ltsgt, lt/sgt
  • Initial probability
  • Transition probability
  • aij P(t3 t1, t2), where si(t1, t2),
    and sj(t2, t3)
  • 0, where si(t1, t2), and sj(t2, t3),
    and t2 ! t2
  • Emission probability
  • bjk P(wk t), where sj(t,t) for any
    t

22
The trigram tagger (cont)
23
Training a n-gram tagger
24
Estimating the probability
25
Smoothing
  • To handle unseen tag sequences
  • ? to smooth the transition prob
  • To handle unknown words
  • ? to smooth the emission prob
  • To handle unseen (word, tag) pairs, where both
    word and tag are known
  • There is very low percentage of such pairs (e.g.,
    0.44) in PTB.

26
Handling unseen tag sequences
  • Ex To smooth P(t3t1, t2) for a trigram tagger.
  • Can we use GT smoothing for this?
  • How about interpolation?
  • P(t3 t1, t2)

27
Handling unknown words?
  • Introduce a new output symbol ltunkgt
  • Estimate P(ltunkgt t) for each tag t
  • Ex split training data into two sets create the
    voc from set1, and estimate P(ltunkgtt) from set2.
  • Add P(ltunkgt t) to the emission prob and
    renormalize so that sumw P(wt) 1.
  • Ex Keep P(ltunkgtt) the same, and make
Write a Comment
User Comments (0)
About PowerShow.com