LING 406 Intro to Computational Linguistics Language Models, HMMs, Forward Algorithm, Viterbi Algori - PowerPoint PPT Presentation

Loading...

PPT – LING 406 Intro to Computational Linguistics Language Models, HMMs, Forward Algorithm, Viterbi Algori PowerPoint presentation | free to view - id: 1b456c-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

LING 406 Intro to Computational Linguistics Language Models, HMMs, Forward Algorithm, Viterbi Algori

Description:

Suppose your corpus does not have every Monday but it does ... http://www.scs.leeds.ac.uk/amalgam/tagsets/upenn.html. Link for the original Brown corpus tags: ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 35
Provided by: serrano5
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: LING 406 Intro to Computational Linguistics Language Models, HMMs, Forward Algorithm, Viterbi Algori


1
LING 406Intro to Computational
LinguisticsLanguage Models, HMMs, Forward
Algorithm, Viterbi Algorithm
  • Richard Sproat
  • URL http//catarina.ai.uiuc.edu/L406_08/

2
This Lecture
  • Class-based language models
  • Hidden Markov models
  • FST equivalents
  • Forward algorithm
  • Part-of-speech tagging
  • Viterbi algorithm
  • Forward-backward algorithm and expectation
    maximization
  • Further issues in part-of-speech tagging

3
Class-based language models
  • Suppose your corpus does not have every Monday
    but it does have every DAY-OF-WEEK for all the
    other days of the week.
  • A class-based language model can model this
    situation
  • P(wi Ci) P(Ci C0,C1Ci-1)

4
What are classes?
  • A word can be in its own class
  • Part-of-speech
  • Semantic classes (DAY-OF-WEEK)

5
General statement of problem
Markov assumption biclass model
6
Hidden Markov Models (HMMs)
emission costs
ltSgt
P(dogN) 0.9 P(eatsN) 0.1
P(dogV) 0.1 P(eatsV) 0.9
P(NltSgt)0.5
P(VltSgt)0.5
P(VN)0.8
N
V
P(NN)0.1
P(VV)0.1
P(NV)0.7
P(lt/SgtN)0.1
P(lt/SgtV)0.2
lt/Sgt
ltsgt
dog
eats
dog
lt/sgt
1.0 0.5 0.9 0.8 0.9 0.7 0.9 0.1
0.02
Note set probabilities of starting in any state
other than ltsgt to 0
7
Why hidden?
  • But if we see dog eat dog we dont actually know
    what underlying sequence it came from
  • The true sequence is hidden
  • Another possibility would be
  • ltsgt V V V lt/sgt
  • 1.0 0.5 0.1 0.1 0.9 0.1 0.1 0.2
    .000009
  • So we need to consider all possibilities if we
    want an estimate of the probability of the
    sentence given the model

8
An equivalent WFST
ltSgt
Vdog/P(VltSgt)P(dogV)
Veats/P(VltSgt)P(eatV)
N
V
lt/Sgt
  • Arcs are labeled with tagword pairs
  • States represent last seen tag
  • Arc costs are combined transition and emission
    costs

9
The probability of an observed sequence
10
Forward algorithm
  • cf. the probability of a particular sequence in
    the real semiring for WFSAs

11
Forward algorithm
12
Forward probability
13
Pseudocode for forward algorithm
14
Example
15
Tritags, tetratags
  • You can extend to more than one previous tag of
    history by adding states so that each state
    remembers the last n tags seen.
  • We saw this before in our WFST implementation of
    a language model

16
Part-of-speech tagging
We dont want the probability of the observed
sequence we want the part of speech sequence
that maximizes that probability.
17
Viterbi algorithm
18
Viterbi algorithm pseudocode
backpointer for class j at time t
reconstruct the best scoring path
19
Viterbi example
20
Forward-backward algorithm
  • Forward algorithm as before compute the
    probabilities to get to this point.
  • At the same time, compute the probabilities to
    get to the end of the string from here the
    backward probabilities
  • This allows you to compute the probability of a
    given tag, given the entire string.

21
EM
  • The forward-backward probabilities are used to
    compute the expected frequency of a tag
  • You can then adjust your model to maximize this
    expected frequency given the data.
  • This is Expectation Maximization EM
  • See RS, 6.3.2

22
More on part-of-speech tagging
  • Part of speech (POS) tagging is simply the
    problem of placing words into equivalence
    classes.
  • Notion of part of speech tags can be attributed
    to Dionysius Thrax, 1st Century BC Greek
    grammarian who classified Greek words into eight
    classes
  • noun, verb, pronoun, preposition, adverb,
    conjunction, participle and article.
  • Tagging is arguably easiest in languages with
    rich (inflectional) morphology (e.g. Spanish) for
    two reasons
  • Its more obvious what the basic set of tags
    should be since words fall into
  • The morphology gives important cues to what the
    part of speech is
  • cantaremos is highly likely to be a verb given
    the ending -ar-emos.
  • Its arguably hardest in languages with minimal
    (inflectional) morphology
  • there are fewer cues in English than there are in
    Spanish
  • for some languages like Chinese, cues are almost
    completely absent
  • linguists cant even agree on whether (e.g.)
    Chinese distinguishes verbs from adjectives.

23
Part-of-speech tags
  • Linguists typically distinguish a relatively
    small set of basic categories (like Dionysius
    Thrax)sometimes just 4 in the case of Chomskys
    N,V proposal.
  • But usually these analyses assume an additional
    set of morphosyntactic features.
  • Computational models of tagging usually involve a
    larger set, which in manycases can be thought of
    as the linguists small set, plus the features
    squished into one term
  • eat/VB, eat/VBP, eats/VBZ, ate/VBD, eaten/VBN
  • Tagset size has a clear effect on performance of
    taggers
  • the Penn Treebank project collapsed many tags
    compared to the original Brown tagset, and got
    better results. (http//www.ilc.cnr.it/EAGLES96/m
    orphsyn/node18.html)
  • But choosing the right size tagset depends upon
    the intended application.
  • As far as I know, there is no demonstration of
    what is the optimal tagset.

24
The Penn Treebank tagset
  • 46 tags, collapsed from the Brown Corpus tagset
  • Some details
  • to/TO not disambiguated
  • verbs and auxiliaries (have, be) not
    distinguished (though these were in the Brown
    tagset).
  • Some links
  • http//www.computing.dcu.ie/acahill/tagset.html
  • http//www.mozart-oz.org/mogul/doc/lager/brill-tag
    ger/penn.html
  • http//www.scs.leeds.ac.uk/amalgam/tagsets/upenn.h
    tml
  • Link for the original Brown corpus tags
  • http//www.scs.leeds.ac.uk/ccalas/tagsets/brown.ht
    ml
  • Motivations for the Penn tagset modifications
  • the Penn Treebank tagset is based on that of the
    Brown Corpus. However the stochastic orientation
    of the Penn Treebank and the resulting concern
    with sparse data led us to modify the Brown
    tagset by paring it down considerably (Marcus,
    Santorini and Marcinkiewicz, 1993).
  • eliminated distinctions that were lexically
    recoverable thus no separate tags for be, do,
    have
  • as well as distinctions that were syntactically
    recoverable (e.g. the distinction between subject
    and object pronouns)

25
Problematic cases
  • Even with a well-designed tagset, there are cases
    that even experts find it difficult to agree on.
  • adjective or participle?
  • a seen event, a rarely seen event, an unseen
    event,
  • a child seat, a very child seat, this seat is
    child
  • but thats a very MIT paper, shes sooooooo
    California
  • Some cases are difficult to get in the absence of
    further knowledge preposition or particle?
  • he threw out the garbage
  • he threw the garbage out
  • he threw the garbage out the door
  • he threw the garbage the door out

26
Typical examples used to motivate tagging
  • Can they can cans?
  • May may leave
  • He does not shoot does
  • You might use all your might
  • I am arriving at 3 am

27
How hard is tagging?
28
Approaches to tagging
29
Approaches to tagging
  • Source-channel model
  • This is what weve seen already with the HMM
    tagger
  • But just to give a little more background (coz
    you should know this)
  • Want to maximize P(TW)
  • From Bayes rule we know that

this is a constant for any sentence
30
Transformation-based learning
31
Transformation-based learning
32
Example rules from Brills thesis
33
Some problems with tagging for English
  • Prenominal NN or NNP (proper name) or JJ
    (adjective)
  • Brown Corpus
  • RP (particle) or RB (adverb) or IN (preposition)
  • Run up a pipe
  • run up a bill
  • VBD (past verb) versus VBN (past participle)
    versus JJ
  • The vase was broken yesterday it was fine the
    day before.
  • The vase was broken yesterday but now its
    fixed.

34
Summary
  • Various approaches to tagging
  • Source-channel (HMM) taggers
  • Hand-built constraint-based approaches
  • Transformation-based learning
  • For many years tagging was viewed as an end in
    itself, i.e. taggers were evaluated for their own
    sake without considering what they might be used
    for
About PowerShow.com