CPSC 503 Computational Linguistics - PowerPoint PPT Presentation

About This Presentation
Title:

CPSC 503 Computational Linguistics

Description:

Decoding: Finding the probability of an observation. brute force or Forward/Backward-Algorithm ... Training: find model parameters which best explain the ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 32
Provided by: giuseppec
Category:

less

Transcript and Presenter's Notes

Title: CPSC 503 Computational Linguistics


1
CPSC 503Computational Linguistics
  • Finish HMMs
  • Part-of-Speech Tagging
  • Lecture 10
  • Giuseppe Carenini

2
Today 13/2
  • Finish HMM the three key problems
  • Part-of-speech tagging
  • What it is
  • Why we need it
  • How to do it

3
Hidden Markov Model(Arc Emission)
a
.1
.6
.5
1
b
b
.4
a
b
a
i
.5
.5
s4
s3
.5
.5
a
.1
.4
.5
i
b
b
.7
.1
.4
a
.5
.3
.5
.6
.9
a
i
s1
s2
1
.4
.6
.4
b
Start
Start
4
Hidden Markov Model
Formal Specification as five-tuple
Set of States
Output Alphabet
Initial State Probabilities
State Transition Probabilities
Symbol Emission Probabilities
5
Three fundamental questions for HMMs
  • Decoding Finding the probability of an
    observation
  • brute force or Forward/Backward-Algorithm
  • Finding the best state sequence
  • Viterbi-Algorithm

Training find model parameters which best
explain the observations
Manning/Schütze, 2000 325
6
Computing the probability of an observation
sequence
7
Decoding Example
s1, s1, s1 0 ?
s1, s2, s1 0 ?
.
.
s1, s4, s4 .6 .7 .6 .4 .5
s2, s4, s3 0?
s2, s1, s4 .4 .4 .7 1 .5
.
Manning/Schütze, 2000 327
8
The forward procedure
9
The backward procedure
10
Combining backward and forward
11
Finding the Best State Sequence
?j(t) probability of the most probable path
that leads to that node
  • The Viterbi Algorithm
  • Initialization ?j(1) ?j, 1? j? N
  • Induction ?j(t1) max1? i?N ?i(t)aijbijot, 1?
    j? N
  • Store backtrace
  • ?j(t1) argmax1? i?N ?j(t)aij bijot, 1? j? N
  • Termination and path readout
    XT1 argmax1? i?N ?j(T1)
    Xt ?Xt1(t1)
    P(X)
    max1? i?N ?j(T1)

12
Parameter Estimation
  • Find the values of the model parameters ?(A, B,
    ?) which best explain O
  • Using Maximum Likelihood Estimation, we can want
    find the values that maximize
  • There is no known analytic method
  • Iterative hill-climbing algorithm known as
    Baum-Welch or Forward-Backward algorithm.
    (special case of the EM Algorithm)

13
Baum-Welch Algorithm Key ideas
  • Start using some (perhaps randomly chosen) model.
  • 2) Now you can compute
  • Expected of transitions from i to j
  • Expected of transitions from state i
  • Expected of transitions from i to j with k
    observed
  • 3) Now you can compute re-estimates of model
    parameters
  • 4) Back to 1

14
Parts of Speech Tagging
  • What is it?
  • Why do we need it?
  • Word classes (Tags)
  • Distribution
  • Tagsets
  • How to do it
  • Rule-based
  • Stochastic
  • Transformation-based

15
Parts of Speech Tagging What
  • Brainpower_NNP ,_, not_RB physical_JJ plant_NN
    ,_, is_VBZ now_RB a_DT firm_NN 's_POS chief_JJ
    asset_NN ._.
  • Tag meanings
  • NNP (Proper N sing), RB (Adv), JJ (Adj), NN (N
    sing. or mass), VBZ (V 3sg pres), DT
    (Determiner), POS (Possessive ending), .
    (sentence-final punct)

16
Parts of Speech Tagging Why?
  • Part-of-speech (word class, morph. class,
    syntactic category) gives a significant amount of
    info about the word and its neighbors

Useful in the following NLP tasks
  • As a basis for (Partial) Parsing
  • IR
  • Word-sense disambiguation
  • Speech synthesis
  • Improve language models (Spelling/Speech)

17
Parts of Speech
  • Eight basic categories
  • Noun, verb, pronoun, preposition, adjective,
    adverb, article, conjunction
  • These categories are based on
  • morphological properties (affixes they take)
  • distributional properties (what other words can
    occur nearby)
  • e.g, green It is so , both, The is
  • Not semantics!

18
Parts of Speech
  • Two kinds of category
  • Closed class (generally are function words)
  • Prepositions, articles, conjunctions, pronouns,
    determiners, aux, numerals
  • Open class
  • Nouns (proper/common mass/count), verbs,
    adjectives, adverbs

Very short, frequent and important
Objects, actions, events, properties
  • If you run across an unknown word.??

19
PoS Distribution
  • Parts of speech follow a usual behavior in
    Language

(unfortunately very frequent)
many PoS
Words
1 PoS
2 PoS
but luckily different tags associated with a
word are not equally likely
20
Sets of Parts of SpeechTagsets
  • Most commonly used
  • 45-tag Penn Treebank,
  • 61-tag C5,
  • 146-tag C7
  • The choice of tagset is based on the application
    (do you care about distinguishing between to as
    a prep and to as a infinitive marker?)
  • Accurate tagging can be done with even large
    tagsets

21
PoS Tagging
Input text
  • Brainpower, not physical plant, is now a firm's
    chief asset.

Tagset
Dictionary wordi -gt set of tags
Output
  • Brainpower_NNP ,_, not_RB physical_JJ plant_NN
    ,_, is_VBZ now_RB a_DT firm_NN 's_POS chief_JJ
    asset_NN ._. .

22
Tagger Types
  • Rule-based
  • Stochastic
  • HMM tagger gt 92
  • Transformation-based tagger (Brill) gt 95
  • Maximum Entropy Models gt 97

23
Rule-Based (ENGTWOL 95)
  1. A lexicon transducer returns for each word all
    possible morphological parses
  2. A set of 1,000 constraints is applied to rule
    out inappropriate PoS

24
HMM Stochastic Tagging
  • Tags corresponds to an HMM states
  • Words correspond to the HMM alphabet symbols

Tagging given a sequence of words
(observations), find the most likely sequence of
tags (states)
But this is..!
We need State transition and symbol emission
probabilities
2) No corpus parameter estimation (Baum-Welch)
25
Transformation Based Learning(the Brill Tagger
95-97)
Combines rule-based and stochastic approaches
  • Rules specify tags for words based on context
  • Rules are automatically induced from a
    pre-tagged training corpus

26
TBL How TBL rules are applied
Step 1 Assign each word the tag that is most
likely given no contextual information. Race
example P(NNrace) .98 P(VBrace) .02
Step 2 Apply transformation rules that use the
context that was just established. Race example
Change NN to VB when the previous tag is TO.
Johanna is expected to race tomorrow. The race
is already over.
.
.
.
27
How TBL Rules are learned
  • Major stages (supervised!)
  • 0. Save hand-tagged corpus
  • Label every word with its most-likely tag.
  • Examine every possible transformation and select
    the one with the most improved tagging.
  • Retag the data according to this rule.
  • Repeat 2-3 until some stopping point is reached.

Output an ordered list of transformations
28
The Universe of Possible Transformations?
Change tag a to b if
Huge search space!
29
Evaluating Taggers
  • Accuracy percent correct (most current taggers
    96-7) test on unseen data!
  • Human Celing agreement rate of humans on
    classification (96-7)
  • Unigram baseline assign each token to the class
    it occurred in most frequently in the training
    set (race -gt NN).
  • What is causing the errors? Build a confusion
    matrix

30
Knowledge-Formalisms Map(including probabilistic
formalisms)
State Machines (and prob. versions) (Finite State
Automata,Finite State Transducers, Markov Models)
Morphology
Syntax
Rule systems (and prob. versions) (e.g., (Prob.)
Context-Free Grammars)
Semantics
  • Logical formalisms
  • (First-Order Logics)

Pragmatics Discourse and Dialogue
AI planners
31
Next Time
  • Read about tagging unknown words
  • Read Chapter 9
Write a Comment
User Comments (0)
About PowerShow.com