IE With Undirected Models: the saga continues presentation

About This Presentation

Transcript and Presenter's Notes

Title: IE With Undirected Models: the saga continues

1
IE With Undirected Modelsthe saga continues

William W. Cohen
CALD

2
Announcements

Upcoming assignments
Mon 2/23 Toutanova et al
Wed 2/25 Klein Manning, intro to max margin
theory
Mon 3/1 no writeup due
Wed 3/3 project proposal due
personnel 1-2 page
Spring break week, no class

3
Motivation for CMMs
S
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor
t
-
1
t
t1

is Wisniewski

part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
Idea replace generative model in HMM with a
maxent model, where state depends on observations
and previous state
4
Implications of the model

Does this do what we want?
Q does Yi-1 depend on Xi1 ?
a nodes is conditionally independent of its
non-descendents given its parents

5
CRF model
y1
y2
y3
y4
x
6
Dependency Nets
7
Dependency nets

Learning is simple and elegant (if you know each
nodes Markov blanket) just learn a
probabilistic classifier for P(Xpa(X)) for each
node X.

Pr(y1x,y2)
Pr(y2x,y1,y2)
Pr(y3x,y2,y4)
Pr(y4x,y3)
y1
y2
y3
y4
Learning is local, but inference is not, and need
not be unidirectional
x
8
Toutanova, Klein, Manning, Singer

Dependency nets for POS tagging vs CMMs.
Maxent is used for local conditional model.
Goals
An easy-to-train bidirectional model
A really good POS tagger

9
Toutanova et al

Dont use Gibbs sampling for inference instead
use a Viterbi variant (which is not guaranteed to
produce the ML sequence)

D 11, 11, 11, 12, 21, 33 ML state
11 P(a1b1)P(b1a1) lt 1 P(a3b3)P(b3a3)
1
10
Results with model
Final test-set results
MXPost 47.6, 96.4, 86.2 CRF 95.7, 76.4
11
Klein Manning Conditional Structure vs
Estimation
12
Task 1 WSD (Word Sense Disambiguation)
Bushs election-year ad campaign will begin this
summer, with... (sense1) Bush whacking is tiring
but rewardingwho wants to spend all their time
on marked trails? (sense2) Class is
sense1/sense2, features are context words.
13
Task 1 WSD (Word Sense Disambiguation)
Model 1 Naive Bayes multinomial model
Use conditional rule to predict sense s from
context-word observations o. Standard NB
training maximizes joint likelihood under
independence assumption
14
Task 1 WSD (Word Sense Disambiguation)
Model 2 Keep same functional form, but maximize
conditional likelihood (sound familiar?)
or maybe SenseEval score
or maybe even
15
Task 1 WSD (Word Sense Disambiguation)

Optimize JL with std NB learning
Optimize SCL, CL with conjugate gradient
Also over non-deficient models (?) using
Lagrange penalties to enforce soft version of
deficiency constraint
I think this makes sure non-conditional version
is a valid probability
Dont even try on optimizing accuracy
Penalty for extreme predictions in SCL

16
(No Transcript)
17
Task 2 POS Tagging

Sequential problem
Replace NB with HMM model.
Standard algorithms maximize joint likelihood

Claim keeping the same model but maximizing
conditional likelihood leads to a CRF
Is this true?
Alternative is conditional structure (CMM)

18
Using conditional structure vs maximizing
conditional likelihood
CMM factors Pr(s,o) into Pr(so)Pr(o). For the
CMM model, adding dependencies btwn observations
does not change Pr(so), ie JL estimate CL
estimate for Pr(so)
19
Task 2 POS Tagging
Experiments with a simple feature set For fixed
model, CL is preferred to JL (CRF beats HMM) For
fixed objective, HMM is preferred to MEMM/CMM
20
Error analysis for POS tagging

Label bias is not the issue
state-state dependencies are weak compared to
observation-state dependencies
too much emphasis on observation, not enough on
previous states (observation bias)
put another way label bias predicts
overprediction of states with few outgoing
transitions, of more generally, low entropy...

21
Error analysis for POS tagging
22
Background for next weekthe last 20 years of
learning theory
23
Milestones in learning theory

Valiant 1984 CACM
Turing machines and Turing testsformal analysis
of AI problems
Chernoff bound shows that Prob(error of hgte) gt
Prob(h consistent with m examples)ltd
So given m examples, can afford to examine 2m
hypotheses

24
Milestones in learning theory

Haussler AAAI86
Pick a small hypothesis from a large set
Given m examples, can learn hypothesis of size
O(m) bits
Blumer,Ehrenfeucht,Haussler,Warmuth, STOC88
Generalize notion of hypothesis size to
VC-dimension.

25
More milestones....

Littlestone MLJ88 Winnow algorithm
Learning small hypothesis in many dimensions,
in mistake bounded model
Mistake bound VCdim.
Blum COLT91
Learning over infinitely many attributes in
mistake-bounded model
Learning as compression as learning...

26
More milestones....

Freund Schapire 1996
boosting C4.5, even to extremes, does not overfit
data (!?) --how does this reconcile with Occams
razor?
Vapniks support vector machines
kernel representation of a function
true optimization in machine learning
boosting as iterative margin maximization

IE With Undirected Models: the saga continues PowerPoint PPT Presentation