Log-Linear Models in NLP - PowerPoint PPT Presentation

About This Presentation

Title:

Log-Linear Models in NLP

Description:

born (4) (6) The 'Label Bias' Problem. VBN. VBN, IN. IN, NN. VBN, TO. born. to. to. wealth. run. born to wealth ... Pr(VBN | born) Pr(TO | VBN, to) Pr(VB | VBN, ... – PowerPoint PPT presentation

Number of Views:254

Avg rating:3.0/5.0

Slides: 82

Provided by: noa75

Learn more at: http://users.umiacs.umd.edu

Category:

more less

Transcript and Presenter's Notes

Title: Log-Linear Models in NLP

1
Log-Linear Models in NLP

Noah A. Smith
Department of Computer Science /
Center for Language and Speech Processing
Johns Hopkins University
nasmith_at_cs.jhu.edu

2
Outline

Maximum Entropy principle
Log-linear models
Conditional modeling for classification
Ratnaparkhis tagger
Conditional random fields
Smoothing
Feature Selection

3
Data
For now, were just talking about modeling data.
No task.
How to assign probability to each shape type?
4
Maximum Likelihood
3 .19
2 .12
0 0
1 .06
4 .25
3 .19
1 .06
0 0
0 0
1 .06
0 0
1 .06
Fewer parameters?
How to smooth?
11 degrees of freedom (12 1).
5
Some other kinds of models
11 degrees of freedom (1 4 6).
Color
0.5
0.5
large 0.000
small 1.000
large 0.333
small 0.667
large 0.250
small 0.750
large 1.000
small 0.000
large 0.000
small 1.000
large 0.000
small 1.000
0.125
0.375
0.500
0.125
0.375
0.500
These two are the same!
These two are the same!
Pr(Color, Shape, Size) Pr(Color) Pr(Shape
Color) Pr(Size Color, Shape)
6
Some other kinds of models
9 degrees of freedom (1 2 6).
Color
0.5
0.5
large 0.000
small 1.000
large 0.333
small 0.667
large 0.250
small 0.750
large 1.000
small 0.000
large 0.000
small 1.000
large 0.000
small 1.000
0.125
0.375
0.500
Pr(Color, Shape, Size) Pr(Color) Pr(Shape)
Pr(Size Color, Shape)
7
Some other kinds of models
7 degrees of freedom (1 2 4).
Color
large 0.667
large 0.333
small 0.462
small 0.538
large 0.375
small 0.625
large 0.333
large 0.333
large 0.333
small 0.077
small 0.385
small 0.538
No zeroes here ...
Pr(Color, Shape, Size) Pr(Size) Pr(Shape
Size) Pr(Color Size)
8
Some other kinds of models
4 degrees of freedom (1 2 1).
Color
0.5
0.5
large 0.375
small 0.625
0.125
0.375
0.500
Pr(Color, Shape, Size) Pr(Size) Pr(Shape)
Pr(Color)
9
This is difficult.

Different factorizations affect
smoothing
parameters (model size)
model complexity
interpretability
goodness of fit ...

Usually, this isnt done empirically, either!
10
Desiderata

You decide which features to use.
Some intuitive criterion tells you how to use
them in the model.
Empirical.

11
Maximum Entropy

Make the model as uniform as possible ...
but I noticed a few things that I want to model
...
so pick a model that fits the data on those
things.

12
Occams Razor
One should not increase, beyond what is
necessary, the number of entities required to
explain anything.
13
Uniform model

small 0.083 0.083 0.083
small 0.083 0.083 0.083
large 0.083 0.083 0.083
large 0.083 0.083 0.083
14
Constraint Pr(small) 0.625

small 0.104 0.104 0.104
small 0.104 0.104 0.104
large 0.063 0.063 0.063
large 0.063 0.063 0.063
0.625
15
Pr( , small) 0.048
0.048

small 0.024 0.144 0.144
small 0.024 0.144 0.144
large 0.063 0.063 0.063
large 0.063 0.063 0.063
0.625
16
Pr(large, ) 0.125
0.048

small 0.024 0.144 0.144
small 0.024 0.144 0.144
large 0.063 0.063 0.063
large 0.063 0.063 0.063
?
0.625
17
Questions
Is there an efficient way to solve this problem?
Does a solution always exist?
Is there a way to express the model succinctly?
What to do if it doesnt?
18
Entropy

A statistical measurement on a distribution.
Measured in bits.
? 0, log2X
High entropy close to uniform
Low entropy close to deterministic
Concave in p.

19
The Max Ent Problem
H
p2
p1
20
The Max Ent Problem
objective function is H
probabilities sum to 1 ...
picking a distribution
... and are nonnegative
expected feature value under the model
n constraints
expected feature value from the data
21
The Max Ent Problem
H
p2
p1
22
About feature constraints
1 if x is large and light, 0 otherwise
1 if x is small, 0 otherwise
23
Mathematical Magic
constrained X variables (p) concave in p
unconstrained N variables (?) concave in ?
24
Whats the catch?
The model takes on a specific, parameterized
form. It can be shown that any max-ent model
must take this form.
25
Outline

Maximum Entropy principle
Log-linear models
Conditional modeling for classification
Ratnaparkhis tagger
Conditional random fields
Smoothing
Feature Selection

26
Log-linear models
Log
linear
27
Log-linear models
One parameter (?i) for each feature.
Unnormalized probability, or weight
Partition function
28
Mathematical Magic
Max ent problem
constrained X variables (p) concave in p
unconstrained N variables (?) concave in ?
Log-linear ML problem
29
What does MLE mean?
Independence among examples
Arg max is the same in the log domain
30
MLE Then and Now
Directed models Log-linear models
Concave Concave
Constrained (simplex) Unconstrained
Count and normalize (closed form solution) Iterative methods
31
Iterative Methods
All of these methods are correct and will
converge to the right answer its just a matter
of how fast.

Generalized Iterative Scaling
Improved Iterative Scaling
Gradient Ascent
Newton/Quasi-Newton Methods
Conjugate Gradient
Limited-Memory Variable Metric
...

32
Questions
Is there an efficient way to solve this problem?
Yes, many iterative methods.
Does a solution always exist?
Is there a way to express the model succinctly?
Yes, if the constraints come from the data.
Yes, a log-linear model.
33
Outline

Maximum Entropy principle
Log-linear models
Conditional modeling for classification
Ratnaparkhis tagger
Conditional random fields
Smoothing
Feature Selection

34
Conditional Estimation
labels
examples
Training Objective
Classification Rule
35
Maximum Likelihood
label
object
36
Maximum Likelihood
label
object
37
Maximum Likelihood
label
object
38
Maximum Likelihood
label
object
39
Conditional Likelihood
label
object
40
Remember

log-linear models
conditional estimation

41
The Whole Picture
Directed models Log-linear models
MLE Count Normalize Unconstrained concave optimization
CLE Constrained concave optimization Unconstrained concave optimization
42
Log-linear models MLE vs. CLE
Sum over all example types ? all labels.
Sum over all labels.
43
Classification Rule

Pick the most probable label y

We dont need to compute the partition function
at test time!
But it does need to be computed during training.
44
Outline

Maximum Entropy principle
Log-linear models
Conditional modeling for classification
Ratnaparkhis tagger
Conditional random fields
Smoothing
Feature Selection

45
Ratnaparkhis POS Tagger (1996)

Probability model
Assume unseen words behave like rare words.
Rare words count lt 5
Training GIS
Testing/Decoding beam search

46
Features common words
the stories about well-heeled communities and developers
DT NNS IN JJ NNS CC NNS
about
IN
stories
IN
the
IN
well-heeled
IN
communities
IN
NNS IN
DT NNS IN
47
Features rare words
the stories about well-heeled communities and developers
DT NNS IN JJ NNS CC NNS
about
JJ
stories
JJ
communities
JJ
and
JJ
IN JJ
NNS IN JJ
...-...
JJ
...d
JJ
...ed
JJ
...led
JJ
...eled
JJ
w...
JJ
we...
JJ
wel...
JJ
well...
JJ
48
The Label Bias Problem
born to run
VBN TO VB
born to wealth
VBN TO NN
born to run
VBN TO VB
born to wealth
VBN TO NN
born to run
VBN TO VB
born to wealth
VBN TO NN
born to run
VBN TO VB
born to wealth
VBN TO NN
born to run
VBN TO VB
born to run
VBN TO VB
born to run
VBN TO VB
(4)
(6)
49
The Label Bias Problem
Pr(VBN born) Pr(IN VBN, to) Pr(NN VBN, IN,
wealth) 1 .4 1
born
to
VBN, IN
wealth
VBN

IN, NN
to
run
VBN, TO
TO, VB
Pr(VBN born) Pr(TO VBN, to) Pr(VB VBN, TO,
wealth) 1 .6 1
born to wealth
50
Is this symptomatic of log-linear models?
No!
51
Tagging Decisions
tag3
A
B
tag3
tag2
A
At each decision point, the total weight is 1.
C
tag3
D
tag3
tagn

tag2
B
tag1
tag1
A
tag3
Choose the path with the greatest weight.
B
tag3
You never pay a penalty for it!
C
D
tag3
tag2
You must choose tag2 B, even if B is a terrible
tag for word2. Pr(tag2 B anything at all!) 1
B
tag3
52
Tagging Decisions in an HMM
tag3
A
B
tag3
tag2
A
At each decision point, the total weight can be
0.
C
tag3
D
tag3
tagn

tag2
B
tag1
tag1
A
tag3
Choose the path with the greatest weight.
B
tag3
C
D
tag3
tag2
You may choose to discontinue this path if B
cant tag word2. Or pay a high cost.
B
tag3
53
Outline

Maximum Entropy principle
Log-linear models
Conditional modeling for classification
Ratnaparkhis tagger
Conditional random fields
Smoothing
Feature Selection

54
Conditional Random Fields

Lafferty, McCallum, and Pereira (2001)
Whole sentence model with local features

55
Simple CRFs as Graphs
PRP
NN
VBZ
ADV
Weights, added together.
My
cat
begs
silently
PRP
NN
VBZ
ADV
Compare with an HMM
Log-probs, added together.
My
cat
begs
silently
56
What can CRFs do that HMMs cant?
PRP
NN
VBZ
ADV
My
cat
begs
silently
57
An Algorithmic Connection
Total weight of all paths.
What is the partition function?
58
CRF weight training

Maximize log-likelihood
Gradient

Total weight of all paths.
Forward algorithm.
Expected feature values.
Forward-backward algorithm.
59
Forward, Backward, and Expectations
fk is the number of firings each firing is at
some position
Markovian property
backward weight
forward weight
forward weight
60
Forward, Backward, and Expectations
forward weight
backward weight
forward weight to final state weight of all
paths
61
Forward-Backwards Clients
Training a CRF Baum-Welch
supervised (labeled data) unsupervised
concave bumpy
converges to global max converges to local max
max p(y x) (conditional training) max p(x) (y unknown)
62
A Glitch

Suppose we notice that -ly words are always
adverbs.
Call this feature 7.

The expectation cant exceed the max (it cant
even reach it).
-ly words are all ADV this is maximal.
The gradient will always be positive.
63
The Dark Side of Log-Linear Models
64
Outline

Maximum Entropy principle
Log-linear models
Conditional modeling for classification
Ratnaparkhis tagger
Conditional random fields
Smoothing
Feature Selection

65
Regularization

?s shouldnt have huge magnitudes
Model must generalize to test data
Example quadratic penalty

66
Bayesian RegularizationMaximum A Posteriori
Estimation
67
Independent Gaussians Prior (Chen and Rosenfeld,
2000)
Independence
Gaussian
0-mean, identical variance.
Quadratic penalty!
68
Alternatives
Not differentiable.

Different variances for different parameters
Laplacian prior (1-norm)
Exponential prior (Goodman, 2004)
Relax the constraints (Kazama Tsujii, 2003)

All ?k 0.
69
Effect of the penalty
?k
70
Kazama Tsujiis box constraints
The primal Max Ent problem
71
Sparsity

Fewer features ? better generalization
E.g., support vector machines
Kazama Tsujiis prior, and Goodmans, give
sparsity.

72
Sparsity
Gradient is 0.
penalty
Cusp function is not differentiable here.
?k
73
Outline

Maximum Entropy principle
Log-linear models
Conditional modeling for classification
Ratnaparkhis tagger
Conditional random fields
Smoothing
Feature Selection

74
Feature Selection

Sparsity from priors is one way to pick the
features. (Maybe not a good way.)
Della Pietra, Della Pietra, and Lafferty (1997)
gave another way.

75
Back to the original example.
76
Nine features.
?i log counti

f1 1 if , 0 otherwise
f2 1 if , 0 otherwise
f3 1 if , 0 otherwise
f4 1 if , 0 otherwise
f5 1 if , 0 otherwise

f6 1 if , 0 otherwise
f7 1 if , 0 otherwise
f8 1 if , 0 otherwise
f9 1 unless some other feature fires ?9 ltlt 0

Whats wrong here?
77
The Della Pietras Laffertys Algorithm

Start out with no features.
Consider a set of candidates.
Atomic features.
Current features conjoined with atomic features.
Pick the candidate g with the greatest gain
Add g to the model.
Retrain all parameters.
Go to 2.

78
Feature Induction Example
Selected features
PRP
NN
VBZ
ADV
My
cat
begs
silently
Atomic features
Other candidates
PRP
NN
VBZ
ADV
PRP NN
NN cat
My
cat
begs
silently
NN VBZ
79
Outline

Maximum Entropy principle
Log-linear models
Conditional modeling for classification
Ratnaparkhis tagger
Conditional random fields
Smoothing
Feature Selection

80
Conclusions
The math is beautiful and easy to implement.
You pick the features the rest is just math!
Log-linear models
Probabilistic models robustness data-oriented mat
hematically understood
Hacks explanatory power exploit experts choice
of features (can be) more data-oriented
81
Thank you!

Write a Comment

User Comments (0)