Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty - PowerPoint PPT Presentation

About This Presentation

Title:

Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty

Description:

Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with ... Stochastic gradient descent training for L1-regularized log-linear models ... – PowerPoint PPT presentation

Number of Views:294

Avg rating:3.0/5.0

Slides: 25

Provided by: tsur3

Category:

more less

Transcript and Presenter's Notes

Title: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty

1
Stochastic Gradient Descent Training for
L1-regularizaed Log-linear Models with Cumulative
Penalty

Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia
Ananiadou
University of Manchester

2
Log-linear models in NLP

Maximum entropy models
Text classification (Nigam et al., 1999)
History-based approaches (Ratnaparkhi, 1998)
Conditional random fields
Part-of-speech tagging (Lafferty et al., 2001),
chunking (Sha and Pereira, 2003), etc.
Structured prediction
Parsing (Clark and Curan, 2004), Semantic Role
Labeling (Toutanova et al, 2005), etc.

3
Log-linear models

Log-linear (a.k.a. maximum entropy) model
Training
Maximize the conditional likelihood of the
training data

Feature function
Weight
Partition function
4
Regularization

To avoid overfitting to the training data
Penalize the weights of the features
L1 regularization
Most of the weights become zero
Produces sparse (compact) models
Saves memory and storage

5
Training log-linear models

Numerical optimization methods
Gradient descent (steepest descent or
hill-climbing)
Quasi-Newton methods (e.g. BFGS, OWL-QN)
Stochastic Gradient Descent (SGD)
etc.
Training can take several hours (or even days),
depending on the complexity of the model, the
size of training data, etc.

6
Gradient Descent (Hill Climbing)
objective
7
Stochastic Gradient Descent (SGD)
objective
Compute an approximate gradient using
one training sample
8
Stochastic Gradient Descent (SGD)

Weight update procedure
very simple (similar to the Perceptron algorithm)

learning rate
9
Using subgradients

Weight update procedure

10
Using subgradients

Problems
L1 penalty needs to be applied to all features
(including the ones that are not used in the
current sample).
Few weights become zero as a result of training.

11
Clipping-at-zero approach
w

Carpenter (2008)
Special case of the FOLOS algorithm (Duchi and
Singer, 2008) and the truncated gradient method
(Langford et al., 2009)
Enables lazy update

12
Clipping-at-zero approach
13

Text chunking
Named entity recognition
Part-of-speech tagging

Number of non-zero features
Quasi-Newton 18,109
SGD (Naive) 455,651
SGD (Clipping-at-zero) 87,792
Number of non-zero features
Quasi-Newton 30,710
SGD (Naive) 1,032,962
SGD (Clipping-at-zero) 279,886
Number of non-zero features
Quasi-Newton 50,870
SGD (Naive) 2,142,130
SGD (Clipping-at-zero) 323,199
14
Why it does not produce sparse models

In SGD, weights are not updated smoothly

15
Cumulative L1 penalty

The absolute value of the total L1 penalty which
should have been applied to each weight
The total L1 penalty which has actually been
applied to each weight

16
Applying L1 with cumulative penalty

Penalize each weight according to the difference
between and

17
Implementation
10 lines of code!
18
Experiments

Model Conditional Random Fields (CRFs)
Baseline OWL-QN (Andrew and Gao, 2007)
Tasks
Text chunking (shallow parsing)
CoNLL 2000 shared task data
Recognize base syntactic phrases (e.g. NP, VP,
PP)
Named entity recognition
NLPBA 2004 shared task data
Recognize names of genes, proteins, etc.
Part-of-speech (POS) tagging
WSJ corpus (sections 0-18 for training)

19
CoNLL 2000 chunking task objective
20
CoNLL 2000 chunking non-zero features
21
CoNLL 2000 chunking

Performance of the produced model

Passes Obj. Features Time (sec) F-score
OWL-QN 160 -1.583 18,109 598 93.62
SGD (Naive) 30 -1.671 455,651 1,117 93.64
SGD (Clipping Lazy Update) 30 -1.671 87,792 144 93.65
SGD (Cumulative) 30 -1.653 28,189 149 93.68
SGD (Cumulative ED) 30 -1.622 23,584 148 93.66

Training is 4 times faster than OWL-QN
The model is 4 times smaller than the
clipping-at-zero approach
The objective is also slightly better

22
NLPBA 2004 named entity recognition
Passes Obj. Features Time (sec) F-score
OWL-QN 160 -2.448 30,710 2,253 71.76
SGD (Naive) 30 -2.537 1,032,962 4,528 71.20
SGD (Clipping Lazy Update) 30 -2.538 279,886 585 71.20
SGD (Cumulative) 30 -2.479 31,986 631 71.40
SGD (Cumulative ED) 30 -2.443 25,965 631 71.63
Part-of-speech tagging on WSJ
Passes Obj. Features Time (sec) Accuracy
OWL-QN 124 -1.941 50,870 5,623 97.16
SGD (Naive) 30 -2.013 2,142,130 18,471 97.18
SGD (Clipping Lazy Update) 30 -2.013 323,199 1,680 97.18
SGD (Cumulative) 30 -1.987 62,043 1,777 97.19
SGD (Cumulative ED) 30 -1.954 51,857 1,774 97.17
23
Discussions

Convergence
Demonstrated empirically
Penalties applied are not i.i.d.
Learning rate
The need for tuning can be annoying
Rule of thumb
Exponential decay (passes 30, alpha 0.85)

24
Conclusions

Stochastic gradient descent training for
L1-regularized log-linear models
Force each weight to receive the total L1 penalty
that would have been applied if the true
(noiseless) gradient were available
3 to 4 times faster than OWL-QN
Extremely easy to implement

Write a Comment

User Comments (0)