Title: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty
1Stochastic Gradient Descent Training for
L1-regularizaed Log-linear Models with Cumulative
Penalty
- Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia
Ananiadou - University of Manchester
2Log-linear models in NLP
- Maximum entropy models
- Text classification (Nigam et al., 1999)
- History-based approaches (Ratnaparkhi, 1998)
- Conditional random fields
- Part-of-speech tagging (Lafferty et al., 2001),
chunking (Sha and Pereira, 2003), etc. - Structured prediction
- Parsing (Clark and Curan, 2004), Semantic Role
Labeling (Toutanova et al, 2005), etc.
3Log-linear models
- Log-linear (a.k.a. maximum entropy) model
- Training
- Maximize the conditional likelihood of the
training data
Feature function
Weight
Partition function
4Regularization
- To avoid overfitting to the training data
- Penalize the weights of the features
- L1 regularization
- Most of the weights become zero
- Produces sparse (compact) models
- Saves memory and storage
5Training log-linear models
- Numerical optimization methods
- Gradient descent (steepest descent or
hill-climbing) - Quasi-Newton methods (e.g. BFGS, OWL-QN)
- Stochastic Gradient Descent (SGD)
- etc.
- Training can take several hours (or even days),
depending on the complexity of the model, the
size of training data, etc.
6Gradient Descent (Hill Climbing)
objective
7Stochastic Gradient Descent (SGD)
objective
Compute an approximate gradient using
one training sample
8Stochastic Gradient Descent (SGD)
- Weight update procedure
- very simple (similar to the Perceptron algorithm)
learning rate
9Using subgradients
10Using subgradients
- Problems
- L1 penalty needs to be applied to all features
(including the ones that are not used in the
current sample). - Few weights become zero as a result of training.
11Clipping-at-zero approach
w
- Carpenter (2008)
- Special case of the FOLOS algorithm (Duchi and
Singer, 2008) and the truncated gradient method
(Langford et al., 2009) - Enables lazy update
12Clipping-at-zero approach
13- Text chunking
- Named entity recognition
- Part-of-speech tagging
Number of non-zero features
Quasi-Newton 18,109
SGD (Naive) 455,651
SGD (Clipping-at-zero) 87,792
Number of non-zero features
Quasi-Newton 30,710
SGD (Naive) 1,032,962
SGD (Clipping-at-zero) 279,886
Number of non-zero features
Quasi-Newton 50,870
SGD (Naive) 2,142,130
SGD (Clipping-at-zero) 323,199
14Why it does not produce sparse models
- In SGD, weights are not updated smoothly
15Cumulative L1 penalty
- The absolute value of the total L1 penalty which
should have been applied to each weight - The total L1 penalty which has actually been
applied to each weight
16Applying L1 with cumulative penalty
- Penalize each weight according to the difference
between and
17Implementation
10 lines of code!
18Experiments
- Model Conditional Random Fields (CRFs)
- Baseline OWL-QN (Andrew and Gao, 2007)
- Tasks
- Text chunking (shallow parsing)
- CoNLL 2000 shared task data
- Recognize base syntactic phrases (e.g. NP, VP,
PP) - Named entity recognition
- NLPBA 2004 shared task data
- Recognize names of genes, proteins, etc.
- Part-of-speech (POS) tagging
- WSJ corpus (sections 0-18 for training)
19CoNLL 2000 chunking task objective
20CoNLL 2000 chunking non-zero features
21CoNLL 2000 chunking
- Performance of the produced model
Passes Obj. Features Time (sec) F-score
OWL-QN 160 -1.583 18,109 598 93.62
SGD (Naive) 30 -1.671 455,651 1,117 93.64
SGD (Clipping Lazy Update) 30 -1.671 87,792 144 93.65
SGD (Cumulative) 30 -1.653 28,189 149 93.68
SGD (Cumulative ED) 30 -1.622 23,584 148 93.66
- Training is 4 times faster than OWL-QN
- The model is 4 times smaller than the
clipping-at-zero approach - The objective is also slightly better
22NLPBA 2004 named entity recognition
Passes Obj. Features Time (sec) F-score
OWL-QN 160 -2.448 30,710 2,253 71.76
SGD (Naive) 30 -2.537 1,032,962 4,528 71.20
SGD (Clipping Lazy Update) 30 -2.538 279,886 585 71.20
SGD (Cumulative) 30 -2.479 31,986 631 71.40
SGD (Cumulative ED) 30 -2.443 25,965 631 71.63
Part-of-speech tagging on WSJ
Passes Obj. Features Time (sec) Accuracy
OWL-QN 124 -1.941 50,870 5,623 97.16
SGD (Naive) 30 -2.013 2,142,130 18,471 97.18
SGD (Clipping Lazy Update) 30 -2.013 323,199 1,680 97.18
SGD (Cumulative) 30 -1.987 62,043 1,777 97.19
SGD (Cumulative ED) 30 -1.954 51,857 1,774 97.17
23Discussions
- Convergence
- Demonstrated empirically
- Penalties applied are not i.i.d.
- Learning rate
- The need for tuning can be annoying
- Rule of thumb
- Exponential decay (passes 30, alpha 0.85)
24Conclusions
- Stochastic gradient descent training for
L1-regularized log-linear models - Force each weight to receive the total L1 penalty
that would have been applied if the true
(noiseless) gradient were available - 3 to 4 times faster than OWL-QN
- Extremely easy to implement