Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty - PowerPoint PPT Presentation

About This Presentation
Title:

Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty

Description:

Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with ... Stochastic gradient descent training for L1-regularized log-linear models ... – PowerPoint PPT presentation

Number of Views:294
Avg rating:3.0/5.0
Slides: 25
Provided by: tsur3
Category:

less

Transcript and Presenter's Notes

Title: Stochastic Gradient Descent Training for L1-regularizaed Log-linear Models with Cumulative Penalty


1
Stochastic Gradient Descent Training for
L1-regularizaed Log-linear Models with Cumulative
Penalty
  • Yoshimasa Tsuruoka, Junichi Tsujii, and Sophia
    Ananiadou
  • University of Manchester

2
Log-linear models in NLP
  • Maximum entropy models
  • Text classification (Nigam et al., 1999)
  • History-based approaches (Ratnaparkhi, 1998)
  • Conditional random fields
  • Part-of-speech tagging (Lafferty et al., 2001),
    chunking (Sha and Pereira, 2003), etc.
  • Structured prediction
  • Parsing (Clark and Curan, 2004), Semantic Role
    Labeling (Toutanova et al, 2005), etc.

3
Log-linear models
  • Log-linear (a.k.a. maximum entropy) model
  • Training
  • Maximize the conditional likelihood of the
    training data

Feature function
Weight
Partition function
4
Regularization
  • To avoid overfitting to the training data
  • Penalize the weights of the features
  • L1 regularization
  • Most of the weights become zero
  • Produces sparse (compact) models
  • Saves memory and storage

5
Training log-linear models
  • Numerical optimization methods
  • Gradient descent (steepest descent or
    hill-climbing)
  • Quasi-Newton methods (e.g. BFGS, OWL-QN)
  • Stochastic Gradient Descent (SGD)
  • etc.
  • Training can take several hours (or even days),
    depending on the complexity of the model, the
    size of training data, etc.

6
Gradient Descent (Hill Climbing)
objective
7
Stochastic Gradient Descent (SGD)
objective
Compute an approximate gradient using
one training sample
8
Stochastic Gradient Descent (SGD)
  • Weight update procedure
  • very simple (similar to the Perceptron algorithm)

learning rate
9
Using subgradients
  • Weight update procedure

10
Using subgradients
  • Problems
  • L1 penalty needs to be applied to all features
    (including the ones that are not used in the
    current sample).
  • Few weights become zero as a result of training.

11
Clipping-at-zero approach
w
  • Carpenter (2008)
  • Special case of the FOLOS algorithm (Duchi and
    Singer, 2008) and the truncated gradient method
    (Langford et al., 2009)
  • Enables lazy update

12
Clipping-at-zero approach
13
  • Text chunking
  • Named entity recognition
  • Part-of-speech tagging

Number of non-zero features
Quasi-Newton 18,109
SGD (Naive) 455,651
SGD (Clipping-at-zero) 87,792
Number of non-zero features
Quasi-Newton 30,710
SGD (Naive) 1,032,962
SGD (Clipping-at-zero) 279,886
Number of non-zero features
Quasi-Newton 50,870
SGD (Naive) 2,142,130
SGD (Clipping-at-zero) 323,199
14
Why it does not produce sparse models
  • In SGD, weights are not updated smoothly

15
Cumulative L1 penalty
  • The absolute value of the total L1 penalty which
    should have been applied to each weight
  • The total L1 penalty which has actually been
    applied to each weight

16
Applying L1 with cumulative penalty
  • Penalize each weight according to the difference
    between and

17
Implementation
10 lines of code!
18
Experiments
  • Model Conditional Random Fields (CRFs)
  • Baseline OWL-QN (Andrew and Gao, 2007)
  • Tasks
  • Text chunking (shallow parsing)
  • CoNLL 2000 shared task data
  • Recognize base syntactic phrases (e.g. NP, VP,
    PP)
  • Named entity recognition
  • NLPBA 2004 shared task data
  • Recognize names of genes, proteins, etc.
  • Part-of-speech (POS) tagging
  • WSJ corpus (sections 0-18 for training)

19
CoNLL 2000 chunking task objective
20
CoNLL 2000 chunking non-zero features
21
CoNLL 2000 chunking
  • Performance of the produced model

Passes Obj. Features Time (sec) F-score
OWL-QN 160 -1.583 18,109 598 93.62
SGD (Naive) 30 -1.671 455,651 1,117 93.64
SGD (Clipping Lazy Update) 30 -1.671 87,792 144 93.65
SGD (Cumulative) 30 -1.653 28,189 149 93.68
SGD (Cumulative ED) 30 -1.622 23,584 148 93.66
  • Training is 4 times faster than OWL-QN
  • The model is 4 times smaller than the
    clipping-at-zero approach
  • The objective is also slightly better

22
NLPBA 2004 named entity recognition
Passes Obj. Features Time (sec) F-score
OWL-QN 160 -2.448 30,710 2,253 71.76
SGD (Naive) 30 -2.537 1,032,962 4,528 71.20
SGD (Clipping Lazy Update) 30 -2.538 279,886 585 71.20
SGD (Cumulative) 30 -2.479 31,986 631 71.40
SGD (Cumulative ED) 30 -2.443 25,965 631 71.63
Part-of-speech tagging on WSJ
Passes Obj. Features Time (sec) Accuracy
OWL-QN 124 -1.941 50,870 5,623 97.16
SGD (Naive) 30 -2.013 2,142,130 18,471 97.18
SGD (Clipping Lazy Update) 30 -2.013 323,199 1,680 97.18
SGD (Cumulative) 30 -1.987 62,043 1,777 97.19
SGD (Cumulative ED) 30 -1.954 51,857 1,774 97.17
23
Discussions
  • Convergence
  • Demonstrated empirically
  • Penalties applied are not i.i.d.
  • Learning rate
  • The need for tuning can be annoying
  • Rule of thumb
  • Exponential decay (passes 30, alpha 0.85)

24
Conclusions
  • Stochastic gradient descent training for
    L1-regularized log-linear models
  • Force each weight to receive the total L1 penalty
    that would have been applied if the true
    (noiseless) gradient were available
  • 3 to 4 times faster than OWL-QN
  • Extremely easy to implement
Write a Comment
User Comments (0)
About PowerShow.com