Using Neural Network Language Models for LVCSR - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Using Neural Network Language Models for LVCSR

Description:

Targets set to 1 for wj and to 0 otherwise. These outputs shown to cvg to posterior probs ... Neural net LM provide significant improvements in PPL and WER ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 25
Provided by: erinfit
Category:

less

Transcript and Presenter's Notes

Title: Using Neural Network Language Models for LVCSR


1
Using Neural Network Language Models for LVCSR
  • Holger Schwenk and Jean-Luc Gauvain
  • Presented by Erin Fitzgerald
  • CLSP Reading Group
  • December 10, 2004

2
Introduction
  • Build and use neural networks to estimate LM
    posterior probabilities for ASR tasks
  • Idea
  • Project word indices onto continuous space
  • Resulting smooth prob fns of word representations
    generalize better to unknown ngrams
  • Still an n-gram approach, but posteriors
    interpolated for any poss. context no backing
    off
  • Result significant WER reduction with small
    computational costs

3
ArchitectureStandard fully connected multilayer
perceptron
4
Architecture
oi
pi P(wji hj)
ck
dj
M
V
b
k
H
P
d tanh(Mcb)
pN P(wjN hj)
N
o tanh(Vdk)
5
Training
  • Train with std back propagation algorithm
  • Error fn cross entropy
  • Weight decay regularization used
  • Targets set to 1 for wj and to 0 otherwise
  • These outputs shown to cvg to posterior probs
  • Back-prop through projection layer? NN learns
    best projection of words onto continuous space
    for prob estimation task

6
Optimizations
7
Fast Recognition
  • Techniques
  • Lattice Rescoring
  • Shortlists
  • Regrouping
  • Block mode
  • CPU optimization

8
Fast Recognition
  • Techniques
  • Lattice Rescoring
  • Decode with std backoff LM to build lattices
  • Shortlists
  • Regrouping
  • Block mode
  • CPU optimization

9
Fast Recognition
  • Techniques
  • Lattice Rescoring
  • Shortlists
  • NN only predicts high freq subset of vocab
  • Regrouping
  • Block mode
  • CPU optimization

10
Shortlist optimization
oi
pi P(wji hj)
ck
dj
M
V
pS P(wjS hj)
k
b
H
P
N
11
Fast Recognition
  • Techniques
  • Lattice Rescoring
  • Shortlists
  • Regrouping Optimization of 1
  • Collect and sort LM prob requests
  • All prob requests with same ht only one fwd
    pass necessary
  • Block mode
  • CPU optimization

12
Fast Recognition
  • Techniques
  • Lattice Rescoring
  • Shortlists
  • Regrouping
  • Block mode
  • Several examples propagated through NN at once
  • Takes advantage of faster matrix operations
  • CPU optimization

13
Block mode calculations
oi
ck
dj
M
V
b
k
H
P
d tanh(Mcb)
N
o tanh(Vdk)
14
Block mode calculations
O
C
D
M
V
b
k
D tanh(MCB)
O (VDK)
15
Fast Recognition Test Results
  • Techniques
  • Lattice Rescoring ave 511 nodes
  • Shortlists (2000) 90 prediction coverage
  • 3.8M 4gms reqd, 3.4M processed by NN
  • Regrouping only 1M fwd passes reqd
  • Block mode bunch size128
  • CPU optimization
  • Total processing lt 9min (0.03xRT)
  • Without optimizations, 10x slower

16
Fast Training
  • Techniques
  • Parallel implementations
  • Full connections req low latency very costly
  • Resampling techniques
  • Optimum floating pt operations best with
    continuous memory locations

17
Fast Training
  • Techniques
  • Floating point precision 1.5x faster
  • Suppress internal calcs 1.3x faster
  • Bunch mode 10x faster
  • Fwd back propagation for many examples at once
  • Multiprocessing 1.5x faster
  • 47 hours ? 1h27m with bunch size 128

18
Application toCTS and BNLVCSR
19
Application to ASR
  • Neural net LM techniques focus on CTS bc
  • Far less in-domain training data ? data sparsity
  • NN can only handle sm amount of training data
  • New Fisher CTS data 20M words (vs 7M)
  • BN data 500M words

20
Application to CTS
  • Baseline Train standard backoff LMs for each
    domain and then interpolate
  • Expt 1 Interpolate CTS neural net with
    in-domain back-off LM
  • Expt 2 Interpolate CTS neural net with full
    data back-off LM

21
Application to CTS - PPL
  • Baseline Train standard backoff LMs for each
    domain and then interpolate
  • In-domain PPL 50.1 Full data PPL 47.5
  • Expt 1 Interpolate CTS neural net with
    in-domain back-off LM
  • In-domain PPL 45.5
  • Expt 2 Interpolate CTS neural net with full
    data back-off LM
  • Full data
    PPL 44.2

22
Application to CTS - WER
  • Baseline Train standard backoff LMs for each
    domain and then interpolate
  • In-domain WER 19.9 Full data WER 19.3
  • Expt 1 Interpolate CTS neural net with
    in-domain back-off LM
  • In-domain WER 19.1
  • Expt 2 Interpolate CTS neural net with full
    data back-off LM
  • Full data
    WER 18.8

23
Application to BN
  • Only subset of 500M available words could be used
    for training 27M train set
  • Still useful
  • NN LM gave 12 PPL gain over backoff on small 27M
    set
  • NN LM gave 4 PPL gain over backoff on full 500M
    word training set
  • Overall WER reduction of 0.3 absolute

24
Conclusion
  • Neural net LM provide significant improvements in
    PPL and WER
  • Optimizations can speed NN training by 20x and
    lattice rescoring in less than 0.05xRT
  • While NN LM was developed for and works best with
    CTS, gains found in BN task too
Write a Comment
User Comments (0)
About PowerShow.com