Title: An Alternative Approach of Finding Competing Hypotheses for Better Minimum Classification Error Training
1An Alternative Approach of Finding Competing
Hypotheses for Better Minimum Classification
Error Training
Mr. Yik-Cheung Tam Dr. Brian Mak
2Outline
- Motivation
- Overview of MCE training
- Problem using N-best hypotheses
- Alternative1-nearest hypothesis
- What?
- Why?
- How?
- Evaluation
- Conclusion
3MCE Overview
- The MCE loss function
- Distance measure
- G(X) may be computed using the N-best hypotheses.
- l(.) 0-1 soft error-counting function (Sigmoid)
- Gradient descent method to obtain a better
estimate.
4Problem Using N-best Hypotheses
- When d(X) gets large enough,
- It falls out of the steep trainable region of
Sigmoid.
5What is 1-nearest Hypothesis?
- d(1-nearest) lt d(1-best)
- The idea can be generalized to N-nearest
hypotheses.
6Using 1-nearest Hypothesis
- Keep the training data inside the steep trainable
region.
Trainable region
7How to Find 1-nearest Hypothesis?
- Method 1 (exact approach)
- Stack-based N-best decoder
- Drawback
- N may be very large gt memory problem
- Need to limit the size of N.
- Method 2 (approximated approach)
- Modify the Viterbi algorithm with a special
pruning scheme.
8Approximated 1-nearest Hypothesis
- Notation
- V(t1, j) accumulated score at time t1 and
state j - transition probability from
state i to j - observation probability at
time t1 and state j - accumulated score of the
Viterbi path of the correct string at time t1. - Beam(t1) beam width applied at time t1
9Approximated 1-nearest Hypothesis (.)
- There exists some nearest path in the search
space (shaded area).
10System Evaluation
11Corpus Aurora
- Aurora
- Noisy connected digits derived from TIDIGIT.
- Multi-condition training (Train on noisy
condition) - subway, babble, car, exhibition x clean, 20,
15, 10, 5 (5 noise levels) - 8440 training utterances.
- Testing (Test on matched noisy condition)
- Same as above except with additional samples with
0 and 5 dB (7 noise levels) - 28,028 testing utterances.
12System Configuration
- Standard 39-dimension MFCC (cep D DD)
- 11 Whole-word digit HMM (0-9, oh)
- 16 states, 3 Gaussians per state
- 3-state silence HMM, 6 Gaussians per state
- 1-state short pause HMM tied to the 2nd state of
the silence model. - Baum-Welch training to obtain the initial HMM.
- Corrective MCE training on HMM parameters.
13System Configuration (.)
- Compare 3 kinds of competing hypotheses
- 1-best hypothesis
- Exact 1-nearest hypothesis
- Approx. 1-nearest hypothesis
- Sigmoid parameters
- Various (control slope of Sigmoid)
- Offset 0
14Experiment I Effect of Sigmoid slope
- Learning rate 0.05, with different
- 0.1 (best test performance)
- 0.5 (steeper)
- 0.02, 0.004 (more flat)
Baseline 12.71
1-best 11.01
Approx. 1-nearest 10.71
Exact 1-nearest 10.45
15Effective Amount of Training Data
- Soft error lt 0.95 is defined to be effective.
- 1-nearest approach has more training data when
the Sigmoid slope is relatively steep.
Exact. 1-nearest (67)
Approx. 1-nearest (51)
1-best (40)
16Experiment II Compensation With More Training
Iterations
- With 100 effective training data, apply more
training iterations - 0.004, learning rate 0.05
- Result Slow improvement compared to the best
case.
Exact 1-nearest with gamma 0.1
17Experiment II Compensation Using a Larger
Learning Rate
- Use a larger learning rate (0.05 -gt 1.25)
- Fix 0.004 (100 effective training data)
- Result 1-nearest approach is better than
one-best approach after compensation.
System Before compensation After compensation
Baseline 12.71 12.71
1-best 12.07 11.55
Approx 1-nearest 12.27 10.70
Exact 1-nearest 12.16 10.79
18Using a Larger Learning Rate (.)
- Training performance MCE loss versus of
training iterations.
Approx. 1-nearest
1-best
Exact. 1-nearest
19Using a Larger Learning Rate (..)
- Test performance WER versus of training
iterations.
1-best (11.55)
Approx. 1-nearest (10.70)
Exact. 1-nearest (10.79)
20Conclusion
- 1-best and 1-nearest methods were compared in MCE
training. - Effect of Sigmoid slope.
- Compensation on using a flat sigmoid.
- 1-nearest method is better than 1-best approach.
- More trainable data are available in the
1-nearest approach. - Approx. and exact 1-nearest methods yield
comparable performance.
21Questions and Answers