An Alternative Approach of Finding Competing Hypotheses for Better Minimum Classification Error Training - PowerPoint PPT Presentation

About This Presentation
Title:

An Alternative Approach of Finding Competing Hypotheses for Better Minimum Classification Error Training

Description:

An Alternative Approach of Finding Competing Hypotheses for Better Minimum ... Alternative:1-nearest hypothesis. What? Why? How? Evaluation. Conclusion. Outline ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 22
Provided by: csC76
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: An Alternative Approach of Finding Competing Hypotheses for Better Minimum Classification Error Training


1
An Alternative Approach of Finding Competing
Hypotheses for Better Minimum Classification
Error Training
Mr. Yik-Cheung Tam Dr. Brian Mak
2
Outline
  • Motivation
  • Overview of MCE training
  • Problem using N-best hypotheses
  • Alternative1-nearest hypothesis
  • What?
  • Why?
  • How?
  • Evaluation
  • Conclusion

3
MCE Overview
  • The MCE loss function
  • Distance measure
  • G(X) may be computed using the N-best hypotheses.
  • l(.) 0-1 soft error-counting function (Sigmoid)
  • Gradient descent method to obtain a better
    estimate.

4
Problem Using N-best Hypotheses
  • When d(X) gets large enough,
  • It falls out of the steep trainable region of
    Sigmoid.

5
What is 1-nearest Hypothesis?
  • d(1-nearest) lt d(1-best)
  • The idea can be generalized to N-nearest
    hypotheses.

6
Using 1-nearest Hypothesis
  • Keep the training data inside the steep trainable
    region.

Trainable region
7
How to Find 1-nearest Hypothesis?
  • Method 1 (exact approach)
  • Stack-based N-best decoder
  • Drawback
  • N may be very large gt memory problem
  • Need to limit the size of N.
  • Method 2 (approximated approach)
  • Modify the Viterbi algorithm with a special
    pruning scheme.

8
Approximated 1-nearest Hypothesis
  • Notation
  • V(t1, j) accumulated score at time t1 and
    state j
  • transition probability from
    state i to j
  • observation probability at
    time t1 and state j
  • accumulated score of the
    Viterbi path of the correct string at time t1.
  • Beam(t1) beam width applied at time t1

9
Approximated 1-nearest Hypothesis (.)
  • There exists some nearest path in the search
    space (shaded area).

10
System Evaluation
11
Corpus Aurora
  • Aurora
  • Noisy connected digits derived from TIDIGIT.
  • Multi-condition training (Train on noisy
    condition)
  • subway, babble, car, exhibition x clean, 20,
    15, 10, 5 (5 noise levels)
  • 8440 training utterances.
  • Testing (Test on matched noisy condition)
  • Same as above except with additional samples with
    0 and 5 dB (7 noise levels)
  • 28,028 testing utterances.

12
System Configuration
  • Standard 39-dimension MFCC (cep D DD)
  • 11 Whole-word digit HMM (0-9, oh)
  • 16 states, 3 Gaussians per state
  • 3-state silence HMM, 6 Gaussians per state
  • 1-state short pause HMM tied to the 2nd state of
    the silence model.
  • Baum-Welch training to obtain the initial HMM.
  • Corrective MCE training on HMM parameters.

13
System Configuration (.)
  • Compare 3 kinds of competing hypotheses
  • 1-best hypothesis
  • Exact 1-nearest hypothesis
  • Approx. 1-nearest hypothesis
  • Sigmoid parameters
  • Various (control slope of Sigmoid)
  • Offset 0

14
Experiment I Effect of Sigmoid slope
  • Learning rate 0.05, with different
  • 0.1 (best test performance)
  • 0.5 (steeper)
  • 0.02, 0.004 (more flat)

Baseline 12.71
1-best 11.01
Approx. 1-nearest 10.71
Exact 1-nearest 10.45
15
Effective Amount of Training Data
  • Soft error lt 0.95 is defined to be effective.
  • 1-nearest approach has more training data when
    the Sigmoid slope is relatively steep.

Exact. 1-nearest (67)
Approx. 1-nearest (51)
1-best (40)
16
Experiment II Compensation With More Training
Iterations
  • With 100 effective training data, apply more
    training iterations
  • 0.004, learning rate 0.05
  • Result Slow improvement compared to the best
    case.

Exact 1-nearest with gamma 0.1
17
Experiment II Compensation Using a Larger
Learning Rate
  • Use a larger learning rate (0.05 -gt 1.25)
  • Fix 0.004 (100 effective training data)
  • Result 1-nearest approach is better than
    one-best approach after compensation.

System Before compensation After compensation
Baseline 12.71 12.71
1-best 12.07 11.55
Approx 1-nearest 12.27 10.70
Exact 1-nearest 12.16 10.79
18
Using a Larger Learning Rate (.)
  • Training performance MCE loss versus of
    training iterations.

Approx. 1-nearest
1-best
Exact. 1-nearest
19
Using a Larger Learning Rate (..)
  • Test performance WER versus of training
    iterations.

1-best (11.55)
Approx. 1-nearest (10.70)
Exact. 1-nearest (10.79)
20
Conclusion
  • 1-best and 1-nearest methods were compared in MCE
    training.
  • Effect of Sigmoid slope.
  • Compensation on using a flat sigmoid.
  • 1-nearest method is better than 1-best approach.
  • More trainable data are available in the
    1-nearest approach.
  • Approx. and exact 1-nearest methods yield
    comparable performance.

21
Questions and Answers
Write a Comment
User Comments (0)
About PowerShow.com