An Alternative Approach of Finding Competing Hypotheses for Better Minimum Classification Error Training

About This Presentation

Title:

An Alternative Approach of Finding Competing Hypotheses for Better Minimum Classification Error Training

Description:

An Alternative Approach of Finding Competing Hypotheses for Better Minimum ... Alternative:1-nearest hypothesis. What? Why? How? Evaluation. Conclusion. Outline ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 22

Provided by: csC76

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: An Alternative Approach of Finding Competing Hypotheses for Better Minimum Classification Error Training

1
An Alternative Approach of Finding Competing
Hypotheses for Better Minimum Classification
Error Training
Mr. Yik-Cheung Tam Dr. Brian Mak
2
Outline

Motivation
Overview of MCE training
Problem using N-best hypotheses
Alternative1-nearest hypothesis
What?
Why?
How?
Evaluation
Conclusion

3
MCE Overview

The MCE loss function
Distance measure
G(X) may be computed using the N-best hypotheses.
l(.) 0-1 soft error-counting function (Sigmoid)
Gradient descent method to obtain a better
estimate.

4
Problem Using N-best Hypotheses

When d(X) gets large enough,
It falls out of the steep trainable region of
Sigmoid.

5
What is 1-nearest Hypothesis?

d(1-nearest) lt d(1-best)
The idea can be generalized to N-nearest
hypotheses.

6
Using 1-nearest Hypothesis

Keep the training data inside the steep trainable
region.

Trainable region
7
How to Find 1-nearest Hypothesis?

Method 1 (exact approach)
Stack-based N-best decoder
Drawback
N may be very large gt memory problem
Need to limit the size of N.
Method 2 (approximated approach)
Modify the Viterbi algorithm with a special
pruning scheme.

8
Approximated 1-nearest Hypothesis

Notation
V(t1, j) accumulated score at time t1 and
state j
transition probability from
state i to j
observation probability at
time t1 and state j
accumulated score of the
Viterbi path of the correct string at time t1.
Beam(t1) beam width applied at time t1

9
Approximated 1-nearest Hypothesis (.)

There exists some nearest path in the search
space (shaded area).

10
System Evaluation
11
Corpus Aurora

Aurora
Noisy connected digits derived from TIDIGIT.
Multi-condition training (Train on noisy
condition)
subway, babble, car, exhibition x clean, 20,
15, 10, 5 (5 noise levels)
8440 training utterances.
Testing (Test on matched noisy condition)
Same as above except with additional samples with
0 and 5 dB (7 noise levels)
28,028 testing utterances.

12
System Configuration

Standard 39-dimension MFCC (cep D DD)
11 Whole-word digit HMM (0-9, oh)
16 states, 3 Gaussians per state
3-state silence HMM, 6 Gaussians per state
1-state short pause HMM tied to the 2nd state of
the silence model.
Baum-Welch training to obtain the initial HMM.
Corrective MCE training on HMM parameters.

13
System Configuration (.)

Compare 3 kinds of competing hypotheses
1-best hypothesis
Exact 1-nearest hypothesis
Approx. 1-nearest hypothesis
Sigmoid parameters
Various (control slope of Sigmoid)
Offset 0

14
Experiment I Effect of Sigmoid slope

Learning rate 0.05, with different
0.1 (best test performance)
0.5 (steeper)
0.02, 0.004 (more flat)

Baseline 12.71
1-best 11.01
Approx. 1-nearest 10.71
Exact 1-nearest 10.45
15
Effective Amount of Training Data

Soft error lt 0.95 is defined to be effective.
1-nearest approach has more training data when
the Sigmoid slope is relatively steep.

Exact. 1-nearest (67)
Approx. 1-nearest (51)
1-best (40)
16
Experiment II Compensation With More Training
Iterations

With 100 effective training data, apply more
training iterations
0.004, learning rate 0.05
Result Slow improvement compared to the best
case.

Exact 1-nearest with gamma 0.1
17
Experiment II Compensation Using a Larger
Learning Rate

Use a larger learning rate (0.05 -gt 1.25)
Fix 0.004 (100 effective training data)
Result 1-nearest approach is better than
one-best approach after compensation.

System Before compensation After compensation
Baseline 12.71 12.71
1-best 12.07 11.55
Approx 1-nearest 12.27 10.70
Exact 1-nearest 12.16 10.79
18
Using a Larger Learning Rate (.)