Title: Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach
1Hyperparameter Estimation for Speech Recognition
Based on Variational Bayesian Approach Kei
Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu
Lee and Keiichi Tokuda (Nagoya Institute of
Technology)
1. Introduction
6. Experimental Results
4. Hyperparameter Estimation
3. Variational Bayesian Approach
Relationships between F and recognition Acc.
- Variational Bayes Attias1999
- Approximate posterior distributions
- by variational method
- Define a lower bound on the log-likelihood
Estimating appropriate hyperparameters Maximize F
w.r.t. hyperparameters
- Recent speech recognition systems
- ML(Maximum Likelihood) criterion
- ? Reduce the estimation accuracy
- MDL(Minimum Description Length) criterion
- ? Based on an asymptotic approximation
Conventional
Using monophone HMM state statistics ?
Maximizing F at the root node
- Variational Bayesian(VB) approach
- Higher generalization ability
- Appropriate model structures can be selected
- Performance depends on hyperparameters
?
Conventional Proposed
all
phone
state
leaf
- F and recognition accuracy
- behaved similarly
Objective
Proposed
- Proposed technique gives
- consistent improvement at the value of F
Estimate hyperparameters maximizing marginal
likelihood
? Maximize F w.r.t. variational posteriors
Using the statistics of all leaf nodes ?
Maximizing F of the tree structure
- If prior distributions have tying structure
- ? F is good for model selection
Context Clustering based on Variational
Bayes Watanabe et al. 2002
- Otherwise
- ? F increases monotonically as T increases
?
2. Bayesian Framework
Maximize F w.r.t. variational posteriors
Relationships between tying structure and the
amount of training data
Q Phonetic question
Yes
No
Tying structure of prior distributions
Consider four kinds of tying structure
- Use a conjugate prior distribution
- Output probability distribution
- ? Gaussian distribution
Proposed
all
phone
state
leaf
- Conjugate prior distribution
- ? Gauss-Wishart distribution
ML
Based on the posterior distributions Model
parameters are regarded as probabilistic
variables
5. Experimental Conditions
- The VB clustering with appropriate prior
- distribution improves the recognition
performance
- Advantages
- Prior knowledge can be integrated
- Model structure can be selected
- Robust classification
- Likelihood function
- ? Proportional to a Gauss-Wishart distribution
Database JNAS (Japanese Newspaper Article Sentences)
Training data JNAS 20,000 / 2,500 / 200 sentences
Test data JNAS 100 sentences
Sampling rate 16 KHz
Window Hamming window
Frame size/shift 25 ms / 10 ms
Feature vector 12 order MFCC ?MFCC ?Energy (25 dimensions)
- Appropriate tying structure of prior
distributions - ? Depend on the amount of training data
- Define new hyperparameter T representing
- the amount of prior data
Large training data set ? Tying few prior
distributions
- Disadvantage
- Include integral and expectation calculations
- ? Effective approximation technique is required
Small training data set ? Tying many prior
distributions