A Bayesian Approach to HMM-Based Speech Synthesis

About This Presentation

Title:

A Bayesian Approach to HMM-Based Speech Synthesis

Description:

A Bayesian Approach to. HMM-Based Speech Synthesis. Kei Hashimoto , Heiga Zen , ... Maximum likelihood (ML) criterion. Train HMMs and generate speech parameters ... – PowerPoint PPT presentation

Number of Views:96

Avg rating:3.0/5.0

Slides: 21

Provided by: spNit

Category:

more less

Transcript and Presenter's Notes

Title: A Bayesian Approach to HMM-Based Speech Synthesis

1
A Bayesian Approach to HMM-Based Speech Synthesis
1
1

Kei Hashimoto , Heiga Zen ,
Yoshihiko Nankaku , Takashi Masuko ,
and Keiichi Tokuda
Nagoya Institute of Technology
Tokyo Institute of Technology

1
2
1
1
2
2
Background

HMM-based speech synthesis system
Spectrum, excitation and duration are modeled
Speech parameter seqs. are generated
Maximum likelihood (ML) criterion
Train HMMs and generate speech parameters
Point estimate ? The over-fitting problem
Bayesian approach
Estimate posterior dist. of model parameters
Prior information can be use
? Alleviate the over-fitting problem

3
Outline

Bayesian speech synthesis
Variational Bayesian method
Speech parameter generation
Bayesian context clustering
Prior distribution using cross validation
Experiments
Conclusion Future work

4
Bayesian speech synthesis (1/2)

Model training and speech synthesis

5
Bayesian speech synthesis (2/2)

Predictive distribution (marginal likelihood)

Variational Bayesian method Attias 99
6
Variational Bayesian method (1/2)

Estimate approximate posterior dist.
? Maximize a lower bound

7
Variational Bayesian method (2/2)

Random variables are statistically independent
Optimal posterior distributions

Iterative updates as the EM algorithm
8
Approximation for speech synthesis

is dependent on synthesis data
? Huge computational cost in the synthesis part
Ignore the dependency of synthesis data
? Estimation from only training data

9
Prior distribution

Conjugate prior distribution
? Posterior dist. becomes a same family of dist.
with prior dist.
Determination using statistics of prior data

10
Speech parameter generation

Speech parameter
Consist of static and dynamic features
? Only static feature seq. is generated
Speech parameter generation based on Bayesian
approach
? Maximize the lower bound

11
Relation between Bayes and ML

Compare with the ML criterion
Use of expectations of model parameters
Can be solved by the same fashion of ML

Output dist.
12
Outline

Bayesian speech synthesis
Variational Bayesian method
Speech parameter generation
Bayesian context clustering
Prior distribution using cross validation
Experiments
Conclusion Future work

13
Bayesian context clustering

Context clustering based on maximizing

yes
no
? Split node based on gain
14
Impact of prior distribution

Affect model selection as tuning parameters
? Require determination technique of prior dist.
Conventional maximize the marginal likelihood
Lead to the over-fitting problem as the ML
Tuning parameters are still required
Determination technique of prior distribution
using cross validation Hashimoto 08

15
Bayesian approach using CV

Prior distribution based on Cross Validation

16
Outline

Bayesian speech synthesis
Variational Bayesian method
Speech parameter generation
Bayesian context clustering
Prior distribution using cross validation
Experiments
Conclusion Future work

17
Experimental conditions (1/2)
Database ATR Japanese speech database B-set
Speaker MHT
Training data 450 utterances
Test data 53 utterances
Sampling rate 16 kHz
Window Blackman window
Frame size / shift 25 ms / 5 ms
Feature vector 24 mel-cepstrum ? ?? and log F0 ? ?? (78 dimension)
HMM 5-state left-to-right HMM without skip transition
18
Experimental conditions (2/2)

Compared approach
Mean Opinion Score (MOS) test
Subjects were 10 Japanese students
20 sentences were chosen at random

Training Context clustering of states
ML-MDL ML MDL 2,491
Bayes-Bayes Bayes Bayes using CV 25,911
Bayes-MDL Bayes Bayes using CV Adjust threshold 2,553
ML-Bayes ML MDL Adjust threshold 27,106
19
Subjective listening test

Mean opinion score

2,491
25,911
27,106
2,553
20
Conclusions and future work

A new framework based on Bayesian approach
All processes are derived from a single
predictive distribution
Improve the naturalness of synthesized speech
Future work
Introduce HSMM instead of HMM
Investigate the relation between the speech
quality and model structures

21
(No Transcript)
22
Cross valid prior distribution

Marginal likelihood using cross validation
Alleviate over-fitting problem
Cross valid prior distribution

23
Experimental conditions(2/2)

Compared approach
Number of states

Training Context clustering
ML-MDL ML MDL
Bayes-Bayes Bayes Bayes using cross validation
Bayes-MDL Bayes Bayes using threshold
ML-Bayes ML MDL using threshold
Spectrum F0 Duration Sum
ML-MDL 956 1,151 280 2,491
Bayes-Bayes 9,070 12,836 4,005 25,911
Bayes-MDL 1,941 565 47 2,553
ML-Bayes 15,077 8,844 3,185 27,106
24
Bayesian context clustering using CV