Title: Recent Progress on Speech Synthesis in USTC iFlytek Speech Lab
1Recent Progress on Speech Synthesis in USTC
iFlytek Speech Lab
- ? ? ?
- Ren-Hua Wang
- 2006,11,14
2CONTENTS
- Introduction to USTC iFlytek Speech Lab.
- Review of Speech Synthesis
- Recent Research Progress
- Applications
3USTC iFlytek Speech Laboratory
- Be subject to USTC and iFlytek
- University of Science and Technology of China
- Anhui USTC iFLYTEK Co.Ltd
- Research and Development
- Speech synthesis
- Speech recognition
- Standardizations of man-machine speech
interactive technology - Application of speech technology
4University of Science and Technology of China
- USTC is the only university under Chinese Academy
of Sciences(CAS) and one of the 9 institutes in
China entitled to government support for
internationally acknowledged research
universities. - Located in Hefei, which is the capital of Anhui
province, and around 500 KM apart from Shanghai
5USTC iFLYTEK CO. LTD
- Founded on USTC Speech laboratory in 1999
- Registered capitals reaches 75 million and market
value 300 million of Chinese Yuan today - A leading provider of speech technology with
focus on Chinese TTS and a market leader of
speech interactive application in China - The only speech technology Industrialization Base
for the China National Hi-tech RD Program(863
Program) - Affiliate of Chinese Speech Interactive Standard
Group, and takes the lead in making the national
standard
6Review of Speech Synthesis
- Mainstream Methods in last ten years
- Corpus-based unit connection
- Statistical parametric synthesis
- Trainable speech synthesis
- Hidden Markov Model (HMM) based
7 Corpus-based Speech Synthesis
- Synthesized speech is generated by catching the
optimal speech segments from the corpus and
concatenating them together - Two problems
- What should the corpus include?
- Corpus design
- How to select the required synthesis units in the
corpus for a target sentence to be synthesized? - Unit selection
- Link cost
8Block Diagram of Corpus-based TTS
Text input
Speech Corpus
Lexicon and Syntax Rules
Text Processing
Candidates For Each Syllable s1 s2 s3
s4
Prosody Prediction
Template Corpus
Select Best Path By Link Cost s1 s2 s3
s4 s5 s6
Output Speech
9Advantages and Disadvantages
- Excellent speech quality
- Synthesized units from natural speech
- Unstable performance
- Limited corpus size
- A long period for construction of a new corpus
- Lack of flexibility
10HMM-based Speech Synthesis
- Training stage
- Speech parameters (spectrum, pitch and duration)
are extracted from speech waveforms of training
data - Spectrum, pitch and duration are modeled
simultaneously in a unified framework of HMMs - Synthesis stage
- The parameters are generated from HMMs by using
dynamic features under maximum likelihood
criterion - These parameters are sent into parametric
synthesizer to generate speech waveforms
11System Overview
12Strongpoint
- High smoothness and naturalness
- Small training set 0.51hour
- Automatic and fast training
- Language independent
- High flexibility model adaptation and
interpolation - Small footprint 1MB system for embeded
application
13Problems with baseline method
- Muffled synthesized speech
- Vocoder quality from parametric synthesizer
- Broaden formant caused by the averaging effects
of statistic modeling - Too flat prosody
- Unideal statistical modeling for speech synthesis
14Recent Progress
- USTC IFlytek Speech Lab has been working on
corpus based TTS, and pay more attention on
Trainable TTS Since 2001. - Following the framework of HMM based synthesis,
several innovative results have been achieved - Both of Chinese Text-To-Speech and English TTS
have a good lead over our competitor's in the
world
15Text-To-Speech Flowchart
16Subjective Evaluation for corpus-based TTS
- Mean Opinion Score
- 5 EXCELLENT
- 4 GOOD
- 3 ACCEPTABLE
- 2 BAD
- 1 VERY BAD
Evaluated at 2003,7 For the first time, the
synthesized speech sounds above the voice of the
common man !
17P.1 Feature Parameters at HMM based synthesis
- Spectral feature Linear Spectral Pairs
- Relate more closely to formant positions
- Better temporal smoothness for each order
- Spectral enhancement based on LSP
- To enhance the formant of synthesized speech by
modifying the DAL (Differential of Adjacent LSP
orders) of generated LSPs
18Spectrum Smoothing
(b) 3D spectrum graph of synthesized voice /uo/
based on mel-spectrum
(a) 3D spectrum graph of natural voice /uo/
19Modeling With LSF Parameters
(c) 3D spectrum graph of synthesized voice /uo/
based on LSF
(a) 3D spectrum graph of natural voice /uo/
20Spectrum Enhancement
21P.2 Model Training at HMM based synthesis
22- Optimization of feature and question set
- Minimum Generation Error criterion
- Instead of the ML criterion, the HMMs are
estimated to minimize the generation error which
is defined as the distance between generated
parameters and natural ones for the sentences in
training set - Advantages
- To give better consistency between model training
and the purpose of speech synthesis - To take the constraints between static and
dynamic features into account during HMM
training
23P.3 Others at HMM based synthesis
- Improved duration modeling -- duration prediction
combing state duration model and phone duration
model - Vocoder STRAIGHT
- Speech Transformation and Representation using
Adaptive Interpolation of weiGHTed spectrum - High performance speech analysis/synthesis method
24Blizzard Challenge
- An international competition for English speech
synthesis systems - To better understand and compare research
techniques in building speech synthesis system on
the SAME data - Proposed by Prof. Alan.W.Black (CMU ) and Prof.
Tokuda (Nitech) since 2005
25Blizzard Challenge 2006
- 14 entries from all round the world
- Two systems required for each entry
- Full set system 4273 utts
- Subset system 1082 utts
- 1 month for system building
- Evaluation
- Internet based evaluation
- Intelligibility (WER) and naturalness (MOS)
- Experts, volunteers and native students
26Results
- USTC system built using an improved HMM-based
synthesis method gives the best performance in
this competition
STOP
WER
MOS
27A1. Text-To-Speech
- Multi-speaker synthesis
- Man/woman, child/elder
- Multi-lingual/accent synthesis
- English
- Sichuan accent
STOP
28A2. Speaker Interpolation
Speaker interpolation
STOP
29A3. Voice conversion
- MLLR based model adaptation
- 550 utterances of target speaker
- Simulation of celebrity voice
Speech sample of target speaker
Target speaker TTS system
Source speaker TTS system
Model adaptation
STOP
30A4. Expressive speech synthesis
- Emotional speech synthesis
- Sad
- Happy
- Singing TTS
- Based on the reading style synthesis system
- Prosodic controlling according to input score
information
STOP
31Thank you!.