Recent Progress on Speech Synthesis in USTC iFlytek Speech Lab

About This Presentation

Title:

Recent Progress on Speech Synthesis in USTC iFlytek Speech Lab

Description:

Simulation of celebrity voice. Speech sample of. target speaker. Source speaker. TTS system ... Based on the reading style synthesis system ... – PowerPoint PPT presentation

Number of Views:293

Avg rating:3.0/5.0

Slides: 32

Provided by: zhl7

Category:

more less

Transcript and Presenter's Notes

Title: Recent Progress on Speech Synthesis in USTC iFlytek Speech Lab

1
Recent Progress on Speech Synthesis in USTC
iFlytek Speech Lab

? ? ?
Ren-Hua Wang
2006,11,14

2
CONTENTS

Introduction to USTC iFlytek Speech Lab.
Review of Speech Synthesis
Recent Research Progress
Applications

3
USTC iFlytek Speech Laboratory

Be subject to USTC and iFlytek
University of Science and Technology of China
Anhui USTC iFLYTEK Co.Ltd
Research and Development
Speech synthesis
Speech recognition
Standardizations of man-machine speech
interactive technology
Application of speech technology

4
University of Science and Technology of China

USTC is the only university under Chinese Academy
of Sciences(CAS) and one of the 9 institutes in
China entitled to government support for
internationally acknowledged research
universities.
Located in Hefei, which is the capital of Anhui
province, and around 500 KM apart from Shanghai

5
USTC iFLYTEK CO. LTD

Founded on USTC Speech laboratory in 1999
Registered capitals reaches 75 million and market
value 300 million of Chinese Yuan today
A leading provider of speech technology with
focus on Chinese TTS and a market leader of
speech interactive application in China
The only speech technology Industrialization Base
for the China National Hi-tech RD Program(863
Program)
Affiliate of Chinese Speech Interactive Standard
Group, and takes the lead in making the national
standard

6
Review of Speech Synthesis

Mainstream Methods in last ten years
Corpus-based unit connection
Statistical parametric synthesis
Trainable speech synthesis
Hidden Markov Model (HMM) based

7
Corpus-based Speech Synthesis

Synthesized speech is generated by catching the
optimal speech segments from the corpus and
concatenating them together
Two problems
What should the corpus include?
Corpus design
How to select the required synthesis units in the
corpus for a target sentence to be synthesized?
Unit selection
Link cost

8
Block Diagram of Corpus-based TTS
Text input
Speech Corpus
Lexicon and Syntax Rules
Text Processing
Candidates For Each Syllable s1 s2 s3
s4
Prosody Prediction
Template Corpus
Select Best Path By Link Cost s1 s2 s3
s4 s5 s6
Output Speech
9
Advantages and Disadvantages

Excellent speech quality
Synthesized units from natural speech
Unstable performance
Limited corpus size
A long period for construction of a new corpus
Lack of flexibility

10
HMM-based Speech Synthesis

Training stage
Speech parameters (spectrum, pitch and duration)
are extracted from speech waveforms of training
data
Spectrum, pitch and duration are modeled
simultaneously in a unified framework of HMMs
Synthesis stage
The parameters are generated from HMMs by using
dynamic features under maximum likelihood
criterion
These parameters are sent into parametric
synthesizer to generate speech waveforms

11
System Overview
12
Strongpoint

High smoothness and naturalness
Small training set 0.51hour
Automatic and fast training
Language independent
High flexibility model adaptation and
interpolation
Small footprint 1MB system for embeded
application

13
Problems with baseline method

Muffled synthesized speech
Vocoder quality from parametric synthesizer
Broaden formant caused by the averaging effects
of statistic modeling
Too flat prosody
Unideal statistical modeling for speech synthesis

14
Recent Progress

USTC IFlytek Speech Lab has been working on
corpus based TTS, and pay more attention on
Trainable TTS Since 2001.
Following the framework of HMM based synthesis,
several innovative results have been achieved
Both of Chinese Text-To-Speech and English TTS
have a good lead over our competitor's in the
world

15
Text-To-Speech Flowchart
16
Subjective Evaluation for corpus-based TTS

Mean Opinion Score
5 EXCELLENT
4 GOOD
3 ACCEPTABLE
2 BAD
1 VERY BAD

Evaluated at 2003,7 For the first time, the
synthesized speech sounds above the voice of the
common man !
17
P.1 Feature Parameters at HMM based synthesis

Spectral feature Linear Spectral Pairs
Relate more closely to formant positions
Better temporal smoothness for each order
Spectral enhancement based on LSP
To enhance the formant of synthesized speech by
modifying the DAL (Differential of Adjacent LSP
orders) of generated LSPs

18
Spectrum Smoothing
(b) 3D spectrum graph of synthesized voice /uo/
based on mel-spectrum
(a) 3D spectrum graph of natural voice /uo/
19
Modeling With LSF Parameters
(c) 3D spectrum graph of synthesized voice /uo/
based on LSF
(a) 3D spectrum graph of natural voice /uo/
20
Spectrum Enhancement
21
P.2 Model Training at HMM based synthesis
22

Optimization of feature and question set
Minimum Generation Error criterion
Instead of the ML criterion, the HMMs are
estimated to minimize the generation error which
is defined as the distance between generated
parameters and natural ones for the sentences in
training set
Advantages
To give better consistency between model training
and the purpose of speech synthesis
To take the constraints between static and
dynamic features into account during HMM
training

23
P.3 Others at HMM based synthesis

Improved duration modeling -- duration prediction
combing state duration model and phone duration
model
Vocoder STRAIGHT
Speech Transformation and Representation using
Adaptive Interpolation of weiGHTed spectrum
High performance speech analysis/synthesis method

24
Blizzard Challenge

An international competition for English speech
synthesis systems
To better understand and compare research
techniques in building speech synthesis system on
the SAME data
Proposed by Prof. Alan.W.Black (CMU ) and Prof.
Tokuda (Nitech) since 2005

25
Blizzard Challenge 2006

14 entries from all round the world
Two systems required for each entry
Full set system 4273 utts
Subset system 1082 utts
1 month for system building
Evaluation
Internet based evaluation
Intelligibility (WER) and naturalness (MOS)
Experts, volunteers and native students

26
Results

USTC system built using an improved HMM-based
synthesis method gives the best performance in
this competition

STOP
WER
MOS
27
A1. Text-To-Speech

Multi-speaker synthesis
Man/woman, child/elder
Multi-lingual/accent synthesis
English
Sichuan accent

STOP
28
A2. Speaker Interpolation
Speaker interpolation
STOP
29
A3. Voice conversion

MLLR based model adaptation
550 utterances of target speaker
Simulation of celebrity voice

Speech sample of target speaker
Target speaker TTS system
Source speaker TTS system
Model adaptation
STOP
30
A4. Expressive speech synthesis

Emotional speech synthesis
Sad
Happy
Singing TTS
Based on the reading style synthesis system
Prosodic controlling according to input score
information

Recent Progress on Speech Synthesis in USTC iFlytek Speech Lab - PowerPoint PPT Presentation

Recent Progress on Speech Synthesis in USTC iFlytek Speech Lab

Simulation of celebrity voice. Speech sample of. target speaker. Source speaker. TTS system ... Based on the reading style synthesis system ... – PowerPoint PPT presentation