Recent Progress on Speech Synthesis in USTC iFlytek Speech Lab - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Recent Progress on Speech Synthesis in USTC iFlytek Speech Lab

Description:

Simulation of celebrity voice. Speech sample of. target speaker. Source speaker. TTS system ... Based on the reading style synthesis system ... – PowerPoint PPT presentation

Number of Views:293
Avg rating:3.0/5.0
Slides: 32
Provided by: zhl7
Category:

less

Transcript and Presenter's Notes

Title: Recent Progress on Speech Synthesis in USTC iFlytek Speech Lab


1
Recent Progress on Speech Synthesis in USTC
iFlytek Speech Lab
  • ? ? ?
  • Ren-Hua Wang
  • 2006,11,14

2
CONTENTS
  • Introduction to USTC iFlytek Speech Lab.
  • Review of Speech Synthesis
  • Recent Research Progress
  • Applications

3
USTC iFlytek Speech Laboratory
  • Be subject to USTC and iFlytek
  • University of Science and Technology of China
  • Anhui USTC iFLYTEK Co.Ltd
  • Research and Development
  • Speech synthesis
  • Speech recognition
  • Standardizations of man-machine speech
    interactive technology
  • Application of speech technology

4
University of Science and Technology of China
  • USTC is the only university under Chinese Academy
    of Sciences(CAS) and one of the 9 institutes in
    China entitled to government support for
    internationally acknowledged research
    universities.
  • Located in Hefei, which is the capital of Anhui
    province, and around 500 KM apart from Shanghai

5
USTC iFLYTEK CO. LTD
  • Founded on USTC Speech laboratory in 1999
  • Registered capitals reaches 75 million and market
    value 300 million of Chinese Yuan today
  • A leading provider of speech technology with
    focus on Chinese TTS and a market leader of
    speech interactive application in China
  • The only speech technology Industrialization Base
    for the China National Hi-tech RD Program(863
    Program)
  • Affiliate of Chinese Speech Interactive Standard
    Group, and takes the lead in making the national
    standard

6
Review of Speech Synthesis
  • Mainstream Methods in last ten years
  • Corpus-based unit connection
  • Statistical parametric synthesis
  • Trainable speech synthesis
  • Hidden Markov Model (HMM) based

7
Corpus-based Speech Synthesis
  • Synthesized speech is generated by catching the
    optimal speech segments from the corpus and
    concatenating them together
  • Two problems
  • What should the corpus include?
  • Corpus design
  • How to select the required synthesis units in the
    corpus for a target sentence to be synthesized?
  • Unit selection
  • Link cost

8
Block Diagram of Corpus-based TTS
Text input
Speech Corpus
Lexicon and Syntax Rules
Text Processing
Candidates For Each Syllable s1 s2 s3
s4
Prosody Prediction
Template Corpus
Select Best Path By Link Cost s1 s2 s3
s4 s5 s6
Output Speech
9
Advantages and Disadvantages
  • Excellent speech quality
  • Synthesized units from natural speech
  • Unstable performance
  • Limited corpus size
  • A long period for construction of a new corpus
  • Lack of flexibility

10
HMM-based Speech Synthesis
  • Training stage
  • Speech parameters (spectrum, pitch and duration)
    are extracted from speech waveforms of training
    data
  • Spectrum, pitch and duration are modeled
    simultaneously in a unified framework of HMMs
  • Synthesis stage
  • The parameters are generated from HMMs by using
    dynamic features under maximum likelihood
    criterion
  • These parameters are sent into parametric
    synthesizer to generate speech waveforms

11
System Overview
12
Strongpoint
  • High smoothness and naturalness
  • Small training set 0.51hour
  • Automatic and fast training
  • Language independent
  • High flexibility model adaptation and
    interpolation
  • Small footprint 1MB system for embeded
    application

13
Problems with baseline method
  • Muffled synthesized speech
  • Vocoder quality from parametric synthesizer
  • Broaden formant caused by the averaging effects
    of statistic modeling
  • Too flat prosody
  • Unideal statistical modeling for speech synthesis

14
Recent Progress
  • USTC IFlytek Speech Lab has been working on
    corpus based TTS, and pay more attention on
    Trainable TTS Since 2001.
  • Following the framework of HMM based synthesis,
    several innovative results have been achieved
  • Both of Chinese Text-To-Speech and English TTS
    have a good lead over our competitor's in the
    world

15
Text-To-Speech Flowchart
16
Subjective Evaluation for corpus-based TTS
  • Mean Opinion Score
  • 5 EXCELLENT
  • 4 GOOD
  • 3 ACCEPTABLE
  • 2 BAD
  • 1 VERY BAD

Evaluated at 2003,7 For the first time, the
synthesized speech sounds above the voice of the
common man !
17
P.1 Feature Parameters at HMM based synthesis
  • Spectral feature Linear Spectral Pairs
  • Relate more closely to formant positions
  • Better temporal smoothness for each order
  • Spectral enhancement based on LSP
  • To enhance the formant of synthesized speech by
    modifying the DAL (Differential of Adjacent LSP
    orders) of generated LSPs

18
Spectrum Smoothing
(b) 3D spectrum graph of synthesized voice /uo/
based on mel-spectrum
(a) 3D spectrum graph of natural voice /uo/
19
Modeling With LSF Parameters
(c) 3D spectrum graph of synthesized voice /uo/
based on LSF
(a) 3D spectrum graph of natural voice /uo/
20
Spectrum Enhancement
21
P.2 Model Training at HMM based synthesis
22
  • Optimization of feature and question set
  • Minimum Generation Error criterion
  • Instead of the ML criterion, the HMMs are
    estimated to minimize the generation error which
    is defined as the distance between generated
    parameters and natural ones for the sentences in
    training set
  • Advantages
  • To give better consistency between model training
    and the purpose of speech synthesis
  • To take the constraints between static and
    dynamic features into account during HMM
    training

23
P.3 Others at HMM based synthesis
  • Improved duration modeling -- duration prediction
    combing state duration model and phone duration
    model
  • Vocoder STRAIGHT
  • Speech Transformation and Representation using
    Adaptive Interpolation of weiGHTed spectrum
  • High performance speech analysis/synthesis method

24
Blizzard Challenge
  • An international competition for English speech
    synthesis systems
  • To better understand and compare research
    techniques in building speech synthesis system on
    the SAME data
  • Proposed by Prof. Alan.W.Black (CMU ) and Prof.
    Tokuda (Nitech) since 2005

25
Blizzard Challenge 2006
  • 14 entries from all round the world
  • Two systems required for each entry
  • Full set system 4273 utts
  • Subset system 1082 utts
  • 1 month for system building
  • Evaluation
  • Internet based evaluation
  • Intelligibility (WER) and naturalness (MOS)
  • Experts, volunteers and native students

26
Results
  • USTC system built using an improved HMM-based
    synthesis method gives the best performance in
    this competition

STOP
WER
MOS
27
A1. Text-To-Speech
  • Multi-speaker synthesis
  • Man/woman, child/elder
  • Multi-lingual/accent synthesis
  • English
  • Sichuan accent

STOP
28
A2. Speaker Interpolation
Speaker interpolation
STOP
29
A3. Voice conversion
  • MLLR based model adaptation
  • 550 utterances of target speaker
  • Simulation of celebrity voice

Speech sample of target speaker
Target speaker TTS system
Source speaker TTS system
Model adaptation
STOP
30
A4. Expressive speech synthesis
  • Emotional speech synthesis
  • Sad
  • Happy
  • Singing TTS
  • Based on the reading style synthesis system
  • Prosodic controlling according to input score
    information

STOP
31
Thank you!.
  • Are there any questions?
Write a Comment
User Comments (0)
About PowerShow.com