Speech Synthesis in the SPACE Reading Tutor Closing Symposium of the SPACE Project 06 FEB 2009 - PowerPoint PPT Presentation

Loading...

PPT – Speech Synthesis in the SPACE Reading Tutor Closing Symposium of the SPACE Project 06 FEB 2009 PowerPoint presentation | free to download - id: 1b047e-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Speech Synthesis in the SPACE Reading Tutor Closing Symposium of the SPACE Project 06 FEB 2009

Description:

Speech Corpus. Database (about 6 hours) Material from stories for children ... Corpus selected based on word frequency; Untrained weights ... – PowerPoint PPT presentation

Number of Views:13
Avg rating:3.0/5.0
Slides: 27
Provided by: esatKu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Speech Synthesis in the SPACE Reading Tutor Closing Symposium of the SPACE Project 06 FEB 2009


1
Speech Synthesis in the SPACE Reading
Tutor Closing Symposium of the SPACE Project 06
FEB 2009
  • Yuk On Kong, Lukas Latacz, Werner
    Verhelst Laboratory for Digital Speech and Audio
    Processing Vrije Universiteit Brussel

2
Introduction
3
To Record or Not to Record Thats the question.
  • Pre-recorded speech in existing reading tutors
  • Advantages / disadvantages?

4
Application-specific TTS
  • Speaker / voice
  • Material in speech corpus
  • How to synthesize
  • Any extra mode necessary?
  • the child is too slow
  • How to maximize quality

5
Speaker / Voice
  • Speaker
  • Appealing to children
  • Female speaker
  • Standard Flemish pronunciation, no noticeable
    regional accent
  • Experienced speaker

6
Material in Speech Corpus
  • Database (about 6 hours)
  • Material from stories for children
  • Words expected at 6 years of age
  • Diphones

7
How to synthesize
  • Based on the general unit selection paradigm.
  • Heterogeneous units units could be of various
    sizes
  • Bases
  • Use of longer chunks leads to quality
    improvement.
  • Used for synthesizing domain-specific utterances.

oma
o
ma
_-o
o-m
m-a
Fig. Word oma to synthesize and multi-tier
segmentation in word, syllable and segment
8
How to synthesize
  • Basic algorithm
  • Search top-down and select longest sequence of
    targets at each level and go to lower levels if
    no candidates are found.
  • Coarticulation
  • Even across word boundaries
  • Level diphone, syllable, word, phrase

9
How to synthesize
  • Front-end

Als het flink vriest, kunnen we schaatsen.
Tokenisation
Text Normalisation
Phrase and Pause Prediction
Part of speech
Silence Insertion
Word Pronunciation
ToDI Intonation
Word Accent
Back-end
Unit Selection
Unit Concatenation
Speech DB
10
How to synthesize
  • Target prosody is described symbolically
  • Best sequence of units is selected
  • Weighted sum of target and join costs
  • Viterbi search
  • Joins
  • Costs based on spectrum, pitch, energy, duration
    and adjacency
  • PSOLA-based algorithm with optimal coupling

Level Target cost
Segment Segment Phonemic identity Pause type (if silence)
Segment Position in syllable
Syllable Syllable Syllable Syllable Syllable Phoneme sequence Lexical stress ToDI accent Is_accented Onset and coda type
Syllable Syllable Syllable Onset, nucleus and coda size Distance to next/previous stressed syllable, in terms of syls Number of stressed syllables until next/previous phrase break
Syllable Syllable Distance to next/previous accented syllables, in terms of syls Number of accented syllables until next/previous phrase break
Word Word Word Word Word Word Word Word Word Word Position in phrase Part of speech Is_content_word Has_accented_syllable(s) Is_capitalized Position in phrase Token punctuation Token prepunctuation Number of words until next/previous phrase break Number of content words until next/previous phrase break
Those with a are also calculated for the
neighboring segments, syllables or words.
Neighboring syllables are restricted to the
syllables of the current word. As for segments
words, three neighbors on the left and three on
the right are taken into account.
11
Extra Modes?
  • Phoneme-by-phoneme mode
  • Stress
  • Syllable mode

12
Extra Modes?
  • Demonstration
  • Phoneme-by-phoneme
  • Stress on first phoneme
  • Syllable
  • Normal mode

Moeilijk Koffiezetapparaat




13
The Child is Too Slow
  • Choosing the appropriate reading speed for the
    child
  • Uniform WSOLA time-scaling
  • Insertion of additional silences between
    neighboring words
  • Reading along

14
The Child is Too Slow
15
How to Maximize Quality
  • Major synthesis problems
  • Join artifacts
  • Inappropriate prosody
  • Interactive tuning of synthesis
  • Assisted by quality management
  • User can make small changes to the input text

16
How to Maximize Quality
  • Approach
  • For each word, calculate average target and join
    costs
  • Predictor
  • threshold based on max and min of
    cost c
  • uj usually lies between 0 1 because of training
    settings.
  • Accept if uj lt 0.5 and reject otherwise.
  • Weights linear regression
  • Best alphas found iteratively (maximizing
    f-score)

17
Other Special Aspects
  • Phrase and Silence Prediction
  • Context-dependent Weight Training

18
Phrase and Silence Prediction
  • Type of pauses heavy, medium and light
  • Phrase breaks both heavy and medium pauses
  • Training
  • No manual labeling, but based on the pauses
    automatically labeled in the speech database
  • Iterative classification based on these pauses
  • Training of memory-based learner (features such
    as POS, punctuation, ...)

19
Context-dependent Weight Training
  • Automatic adaptation (tuning) of weights
  • Context-dependent weights
  • Context is described symbolically per phone
  • Training
  • Optimizing weights
  • Clustering optimized weights (decision trees)

20
Context-dependent Weight Training
  • 7 subjects
  • 4 conditions
  • Randomly selected corpus Context-dependent
    weights
  • Randomly selected corpus Untrained weights
  • Corpus selected based on word frequency
    Context-dependent weights
  • Corpus selected based on word frequency
    Untrained weights
  • 25 test utterances, AVI 1-5 (5 utt./level)
  • Results

Conditions MOS
Randomly selected corpus Context-dependent weights 3.1
Randomly selected corpus Untrained weights 3.1
Corpus selected based on word frequency Context-dependent weights 3.3
Corpus selected based on word frequency Untrained weights 3.0
21
Demonstration
  • Hierarchical unit selection
  • AVI 1 Dit is te gek, gilt ze.
  • AVI 3 Toch had hij liever de hond gehad.
  • AVI 5 Roel ligt nog een paar dagen in het
    ziekenhuis.
  • AVI 7 De kleine huizen staan dicht tegen elkaar
    aan.
  • AVI 9 Nou Henk, zie je nu wel dat je moeder
    hier fantastisch verzorgd wordt!

22
WSOLA
Top original signal Bottom WSOLA time-scaling
Illustration of the WSOLA strategy
23
Other Application
  • Audio-visual TTS
  • Example The sentence you hear is made out of
    many combinations of original sound and video,
    selected from the recordings of natural speech.
  • Database containing about 20 minutes (LIPS
    Challenge 08)
  • For better audio quality, the database should be
    much larger

24
Future Work
  • Optimizing synthesis
  • User feedback
  • Expressive speech synthesis
  • Automated prosodic annotations
  • Quality Management
  • Evaluation optimization of the algorithm
  • Compare with the perceived quality of synthesized
    sentences (MOS)

25
Questions?
  • Thank you for your attention.
  • Acknowledgments
  • Prof. Wivine Decoster (our speaker)
  • Jacques, Leen and other SPACE members
  • Wesley and other DSSP people
  • IWT

26
  • THE END
About PowerShow.com