CSE 551651:

About This Presentation

Title:

CSE 551651:

Description:

Having looked at theories of human speech production and ... 'Send $2.50 to 1024 Clough Dr. ASAP.' Also, syllabification: 'aeon' ae/on 'aortae' a/or/tae ' ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 19

Provided by: hos1

Category:

Tags: cse | clough

more less

Transcript and Presenter's Notes

Title: CSE 551651:

1
CSE 551/651 Structure of Spoken
Language Lecture 16 Text-to-Speech (TTS)
Technology John-Paul Hosom Fall 2005
2
Text-to-Speech (TTS) Synthesis

Having looked at theories of human speech
production and speech perception, now well look
at structures and algorithms currently used to
implement these technologies.
Text-to-Speech (TTS) has three main
approaches(1) formant-based(2)
concatenative(3) articulatory
All TTS approaches must address(a) text
analysis from text, predicting phonemes,
stress, and phrase boundaries(b) prosody from
text-analysis output, predicting pitch contour,
energy contour, duration of each phoneme(c)
signal processing given phoneme symbols and
timing, generate speech waveform

3
Text-to-Speech (TTS) Synthesis

From a linguistic perspective, there may be many
more things to consider

(from Klatt 1987)
4
Text-to-Speech (TTS) Synthesis

Text Analysis
First, convert words to phonemes hello h eh
l ow read r iy dcl d or r eh dcl d
?? Send 2.50 to 1024 Clough Dr. ASAP.
Also, syllabification aeon ae/on aortae
a/or/tae tasty tas/ty tast/y ta/sty
Plus, stress conduct vs. conduct
And, accent I want to go now, not later!
Tones not in English (whew!)

5
Text-to-Speech (TTS) Synthesis

Text Analysis
Three basic techniques/approaches
(1) Dictionaries including word pronunciation,
accent marks, syllabification, etc. but many
words are not in the dictionary, and some words
have multiple pronunciations depending on context
(2) Rules such as morphological
analysis parallelization ? parallel ize
ation
(3) CART (Classification and Regression
Trees) utilize context to make binary decisions
leading to final decision

noun?
conduct
sent_begin? prev_wordthe?
6
Text-to-Speech (TTS) Synthesis

Prosody Pitch, Duration, Energy, Allophones
Pitch modeling, superposition of phrase accent
Fujisaki model

(from van Santen,2000)
7
Text-to-Speech (TTS) Synthesis

Pitch contours
However, more detailed approaches available (van
Santen),and pitch contour modeling is not a
solved problem

(from Klatt 1987)
8
Text-to-Speech (TTS) Synthesis

Factors affecting duration include
Current phoneme, Previous phoneme, Next phoneme,
Word stress, Phrase accent, Degree of emphasis,
Position in syllable (onset vs. coda, open vs.
closed), Position in word, Position in phrase,
Position in foot
Models of Duration
Hand-Tuned Rules (Klatt)
PRCNT based on 11 Rules pause insertion rule,
clause-final lengthening, phrase-final
lengthening, non-word-final shortening,
polysyllabic shortening, non-initial-consonant
shortening, unstressed shortening, lengthening
for emphasis, postvocalic context of vowels,
shortening in clusters, lengthening due to
plosive aspiration

9
Text-to-Speech (TTS) Synthesis

Duration Modeling using CART
Given large corpus, annotated with duration,
phoneme, stress, phrase information, train a CART
classifier
Doesnt generalize well to data not seen in
training
Sums-of-Products Approach (van Santen)
Dur(/p/ c2, c3, c4, ...) where cn is
particular context
A11(/p/) A21(/p/) ?A22(c2)
A33(c3) ? A34(c4)
duration is combined function of contexts
statistical, not rule based simple to train
generalizes well

plosive?
vowel? word-initial?
more rules here
0.98
1.13
10
Text-to-Speech (TTS) Synthesis

Generating a Waveform Articulatory Synthesis
The vocal tract is divided into a large number of
short tubes, as in the electrical transmission
line analog (Lecture 11), which are then combined
and resonant frequencies calculated.

from Sinder, 1999 (thesis work with Flanagan,
Rutgers)
11
Text-to-Speech (TTS) Synthesis

Generating a Waveform Articulatory Synthesis
Vocal-tract sources include noise and a buzz
source for voiced sounds
Articulatory synthesis important for validating
the Motor Theory of Speech Perception
Demos from 1976 and circa 1992 (Haskins
Labs)

12
Text-to-Speech (TTS) Synthesis

Generating a Waveform Formant Synthesis
Instead of specifying mouth shapes, formant
synthesis specifies frequencies and bandwidths of
resonators, which are used to filter a source
waveform.
Formant frequency analysis is difficult
bandwidth estimation is even more difficult. But
the biggest perceptual problem in formant
synthesis is not in the resonances, but in a
buzzy quality most likely due to the glottal
source model.
Formant synthesis can sound identical to natural
utterance if details of the glottal source and
formants are well modeled. NATURAL
SPEECH SYNTHETIC SPEECH(John Holmes, 1973)

13
Text-to-Speech (TTS) Synthesis

Formant TTS Synthesis Architecture
Formant-synthesis systems contain a number of
sound sources, which are passed to filters in
either parallel or cascade series. Each filter
corresponds to one formant (resonance) or
anti-resonance.

(From Yamaguchi, 1993)
14
Text-to-Speech (TTS) Synthesis

Formant systems Rule-Based Synthesis
For synthesis of arbitrary text, formants and
bandwidths for each phoneme are determined by
analyzing speech of a single person.
The models of each phoneme may be a single set of
formant frequencies and bandwidths for a
canonical phoneme at a single point in time, or a
trajectory of frequencies, bandwidths, and source
models over time.
The formant frequencies for each phoneme are
combined over time using a model of
coarticulation, such as Klatts modified locus
theory.
Duration, pitch, and energy rules are applied
Result something like this

15
Text-to-Speech (TTS) Synthesis

Despite great success in copy synthesis,
synthesis by rule using formants has severely
degraded quality. Its not clear why Problem
with glottal source? Problem with coarticulation
and formant transitions? Problem with prosody?
Formant synthesis was main TTS technique until
the early or mid 1990s, when increasing memory
size and CPU speed allowed concatenative
synthesis to be viable approach.
Concatenative synthesis uses recordings of small
units of speech (typically the region from the
middle of one phoneme to the middle of another
phoneme, or a diphone unit), and glues these
units together to forms words and sentences.
Concatenative synthesis means that you dont have
to worry about glottal source models or
coarticulation, since the synthesis is just a
concatenation of different waveforms containing
natural glottal source and coarticulation.

16
Text-to-Speech (TTS) Synthesis

Concatenative Synthesis Units
The basic unit for concatenative synthesis is the
diphone
More recent TTS research is on using larger
units. Issues include (a) how to decide what
units will be used? (b) how to select
best unit from very large database?
With increasing size and variety of units, there
is an exponential growth in the database size.
Yet, despite massive databases that may take
months to record, coverage is nowhere near
complete. There is a very large number of
infrequent events in speech.

sil-jh jh-aa aa-n n-sil
17
Text-to-Speech (TTS) Synthesis

Concatenative Synthesis Signal Processing
Waveform-based Pitch-Synchronous Overlap Add
(PSOLA)
Perform pitch modification by spacing of
pitch-synchronous units
Or, use Line Spectral Frequencies (LSFs), which
areconceptually the harmonics in a spectrogram

18
Text-to-Speech (TTS) Synthesis

DEMOS
Klatts DEC Talk
sample 1
ATT
sample 1 sample 2
Bell Labs
sample 1 sample 2
OGI
sample 1 sample 2

Write a Comment

User Comments (0)

About PowerShow.com

CSE 551651: - PowerPoint PPT Presentation

CSE 551651:

Having looked at theories of human speech production and ... 'Send $2.50 to 1024 Clough Dr. ASAP.' Also, syllabification: 'aeon' ae/on 'aortae' a/or/tae ' ... – PowerPoint PPT presentation