CSE 551651: - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

CSE 551651:

Description:

Having looked at theories of human speech production and ... 'Send $2.50 to 1024 Clough Dr. ASAP.' Also, syllabification: 'aeon' ae/on 'aortae' a/or/tae ' ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 19
Provided by: hos1
Category:
Tags: cse | clough

less

Transcript and Presenter's Notes

Title: CSE 551651:


1
CSE 551/651 Structure of Spoken
Language Lecture 16 Text-to-Speech (TTS)
Technology John-Paul Hosom Fall 2005
2
Text-to-Speech (TTS) Synthesis
  • Having looked at theories of human speech
    production and speech perception, now well look
    at structures and algorithms currently used to
    implement these technologies.
  • Text-to-Speech (TTS) has three main
    approaches(1) formant-based(2)
    concatenative(3) articulatory
  • All TTS approaches must address(a) text
    analysis from text, predicting phonemes,
    stress, and phrase boundaries(b) prosody from
    text-analysis output, predicting pitch contour,
    energy contour, duration of each phoneme(c)
    signal processing given phoneme symbols and
    timing, generate speech waveform

3
Text-to-Speech (TTS) Synthesis
  • From a linguistic perspective, there may be many
    more things to consider

(from Klatt 1987)
4
Text-to-Speech (TTS) Synthesis
  • Text Analysis
  • First, convert words to phonemes hello h eh
    l ow read r iy dcl d or r eh dcl d
    ?? Send 2.50 to 1024 Clough Dr. ASAP.
  • Also, syllabification aeon ae/on aortae
    a/or/tae tasty tas/ty tast/y ta/sty
  • Plus, stress conduct vs. conduct
  • And, accent I want to go now, not later!
  • Tones not in English (whew!)

5
Text-to-Speech (TTS) Synthesis
  • Text Analysis
  • Three basic techniques/approaches
  • (1) Dictionaries including word pronunciation,
    accent marks, syllabification, etc. but many
    words are not in the dictionary, and some words
    have multiple pronunciations depending on context
  • (2) Rules such as morphological
    analysis parallelization ? parallel ize
    ation
  • (3) CART (Classification and Regression
    Trees) utilize context to make binary decisions
    leading to final decision

noun?
conduct
sent_begin? prev_wordthe?
6
Text-to-Speech (TTS) Synthesis
  • Prosody Pitch, Duration, Energy, Allophones
  • Pitch modeling, superposition of phrase accent
    Fujisaki model

(from van Santen,2000)
7
Text-to-Speech (TTS) Synthesis
  • Pitch contours
  • However, more detailed approaches available (van
    Santen),and pitch contour modeling is not a
    solved problem

(from Klatt 1987)
8
Text-to-Speech (TTS) Synthesis
  • Factors affecting duration include
  • Current phoneme, Previous phoneme, Next phoneme,
  • Word stress, Phrase accent, Degree of emphasis,
  • Position in syllable (onset vs. coda, open vs.
    closed), Position in word, Position in phrase,
    Position in foot
  • Models of Duration
  • Hand-Tuned Rules (Klatt)
  • PRCNT based on 11 Rules pause insertion rule,
    clause-final lengthening, phrase-final
    lengthening, non-word-final shortening,
    polysyllabic shortening, non-initial-consonant
    shortening, unstressed shortening, lengthening
    for emphasis, postvocalic context of vowels,
    shortening in clusters, lengthening due to
    plosive aspiration

9
Text-to-Speech (TTS) Synthesis
  • Duration Modeling using CART
  • Given large corpus, annotated with duration,
    phoneme, stress, phrase information, train a CART
    classifier
  • Doesnt generalize well to data not seen in
    training
  • Sums-of-Products Approach (van Santen)
  • Dur(/p/ c2, c3, c4, ...) where cn is
    particular context
  • A11(/p/) A21(/p/) ?A22(c2)
    A33(c3) ? A34(c4)
  • duration is combined function of contexts
  • statistical, not rule based simple to train
    generalizes well

plosive?
vowel? word-initial?
more rules here
0.98
1.13
10
Text-to-Speech (TTS) Synthesis
  • Generating a Waveform Articulatory Synthesis
  • The vocal tract is divided into a large number of
    short tubes, as in the electrical transmission
    line analog (Lecture 11), which are then combined
    and resonant frequencies calculated.

from Sinder, 1999 (thesis work with Flanagan,
Rutgers)
11
Text-to-Speech (TTS) Synthesis
  • Generating a Waveform Articulatory Synthesis
  • Vocal-tract sources include noise and a buzz
    source for voiced sounds
  • Articulatory synthesis important for validating
    the Motor Theory of Speech Perception
  • Demos from 1976 and circa 1992 (Haskins
    Labs)

12
Text-to-Speech (TTS) Synthesis
  • Generating a Waveform Formant Synthesis
  • Instead of specifying mouth shapes, formant
    synthesis specifies frequencies and bandwidths of
    resonators, which are used to filter a source
    waveform.
  • Formant frequency analysis is difficult
    bandwidth estimation is even more difficult. But
    the biggest perceptual problem in formant
    synthesis is not in the resonances, but in a
    buzzy quality most likely due to the glottal
    source model.
  • Formant synthesis can sound identical to natural
    utterance if details of the glottal source and
    formants are well modeled. NATURAL
    SPEECH SYNTHETIC SPEECH(John Holmes, 1973)

13
Text-to-Speech (TTS) Synthesis
  • Formant TTS Synthesis Architecture
  • Formant-synthesis systems contain a number of
    sound sources, which are passed to filters in
    either parallel or cascade series. Each filter
    corresponds to one formant (resonance) or
    anti-resonance.

(From Yamaguchi, 1993)
14
Text-to-Speech (TTS) Synthesis
  • Formant systems Rule-Based Synthesis
  • For synthesis of arbitrary text, formants and
    bandwidths for each phoneme are determined by
    analyzing speech of a single person.
  • The models of each phoneme may be a single set of
    formant frequencies and bandwidths for a
    canonical phoneme at a single point in time, or a
    trajectory of frequencies, bandwidths, and source
    models over time.
  • The formant frequencies for each phoneme are
    combined over time using a model of
    coarticulation, such as Klatts modified locus
    theory.
  • Duration, pitch, and energy rules are applied
  • Result something like this

15
Text-to-Speech (TTS) Synthesis
  • Despite great success in copy synthesis,
    synthesis by rule using formants has severely
    degraded quality. Its not clear why Problem
    with glottal source? Problem with coarticulation
    and formant transitions? Problem with prosody?
  • Formant synthesis was main TTS technique until
    the early or mid 1990s, when increasing memory
    size and CPU speed allowed concatenative
    synthesis to be viable approach.
  • Concatenative synthesis uses recordings of small
    units of speech (typically the region from the
    middle of one phoneme to the middle of another
    phoneme, or a diphone unit), and glues these
    units together to forms words and sentences.
  • Concatenative synthesis means that you dont have
    to worry about glottal source models or
    coarticulation, since the synthesis is just a
    concatenation of different waveforms containing
    natural glottal source and coarticulation.

16
Text-to-Speech (TTS) Synthesis
  • Concatenative Synthesis Units
  • The basic unit for concatenative synthesis is the
    diphone
  • More recent TTS research is on using larger
    units. Issues include (a) how to decide what
    units will be used? (b) how to select
    best unit from very large database?
  • With increasing size and variety of units, there
    is an exponential growth in the database size.
    Yet, despite massive databases that may take
    months to record, coverage is nowhere near
    complete. There is a very large number of
    infrequent events in speech.

sil-jh jh-aa aa-n n-sil
17
Text-to-Speech (TTS) Synthesis
  • Concatenative Synthesis Signal Processing
  • Waveform-based Pitch-Synchronous Overlap Add
    (PSOLA)
  • Perform pitch modification by spacing of
    pitch-synchronous units
  • Or, use Line Spectral Frequencies (LSFs), which
    areconceptually the harmonics in a spectrogram

18
Text-to-Speech (TTS) Synthesis
  • DEMOS
  • Klatts DEC Talk
  • sample 1
  • ATT
  • sample 1 sample 2
  • Bell Labs
  • sample 1 sample 2
  • OGI
  • sample 1 sample 2
Write a Comment
User Comments (0)
About PowerShow.com