ECE 16:332:527 Digital Speech Processing Lecture 18 - PowerPoint PPT Presentation

Loading...

PPT – ECE 16:332:527 Digital Speech Processing Lecture 18 PowerPoint presentation | free to view - id: 10dda9-ZmYyZ



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

ECE 16:332:527 Digital Speech Processing Lecture 18

Description:

ECE 16:332:527 Digital Speech Processing Lecture 18 – PowerPoint PPT presentation

Number of Views:985
Avg rating:5.0/5.0
Slides: 101
Provided by: larryr8
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: ECE 16:332:527 Digital Speech Processing Lecture 18


1
ECE 16332527Digital Speech Processing Lecture
18
  • Text-to-Speech (TTS) Synthesis Systems

2
Text-to-Speech (TTS) Synthesis
  • GOAL convert arbitrary textual messages to
    intelligible and natural sounding synthetic
    speech so as to transmit information from a
    machine to a person

3
Text Analysis Components
Raw English Input Text
Basic Text Processing Document Structure
Detection Text Normalization Linguistic Analysis
Dictionary
Tagged Text
Phonetic Analysis Homograph disambiguation Graphem
e-to-Phoneme Conversion
Tagged Phones
Prosodic Analysis Pitch and Duration Rules Stress
and Pause Assignment
Synthesis Controls (Sequence of Sounds,
Durations, Pitch)
4
Document Structure
  • end of sentence marked by .?! is not infallible
  • The car is 72.5 in. long
  • e-mail and web pages need special processing
  • LarrySure.  I'll try to do it before Thursday
    -)Ed
  • multiple languages
  • insertion of foreign words, unusual accent and
    diacritical marks, etc.

5
Text Normalization
  • abbreviations and acronyms
  • Dr. is pronounced either as Doctor or drive
    depending on context (Dr. Smith lives on Smith
    Dr.)
  • St. is pronounced either as street or Saint
    depending on context (I live on Bourbon St. in
    St. Louis)
  • DC is either direct current or District of
    Columbia
  • MIT is pronounced as either M I T or
    Massachusetts Institute of Technology but never
    as mitt
  • DEC is pronounced as either deck or Digital
    Equipment Company but never as D E C
  • numbers
  • 370-1111 can be either three seven oh or
    three seventy-model 1111 for the IBM 370
    computer
  • 1920 is either nineteen-twenty or one
    thousand, nine hundred, twenty
  • dates, times currency, account numbers, ordinals,
    cardinals, math
  • Feb. 15, 1983 needs to convert to February
    fifteenth, nineteen eighty-three
  • 10.50 is pronounced as ten dollars and fifty
    cents
  • part 10-50 needs to be pronounced as part
    number 10 dash fifty rather than part pound
    sign ten to fifty

6
Text Normalization
  • proper names
  • Rudy Prpch is pronounced as Rudy Perpich
  • Sorin Ducan is pronounced as Sorin Duchan
  • part of speech
  • read is pronounced as reed or red
  • record is pronounced as rec-ard or ri-cord
  • word decomposition
  • need to decompose complex words into base forms
    (morphemes) to determine pronunciation
    (indivisibility needs to be decomposed into
    in-di-visible-ity to determine pronunciation)

7
Text Normalization
  • proper handling of special symbols in text
  • punctuation, e.g., . , - -- _ ( )
    _at_ ! lt gt ? / \
  • string resolution
  • 1020 can be pronounced as either twenty after
    10 (as a time) or ten to twenty (as a sequence)

8
Linguistic Analysis
  • part-of-speech (POS)
  • word sense
  • phrases
  • anaphora
  • emphasis
  • style
  • a conventional parser could be used, but
    typically a simple shallow analysis is done for
    speed (parsers are not real-time!)

9
Homograph Disambiguation
  • an absent boy versus do you choose to absent
    yourself?
  • they will abuse him versus they wont take
    abuse
  • an overnight bag versus are you staying
    overnight?
  • he is a learned man versus he learned to play
    piano
  • El Camino Real road versus real world

10
Letter-to-Sound (LTS) Conversion
Input Word
  • CART (Classification and Regression Tree)
    analysis
  • conventional dictionary search with
    letter-to-sound rules

Whole Word Dictionary Probe
no
Affix Stripping
no
yes
Root Dictionary Probe
no
yes
Letter-to-Sound and Stress Rules
yes
Affix Reattachment
phonemes, stress, parts-of-speech
This is only ONE way of doing LTS. FSMs are
another.
11
Prosody
  • pauses
  • to indicate phrases and to avoid running out of
    breath
  • pitch
  • fundamental frequency (F0) as a function of time
  • rate/relative duration
  • phoneme durations, timing, and rhythm
  • loudness
  • relative amplitude/volume

12
Symbolic and Phonetic Prosody
parsed text and phone string
Symbolic Prosody
Pauses Prosodic Phrases
Accent Tone Tune
Speaking Style
Prosody Attributes Pitch Range Prominence Declina
tion
F0 Contour Generation
F0 Contour
13
ToBI tones
  • pitch accent tones
  • intermediate phrasal tones (L-, H-)
  • boundary tones (L-L, L-H, H-H, H- L, H)

14
Marianna Made the Marmalade
15
F0 Contour Generation
  • anchor points from accent, tone, pitch range,
    prominence and declination
  • F0 contour is obtained by interpolation

16
Full TTS System
Face Sync
Text
Text/Language
Text Analysis
Letter-to-Sound Rules
Synthesis Backend
Speech
Synthesis model impacts TTS, as well as unit
representation. choice of units
- words
- phones
- diphones
- dyads
- syllables choice of parameters- LPC -
formants- waveform templates- articulatory
parameters- sinusoidal parameters method of
computation- rules- concatenation
- text and name dictionaries - timing and
duration rate, stress (position, context) -
intonation/pitch (type of phraselist,statement,
??) - phonetic substitution (vowel reduction,
flapping rules) - loudness/amplitude (phrasal
amp., local reduction)
- Numerical expansion (dates, times, , ) -
abbreviations, acronyms - proper name id
Prosody
Dr. Smith lives at 23 Lakeshore Dr.
Parsing
- sentences, phrase and breath groups- semantic
and syntactic accent stress of compounds-
intonation types parts of speech
17
Word Concatenation Synthesis
  • words in sentences are much shorter than in
    isolation (up to 50 shorter) (see next page)
  • words cannot preserve sentence-level stress,
    rhythm or intonation patterns
  • too many words to store (1.7 million surnames),
    extended words using prefixes and suffixes

18
Word Concatenation
19
Proper Name Statistics
20
Statistics of Proper Name Coverage
name pronunciation based on etymology
21
Concatenative Word Synthesis
several examples of concatenated words and full
sentences--SBR
word-based synthesis does not work!
22
Speech Synthesis Methods
  • 1939the VODER (Voice Operated DEmonstratoR)Homer
    Dudley
  • based on a simple model of speech sound
    production
  • select voicing source (with foot pedal control of
    pitch) or noise source
  • ten filters shaped the source to produce vocal or
    noise-excited soundscontrolled by finger motions
  • separate keys for stop sounds
  • wrist bar control of signal energy

23
The VODER
24
Articulatory Synthesis
  • in theory can create more natural and more
    realistic motions of the articulators (rather
    than formant parameters), thereby leading to more
    natural sounding synthetic speech
  • utilize physical constraints of articulator
    movements
  • use X-ray data to characterize individual speech
    sounds
  • model how articulatory parameters move smoothly
    between sounds
  • direct method solve wave equation for sound
    pressure at lips
  • indirect method convert to formants or LPC
    parameters for final synthesis in order to
    utilize existing synthesizers
  • use highly constrained motions of articulatory
    parameters

25
Articulatory Model
  • Articulatory Parameters of Model
  • lip openingW
  • lip protrusion lengthL
  • tongue body height and lengthY,X
  • velar closureK
  • tongue tip height and lengthB,R
  • jaw raising (dependent parameter)
  • velum openingN

26
Articulatory Models
27
Articulatory Synthesis Using Formant Synthesizer
Backend
Cecil Coker--teaching computers to talk
Articulatory Synthesis of SpeechCecil Coker
28
A DIRECT Approach Analysis of the Vocal Tract in
the Frequency Domain
Chain (abcd) matrix
The VT-transfer function is
with
impedance at lips
Matrix elements of a lossless section (length l
and cross-section Aconst.) are
Pout,Uout
with
Pin, Uin
and c (speed of sound), ? (density of air)
29
Vocal Tract Analysis
30
Articulatory Synthesis by Copying Measured Vocal
Tract Data
  • fully automatic closed-loop optimization
  • initialized from articulatory codebooks, neural
    nets
  • Schroeter and Sondhi, 1987
  • One example original re-synthesis

31
Articulatory Synthesis Issues
  • requires highly accurate models of glottis and
    vocal tract
  • requires rules for dynamics of the articulators

32
Vocal Tract to LPC
33
LPC Implementation
34
Serial Synthesis from LPC
Note H(z) has unity gain at DC (?0, z1)
35
Source-Filter Synthesis Models
  • cascade/serial (formant) synthesis model

36
Serial/Formant Synthesis Model
37
Serial/Formant Synthesis Model
  • flaws in the serial/formant synthesis model
  • cant handle voiced fricatives
  • no zeros for nasal sounds
  • no precise control for stop consonants
  • pitch pulse shape fixedindependent of pitch
  • spectral compensation is inadequate

To Be -Bell Labs
Daisy-Daisy with music
SPASS synthesis
JSRU Synthesis
OVE 1--Fant
We Wish You
38
Parallel Synthesis Model
A serial synthesizer is a good approach for open,
non-nasal vocal tracts (vowels, liquids). For
obstruents and nasals, we need to control the
amplitudes of each resonance, and to introduce
zeros in addition to the poles.
parallel synthesizer provides more flexibility in
matching spectrum levels at formant frequencies
(via gain controls)however, zeros are introduced
into the spectrum.
39
Parallel Synthesis
  • issues
  • need individual resonance amplitudes
    (A1,,A4)if resonances are close, this is a
    messy calculation
  • phasing of resonances neglected (the Bkz-1
    terms)
  • synthetic speech has both resonances and zeros
    (at frequencies between the resonances) that may
    be perceptible
  • better reproduction of complex consonants

Parallel synthesis from BYU
Parallel synthesis-Holmes
40
More Advanced Synthesizer
41
More Advanced Synthesizer
  • synthesizer features
  • glottal pulse modeled directly using several
    tunable parameters
  • breathiness component added to glottal source
  • aspiration source included in voiced sound loop
    to enable voiced fricative production
  • pole-zero model for voiced speech
  • radiation modeled separately for both voiced and
    unvoiced speech

42
More Versatile Synthesizer (Serial-Parallel)
43
Voiced Fricative Synthesis
Klatt TalkMIT, 1986
44
Continuing Evolution (1959-1987)
  • Haskins, 1959
  • KTH Stockholm, 1962
  • Bell Labs, 1973
  • MIT, 1976
  • MIT-talk, 1979
  • Speak N spell, 1980
  • BELL Labs, 1985
  • Dec talk, 1987

45
Text-to-Speech Synthesis (TTS) Evolution
Good Intelligibility Customer Quality
Naturalness (Limited Context)
Poor Intelligibility Poor Naturalness
Good Intelligibility Poor Naturalness
Formant Synthesis
LPC-Based Diphone/Dyad Synthesis
Unit Selection Synthesis
ATR in Japan CSTR in Scotland BT in England
ATT Labs (1998) LH in Belgium
Bell Labs CNET Bellcore Berkeley Speech
Technology
Bell Labs Joint Speech Research Unit MIT
(DEC-Talk) Haskins Lab
1962 1967 1972
1977 1982 1987
1992 1997
Year
46
Speech Synthesisthe 90s
  • what changed?
  • TTS was highly intelligible but extremely
    unnatural sounding
  • a decade of work had not changed the naturalness
    substantially
  • computation and memory grew with Moores law,
    enabling highly complex concatenative systems to
    be created, implemented and perfected
  • concatenative systems showed themselves capable
    of producing (in some cases) extremely natural
    sounding synthetic speech

47
Concatenation TTS Systems
  • key idea use segments of recorded speech for
    synthesis
  • data driven approach ? more segments give better
    synthesis ? using an infinite number of segments
    leads to perfect synthesis
  • key issues
  • what units to use
  • how to select units from natural speech
  • how to label and extract consistent units from a
    large database
  • what signal representation should be used for
    spectrally smoothing units (at junctures) and for
    prosody modification (pitch, duration, amplitude)

48
Concatenation Units
  • choice of units
  • Wordsthere are an infinite number of them
  • Syllablesthere are about 10K in English
  • Phonemesthere are about 45 in English
  • Demi-syllablesthere are about 2500 in English
  • Diphonesthere are about 1500-2500 in English

49
Choice of Units
Rules, Necessary Unit Modifications
Units (English)
Unit
Length
Quality
Allophone 60-80 Diphone lt402-652 Triphone lt403-
653 Demisyllable 2K Syllable 11K VCV 2-s
yllable lt11 K2 Word 100K-1.5M Phrase Sentence

Many
Short
Low
8
High
Few
Long
50
Corpus Coverage by Unit Type
1.0
NOTE depends on domain (here SURNAMES)
.83
Slope
Units (1 token/unit)
.23
.15
.11
.02
.003
1 10K 20K 30K
40K 50K 2 M
Top N Surnames (rank)
51
Concatenation Units
  • Words
  • no complete coverage for broad domains ? words
    have to be supplemented with smaller units
  • limited ability to modify pitch, amplitude and
    duration without losing naturalness and
    intelligibility
  • need huge database to extract multiple versions
    of each word
  • Sub-Words
  • hard to isolate in context due to co-articulation
  • need allophonic variations to characterize units
    in all contexts
  • puts large burden on signal processing to smooth
    at unit join points

52
Concatenation Unit Representation
  • LPC
  • simple, easy to concatenate units, efficient for
    modification of pitch (since it is inherently
    separated from vocal tract spectra)
  • doesnt work well for nasals (lack of nasal
    zeros)
  • glottal excitation not correct (assumed pitch
    pulses)
  • doesnt work well for mixed excitation (basic LPC
    assumptions)
  • buzzy
  • TD-PSOLATime Domain, Pitch Synchronous Overlap
    Add Synthesis
  • efficient prosody modification (pitch
    synchronous)
  • no smoothing at join points

53
Speech Waveform Models
  • time-domain source-filter models (LPC)
  • filter represents the vocal tract
  • synthetic glottal pulse excites the filter
  • filter produces synthetic speech

Filter
54
Speech Waveform Models
  • time-domain modification (e.g., PSOLA)
  • window and shift pitch pulses
  • pitch marks critical

zeros added to extend period
Amplitude
time
55
TD-PSOLA Synthesis
time domain modifications lead to spectral
distortions
56
Concatenation Mismatches of TD-PSOLA
  • phase mismatch different relative position of
    OLA windows within left and right segments (LS
    and RS)
  • pitch mismatch different F0 in LS and RS
  • timbre mismatch different spectral envelopes in
    LS and RS
  • lacking smoothness across concatenation points
    need to painfully optimize the segment database
    to get best segmental quality

57
Temporal Envelope Mismatch in TD-PSOLA
  • speech waveform changes abruptly from the left
    segment to the right segment
  • concatenation point is detectable
  • even when applied on LPC-residual and with
    smoothing spectral envelopes, audible glitches
    remain

58
MBROLA (T. Dutoit)
  • time-domain synthesis that combines the
    advantages of PSOLA (low computational cost, good
    prosody modification) with an off-line hybrid
    time/frequency algorithm for smoothing
    transitions.
  • speech in database has constant pitch (100 Hz)
  • amplitude and some general spectral
    smoothinggtgtgtgt http//tcts.fpms.ac.be/synthesis/

59
Speech Waveform Models
  • sinusoidal models
  • model signal as a sequence of time-varying
    sinusoids

60
HNM (Harmonic Noise Model) (Y. Stylianou)
  • harmonic (low band) and noise (high band)
    classification of speech
  • harmonic part modeled by a comb of sinusoids
  • noise part modeled by a parametric time-domain
    envelope and spectrally shaped by an AR-model
  • analysis/synthesis done pitch-synchronously
    without explicit pitch markers

61
Concatenation Unit Representation
  • Hybrid Time-Frequency Representation
  • easy to modify prosody
  • segments easily smoothed at concatentation
    points
  • can easily modify spectral envelope

62
Hybrid Representation
63
Practical Implementation of Hybrid Synthesis
Sr(?)
ai fi
?ii ?0
ep(n)
sp(n)
hp(n,m)
s(n)
hr(n,m)
er(n)
sr(n)
64
Speech Waveform Modification
  • cant cover all possible combinations of feature
    variables in a database
  • waveform modification is of vital importance for
    concatenative synthesis based on diphone units
  • some attributes can be modified easily with
    signal processing techniques

65
Powerful Voice Alteration
original modified
  • prosody modification capabilities
  • voice alteration female to child
  • voice alteration child to adult male

HNM
PSOLA
66
Hybrid Methods, Final Thoughts
  • 10-20 times more complex than TD-PSOLA
  • complexity issue can be addressed by using
    computationally expensive hybrid methods only
    offline during database generation while applying
    low-complexity time-domain-only synthesis online
    in the TTS system
  • e.g., MBROLA
  • for highest possible quality prosodic
    modifications, hybrid methods need to be applied
    online
  • e.g., HNM

67
Block Diagram of a Concatenative TTS System
Dictionary and Rules
Store of Sound Units
Text Analysis,Letter-to-Sound,Prosody
Speech Waveform Modification and Synthesis
Assemble Units that Match Input Targets
Speech
Message Text
Alphabetic Characters
Phonetic Symbols, Prosody Targets
Female Male
68
Concatenative Synthesis Unit Definition and
Extraction
/eh-s/ diphone
  • ? Waveform
  • ? Spectrogram
  • Symbolic Representation
  • ? word labels
  • ? tone labels
  • ?syllable and stress labels
  • ? phone labels
  • ? break indices

/s/ phone
69
Issues with Unit Type
  • sub-word units
  • usually cut from over-articulated sentences
    read in an almost monotone style
  • neglects effects of neighboring units
    (coarticulation)
  • must include several allophonic variations
  • neglects variations due to speaking style (news,
    announcements) and pitch
  • puts a large burden on signal processing
    (smoothing, prosody modifications)
  • small (30 minutes / 650 sentences) database,
    easy to label

70
Procedure for Concatenative Synthesis
  • off-line inventory preparation
  • record speech corpus and process with coding
    method of choice
  • determine location of speech units and store
    units in inventory
  • on-line synthesis from text
  • normalize input text (expand abbreviations, etc.)
  • letter-to-sound (pronunciation dictionary and
    rules)
  • prosody (melody/pitchdurations, stress
    patterns/amplitudes, )
  • select appropriate sequence of units from
    inventory
  • modify units (smooth at boundaries, match desired
    prosody)
  • synthesize and output speech signal

71
Unit Selection Synthesis
  • need to optimally match units at boundaries,
    e.g., fundamental frequency (pitch), and spectrum
  • need to automatically and efficiently select
    optimal sequence of units from database
  • issues in Unit Selection Synthesis
  • several examples in each unit category (from 10
    to 10 )
  • waveform modification used sparingly (leads to
    perceived distortions)
  • high intelligibility must be maintained
  • customer quality attained with reasonable
    training set (1-10 hours)
  • natural quality attained with large training
    set (10s of hours)
  • unit selection algorithm must run in a fraction
    of real time on a state-of-the-art processor

6
72
Why is Online Unit Selection Necessary?
Single Feature Distribution (within same
category, here /ow/) e.g., pitch, duration,
emphasis, spectral tilt,
/ow/
Assumption no labeling and feature extraction
errors !!
  • Impossible to capture broad range of naturally
    occurring features with just one or two examples

73
Unit Selection
  • given target features, automatically find
    sequence of units in the database that most
    closely match these features

Trained perceptual distance metric
hh
l
ow
eh
...
ow
ow
hh
eh
ow
hh
hh
eh
hh
l
eh
eh
ow
hh
eh
l
l
l
eh
eh
l
74
Unit Selection
  • additionally, find sequence of units that best
    join each other
  • find optimal path using Viterbi Search (Dynamic
    Programming)

hh
l
ow
eh
...
ow
ow
hh
eh
ow
hh
hh
eh
hh
l
eh
E
hh
eh
l
l
l
eh
eh
l
75
(On-line) Unit Selection Viterbi Search
u-
-a
a-u
a-u (1)
-a (1)
u- (1)
a-u (2)
-a (2)
u- (2)
-
-
a-u (3)
-a (3)
u- (3)
a-u (4)
  • transitional (concatenation) costs are based on
    acoustic distances
  • node (target) costs are based on linguistic
    identity of unit

76
Unit Selection Measures
  • USDUnit Segmental Distortion ? differences
    between desired spectral pattern of target and
    that of candidate unit, throughout whole unit
  • UCDUnit Concatenative Distortion ? spectral
    discontinuity across boundaries of the
    concatenated units
  • Example source contextwant/w ah n t/
  • target contextcart/k ah r t/
  • USD?(ah?n versus ah?r)

77
Concatenative Synthesis
  • concatenate recording chunks (senone, half-phone,
    diphone, phone, demisyllable, syllable, word,
    phrase, sentence)
  • adjacent units have zero concatenation cost.

Transition Cost (UCD)
Selected units
?j
?j1
Unit Cost (USD)
Target units
tj
78
Concatenation Cost
79
Target Cost
80
(Off-line) Weight Training
81
Acoustic Target Cost
82
Modern TTS Systems (Natural Voices from ATT)
  • Soliloquy from Hamlet
  • Gettysburg Address
  • Bob Story
  • German female
  • UK British female
  • Spanish female
  • Korean female
  • French male

83
Modern COMMERCIAL Systems
  • Lucent
  • AcuVoice
  • Festival
  • LH RealSpeak
  • SpeechWorks female
  • SpeechWorks male
  • Cselt (Actor) - Italian

84
TTS Future Needs
  • TTS needs to know how things should be said
  • context-sensitive pronunciations of words
  • prosody prediction? emphasis
  • I gave the book to John (not someone else)
  • I gave the book to John (not the photos)
  • I gave the book to John (I did it, not someone
    else)
  • unit selection process ? target cost captures
    mismatch between predicted unit specification
    (phoneme name, duration, pitch, spectral
    properties) and actual features of a candidate
    recorded unit ? need better spectral distance
    measures that incorporate human perception
  • better signal processing ? compress units for
    small footprint devices

85
Visual TTS
  • Applications talking assistants/avatars,
    intelligent agents, video mail, email reading ...
  • Advantages of using a talking head
  • higher intelligibility and perceived quality of
    (audio) TTS!
  • enhanced user experience through multimedia
    communication
  • Approaches
  • sample-based image synthesis using library of
    video snippets
  • 3D head model including models of tongue and lips
    (Baldi)
  • in the future sample-based image synthesis using
    Baldi for alignment(i.e., use best of each of
    the two technologies)

86
Visual Text-to-Speech Synthesis
  • personalized friendly agents provide an
    entertaining and effective user experience.
  • subjective tests confirm
  • Agents are preferred over text and audio
    interfaces
  • Agents are more trusted
  • than text and audio interfaces
  • applications
  • Personal Assistant
  • Customer Service
  • Newscaster (Ananova)
  • E-commerce
  • Games

87
Talking Heads
  • 3D Talking Heads
  • Sample-based Talking Heads

look like a real person require recording of
real people limited in pose that can be shown.
flexible easy to show in any pose faces look
cartoon-like.
88
3D-Head Model (Baldi Family)
Baldi

Andrew
Katherine
Caesar
Cybatt
Cybel
89
VTTS Process
Text
Coarticulation Model/Library
phonemes
Text to Speech Synthesizer
Rendering
Lip shapes Emotions Movements ( FAPs)
3D model
Movements Emotions Model/Library
Sample-based model
stress
Conversation module
emotions
90
Two Rendering Techniques
-

hard to reproduce minute skin details like
wrinkles, that look absolutely natural
Synthetic3D models parametrized shapes
Keeps correct appearance under full range of views
Sample-based parametrized textures
Reproduces photo realistic appearances fast
Range of views limited by planar approximation of
parts
91
Sample-Based Model
  • concatenate snippets of video to synthesize
    talking heads
  • reduce the number of samples to store by
    decomposing recorded head into sub-parts.
  • use a background image of the whole head onto
    which parts are warped.
  • feathering (transparency gradient at border)
    helps smooth blending
  • smooth transitions of each object (e.g., mouth
    shape) across unit boundaries by using advanced
    morphing techniques

92
Model-Independent Animation
  • facial expressions

Fear
Disgust
Anger
Surprise
Sadness
Joy
93
Giving Machines High Quality Voices and Faces
U.S. English Female U.S. English Male Spanish
Female
Natural Speech
94
Customer Care Scenario
95
Visual TTS Demos
E-Mail Messages
Virtual Secretary
Au Claire de la Lune
96
Travel Domain Scenario
97
Business Drivers of TTS
  • cost reduction
  • TTS as a dialog component for customer care
  • TTS for delivering messages
  • TTS to replace expensive recorded IVR prompts
  • new products and services
  • location-based services
  • providing information in cars (e.g., driving
    directions, traffic reports)
  • unified Messaging (reading e-mail, fax)
  • voice Portals (enterprise, home, phone access to
    Web-based services)
  • e-commerce (automatic information agents)
  • customized News, Stock Reports, Sports Scores
  • devices

98
Reading Email
From Marilyn Walker ltwalker_at_research.att.comgt
To David Ross ltdavidross_at_home.comgt Subject
Re Today's Meeting Date Tuesday, December
01, 1998 425 PM -------------------------------
--------------------------------------------------
------------------------------------------ 430
is fine for me. See you at the meeting.
Marilyn -----Original Message-----
From David Ross ltdavidross_at_home.comgt To
Marilyn Walker ltwalker_at_research.att.comgt
Date Tuesday, December 01, 1998 225 PM
Subject Today's Meeting Today's
meeting has been changed from 400 to 430 PM. If
the time change is a problem, please send
me email at davidross_at_home.com.
Thanks, david ross
99
Reading Email (final)
From Marilyn Walker ltwalker_at_research.att.comgt
To David Ross ltdavidross_at_home.comgt Subject
Re Today's Meeting Date Tuesday, December
01, 1998 425 PM -------------------------------
--------------------------------------------------
------------------------------------------ 430
is fine for me. See you at the meeting.
Marilyn Walker -----Original Message-----
From David Ross ltdavidross_at_home.comgt
To Marilyn Walker ltwalker_at_research.att.comgt
Date Tuesday, December 01, 1998 225 PM
Subject Today's Meeting Today's
meeting has been changed from 400 to 430 PM. If
the time change is a problem, please send
me email at davidross_at_home.com.
Thanks, david ross
100
TTS Application Categories
Devices
  • PDAs, cellphones, gaming, talking appliances
  • driving directions, city and restaurant guides,
    location services (e.g., Macys has a
    sale!)voice control of cell phones, VCRs, TVs
  • home information access over telephone (Home
    Voice Portals)
  • information access over the phone such as sales
    information, HR, internal phonebook, messaging
  • E-commerce, customer care (e.g., friendly
    automated talking web agents, FAQs, product
    information)
  • next-gen HMIHY automated operator services

Automotive Connectivity
ConsumerCommunications
Enterprise Communications
Voice-assistedE-Commerce
Call center Automation
About PowerShow.com