ECE 16:332:527 Digital Speech Processing Lecture 18

About This Presentation

Title:

ECE 16:332:527 Digital Speech Processing Lecture 18

Description:

ECE 16:332:527 Digital Speech Processing Lecture 18 – PowerPoint PPT presentation

Number of Views:1059

Avg rating:5.0/5.0

Slides: 101

Provided by: larryr8

Category:

more less

Transcript and Presenter's Notes

Title: ECE 16:332:527 Digital Speech Processing Lecture 18

1
ECE 16332527Digital Speech Processing Lecture
18

Text-to-Speech (TTS) Synthesis Systems

2
Text-to-Speech (TTS) Synthesis

GOAL convert arbitrary textual messages to
intelligible and natural sounding synthetic
speech so as to transmit information from a
machine to a person

3
Text Analysis Components
Raw English Input Text
Basic Text Processing Document Structure
Detection Text Normalization Linguistic Analysis
Dictionary
Tagged Text
Phonetic Analysis Homograph disambiguation Graphem
e-to-Phoneme Conversion
Tagged Phones
Prosodic Analysis Pitch and Duration Rules Stress
and Pause Assignment
Synthesis Controls (Sequence of Sounds,
Durations, Pitch)
4
Document Structure

end of sentence marked by .?! is not infallible
The car is 72.5 in. long
e-mail and web pages need special processing
LarrySure. I'll try to do it before Thursday
-)Ed
multiple languages
insertion of foreign words, unusual accent and
diacritical marks, etc.

5
Text Normalization

abbreviations and acronyms
Dr. is pronounced either as Doctor or drive
depending on context (Dr. Smith lives on Smith
Dr.)
St. is pronounced either as street or Saint
depending on context (I live on Bourbon St. in
St. Louis)
DC is either direct current or District of
Columbia
MIT is pronounced as either M I T or
Massachusetts Institute of Technology but never
as mitt
DEC is pronounced as either deck or Digital
Equipment Company but never as D E C
numbers
370-1111 can be either three seven oh or
three seventy-model 1111 for the IBM 370
computer
1920 is either nineteen-twenty or one
thousand, nine hundred, twenty
dates, times currency, account numbers, ordinals,
cardinals, math
Feb. 15, 1983 needs to convert to February
fifteenth, nineteen eighty-three
10.50 is pronounced as ten dollars and fifty
cents
part 10-50 needs to be pronounced as part
number 10 dash fifty rather than part pound
sign ten to fifty

6
Text Normalization

proper names
Rudy Prpch is pronounced as Rudy Perpich
Sorin Ducan is pronounced as Sorin Duchan
part of speech
read is pronounced as reed or red
record is pronounced as rec-ard or ri-cord
word decomposition
need to decompose complex words into base forms
(morphemes) to determine pronunciation
(indivisibility needs to be decomposed into
in-di-visible-ity to determine pronunciation)

7
Text Normalization

proper handling of special symbols in text
punctuation, e.g., . , - -- _ ( )
_at_ ! lt gt ? / \
string resolution
1020 can be pronounced as either twenty after
10 (as a time) or ten to twenty (as a sequence)

8
Linguistic Analysis

part-of-speech (POS)
word sense
phrases
anaphora
emphasis
style
a conventional parser could be used, but
typically a simple shallow analysis is done for
speed (parsers are not real-time!)

9
Homograph Disambiguation

an absent boy versus do you choose to absent
yourself?
they will abuse him versus they wont take
abuse
an overnight bag versus are you staying
overnight?
he is a learned man versus he learned to play
piano
El Camino Real road versus real world

10
Letter-to-Sound (LTS) Conversion
Input Word

CART (Classification and Regression Tree)
analysis
conventional dictionary search with
letter-to-sound rules

Whole Word Dictionary Probe
no
Affix Stripping
no
yes
Root Dictionary Probe
no
yes
Letter-to-Sound and Stress Rules
yes
Affix Reattachment
phonemes, stress, parts-of-speech
This is only ONE way of doing LTS. FSMs are
another.
11
Prosody

pauses
to indicate phrases and to avoid running out of
breath
pitch
fundamental frequency (F0) as a function of time
rate/relative duration
phoneme durations, timing, and rhythm
loudness
relative amplitude/volume

12
Symbolic and Phonetic Prosody
parsed text and phone string
Symbolic Prosody
Pauses Prosodic Phrases
Accent Tone Tune
Speaking Style
Prosody Attributes Pitch Range Prominence Declina
tion
F0 Contour Generation
F0 Contour
13
ToBI tones

pitch accent tones
intermediate phrasal tones (L-, H-)
boundary tones (L-L, L-H, H-H, H- L, H)

14
Marianna Made the Marmalade
15
F0 Contour Generation

anchor points from accent, tone, pitch range,
prominence and declination
F0 contour is obtained by interpolation

16
Full TTS System
Face Sync
Text
Text/Language
Text Analysis
Letter-to-Sound Rules
Synthesis Backend
Speech
Synthesis model impacts TTS, as well as unit
representation. choice of units
- words
- phones
- diphones
- dyads
- syllables choice of parameters- LPC -
formants- waveform templates- articulatory
parameters- sinusoidal parameters method of
computation- rules- concatenation
- text and name dictionaries - timing and
duration rate, stress (position, context) -
intonation/pitch (type of phraselist,statement,
??) - phonetic substitution (vowel reduction,
flapping rules) - loudness/amplitude (phrasal
amp., local reduction)
- Numerical expansion (dates, times, , ) -
abbreviations, acronyms - proper name id
Prosody
Dr. Smith lives at 23 Lakeshore Dr.
Parsing
- sentences, phrase and breath groups- semantic
and syntactic accent stress of compounds-
intonation types parts of speech
17
Word Concatenation Synthesis

words in sentences are much shorter than in
isolation (up to 50 shorter) (see next page)
words cannot preserve sentence-level stress,
rhythm or intonation patterns
too many words to store (1.7 million surnames),
extended words using prefixes and suffixes

18
Word Concatenation
19
Proper Name Statistics
20
Statistics of Proper Name Coverage
name pronunciation based on etymology
21
Concatenative Word Synthesis
several examples of concatenated words and full
sentences--SBR
word-based synthesis does not work!
22
Speech Synthesis Methods

1939the VODER (Voice Operated DEmonstratoR)Homer
Dudley
based on a simple model of speech sound
production
select voicing source (with foot pedal control of
pitch) or noise source
ten filters shaped the source to produce vocal or
noise-excited soundscontrolled by finger motions
separate keys for stop sounds
wrist bar control of signal energy

23
The VODER
24
Articulatory Synthesis

in theory can create more natural and more
realistic motions of the articulators (rather
than formant parameters), thereby leading to more
natural sounding synthetic speech
utilize physical constraints of articulator
movements
use X-ray data to characterize individual speech
sounds
model how articulatory parameters move smoothly
between sounds
direct method solve wave equation for sound
pressure at lips
indirect method convert to formants or LPC
parameters for final synthesis in order to
utilize existing synthesizers
use highly constrained motions of articulatory
parameters

25
Articulatory Model

Articulatory Parameters of Model
lip openingW
lip protrusion lengthL
tongue body height and lengthY,X
velar closureK
tongue tip height and lengthB,R
jaw raising (dependent parameter)
velum openingN

26
Articulatory Models
27
Articulatory Synthesis Using Formant Synthesizer
Backend
Cecil Coker--teaching computers to talk
Articulatory Synthesis of SpeechCecil Coker
28
A DIRECT Approach Analysis of the Vocal Tract in
the Frequency Domain
Chain (abcd) matrix
The VT-transfer function is
with
impedance at lips
Matrix elements of a lossless section (length l
and cross-section Aconst.) are
Pout,Uout
with
Pin, Uin
and c (speed of sound), ? (density of air)
29
Vocal Tract Analysis
30
Articulatory Synthesis by Copying Measured Vocal
Tract Data

fully automatic closed-loop optimization
initialized from articulatory codebooks, neural
nets
Schroeter and Sondhi, 1987
One example original re-synthesis

31
Articulatory Synthesis Issues

requires highly accurate models of glottis and
vocal tract
requires rules for dynamics of the articulators

32
Vocal Tract to LPC
33
LPC Implementation
34
Serial Synthesis from LPC
Note H(z) has unity gain at DC (?0, z1)
35
Source-Filter Synthesis Models

cascade/serial (formant) synthesis model

36
Serial/Formant Synthesis Model
37
Serial/Formant Synthesis Model

flaws in the serial/formant synthesis model
cant handle voiced fricatives
no zeros for nasal sounds
no precise control for stop consonants
pitch pulse shape fixedindependent of pitch
spectral compensation is inadequate

To Be -Bell Labs
Daisy-Daisy with music
SPASS synthesis
JSRU Synthesis
OVE 1--Fant
We Wish You
38
Parallel Synthesis Model
A serial synthesizer is a good approach for open,
non-nasal vocal tracts (vowels, liquids). For
obstruents and nasals, we need to control the
amplitudes of each resonance, and to introduce
zeros in addition to the poles.
parallel synthesizer provides more flexibility in
matching spectrum levels at formant frequencies
(via gain controls)however, zeros are introduced
into the spectrum.
39
Parallel Synthesis

issues
need individual resonance amplitudes
(A1,,A4)if resonances are close, this is a
messy calculation
phasing of resonances neglected (the Bkz-1
terms)
synthetic speech has both resonances and zeros
(at frequencies between the resonances) that may
be perceptible
better reproduction of complex consonants

Parallel synthesis from BYU
Parallel synthesis-Holmes
40
More Advanced Synthesizer
41
More Advanced Synthesizer

synthesizer features
glottal pulse modeled directly using several
tunable parameters
breathiness component added to glottal source
aspiration source included in voiced sound loop
to enable voiced fricative production
pole-zero model for voiced speech
radiation modeled separately for both voiced and
unvoiced speech

42
More Versatile Synthesizer (Serial-Parallel)
43
Voiced Fricative Synthesis
Klatt TalkMIT, 1986
44
Continuing Evolution (1959-1987)

Haskins, 1959
KTH Stockholm, 1962
Bell Labs, 1973
MIT, 1976
MIT-talk, 1979
Speak N spell, 1980
BELL Labs, 1985
Dec talk, 1987

45
Text-to-Speech Synthesis (TTS) Evolution
Good Intelligibility Customer Quality
Naturalness (Limited Context)
Poor Intelligibility Poor Naturalness
Good Intelligibility Poor Naturalness
Formant Synthesis
LPC-Based Diphone/Dyad Synthesis
Unit Selection Synthesis
ATR in Japan CSTR in Scotland BT in England
ATT Labs (1998) LH in Belgium
Bell Labs CNET Bellcore Berkeley Speech
Technology
Bell Labs Joint Speech Research Unit MIT
(DEC-Talk) Haskins Lab
1962 1967 1972
1977 1982 1987
1992 1997
Year
46
Speech Synthesisthe 90s

what changed?
TTS was highly intelligible but extremely
unnatural sounding
a decade of work had not changed the naturalness
substantially
computation and memory grew with Moores law,
enabling highly complex concatenative systems to
be created, implemented and perfected
concatenative systems showed themselves capable
of producing (in some cases) extremely natural
sounding synthetic speech

47
Concatenation TTS Systems

key idea use segments of recorded speech for
synthesis
data driven approach ? more segments give better
synthesis ? using an infinite number of segments
leads to perfect synthesis
key issues
what units to use
how to select units from natural speech
how to label and extract consistent units from a
large database
what signal representation should be used for
spectrally smoothing units (at junctures) and for
prosody modification (pitch, duration, amplitude)

48
Concatenation Units

choice of units
Wordsthere are an infinite number of them
Syllablesthere are about 10K in English
Phonemesthere are about 45 in English
Demi-syllablesthere are about 2500 in English
Diphonesthere are about 1500-2500 in English

49
Choice of Units
Rules, Necessary Unit Modifications
Units (English)
Unit
Length
Quality
Allophone 60-80 Diphone lt402-652 Triphone lt403-
653 Demisyllable 2K Syllable 11K VCV 2-s
yllable lt11 K2 Word 100K-1.5M Phrase Sentence

Many
Short
Low
8
High
Few
Long
50
Corpus Coverage by Unit Type
1.0
NOTE depends on domain (here SURNAMES)
.83
Slope
Units (1 token/unit)
.23
.15
.11
.02
.003
1 10K 20K 30K
40K 50K 2 M
Top N Surnames (rank)
51
Concatenation Units

Words
no complete coverage for broad domains ? words
have to be supplemented with smaller units
limited ability to modify pitch, amplitude and
duration without losing naturalness and
intelligibility
need huge database to extract multiple versions
of each word
Sub-Words
hard to isolate in context due to co-articulation
need allophonic variations to characterize units
in all contexts
puts large burden on signal processing to smooth
at unit join points

52
Concatenation Unit Representation

LPC
simple, easy to concatenate units, efficient for
modification of pitch (since it is inherently
separated from vocal tract spectra)
doesnt work well for nasals (lack of nasal
zeros)
glottal excitation not correct (assumed pitch
pulses)
doesnt work well for mixed excitation (basic LPC
assumptions)
buzzy
TD-PSOLATime Domain, Pitch Synchronous Overlap
Add Synthesis
efficient prosody modification (pitch
synchronous)
no smoothing at join points

53
Speech Waveform Models

time-domain source-filter models (LPC)
filter represents the vocal tract
synthetic glottal pulse excites the filter
filter produces synthetic speech

Filter
54
Speech Waveform Models

time-domain modification (e.g., PSOLA)
window and shift pitch pulses
pitch marks critical

zeros added to extend period
Amplitude
time
55
TD-PSOLA Synthesis
time domain modifications lead to spectral
distortions
56
Concatenation Mismatches of TD-PSOLA

phase mismatch different relative position of
OLA windows within left and right segments (LS
and RS)
pitch mismatch different F0 in LS and RS
timbre mismatch different spectral envelopes in
LS and RS
lacking smoothness across concatenation points
need to painfully optimize the segment database
to get best segmental quality

57
Temporal Envelope Mismatch in TD-PSOLA

speech waveform changes abruptly from the left
segment to the right segment
concatenation point is detectable
even when applied on LPC-residual and with
smoothing spectral envelopes, audible glitches
remain

58
MBROLA (T. Dutoit)

time-domain synthesis that combines the
advantages of PSOLA (low computational cost, good
prosody modification) with an off-line hybrid
time/frequency algorithm for smoothing
transitions.
speech in database has constant pitch (100 Hz)
amplitude and some general spectral
smoothinggtgtgtgt http//tcts.fpms.ac.be/synthesis/

59
Speech Waveform Models

sinusoidal models
model signal as a sequence of time-varying
sinusoids

60
HNM (Harmonic Noise Model) (Y. Stylianou)

harmonic (low band) and noise (high band)
classification of speech
harmonic part modeled by a comb of sinusoids
noise part modeled by a parametric time-domain
envelope and spectrally shaped by an AR-model
analysis/synthesis done pitch-synchronously
without explicit pitch markers

61
Concatenation Unit Representation

Hybrid Time-Frequency Representation

easy to modify prosody
segments easily smoothed at concatentation
points
can easily modify spectral envelope

62
Hybrid Representation
63
Practical Implementation of Hybrid Synthesis
Sr(?)
ai fi
?ii ?0
ep(n)
sp(n)
hp(n,m)
s(n)
hr(n,m)
er(n)
sr(n)
64
Speech Waveform Modification

cant cover all possible combinations of feature
variables in a database
waveform modification is of vital importance for
concatenative synthesis based on diphone units
some attributes can be modified easily with
signal processing techniques

65
Powerful Voice Alteration
original modified

prosody modification capabilities
voice alteration female to child
voice alteration child to adult male

HNM
PSOLA
66
Hybrid Methods, Final Thoughts

10-20 times more complex than TD-PSOLA
complexity issue can be addressed by using
computationally expensive hybrid methods only
offline during database generation while applying
low-complexity time-domain-only synthesis online
in the TTS system
e.g., MBROLA
for highest possible quality prosodic
modifications, hybrid methods need to be applied
online
e.g., HNM

67
Block Diagram of a Concatenative TTS System
Dictionary and Rules
Store of Sound Units
Text Analysis,Letter-to-Sound,Prosody
Speech Waveform Modification and Synthesis
Assemble Units that Match Input Targets
Speech
Message Text
Alphabetic Characters
Phonetic Symbols, Prosody Targets
Female Male
68
Concatenative Synthesis Unit Definition and
Extraction
/eh-s/ diphone

? Waveform
? Spectrogram
Symbolic Representation
? word labels
? tone labels
?syllable and stress labels
? phone labels
? break indices

/s/ phone
69
Issues with Unit Type

sub-word units
usually cut from over-articulated sentences
read in an almost monotone style
neglects effects of neighboring units
(coarticulation)
must include several allophonic variations
neglects variations due to speaking style (news,
announcements) and pitch
puts a large burden on signal processing
(smoothing, prosody modifications)
small (30 minutes / 650 sentences) database,
easy to label

70
Procedure for Concatenative Synthesis

off-line inventory preparation
record speech corpus and process with coding
method of choice
determine location of speech units and store
units in inventory
on-line synthesis from text
normalize input text (expand abbreviations, etc.)
letter-to-sound (pronunciation dictionary and
rules)
prosody (melody/pitchdurations, stress
patterns/amplitudes, )
select appropriate sequence of units from
inventory
modify units (smooth at boundaries, match desired
prosody)
synthesize and output speech signal

71
Unit Selection Synthesis

need to optimally match units at boundaries,
e.g., fundamental frequency (pitch), and spectrum
need to automatically and efficiently select
optimal sequence of units from database
issues in Unit Selection Synthesis
several examples in each unit category (from 10
to 10 )
waveform modification used sparingly (leads to
perceived distortions)
high intelligibility must be maintained
customer quality attained with reasonable
training set (1-10 hours)
natural quality attained with large training
set (10s of hours)
unit selection algorithm must run in a fraction
of real time on a state-of-the-art processor

6
72
Why is Online Unit Selection Necessary?
Single Feature Distribution (within same
category, here /ow/) e.g., pitch, duration,
emphasis, spectral tilt,
/ow/
Assumption no labeling and feature extraction
errors !!

Impossible to capture broad range of naturally
occurring features with just one or two examples

73
Unit Selection

given target features, automatically find
sequence of units in the database that most
closely match these features

Trained perceptual distance metric
hh
l
ow
eh
...
ow
ow
hh
eh
ow
hh
hh
eh
hh
l
eh
eh
ow
hh
eh
l
l
l
eh
eh
l
74
Unit Selection

additionally, find sequence of units that best
join each other
find optimal path using Viterbi Search (Dynamic
Programming)

hh
l
ow
eh
...
ow
ow
hh
eh
ow
hh
hh
eh
hh
l
eh
E
hh
eh
l
l
l
eh
eh
l
75
(On-line) Unit Selection Viterbi Search
u-
-a
a-u
a-u (1)
-a (1)
u- (1)
a-u (2)
-a (2)
u- (2)
-
-
a-u (3)
-a (3)
u- (3)
a-u (4)

transitional (concatenation) costs are based on
acoustic distances
node (target) costs are based on linguistic
identity of unit

76
Unit Selection Measures

USDUnit Segmental Distortion ? differences
between desired spectral pattern of target and
that of candidate unit, throughout whole unit
UCDUnit Concatenative Distortion ? spectral
discontinuity across boundaries of the
concatenated units
Example source contextwant/w ah n t/
target contextcart/k ah r t/
USD?(ah?n versus ah?r)

77
Concatenative Synthesis

concatenate recording chunks (senone, half-phone,
diphone, phone, demisyllable, syllable, word,
phrase, sentence)
adjacent units have zero concatenation cost.

Transition Cost (UCD)
Selected units
?j
?j1
Unit Cost (USD)
Target units
tj
78
Concatenation Cost
79
Target Cost
80
(Off-line) Weight Training
81
Acoustic Target Cost
82
Modern TTS Systems (Natural Voices from ATT)

Soliloquy from Hamlet
Gettysburg Address
Bob Story
German female
UK British female
Spanish female
Korean female
French male

83
Modern COMMERCIAL Systems

Lucent
AcuVoice
Festival
LH RealSpeak
SpeechWorks female
SpeechWorks male
Cselt (Actor) - Italian

84
TTS Future Needs

TTS needs to know how things should be said
context-sensitive pronunciations of words
prosody prediction? emphasis
I gave the book to John (not someone else)
I gave the book to John (not the photos)
I gave the book to John (I did it, not someone
else)
unit selection process ? target cost captures
mismatch between predicted unit specification
(phoneme name, duration, pitch, spectral
properties) and actual features of a candidate
recorded unit ? need better spectral distance
measures that incorporate human perception
better signal processing ? compress units for
small footprint devices

85
Visual TTS

Applications talking assistants/avatars,
intelligent agents, video mail, email reading ...
Advantages of using a talking head
higher intelligibility and perceived quality of
(audio) TTS!
enhanced user experience through multimedia
communication
Approaches
sample-based image synthesis using library of
video snippets
3D head model including models of tongue and lips
(Baldi)
in the future sample-based image synthesis using
Baldi for alignment(i.e., use best of each of
the two technologies)

86
Visual Text-to-Speech Synthesis

personalized friendly agents provide an
entertaining and effective user experience.
subjective tests confirm
Agents are preferred over text and audio
interfaces
Agents are more trusted
than text and audio interfaces
applications
Personal Assistant
Customer Service
Newscaster (Ananova)
E-commerce
Games

87
Talking Heads

3D Talking Heads

Sample-based Talking Heads

look like a real person require recording of
real people limited in pose that can be shown.
flexible easy to show in any pose faces look
cartoon-like.
88
3D-Head Model (Baldi Family)
Baldi

Andrew
Katherine
Caesar
Cybatt
Cybel
89
VTTS Process
Text
Coarticulation Model/Library
phonemes
Text to Speech Synthesizer
Rendering
Lip shapes Emotions Movements ( FAPs)
3D model
Movements Emotions Model/Library
Sample-based model
stress
Conversation module
emotions
90
Two Rendering Techniques
-

hard to reproduce minute skin details like
wrinkles, that look absolutely natural
Synthetic3D models parametrized shapes
Keeps correct appearance under full range of views
Sample-based parametrized textures
Reproduces photo realistic appearances fast
Range of views limited by planar approximation of
parts
91
Sample-Based Model

concatenate snippets of video to synthesize
talking heads
reduce the number of samples to store by
decomposing recorded head into sub-parts.
use a background image of the whole head onto
which parts are warped.
feathering (transparency gradient at border)
helps smooth blending
smooth transitions of each object (e.g., mouth
shape) across unit boundaries by using advanced
morphing techniques

92
Model-Independent Animation

facial expressions

Fear
Disgust
Anger
Surprise
Sadness
Joy
93
Giving Machines High Quality Voices and Faces
U.S. English Female U.S. English Male Spanish
Female
Natural Speech
94
Customer Care Scenario
95
Visual TTS Demos
E-Mail Messages
Virtual Secretary
Au Claire de la Lune
96
Travel Domain Scenario
97
Business Drivers of TTS

cost reduction
TTS as a dialog component for customer care
TTS for delivering messages
TTS to replace expensive recorded IVR prompts
new products and services
location-based services
providing information in cars (e.g., driving
directions, traffic reports)
unified Messaging (reading e-mail, fax)
voice Portals (enterprise, home, phone access to
Web-based services)
e-commerce (automatic information agents)
customized News, Stock Reports, Sports Scores
devices

98
Reading Email
From Marilyn Walker ltwalker_at_research.att.comgt
To David Ross ltdavidross_at_home.comgt Subject
Re Today's Meeting Date Tuesday, December
01, 1998 425 PM -------------------------------
--------------------------------------------------
------------------------------------------ 430
is fine for me. See you at the meeting.
Marilyn -----Original Message-----
From David Ross ltdavidross_at_home.comgt To
Marilyn Walker ltwalker_at_research.att.comgt
Date Tuesday, December 01, 1998 225 PM
Subject Today's Meeting Today's
meeting has been changed from 400 to 430 PM. If
the time change is a problem, please send
me email at davidross_at_home.com.
Thanks, david ross
99
Reading Email (final)
From Marilyn Walker ltwalker_at_research.att.comgt
To David Ross ltdavidross_at_home.comgt Subject
Re Today's Meeting Date Tuesday, December
01, 1998 425 PM -------------------------------
--------------------------------------------------
------------------------------------------ 430
is fine for me. See you at the meeting.
Marilyn Walker -----Original Message-----
From David Ross ltdavidross_at_home.comgt
To Marilyn Walker ltwalker_at_research.att.comgt
Date Tuesday, December 01, 1998 225 PM
Subject Today's Meeting Today's
meeting has been changed from 400 to 430 PM. If
the time change is a problem, please send
me email at davidross_at_home.com.
Thanks, david ross
100
TTS Application Categories
Devices

PDAs, cellphones, gaming, talking appliances
driving directions, city and restaurant guides,
location services (e.g., Macys has a
sale!)voice control of cell phones, VCRs, TVs
home information access over telephone (Home
Voice Portals)
information access over the phone such as sales
information, HR, internal phonebook, messaging
E-commerce, customer care (e.g., friendly
automated talking web agents, FAQs, product
information)
next-gen HMIHY automated operator services