Speech Recognition Technology in the UbiquitousWearable Computing Environment - PowerPoint PPT Presentation

1 / 73
About This Presentation
Title:

Speech Recognition Technology in the UbiquitousWearable Computing Environment

Description:

Speech Recognition Technology in the. Ubiquitous/Wearable Computing Environment. Tokyo Institute of Technology. Department of Computer Science ... – PowerPoint PPT presentation

Number of Views:258
Avg rating:3.0/5.0
Slides: 74
Provided by: die2
Category:

less

Transcript and Presenter's Notes

Title: Speech Recognition Technology in the UbiquitousWearable Computing Environment


1
Speech Recognition Technology in
theUbiquitous/Wearable Computing Environment
Sadaoki Furui
  • Tokyo Institute of Technology
  • Department of Computer Science
  • 2-12-1, O-okayama, Meguro-ku, Tokyo, 152-8552
    Japan
  • Tel/Fax 81-3-5734-3480
  • furui_at_cs.titech.ac.jp
  • http//www.furui.cs.titech.ac.jp/

2
Outline
  • Speech recognition applications
  • Speech recognition in the ubiquitous/wearable
    computing environment
  • Hands-free communication problems
  • Speaker recognition technology
  • Audio indexing
  • Multimodal human-machine communication

3
Outline
  • Speech recognition applications
  • Speech recognition in the ubiquitous/wearable
    computing environment
  • Hands-free communication problems
  • Speaker recognition technology
  • Audio indexing
  • Multimodal human-machine communication

4
Major speech recognition applications
  • Conversational systems for accessing information
    services
  • Robust conversation using wireless
    handheld/hands-free devices in the real mobile
    computing environment
  • Multimodal speech recognition technology
  • Systems for transcribing, understanding and
    summarizing ubiquitous speech documents such as
    broadcast news, meetings, lectures, presentations
    and voicemails

5
Two-pass search structure used in the Japanese
broadcast-news transcription system
0011-10
6
0203-01
Model of human-computer interaction
7
Typical structure for task-specific voice
controland dialogue systems
0011-23
8
0111-15
ATT Communicator architecture
9
0110-05
Voice portal environment
10
0110-13
Voice portal components
11
What is VoiceXML?
  • VoiceXML is to the voice Web what
  • HTML is to the visual Web
  • Collection of XML-based markup languages for
    implementing speech applications

12
Multimedia files
HTML scripts
Information Retrieval
VoiceXML scripts
Web browser
MS SQL Server
DB
Capture voice
Grammars
ASR
Database server
metadata
Voice browser
DTMF
Replay audio
TTS
Audio files
Gateway
Web server
VoiceXML
13
Why VoiceXML?
  • Simplifies application development
  • Minimizes Internet traffic
  • Separates user interaction code from application
    logic
  • Provides portability
  • Supports rapid prototyping and iterative
    refinement

14
Forms-Typical sequential dialog
Welcome to the electronic payment system
ltformsgt
ltpromptgtWelcome to the electronic payment system.
lt/promptgt
Whom do you want to pay?
ltfield namerecipientgt
ltpromptgtWhom do you want to pay? lt/promptgt
Ajax Drugstore
ltgrammar src www.valid_recipients.vxl/gt
lt/fieldgt
How much do you want to pay?
ltfield nameamountgt
ltpromptgtHow much do you want to pay? lt/promptgt
36.95
ltgrammar srcwww.valid_money.vxml/gt
lt/fieldgt
Do you want to pay 36.95 to Ajax Drugstore?
ltfield namevalidation typebinarygt
ltpromptgtDo you want to pay ltvalue expramount/gt
to ltvalue exprrecipient/gtlt/promptgt
Yes
lt/fieldgt
15
W3C voice interface framework
VoiceXML 1.0
Dialog manager
World Wide Web
Context interpretation
Language understanding
ASR
DTMF tone recognizer
Telephone system
Media planning
Prerecorded audio player
User
Language generation
TTS
16
0203-04
Taxonomy of system-level evaluation techniques
17
Outline
  • Speech recognition applications
  • Speech recognition in the ubiquitous/wearable
    computing environment
  • Hands-free communication problems
  • Speaker recognition technology
  • Audio indexing
  • Multimodal human-machine communication

18
The Major Trends in Computing(http//www.ubiq.com
/hypertext/weiser/NomadicInteractive/Sld003.htm)
0010-14
19
MIT wearable computing people(http//www.media.mi
t.edu/wearables/)
20
Features provided by Ubicomp vs.Wearables(http//
rhodes.www.media.mit.edu/people/rhodes/papers/wear
hive.html)
0202-01
21
Speech recognition in the ubiquitous/wearable
computing environment
0010-15
Ubiquitous computing environment
22
0104-02
Distributed speech recognition (DSR)
23
Meeting synopsizing system using collaborative
speech recognizers
0010-16
24
Outline
  • Speech recognition applications
  • Speech recognition in the ubiquitous/wearable
    computing environment
  • Hands-free communication problems
  • Speaker recognition technology
  • Audio indexing
  • Multimodal human-machine communication

25
Hands-free communications
0103-03
26
Hands-free communication problems
0103-02
  • Noise problem
  • Additive, background noise of various kinds,
    including
  • unwanted speech from multiple sources
  • Colorization problem
  • Change in amplitude spectrum caused by
    (short) impulse
  • response of the room and/or the transducer
  • Reverberation problem
  • Reverberation caused by long impulse response
    of the room
  • and its interaction with quasi-stationarity
    in speech
  • Duplex system mode problem
  • Echo cancellation for full-duplex
    communication (e.g. barge-in)
  • speech activity detection, verification
    attention
  • spatialization real-time human-machine
    interaction

27
Outline
  • Speech recognition applications
  • Speech recognition in the ubiquitous/wearable
    computing environment
  • Hands-free communication problems
  • Speaker recognition technology
  • Audio indexing
  • Multimodal human-machine communication

28
Information present in a speech signal
0012-02
29
Speaker recognition
0103-21
voice key
30
Principal structure of speaker recognition systems
0103-22
31
(a) Speaker identification Basic structure of
speaker recognition systems
0012-06
32
(b) Speaker verification Basic structure of
speaker recognition systems
0012-07
33
Example of typical intraspeaker and interspeaker
distance distributions
0012-08
34
Four conditional probabilitiesin speaker
verification
0012-09
35
Relationship between error rate and decision
criterion (threshold) in speaker verification
0012-10
36
Receiver operating characteristic (ROC) curves
performance examples of three speaker
verification systems A, B, and D
0012-11
37
Recognition error rates as a function of
population size in speaker identification and
verification
0103-23
38
Text-dependent vs. text-independent methods
0103-24
Text-dependent methods are usually based on
template matching techniques. The structure of
the systems is, therefore, rather simple. Since
this method can directly exploit the voice
individuality associated with each phoneme or
syllable, it generally achieves higher
recognition performance than the text-independent
method. Text-independent methods can be used in
several applications in which predetermined key
words cannot be used. Another advantage is that
it can be done sequentially, until a desired
significance level is reached, without the
annoyance of repeating the key words again and
again.
39
Basic structure of DTW/HMM-based text-dependent
speaker recognition methods
0103-25
40
Block diagram indicating principal operation of
speaker recognition method using time series of
cepstral coefficients and their orthogonal
polynomial coefficients
0012-12
41
Basic structures of text-independent speaker
recognition methods
0012-13
42
Variation of the long-time averaged spectrum at
four sessions over eight months, and
corresponding spectral envelopes derived from
cepstrum coefficients weighted by the square root
of inverse variances
0103-20
43
Vector quantization (VQ)-based text-independent
speaker recognition
0103-19
44
A five-state ergodic HMM for text-independent
speaker verification
0012-14
45
Basic structures of text-independent speaker
recognition methods (cont.)
0012-15
46
Text-prompted Speaker Recognition Method
0103-26
This method is facilitated by using
speaker-specific phoneme models as basic acoustic
units. The recognition system prompts each user
with a new key sentence every time the system is
used, and accepts the input utterance only when
it decides that the registered speaker has
uttered the prompted sentence. Because the
vocabulary is unlimited, prospective impostors
cannot know in advance what sentence they will be
asked to repeat. Thus a pre-recorded voice can
easily be rejected. One of the major issues in
this method is how to properly create the
speaker-specific phoneme models with training
utterances of a limited size for each speaker.
47
Block diagram of the text-prompted speaker
recognition method
0012-16
48
Sound spectrograms for word utterances by several
speakers
0103-18
Speaker S /kogeN/
Same (2 years later)
Speaker S /baNgo/
Speaker M /kogeN/
Speaker F /kogeN/
Speaker U /kogeN/
49
Intersession variability (variability over time)
0103-27
  • Speakers
  • Recording and transmission conditions
  • Noise
  • Normalization
  • Parameter domain
  • Distance/similarity domain

50
Parameter-domain normalization
0103-28
  • Cepstral mean normalization (subtraction) (CMN,
    CMS)
  • Linear channel effects
  • Long-term spectral variation
  • Delta-cepstral coefficients

51
Distance/similarity-domain normalization
0103-29
52
Cohort speakers
0103-30
53
Conventional speaker verification system
0012-17
54
Speaker verification system including verbal
information verification (VIV)
0012-18
Pass-phrases of the first few accesses Open
sesame Open sesame Open sesame
Save for training
Verbal information verification
Verified pass-phrases for training
HMM training
Speaker-dependent HMM
Automatic enrollment
Database
Speaker Verification
Identity claim
Speaker verifier
Scores
Test pass-phrase Open sesame
55
Outline
  • Speech recognition applications
  • Speech recognition in the ubiquitous/wearable
    computing environment
  • Hands-free communication problems
  • Speaker recognition technology
  • Audio indexing
  • Multimodal human-machine communication

56
BBNs Rough n Ready audio indexing system
  • Processes recorded audio from broadcast news,
    meetings, etc.
  • Produces partial transcripts
  • Identifies entities spoken about (people,
    companies, etc.)
  • Indexes words, concepts, and speakers
  • Locates segments where each person is speaking
  • Assigns multiple, ranked topics to segments
  • Automatically creates structural summaries and
    stores them
  • Quickly and easily retrieves relevant information
  • Allows users to skim for topics without listening
    to the recording

57
0010-16
Audio compressor
WAN LAN or local bus
Audio server
Speaker segmentation
Speech recognition
MS Internet Explorer
Information retrieval
Clustering
IR Index server
Speaker identification
MS SQL server
Name spotting
Story Indexing
Metadata
Classification
XML index
Database uploader
Story segmentation
XML Corpus
Architecture of the Rough n Ready audio
indexing system
58
0111-16
The SCANMail architecture
59
Outline
  • Speech recognition applications
  • Speech recognition in the ubiquitous/wearable
    computing environment
  • Hands-free communication problems
  • Speaker recognition technology
  • Audio indexing
  • Multimodal human-machine communication

60
Multimodal human-machine communication (HMC)
0010-25
61
0203-03
Modality-oriented classification of multimodal
systems
62
Multimedia contents technology
0010-27
63
Information extraction and retrieval of spoken
language content(Spoken document retrieval,
information indexing, story segmentation, topic
tracking, topic detection, etc.)
0010-26
64
Architecture of multimodal human/computer
interaction
0012-19
65
0011-9
Multi-modal dialogue system structure
66
Dialogue system specifications
Task Shop/business information retrieval
(names, addresses, phone numbers,
specialties, etc.) Keywords District/station
names, business names, special requests
and shop names Acoustic model
Task-independent triphone HMMs trained by
phonetically-balanced sentence utterances
and dialogue utterances by many speakers
Language modeling FSN (finite state network)
with fillers or class-bigrams/trigrams
Each word/phrase DAWG structure in the FSN LM
case
67
Audio-visual speech recognition systemusing
optical-flow analysis
0103-04
68
Audio-only and audio-visual connected digit
recognition results.(la optimum audio-weighting
factor)
0103-05
69
Video stream
Audio stream
Videostream
(a)
Audio stream
(b)
  • Multi-stream HMM consisting of three audio and
    three video HMM states.
  • The corresponding product (composite) HMM that
    limits the possible asynchrony between the audio
    and visual state sequences to one state.

70
0203-05
Late integration model
71
0203-07
Overview of an audio-visual speech system
72
0203-06
General flow chart of a talking head system
73
Summary
  • Speech and speaker recognition technology has
    many potential applications.
  • Multimodal human-computer communication and
    information extraction in ubiquitous/wearable
    computing environment has a bright future.
  • Using such systems, everybody will access
    information services anytime anywhere, and these
    services are expected to augment various human
    intelligent activities.
  • Robust conversation using wireless
    handheld/hands-free devices in the real mobile
    computing environment will be crucial.
Write a Comment
User Comments (0)
About PowerShow.com