Title: Can Advances in Speech Recognition make Spoken Language as Convenient and as Accessible as Online Text?
1Can Advances in Speech Recognition make
Spoken Language as Convenient and as
Accessible as Online Text?
- Joseph Picone, PhD
- Professor, Electrical Engineering
- Mississippi State University
Patti Price, PhD VP Business Development BravoBrav
a
2Outline
- Introduction and state of the art (Price)
- Research issues (Picone)
- Evaluation metrics
- Acoustic modeling
- Language modeling
- Practical issues
- Technology demands
- Conclusion and future directions (Price)
3Introduction What is Speech Recognition?
Goal Automatically extract the string of words
spoken from the speech signal
- Speech recognition does NOT determine
- Who is talker (speaker recognition, Heck and
Reynolds) - Speech output (speech synthesis, Fruchterman and
Ostendorf) - What the words mean (speech understanding)
4Introduction Speech in the Information Age
- Speech text were revolutionary because of
information access - New media and connectivity yield information
overload - Can speech technology help?
Time
Access to Information
Listen, remember
Read books
Computer typing
Conversational language
Careful spoken, written input
5State of the ArtInitial and Current Applications
1997
- Command and control
- Manufacturing
- Consumer products
http//www.speech.be.philips.com/
- Database query
- Resource management
- Flight information
- Stock quote
Nuance, American Airlines 1-800-433-7300, touch 1
- Dictation
- http//www.dragonsys.com
- http//www-4.ibm.com/software/speech
6State of the ArtHow Do You Measure?
- What benchmarks?
- Have other systems used same one?
- What was training set?
- What was test set?
- Were training and test independent?
- How large was the vocabulary and the sample size?
- What speakers?
- What style speech?
- What kind of noise?
7State of the ArtFactors that Affect Performance
8Evaluation MetricsEvolution
Word Error Rate
Conversational Speech
40
- Spontaneous telephone speech is still a grand
challenge. - Telephone-quality speech is still central to the
problem. - Vision for speech technology continues
- to evolve.
- Broadcast news is a very dynamic domain.
30
Broadcast News
20
Read Speech
10
Continuous Digits
Letters and Numbers
Digits
Command and Control
0
Level Of Difficulty
9Evaluation MetricsHuman Performance
Word Error Rate
- Human performance exceeds machine
- performance by a factor ranging from
- 4x to 10x depending on the task.
- On some tasks, such as credit card number
recognition, machine performance exceeds humans
due to human memory retrieval capacity. - The nature of the noise is as important as the
SNR (e.g., cellular phones). - A primary failure mode for humans is inattention.
- A second major failure mode is the lack of
familiarity with the domain (i.e., business
terms and corporation names).
20
Wall Street Journal (Additive Noise)
15
Machines
10
5
Human Listeners (Committee)
0
Quiet
10 dB
16 dB
22 dB
Speech-To-Noise Ratio
10Evaluation MetricsMachine Performance
100
(Foreign)
Read Speech
Conversational Speech
Broadcast Speech
20k
Spontaneous Speech
Varied Microphones
(Foreign)
10 X
10
5k
Noisy
1k
1
1988 1989 1990 1991 1992 1993 1994
1995 1996 1997 1998 1999 2000 2001
2002 2003
11Evaluation MetricsBeyond WER Named Entity
- Information extraction is the analysis of
- natural language to collect information
- about specified types of entities.
-
- As the focus shifts to providing enhanced
annotations, WER may not be the most
appropriate measure of performance (content-based
scoring).
12Recognition ArchitecturesWhy Is Speech
Recognition So Difficult?
Feature No. 2
Ph_1
Ph_2
Ph_3
Feature No. 1
- Our measurements of the
- signal are ambiguous.
- Region of overlap represents classification
errors. - Reduce overlap by introducing acoustic and
linguistic context (e.g., context-dependent
phones).
13Recognition ArchitecturesA Communication
Theoretic Approach
Message Source
Linguistic Channel
Articulatory Channel
Acoustic Channel
Features
Observable Message
Words
Sounds
- Bayesian formulation for speech recognition
- P(WA) P(AW) P(W) / P(A)
Objective minimize the word error
rate Approach maximize P(WA) during training
- Components
- P(AW) acoustic model (hidden Markov models,
mixtures) - P(W) language model (statistical, finite
state networks, etc.) - The language model typically predicts a small set
of next words based on - knowledge of a finite number of previous words
(N-grams).
14Recognition ArchitecturesIncorporating Multiple
Knowledge Sources
Input Speech
Language Model P(W)
15Acoustic ModelingFeature Extraction
Fourier Transform
Input Speech
Cepstral Analysis
Perceptual Weighting
Time Derivative
Time Derivative
Delta Energy Delta Cepstrum
Delta-Delta Energy Delta-Delta Cepstrum
Energy Mel-Spaced Cepstrum
16Acoustic ModelingHidden Markov Models
- Acoustic models encode the temporal evolution of
the features (spectrum). - Gaussian mixture distributions are used to
account for variations in speaker, accent, and
pronunciation. - Phonetic model topologies are simple
left-to-right structures. - Skip states (time-warping) and multiple paths
(alternate pronunciations) are also common
features of models. - Sharing model parameters is a common strategy to
reduce complexity.
17Acoustic ModelingParameter Estimation
- Closed-loop data-driven modeling supervised only
from a word-level transcription - The expectation/maximization (EM) algorithm is
used to improve our parameter estimates. - Computationally efficient training algorithms
(Forward-Backward) have been crucial. - Batch mode parameter updates are typically
preferred. - Decision trees are used to optimize
parameter-sharing, system complexity, and the
use of additional linguistic knowledge.
18Language ModelingIs A Lot Like Wheel of Fortune
19Language ModelingN-Grams The Good, The Bad, and
The Ugly
20Language ModelingIntegration of Natural Language
21Implementation Issues Search Is Resource
Intensive
- Typical LVCSR systems have about 10M free
parameters, which makes training a challenge. - Large speech databases are required (several
hundred hours of speech). - Tying, smoothing, and interpolation are required.
22Implementation IssuesDynamic Programming-Based
Search
23Implementation IssuesCross-Word Decoding Is
Expensive
- Cross-word Decoding since word boundaries dont
occur in spontaneous speech, we must allow for
sequences of sounds that span word boundaries. -
- Cross-word decoding significantly increases
memory requirements.
24Implementation IssuesDecoding Example
25Implementation IssuesInternet-Based Speech
Recognition
26Technology Conversational Speech
- Conversational speech collected over the
telephone contains background - noise, music, fluctuations in the speech rate,
laughter, partial words, - hesitations, mouth noises, etc.
- WER has decreased from 100 to 30 in six years.
- Laughter
- Singing
- Unintelligible
- Spoonerism
- Background Speech
- No pauses
- Restarts
- Vocalized Noise
- Coinage
27Technology Audio Indexing of Broadcast News
- Broadcast news offers some unique
- challenges
- Lexicon important information in
- infrequently occurring words
- Acoustic Modeling variations in channel,
particularly within the same segment ( in the
studio vs. on location) - Language Model must adapt ( Bush,
Clinton, Bush, McCain, ???) - Language multilingual systems?
language-independent acoustic modeling?
28Technology Real-Time Translation
- From President Clintons State of the Union
address (January 27, 2000) -
- These kinds of innovations are also propelling
our remarkable prosperity... - Soon researchers will bring us devices that can
translate foreign languages - as fast as you can talk... molecular computers
the size of a tear drop with the - power of todays fastest supercomputers.
- Human Language Engineering a sophisticated
integration of many speech and - language related technologies... a science
for the next millennium.
29Technology Future Directions
- What are the algorithmic issues for the next
decade - Better features by extracting articulatory
information? - Bayesian statistics? Bayesian networks?
- Decision Trees? Information-theoretic measures?
- Nonlinear dynamics? Chaos?
30To Probe Further References
Journals and Conferences 1 N. Deshmukh, et.
al., Hierarchical Search for LargeVocabulary
Conversational Speech Recognition, IEEE Signal
Processing Magazine, vol. 1, no. 5, pp. 84- 107,
September 1999. 2 N. Deshmukh, et. al.,
Benchmarking Human Performance for Continuous
Speech Recognition, Proceedings of the Fourth
International Conference on Spoken Language
Processing, pp. SuP1P1.10, Philadelphia,
Pennsylvania, USA, October 1996. 3 R.
Grishman, Information Extraction and Speech
Recognition, presented at the DARPA Broadcast
News Transcription and Understanding Workshop,
Lansdowne, Virginia, USA, February 1998. 4 R.
P. Lippmann, Speech Recognition By Machines and
Humans, Speech Communication, vol. 22, pp. 1-15,
July 1997. 5 M. Maybury (editor), News on
Demand, Communications of the ACM, vol. 43, no.
2, February 2000. 6 D. Miller, et. al., Named
Entity Extraction from Broadcast News, presented
at the DARPA Broadcast News Workshop, Herndon,
Virginia, USA, February 1999. 7 D. Pallett,
et. al., Broadcast News Benchmark Test Results,
presented at the DARPA Broadcast News Workshop,
Herndon, Virginia, USA, February 1999. 8 J.
Picone, Signal Modeling Techniques in Speech
Recognition, IEEE Proceedings, vol. 81, no. 9,
pp. 1215- 1247, September 1993.
9 P. Robinson, et. al., Overview Information
Extraction from Broadcast News, presented at the
DARPA Broadcast News Workshop, Herndon, Virginia,
USA, February 1999. 10 F. Jelinek,
Statistical Methods for Speech Recognition, MIT
Press, 1998. URLs and Resources 11 Speech
Corpora, The Linguistic Data Consortium,
http//www.ldc.upenn.edu. 12 Technology
Benchmarks, Spoken Natural Language Processing
Group, The National Institute for Standards,
http//www.itl.nist.gov/iaui/894.01/index.html.
13 Signal Processing Resources, Institute for
Signal and Information Technology, Mississippi
State University, http//www.isip.msstate.edu.
14 Internet- Accessible Speech Recognition
Technology, http//www.isip.msstate.edu/projects/
speech/index.html. 15 A Public Domain Speech
Recognition System, http//www.isip.msstate.edu/p
rojects/speech/software/index.html. 16
Remote Job Submission, http//www.isip.msstate.
edu/projects/speech/experiments/index.html. 17
The Switchboard Corpus, http//www.isip.msstate
.edu/projects/switchboard/index.html.
31Conclusion and Future DirectionsTrends
- We need new technology to help with information
overload - Speech information sources are everywhere
- Voice mail messages
- Professional talk
- Lectures, broadcasts
- Speech sources of information will increase
- As devices shrink
- As mobility increases
- New uses annotation, documentation
32Conclusion and Future DirectionsLimitations on
Applications
- Recognition performance, especially in error
recovery - Natural language understanding (speech differs
from text) - Speech unfolds linearly in time
- Speech is more indeterminate than text
- Speech has different syntax and semantics
- Prosody differs from punctuation
- Cost to develop applications (too few experts)
- Cost to integrate/interoperate with other
technologies - New capabilities
- "When did he say Y and was he angry?
- Scanning, refocusing quickly (browsing)
- Proactive information Match past pattern, find
novel aspects - Gist, summarize, translate for different purposes
33Conclusion and Future DirectionsApplications on
the Horizon
- Beginnings of speech as source of information
- ISLIP http//www.mediasite.net/info/frames.htm
- Virage http//www.virage.com
- Speech technology in education and training
- Cliff Stoll, High Tech Heretic
- Good schools need no computers
- Bad schools wont be improved by them
- BravoBrava Co-evolving technology and people
can - Dramatically reduce the cost of delivery of
content - Increase its timeliness, quality and
appropriateness - Target needs of individual and/or group
- Reading Pal demo
34Reading Pal
Child reads
Errors in red
Looks up word massive
Clicks Listen To play back from massive
Clicks You to play it back
35SummaryGoal Speech Better Than Text
- Healthy loop between research and applications
- Research leads to applications, which lead to new
research opportunities - We need collaboration
- Too much for one person, one site, one country
- Humans will probably continue to be better than
machines at many things - Can we learn to use technology and training to
augment human-human and human-machine
collaboration? - We need to micronize education and training
- Its not a solved problem
- Further technology development is needed to
enable the vision