Can Advances in Speech Recognition make Spoken Language as Convenient and as Accessible as Online Text? - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Can Advances in Speech Recognition make Spoken Language as Convenient and as Accessible as Online Text?

Description:

Access to. Information. Listen, remember. Read. books. Computer. typing. Careful spoken, written input ... noise, music, fluctuations in the speech rate, ... – PowerPoint PPT presentation

Number of Views:117
Avg rating:3.0/5.0
Slides: 36
Provided by: DAR1159
Category:

less

Transcript and Presenter's Notes

Title: Can Advances in Speech Recognition make Spoken Language as Convenient and as Accessible as Online Text?


1
Can Advances in Speech Recognition make
Spoken Language as Convenient and as
Accessible as Online Text?
  • Joseph Picone, PhD
  • Professor, Electrical Engineering
  • Mississippi State University

Patti Price, PhD VP Business Development BravoBrav
a
2
Outline
  • Introduction and state of the art (Price)
  • Research issues (Picone)
  • Evaluation metrics
  • Acoustic modeling
  • Language modeling
  • Practical issues
  • Technology demands
  • Conclusion and future directions (Price)

3
Introduction What is Speech Recognition?
Goal Automatically extract the string of words
spoken from the speech signal
  • Speech recognition does NOT determine
  • Who is talker (speaker recognition, Heck and
    Reynolds)
  • Speech output (speech synthesis, Fruchterman and
    Ostendorf)
  • What the words mean (speech understanding)

4
Introduction Speech in the Information Age
  • Speech text were revolutionary because of
    information access
  • New media and connectivity yield information
    overload
  • Can speech technology help?

Time
Access to Information
Listen, remember
Read books
Computer typing
Conversational language
Careful spoken, written input
5
State of the ArtInitial and Current Applications
1997
  • Command and control
  • Manufacturing
  • Consumer products

http//www.speech.be.philips.com/
  • Database query
  • Resource management
  • Flight information
  • Stock quote

Nuance, American Airlines 1-800-433-7300, touch 1
  • Dictation
  • http//www.dragonsys.com
  • http//www-4.ibm.com/software/speech

6
State of the ArtHow Do You Measure?
  • What benchmarks?
  • Have other systems used same one?
  • What was training set?
  • What was test set?
  • Were training and test independent?
  • How large was the vocabulary and the sample size?
  • What speakers?
  • What style speech?
  • What kind of noise?

7
State of the ArtFactors that Affect Performance
8
Evaluation MetricsEvolution
Word Error Rate
Conversational Speech
40
  • Spontaneous telephone speech is still a grand
    challenge.
  • Telephone-quality speech is still central to the
    problem.
  • Vision for speech technology continues
  • to evolve.
  • Broadcast news is a very dynamic domain.

30
Broadcast News
20
Read Speech
10
Continuous Digits
Letters and Numbers
Digits
Command and Control
0
Level Of Difficulty
9
Evaluation MetricsHuman Performance
Word Error Rate
  • Human performance exceeds machine
  • performance by a factor ranging from
  • 4x to 10x depending on the task.
  • On some tasks, such as credit card number
    recognition, machine performance exceeds humans
    due to human memory retrieval capacity.
  • The nature of the noise is as important as the
    SNR (e.g., cellular phones).
  • A primary failure mode for humans is inattention.
  • A second major failure mode is the lack of
    familiarity with the domain (i.e., business
    terms and corporation names).

20
Wall Street Journal (Additive Noise)
15
Machines
10
5
Human Listeners (Committee)
0
Quiet
10 dB
16 dB
22 dB
Speech-To-Noise Ratio
10
Evaluation MetricsMachine Performance
100
(Foreign)
Read Speech
Conversational Speech
Broadcast Speech
20k
Spontaneous Speech
Varied Microphones
(Foreign)
10 X
10
5k
Noisy
1k
1
1988 1989 1990 1991 1992 1993 1994
1995 1996 1997 1998 1999 2000 2001
2002 2003
11
Evaluation MetricsBeyond WER Named Entity
  • Information extraction is the analysis of
  • natural language to collect information
  • about specified types of entities.
  • As the focus shifts to providing enhanced
    annotations, WER may not be the most
    appropriate measure of performance (content-based
    scoring).

12
Recognition ArchitecturesWhy Is Speech
Recognition So Difficult?
Feature No. 2
Ph_1
Ph_2
Ph_3
Feature No. 1
  • Our measurements of the
  • signal are ambiguous.
  • Region of overlap represents classification
    errors.
  • Reduce overlap by introducing acoustic and
    linguistic context (e.g., context-dependent
    phones).

13
Recognition ArchitecturesA Communication
Theoretic Approach
Message Source
Linguistic Channel
Articulatory Channel
Acoustic Channel
Features
Observable Message
Words
Sounds
  • Bayesian formulation for speech recognition
  • P(WA) P(AW) P(W) / P(A)

Objective minimize the word error
rate Approach maximize P(WA) during training
  • Components
  • P(AW) acoustic model (hidden Markov models,
    mixtures)
  • P(W) language model (statistical, finite
    state networks, etc.)
  • The language model typically predicts a small set
    of next words based on
  • knowledge of a finite number of previous words
    (N-grams).

14
Recognition ArchitecturesIncorporating Multiple
Knowledge Sources
Input Speech
Language Model P(W)
15
Acoustic ModelingFeature Extraction
Fourier Transform
Input Speech
Cepstral Analysis
Perceptual Weighting
Time Derivative
Time Derivative
Delta Energy Delta Cepstrum
Delta-Delta Energy Delta-Delta Cepstrum
Energy Mel-Spaced Cepstrum
16
Acoustic ModelingHidden Markov Models
  • Acoustic models encode the temporal evolution of
    the features (spectrum).
  • Gaussian mixture distributions are used to
    account for variations in speaker, accent, and
    pronunciation.
  • Phonetic model topologies are simple
    left-to-right structures.
  • Skip states (time-warping) and multiple paths
    (alternate pronunciations) are also common
    features of models.
  • Sharing model parameters is a common strategy to
    reduce complexity.

17
Acoustic ModelingParameter Estimation
  • Closed-loop data-driven modeling supervised only
    from a word-level transcription
  • The expectation/maximization (EM) algorithm is
    used to improve our parameter estimates.
  • Computationally efficient training algorithms
    (Forward-Backward) have been crucial.
  • Batch mode parameter updates are typically
    preferred.
  • Decision trees are used to optimize
    parameter-sharing, system complexity, and the
    use of additional linguistic knowledge.

18
Language ModelingIs A Lot Like Wheel of Fortune
19
Language ModelingN-Grams The Good, The Bad, and
The Ugly
20
Language ModelingIntegration of Natural Language
21
Implementation Issues Search Is Resource
Intensive
  • Typical LVCSR systems have about 10M free
    parameters, which makes training a challenge.
  • Large speech databases are required (several
    hundred hours of speech).
  • Tying, smoothing, and interpolation are required.

22
Implementation IssuesDynamic Programming-Based
Search
23
Implementation IssuesCross-Word Decoding Is
Expensive
  • Cross-word Decoding since word boundaries dont
    occur in spontaneous speech, we must allow for
    sequences of sounds that span word boundaries.
  • Cross-word decoding significantly increases
    memory requirements.

24
Implementation IssuesDecoding Example
25
Implementation IssuesInternet-Based Speech
Recognition
26
Technology Conversational Speech
  • Conversational speech collected over the
    telephone contains background
  • noise, music, fluctuations in the speech rate,
    laughter, partial words,
  • hesitations, mouth noises, etc.
  • WER has decreased from 100 to 30 in six years.
  • Laughter
  • Singing
  • Unintelligible
  • Spoonerism
  • Background Speech
  • No pauses
  • Restarts
  • Vocalized Noise
  • Coinage

27
Technology Audio Indexing of Broadcast News
  • Broadcast news offers some unique
  • challenges
  • Lexicon important information in
  • infrequently occurring words
  • Acoustic Modeling variations in channel,
    particularly within the same segment ( in the
    studio vs. on location)
  • Language Model must adapt ( Bush,
    Clinton, Bush, McCain, ???)
  • Language multilingual systems?
    language-independent acoustic modeling?

28
Technology Real-Time Translation
  • From President Clintons State of the Union
    address (January 27, 2000)
  • These kinds of innovations are also propelling
    our remarkable prosperity...
  • Soon researchers will bring us devices that can
    translate foreign languages
  • as fast as you can talk... molecular computers
    the size of a tear drop with the
  • power of todays fastest supercomputers.
  • Human Language Engineering a sophisticated
    integration of many speech and
  • language related technologies... a science
    for the next millennium.

29
Technology Future Directions
  • What are the algorithmic issues for the next
    decade
  • Better features by extracting articulatory
    information?
  • Bayesian statistics? Bayesian networks?
  • Decision Trees? Information-theoretic measures?
  • Nonlinear dynamics? Chaos?

30
To Probe Further References
Journals and Conferences 1 N. Deshmukh, et.
al., Hierarchical Search for LargeVocabulary
Conversational Speech Recognition, IEEE Signal
Processing Magazine, vol. 1, no. 5, pp. 84- 107,
September 1999. 2 N. Deshmukh, et. al.,
Benchmarking Human Performance for Continuous
Speech Recognition, Proceedings of the Fourth
International Conference on Spoken Language
Processing, pp. SuP1P1.10, Philadelphia,
Pennsylvania, USA, October 1996. 3 R.
Grishman, Information Extraction and Speech
Recognition, presented at the DARPA Broadcast
News Transcription and Understanding Workshop,
Lansdowne, Virginia, USA, February 1998. 4 R.
P. Lippmann, Speech Recognition By Machines and
Humans, Speech Communication, vol. 22, pp. 1-15,
July 1997. 5 M. Maybury (editor), News on
Demand, Communications of the ACM, vol. 43, no.
2, February 2000. 6 D. Miller, et. al., Named
Entity Extraction from Broadcast News, presented
at the DARPA Broadcast News Workshop, Herndon,
Virginia, USA, February 1999. 7 D. Pallett,
et. al., Broadcast News Benchmark Test Results,
presented at the DARPA Broadcast News Workshop,
Herndon, Virginia, USA, February 1999. 8 J.
Picone, Signal Modeling Techniques in Speech
Recognition, IEEE Proceedings, vol. 81, no. 9,
pp. 1215- 1247, September 1993.
9 P. Robinson, et. al., Overview Information
Extraction from Broadcast News, presented at the
DARPA Broadcast News Workshop, Herndon, Virginia,
USA, February 1999. 10 F. Jelinek,
Statistical Methods for Speech Recognition, MIT
Press, 1998. URLs and Resources 11 Speech
Corpora, The Linguistic Data Consortium,
http//www.ldc.upenn.edu. 12 Technology
Benchmarks, Spoken Natural Language Processing
Group, The National Institute for Standards,
http//www.itl.nist.gov/iaui/894.01/index.html.
13 Signal Processing Resources, Institute for
Signal and Information Technology, Mississippi
State University, http//www.isip.msstate.edu.
14 Internet- Accessible Speech Recognition
Technology, http//www.isip.msstate.edu/projects/
speech/index.html. 15 A Public Domain Speech
Recognition System, http//www.isip.msstate.edu/p
rojects/speech/software/index.html. 16
Remote Job Submission, http//www.isip.msstate.
edu/projects/speech/experiments/index.html. 17
The Switchboard Corpus, http//www.isip.msstate
.edu/projects/switchboard/index.html.
31
Conclusion and Future DirectionsTrends
  • We need new technology to help with information
    overload
  • Speech information sources are everywhere
  • Voice mail messages
  • Professional talk
  • Lectures, broadcasts
  • Speech sources of information will increase
  • As devices shrink
  • As mobility increases
  • New uses annotation, documentation

32
Conclusion and Future DirectionsLimitations on
Applications
  • Recognition performance, especially in error
    recovery
  • Natural language understanding (speech differs
    from text)
  • Speech unfolds linearly in time
  • Speech is more indeterminate than text
  • Speech has different syntax and semantics
  • Prosody differs from punctuation
  • Cost to develop applications (too few experts)
  • Cost to integrate/interoperate with other
    technologies
  • New capabilities
  • "When did he say Y and was he angry?
  • Scanning, refocusing quickly (browsing)
  • Proactive information Match past pattern, find
    novel aspects
  • Gist, summarize, translate for different purposes

33
Conclusion and Future DirectionsApplications on
the Horizon
  • Beginnings of speech as source of information
  • ISLIP http//www.mediasite.net/info/frames.htm
  • Virage http//www.virage.com
  • Speech technology in education and training
  • Cliff Stoll, High Tech Heretic
  • Good schools need no computers
  • Bad schools wont be improved by them
  • BravoBrava Co-evolving technology and people
    can
  • Dramatically reduce the cost of delivery of
    content
  • Increase its timeliness, quality and
    appropriateness
  • Target needs of individual and/or group
  • Reading Pal demo

34
Reading Pal
Child reads
Errors in red
Looks up word massive
Clicks Listen To play back from massive
Clicks You to play it back
35
SummaryGoal Speech Better Than Text
  • Healthy loop between research and applications
  • Research leads to applications, which lead to new
    research opportunities
  • We need collaboration
  • Too much for one person, one site, one country
  • Humans will probably continue to be better than
    machines at many things
  • Can we learn to use technology and training to
    augment human-human and human-machine
    collaboration?
  • We need to micronize education and training
  • Its not a solved problem
  • Further technology development is needed to
    enable the vision
Write a Comment
User Comments (0)
About PowerShow.com