Automatic Speech Recognition ASR: A Brief Overview - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Automatic Speech Recognition ASR: A Brief Overview

Description:

Temporal processing. Hypothesis Generation. Issue: models of ... Multiple temporal properties. Multiple data-driven temporal filters. Multi-band analysis ... – PowerPoint PPT presentation

Number of Views:604
Avg rating:5.0/5.0
Slides: 46
Provided by: tcnj
Category:

less

Transcript and Presenter's Notes

Title: Automatic Speech Recognition ASR: A Brief Overview


1
Automatic Speech Recognition (ASR) A Brief
Overview
2
Radio Rex 1920s ASR
3
Statistical ASR
  • i_best argmax P(M X ) argmax P(XM ) P(M
    )(1st term, acoustic model 2nd term,language
    model)
  • P(XM ) ? P(XQ ) Viterbi
    approx.where Q is the best state sequence
    in M
  • approximated by product of local likelihoods
    (Markov,conditional independence assumptions)

i
i
i
i
i
i
M
i
i
M
i
4
Automatic Speech Recognition
Speech Production/Collection
Pre-processing
Feature Extraction
Hypothesis Generation
Cost Estimator
Decoding
5
Simplified Model of SpeechProduction
Periodic Source
Filters
Random Source
Vocal vibration or turbulence (Fine spectral
structure)
Vocal tract, nasal tract, radiation (spectral
envelope)
6
Pre-processing
Speech
RoomAcoustics
LinearFiltering
Sampling Digitization
Microphone
Issues Noise and reverb, effect on modeling
7
Framewise Analysis of Speech
Frame 1
Frame 2
Feature VectorX1
Feature VectorX2
8
Feature Extraction
SpectralAnalysis
AuditoryModel/ Orthogonalize (cepstrum)
Issues Design for discrimination,
insensitivities to scaling and simple distortions
9
Representations are Important
Network
23 frame correct
Speech waveform
Network
PLP features
70 frame correct
10
Mel Frequency Scale
11
Spectral vs Temporal Processing
Analysis (e.g., cepstral)
frequency
Spectral processing
Time
Processing (e.g., mean removal)
frequency
Temporal processing
12
Hypothesis Generation
cat
dog
a dog is not a cat
Issue models of language and task
13
Cost Estimation
  • Distances
  • -Log probabilities, from
  • discrete distributions
  • Gaussians, mixtures
  • neural networks

14
Nonlinear Time Normalization
15
Decoding
16
Pronunciation Models
17
Language Models
  • Most likely words for largest product
  • P(acoustics?words) ? P(words)
  • P(words) ? P(words?history)
  • bigram, history is previous word
  • trigram, history is previous 2 words
  • n-gram, history is previous n-1 words

18
ASR System Architecture
Speech Signal

Pronunciation Lexicon
19
HMMs for Speech
  • Math from Baum and others, 1966-1972
  • Applied to speech by Baker in theoriginal CMU
    Dragon System (1974)
  • Developed by IBM (Baker, Jelinek,
    Bahl,Mercer,.) (1970-1993)
  • Extended by others in the mid-1980s

20
Hidden Markov model (graphical form)
q
q
q
q
1
2
3
4
x
x
x
x
1
2
3
4
21
Hidden Markov Model(state machine form)
q
q
q
2
1
3
P(q q )
P(q q )
P(q q )
2
1
3
2
4
3
22
Markov model
q
q
1
2
P(x ,x q ,q ) ? P( q ) P(x q ) P(q
q ) P(x q )
23
HMM Training Steps
  • Initialize estimators and models
  • Estimate hidden variable probabilities
  • Choose estimator parameters to maximizemodel
    likelihoods
  • Assess and repeat steps as necessary
  • A special case of ExpectationMaximization (EM)

24
Progress in 3 Decades
  • From digits to 60,000 words
  • From single speakers to many
  • From isolated words to continuousspeech
  • From no products to many products,some systems
    actually saving LOTSof money

25
Real Uses
  • Telephone phone company services(collect versus
    credit card)
  • Telephone call centers for queryinformation
    (e.g., stock quotes, parcel tracking)
  • Dictation products continuous recognition,
    speaker dependent/adaptive

26
But
  • Still lt97 on yes for telephone
  • Unexpected rate of speech causes doublingor
    tripling of error rate
  • Unexpected accent hurts badly
  • Performance on unrestricted speech at 70(with
    good acoustics)
  • Dont know when we know
  • Few advances in basic understanding

27
Why is ASR Hard?
  • Natural speech is continuous
  • Natural speech has disfluencies
  • Natural speech is variable overglobal rate,
    local rate, pronunciationwithin speaker,
    pronunciation acrossspeakers, phonemes in
    differentcontexts

28
Why is ASR Hard?(continued)
  • Large vocabularies are confusable
  • Out of vocabulary words inevitable
  • Recorded speech is variable overroom acoustics,
    channel characteristics,background noise
  • Large training times are not practical
  • User expectations are for equal to orgreater
    than human performance

29
ASR Dimensions
  • Speaker dependent, independent
  • Isolated, continuous, keywords
  • Lexicon size and difficulty
  • Task constraints, perplexity
  • Adverse or easy conditions
  • Natural or read speech

30
Telephone Speech
  • Limited bandwidth (F vs S)
  • Large speaker variability
  • Large noise variability
  • Channel distortion
  • Different handset microphones
  • Mobile and handsfree acoustics

31
Hot Research Problems
  • Speech in noise
  • Multilingual conversational speech (EARS)
  • Portable (e.g., cellular) ASR
  • Question answering
  • Understanding meetings or at least browsing
    them

32
Hot Research Approaches
  • New (multiple) features and models
  • New statistical dependencies
  • Multiple time scales
  • Multiple (larger) sound units
  • Dynamic/robust pronunciation models
  • Long-range language models
  • Incorporating prosody
  • Incorporating meaning
  • Non-speech modalities
  • Understanding confidence

33
Multi-frame analysis
  • Incorporate multiple frames as a single
    observation
  • LDA the most common approach
  • Neural networks
  • Bayesian networks (graphical models,
    including Buried Markov Models)

34
Linear Discriminant Analysis (LDA)
All variables for several frames
x
1
x
2
y
1

X
x
y
3
2
x
4
Transformation to maximize ratio between-class
variance within-class variance
x
5
35
Multi-layer perceptron
36

Buried Markov Models
37
Multi-stream analysis
  • Multi-band systems
  • Multiple temporal properties
  • Multiple data-driven temporal filters

38

Multi-band analysis
39
Temporally distinct features
40

Combining streams
41
Another novel approach Articulator dynamics
  • Natural representation of context
  • Production apparatus has mass, inertia
  • Difficult to accurately model
  • Can approximate with simple dynamics

42
Hidden Dynamic Models
We hold these truths to be self-evident that
speech is produced by an underlying dynamic
system, that it is endowed by its production
system with certain inherent dynamic qualities,
among these are compactness, continuity, and the
pursuit of target values for each phone class,
that to exploit these characteristics Hidden
Dynamic Models are instituted among men. We
solemnly publish and declare, that these phone
classes are and of aright ought to be free and
context independent states And for the support
of this declaration, with a firm reliance on the
acoustic theory of speech production, we mutually
pledge our lives, our fortunes, and our sacred
honor.
John Bridle and Li Deng, 1998 Hopkins Spoken
LanguageWorkshop, with apologies to Thomas
Jefferson ...
(See http//www/clsp.jhu.edu/ws98/projects/dynamic
/)
43
Hidden Dynamic Models
SEGMENTATION
TARGET SWITCH
TARGET VALUES
FILTER
NEURAL NETWORK
SPEECH PATTERN
44
Sources of Optimism
  • Comparatively new research lines
  • Many examples of improvements
  • Moores Law ? much more processing
  • Points toward joint development of front end
    and statistical components

45
Summary
  • 2002 ASR based on 50 years of research
  • Core algorithms ? mature systems, 10-30 yrs
  • Deeply difficult, but tasks can be chosenthat
    are easier in SOME dimension
  • Much more yet to do
Write a Comment
User Comments (0)
About PowerShow.com