Title: Auditory Scene Analysis and Automatic Speech Recognition in Adverse Conditions
1Auditory Scene Analysis and Automatic Speech
Recognition in Adverse Conditions
Phil Green Speech and Hearing Research Group,
Department of Computer Science, University of
Sheffield With thanks to Martin Cooke, Guy
Brown, Jon Barker..
2Overview
- Visual and Auditory Scene Analysis
- Glimpsing in Speech Perception
- Missing Data ASR
- Finding the glimpses
- Current Sheffield Work
- Dealing with Reverberation
- Identifying Musical Instruments
- Multisource Decoding
- Speech Separation Challenge
3Visual Scenes and Auditory Scenes
- Objects are opaque
- Each spatial pixel images a single object
- Object recognition has to cope with occlusion
- Sound is additive
- Each time/frequency pixel receives contributions
from many sound sources - Sound source recognition apparently requires
reconstruction..
4Glimpsing in auditory scenes the dominance
effect (Cooke)
- Although audio signals add additively, the
occlusion metaphor is a good approximation due to
loglike compression in the auditory system
Consequently, most regions in a mixture are
dominated by one or other source, leaving very
few ambiguous regions, even for a pair of speech
signals mixed at 0 dB.
5Can listeners handle glimpses?
6The robustness problem in Automatic Speech
Recognition
- Current ASR devices cannot tolerate additive
noise, particularly if its unpredictable - Listeners noise-tolerance is 1 or 2 orders of
magnitude better in equivalent conditions
(Lippmann 97) - Can glimpsing be used as the basis for robust
ASR? - Requirements
- Adapt statistical ASR to incomplete data case
- Identify the glimpses
7Classification with Missing Data
A common problem visual occlusion, sensor
failure, transmission losses..
Need to evaluate the likelihood that observation
vector x was generated by class C , f(xC)
Assume x has been partitioned into reliable and
unreliable parts, (xr,xu)
Two approaches
Imputation estimate xu , then proceed as normal
Marginalisation integrate over possible range of
xu
Marginalisation is preferable if there is no need
to reconstruct x
8The Missing Data Likelihood Computation
- In ASR by Continuous Density HMMS,
- State distributions are Gaussian Mixtures with
diagonal covariance - The marginal is just the reduced dimensionality
distribution - The integral can be approximated by ERFS
- This is computed independently for each mixture
in the state distribution
Cooke et al 2001
9Counter-evidence from bounds
Class C matches the reliable evidence well but
there is insufficient energy in the unreliable
components
10Finding the glimpses
- Auditory scene analysis identifies spectral
regions dominated by a single source - Harmonicity
- Common amplitude modulation
- Sound source location
- Local SNR estimates can be used to compensate for
predictable noise sources.
Cooke 91
11Harmonicity Masks
- Only meaningful in voiced segments
- Can be combined with SNR masks
12Aurora Results (Sept 2001)
Average gain over clean baseline under all
conditions 65
13Missing data masks from spatial location
Sue Harding, Guy Brown
- Cues for spatial location are used to separate a
target source from masking sources - Interaural Time Difference from
corss-correlation between left and right binaural
signals - Interaural Level Difference from ratio of energy
in left and right ears - Soft masks
- Task
- Target source male speaker straight ahead
- One or two masking sources (also male speakers)
at other positions - Added reverberation
14Missing data masks from spatial location (2)
Oracle ITD only, ILD only, combined ITD and
ILD. Best performance is with combined ITD and
ILD
Azimuth of masker (degrees)
15MD for reverberant conditions (1)
- Palomäki, Brown and Barker have applied MD to the
problem of room reverberation - Use spectral normalization to deal with
distortion caused by early reflections - Treat late reverberation as additive noise, and
apply standard MD techniques. - Select features which are uncontaminated by
reverberation and contain strong speech energy.
- Approach based on modulation filtering
- Each rate map channel passed through modulation
filter - Identify periods with enough energy in the
filtered output - Use these to define mask on original rate map
16MD for reverberant conditions (2)
- Recognition of connected digits (Aurora 2)
- Reverberated using recorded room impulse
responses - Performance comparable with Brian Kingsburys
hybrid HMM-MLP recognizer - K. J. Palomäki, G. J. Brown and J. Barker
(2004) Speech Communication 43 (1-2), pp. 123-142
17MD for music analysis (1)
- Eggink and Brown have used MD techniques to
identify concurrent musical instrument sounds - Part of a system for transcribing chamber music
- Identify the F0 of the target note, and only keep
its harmonics in the MD mask - Uses a GMM classifier for each instrument,
trained on isolated tones and short phrases - Tested on tones, phrases and commercial CD
18MD for music analysis (2)
Flute
- Example duet for flute and clarinet
- All instrument tones correctly identified in this
example -
- J. Eggink and G. J. Brown (2003) Proc. ICASSP,
Hong Kong, IV, pp. 553-556 - J. Eggink and G. J. Brown (2004) Proc. ICASSP,
Montreal, V, pp. 217-220
Clarinet
Fundamental Frequency (Hz)
Time (frames)
19Multisource Decoding
Use primitive ASA and local SNR to identify
time-frequency regions (fragments) dominated by a
single source i.e. possible segregations S
but NOT to decide what the best segregation is
Instead, jointly optimise over the word sequence
W and S
Decoding algorithm finds best subset of fragments
to match speech source
Based on missing data techniques regions
hypothesised as non-speech are missing
Barker, Cooke Ellis 2003
20Multisource decoding algorithm
Work forward in time, maintaining a set of
alternative decodings Viterbi searches based
on a choice of speech fragments.
When new fragment arrives, split decodings -
speech or non-speech?
When fragment ends, merge decoders which differ
in its interpretation.
21Multisource Decoding on Aurora 2
22Multisource decoding with a competing speaker
- Andre Coy and Jon Barker
- Utterances of male and female speakers mixed at
0 db - Voiced regions Soft Harmonicity masks from
autocorrelation peaks - Voiceless regions fragments from image
processing - Gender-dependent HMMs.
- Separate decoding for male female
- 73.7 accuracy on a connected digit task
23Informing Multisource Decoding Work in progress
- Ning Ma, Andre Coy, Phil Green
- HMM Duration constraints
- Links between fragments pitch continuity
- Speechiness
24Speech separation challenge
- Organisers Martin Cooke (University of
Sheffield, UK) , Te-Won Lee (UCSD, USA) - see http//www.dcs.shef.ac.uk/martin
- Global comparison of techniques for separating
and recognising speech - Special session of Interspeech 2006 in Pittsburgh
(USA) from 17-21 September, 2006. - Task- recognise speech from a target talker in
the presence of either stationary noise or other
speech. - Training and test data supplied.
- One signal per mixture (i.e. the task is "single
microphone"). - Speech material- simple sentences from the Grid
Task, e.g. place white at L 3 now"