Auditory Scene Analysis and Automatic Speech Recognition in Adverse Conditions presentation

About This Presentation

Transcript and Presenter's Notes

Title: Auditory Scene Analysis and Automatic Speech Recognition in Adverse Conditions

1
Auditory Scene Analysis and Automatic Speech
Recognition in Adverse Conditions
Phil Green Speech and Hearing Research Group,
Department of Computer Science, University of
Sheffield With thanks to Martin Cooke, Guy
Brown, Jon Barker..
2
Overview

Visual and Auditory Scene Analysis
Glimpsing in Speech Perception
Missing Data ASR
Finding the glimpses
Current Sheffield Work
Dealing with Reverberation
Identifying Musical Instruments
Multisource Decoding
Speech Separation Challenge

3
Visual Scenes and Auditory Scenes

Objects are opaque
Each spatial pixel images a single object
Object recognition has to cope with occlusion

Sound is additive
Each time/frequency pixel receives contributions
from many sound sources
Sound source recognition apparently requires
reconstruction..

4
Glimpsing in auditory scenes the dominance
effect (Cooke)

Although audio signals add additively, the
occlusion metaphor is a good approximation due to
loglike compression in the auditory system

Consequently, most regions in a mixture are
dominated by one or other source, leaving very
few ambiguous regions, even for a pair of speech
signals mixed at 0 dB.
5
Can listeners handle glimpses?
6
The robustness problem in Automatic Speech
Recognition

Current ASR devices cannot tolerate additive
noise, particularly if its unpredictable
Listeners noise-tolerance is 1 or 2 orders of
magnitude better in equivalent conditions
(Lippmann 97)
Can glimpsing be used as the basis for robust
ASR?
Requirements
Adapt statistical ASR to incomplete data case
Identify the glimpses

7
Classification with Missing Data
A common problem visual occlusion, sensor
failure, transmission losses..
Need to evaluate the likelihood that observation
vector x was generated by class C , f(xC)
Assume x has been partitioned into reliable and
unreliable parts, (xr,xu)
Two approaches
Imputation estimate xu , then proceed as normal
Marginalisation integrate over possible range of
xu
Marginalisation is preferable if there is no need
to reconstruct x
8
The Missing Data Likelihood Computation

In ASR by Continuous Density HMMS,
State distributions are Gaussian Mixtures with
diagonal covariance
The marginal is just the reduced dimensionality
distribution
The integral can be approximated by ERFS
This is computed independently for each mixture
in the state distribution

Cooke et al 2001
9
Counter-evidence from bounds
Class C matches the reliable evidence well but
there is insufficient energy in the unreliable
components
10
Finding the glimpses

Auditory scene analysis identifies spectral
regions dominated by a single source
Harmonicity
Common amplitude modulation
Sound source location
Local SNR estimates can be used to compensate for
predictable noise sources.

Cooke 91
11
Harmonicity Masks

Only meaningful in voiced segments
Can be combined with SNR masks

12
Aurora Results (Sept 2001)
Average gain over clean baseline under all
conditions 65
13
Missing data masks from spatial location
Sue Harding, Guy Brown

Cues for spatial location are used to separate a
target source from masking sources
Interaural Time Difference from
corss-correlation between left and right binaural
signals
Interaural Level Difference from ratio of energy
in left and right ears
Soft masks
Task
Target source male speaker straight ahead
One or two masking sources (also male speakers)
at other positions
Added reverberation

14
Missing data masks from spatial location (2)
Oracle ITD only, ILD only, combined ITD and
ILD. Best performance is with combined ITD and
ILD
Azimuth of masker (degrees)
15
MD for reverberant conditions (1)

Palomäki, Brown and Barker have applied MD to the
problem of room reverberation
Use spectral normalization to deal with
distortion caused by early reflections
Treat late reverberation as additive noise, and
apply standard MD techniques.
Select features which are uncontaminated by
reverberation and contain strong speech energy.

Approach based on modulation filtering
Each rate map channel passed through modulation
filter
Identify periods with enough energy in the
filtered output
Use these to define mask on original rate map

16
MD for reverberant conditions (2)

Recognition of connected digits (Aurora 2)
Reverberated using recorded room impulse
responses
Performance comparable with Brian Kingsburys
hybrid HMM-MLP recognizer
K. J. Palomäki, G. J. Brown and J. Barker
(2004) Speech Communication 43 (1-2), pp. 123-142

17
MD for music analysis (1)

Eggink and Brown have used MD techniques to
identify concurrent musical instrument sounds
Part of a system for transcribing chamber music
Identify the F0 of the target note, and only keep
its harmonics in the MD mask
Uses a GMM classifier for each instrument,
trained on isolated tones and short phrases
Tested on tones, phrases and commercial CD

18
MD for music analysis (2)
Flute

Example duet for flute and clarinet
All instrument tones correctly identified in this
example
J. Eggink and G. J. Brown (2003) Proc. ICASSP,
Hong Kong, IV, pp. 553-556
J. Eggink and G. J. Brown (2004) Proc. ICASSP,
Montreal, V, pp. 217-220

Clarinet
Fundamental Frequency (Hz)
Time (frames)
19
Multisource Decoding
Use primitive ASA and local SNR to identify
time-frequency regions (fragments) dominated by a
single source i.e. possible segregations S
but NOT to decide what the best segregation is
Instead, jointly optimise over the word sequence
W and S
Decoding algorithm finds best subset of fragments
to match speech source
Based on missing data techniques regions
hypothesised as non-speech are missing
Barker, Cooke Ellis 2003
20
Multisource decoding algorithm
Work forward in time, maintaining a set of
alternative decodings Viterbi searches based
on a choice of speech fragments.
When new fragment arrives, split decodings -
speech or non-speech?
When fragment ends, merge decoders which differ
in its interpretation.
21
Multisource Decoding on Aurora 2
22
Multisource decoding with a competing speaker

Andre Coy and Jon Barker
Utterances of male and female speakers mixed at
0 db
Voiced regions Soft Harmonicity masks from
autocorrelation peaks
Voiceless regions fragments from image
processing
Gender-dependent HMMs.
Separate decoding for male female
73.7 accuracy on a connected digit task

23
Informing Multisource Decoding Work in progress

Ning Ma, Andre Coy, Phil Green
HMM Duration constraints
Links between fragments pitch continuity
Speechiness

24
Speech separation challenge

Organisers Martin Cooke (University of
Sheffield, UK) , Te-Won Lee (UCSD, USA)
see http//www.dcs.shef.ac.uk/martin
Global comparison of techniques for separating
and recognising speech
Special session of Interspeech 2006 in Pittsburgh
(USA) from 17-21 September, 2006.
Task- recognise speech from a target talker in
the presence of either stationary noise or other
speech.
Training and test data supplied.
One signal per mixture (i.e. the task is "single
microphone").
Speech material- simple sentences from the Grid
Task, e.g. place white at L 3 now"

Write a Comment

User Comments (0)

About PowerShow.com

Auditory Scene Analysis and Automatic Speech Recognition in Adverse Conditions PowerPoint PPT Presentation