Auditory Scene Analysis and Automatic Speech Recognition in Adverse Conditions PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Auditory Scene Analysis and Automatic Speech Recognition in Adverse Conditions


1
Auditory Scene Analysis and Automatic Speech
Recognition in Adverse Conditions
Phil Green Speech and Hearing Research Group,
Department of Computer Science, University of
Sheffield With thanks to Martin Cooke, Guy
Brown, Jon Barker..
2
Overview
  • Visual and Auditory Scene Analysis
  • Glimpsing in Speech Perception
  • Missing Data ASR
  • Finding the glimpses
  • Current Sheffield Work
  • Dealing with Reverberation
  • Identifying Musical Instruments
  • Multisource Decoding
  • Speech Separation Challenge

3
Visual Scenes and Auditory Scenes
  • Objects are opaque
  • Each spatial pixel images a single object
  • Object recognition has to cope with occlusion
  • Sound is additive
  • Each time/frequency pixel receives contributions
    from many sound sources
  • Sound source recognition apparently requires
    reconstruction..

4
Glimpsing in auditory scenes the dominance
effect (Cooke)
  • Although audio signals add additively, the
    occlusion metaphor is a good approximation due to
    loglike compression in the auditory system

Consequently, most regions in a mixture are
dominated by one or other source, leaving very
few ambiguous regions, even for a pair of speech
signals mixed at 0 dB.
5
Can listeners handle glimpses?
6
The robustness problem in Automatic Speech
Recognition
  • Current ASR devices cannot tolerate additive
    noise, particularly if its unpredictable
  • Listeners noise-tolerance is 1 or 2 orders of
    magnitude better in equivalent conditions
    (Lippmann 97)
  • Can glimpsing be used as the basis for robust
    ASR?
  • Requirements
  • Adapt statistical ASR to incomplete data case
  • Identify the glimpses

7
Classification with Missing Data
A common problem visual occlusion, sensor
failure, transmission losses..
Need to evaluate the likelihood that observation
vector x was generated by class C , f(xC)
Assume x has been partitioned into reliable and
unreliable parts, (xr,xu)
Two approaches
Imputation estimate xu , then proceed as normal
Marginalisation integrate over possible range of
xu
Marginalisation is preferable if there is no need
to reconstruct x
8
The Missing Data Likelihood Computation
  • In ASR by Continuous Density HMMS,
  • State distributions are Gaussian Mixtures with
    diagonal covariance
  • The marginal is just the reduced dimensionality
    distribution
  • The integral can be approximated by ERFS
  • This is computed independently for each mixture
    in the state distribution

Cooke et al 2001
9
Counter-evidence from bounds
Class C matches the reliable evidence well but
there is insufficient energy in the unreliable
components
10
Finding the glimpses
  • Auditory scene analysis identifies spectral
    regions dominated by a single source
  • Harmonicity
  • Common amplitude modulation
  • Sound source location
  • Local SNR estimates can be used to compensate for
    predictable noise sources.

Cooke 91
11
Harmonicity Masks
  • Only meaningful in voiced segments
  • Can be combined with SNR masks

12
Aurora Results (Sept 2001)
Average gain over clean baseline under all
conditions 65
13
Missing data masks from spatial location
Sue Harding, Guy Brown
  • Cues for spatial location are used to separate a
    target source from masking sources
  • Interaural Time Difference from
    corss-correlation between left and right binaural
    signals
  • Interaural Level Difference from ratio of energy
    in left and right ears
  • Soft masks
  • Task
  • Target source male speaker straight ahead
  • One or two masking sources (also male speakers)
    at other positions
  • Added reverberation

14
Missing data masks from spatial location (2)
Oracle ITD only, ILD only, combined ITD and
ILD. Best performance is with combined ITD and
ILD
Azimuth of masker (degrees)
15
MD for reverberant conditions (1)
  • Palomäki, Brown and Barker have applied MD to the
    problem of room reverberation
  • Use spectral normalization to deal with
    distortion caused by early reflections
  • Treat late reverberation as additive noise, and
    apply standard MD techniques.
  • Select features which are uncontaminated by
    reverberation and contain strong speech energy.
  • Approach based on modulation filtering
  • Each rate map channel passed through modulation
    filter
  • Identify periods with enough energy in the
    filtered output
  • Use these to define mask on original rate map

16
MD for reverberant conditions (2)
  • Recognition of connected digits (Aurora 2)
  • Reverberated using recorded room impulse
    responses
  • Performance comparable with Brian Kingsburys
    hybrid HMM-MLP recognizer
  • K. J. Palomäki, G. J. Brown and J. Barker
    (2004) Speech Communication 43 (1-2), pp. 123-142

17
MD for music analysis (1)
  • Eggink and Brown have used MD techniques to
    identify concurrent musical instrument sounds
  • Part of a system for transcribing chamber music
  • Identify the F0 of the target note, and only keep
    its harmonics in the MD mask
  • Uses a GMM classifier for each instrument,
    trained on isolated tones and short phrases
  • Tested on tones, phrases and commercial CD

18
MD for music analysis (2)
Flute
  • Example duet for flute and clarinet
  • All instrument tones correctly identified in this
    example
  • J. Eggink and G. J. Brown (2003) Proc. ICASSP,
    Hong Kong, IV, pp. 553-556
  • J. Eggink and G. J. Brown (2004) Proc. ICASSP,
    Montreal, V, pp. 217-220

Clarinet
Fundamental Frequency (Hz)
Time (frames)
19
Multisource Decoding
Use primitive ASA and local SNR to identify
time-frequency regions (fragments) dominated by a
single source i.e. possible segregations S
but NOT to decide what the best segregation is
Instead, jointly optimise over the word sequence
W and S
Decoding algorithm finds best subset of fragments
to match speech source
Based on missing data techniques regions
hypothesised as non-speech are missing
Barker, Cooke Ellis 2003
20
Multisource decoding algorithm
Work forward in time, maintaining a set of
alternative decodings Viterbi searches based
on a choice of speech fragments.
When new fragment arrives, split decodings -
speech or non-speech?
When fragment ends, merge decoders which differ
in its interpretation.
21
Multisource Decoding on Aurora 2
22
Multisource decoding with a competing speaker
  • Andre Coy and Jon Barker
  • Utterances of male and female speakers mixed at
    0 db
  • Voiced regions Soft Harmonicity masks from
    autocorrelation peaks
  • Voiceless regions fragments from image
    processing
  • Gender-dependent HMMs.
  • Separate decoding for male female
  • 73.7 accuracy on a connected digit task

23
Informing Multisource Decoding Work in progress
  • Ning Ma, Andre Coy, Phil Green
  • HMM Duration constraints
  • Links between fragments pitch continuity
  • Speechiness

24
Speech separation challenge
  • Organisers Martin Cooke (University of
    Sheffield, UK) , Te-Won Lee (UCSD, USA)
  • see http//www.dcs.shef.ac.uk/martin
  • Global comparison of techniques for separating
    and recognising speech
  • Special session of Interspeech 2006 in Pittsburgh
    (USA) from 17-21 September, 2006.
  • Task- recognise speech from a target talker in
    the presence of either stationary noise or other
    speech.
  • Training and test data supplied.
  • One signal per mixture (i.e. the task is "single
    microphone").
  • Speech material- simple sentences from the Grid
    Task, e.g. place white at L 3 now"
Write a Comment
User Comments (0)
About PowerShow.com