Perceptual compensation for reverberation: computer modelling - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Perceptual compensation for reverberation: computer modelling

Description:

Use the corpus of binaural room impulse responses (BRIRs) recorded at Reading ... interaural coherence (requires binaural model) Low-frequency modulation ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 47
Provided by: guyb
Category:

less

Transcript and Presenter's Notes

Title: Perceptual compensation for reverberation: computer modelling


1
Perceptual compensation for reverberation
computer modelling
  • Guy Brown and Amy Beeston
  • Department of Computer Science
  • University of Sheffield

2
Overview
  • Work in this period (WP2)
  • Development of a computer model of perceptual
    compensation for reverberation
  • Metrics for assessing the amount of reverberation
    present
  • Models based on nonlinear cochlear filterbank
  • Work in the next period
  • WP2 within-band mechanisms
  • WP4 exploiting statistics of naturally occurring
    sounds

3
Work in this period
4
Modelling approach
  • Aim to build computer models that replicate the
    performance of human subjects in specific
    perceptual experiments
  • Current focus is compensation for effects of
    reverberation in sir / stir continuum
    (Watkins, 2005)
  • Could the auditory efferent system play a role?
  • Modelling the efferent system provides a useful
    framework for the modelling effort
  • factors that determine amount of efferent
    suppression
  • effect of linear vs. nonlinear filtering
  • within-band vs. across-band processing

5
Watkins (2005) sir/stir experiment
context
test
sir/stir category boundary recorded
clean speech context and test
testreverberated
more sir responses
compensation more stir responses
test and context reverberated
6
Watkins (2005) experiment 5
  • Compensation measured with contexts reverberated
    by time-reversed room impulse responses
  • slowly decaying tails do not occur at offsets
  • slowly growing heads added at onsets
  • is compensation related to reverberation tails?
  • Compensation measured with time-reversed speech
    carrier contexts
  • do words need to be identifiable for compensation
    to occur?

7
Experiment 5 pictorially
forward speech carrier
forward reverb
reverse reverb
reversed speech carrier
forward reverb
reverse reverb
8
Results from Watkins (2005) experiment 5
  • Compensation markedly reduced in reversed
    reverberation conditions
  • for both forward and reversed speech carrier
  • Compensation remains substantial in forwards and
    reversed speech carrier conditions
  • slight reduction in compensation with reversed
    speech carrier
  • Conclusions
  • Reverberation tails appear to be important for
    compensation
  • Intelligibility of the carrier is not required

9
Putative role of auditory efferent processing
  • Reverberating a speech signal reduces its dynamic
    range
  • Compensation could be characterised as
    restoration of dynamic range?
  • Efferent system implicated in control of dynamic
    range via a closed-loop feedback mechanism
    (Guinan Gifford, 1988)
  • Evidence that compensation effects are primarily
    within-frequency-channel
  • Feedback from efferent system appears to be
    fairly narrowly tuned
  • Plausible time scales
  • Efferent feedback is sluggish, in the range
    100-200 ms. Also long term effects over tens of
    seconds (Sridhar et al. 1997).

10
Main features of computer model
  • Medial olivocochlear (MOC) system is known to
    exert a suppressive influence on the basilar
    membrane
  • Adapted models of Ferry Meddis (2007) and
    Ghitza (2007)
  • Suppression modelled by attenuation in nonlinear
    path of dual-resonance-nonlinear (DRNL)
    filterbank
  • Amount of efferent suppression determined by
    metric applied to auditory nerve response
  • May be within-channel or across-channel
  • Metric computed over appropriate time scale
  • Eventually this should be implemented as a
    closed-loop feedback system (currently not)

11
Schematic of the model
Auditory periphery
DRNL
Hair cell
Framing
Outer Middle Ear
DCT(optional)
Efferent attenuation
Stimulus
DTW-based recogniser
AN response
Metric
Efferent system
12
Dual resonance nonlinear filterbank (DRNL)
  • Originally proposed by Meddis, OMard and
    Lopez-Poveda (2001), human parameters from Meddis
    (2006)
  • Efferent attenuation introduced by Ferry and
    Meddis (2007)

13
Simple hair cell model
  • Hair cell is a simple threshold and rate limiter
    as described by Messing (2007).
  • Half-wave rectification.
  • Linear output between threshold and saturated
    firing rate.
  • Tuned to approximate low spontaneous rate AN
    fibres.

14
Rate-level response effect of suppression
15
Auditory spectrograms effect of suppression
16
Control of efferent attenuation
  • Intend to implement a closed-loop system in which
    a metric is applied to AN response over a sliding
    window and used to determined efferent
    attenuation.
  • Currently
  • Only investigated metrics applied to pooled AN
    response over all channels within-channel
    mechanisms will be addressed next.
  • Estimated metric over 1 s preceding the target
    word, and then used this to set the efferent
    attenuation for the remainder of the stimulus

17
Template-based speech recogniser
  • Use a template-based speech recogniser based on
    dynamic time warping (DTW) and cosine distance.
  • Templates are sir and stir from extreme ends
    of the dry (unreverberated) continuum
  • DTW not necessary at this stage, but later intend
    to model both fast and slow cases in Watkins
    study.
  • Features for recognition were either
  • Firing rate, computed at 5 ms intervals over 20
    ms Hann window
  • DCT-transformed firing rate (15 coefficients, not
    including the first)

18
Configuration of the model
  • Outer/middle ear and DRNL modified from code
    supplied by Ray Meddis and Robert Ferry (Essex
    University)
  • 80 frequency channels
  • Best frequencies in range 100 Hz to 8 kHz, log
    spaced
  • Stimuli presented at level of 56 dB SPL
  • Implemented in MATLAB
  • Invested some time developing framework for
    running simulations on Sheffields computing grid

19
Scoring the category boundary
  • Watkins (2005) characterised listeners responses
    in terms of shifts in the category boundary
  • 11 steps of continuum presented three times
  • category boundary computed as (total number of
    sir responses)/3-0.5, giving step between -0.5
    and 10.5
  • Model produces same output for each presentation
    of 11 steps
  • 11 steps of continuum presented once
  • category boundary computed as (total number of
    sir responses)-0.5, giving same range of steps
  • quantisation of step is greater for model

20
Manual tuning of efferent attenuation (rate)
Listeners (Watkins, 2005)
Computer model
21
Category boundary vs. attenuation (rate)
reverse reverberation forward reverberation
forward speech carrier reverse speech
carrier
22
Efferent attenuation vs mean-to-peak ratio
  • Need a function that maps the amount of
    reverberation in the context to the efferent
    attenuation
  • Should be
  • insensitive to timereversal of speech
  • sensitive of reversal of impulse response
  • Mean-to-peak ratio of pooled AN response gives
    reasonable linear fit

23
Results autonomous model (firing rate)
attenuation m(metric)c
  • Not a good fit to listener data, although for
    near (0.32) context get shift in boundary when
    test word is reverberated, and right pattern of
    compensation for the far (10m) test word.

24
Results autonomous model (DCT)
attenuation m(metric)c
  • Better fit of general pattern, but size of
    category boundary changes is not well matched

25
Other reverb metrics kurtosis
  • Kurtosis is ?4/?4, where ?4 is the fourth
    central moment
  • Measures the peakedness of the p.d.f. of a
    random variable
  • Used as a reverberation metric in a number of
    blind dereverberation studies
  • Reverberated speech has a lower kurtosis (more
    Gaussian)
  • Not well correlated with efferent attenuation
    (e.g., kurtosis is affected by reversing the
    context)

26
Other reverb metrics offset density
  • Reverberation reduces the occurrence of sharp
    offsets in the temporal envelope
  • Offset density number of consecutive TF bins
    that differ by more than x dB in value
  • Coherence across frequency is enforced
  • Reasonable low-order fit should be possible

27
Other reverb metrics measuring tails
  • A metric that emphasises reverberant tails may
    provide a better fit, sensitive to
    reverse-reverberation conditions
  • Mean-to-peak ratio of negative part of
    differentiated temporal envelope
  • Not well correlated with efferent attenuation
    (somewhat sensitive to reversal of context)

28
Conclusions
  • The mean-to-peak ratio of the pooled AN response
    gives a reverberation metric that is easy to map
    (linearly) to efferent attenuation
  • Key problem with the current model is steep
    function relating category boundary to amount of
    attenuation
  • Are the templates appropriate?
  • Could matching be done another way?
  • But
  • Currently this works on pooled AN response
  • Maybe a good fit to listener data cannot be
    obtained unless efferent attenuation is adjusted
    within individual channels?

29
Work in the next periodWP2 within-band
mechanisms
30
Summary of planned WP2 work
  • Within-channel mechanisms
  • Reverberation metrics that emphasize tails
  • Comparison of linear and nonlinear models
  • Distance metrics
  • Frequency-dependent suppression
  • Model of efferent processing as a front-end for
    automatic speech recognition

31
Distance metrics
  • Templates are currently matched using cosine
    distance based on firing rate or DCT features.
  • Perceptually motivated metrics may be more
    appropriate
  • A number of metrics in the literature emphasize
    formant peak frequencies (applied to vowel
    identification)
  • Weighted spectral slope metric (Klatt, 1982)
  • Weighted level metric (Assmann Summerfield,
    1989)
  • Weighted negative second differential metric
    (ditto)

32
Frequency-dependent suppression
  • Guinan Gifford (1988) find a fall-off in the
    effect of efferent-induced threshold shift at
    low BFs (data from cat)
  • This will improverepresentation of
    low-frequency speech structure when efferent
    attenuation is high

33
Model of efferent processing as a front-end for
ASR
  • Have applied model of efferent processing to ASR
    in work with Ray Meddis and Robert Ferry (Essex)
  • Open loop, complex hair cell
  • Efferent suppression improves representation of
    speech in broadband noise, and hence recognition
    accuracy
  • Evaluate reverberation robustness (of simpler
    model) in next period

34
Work in the next periodWP4 exploiting
statistics of naturally occurring sounds
35
Summary of previous work
  • Previously in work with Kalle Palomäki we have
    used missing data to handle reverberation in ASR
  • Train acoustic models for ASR on clean speech
  • Compute a binary time-frequency (TF) mask in
    which TF units dominated by speech are labelled
    as reliable and TF units dominated by
    reverberation are labelled as unreliable
  • Treat reliable and unreliable regions differently
    during ASR decoding using missing feature theory
    (Cooke et al.)
  • Hynek (yesterday) are the least reverberated
    regions the most reliable?

36
Reverberation mask and oracle mask
Reverberated T600.7s, S/R dist.3.05m
Clean speech 98415
Frequency
Oracle mask (15dB criterion)
Reverberation mask
Frequency
Time
Time
37
Schematic of current system
Reverberated speech
Spectralnormalization
Spectralfeatures
Missing dataspeechrecognizer
Auditoryfilterbank
Firingrate
Blurredness
Modulation filter
GMMClassifier
Mask
Shadowed boxes indicate processing inmultiple
frequency channels
Local slopeof envelope
38
Mask derived from GMM classifier
  • For the development set of 300 utterances in 4
    reverberation conditions
  • Compute oracle masks that maximize speech
    recognition accuracy
  • Probability densities for reliable regions Prj
    and unreliable regions Puj in oracle masks are
    modelled with 3-mixture GMMs, trained using EM
    with feature vectors ?ij
  • GMMs trained separately for channel
  • During testing, for feature vector ?ij set mask
    as
  • maskij 1 iff Prj(?ij) gt Puj(?ij), 0 otherwise

39
Training procedure (for each channel)
40
Speech recognition experiments
  • Spoken digit strings from Aurora 2.0 corpus
  • HMM-based ASR system trained on clean training
    section of Aurora corpus
  • Word models trained for each digit (1-9 plus oh
    and zero) plus silence and short pause models
  • For testing, reverberated speech obtained by
    convolving Aurora clean1 utterances with
    recorded room impulse responses

41
Results (from ASA meeting 6/08)
  • Competitive with Kingsburys hybrid HMM-MLP
    system
  • Oracle mask gives upper limit on performance
  • Gap between oracle mask and reverb mask
    performance suggests that reverberation masking
    algorithm can be further tuned

42
Planned work for WP4 (1)
  • What are the cues to reliable time-frequency
    (TF) regions?
  • Use the corpus of binaural room impulse responses
    (BRIRs) recorded at Reading
  • Collect statistics that relate reliable TF
    regions to cues that listeners may use to
    determine the amount of reverberation present
  • pitch structure
  • interaural coherence (requires binaural model)
  • Low-frequency modulation
  • Measure of reverberant tails

43
Planned work for WP4 (2)
  • Contextual effects are also important
  • The salience of a cue depends on its local
    context (as in the sir/stir case)
  • Auditory (and visual) perception based on
    relative values
  • Hermansky Morgan (1994) How else could you
    see the black-and-white movie on a white screen?
  • Can this contextual information be captured in
    statistical models?

44
Sir/stir experiment with large vocab ASR (Kalle
Palomäki)
  • Preliminary experiment with existing large
    vocabulary recogniser at HUT
  • Justifications
  • Technical Large vocabulary ASR is an obvious
    technological application
  • Scientific Large vocabulary ASR has learned
    phoneme models close to humans
  • Using well-tested ASR system that works
    reasonably well on broadcast news data
  • Recogniser was forced to choose from two
    sentences
  • Next you'll get stir to click on
  • Next you'll get sir to click on

45
Results (Kalle Palomäki)
  • Test on mildly reverberated signals (context
    near, test near)
  • For steps 0-6 sir- sentence chosen
  • For steps 7-10 stir - sentence chosen
  • Indicates that recogniser can correctly align
    phonemes to corresponding utterances
  • Free recognition too hard
  • The far reverberant case was also too hard in
    this preliminary test

46
Suggestions for further work
  • American English recogniser, should be British
    English
  • Sampling rate was only 8 kHz
  • We have Wall Street Journal corpus in British
    English in 16 kHz need some work to train it
  • Measure phoneme or state probability near st in
    stir-sir
  • Transform phoneme boundary test to an
    intelligibility test
  • Apply SRT test and measure intelligibility after
    change in acoustic conditions
  • Could also compare ASR performance against
    intelligibility
Write a Comment
User Comments (0)
About PowerShow.com