An Auditory Scene Analysis Approach to Speech Segregation and Restoration - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

An Auditory Scene Analysis Approach to Speech Segregation and Restoration

Description:

An Auditory Scene Analysis Approach to Speech Segregation and Restoration – PowerPoint PPT presentation

Number of Views:192
Avg rating:3.0/5.0
Slides: 43
Provided by: Shzu
Category:

less

Transcript and Presenter's Notes

Title: An Auditory Scene Analysis Approach to Speech Segregation and Restoration


1
An Auditory Scene Analysis Approach to Speech
Segregation and Restoration
  • DeLiang Wang
  • Perception and Neurodynamics Lab
  • Ohio State University
  • http//www.cse.ohio-state.edu/pnl

2
Outline of presentation
  • Introduction
  • Speech segregation problem
  • Auditory scene analysis (ASA) approach
  • Voiced speech segregation based on pitch tracking
    and amplitude modulation analysis
  • Unvoiced speech segregation
  • Phonemic restoration

3
Real-world audition
  • What?
  • Source type
  • Speech
  • message
  • speaker
  • age, gender, linguistic origin, mood,
  • Music
  • Car passing by
  • Where?
  • Left, right, up, down
  • How close?
  • Channel characteristics
  • Environment characteristics
  • Room configuration
  • Ambient noise

4
Humans versus machines
  • Additionally
  • Car noise is not an effective speech masker
  • At 10 dB
  • At 0 dB
  • Human word error rate at 0 dB SNR is around 1 as
    opposed to 40 for recognizers with noise
    adaptation

Source Lippmann (1997)
5
Speech segregation problem
  • In a natural environment, speech is usually
    corrupted by acoustic interference. Speech
    segregation is critical for many applications,
    such as automatic speech recognition (ASR) and
    hearing prosthesis
  • Most speech separation techniques, e.g.
    beamforming and blind source separation via
    independent component analysis, require multiple
    sensors. However, such techniques have clear
    limits
  • Suffer from configuration stationarity
  • Cant deal with single-microphone mixtures
  • Most speech enhancement developed for monaural
    situation can deal with only stationary acoustic
    interference

6
Auditory scene analysis (Bregman90)
  • Listeners are able to parse the complex mixture
    of sounds arriving at the ears in order to
    retrieve a mental representation of each sound
    source
  • Ball-room problem, Helmholtz (1863)
  • complicated beyond conception
  • Cocktail-party problem, Cherry (1953)
  • Two conceptual processes of auditory scene
    analysis (ASA)
  • Segmentation. Decompose the acoustic mixture into
    sensory elements (segments)
  • Grouping. Combine segments into groups, so that
    segments in the same group likely originate from
    the same environmental source

7
Auditory Scene Analysis (cont.)
  • Two grouping processes
  • Primitive grouping. Innate data-driven
    mechanisms, consistent with those described by
    Gestalt psychologists for visual perception
    (proximity, similarity, common fate, good
    continuation, etc.)
  • Schema-driven grouping. Application of learned
    knowledge about speech, music and other
    environmental sounds
  • Simultaneous vs. sequential organization
  • Simultaneous organization groups sound components
    that overlap in time. Main ASA cues include
    periodicity, temporal modulation, and
    onset/offset
  • Sequential organization groups sound components
    across time. Main ASA cues include location,
    pitch contour and other source characteristics
    (e.g. vocal tract size)

8
Computational auditory scene analysis
  • Computational ASA (CASA) systems approach sound
    separation based on ASA principles
  • Weintraub85, Cooke93, Brown Cooke94,
    Ellis96, Wang Brown99
  • CASA progress Monaural segregation with minimal
    assumptions
  • CASA challenges
  • Broadband high-frequency mixtures
  • Reliable pitch tracking of noisy speech
  • Unvoiced speech
  • Sequential organization
  • Our model for voiced speech segregation (Hu
    Wang, 2004) considers perceptual resolvability of
    harmonics

9
Resolved and unresolved harmonics
  • For voiced speech, lower harmonics are resolved
    while higher harmonics are not
  • For unresolved harmonics, the envelopes of filter
    responses fluctuate at the fundamental frequency
    of speech
  • Hence we apply different grouping mechanisms for
    low-frequency and high-frequency signals
  • Low-frequency signals are grouped based on
    periodicity and temporal continuity
  • High-frequency signals are grouped based on
    amplitude modulation (AM) and temporal continuity

10
Diagram of the Hu-Wang model
11
Cochleagram Auditory peripheral model
Spectrogram
  • Spectrogram
  • Plot of log energy across time and frequency
    (linear frequency scale)
  • Cochleagram
  • Cochlear filtering by the gammatone filterbank
    (or other models of cochlear filtering), followed
    by a stage of nonlinear rectification the latter
    corresponds to hair cell transduction by either a
    hair cell model or simple compression operations
    (log and cube root)
  • Quasi-logarithmic frequency scale, and filter
    bandwidth is frequency-dependent
  • Previous work suggests better resilience to noise
    than spectrogram

Cochleagram
12
Mid-level auditory representations
  • Mid-level representations form the basis for
    segment formation and subsequent grouping
  • Correlogram extracts periodicity and AM from
    simulated auditory nerve firing patterns
  • Summary correlogram is used to identify global
    pitch
  • Cross-channel correlation between adjacent
    correlogram channels identifies regions that are
    excited by the same harmonic or formant

13
Correlogram
  • Short-term autocorrelation of the output of each
    frequency channel of the cochleogram
  • Peaks in summary correlogram indicate pitch
    periods (F0)
  • A standard model of pitch perception

Correlogram summary correlogram of a double
vowel, showing F0s
14
Initial segregation
  • Segments are formed based on temporal continuity
    and cross-channel correlation
  • Initial grouping into a foreground (target)
    stream and a background stream according to
    global pitch
  • Segments generated in this stage tend to reflect
    resolved harmonics, but not unresolved ones

15
Pitch tracking
  • Pitch periods of target speech are estimated from
    the segregated speech stream
  • Estimated pitch periods are checked and
    re-estimated using two psychoacoustically
    motivated constraints
  • Target pitch should agree with the periodicity of
    the time-frequency units in the initial speech
    stream
  • Pitch periods change smoothly, thus allowing for
    verification and interpolation

16
Pitch tracking example
  • (a) Dominant pitch (Line pitch track of clean
    speech) for a mixture of target speech and
    cocktail-party intrusion
  • (b) Estimated target pitch

17
T-F unit labeling
  • In the low-frequency range
  • A time-frequency (T-F) unit is labeled by
    comparing the periodicity of its autocorrelation
    with the estimated target pitch
  • In the high-frequency range
  • Due to their wide bandwidths, high-frequency
    filters respond to multiple harmonics. These
    responses are amplitude modulated due to beats
    and combinational tones (Helmholtz, 1863)
  • A T-F unit in the high-frequency range is labeled
    by comparing its AM rate with the estimated
    target pitch

18
AM example
  • (a) The output of a gammatone filter (center
    frequency 2.6 kHz) in response to clean speech
  • (b) The corresponding autocorrelation function

19
Final segregation
  • New segments corresponding to unresolved
    harmonics are formed based on temporal continuity
    and cross-channel correlation of response
    envelopes (i.e. common AM). Then they are grouped
    into the foreground stream according to the AM
    criterion
  • Other units are grouped according to temporal and
    spectral continuity

20
Ideal binary mask for performance evaluation
  • Within a T-F unit, the ideal binary mask is 1 if
    target energy is stronger than interference
    energy, and 0 otherwise
  • Motivation Auditory masking - stronger signal
    masks weaker one within a critical band
  • We have suggested to use ideal binary masks as
    ground truth for CASA performance evaluation
  • Consistent with recent speech intelligibility
    results (Roman et al.03 Brungart et al.05)

21
Ideal binary mask illustration
22
Voiced speech segregation example
23
Systematic SNR results
SNR (in dB)
Hu-Wang model
  • Evaluation on a corpus of 100 mixtures (Cooke,
    1993) 10 voiced utterances x 10 noise intrusions
    (see next slide)
  • Average SNR gain 12.3 dB 5.2 dB better than the
    Wang-Brown model (1999), and 6.4 dB better than
    the spectral subtraction method

24
Monaural CASA progress via sound demo
  • 100 mixture set used by Cooke (1993)
  • 10 voiced utterances mixed with 10 noise
    intrusions (N0 tone, N1 white noise, N2 noise
    bursts, N3 cocktail party, N4 rock music, N5
    siren, N6 telephone, N7 female utterance, N8
    male utterance, N9 female utterance)

Wang Brown (1999)
Original mixture of voiced speech
Cooke (1993)
Ellis (1996)
Hu Wang (2004)
telephone
male
female
25
Segmentation and unvoiced speech segregation
  • To deal with unvoiced speech segregation, Hu and
    Wang (2004) recently proposed a model of auditory
    segmentation that applies to both voiced and
    unvoiced speech
  • Segmentation amounts to identifying onsets and
    offsets of individual T-F regions
  • Onset/offset analysis employs scale-space theory,
    which is a multiscale analysis commonly used in
    image segmentation
  • The strategy for general speech segregation is to
    first segregate voiced speech using the pitch
    cue, and then deal with unvoiced speech
  • To segregate unvoiced speech, we perform auditory
    segmentation, and then group segments that
    correspond to unvoiced speech

26
Example of segregating fricatives/affricates
Utterance That noise problem grows more
annoying each day Interference Crowd noise with
music (IBM Ideal binary mask)
27
Phonemic restoration phenomenon
  • When an extraneous sound such as a cough replaces
    a part of speech, listeners believe they hear the
    missing speech sound in addition, they cannot
    localize the extraneous sound (Warren, 1970)
  • If silence replaces a speech sound, the gap is
    correctly localized
  • Phonemic restoration depends on properties of
    noise source and linguistic skills of the
    listener
  • A sequential integration process involving
    top-down (schema-based) and bottom-up (primitive)
    continuity
  • With silence
  • With cough

28
A visual analogue (Bregman81)
  • An instance of visual completion

29
Modeling phonemic restoration
  • The main motivation is to complement speech
    segregation in order to recover masked speech
  • Previous models for phonemic restoration only use
    temporal continuity (Cooke Brown93,
    Masuda-Katsuse Kawahara99)
  • Inability to deal with unvoiced speech
  • Our approach (Srinivasan Wang05) follows the
    interpretation that phonemic restoration uses
    intact portions of the speech signal to
    interpolate (synthesize) masked phonemes
  • Use lexical knowledge to hypothesize the noisy
    word and use the hypothesis to predict the masked
    phoneme

30
Schema-based model (Srinivasan Wang05)
31
Processing steps
  • Input is converted into a spectrogram
  • Identify reliable frames and T-F units
  • Missing-data ASR provides word level recognition
  • Select word template based on recognition
  • Dynamically time warp the template to the noisy
    word and replace unreliable T-F units
  • Pitch based smoothing as postprocessing

32
Frame-level reliability labeling
  • Train a multilayer perceptron to label each frame
  • Input features are spectral flatness (SFM) and
    normalized energy (NE)
  • Output frame labels indicating reliable (1) and
    unreliable (0) frames

Word five interrupted by white noise burst
33
Analyzing unreliable frames
  • Kalman filtering is used to predict spectral
    coefficients in unreliable frames from the
    spectral trajectories of reliable frames

34
Missing data recognition
  • When speech is contaminated by additive noise,
    some T-F regions contain predominantly speech
    energy and the rest contain predominantly noise
    energy
  • The missing data method (Cooke et al.01) treats
    the noise-dominant T-F regions as missing or
    unreliable during recognition
  • The recognizer marginalizes the missing parts

35
Missing data marginalization method
36
Word templates
  • Use linguistic knowledge stored in ASR
  • A word template corresponds to a speech schema
  • Missing-data speech recognition is used to
    recognize speech sounds as words based primarily
    on reliable portions of the input signal
  • A word template corresponding to the recognized
    word is then used to insert/induce relevant
    acoustic signal in the frames containing the
    extraneous sound

37
Training of ASR and word templates
  • The vocabulary for the task is digits (1-9, a
    silence, short pauses between words, zero and oh)
    from the TIDigits corpus
  • 10-state continuous density HMM is used to model
    each word
  • Train 2 word-level templates for each word
    speaker independent (SI) and speaker dependent
    (SD)
  • Each template is a dynamically time-warped
    cepstral average

38
Phonemic synthesis
  • Choose the word template corresponding to the
    noisy word and warp it to the noisy word segment
    in the input signal by dynamic time warping
  • The T-F units of the template corresponding to
    the masked T-F units substitute the masked units
  • Restored information may not conform with the
    speaking style and rate in the rest of the
    utterance. Hence, restored frames are further
    pitch-synchronized with the remainder of the
    utterance

39
Example results
  • White noise intrusion

Clean Schema-based
restoration KF
Clean Masked Phoneme
Restoration using Kalman Filter (KF)
Schema-based restoration
  • Cough intrusion

Masked
Clean
Speaker-independent restoration
Restoration by KF
Speaker-dependent restoration
40
Systematic evaluation results
  • Performance with white noise as the masker
  • Similar performance is obtained with clicks and
    cough
  • N Distance between clean and noisy speech
  • SD and SI - The performance of our model with
    speaker-dependent and speaker-independent
    templates respectively
  • KF - The performance of the Kalman filter model
    of Masuda-Katsuse and Kawahara (1999)

41
Conclusion
  • CASA approach to the cocktail party problem
  • The monaural approach performs substantially
    better than previous CASA systems and other
    separation approaches
  • Onset/offset based segmentation and unvoiced
    speech segregation
  • A schema-based model for phonemic restoration
  • Models based on temporal continuity alone cannot
    restore phonemes that lack continuity with their
    neighboring phonemes

42
Acknowledgment
  • Joint work with Guoning Hu and Soundar Srinivasan
  • Funding by AFOSR/AFRL and NSF
Write a Comment
User Comments (0)
About PowerShow.com