Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation - PowerPoint PPT Presentation


PPT – Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation PowerPoint presentation | free to download - id: 68905a-M2NkZ


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation


Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation DeLiang Wang The Ohio State University Outline of Presentation Introduction Speech ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Date added: 24 February 2020
Slides: 39
Provided by: Shzu


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation

Monaural Speech Segregation Representation,
Pitch, and Amplitude Modulation
  • DeLiang Wang
  • The Ohio State University

Outline of Presentation
  • Introduction
  • Speech segregation problem
  • Auditory scene analysis (ASA) approach
  • A multistage model for computational ASA
  • On amplitude modulation and pitch tracking
  • Oscillatory correlation theory for ASA

Speech Segregation Problem
  • In a natural environment, target speech is
    usually corrupted by acoustic interference. An
    effective system for speech segregation has many
    applications, such as automatic speech
    recognition, audio retrieval, and hearing aid
  • Most speech separation techniques require
    multiple sensors
  • Speech enhancement developed for the monaural
    situation can deal with only specific acoustic

Auditory Scene Analysis (Bregman90)
  • Listeners are able to parse the complex mixture
    of sounds arriving at the ears in order to
    retrieve a mental representation of each sound
  • ASA would take place in two conceptual processes
  • Segmentation. Decompose the acoustic mixture into
    sensory elements (segments)
  • Grouping. Combine segments into groups, so that
    segments in the same group are likely to have
    originated from the same environmental source

Auditory Scene Analysis - continued
  • The grouping process involves two aspects
  • Primitive grouping. Innate data-driven
    mechanisms, consistent with those described by
    Gestalt psychologists for visual perception
    (proximity, similarity, common fate, good
    continuation, etc.)
  • Schema-driven grouping. Application of learned
    knowledge about speech, music and other
    environmental sounds

Computational Auditory Scene Analysis
  • Computational ASA (CASA) systems approach sound
    separation based on ASA principles
  • Weintraub85, Cooke93, Brown Cooke94,
    Ellis96, Wang96
  • Previous CASA work suggests that
  • Representation of the auditory scene is a key
  • Temporal continuity is important (although it is
    ignored in most frame-based sound processing
  • Fundamental frequency (F0) is a strong cue for

A Multi-stage Model (Wang Brown99)
Auditory Periphery Model
  • A bank of fourth-order gammatone filters
    (Patterson et al.88)
  • Meddis hair cell model converts gammatone output
    to neural firing

Auditory Periphery - Example
  • Hair cell response to utterance Why were you
    all weary? mixed with phone ringing
  • 128 filter channels arranged in ERB

Mid-level Auditory Representations
  • Mid-level representations form the basis for
    segment formation and subsequent grouping
  • Correlogram extracts periodicity information from
    simulated auditory nerve firing patterns
  • Summary correlogram is used to identify F0
  • Cross-correlation between adjacent correlogram
    channels identifies regions that are excited by
    the same frequency component or formant

Mid-level Representations - Example
  • Correlogram and cross-channel correlation for the
    speech/telephone mixture

Oscillator Network Segmentation Layer
  • Horizontal weights are unity, reflecting temporal
    continuity, and vertical weights are unity if
    cross-channel correlation exceeds a threshold,
    otherwise 0
  • A global inhibitor ensures that different
    segments have different phases
  • A segment thus formed corresponds to acoustic
    energy in a local time-frequency region that is
    treated as an atomic component of an auditory

Segmentation Layer - Example
  • Output of the segmentation layer in response to
    the speech/telephone mixture

Oscillator Network Grouping Layer
  • At each time frame, an F0 estimate from the
    summary correlogram is used to classify channels
    into two categories those that are consistent
    with the F0, and those that are not
  • Connections are formed between pairs of channels
    mutual excitation if the channels belong to the
    same F0 category, otherwise mutual inhibition
  • Strong excitation within each segment
  • The second layer embodies the grouping stage of

Grouping Layer - Example
  • Two streams emerge from the grouping layer at
    different times or with different phases
  • Left Foreground (original mixture
  • Right Background

Challenges Facing CASA
  • Previous systems, including the Wang-Brown model,
    have difficulty in
  • Dealing with broadband high-frequency mixtures
  • Performing reliable pitch tracking for noisy
  • Retaining high-frequency energy of the target
  • Our next step considers perceptual resolvability
    of various harmonics

Resolved and Unresolved Harmonics
  • For voiced speech, lower harmonics are resolved
    while higher harmonics are not
  • For unresolved harmonics, the envelopes of filter
    responses fluctuate at the fundamental frequency
    of speech
  • Hence we apply different grouping mechanisms for
    low-frequency and high-frequency signals
  • Low-frequency signals are grouped based on
    periodicity and temporal continuity
  • High-frequency signals are grouped based on
    amplitude modulation (AM) and temporal continuity

Proposed System (Hu Wang'02)
Envelope Representations - Example
(a) Correlogram and cross-channel correlation of
hair cell response to clean speech (b)
Corresponding representations for response
Initial Segregation
  • The Wang-Brown model is used in this stage to
    generate segments and select the target speech
  • Segments generated in this stage tend to reflect
    resolved harmonics, but not unresolved ones

Pitch Tracking
  • Pitch periods of target speech are estimated from
    the segregated speech stream
  • Estimated pitch periods are checked and
    re-estimated using two psychoacoustically
    motivated constraints
  • Target pitch should agree with the periodicity of
    the time-frequency (T-F) units in the initial
    speech stream
  • Pitch periods change smoothly, thus allowing for
    verification and interpolation

Pitch Tracking - Example
  • (a) Global pitch (Line pitch track of clean
    speech) for a mixture of target speech and
    cocktail-party intrusion
  • (b) Estimated target pitch

T-F Unit Labeling
  • In the low-frequency range
  • A T-F unit is labeled by comparing the
    periodicity of its autocorrelation with the
    estimated target pitch
  • In the high-frequency range
  • Due to their wide bandwidths, high-frequency
    filters generally respond to multiple harmonics.
    These responses are amplitude modulated due to
    beats and combinational tones (Helmholtz, 1863)
  • A T-F unit in the high-frequency range is labeled
    by comparing its AM repetition rate with the
    estimated target pitch

AM - Example
  • (a) The output of a gammatone filter (center
    frequency 2.6 kHz) to clean speech
  • (b) The corresponding autocorrelation function

AM Repetition Rates
  • To obtain AM repetition rates, a filter response
    is half-wave rectified and bandpass filtered
  • The resulting signal within a T-F unit is modeled
    by a single sinusoid using the gradient descent
    method. The frequency of the sinusoid indicates
    the AM repetition rate of the corresponding

Final Segregation
  • New segments corresponding to unresolved
    harmonics are formed based on temporal continuity
    and cross-channel correlation of response
    envelopes (i.e. common AM). Then they are grouped
    into the foreground stream according to AM
    repetition rates
  • The foreground stream is adjusted to remove the
    segments that do not agree with the estimated
    target pitch
  • Other units are grouped according to temporal and
    spectral continuity

Ideal Binary Mask for Performance Evaluation
  • Within a T-F unit, the ideal binary mask is 1 if
    target energy is stronger than interference
    energy, and 0 otherwise
  • Motivation Auditory masking - stronger signal
    masks weaker one within a critical band
  • Further motivation Ideal binary masks give
    excellent listening experience and automatic
    speech recognition performance
  • Thus, we suggest to use ideal binary masks as
    ground truth for CASA performance evaluation

Monaural Speech Segregation Example
  • Left Segregated speech stream (original mixture
  • Right Ideal binary mask

Systematic Evaluation
  • Evaluated on a corpus of 100 mixtures (Cooke93)
    10 voiced utterances x 10 noise intrusions
  • Noise intrusions have a large variety
  • Resynthesis stage allows estimation of target
    speech waveform
  • Evaluation is based on ideal binary masks

Signal-to-Noise Ratio (SNR) Results
  • Average SNR gain 12.1 dB average improvement
    over Wang-Brown 5 dB
  • Major improvement occurs in target energy
    retention, particularly in the high-frequency

Segregation Examples
Mixture Ideal Binary Mask Wang-Brown New System
How Does Auditory System Perform ASA?
  • Information about acoustic features (pitch,
    spectral shape, interaural differences, AM, FM)
    is extracted in distributed areas of the auditory
  • Binding problem How are these features combined
    to form a perceptual whole (stream)?
  • Hierarchies of feature-detecting cells exist, but
    do not seem to constitute a solution to the
    binding problem

Oscillatory Correlation Theory (von der Malsburg
Schneider86 Wang96)
  • Neural oscillators are used to represent auditory
  • Oscillators representing features of the same
    source are synchronized (phase-locked with zero
    phase lag), and are desynchronized from
    oscillators representing different sources
  • Supported by growing experimental evidence, e.g.
    oscillations in auditory cortex measured by EEG,
    MEG and local field potentials

Oscillatory Correlation Representation
  • FD Feature
  • Detector

Oscillatory Correlation for ASA
  • LEGION dynamics (Terman Wang95) provides a
    computational foundation for the oscillatory
    correlation theory
  • The utility of oscillatory correlation has been
    demonstrated for speech separation
    (Wang-Brown99), modeling auditory attention
    (Wrigley-Brown01), etc.

  • Grouping is entirely pitch-based, hence limited
    to segregating voiced speech
  • How to group unvoiced speech?
  • Target pitch tracking in the presence of multiple
    voiced sources
  • Role of segmentation
  • We found increased robustness with segments as an
    intermediate representation between streams and
    T-F units

  • Multistage ASA approach to monaural speech
  • Performs substantially better than previous CASA
  • Oscillatory correlation theory for ASA
  • Key issue is integration of various grouping cues

  • Recent work with Guoning Hu- Ohio State
  • Earlier work with Guy Brown - University of