Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation - PowerPoint PPT Presentation

About This Presentation
Title:

Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation

Description:

Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation DeLiang Wang The Ohio State University Outline of Presentation Introduction Speech ... – PowerPoint PPT presentation

Number of Views:234
Avg rating:3.0/5.0
Slides: 39
Provided by: Shzu
Category:

less

Transcript and Presenter's Notes

Title: Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation


1
Monaural Speech Segregation Representation,
Pitch, and Amplitude Modulation
  • DeLiang Wang
  • The Ohio State University

2
Outline of Presentation
  • Introduction
  • Speech segregation problem
  • Auditory scene analysis (ASA) approach
  • A multistage model for computational ASA
  • On amplitude modulation and pitch tracking
  • Oscillatory correlation theory for ASA

3
Speech Segregation Problem
  • In a natural environment, target speech is
    usually corrupted by acoustic interference. An
    effective system for speech segregation has many
    applications, such as automatic speech
    recognition, audio retrieval, and hearing aid
    design
  • Most speech separation techniques require
    multiple sensors
  • Speech enhancement developed for the monaural
    situation can deal with only specific acoustic
    interference

4
Auditory Scene Analysis (Bregman90)
  • Listeners are able to parse the complex mixture
    of sounds arriving at the ears in order to
    retrieve a mental representation of each sound
    source
  • ASA would take place in two conceptual processes
  • Segmentation. Decompose the acoustic mixture into
    sensory elements (segments)
  • Grouping. Combine segments into groups, so that
    segments in the same group are likely to have
    originated from the same environmental source

5
Auditory Scene Analysis - continued
  • The grouping process involves two aspects
  • Primitive grouping. Innate data-driven
    mechanisms, consistent with those described by
    Gestalt psychologists for visual perception
    (proximity, similarity, common fate, good
    continuation, etc.)
  • Schema-driven grouping. Application of learned
    knowledge about speech, music and other
    environmental sounds

6
Computational Auditory Scene Analysis
  • Computational ASA (CASA) systems approach sound
    separation based on ASA principles
  • Weintraub85, Cooke93, Brown Cooke94,
    Ellis96, Wang96
  • Previous CASA work suggests that
  • Representation of the auditory scene is a key
    issue
  • Temporal continuity is important (although it is
    ignored in most frame-based sound processing
    algorithms)
  • Fundamental frequency (F0) is a strong cue for
    grouping

7
A Multi-stage Model (Wang Brown99)
8
Auditory Periphery Model
  • A bank of fourth-order gammatone filters
    (Patterson et al.88)
  • Meddis hair cell model converts gammatone output
    to neural firing

9
Auditory Periphery - Example
  • Hair cell response to utterance Why were you
    all weary? mixed with phone ringing
  • 128 filter channels arranged in ERB

10
Mid-level Auditory Representations
  • Mid-level representations form the basis for
    segment formation and subsequent grouping
  • Correlogram extracts periodicity information from
    simulated auditory nerve firing patterns
  • Summary correlogram is used to identify F0
  • Cross-correlation between adjacent correlogram
    channels identifies regions that are excited by
    the same frequency component or formant

11
Mid-level Representations - Example
  • Correlogram and cross-channel correlation for the
    speech/telephone mixture

12
Oscillator Network Segmentation Layer
  • Horizontal weights are unity, reflecting temporal
    continuity, and vertical weights are unity if
    cross-channel correlation exceeds a threshold,
    otherwise 0
  • A global inhibitor ensures that different
    segments have different phases
  • A segment thus formed corresponds to acoustic
    energy in a local time-frequency region that is
    treated as an atomic component of an auditory
    scene

13
Segmentation Layer - Example
  • Output of the segmentation layer in response to
    the speech/telephone mixture

14
Oscillator Network Grouping Layer
  • At each time frame, an F0 estimate from the
    summary correlogram is used to classify channels
    into two categories those that are consistent
    with the F0, and those that are not
  • Connections are formed between pairs of channels
    mutual excitation if the channels belong to the
    same F0 category, otherwise mutual inhibition
  • Strong excitation within each segment
  • The second layer embodies the grouping stage of
    ASA

15
Grouping Layer - Example
  • Two streams emerge from the grouping layer at
    different times or with different phases
  • Left Foreground (original mixture
    )
  • Right Background

16
Challenges Facing CASA
  • Previous systems, including the Wang-Brown model,
    have difficulty in
  • Dealing with broadband high-frequency mixtures
  • Performing reliable pitch tracking for noisy
    speech
  • Retaining high-frequency energy of the target
    speaker
  • Our next step considers perceptual resolvability
    of various harmonics

17
Resolved and Unresolved Harmonics
  • For voiced speech, lower harmonics are resolved
    while higher harmonics are not
  • For unresolved harmonics, the envelopes of filter
    responses fluctuate at the fundamental frequency
    of speech
  • Hence we apply different grouping mechanisms for
    low-frequency and high-frequency signals
  • Low-frequency signals are grouped based on
    periodicity and temporal continuity
  • High-frequency signals are grouped based on
    amplitude modulation (AM) and temporal continuity

18
Proposed System (Hu Wang'02)
19
Envelope Representations - Example
(a) Correlogram and cross-channel correlation of
hair cell response to clean speech (b)
Corresponding representations for response
envelopes
20
Initial Segregation
  • The Wang-Brown model is used in this stage to
    generate segments and select the target speech
    stream
  • Segments generated in this stage tend to reflect
    resolved harmonics, but not unresolved ones

21
Pitch Tracking
  • Pitch periods of target speech are estimated from
    the segregated speech stream
  • Estimated pitch periods are checked and
    re-estimated using two psychoacoustically
    motivated constraints
  • Target pitch should agree with the periodicity of
    the time-frequency (T-F) units in the initial
    speech stream
  • Pitch periods change smoothly, thus allowing for
    verification and interpolation

22
Pitch Tracking - Example
  • (a) Global pitch (Line pitch track of clean
    speech) for a mixture of target speech and
    cocktail-party intrusion
  • (b) Estimated target pitch

23
T-F Unit Labeling
  • In the low-frequency range
  • A T-F unit is labeled by comparing the
    periodicity of its autocorrelation with the
    estimated target pitch
  • In the high-frequency range
  • Due to their wide bandwidths, high-frequency
    filters generally respond to multiple harmonics.
    These responses are amplitude modulated due to
    beats and combinational tones (Helmholtz, 1863)
  • A T-F unit in the high-frequency range is labeled
    by comparing its AM repetition rate with the
    estimated target pitch

24
AM - Example
  • (a) The output of a gammatone filter (center
    frequency 2.6 kHz) to clean speech
  • (b) The corresponding autocorrelation function

25
AM Repetition Rates
  • To obtain AM repetition rates, a filter response
    is half-wave rectified and bandpass filtered
  • The resulting signal within a T-F unit is modeled
    by a single sinusoid using the gradient descent
    method. The frequency of the sinusoid indicates
    the AM repetition rate of the corresponding
    response

26
Final Segregation
  • New segments corresponding to unresolved
    harmonics are formed based on temporal continuity
    and cross-channel correlation of response
    envelopes (i.e. common AM). Then they are grouped
    into the foreground stream according to AM
    repetition rates
  • The foreground stream is adjusted to remove the
    segments that do not agree with the estimated
    target pitch
  • Other units are grouped according to temporal and
    spectral continuity

27
Ideal Binary Mask for Performance Evaluation
  • Within a T-F unit, the ideal binary mask is 1 if
    target energy is stronger than interference
    energy, and 0 otherwise
  • Motivation Auditory masking - stronger signal
    masks weaker one within a critical band
  • Further motivation Ideal binary masks give
    excellent listening experience and automatic
    speech recognition performance
  • Thus, we suggest to use ideal binary masks as
    ground truth for CASA performance evaluation

28
Monaural Speech Segregation Example
  • Left Segregated speech stream (original mixture
    )
  • Right Ideal binary mask

29
Systematic Evaluation
  • Evaluated on a corpus of 100 mixtures (Cooke93)
    10 voiced utterances x 10 noise intrusions
  • Noise intrusions have a large variety
  • Resynthesis stage allows estimation of target
    speech waveform
  • Evaluation is based on ideal binary masks

30
Signal-to-Noise Ratio (SNR) Results
  • Average SNR gain 12.1 dB average improvement
    over Wang-Brown 5 dB
  • Major improvement occurs in target energy
    retention, particularly in the high-frequency
    range

31
Segregation Examples
Mixture Ideal Binary Mask Wang-Brown New System
32
How Does Auditory System Perform ASA?
  • Information about acoustic features (pitch,
    spectral shape, interaural differences, AM, FM)
    is extracted in distributed areas of the auditory
    system
  • Binding problem How are these features combined
    to form a perceptual whole (stream)?
  • Hierarchies of feature-detecting cells exist, but
    do not seem to constitute a solution to the
    binding problem

33
Oscillatory Correlation Theory (von der Malsburg
Schneider86 Wang96)
  • Neural oscillators are used to represent auditory
    features
  • Oscillators representing features of the same
    source are synchronized (phase-locked with zero
    phase lag), and are desynchronized from
    oscillators representing different sources
  • Supported by growing experimental evidence, e.g.
    oscillations in auditory cortex measured by EEG,
    MEG and local field potentials

34
Oscillatory Correlation Representation
  • FD Feature
  • Detector

35
Oscillatory Correlation for ASA
  • LEGION dynamics (Terman Wang95) provides a
    computational foundation for the oscillatory
    correlation theory
  • The utility of oscillatory correlation has been
    demonstrated for speech separation
    (Wang-Brown99), modeling auditory attention
    (Wrigley-Brown01), etc.

36
Issues
  • Grouping is entirely pitch-based, hence limited
    to segregating voiced speech
  • How to group unvoiced speech?
  • Target pitch tracking in the presence of multiple
    voiced sources
  • Role of segmentation
  • We found increased robustness with segments as an
    intermediate representation between streams and
    T-F units

37
Summary
  • Multistage ASA approach to monaural speech
    segregation
  • Performs substantially better than previous CASA
    systems
  • Oscillatory correlation theory for ASA
  • Key issue is integration of various grouping cues

38
Collaborators
  • Recent work with Guoning Hu- Ohio State
    University
  • Earlier work with Guy Brown - University of
    Sheffield
Write a Comment
User Comments (0)
About PowerShow.com