An Auditory Scene Analysis Approach to Speech Segregation - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

An Auditory Scene Analysis Approach to Speech Segregation

Description:

... enhancement developed for monaural situation can deal with ... CASA progress: Monaural segregation with minimal assumptions ... to monaural speech ... – PowerPoint PPT presentation

Number of Views:225
Avg rating:3.0/5.0
Slides: 42
Provided by: Shzu1
Category:

less

Transcript and Presenter's Notes

Title: An Auditory Scene Analysis Approach to Speech Segregation


1
An Auditory Scene Analysis Approach to Speech
Segregation
  • DeLiang Wang
  • Perception and Neurodynamics Lab
  • The Ohio State University

2
Outline of presentation
  • Introduction
  • Speech segregation problem
  • Auditory scene analysis (ASA) approach
  • Voiced speech segregation based on pitch tracking
    and amplitude modulation analysis
  • Ideal binary mask as CASA goal
  • Unvoiced speech segregation
  • Auditory segmentation
  • Neurobiological basis of ASA

3
Real-world audition
  • What?
  • Source type
  • Speech
  • message
  • speaker
  • age, gender, linguistic origin, mood,
  • Music
  • Car passing by
  • Where?
  • Left, right, up, down
  • How close?
  • Channel characteristics
  • Environment characteristics
  • Room configuration
  • Ambient noise

4
Humans versus machines
  • Additionally
  • Car noise is not a very effective speech masker
  • At 10 dB
  • At 0 dB
  • Human word error rate at 0 dB SNR is around 1 as
    opposed to 100 for unmodified recognisers
    (around 40 with noise adaptation)

Source Lippmann (1997)
5
Speech segregation problem
  • In a natural environment, speech is usually
    corrupted by acoustic interference. Speech
    segregation is critical for many applications,
    such as automatic speech recognition and hearing
    prosthesis
  • Most speech separation techniques, e.g.
    beamforming and blind source separation via
    independent analysis, require multiple sensors.
    However, such techniques have clear limits
  • Suffer from configuration stationarity
  • Cant deal with single-microphone mixtures or
    situations where multiple sounds arrive from
    close directions
  • Most speech enhancement developed for monaural
    situation can deal with only stationary acoustic
    interference

6
Auditory scene analysis (Bregman90)
  • Listeners are able to parse the complex mixture
    of sounds arriving at the ears in order to
    retrieve a mental representation of each sound
    source
  • Ball-room problem, Helmholtz, 1863 (complicated
    beyond conception)
  • Cocktail-party problem, Cherry53
  • Two conceptual processes of auditory scene
    analysis (ASA)
  • Segmentation. Decompose the acoustic mixture into
    sensory elements (segments)
  • Grouping. Combine segments into groups, so that
    segments in the same group are likely to have
    originated from the same environmental source

7
Computational auditory scene analysis
  • Computational ASA (CASA) systems approach sound
    separation based on ASA principles
  • Weintraub85, Cooke93, Brown Cooke94,
    Ellis96, Wang Brown99
  • CASA progress Monaural segregation with minimal
    assumptions
  • CASA challenges
  • Broadband high-frequency mixtures
  • Reliable pitch tracking of noisy speech
  • Unvoiced speech

8
Outline of presentation
  • Introduction
  • Speech segregation problem
  • Auditory scene analysis (ASA) approach
  • Voiced speech segregation based on pitch tracking
    and amplitude modulation analysis
  • Ideal binary mask as CASA goal
  • Unvoiced speech segregation
  • Auditory segmentation
  • Neurobiological basis of ASA

9
Resolved and unresolved harmonics
  • For voiced speech, lower harmonics are resolved
    while higher harmonics are not
  • For unresolved harmonics, the envelopes of filter
    responses fluctuate at the fundamental frequency
    of speech
  • Our model (Hu Wang04) applies different
    grouping mechanisms for low-frequency and
    high-frequency signals
  • Low-frequency signals are grouped based on
    periodicity and temporal continuity
  • High-frequency signals are grouped based on
    amplitude modulation (AM) and temporal continuity

10
Diagram of the Hu-Wang model
11
Cochleogram Auditory peripheral model
Spectrogram
  • Spectrogram
  • Plot of log energy across time and frequency
    (linear frequency scale)
  • Cochleogram
  • Cochlear filtering by the gammatone filterbank
    (or other models of cochlear filtering), followed
    by a stage of nonlinear rectification the latter
    corresponds to hair cell transduction by either a
    hair cell model or simple compression operations
    (log and cube root)
  • Quasi-logarithmic frequency scale, and filter
    bandwidth is frequency-dependent
  • Previous work suggests better resilience to noise
    than spectrogram

Cochleogram
12
Mid-level auditory representations
  • Mid-level representations form the basis for
    segment formation and subsequent grouping
  • Correlogram extracts periodicity and AM from
    simulated auditory nerve firing patterns
  • Summary correlogram is used to identify global
    pitch
  • Cross-channel correlation between adjacent
    correlogram channels identifies regions that are
    excited by the same harmonic or formant

13
Correlogram
  • Short-term autocorrelation of the output of each
    frequency channel of the cochleogram
  • Peaks in summary correlogram indicate pitch
    periods (F0)
  • A standard model of pitch perception

Correlogram summary correlogram of a double
vowel, showing F0s
14
Cross-channel correlation
(a) Correlogram and cross-channel correlation of
hair cell response to clean speech (b)
Corresponding representations for response
envelopes
15
Initial segregation
  • Segments are formed based on temporal continuity
    and cross-channel correlation
  • Segments generated in this stage tend to reflect
    resolved harmonics, but not unresolved ones
  • Initial grouping into a foreground (target)
    stream and a background stream according to
    global pitch using the oscillatory correlation
    model of Wang and Brown (1999)

16
Pitch tracking
  • Pitch periods of target speech are estimated from
    the segregated speech stream
  • Estimated pitch periods are checked and
    re-estimated using two psychoacoustically
    motivated constraints
  • Target pitch should agree with the periodicity of
    the time-frequency units in the initial speech
    stream
  • Pitch periods change smoothly, thus allowing for
    verification and interpolation

17
Pitch tracking example
  • (a) Global pitch (Line pitch track of clean
    speech) for a mixture of target speech and
    cocktail-party intrusion
  • (b) Estimated target pitch

18
T-F unit labeling
  • In the low-frequency range
  • A time-frequency (T-F) unit is labeled by
    comparing the periodicity of its autocorrelation
    with the estimated target pitch
  • In the high-frequency range
  • Due to their wide bandwidths, high-frequency
    filters respond to multiple harmonics. These
    responses are amplitude modulated due to beats
    and combinational tones (Helmholtz, 1863)
  • A T-F unit in the high-frequency range is labeled
    by comparing its AM repetition rate with the
    estimated target pitch

19
AM example
  • (a) The output of a gammatone filter (center
    frequency 2.6 kHz) in response to clean speech
  • (b) The corresponding autocorrelation function

20
AM repetition rates
  • To obtain AM repetition rates, a filter response
    is half-wave rectified and bandpass filtered
  • The resulting signal within a T-F unit is modeled
    by a single sinusoid using the gradient descent
    method. The frequency of the sinusoid indicates
    the AM repetition rate of the corresponding
    response

21
Final segregation
  • New segments corresponding to unresolved
    harmonics are formed based on temporal continuity
    and cross-channel correlation of response
    envelopes (i.e. common AM). Then they are grouped
    into the foreground stream according to AM
    repetition rates
  • Other units are grouped according to temporal and
    spectral continuity

22
Ideal binary mask for performance evaluation
  • Within a T-F unit, the ideal binary mask is 1 if
    target energy is stronger than interference
    energy, and 0 otherwise
  • Motivation Auditory masking - stronger signal
    masks weaker one within a critical band
  • We have suggested to use ideal binary masks as
    ground truth for CASA performance evaluation
  • Consistent with recent speech intelligibility
    results (Roman et al.03 Brungart et al.05)

23
Ideal binary mask illustration
24
Voiced speech segregation example
25
Systematic SNR results
SNR (in dB)
Hu-Wang model
  • Evaluation on a corpus of 100 mixtures (Cooke,
    1993) 10 voiced utterances x 10 noise intrusions
    (see next slide)
  • Average SNR gain 12.3 dB 5.2 dB better than the
    Wang-Brown model (1999), and 6.4 dB better than
    the spectral subtraction method

26
CASA progress on voiced speech segregation
  • 100 mixture set used by Cooke (1993)
  • 10 voiced utterances mixed with 10 noise
    intrusions (N0 tone, N1 white noise, N2 noise
    bursts, N3 cocktail party, N4 rock music, N5
    siren, N6 telephone, N7 female utterance, N8
    male utterance, N9 female utterance)

Wang Brown (1999)
Original mixture of voiced speech
Cooke (1993)
Ellis (1996)
Hu Wang (2004)
telephone
male
female
27
Outline of presentation
  • Introduction
  • Speech segregation problem
  • Auditory scene analysis (ASA) approach
  • Voiced speech segregation based on pitch tracking
    and amplitude modulation analysis
  • Ideal binary mask as CASA goal
  • Unvoiced speech segregation
  • Auditory segmentation
  • Neurobiological basis of ASA

28
Segmentation and unvoiced speech segretation
  • To deal with unvoiced speech segregation, we (Hu
    Wang04) proposed a model of auditory
    segmentation that applies to both voiced and
    unvoiced speech
  • The task of segmentation is to decompose an
    auditory scene into contiguous T-F regions, each
    of which should contain signal from the same
    sound source
  • The definition of segmentation does not
    distinguish between voiced and unvoiced sounds
  • This is equivalent to identifying onsets and
    offsets of individual T-F regions, which
    generally correspond to sudden changes of
    acoustic energy
  • The segmentation strategy is based on onset and
    offset analysis

29
Scale-space analysis for auditory segmentation
  • From a computational standpoint, auditory
    segmentation is similar to image (visual)
    segmentation
  • Visual segmentation Finding bounding contours of
    visual objects
  • Auditory segmentation Finding onset and offset
    fronts of segments
  • Onset/offset analysis employs scale-space theory,
    which is a multiscale analysis commonly used in
    image segmentation
  • Smoothing
  • Onset/offset detection and onset/offset front
    matching
  • Multiscale integration

30
Example of auditory segmentation
31
Speech segregation
  • The general strategy for speech segregation is to
    first segregate voiced speech using the pitch
    cue, and then deal with unvoiced speech
  • To segregate unvoiced speech, we perform auditory
    segmentation, and then group segments that
    correspond to unvoiced speech

32
Segment classification
  • For nonspeech interference, grouping is in fact a
    classification task to classify segments as
    either speech or non-speech
  • The following features are used for
    classification
  • Spectral envelope
  • Segment duration
  • Segment intensity
  • Training data
  • Speech Training part of the TIMIT database
  • Interference 90 natural intrusions including
    street noise, crowd noise, wind, etc.
  • A Gaussian mixture model is trained for each
    phoneme, and for interference as well which
    provides the basis for a likelihood ratio test

33
Example of segregating fricatives/affricates
Utterance That noise problem grows more
annoying each day Interference Crowd noise with
music (IBM Ideal binary mask)
34
Example of segregating stops
Utterance A good morrow to you, my
boy Interference Rain
35
Outline of presentation
  • Introduction
  • Speech segregation problem
  • Auditory scene analysis (ASA) approach
  • Voiced speech segregation based on pitch tracking
    and amplitude modulation analysis
  • Ideal binary mask as CASA goal
  • Unvoiced speech segregation
  • Auditory segmentation
  • Neurobiological basis of ASA

36
How does the auditory system perform ASA?
  • Information about acoustic features (pitch,
    spectral shape, interaural differences, AM, FM)
    is extracted in distributed areas of the auditory
    system
  • Binding problem How are these features combined
    to form a perceptual whole (stream)?
  • Hierarchies of feature-detecting cells exist, but
    do not seem to constitute a solution to the
    binding problem

37
Oscillatory correlation theory for ASA
  • Neural oscillators are used to represent auditory
    features
  • Oscillators representing features of the same
    source are synchronized, and are desynchronized
    from those representing different sources
  • Originally proposed by von der Malsburg
    Schneider (1986), and further developed by Wang
    (1996)
  • Supported by growing experimental evidence

38
Oscillatory correlation representation
  • FD Feature
  • Detector

39
Oscillatory correlation for ASA
  • LEGION dynamics (Terman Wang95) provides a
    computational foundation for the oscillatory
    correlation theory
  • The utility of oscillatory correlation has been
    demonstrated for speech segregation
    (Wang-Brown99), modeling auditory attention
    (Wrigley-Brown04), etc.

40
Summary
  • CASA approach to monaural speech segregation
  • Performs substantially better than previous CASA
    systems for voiced speech segregation
  • AM cue and target pitch tracking are important
    for performance improvement
  • Early steps for unvoiced speech segregation
  • Auditory segmentation based on onset/offset
    analysis
  • Segregation using speech classification
  • Oscillatory correlation theory for ASA

41
Acknowledgment
  • Joint work with Guoning Hu
  • Funded by AFOSR/AFRL and NSF
Write a Comment
User Comments (0)
About PowerShow.com