An Auditory Scene Analysis Approach to Speech Segregation and Restoration - PowerPoint PPT Presentation

1 / 42

About This Presentation

Title:

An Auditory Scene Analysis Approach to Speech Segregation and Restoration

Description:

An Auditory Scene Analysis Approach to Speech Segregation and Restoration – PowerPoint PPT presentation

Number of Views:192

Avg rating:3.0/5.0

Slides: 43

Provided by: Shzu

Category:

more less

Transcript and Presenter's Notes

Title: An Auditory Scene Analysis Approach to Speech Segregation and Restoration

1
An Auditory Scene Analysis Approach to Speech
Segregation and Restoration

DeLiang Wang
Perception and Neurodynamics Lab
Ohio State University
http//www.cse.ohio-state.edu/pnl

2
Outline of presentation

Introduction
Speech segregation problem
Auditory scene analysis (ASA) approach
Voiced speech segregation based on pitch tracking
and amplitude modulation analysis
Unvoiced speech segregation
Phonemic restoration

3
Real-world audition

What?
Source type
Speech
message
speaker
age, gender, linguistic origin, mood,
Music
Car passing by
Where?
Left, right, up, down
How close?
Channel characteristics
Environment characteristics
Room configuration
Ambient noise

4
Humans versus machines

Additionally
Car noise is not an effective speech masker
At 10 dB
At 0 dB
Human word error rate at 0 dB SNR is around 1 as
opposed to 40 for recognizers with noise
adaptation

Source Lippmann (1997)
5
Speech segregation problem

In a natural environment, speech is usually
corrupted by acoustic interference. Speech
segregation is critical for many applications,
such as automatic speech recognition (ASR) and
hearing prosthesis
Most speech separation techniques, e.g.
beamforming and blind source separation via
independent component analysis, require multiple
sensors. However, such techniques have clear
limits
Suffer from configuration stationarity
Cant deal with single-microphone mixtures
Most speech enhancement developed for monaural
situation can deal with only stationary acoustic
interference

6
Auditory scene analysis (Bregman90)

Listeners are able to parse the complex mixture
of sounds arriving at the ears in order to
retrieve a mental representation of each sound
source
Ball-room problem, Helmholtz (1863)
complicated beyond conception
Cocktail-party problem, Cherry (1953)
Two conceptual processes of auditory scene
analysis (ASA)
Segmentation. Decompose the acoustic mixture into
sensory elements (segments)
Grouping. Combine segments into groups, so that
segments in the same group likely originate from
the same environmental source

7
Auditory Scene Analysis (cont.)

Two grouping processes
Primitive grouping. Innate data-driven
mechanisms, consistent with those described by
Gestalt psychologists for visual perception
(proximity, similarity, common fate, good
continuation, etc.)
Schema-driven grouping. Application of learned
knowledge about speech, music and other
environmental sounds
Simultaneous vs. sequential organization
Simultaneous organization groups sound components
that overlap in time. Main ASA cues include
periodicity, temporal modulation, and
onset/offset
Sequential organization groups sound components
across time. Main ASA cues include location,
pitch contour and other source characteristics
(e.g. vocal tract size)

8
Computational auditory scene analysis

Computational ASA (CASA) systems approach sound
separation based on ASA principles
Weintraub85, Cooke93, Brown Cooke94,
Ellis96, Wang Brown99
CASA progress Monaural segregation with minimal
assumptions
CASA challenges
Broadband high-frequency mixtures
Reliable pitch tracking of noisy speech
Unvoiced speech
Sequential organization
Our model for voiced speech segregation (Hu
Wang, 2004) considers perceptual resolvability of
harmonics

9
Resolved and unresolved harmonics

For voiced speech, lower harmonics are resolved
while higher harmonics are not
For unresolved harmonics, the envelopes of filter
responses fluctuate at the fundamental frequency
of speech
Hence we apply different grouping mechanisms for
low-frequency and high-frequency signals
Low-frequency signals are grouped based on
periodicity and temporal continuity
High-frequency signals are grouped based on
amplitude modulation (AM) and temporal continuity

10
Diagram of the Hu-Wang model
11
Cochleagram Auditory peripheral model
Spectrogram

Spectrogram
Plot of log energy across time and frequency
(linear frequency scale)
Cochleagram
Cochlear filtering by the gammatone filterbank
(or other models of cochlear filtering), followed
by a stage of nonlinear rectification the latter
corresponds to hair cell transduction by either a
hair cell model or simple compression operations
(log and cube root)
Quasi-logarithmic frequency scale, and filter
bandwidth is frequency-dependent
Previous work suggests better resilience to noise
than spectrogram

Cochleagram
12
Mid-level auditory representations

Mid-level representations form the basis for
segment formation and subsequent grouping
Correlogram extracts periodicity and AM from
simulated auditory nerve firing patterns
Summary correlogram is used to identify global
pitch
Cross-channel correlation between adjacent
correlogram channels identifies regions that are
excited by the same harmonic or formant

13
Correlogram

Short-term autocorrelation of the output of each
frequency channel of the cochleogram
Peaks in summary correlogram indicate pitch
periods (F0)
A standard model of pitch perception

Correlogram summary correlogram of a double
vowel, showing F0s
14
Initial segregation

Segments are formed based on temporal continuity
and cross-channel correlation
Initial grouping into a foreground (target)
stream and a background stream according to
global pitch
Segments generated in this stage tend to reflect
resolved harmonics, but not unresolved ones

15
Pitch tracking

Pitch periods of target speech are estimated from
the segregated speech stream
Estimated pitch periods are checked and
re-estimated using two psychoacoustically
motivated constraints
Target pitch should agree with the periodicity of
the time-frequency units in the initial speech
stream
Pitch periods change smoothly, thus allowing for
verification and interpolation

16
Pitch tracking example

(a) Dominant pitch (Line pitch track of clean
speech) for a mixture of target speech and
cocktail-party intrusion
(b) Estimated target pitch

17
T-F unit labeling

In the low-frequency range
A time-frequency (T-F) unit is labeled by
comparing the periodicity of its autocorrelation
with the estimated target pitch
In the high-frequency range
Due to their wide bandwidths, high-frequency
filters respond to multiple harmonics. These
responses are amplitude modulated due to beats
and combinational tones (Helmholtz, 1863)
A T-F unit in the high-frequency range is labeled
by comparing its AM rate with the estimated
target pitch

18
AM example

(a) The output of a gammatone filter (center
frequency 2.6 kHz) in response to clean speech
(b) The corresponding autocorrelation function

19
Final segregation

New segments corresponding to unresolved
harmonics are formed based on temporal continuity
and cross-channel correlation of response
envelopes (i.e. common AM). Then they are grouped
into the foreground stream according to the AM
criterion
Other units are grouped according to temporal and
spectral continuity

20
Ideal binary mask for performance evaluation

Within a T-F unit, the ideal binary mask is 1 if
target energy is stronger than interference
energy, and 0 otherwise
Motivation Auditory masking - stronger signal
masks weaker one within a critical band
We have suggested to use ideal binary masks as
ground truth for CASA performance evaluation
Consistent with recent speech intelligibility
results (Roman et al.03 Brungart et al.05)

21
Ideal binary mask illustration
22
Voiced speech segregation example
23
Systematic SNR results
SNR (in dB)
Hu-Wang model

Evaluation on a corpus of 100 mixtures (Cooke,
1993) 10 voiced utterances x 10 noise intrusions
(see next slide)
Average SNR gain 12.3 dB 5.2 dB better than the
Wang-Brown model (1999), and 6.4 dB better than
the spectral subtraction method

24
Monaural CASA progress via sound demo

100 mixture set used by Cooke (1993)
10 voiced utterances mixed with 10 noise
intrusions (N0 tone, N1 white noise, N2 noise
bursts, N3 cocktail party, N4 rock music, N5
siren, N6 telephone, N7 female utterance, N8
male utterance, N9 female utterance)

Wang Brown (1999)
Original mixture of voiced speech
Cooke (1993)
Ellis (1996)
Hu Wang (2004)
telephone
male
female
25
Segmentation and unvoiced speech segregation

To deal with unvoiced speech segregation, Hu and
Wang (2004) recently proposed a model of auditory
segmentation that applies to both voiced and
unvoiced speech
Segmentation amounts to identifying onsets and
offsets of individual T-F regions
Onset/offset analysis employs scale-space theory,
which is a multiscale analysis commonly used in
image segmentation
The strategy for general speech segregation is to
first segregate voiced speech using the pitch
cue, and then deal with unvoiced speech
To segregate unvoiced speech, we perform auditory
segmentation, and then group segments that
correspond to unvoiced speech

26
Example of segregating fricatives/affricates
Utterance That noise problem grows more
annoying each day Interference Crowd noise with
music (IBM Ideal binary mask)
27
Phonemic restoration phenomenon

When an extraneous sound such as a cough replaces
a part of speech, listeners believe they hear the
missing speech sound in addition, they cannot
localize the extraneous sound (Warren, 1970)
If silence replaces a speech sound, the gap is
correctly localized
Phonemic restoration depends on properties of
noise source and linguistic skills of the
listener
A sequential integration process involving
top-down (schema-based) and bottom-up (primitive)
continuity

With silence

With cough

28
A visual analogue (Bregman81)

An instance of visual completion

29
Modeling phonemic restoration

The main motivation is to complement speech
segregation in order to recover masked speech
Previous models for phonemic restoration only use
temporal continuity (Cooke Brown93,
Masuda-Katsuse Kawahara99)
Inability to deal with unvoiced speech
Our approach (Srinivasan Wang05) follows the
interpretation that phonemic restoration uses
intact portions of the speech signal to
interpolate (synthesize) masked phonemes
Use lexical knowledge to hypothesize the noisy
word and use the hypothesis to predict the masked
phoneme

30
Schema-based model (Srinivasan Wang05)
31
Processing steps

Input is converted into a spectrogram
Identify reliable frames and T-F units
Missing-data ASR provides word level recognition
Select word template based on recognition
Dynamically time warp the template to the noisy
word and replace unreliable T-F units
Pitch based smoothing as postprocessing

32
Frame-level reliability labeling

Train a multilayer perceptron to label each frame
Input features are spectral flatness (SFM) and
normalized energy (NE)
Output frame labels indicating reliable (1) and
unreliable (0) frames

Word five interrupted by white noise burst
33
Analyzing unreliable frames

Kalman filtering is used to predict spectral
coefficients in unreliable frames from the
spectral trajectories of reliable frames

34
Missing data recognition

When speech is contaminated by additive noise,
some T-F regions contain predominantly speech
energy and the rest contain predominantly noise
energy
The missing data method (Cooke et al.01) treats
the noise-dominant T-F regions as missing or
unreliable during recognition
The recognizer marginalizes the missing parts

35
Missing data marginalization method
36
Word templates

Use linguistic knowledge stored in ASR
A word template corresponds to a speech schema
Missing-data speech recognition is used to
recognize speech sounds as words based primarily
on reliable portions of the input signal
A word template corresponding to the recognized
word is then used to insert/induce relevant
acoustic signal in the frames containing the
extraneous sound

37
Training of ASR and word templates

The vocabulary for the task is digits (1-9, a
silence, short pauses between words, zero and oh)
from the TIDigits corpus
10-state continuous density HMM is used to model
each word
Train 2 word-level templates for each word
speaker independent (SI) and speaker dependent
(SD)
Each template is a dynamically time-warped
cepstral average

38
Phonemic synthesis

Choose the word template corresponding to the
noisy word and warp it to the noisy word segment
in the input signal by dynamic time warping
The T-F units of the template corresponding to
the masked T-F units substitute the masked units
Restored information may not conform with the
speaking style and rate in the rest of the
utterance. Hence, restored frames are further
pitch-synchronized with the remainder of the
utterance

39
Example results

White noise intrusion

Clean Schema-based
restoration KF
Clean Masked Phoneme
Restoration using Kalman Filter (KF)
Schema-based restoration

Cough intrusion

Masked
Clean
Speaker-independent restoration
Restoration by KF
Speaker-dependent restoration
40
Systematic evaluation results

Performance with white noise as the masker
Similar performance is obtained with clicks and
cough
N Distance between clean and noisy speech
SD and SI - The performance of our model with
speaker-dependent and speaker-independent
templates respectively
KF - The performance of the Kalman filter model
of Masuda-Katsuse and Kawahara (1999)

41
Conclusion

CASA approach to the cocktail party problem
The monaural approach performs substantially
better than previous CASA systems and other
separation approaches
Onset/offset based segmentation and unvoiced
speech segregation
A schema-based model for phonemic restoration
Models based on temporal continuity alone cannot
restore phonemes that lack continuity with their
neighboring phonemes

42
Acknowledgment