SPEECH ENHANCEMENT presentation

About This Presentation

Transcript and Presenter's Notes

Title: SPEECH ENHANCEMENT

1
SPEECH ENHANCEMENT

Chunjian Li
Aalborg University, Denmark

2
Introduction

Applications
Improving quality and intelligibility (hearing
aids, cockpit comm., video conferencing ...)
Source coding (mobile phone, video conferencing,
IP phone ...)
Pre-processor for other speech processing
applications (speech recognition, speaker
verification ...)

3
Introduction

Classification 1
Single channel
Multi-channel
with acoustic barrier (Adaptive
Noise Canceling)
without acoustic barrier (Beam forming, ICA)
Classification 2
Spectral domain methods (Power Spectral
Subtraction, Amplitude Spectral Subtraction,
Autocorrelation Subtraction, Non-causal IIR
Wiener Filtering)
Time domain methods (Adaptive noise canceling,
Kalman Filtering, non-stationary Wiener
filtering)

4
Spectral subtraction
5
Power Spectral Subtraction

Stochastic Model
Noise process broadband, stationary (or
short-time stationary), uncorrelated to speech,
additive.
Speech process short-time stationary.
Need short-time processing

6
Power Spectral Subtraction

Important relation in the Power Spectral Density
(PSD) domain
This is true only when the noise is
uncorrelated with the speech signal.
To be concise, the index m is dropped in the
following discussion

7
Power Spectral Subtraction
(1)
Power Spectral Subtraction methods use the
noisy phase spectrum to synthesis the enhanced
signal
8
Generalized Spectral Subtraction and its variants

Generalization
Eq(1) can be written as

(2)

When a1 , eq(2) is called Amplitude Spectral
Subtraction (Boll,1979).
Variant Correlation subtraction

9
Comments on Spectral Subtraction methods

Low complexity
Severe musical noise
Usually need further enhancement
- Smoothing in time and frequency
Rectification

Amplitude Spectral Subtraction
Power Spectral Subtraction
Noisy speech sample (0 dB)
10
Comments on Spectral Subtraction methods

Oversuppressing and smoothing can reduce residual
noise but result in distortion to the speech
spectrum.

Oversuppressing ASS Oversuppressing
PSS Smoothing in time
11
Wiener Filter
12
Wiener Filtering

The non-causal infinite impulse response Wiener
filter (hereafter as Non-causal IIR Wiener
Filter) is recognized as a spectral domain method
although the filtering problem started in time
domain.
Non-causal IIR Wiener filter with AR modeling of
speech can be employed in iterative manner, such
that signal estimation and parameter estimation
are done based on each other.

13
Non-causal IIR Wiener Filter

A linear Minimum Mean Squared Error Filter
Orthogonality principle

Wiener-Hopf equation
14
Non-causal Wiener Filter

Orthogonality principle (frequency domain)
Transfer function
MSE of the Wiener filter

15
Comments on Non-causal WF

Requires estimates of the power spectra of speech
and noise.
Performance depends very much on the estimates of
the speech and noise spectra.
WF over-suppress the speech spectrum, results in
muffling effect.
WF does not process phase spectrum.

16
Comments on Non-causal WF

Muffling effect caused by over-suppression

Blue Original Black Wiener filter Green
Square-root Wiener filter
17
Comments on Non-causal WF

Roughness caused by phase noise
The phase spectrum is not processed, results in
losing phase coherence in the voiced speech. The
effect is called roughness or reverberance.
Samples of muffling and roughness

Clean samples Muffling Roughness Muffling
roughness
18
Iterative Wiener Filtering

A parametric method using an all-pole model
A sequential MAP estimator of both speech
waveform and LP coefficients.
Lim, Oppenheim 1978

19
Iterative Wiener Filtering

All-pole modeling of speech
- Speech amplitude spectrum can be well modeled
by an all-pole transfer function (the vocal
tract) excited by white noise or pulse train (the
glottal pulses). The coefficients of the all-pole
model can be found by the Linear Prediction
analysis, thus is called LP coef., and the
excitation is called the residual.
- The LP model is of minimum phase, which is
generally not the true phase of the vocal tract.

20
Iterative Wiener Filtering

The algorithm
Estimate the LP coef. from the noisy
observation samples. Estimate the noise spectrum
during non-speech activity.
Estimate the signal using the non-casual IIR WF
given the current estimate of LP coef. and
current estimate of the noise spectrum.
Estimate the LP coef. again given the current
estimate of the waveform.
Iterate until the convergence criterion is
satisfied.

21
Iterative Wiener Filtering

Comments
Convergence is not guaranteed, a heuristic stop
criterion is needed
Results in unrealistically sharp formants and
pole jittering
Suffer from musical noise
Need some kind of smoothing

10 dB noisy sample
Iterative WF
Iterative WF with smoothing
22
Further enhancement to IWF

Constrained IWF Hansen,Clements 1987
Apply spectral constraint inter-frame and
intra-frame using LSP transformation.
Pole-zero modeling Flanagan 1972
Replace WF with Kalman filtering Gibson 1991
Vector quantization method Gibson 1988
Use HMM Ephraim 1988

23
Phase issues

The majority of the noise reduction methods only
process amplitude spectrum, while the noisy phase
spectrum is left unprocessed.
The reasons are
- Human ears are less sensitive to phase than to
the amplitude spectrum.
- Masking of amplitude to phase (6dB/0.6rad
threshold).

24
MMSE methods
25
MMSE approaches to speech enhancement

Wiener filtering MMSE amplitude spectrum
estimator MMSE log-amplitude spectrum estimator
Non-Gaussian prior MMSE approaches.
Being the dominant technique because of better
performance than the Spectral Subtraction
methods.
Need a priori info. of the speech and noise
(i.e., covariance matrices or PSD).

26
MMSE amplitude spectrum estimator (Ephraim-Malah
filter)

Ephraim Malah, 1984
The basis of the noise reduction function of
MELPe coding standard
Consists of two parts the Decision-Directed
method estimating the speech spectrum, and the
MMSE Short-Time Spectral Amplitude (STSA)
estimator

27
MMSE STSA estimator

Assumptions
Stationary additive Gaussian noise with known
PSD.
An estimate of the speech spectrum is available.
Spectral components (DFT coefficients) are
statistically independent and each follows
Gaussian distribution (the DFT amplitude follows
Rayleigh distribution).
The DFT phase follows uniform distribution and is
independent of the amplitude.

The signal model

Let ,
, denote the kth spectral component of
the noisy observation y(t), the signal x(t), and
the noise d(t).
28
MMSE STSA estimator
With the following PDFs
,
and Bayes rule, the estimator can be shown
to be
Where and denote the modified
Bessel functions of zero and first order, and
is defined by
29
MMSE STSA estimator
Where and are defined by
Where and
and are interpreted as the a priori
and a posteriori signal-to-noise ratio
respectively. is estimated by the
Decision-Directed method.
30
Decision-Directed method

An estimate of the a priori SNR.
A combination of Power Spectrum Subtraction, half
wave rectification and inter-frame smoothing.
is usually chosen to be 0.98 in order to get
the best smoothing performance. The higher the
is, the less musical noise, but the more
distortion to the speech.

31
Comments on the MMSE STSA estimator

Comparison of the suppression gains of Wiener
filter and MMSE STSA

The instantaneous SNR can be interpreted as the a
priori SNR estimated without smoothing.
WF gains do not vary with the instantaneous SNR,
only vary with the a priori SNR. Whereas the MMSE
STSA gains vary with both instantaneous SNR and a
priori SNR.
When the a priori SNR is high, the MMSE STSA
estimator has gain curves very close to the WF.
When the a priori SNR is low, the MMSE STSA shows
higher gain which is very much affected by the
instantaneous SNR.

32
Comments on the MMSE STSA estimator

A comparison of the suppression gains of PSS, WF
and MMSE STSA estimator

Estimated A priori SNR
Estimated A priori SNR
The MMSE STSA. Rpost denotes the A priori SNR
estimated without smoothing (the instantaneous
SNR).
Solid line power subtraction dashed
line Wiener filter.
33
Comments on the MMSE STSA estimator

The gain curve transit smoothly between the power
subtraction curve and the Wiener curve. This
transit is controlled by the un-smoothed estimate
of a priori SNR (Rpost). The larger Rpost, the
stronger the attenuation.
This counter-intuitive behavior manages to
flatten the spurious spectral peaks caused by the
noise at the low SNR part of the spectrum. While
WF tends to sharpen the spurious peaks at the low
SNR part of the spectrum.
The phase of the noisy speech is used as the
phase of the enhanced speech, because of the
assumption of uniform distributed phase. An
independent MMSE estimate of the phasor has
nonunity modulus, thus can not be combined with
the MMSE STSA.
Suffer less musical noise than the WF.

34
MMSE Log-Spectral Amplitude Estimator

A modification to the MMSE STSA based on the fact
that a distortion measure based on the
mean-square error of the log-spectra is more
suitable for speech processing.
Minimize the distortion measure
The MMSE LSA estimator can be shown to be
where , and
are a priori SNR and a
posteriori SNR as defined before.

35
MMSE Log-Spectral Amplitude Estimator

Comparison of the suppression gains of MMSE STSA
and MMSE LSA

- The gain curves of MMSE LSA are always lower
than that of MMSE STSA, resulting in lower
residual noise. - When the a priori SNR is high,
the gain curve of MMSE LSA is very flat which is
similar to Wiener filter. When the a priori SNR
is low, the gain curve of the MMSE LSA varies
w.r.t. the instantaneous SNR as the MMSE STSA
does.
Decision-Directed Wiener Filter
MMSE LSA
Noisy sample (0 dB)
36
MMSE estimator with non-Gaussian prior
How well does Gaussian model fit the real
probability distribution of DFT coefficients?
Histogram of speech DFT amplitude.
Histogram of noise (recorded from market place)
DFT amplitude.
The histograms are taken from one hour of speech
37
MMSE estimator with non-Gaussian prior

The probability density function of the DFT
coefficients of speech can be better modeled by
Supper-Gaussian functions (e.g. Gamma or Laplace)
than the Gaussian function Rainer Martin 2002,
2003.
An even more exact probability density function
is the one tailored to fit the shape of the
histogram of the DFT coefficients Lotter, Vary
2003.
Using these density function in place of the
Gaussian density function (for speech or noise
processes) in the MMSE estimator can result in
better noise reduction.
Non-Gaussian prior MMSE estimator is nonlinear,
non-zero-phased.

38
MMSE estimator with non-Gaussian prior

Compared with WF
Better output SNR (Gaussian/Gamma)
Less musical noise (Laplace/Gamma)
Less distortion to the speech

39
Noise PSD estimation
Sørensen Andersen, EURASIP J. Applied Signal
Processing, 2005
40
Noise PSD estimation

Most conventional speech enhancement methods rely
on good estimates of noise PSD.
When the noise is non-stationary, online tracking
of the noise PSD is needed.

41
Methods

Minimum statistics Martin 2001
Minima Controlled Recursive Averaging Cohen
Berdugo 2002
Connected Region Speech Presence Detection
(CRSPD) Sørensen Andersen, 2005

42
CRSPD
43
Highway traffic noise (5 dB)
44
Smoothed periodogram
45
(No Transcript)
46
Estimated noise periodogram and noise PSD
47
Enhanced spectrogram
48
Demos

Highway traffic noise (0 dB)
Noisy sample
MMSE-LSA
CRSPD

Interior car noise (0 dB)
Noisy sample
MMSE-LSA
CRSPD

49
MMSE joint estimator for amplitude and phase
spectra
C. Li, S. V. Andersen, Inter-Frequency
Dependency in MMSE Speech Enhancement,
NORSIG04 C. Li, S. V. Andersen, A Block-Based
Linear MMSE Noise Reduction with a High Temporal
Resolution Modeling of the Speech Excitation,
EURASIP Journal on Applied Signal Processing
50
Why MMSE joint estimator?

Phase is found to be of importance for noise
reduction of low SNR sources. Whereas Independent
optimum amplitude estimator and optimum phase
estimator do not coexist.
Finite frame length and temporal power
localization introduce correlation between
spectral components. This correlation can be
exploited to improve the estimate of low SNR
frequency bin.
Time localization can be modeled with the joint
MMSE estimator, but can not be modeled by the
frequency domain Wiener filter. Time localization
indicates how much the phase is linearly related.

51
Formulation
Signal model
Where F is the inverse Fourier matrix, S is the
Fourier coefficients vector, and v is white
Gaussian noise vector.
The MMSE estimator of S can be shown to be
and being the spectral covariance matrix
of the signal and the noise, respectively (need
to be estimated).
52
Estimating covariance matrix
Let 1/A(Z) denote the transfer function of the
all pole model of speech, r denote the LPC
residual, and H denote the Toeplitz analysis
matrix consisting the coef. of A(Z), such as
The covariance matrix of r can be written as a
diagonal matrix with the square of r as its
diagonal elements. Then the covariance matrix of
s and S can be written respectively as
53
Joint estimator vs. spectral amplitude MMSE
estimators

In the joint estimator, the spectral covariance
matrix is assumed to be a full matrix, while
the Wiener filter and MMSE LSA estimator assume
it is a diagonal matrix.
This allows the joint estimator exploits the
correlation between frequency components, which
is ignored by the frequency domain MMSE
estimators.

54
Correlation of frequency components
The covariance matrix of the frequency components
55
Covariance in time and frequency
56
Estimated spectra
The TFE-MMSE estimator preserves the signal
spectrum better than the Wiener filter.
57
White Gaussian noise (10 dB)
58
Results

TFE-MMSE estimator
TFE-Kalman filtering
Compared to
WF
Noisy (10dB)

59
Iterative Kalman filtering
C. Li, S. V. Andersen, A Iterative Speech
Enhancement Scheme Based on Kalman Filtering,
EUSIPCO 2005
60
Motivation

Kalman filter is a time domain MMSE estimator,
which is a joint amplitude-phase estimator when
the non-stationarity of the signal (or the system
noise) is faithfully represented.
We designed a non-stationary Kalman filter with
high temporal resolution modeling of the
excitation.

61
Two steps each iteration
62
The algorithm

Speech spectrum is estimated by the WPSS and LPC
block.
Speech excitation is estimated by the WPSS and
PEKF block.
Need only one iteration. Iterations are done
sequentially, exploiting the correlation between
consecutive spectra.

63
Results
64
Demo

White Gaussian noise (10 dB)
Noisy sample
Conventional Kalman filtering
Iterative Wiener filtering (EM)
Proposed IKF

65
Blind system identification for non-Gaussian
speech analysis
C. Li, S. V. Andersen, Efficient blind system
identification of non-Gaussian Autoregressive
models with HMM modeling of the excitation,
IEEE trans. Signal Processing 2006, submitted.
66
Motivation

Non-Gaussian model for voiced speech
Estimate the excitation model, vocal tract model
parameters, and the noise variance jointly.
Blind identification. The only known is the noisy
observation.

67
E-HMARM model
68
Estimated spectra
Input SNR is 15 dB, white Gaussian noise.
69
Convergence
70
What is the significance of the E-HMARM

A non-Gaussian speech analysis method (LPC is
not)
A noise robust speech analysis method (LPC is
not)
A blind deconvolution of the excitation from the
vocal tract filter
Potential in single channel source separation
(due to the non-Gaussian model)

71
Future work

Single channel BSS
Gaussian AR and Gaussian AR
Non-Gaussian AR and Gaussian AR
Non-Gaussian AR and non-Gaussian AR
More comprehensive non-Gaussian non-stationary
speech models
Speech manipulation
Speech interpolation

Write a Comment

User Comments (0)

About PowerShow.com

SPEECH ENHANCEMENT PowerPoint PPT Presentation