Noise Compensation for Speech Recognition with Arbitrary Additive Noise Ji Ming School of Computer S

About This Presentation

Title:

Noise Compensation for Speech Recognition with Arbitrary Additive Noise Ji Ming School of Computer S

Description:

Speech recognition performance is known to degrade dramatically when a mismatch ... the Connex speaker-independent alphabetic database provided by British Telecom ... – PowerPoint PPT presentation

Number of Views:185

Avg rating:3.0/5.0

Slides: 26

Provided by: Pili2

Category:

more less

Transcript and Presenter's Notes

Title: Noise Compensation for Speech Recognition with Arbitrary Additive Noise Ji Ming School of Computer S

1
Noise Compensation for Speech Recognition with
Arbitrary Additive NoiseJi MingSchool of
Computer ScienceQueens University Belfast,
Belfast BT7 1NN, UK

Presented by Shih-Hsiang

IEEE Trans. on Audio, Speech, and Language
Processing, Vol. 14, No.3, May 2006
2
Introduction

Speech recognition performance is known to
degrade dramatically when a mismatch occurs
between training and testing conditions
Traditional approaches for removing the mismatch
thereby reducing the effect of noise on
recognition include
Removing the noise from the test signal
Noise filtering or speech enhancement
Spectral subtraction, Wiener filtering, RASTA
filtering
Assuming the availability of a priori knowledge
Construction a new acoustic model to match the
appropriate test environment
Noise or environment compensation
Model adaptation, Parallel model combination
(PMC), Multi-condition training, SPLICE
Real-world noisy training data is needed
More recent studies are focused on the methods
requiring less knowledge
Since this knowledge can be difficult to obtain
in real-world application

3
Introduction (cont.)

This paper investigates noise compensation for
speech recognition
Involving additive noise, assuming any corruption
type (e.g. full, partial, stationary, or time
varying)
Assuming no knowledge about the noise
characteristics and no training data from the
noisy environment
This paper proposes a method which focuses
recognition only on reliable features but robust
to full noise corruption that affects all
time-frequency components of the speech
representation
Combining artificial noise compensation with the
missing-feature method, to accommodate mismatches
between the simulated noise condition and the
actual noise condition
It is possible to accommodate sophisticated
spectral distortion, e.g. full, partial, white,
colored or none
Based on clean speech training data and simulated
noise data
Namely , Universal Compensation (UC)

4
Methodology

The UC method comprises three step
Construct a set of models for short-time speech
spectra using artificial multi-condition speech
data
Generated by corruption the clean training data
with artificial wide-band flat-spectrum noise at
consecutive SNRS
Given a test spectrum
Search for the spectral components in each model
spectrum that best match the corresponding
spectral components in the test spectrum
Produce a score based on the matched components
for each model spectrum
Combine the scores from the individual model
spectra to form an overall score for recognition

5
Methodology (cont.)

Step 1
Generating noise by passing a white noise through
a low-pass filter
Step 2
Calculating a score for each model spectrum based
only on the match spectral components
Step 3
Combining the individual score from each model
spectra to product an overall score

Clean training spectrum
Artificial wide-band flat-spectrum noise
Noisy test spectrum
6
Methodology (cont.)

A key to the success of the UC method is the
accuracy for converting a full band corruption
into partial-band corruption
This accuracy is determined by two factors
The frequency-band resolution
Determines the bandwidth for each spectral
component
The smaller the bandwidth, the more accurate the
approximation for arbitrary noise spectral by
piecewise flat spectra
But usually results in a loss of correlation
between the spectral components, thus giving a
poor phonetic discrimination
An optimum frequency-band subdivision, in term of
a good balance between the noise spectral
resolution and the phonetic discrimination
remains a topic for study
The amplitude resolution
Refers to the number of steps used to quantize
the SNR
The finer the quantizing steps, the more accurate
the approximation for any given level of noise
The use of a large number of SNRs may result in a
low computational efficiency

7
FormulationA. Model and Training Algorithms

Assume that each training frame is represented by
spectral vector
consisting of sub-band spectral components
Assume that level of SNR are used to generate
the wide-band flat-spectrum noise to form the
noisy training,
Let represent a model spectrum,
expressed as the probability distribution of the
model spectral vector , associated with
speech state and trained on SNR level
Let be a test spectral
vector
Recognition involves classifying each test
spectrum into an appropriate speech state ,
based on the probabilities of the test spectrum
associated with the individual model
spectra within the state
Computing the probability for each
model spectrum
Only the matched spectral components are
retained,
The mismatched components are ignored

8
Formulation (cont.)A. Model and Training
Algorithms

The probability can be approximated
by which is the marginal
distribution of obtained from
with the mismatched spectral components in
ignored to improve mismatch robustness
Given for each model spectrum,
the overall probability of ,associated with
speech state ,can be obtained by combining
over all different SNRs
For simplicity, assume that the individual
spectral components are independent of one
another. So the probability for any subset can
be written as

(1)
(2)
9
Formulation (cont.)A. Model and Training
Algorithms

The model spectrum may be constructed in two
different ways
Firstly, we may estimate each
explicitly by using the training data
corresponding to a specific SNR
Alternatively, we may build the model by polling
the training data from all SNR conditions
together, and training the model as a usual
mixture model on the mixed dataset (more
flexible)
Use EM algorithm decide the association between
data / mixture / weights

10
Formulation (cont.)B. Recognition Algorithm

Given a test spectral vector , the mixture
probability in (1)
using only a subset of the data for each of
the mixture densities
Reducing the effect of mismatched noisy spectral
components
But we need to decide the matched subset
that contains all the matched components
for each model spectrum
If we can assume that the matched subset produces
a large probability, then may be defined
as the subset that maximize the probability
among all possible subsets in
However, (2) indicates that the values of
for different sized subsets are of a
different order of magnitude and are thus not
directly comparable
An appropriate normalization is needed for the
probability
A possible solution is to replace the condition
probability of the test subset with the posterior
probability of the model spectrum

? always producing a value in the range 0,1
11
Formulation (cont.)B. Recognition Algorithm

By maximizing the posterior probability
, we should be able to obtain the subset
for model spectrum that contains all the
matched components. The following shows the
optimum decision
The above optimized posterior probability can be
incorporated into a HMM to form the state based
emission probability

?MAP Criterion
(3)
Dont care
Assuming an Equal prior p(s) for all the states
12
Experimental Evaluation (cont.)A. Databases

Tow databases are used to evaluate the
performance of the UC method
The first database is Aurora 2
For speaker independent recognition of digit
sequences in noisy conditions
The second database containing the highly
confusing E-set words
Used as an example to further examine the ability
of the new UC model to deal with acoustically
confusing recognition tasks
E-set words include b, c, d, e, g, p, t, v

13
Experimental Evaluation (cont.)Acoustic Modeling
for Aurora 2

The performance of UC model is compared with the
performances of four baseline systems
The first one trained on the clean training set
3 mixtures per state for the digits / 6 mixtures
per stat for the silence
The second one trained on the multi condition
training set
3 mixtures per state for the digits / 6 mixtures
per state for the silence
The third one improved model correspond to the
complex back-end model
20 mixtures per states for the digits / 36
mixtures per state for the silence
The forth one uses 32 mixtures for all the state
Which thus has the same model complexity as the
UC model
The UC model is trained using only the clean
training set
Expanded by adding wide-band flat-spectrum noise
to each of the utterance
10 different SNR levels, from 20dB to 2dB,
reducing 2dB every level
The wide-band flat-spectrum is computer-generated
white noise filtered by a low-pass filter with a
3-dB bandwidth of 3.5 kHz

14
Experimental Evaluation (cont.)Acoustic Modeling
for Aurora 2

The speech is divided into frames of 25 ms at a
frame rate of 10 ms
For each frame
13-channel mel filter bank to obtain 13 log
filter-bank amplitudes
These 13 amplitudes are then decorrelated by
using a high-pass filter
resulting in 12 decorrelated log filter-bank
amplitudes, denoted by
The bandwidth of the subband can be increased
conveniently by grouping neighboring subband
components together to form a new subband
component, for example a 6-subband spectral
vector can be express as
In this paper, each feature vector consists 18
components
6 static subband spectra, 6 delta subband spectra
and 6 delta-delta subband spectra
The overall size of the feature vector for a
frame is 18 x 2 36

15
Experimental Evaluation (cont.)Tests on Aurora 2
Condition

Table shows the recognition result for clean test
data
For the clean data, best accuracy rates were
obtained by the multi-condition baseline model
with 20 and 32 mixtures per states
The UC model performed on average slightly better
than the multi-condition model with 3 mixtures
models

16
Experimental Evaluation (cont.)Tests on Aurora 2
Condition

Tables show the recognition result on test set A
and test set B
The UC model significantly improved over the
baseline model trained on clean data, and achieve
an average performance close to that obtained by
the multi-condition model with three mixtures per
state
Car noise exhibits a less sophisticates spectral
structure than the babble noise, and thus may be
more accurately matched by the piece-wise flat
spectra as implemented in the UC model

17
Experimental Evaluation (cont.)Tests on Aurora 2
Condition

Table shows the recognition result on test set C
The channel mismatch problem can be solved by
Multi-20 and Multi-32
The UC model also showed a capability of coping
with this mismatch
The performance is little affected by channel
mismatch
Figure summarizes the average word accuracy
results for the five system

18
Experimental Evaluation (cont.)Tests on Noise
Unseen in Aurora 2

The purpose of this study is to further
investigate the capability of the UC model to
offer robustness for a wide variety of noise
Three additional noise are used
A polyphonic mobile phone ring, A pop song
segment, A broadcast news segment
The spectral characteristics of the three noise
are shown in follow figure

A polyphonic mobile phone ring
A pop song segment
A broadcast news segment
19
Experimental Evaluation (cont.)Tests on Noise
Unseen in Aurora 2

The UC model offered improved accuracy over all
the three baseline model
The UC model produced particularly good result
for the ringtone noise
because the noise mainly partial corruption over
the speech frequency band
Table also indicates that increasing the number
of mixtures in the mismatched baseline model
produced only a small improvement for the news
noise
no improvement for the phone ring noise

20
Experimental Evaluation (cont.)Tests on Noise
Unseen in Aurora 2

The UC model, with a complexity similar to that
of Multi-32, performed similarly to Multi-3
trained in matched conditions
The UC model was able to outperform Multi-32 in
the case of unknown/mismatched noise conditions

21
Experimental Evaluation (cont.)Discrimination
Study on an E-Set Database

This experiment is conducted into the ability of
the UC model to discriminate between acoustically
confusing words
While it reduces the mismatch between training
and testing conditions, does it also reduce the
discrimination between utterances of different
words
They experimented on a new database, containing
the highly confusing E-set words (b, c, d, e, g,
p, t, v), extracted from the Connex
speaker-independent alphabetic database provided
by British Telecom
Contains three repetitions of each word by 104
speakers
53 male and 51 female
Among 104 speakers, 52 for training and the other
52 for testing
For each word, about 156 utterances are available
for training
A total of 1219 utterances are available for
testing
For different noise from Aurora 2 test set A are
artificially added
Two baseline HMMs are buit
One with the clean training set (1 mixture per
state)
The other with the multi-condition training set
(11 mixtures per state)

22
Experimental Evaluation (cont.)Discrimination
Study on an E-Set Database

For the clean E-set, the UC model achieved a
recognition accuracy rate close to the rate
obtained by the baseline model, with only small
loss in accuracy (84.91?83.33)
For the given noise conditions, the UC model
achieved an average performance close to that
obtained by the multi-condition baseline model

23
Experimental Evaluation (cont.)Discrimination
Study on an E-Set Database

Finally, tested the performance of the UC model
with different resolutions for quantizing the SNR
Three different training sets are generated with
an increasing SNR resolution
Coarse quantization (6 mixtures per state)
Including only five different SNRs, from 20dB to
4dB with a 4 dB step
Medium-resolution quantization (11 mixtures per
state)
Including ten different SNRs, from 20dB to 2dB
with a 2dB step
Fine quantization (21 mixtures per state)
Including twenty different SNRs, from 20dB to 2dB
with an 1dB step
Additionally, all the three sets also include the
clean training data

24
Experimental Evaluation (cont.)Discrimination
Study on an E-Set Database

The two models with the medium and fine
quantization produce quite similar recognition
accuracy in many test conditions
The model with the coarse quantization trained
with 6 SNRs produced poorer results than the
other two models, but still showed significant
performance improvement in comparison to the
baseline model trained on the clean data

25
Summary

This paper investigated noise compensation for
speech recognition
Assuming no knowledge about the noise
characteristics and no training data from the
noisy environment
Universal compensation (UC) is proposed as a
possible solution to the problem
The UC method involves a novel combination of the
principle of multi-condition training and the
principle of the missing feature method
Experiments on the Aurora 2 have shown that the
UC model has the potential to achieve a
recognition performance close to the
multi-condition model performance without
assuming knowledge of the noise
Further experiments with noises unseend in Aurora
2 have indicated the ability of the UC model to
offer robust performance for a wide variety of
noises
Finally, the experimental results on an E-set
database have demonstrated the ability of the UC
model to deal with acoustically confusing
recognition tasks

Write a Comment

User Comments (0)

About PowerShow.com

Noise Compensation for Speech Recognition with Arbitrary Additive Noise Ji Ming School of Computer S - PowerPoint PPT Presentation

Noise Compensation for Speech Recognition with Arbitrary Additive Noise Ji Ming School of Computer S

Speech recognition performance is known to degrade dramatically when a mismatch ... the Connex speaker-independent alphabetic database provided by British Telecom ... – PowerPoint PPT presentation