Title: Noise Compensation for Speech Recognition with Arbitrary Additive Noise Ji Ming School of Computer S
1Noise Compensation for Speech Recognition with
Arbitrary Additive NoiseJi MingSchool of
Computer ScienceQueens University Belfast,
Belfast BT7 1NN, UK
IEEE Trans. on Audio, Speech, and Language
Processing, Vol. 14, No.3, May 2006
2Introduction
- Speech recognition performance is known to
degrade dramatically when a mismatch occurs
between training and testing conditions - Traditional approaches for removing the mismatch
thereby reducing the effect of noise on
recognition include - Removing the noise from the test signal
- Noise filtering or speech enhancement
- Spectral subtraction, Wiener filtering, RASTA
filtering - Assuming the availability of a priori knowledge
- Construction a new acoustic model to match the
appropriate test environment - Noise or environment compensation
- Model adaptation, Parallel model combination
(PMC), Multi-condition training, SPLICE - Real-world noisy training data is needed
- More recent studies are focused on the methods
requiring less knowledge - Since this knowledge can be difficult to obtain
in real-world application
3Introduction (cont.)
- This paper investigates noise compensation for
speech recognition - Involving additive noise, assuming any corruption
type (e.g. full, partial, stationary, or time
varying) - Assuming no knowledge about the noise
characteristics and no training data from the
noisy environment - This paper proposes a method which focuses
recognition only on reliable features but robust
to full noise corruption that affects all
time-frequency components of the speech
representation - Combining artificial noise compensation with the
missing-feature method, to accommodate mismatches
between the simulated noise condition and the
actual noise condition - It is possible to accommodate sophisticated
spectral distortion, e.g. full, partial, white,
colored or none - Based on clean speech training data and simulated
noise data - Namely , Universal Compensation (UC)
4Methodology
- The UC method comprises three step
- Construct a set of models for short-time speech
spectra using artificial multi-condition speech
data - Generated by corruption the clean training data
with artificial wide-band flat-spectrum noise at
consecutive SNRS - Given a test spectrum
- Search for the spectral components in each model
spectrum that best match the corresponding
spectral components in the test spectrum - Produce a score based on the matched components
for each model spectrum - Combine the scores from the individual model
spectra to form an overall score for recognition
5Methodology (cont.)
- Step 1
- Generating noise by passing a white noise through
a low-pass filter - Step 2
- Calculating a score for each model spectrum based
only on the match spectral components - Step 3
- Combining the individual score from each model
spectra to product an overall score
Clean training spectrum
Artificial wide-band flat-spectrum noise
Noisy test spectrum
6Methodology (cont.)
- A key to the success of the UC method is the
accuracy for converting a full band corruption
into partial-band corruption - This accuracy is determined by two factors
- The frequency-band resolution
- Determines the bandwidth for each spectral
component - The smaller the bandwidth, the more accurate the
approximation for arbitrary noise spectral by
piecewise flat spectra - But usually results in a loss of correlation
between the spectral components, thus giving a
poor phonetic discrimination - An optimum frequency-band subdivision, in term of
a good balance between the noise spectral
resolution and the phonetic discrimination
remains a topic for study - The amplitude resolution
- Refers to the number of steps used to quantize
the SNR - The finer the quantizing steps, the more accurate
the approximation for any given level of noise - The use of a large number of SNRs may result in a
low computational efficiency
7FormulationA. Model and Training Algorithms
- Assume that each training frame is represented by
spectral vector
consisting of sub-band spectral components - Assume that level of SNR are used to generate
the wide-band flat-spectrum noise to form the
noisy training, - Let represent a model spectrum,
expressed as the probability distribution of the
model spectral vector , associated with
speech state and trained on SNR level - Let be a test spectral
vector - Recognition involves classifying each test
spectrum into an appropriate speech state ,
based on the probabilities of the test spectrum
associated with the individual model
spectra within the state - Computing the probability for each
model spectrum - Only the matched spectral components are
retained, - The mismatched components are ignored
8Formulation (cont.)A. Model and Training
Algorithms
- The probability can be approximated
by which is the marginal
distribution of obtained from
with the mismatched spectral components in
ignored to improve mismatch robustness - Given for each model spectrum,
the overall probability of ,associated with
speech state ,can be obtained by combining
over all different SNRs - For simplicity, assume that the individual
spectral components are independent of one
another. So the probability for any subset can
be written as
(1)
(2)
9Formulation (cont.)A. Model and Training
Algorithms
- The model spectrum may be constructed in two
different ways - Firstly, we may estimate each
explicitly by using the training data
corresponding to a specific SNR - Alternatively, we may build the model by polling
the training data from all SNR conditions
together, and training the model as a usual
mixture model on the mixed dataset (more
flexible) - Use EM algorithm decide the association between
data / mixture / weights
10Formulation (cont.)B. Recognition Algorithm
- Given a test spectral vector , the mixture
probability in (1)
using only a subset of the data for each of
the mixture densities - Reducing the effect of mismatched noisy spectral
components - But we need to decide the matched subset
that contains all the matched components
for each model spectrum - If we can assume that the matched subset produces
a large probability, then may be defined
as the subset that maximize the probability
among all possible subsets in - However, (2) indicates that the values of
for different sized subsets are of a
different order of magnitude and are thus not
directly comparable - An appropriate normalization is needed for the
probability - A possible solution is to replace the condition
probability of the test subset with the posterior
probability of the model spectrum
? always producing a value in the range 0,1
11Formulation (cont.)B. Recognition Algorithm
- By maximizing the posterior probability
, we should be able to obtain the subset
for model spectrum that contains all the
matched components. The following shows the
optimum decision - The above optimized posterior probability can be
incorporated into a HMM to form the state based
emission probability
?MAP Criterion
(3)
Dont care
Assuming an Equal prior p(s) for all the states
12Experimental Evaluation (cont.)A. Databases
- Tow databases are used to evaluate the
performance of the UC method - The first database is Aurora 2
- For speaker independent recognition of digit
sequences in noisy conditions - The second database containing the highly
confusing E-set words - Used as an example to further examine the ability
of the new UC model to deal with acoustically
confusing recognition tasks - E-set words include b, c, d, e, g, p, t, v
13Experimental Evaluation (cont.)Acoustic Modeling
for Aurora 2
- The performance of UC model is compared with the
performances of four baseline systems - The first one trained on the clean training set
- 3 mixtures per state for the digits / 6 mixtures
per stat for the silence - The second one trained on the multi condition
training set - 3 mixtures per state for the digits / 6 mixtures
per state for the silence - The third one improved model correspond to the
complex back-end model - 20 mixtures per states for the digits / 36
mixtures per state for the silence - The forth one uses 32 mixtures for all the state
- Which thus has the same model complexity as the
UC model - The UC model is trained using only the clean
training set - Expanded by adding wide-band flat-spectrum noise
to each of the utterance - 10 different SNR levels, from 20dB to 2dB,
reducing 2dB every level - The wide-band flat-spectrum is computer-generated
white noise filtered by a low-pass filter with a
3-dB bandwidth of 3.5 kHz
14Experimental Evaluation (cont.)Acoustic Modeling
for Aurora 2
- The speech is divided into frames of 25 ms at a
frame rate of 10 ms - For each frame
- 13-channel mel filter bank to obtain 13 log
filter-bank amplitudes - These 13 amplitudes are then decorrelated by
using a high-pass filter
resulting in 12 decorrelated log filter-bank
amplitudes, denoted by - The bandwidth of the subband can be increased
conveniently by grouping neighboring subband
components together to form a new subband
component, for example a 6-subband spectral
vector can be express as - In this paper, each feature vector consists 18
components - 6 static subband spectra, 6 delta subband spectra
and 6 delta-delta subband spectra - The overall size of the feature vector for a
frame is 18 x 2 36
15Experimental Evaluation (cont.)Tests on Aurora 2
Condition
- Table shows the recognition result for clean test
data - For the clean data, best accuracy rates were
obtained by the multi-condition baseline model
with 20 and 32 mixtures per states - The UC model performed on average slightly better
than the multi-condition model with 3 mixtures
models
16Experimental Evaluation (cont.)Tests on Aurora 2
Condition
- Tables show the recognition result on test set A
and test set B - The UC model significantly improved over the
baseline model trained on clean data, and achieve
an average performance close to that obtained by
the multi-condition model with three mixtures per
state - Car noise exhibits a less sophisticates spectral
structure than the babble noise, and thus may be
more accurately matched by the piece-wise flat
spectra as implemented in the UC model
17Experimental Evaluation (cont.)Tests on Aurora 2
Condition
- Table shows the recognition result on test set C
- The channel mismatch problem can be solved by
Multi-20 and Multi-32 - The UC model also showed a capability of coping
with this mismatch - The performance is little affected by channel
mismatch - Figure summarizes the average word accuracy
results for the five system
18Experimental Evaluation (cont.)Tests on Noise
Unseen in Aurora 2
- The purpose of this study is to further
investigate the capability of the UC model to
offer robustness for a wide variety of noise - Three additional noise are used
- A polyphonic mobile phone ring, A pop song
segment, A broadcast news segment - The spectral characteristics of the three noise
are shown in follow figure
A polyphonic mobile phone ring
A pop song segment
A broadcast news segment
19Experimental Evaluation (cont.)Tests on Noise
Unseen in Aurora 2
- The UC model offered improved accuracy over all
the three baseline model - The UC model produced particularly good result
for the ringtone noise - because the noise mainly partial corruption over
the speech frequency band - Table also indicates that increasing the number
of mixtures in the mismatched baseline model - produced only a small improvement for the news
noise - no improvement for the phone ring noise
20Experimental Evaluation (cont.)Tests on Noise
Unseen in Aurora 2
- The UC model, with a complexity similar to that
of Multi-32, performed similarly to Multi-3
trained in matched conditions - The UC model was able to outperform Multi-32 in
the case of unknown/mismatched noise conditions
21Experimental Evaluation (cont.)Discrimination
Study on an E-Set Database
- This experiment is conducted into the ability of
the UC model to discriminate between acoustically
confusing words - While it reduces the mismatch between training
and testing conditions, does it also reduce the
discrimination between utterances of different
words - They experimented on a new database, containing
the highly confusing E-set words (b, c, d, e, g,
p, t, v), extracted from the Connex
speaker-independent alphabetic database provided
by British Telecom - Contains three repetitions of each word by 104
speakers - 53 male and 51 female
- Among 104 speakers, 52 for training and the other
52 for testing - For each word, about 156 utterances are available
for training - A total of 1219 utterances are available for
testing - For different noise from Aurora 2 test set A are
artificially added - Two baseline HMMs are buit
- One with the clean training set (1 mixture per
state) - The other with the multi-condition training set
(11 mixtures per state)
22Experimental Evaluation (cont.)Discrimination
Study on an E-Set Database
- For the clean E-set, the UC model achieved a
recognition accuracy rate close to the rate
obtained by the baseline model, with only small
loss in accuracy (84.91?83.33) - For the given noise conditions, the UC model
achieved an average performance close to that
obtained by the multi-condition baseline model
23Experimental Evaluation (cont.)Discrimination
Study on an E-Set Database
- Finally, tested the performance of the UC model
with different resolutions for quantizing the SNR - Three different training sets are generated with
an increasing SNR resolution - Coarse quantization (6 mixtures per state)
- Including only five different SNRs, from 20dB to
4dB with a 4 dB step - Medium-resolution quantization (11 mixtures per
state) - Including ten different SNRs, from 20dB to 2dB
with a 2dB step - Fine quantization (21 mixtures per state)
- Including twenty different SNRs, from 20dB to 2dB
with an 1dB step - Additionally, all the three sets also include the
clean training data
24Experimental Evaluation (cont.)Discrimination
Study on an E-Set Database
- The two models with the medium and fine
quantization produce quite similar recognition
accuracy in many test conditions - The model with the coarse quantization trained
with 6 SNRs produced poorer results than the
other two models, but still showed significant
performance improvement in comparison to the
baseline model trained on the clean data
25Summary
- This paper investigated noise compensation for
speech recognition - Assuming no knowledge about the noise
characteristics and no training data from the
noisy environment - Universal compensation (UC) is proposed as a
possible solution to the problem - The UC method involves a novel combination of the
principle of multi-condition training and the
principle of the missing feature method - Experiments on the Aurora 2 have shown that the
UC model has the potential to achieve a
recognition performance close to the
multi-condition model performance without
assuming knowledge of the noise - Further experiments with noises unseend in Aurora
2 have indicated the ability of the UC model to
offer robust performance for a wide variety of
noises - Finally, the experimental results on an E-set
database have demonstrated the ability of the UC
model to deal with acoustically confusing
recognition tasks