Noise Compensation for Speech Recognition with Arbitrary Additive Noise Ji Ming School of Computer S - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Noise Compensation for Speech Recognition with Arbitrary Additive Noise Ji Ming School of Computer S

Description:

Speech recognition performance is known to degrade dramatically when a mismatch ... the Connex speaker-independent alphabetic database provided by British Telecom ... – PowerPoint PPT presentation

Number of Views:185
Avg rating:3.0/5.0
Slides: 26
Provided by: Pili2
Category:

less

Transcript and Presenter's Notes

Title: Noise Compensation for Speech Recognition with Arbitrary Additive Noise Ji Ming School of Computer S


1
Noise Compensation for Speech Recognition with
Arbitrary Additive NoiseJi MingSchool of
Computer ScienceQueens University Belfast,
Belfast BT7 1NN, UK
  • Presented by Shih-Hsiang

IEEE Trans. on Audio, Speech, and Language
Processing, Vol. 14, No.3, May 2006
2
Introduction
  • Speech recognition performance is known to
    degrade dramatically when a mismatch occurs
    between training and testing conditions
  • Traditional approaches for removing the mismatch
    thereby reducing the effect of noise on
    recognition include
  • Removing the noise from the test signal
  • Noise filtering or speech enhancement
  • Spectral subtraction, Wiener filtering, RASTA
    filtering
  • Assuming the availability of a priori knowledge
  • Construction a new acoustic model to match the
    appropriate test environment
  • Noise or environment compensation
  • Model adaptation, Parallel model combination
    (PMC), Multi-condition training, SPLICE
  • Real-world noisy training data is needed
  • More recent studies are focused on the methods
    requiring less knowledge
  • Since this knowledge can be difficult to obtain
    in real-world application

3
Introduction (cont.)
  • This paper investigates noise compensation for
    speech recognition
  • Involving additive noise, assuming any corruption
    type (e.g. full, partial, stationary, or time
    varying)
  • Assuming no knowledge about the noise
    characteristics and no training data from the
    noisy environment
  • This paper proposes a method which focuses
    recognition only on reliable features but robust
    to full noise corruption that affects all
    time-frequency components of the speech
    representation
  • Combining artificial noise compensation with the
    missing-feature method, to accommodate mismatches
    between the simulated noise condition and the
    actual noise condition
  • It is possible to accommodate sophisticated
    spectral distortion, e.g. full, partial, white,
    colored or none
  • Based on clean speech training data and simulated
    noise data
  • Namely , Universal Compensation (UC)

4
Methodology
  • The UC method comprises three step
  • Construct a set of models for short-time speech
    spectra using artificial multi-condition speech
    data
  • Generated by corruption the clean training data
    with artificial wide-band flat-spectrum noise at
    consecutive SNRS
  • Given a test spectrum
  • Search for the spectral components in each model
    spectrum that best match the corresponding
    spectral components in the test spectrum
  • Produce a score based on the matched components
    for each model spectrum
  • Combine the scores from the individual model
    spectra to form an overall score for recognition

5
Methodology (cont.)
  • Step 1
  • Generating noise by passing a white noise through
    a low-pass filter
  • Step 2
  • Calculating a score for each model spectrum based
    only on the match spectral components
  • Step 3
  • Combining the individual score from each model
    spectra to product an overall score

Clean training spectrum
Artificial wide-band flat-spectrum noise
Noisy test spectrum
6
Methodology (cont.)
  • A key to the success of the UC method is the
    accuracy for converting a full band corruption
    into partial-band corruption
  • This accuracy is determined by two factors
  • The frequency-band resolution
  • Determines the bandwidth for each spectral
    component
  • The smaller the bandwidth, the more accurate the
    approximation for arbitrary noise spectral by
    piecewise flat spectra
  • But usually results in a loss of correlation
    between the spectral components, thus giving a
    poor phonetic discrimination
  • An optimum frequency-band subdivision, in term of
    a good balance between the noise spectral
    resolution and the phonetic discrimination
    remains a topic for study
  • The amplitude resolution
  • Refers to the number of steps used to quantize
    the SNR
  • The finer the quantizing steps, the more accurate
    the approximation for any given level of noise
  • The use of a large number of SNRs may result in a
    low computational efficiency

7
FormulationA. Model and Training Algorithms
  • Assume that each training frame is represented by
    spectral vector
    consisting of sub-band spectral components
  • Assume that level of SNR are used to generate
    the wide-band flat-spectrum noise to form the
    noisy training,
  • Let represent a model spectrum,
    expressed as the probability distribution of the
    model spectral vector , associated with
    speech state and trained on SNR level
  • Let be a test spectral
    vector
  • Recognition involves classifying each test
    spectrum into an appropriate speech state ,
    based on the probabilities of the test spectrum
    associated with the individual model
    spectra within the state
  • Computing the probability for each
    model spectrum
  • Only the matched spectral components are
    retained,
  • The mismatched components are ignored

8
Formulation (cont.)A. Model and Training
Algorithms
  • The probability can be approximated
    by which is the marginal
    distribution of obtained from
    with the mismatched spectral components in
    ignored to improve mismatch robustness
  • Given for each model spectrum,
    the overall probability of ,associated with
    speech state ,can be obtained by combining
    over all different SNRs
  • For simplicity, assume that the individual
    spectral components are independent of one
    another. So the probability for any subset can
    be written as

(1)
(2)
9
Formulation (cont.)A. Model and Training
Algorithms
  • The model spectrum may be constructed in two
    different ways
  • Firstly, we may estimate each
    explicitly by using the training data
    corresponding to a specific SNR
  • Alternatively, we may build the model by polling
    the training data from all SNR conditions
    together, and training the model as a usual
    mixture model on the mixed dataset (more
    flexible)
  • Use EM algorithm decide the association between
    data / mixture / weights

10
Formulation (cont.)B. Recognition Algorithm
  • Given a test spectral vector , the mixture
    probability in (1)
    using only a subset of the data for each of
    the mixture densities
  • Reducing the effect of mismatched noisy spectral
    components
  • But we need to decide the matched subset
    that contains all the matched components
    for each model spectrum
  • If we can assume that the matched subset produces
    a large probability, then may be defined
    as the subset that maximize the probability
    among all possible subsets in
  • However, (2) indicates that the values of
    for different sized subsets are of a
    different order of magnitude and are thus not
    directly comparable
  • An appropriate normalization is needed for the
    probability
  • A possible solution is to replace the condition
    probability of the test subset with the posterior
    probability of the model spectrum

? always producing a value in the range 0,1
11
Formulation (cont.)B. Recognition Algorithm
  • By maximizing the posterior probability
    , we should be able to obtain the subset
    for model spectrum that contains all the
    matched components. The following shows the
    optimum decision
  • The above optimized posterior probability can be
    incorporated into a HMM to form the state based
    emission probability

?MAP Criterion
(3)
Dont care
Assuming an Equal prior p(s) for all the states
12
Experimental Evaluation (cont.)A. Databases
  • Tow databases are used to evaluate the
    performance of the UC method
  • The first database is Aurora 2
  • For speaker independent recognition of digit
    sequences in noisy conditions
  • The second database containing the highly
    confusing E-set words
  • Used as an example to further examine the ability
    of the new UC model to deal with acoustically
    confusing recognition tasks
  • E-set words include b, c, d, e, g, p, t, v

13
Experimental Evaluation (cont.)Acoustic Modeling
for Aurora 2
  • The performance of UC model is compared with the
    performances of four baseline systems
  • The first one trained on the clean training set
  • 3 mixtures per state for the digits / 6 mixtures
    per stat for the silence
  • The second one trained on the multi condition
    training set
  • 3 mixtures per state for the digits / 6 mixtures
    per state for the silence
  • The third one improved model correspond to the
    complex back-end model
  • 20 mixtures per states for the digits / 36
    mixtures per state for the silence
  • The forth one uses 32 mixtures for all the state
  • Which thus has the same model complexity as the
    UC model
  • The UC model is trained using only the clean
    training set
  • Expanded by adding wide-band flat-spectrum noise
    to each of the utterance
  • 10 different SNR levels, from 20dB to 2dB,
    reducing 2dB every level
  • The wide-band flat-spectrum is computer-generated
    white noise filtered by a low-pass filter with a
    3-dB bandwidth of 3.5 kHz

14
Experimental Evaluation (cont.)Acoustic Modeling
for Aurora 2
  • The speech is divided into frames of 25 ms at a
    frame rate of 10 ms
  • For each frame
  • 13-channel mel filter bank to obtain 13 log
    filter-bank amplitudes
  • These 13 amplitudes are then decorrelated by
    using a high-pass filter
    resulting in 12 decorrelated log filter-bank
    amplitudes, denoted by
  • The bandwidth of the subband can be increased
    conveniently by grouping neighboring subband
    components together to form a new subband
    component, for example a 6-subband spectral
    vector can be express as
  • In this paper, each feature vector consists 18
    components
  • 6 static subband spectra, 6 delta subband spectra
    and 6 delta-delta subband spectra
  • The overall size of the feature vector for a
    frame is 18 x 2 36

15
Experimental Evaluation (cont.)Tests on Aurora 2
Condition
  • Table shows the recognition result for clean test
    data
  • For the clean data, best accuracy rates were
    obtained by the multi-condition baseline model
    with 20 and 32 mixtures per states
  • The UC model performed on average slightly better
    than the multi-condition model with 3 mixtures
    models

16
Experimental Evaluation (cont.)Tests on Aurora 2
Condition
  • Tables show the recognition result on test set A
    and test set B
  • The UC model significantly improved over the
    baseline model trained on clean data, and achieve
    an average performance close to that obtained by
    the multi-condition model with three mixtures per
    state
  • Car noise exhibits a less sophisticates spectral
    structure than the babble noise, and thus may be
    more accurately matched by the piece-wise flat
    spectra as implemented in the UC model

17
Experimental Evaluation (cont.)Tests on Aurora 2
Condition
  • Table shows the recognition result on test set C
  • The channel mismatch problem can be solved by
    Multi-20 and Multi-32
  • The UC model also showed a capability of coping
    with this mismatch
  • The performance is little affected by channel
    mismatch
  • Figure summarizes the average word accuracy
    results for the five system

18
Experimental Evaluation (cont.)Tests on Noise
Unseen in Aurora 2
  • The purpose of this study is to further
    investigate the capability of the UC model to
    offer robustness for a wide variety of noise
  • Three additional noise are used
  • A polyphonic mobile phone ring, A pop song
    segment, A broadcast news segment
  • The spectral characteristics of the three noise
    are shown in follow figure

A polyphonic mobile phone ring
A pop song segment
A broadcast news segment
19
Experimental Evaluation (cont.)Tests on Noise
Unseen in Aurora 2
  • The UC model offered improved accuracy over all
    the three baseline model
  • The UC model produced particularly good result
    for the ringtone noise
  • because the noise mainly partial corruption over
    the speech frequency band
  • Table also indicates that increasing the number
    of mixtures in the mismatched baseline model
  • produced only a small improvement for the news
    noise
  • no improvement for the phone ring noise

20
Experimental Evaluation (cont.)Tests on Noise
Unseen in Aurora 2
  • The UC model, with a complexity similar to that
    of Multi-32, performed similarly to Multi-3
    trained in matched conditions
  • The UC model was able to outperform Multi-32 in
    the case of unknown/mismatched noise conditions

21
Experimental Evaluation (cont.)Discrimination
Study on an E-Set Database
  • This experiment is conducted into the ability of
    the UC model to discriminate between acoustically
    confusing words
  • While it reduces the mismatch between training
    and testing conditions, does it also reduce the
    discrimination between utterances of different
    words
  • They experimented on a new database, containing
    the highly confusing E-set words (b, c, d, e, g,
    p, t, v), extracted from the Connex
    speaker-independent alphabetic database provided
    by British Telecom
  • Contains three repetitions of each word by 104
    speakers
  • 53 male and 51 female
  • Among 104 speakers, 52 for training and the other
    52 for testing
  • For each word, about 156 utterances are available
    for training
  • A total of 1219 utterances are available for
    testing
  • For different noise from Aurora 2 test set A are
    artificially added
  • Two baseline HMMs are buit
  • One with the clean training set (1 mixture per
    state)
  • The other with the multi-condition training set
    (11 mixtures per state)

22
Experimental Evaluation (cont.)Discrimination
Study on an E-Set Database
  • For the clean E-set, the UC model achieved a
    recognition accuracy rate close to the rate
    obtained by the baseline model, with only small
    loss in accuracy (84.91?83.33)
  • For the given noise conditions, the UC model
    achieved an average performance close to that
    obtained by the multi-condition baseline model

23
Experimental Evaluation (cont.)Discrimination
Study on an E-Set Database
  • Finally, tested the performance of the UC model
    with different resolutions for quantizing the SNR
  • Three different training sets are generated with
    an increasing SNR resolution
  • Coarse quantization (6 mixtures per state)
  • Including only five different SNRs, from 20dB to
    4dB with a 4 dB step
  • Medium-resolution quantization (11 mixtures per
    state)
  • Including ten different SNRs, from 20dB to 2dB
    with a 2dB step
  • Fine quantization (21 mixtures per state)
  • Including twenty different SNRs, from 20dB to 2dB
    with an 1dB step
  • Additionally, all the three sets also include the
    clean training data

24
Experimental Evaluation (cont.)Discrimination
Study on an E-Set Database
  • The two models with the medium and fine
    quantization produce quite similar recognition
    accuracy in many test conditions
  • The model with the coarse quantization trained
    with 6 SNRs produced poorer results than the
    other two models, but still showed significant
    performance improvement in comparison to the
    baseline model trained on the clean data

25
Summary
  • This paper investigated noise compensation for
    speech recognition
  • Assuming no knowledge about the noise
    characteristics and no training data from the
    noisy environment
  • Universal compensation (UC) is proposed as a
    possible solution to the problem
  • The UC method involves a novel combination of the
    principle of multi-condition training and the
    principle of the missing feature method
  • Experiments on the Aurora 2 have shown that the
    UC model has the potential to achieve a
    recognition performance close to the
    multi-condition model performance without
    assuming knowledge of the noise
  • Further experiments with noises unseend in Aurora
    2 have indicated the ability of the UC model to
    offer robust performance for a wide variety of
    noises
  • Finally, the experimental results on an E-set
    database have demonstrated the ability of the UC
    model to deal with acoustically confusing
    recognition tasks
Write a Comment
User Comments (0)
About PowerShow.com