Discriminative Feature Optimization for Speech Recognition - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Discriminative Feature Optimization for Speech Recognition

Description:

Pronunciation dictionary: phonetic spelling of the words. 10. Acoustic Training ... For decoding, several ML crossword SCTM models with different sizes were trained ... – PowerPoint PPT presentation

Number of Views:185
Avg rating:3.0/5.0
Slides: 40
Provided by: Bing97
Category:

less

Transcript and Presenter's Notes

Title: Discriminative Feature Optimization for Speech Recognition


1
Discriminative Feature Optimization for Speech
Recognition
  • Bing Zhang
  • College of Computer Information Science
    Northeastern University

2
Outline
  • Introduction
  • Problem to attack
  • Methodology
  • Region-dependent feature transform
  • Discriminative optimization of the feature
    transform
  • Implementation
  • System description results
  • Conclusions

3
Introduction
  • Speech recognition
  • Goal transcribe speech into text
  • Performance measurement word error rate (WER)
  • Typical approach
  • Training statistically model the acoustic and
    linguistic knowledge
  • Recognition search for the most probable word
    sequence using the models
  • Speech feature extraction
  • Reason raw signals cannot be robustly modeled
    due to high-dimensionality, therefore compact
    features have to be extracted
  • Two stages of feature extraction
  • speech analysis ? cepstral coefficients
  • speech feature transformation
  • In this thesis A better feature transformation
    approach is developed to reduce the WER of the
    speech recognition system

4
Introduction (cont.)
A typical speech recognition system
Word Sequence
Acoustic Model
Language Model
Features
5
Language Model
  • N-grams
  • Models the conditional probability of any word
    given N-1 words in history
  • The product of N-gram probabilities can be used
    to approximate the probability of a sequence of
    words
  • P(w1, w2, , wk) P(w1 ) P(w2 w1) P(w3 w1,
    w2) P(wN w1, , wN-1)
  • P(wk-1 wk-N, ..., wk-2) P(wk
    wk-(N-1), ..., wk-1)
  • Special cases
  • Unigram P(wi)
  • Bigram P(wi wi-1)
  • Trigram P(wi wi-2,wi-1)

6
HMM-based Acoustic Model
  • Repository of unit HMMs (Hidden Markov Model)
  • Each HMM is a probabilistic finite state machine
    with outputs at each hidden state
  • Transition probabilities
  • Observation probabilities (modeled by a mixture
    of Gaussians for each state)
  • Each HMM represents a basic unit of speech, e.g.,
    phoneme, crossword/non-crossword multiphones
  • HMM state-clusters specify which HMM states can
    share which parameters
  • Pronunciation dictionary phonetic spelling of
    the words

7
Example of an HMM
a11
a22
a33
a44
HMM
a12
a23
a34
1
4
2
Start
3
End
a13
a24
Observations
8
Example of an HMM
a11
a33
a12
a23
a34
1
4
2
Start
3
End
b1(o1)
b1(o2)
b2(o3)
b3(o4)
b3(o5)
b4(o6)
o1
o2
o3
o4
o5
o6
a22
a44
a12
1
4
2
Start
End
a24
b1(o1)
b2(o2)
b2(o3)
b2(o4)
b4(o5)
b4(o6)
o1
o2
o3
o4
o5
o6
9
HMM-based Acoustic Model
  • Repository of unit HMMs (Hidden Markov Model)
  • Each HMM is a probabilistic finite state machine
    with outputs at each hidden state
  • Transition probabilities
  • Observation probabilities (modeled by a mixture
    of Gaussians for each state)
  • Each HMM represents a basic unit of speech, e.g.,
    phoneme, crossword/non-crossword multiphones
  • HMM state-clusters specify which HMM states can
    share which parameters
  • Pronunciation dictionary phonetic spelling of
    the words

10
Acoustic Training
  • Maximum likelihood (ML) training
  • Objective maximize the conditional likelihood of
    the observed features given the model
  • Algorithm Expectation-maximization (EM)
  • Discriminative training
  • Objective train the model to distinguish the
    correct word sequence from other hypotheses
  • Criterion
  • Minimum phoneme error (MPE)
  • Representation of hypotheses lattices
  • Algorithm Extended EM

11
Feature Extraction
  • Speech analysis
  • Deals with the problem of extracting
    distinguishing characteristics (e.g., formant
    locations) of speech from digital signals
  • Examples MFCC (Mel-frequency cepstral
    coefficients), PLP (perceptual linear prediction)
  • Resulting features cepstral coefficients
  • Speech feature transformation
  • Applied on top of the cepstral coefficients
  • Transform the cepstral features to better fit the
    model
  • help the HMM to model the trajectory of the
    cepstral features
  • fit the diagonal covariance assumption of the
    Gaussian components

12
Commonly Used Feature Transforms
  • LDA (linear discriminant analysis)
  • Transform the features to maximize the distance
    between different classes while keeping each
    class as compact as possible
  • Assumes the all classes have equal covariance
  • HLDA (heteroscedastic linear discriminant
    analysis)
  • Remove the equal covariance assumption of LDA
  • Find the feature transform that maximizes the
    likelihood of the data with respect to the
    acoustic model in the transformed space
  • Others
  • HDA (heteroscedastic discriminant analysis)
  • MLLT (maximum likelihood linear transform)

13
Drawbacks of Traditional Feature Transforms
  • Inaccurate assumptions about the acoustic model
  • LDA assumes equal-class covariance
  • HDA LDA ignore the diagonal covariance
    assumption
  • Linear transform
  • Linear transform has limited power for feature
    extraction
  • Using more powerful transforms can be risky when
    the criterion does not correlate with the WER
  • The criteria do not correlate with the WER
  • Performance degrades on high-dimensional input
    features
  • Experimental results in the thesis
  • Performance degrades on highly-correlated input
    features
  • Example on the next slide

14
Example
The data has linear dependency between two
dimensions such that Z2X
Z
Z
Y
X
X
  • If projected to 1-D
  • HLDA will map all samples to one single point
  • LDA will fail to find the answer at all because
    the covariance matrix of each class is singular

15
A Better Approach
  • Region-dependent transform
  • Nonlinear
  • Computationally inexpensive to train
  • Discriminative training of the feature transform
  • Criterion correlates well with the WER
  • Detailed acoustic model in feature training

16
Region Dependent Transform (RDT)
  • RDT
  • Divides the acoustic space to multiple regions
  • e.g., r1, r2, , rN
  • Applies a different transform based on which
    region the input feature vector belongs to
  • e.g., f1, f2, , fN

To avoid making hard decisions when choosing
which transform to apply, the posterior
probabilities of the regions are used to
interpolate the transformed results
17
More Details of RDT
  • Input features long-span features
  • A long span feature vector is formed by
    concatenating the cepstral features from
    consecutive frames, centered at the current frame
  • Advantage contains information about the
    acoustic context of the current frame
  • Division of the regions global Gaussian mixture
    model (GMM)
  • Trained via unsupervised clustering
  • Each Gaussian component in the GMM corresponds to
    a region
  • Region-specific transforms
  • In general, they can be any projections of
    long-span feature vectors
  • In this thesis, linear projections are studied

18
Special Cases of RDT
RDT
Generic projection
RDLT
Linear projection
SPLICE
fMPE
MPE-HLDA
Mean-offset fMPE
Only one region
Only offset
Rotation matrix plus offset
P is not region-dependent
Note () fMPE also includes a context-expansion
layer, which does not fit this categorization.
(see thesis for details)
19
Projections vs. Offsets in RDT
The projection and the offset in RDT
Different regions can share the same projections
and/or offsets. So the unique number of
projections/offsets can be less than the number
of regions.
Projection
Offset
20
Optimization Criterion of RDT
  • Minimum Phoneme Error (MPE) criterion
  • Gives significant gains when used to train the
    HMM
  • Correlates well with WER
  • Can be rewritten as a function of the feature
    transform

WER
MPE Score
O, Or original feature vectors ? the HMM
FRDT the feature transform a(Wrk) the accuracy
score of hypothesized word sequence Wrk
21
HMM Updating Methods
  • In MPE, the HMM depends on the transformed
    features, so we should define how it is updated
  • When we choose the HMM updating methods, the
    concern is to make the trained transform be more
    generic, i.e., reusable for different training
    setups including
  • both ML and MPE training
  • different types of HMMs
  • If we can make the feature transform focus on
    separating the data, this goal can be achieved
  • To ensure that, the HMM should better describe
    the data rather than anything else

22
HMM Updating Methods (cont.)
  • If the HMM is updated discriminatively, e.g.,
    under MPE
  • Some Gaussians in the HMM will model decision
    boundaries, being away from the mass of the data
  • The feature transform will be misled from
    separating the real data
  • The resulting transform is less generic
  • This method is OK if there is only one HMM to
    train
  • If the HMM is updated under ML
  • The Gaussians will stay on the data
  • The feature transform will also focus on the data
  • The resulting transform is more generic
  • This method is preferred if there are different
    HMMs to train
  • We assume ML updating of the HMM in this thesis

23
Example
Discriminative Model
ML Model
Before transform
After transform
Since the model is already discriminative,
nothing needs to be done here.
24
Training the Feature Transform
  • The transform is trained using a numerical
    optimization algorithm
  • Derivative of MPE with respect to the transform
  • Two terms in the derivative
  • MPE depends on the transformed features directly
    ? direct derivative
  • MPE depends on the transform through the HMM,
    which in turn depends on transformed features ?
    indirect derivative
  • Two passes of data processing
  • The first pass computes the direct derivative
    using lattices
  • The second pass computes the indirect derivative
    using reference transcripts

25
Training Procedure
Iterative update of RDT using numerical
optimization
26
Implementation
  • Feature transform network
  • A directed acyclic network of primitive
    components
  • Design goals
  • reuse primitive components (e.g., linear
    projection, frame-concatenation)
  • reuse the algorithm that applies the transform or
    computes the derivative
  • easy to extend to other transforms
  • efficient usage of CPU time memory
  • Impact
  • enables numerical optimization of any
    differentiable components including but not
    limited RDT
  • simplifies the BBN system by providing a unified
    representation of various transforms
  • added flexibility to the front-end processing in
    the BBN system

Cepstra
Concatenation
Projection
Gauss. Mixture
RDT
27
RDT and the State-of-the-art System
  • The state-of-the-art system at BBN
  • Two sub-systems
  • Speaker-independent (SI) system
  • Speaker-adaptive (SA) system
  • Two phases of training
  • ML (initialize MPE training)
  • MPE
  • Three pass decoding
  • Three tied-mixture acoustic models
  • How RDT interacts with the system
  • Trained once, used in three types of acoustic
    models
  • Integrated with speaker adaptation

28
RDT in Speaker-independent (SI) Training
Bootstrapping
SI training baseline
SI training with RDT
LDAMLLT
Initial Transform
ML Training
ML-SI HMM
Lattice Generation
Lattices
MPE Training
MPE-SI HMM
29
Experimental Setup
  • Data
  • Training English Conversational Telephone Speech
    (CTS), 2300 hours SWBFisher
  • Testing Eval03Dev04, 3 hours SWB-II, 6 hours
    Fisher
  • Analysis
  • 14 Perceptual Linear Prediction (PLP) cepstral
    coefficients and normalized energy
  • Vocal Tract Length Normalization (VTLN)
  • RDT
  • 15-frame long-span features projected to 60
    dimensions
  • initialized from LDAMLLT
  • 1000 regions, one linear projection per region
  • crossword state-cluster tied model (SCTM), 7K
    clusters.
  • number of Gaussians per state-cluster in the HMM
    varies in different experiments

30
SI Results (ML)
  • Description
  • Two RDTs were trained using the HMMs with 12
    Gaussians per state-cluster (GPS) and 44 GPS,
    respectively
  • For decoding, several ML crossword SCTM models
    with different sizes were trained using either
    LDAMLLT or RDT
  • Only the lattice-rescoring pass was run in
    decoding for simplicity
  • () After other two models (STM, SCTM-NX) were
    retrained, the WER was further reduced to 20.4,
    i.e., 9.3 relatively better than the LDAMLLT
    result

31
SI Results (MPE)
  • Description
  • Same as the ML experiments, except that the final
    models were trained under MPE
  • () After other two models (STM, SCTM-NX) were
    trained, the WER was further reduced to 19.2,
    i.e., 5.8 relatively better than the LDAMLLT
    result

32
Speaker Adaptation
  • Speaker adaptation (figure)
  • Assumption the speaker-dependent models are
    linearly transformed from an SI model
  • Variations
  • MLLR assume that only Gaussian means are
    transformed
  • CMLLR both means covariances are transformed ?
    equivalent to applying the inverse transform to
    features while keeping model fixed
  • Speaker-Adaptive Training (SAT)
  • The SI model is not optimal for adaptation
  • SAT tries to estimate a better model that when
    transformed gives the best likelihood of the data

33
RDT in Speaker-adaptive Training (SAT)
Straightforward approach
Train SI RDT
SI RDT HMM
  • Use SI-RDT transparently
  • Simple
  • But RDT is not optimized for SAT

CMLLR Estimation
SD Transforms
ML SAT
ML-SAT HMM
MPE Training
MPE-SAT HMM
34
RDT in Speaker-adaptive Training (SAT)
Train SI RDT
Iterative approach (SA-RDT)
SI RDT HMM
  • Alternately update RDT and the speaker- dependent
    (SD) transforms
  • Back-propagation is used to compute the
    derivative, since SD transforms are applied on
    top of RDT
  • RDT is optimized for SAT

CMLLR Estimation
SD Transforms
ML SAT
ML-SAT HMM
Update RDT
SA RDT HMM
MPE Training
MPE-SAT HMM
35
Adapted Results
  • Description
  • Same training testing data, state-cluster and
    LM as the unadapted experiments
  • 10.9 relative WER reduction for the ML system
  • 7.0 relative WER reduction for the MPE system

36
Alternative Procedure for SA-RDT
Simplified SA-RDT
SI LDAMLLT HMM
  • Similar to the original SA-RDT
  • But the speaker-dependent transforms are
    estimated using the baseline model features

CMLLR Estimation
SD Transforms
ML SAT
ML-SAT HMM
Update RDT
SA RDT HMM
MPE Training
MPE-SAT HMM
37
Adapted Results
  • Description
  • 500 hours of training data
  • Another set of SD transforms were used before
    LDA/RDT
  • SA-RDT1 was using the simplified procedure
  • SA-RDT2 was using the original procedure
  • The simplified procedure gave 2/3 of the gain by
    training the RDT only once

38
Conclusions
  • Original work
  • Region-dependent transform
  • Improved discriminative feature training that
    leads to more generic feature transform
  • Improved SAT procedure using RDT
  • Impact
  • RDT encompasses several other feature transforms,
    including MPE-HLDA, SPLICE and the core of fMPE
    and mean-offset fMPE
  • The method gives significant WER reduction 7
    relative reduction to the SAT-MPE English CTS
    system
  • The method is potentially helpful for exploring
    novel acoustic features
  • We do not have to worry about the negative effect
    when we add new features to the input of the
    feature transform, because the training will
    decide whether to use the new features and how to
    use them based on a criterion that is correlated
    to WER

39
Publications
  • B. Zhang, S. Matsoukas, J. Ma, and R. Schwartz.
    Long span features and minimum phoneme
    heteroscedastic linear discriminant analysis. In
    Proceedings of EARS RT-04 Workshop, 2004.
  • B. Zhang and S. Matsoukas. Minimum phoneme error
    based heteroscedastic linear discriminant
    analysis for speech recognition, In Proceedings
    of ICASSP, 2005.
  • B. Zhang, S. Matsoukas and R. Schwartz.
    Discriminatively trained region-dependent
    transform for speech recognition. In Proceedings
    of ICASSP, 2006.
  • Nominated for the Student Paper Award
  • Awarded the Spoken Language Processing Grant by
    the IEEE Signal Processing Society
  • B. Zhang, S. Matsoukas and R. Schwartz. Recent
    progress on the discriminative region-dependent
    transform for speech feature extraction. In
    Proceedings of ICSLP, 2006.
Write a Comment
User Comments (0)
About PowerShow.com