An EKF-based algorithm for learning statistical hidden dynamic model parameters for phonetic recognition - PowerPoint PPT Presentation

1 / 15

About This Presentation

Title:

An EKF-based algorithm for learning statistical hidden dynamic model parameters for phonetic recognition

Description:

Microsoft Research, Redmond. 10/13/09. ICASSP'2001. 2. Contents. The Hidden Dynamic Model (HDM) ... (Tj, j) parameters switch when crossing to new phone dynamic ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 16

Provided by: roberto64

Category:

more less

Transcript and Presenter's Notes

Title: An EKF-based algorithm for learning statistical hidden dynamic model parameters for phonetic recognition

1
An EKF-based algorithm for learning statistical
hidden dynamic model parameters for phonetic
recognition

Roberto Togneri
University of Western Australia
Li Deng
Microsoft Research, Redmond

2
Contents

The Hidden Dynamic Model (HDM)
Parameter Estimation by EM
Parameter Estimation by EKF
Comparison between EM and EKF
Phone Recognition Evaluations
Recognition Results
Model Convergence Results
Discussion of Results
Conclusion and Future Work

3
Hidden Dynamic Model (HDM)

Target-directed, VTR state dynamics
Static, non-linear mapping to MFCC
Switching state-space parameters
(Tj, ?j) parameters switch when crossing to new
phone dynamic regime, j
Continuity of dynamic state, z(k)
z(k) is continuous across phone regimes
MLP non-linear mapping, h(.)
h(.) is a 3-layer MLP with z(k) on the input
layer and MFCC observations, O(k), on the output
layer
hyperbolic tangent activation function
Phone segmentation assumed known

4
Parameter estimation by EM

E-step
Use EKF to provide estimates of the hidden
dynamic, Z, given the known observations, O, and
the current parameter estimates (Tj, ?j)
M-step
Maximise Q-function with respect to (Tj, ?j)
Solution of non-linear, high-order equations
Use a generalised form of EM
Gradient descent or Newton-Rhapson
Backprop algorithm for h(.) MLP weights
After each EM iteration use back-propagation
algorithm to estimate MLP weights
smoothed EKF estimates of Z as input
given observations, O, as output

5
Parameter Estimation by EKF

ADDENDUM to paper
Paper only covers EKF estimation of (Tj, ?j), but
implemented version also estimates MLP weights,
Wr, as described here
Augmented form of state vector
Augmented state equation
Observation equation
Nonlinear mapping hr(.) only depends on z(k) and
Wr(k)

6
Parameter Estimation by EKF

State equation Jacobian matrix
m-dim z(k), p-dim Wr(k), n-dim O(k)
Observation equation Jacobian matrix
Initialisation
Parameter vector, ?(00)
State error co-variance matrix, P(00)
Important to control convergence
State noise co-variance, Q(k)
Set to zero for parameter equations
Observation noise co-variance, R(k)

7
Comparison between EM and EKF

Cons of EM algorithm
Requires additional back-propagation algorithm to
estimate MLP weights
Convergence problems in M-step due to non-linear
equations
Slower rate of convergence
Cons of EKF algorithm
Sensitive to initial conditions, especially
initialisation of P(00) and Q
Computationally expensive for large augmented
state vectors

8
Phone Recognition Evaluations

N-best rescoring
Use baseline HMM to provide time-aligned 5-best
and 100-best transcriptions
Optionally include the reference transcription
100-best, 100-bestref, 5-best, 5-bestref
Calculate log-likelihood score of HDM across all
transcriptions and select highest scoring
transcription to calculate the WER of the HDM
Perform forced alignment of HMM across all
transcriptions and select highest scoring
transcription to calculate the WER of the HMM
Baseline HMM
Context-Dependent phone HMM
3-state, left-right triphone model
cross-word triphone network
39-dim (131313) MFCC observation vectors
HTK v2.2 software
Trained on all TIMIT training data
Tested on TIMIT dr8 testing data

9
Phone Recognition Evaluations

Evaluation HDM
3-dim VTR state vector
13-dim observation MFCC vector
Per phone model j
3-dim target, Tj
3-dim diagonal time-constant, ?j
HDMm variant
one 3-12-13 MLP per phone model
42 phone models
HDMc variant
one 3-16-13 MLP per broad-class
3 broad-classes Silence, Voiced, Unvoiced
Trained on TIMIT dr8 training data
Tested on TIMIT dr8 testing data
Differences in training data
baseline HMM performance superior when using all
TIMIT data
HDM computational requirements limit training
data to dr8 subset

10
Recognition Results

WER results
Both HMM and HDM perform little better than
Chance when presented with the N-best list
phone recognition is a difficult problem
HDM performance improves significantly when
presented with the N-bestref list
HDM is able to select the reference
transcription, whereas the HMM is unable to

11
Model ConvergenceResults

Generative properties of HDM
Compare MFCC acoustic feature vector with
generated outputs from HDMm and HDMc
HDMm and HDMc convergence to observed features is
evident

12
ModellingResults

Parsimony of HDM
HDM is a more structured modelling paradigm
HDMm parameters 0.015 x HMM parameters
HDMc parameters 0.0014 x HMM parameters
Identification of HDM parameters
The estimated (Tj, ?j) do not appear to have
converged to the expected values
The estimated Tj do not necessarily correspond to
the measured VTRs for phone j
Problems due to incorrect modelling assumption,
insufficient training, or over-specification of
model parameters
e.g. MLP observation non-linearity may be too
general

13
Discussion of Results

Baseline HMM fails to select reference
transcription whereas HDM is successful with a
WER reduction of 10
HDM represents a more parsimonious modelling
paradigm compared to baseline HMM
HDM does not yield physically reliable parameters
but does converge to the given observation
features
HDM is possibly not identifiable in its current
implementation
The N-best rescoring is not a reliable means of
evaluating system performance
HDM results based on sub-optimal time-aligned
transcriptions from HMM
WER results indicate the potential of the HDM
paradigm for acoustic modelling

14
Conclusion andFuture Work

The HDM is a promising alternative to the current
state-of-the-art HMM
parsimonious model
requires less training data and fewer iterations
easier to adapt
better generalisation capabilities
More work is required to properly evaluate the
performance of the HDM
implement a lattice scoring algorithm with
optimal segmentation to produce own N-best
transcriptions
More work is required to find efficient and less
restricted estimation and decoding algorithms
compare EKF and EM estimation algorithms
estimation of segmentation boundaries
efficient decoding algorithms
train models on larger data-sets

15
Conclusion andFuture Work

More work is needed to confirm the proposed
dynamic and observation modelling structure
More constrained non-linear mapping from hidden
state to MFCC observations
More inaccurate but much more efficient linear
mapping from hidden state to MFCC observations
Reduce number of parameters to be estimated to
allow the system to be identified
Use known VTR resonance values as Tj
estimate ?j and Wr
Use known mapping from VTR resonances to MFCC
observations
estimate ?j and Tj