An EKF-based algorithm for learning statistical hidden dynamic model parameters for phonetic recognition - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

An EKF-based algorithm for learning statistical hidden dynamic model parameters for phonetic recognition

Description:

Microsoft Research, Redmond. 10/13/09. ICASSP'2001. 2. Contents. The Hidden Dynamic Model (HDM) ... (Tj, j) parameters switch when crossing to new phone dynamic ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 16
Provided by: roberto64
Category:

less

Transcript and Presenter's Notes

Title: An EKF-based algorithm for learning statistical hidden dynamic model parameters for phonetic recognition


1
An EKF-based algorithm for learning statistical
hidden dynamic model parameters for phonetic
recognition
  • Roberto Togneri
  • University of Western Australia
  • Li Deng
  • Microsoft Research, Redmond

2
Contents
  • The Hidden Dynamic Model (HDM)
  • Parameter Estimation by EM
  • Parameter Estimation by EKF
  • Comparison between EM and EKF
  • Phone Recognition Evaluations
  • Recognition Results
  • Model Convergence Results
  • Discussion of Results
  • Conclusion and Future Work

3
Hidden Dynamic Model (HDM)
  • Target-directed, VTR state dynamics
  • Static, non-linear mapping to MFCC
  • Switching state-space parameters
  • (Tj, ?j) parameters switch when crossing to new
    phone dynamic regime, j
  • Continuity of dynamic state, z(k)
  • z(k) is continuous across phone regimes
  • MLP non-linear mapping, h(.)
  • h(.) is a 3-layer MLP with z(k) on the input
    layer and MFCC observations, O(k), on the output
    layer
  • hyperbolic tangent activation function
  • Phone segmentation assumed known

4
Parameter estimation by EM
  • E-step
  • Use EKF to provide estimates of the hidden
    dynamic, Z, given the known observations, O, and
    the current parameter estimates (Tj, ?j)
  • M-step
  • Maximise Q-function with respect to (Tj, ?j)
  • Solution of non-linear, high-order equations
  • Use a generalised form of EM
  • Gradient descent or Newton-Rhapson
  • Backprop algorithm for h(.) MLP weights
  • After each EM iteration use back-propagation
    algorithm to estimate MLP weights
  • smoothed EKF estimates of Z as input
  • given observations, O, as output

5
Parameter Estimation by EKF
  • ADDENDUM to paper
  • Paper only covers EKF estimation of (Tj, ?j), but
    implemented version also estimates MLP weights,
    Wr, as described here
  • Augmented form of state vector
  • Augmented state equation
  • Observation equation
  • Nonlinear mapping hr(.) only depends on z(k) and
    Wr(k)

6
Parameter Estimation by EKF
  • State equation Jacobian matrix
  • m-dim z(k), p-dim Wr(k), n-dim O(k)
  • Observation equation Jacobian matrix
  • Initialisation
  • Parameter vector, ?(00)
  • State error co-variance matrix, P(00)
  • Important to control convergence
  • State noise co-variance, Q(k)
  • Set to zero for parameter equations
  • Observation noise co-variance, R(k)

7
Comparison between EM and EKF
  • Cons of EM algorithm
  • Requires additional back-propagation algorithm to
    estimate MLP weights
  • Convergence problems in M-step due to non-linear
    equations
  • Slower rate of convergence
  • Cons of EKF algorithm
  • Sensitive to initial conditions, especially
    initialisation of P(00) and Q
  • Computationally expensive for large augmented
    state vectors

8
Phone Recognition Evaluations
  • N-best rescoring
  • Use baseline HMM to provide time-aligned 5-best
    and 100-best transcriptions
  • Optionally include the reference transcription
  • 100-best, 100-bestref, 5-best, 5-bestref
  • Calculate log-likelihood score of HDM across all
    transcriptions and select highest scoring
    transcription to calculate the WER of the HDM
  • Perform forced alignment of HMM across all
    transcriptions and select highest scoring
    transcription to calculate the WER of the HMM
  • Baseline HMM
  • Context-Dependent phone HMM
  • 3-state, left-right triphone model
  • cross-word triphone network
  • 39-dim (131313) MFCC observation vectors
  • HTK v2.2 software
  • Trained on all TIMIT training data
  • Tested on TIMIT dr8 testing data

9
Phone Recognition Evaluations
  • Evaluation HDM
  • 3-dim VTR state vector
  • 13-dim observation MFCC vector
  • Per phone model j
  • 3-dim target, Tj
  • 3-dim diagonal time-constant, ?j
  • HDMm variant
  • one 3-12-13 MLP per phone model
  • 42 phone models
  • HDMc variant
  • one 3-16-13 MLP per broad-class
  • 3 broad-classes Silence, Voiced, Unvoiced
  • Trained on TIMIT dr8 training data
  • Tested on TIMIT dr8 testing data
  • Differences in training data
  • baseline HMM performance superior when using all
    TIMIT data
  • HDM computational requirements limit training
    data to dr8 subset

10
Recognition Results
  • WER results
  • Both HMM and HDM perform little better than
    Chance when presented with the N-best list
  • phone recognition is a difficult problem
  • HDM performance improves significantly when
    presented with the N-bestref list
  • HDM is able to select the reference
    transcription, whereas the HMM is unable to

11
Model ConvergenceResults
  • Generative properties of HDM
  • Compare MFCC acoustic feature vector with
    generated outputs from HDMm and HDMc
  • HDMm and HDMc convergence to observed features is
    evident

12
ModellingResults
  • Parsimony of HDM
  • HDM is a more structured modelling paradigm
  • HDMm parameters 0.015 x HMM parameters
  • HDMc parameters 0.0014 x HMM parameters
  • Identification of HDM parameters
  • The estimated (Tj, ?j) do not appear to have
    converged to the expected values
  • The estimated Tj do not necessarily correspond to
    the measured VTRs for phone j
  • Problems due to incorrect modelling assumption,
    insufficient training, or over-specification of
    model parameters
  • e.g. MLP observation non-linearity may be too
    general

13
Discussion of Results
  • Baseline HMM fails to select reference
    transcription whereas HDM is successful with a
    WER reduction of 10
  • HDM represents a more parsimonious modelling
    paradigm compared to baseline HMM
  • HDM does not yield physically reliable parameters
    but does converge to the given observation
    features
  • HDM is possibly not identifiable in its current
    implementation
  • The N-best rescoring is not a reliable means of
    evaluating system performance
  • HDM results based on sub-optimal time-aligned
    transcriptions from HMM
  • WER results indicate the potential of the HDM
    paradigm for acoustic modelling

14
Conclusion andFuture Work
  • The HDM is a promising alternative to the current
    state-of-the-art HMM
  • parsimonious model
  • requires less training data and fewer iterations
  • easier to adapt
  • better generalisation capabilities
  • More work is required to properly evaluate the
    performance of the HDM
  • implement a lattice scoring algorithm with
    optimal segmentation to produce own N-best
    transcriptions
  • More work is required to find efficient and less
    restricted estimation and decoding algorithms
  • compare EKF and EM estimation algorithms
  • estimation of segmentation boundaries
  • efficient decoding algorithms
  • train models on larger data-sets

15
Conclusion andFuture Work
  • More work is needed to confirm the proposed
    dynamic and observation modelling structure
  • More constrained non-linear mapping from hidden
    state to MFCC observations
  • More inaccurate but much more efficient linear
    mapping from hidden state to MFCC observations
  • Reduce number of parameters to be estimated to
    allow the system to be identified
  • Use known VTR resonance values as Tj
  • estimate ?j and Wr
  • Use known mapping from VTR resonances to MFCC
    observations
  • estimate ?j and Tj
Write a Comment
User Comments (0)
About PowerShow.com