Articulatory FeatureBased Speech Recognition - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Articulatory FeatureBased Speech Recognition

Description:

Implementation and testing of several phoneme-viseme and AF-based models (Partha, ... Train an MLP to classify phonemes, frame by frame ... – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 45
Provided by: karenl5
Category:

less

Transcript and Presenter's Notes

Title: Articulatory FeatureBased Speech Recognition


1
Articulatory Feature-Based Speech Recognition
Team updateAugust 9, 2006
2
WS06AFSR update, in brief
  • Audio-only (SVitchboard) work
  • Comparison of AF-based observation models in
    phone-based systems (Ozgur, Arthur, Simon)
  • Implementation of AF-based pronunciation models
    (Chris, Arthur, Nash, Lisa, Bronwyn)
  • Audio-visual (CUAVE) work
  • Implementation and testing of several
    phoneme-viseme and AF-based models (Partha,
    Ozgur, Mark, Karen)
  • Other
  • Tying of articulatory states (Partha, Chris)
  • Generation of forced feature alignments (Ari,
    Steve)

3
Observation modeling
  • Comparison of observation models in a phone-based
    system

fully generative
hybrid
tandem
phoneState
phoneState
phoneState
dg1
pl1
rd
PLPs
PLPs
. . .
log MLP outputs KLT
odg1
opl1
ord
Virtual evidence from MLPsp(fo) OR p(of)
p(fo) p(o) / p(f)
4
(Mostly) observation modeling Hybrid models
(Simon)
phoneState
  • Deterministic phoneState feature mapping
  • p(dg1 phoneState) 1 if dg1 is phoneStates
    canonical value
  • Non-deterministic mapping
  • p(dg1 phoneState) learned distribution
  • Hybrid PLP

dg1
pl1
rd
. . .
odg1
opl1
ord
phoneState
dg1
pl1
rd
. . .
PLPs
odg1
opl1
ord
5
Hybrid observation models (Simon)
  • Requires tuning relative weights of different
    MLPs
  • Non-deterministic model is slow to train with
    dense CPTs. Instead
  • Recipe 1
  • Train on 1000 utterances for 2 iterations
  • Make the dense CPTs more sparse by zeroing all
    entries less than 0.1
  • Using these parameters, run the genetic
    triangulation script to find a fast
    triangulation, given this particular sparsity of
    the DCPTs.
  • Starting from these parameters, train to 0.5
    tolerance (takes 8 its) on full training set
  • Find a decoding graph triangulation using the
    final trained parameters.
  • Recipe 2
  • Using a faster triangulation, train model with
    fully dense CPTs
  • Make CPTs sparse by zeroing all entries less than
    0.1

6
Tandem models (Ozgur, Arthur)
  • standard (unfactored)

partially factored
fully factored (not yet implemented)
phoneState
phoneState
phoneState
pl1
rd
dg1
PLPs
PLPs
. . .
log MLP outputs KLT
PLPs
log outputs of separate MLPs
log MLP outputs KLT
7
Tandem observation models (Ozgur, Arthur)
  • Fisher train MLPs trained on 1776 hours of
    Fisher data
  • SVB train MLPs trained on SVitchboard data only
  • Phone MLP Tandem system using phone MLP
    classifer (trained on SVitchboard) instead of
    feature MLPs

8
Reminder phone-based models (Ozgur, Chris)
frame 0
frame i
last frame
variable name
values
word one, two ,...
1
wordTransition 0,1
subWordState 0,1,2,...
0
stateTransition 0,1
phoneState w1, w2, w3, s1, s2, s3,...
observation
(Note missing pronunciation variants)
9
Pronunciation modeling (Arthur, Chris, Nash,
Lisa, Bronwyn)
wordTransition
word
wordTransitionL
subWordStateL
async
stateTransitionL
phoneStateL
wordTransitionT
L
subWordStateT
stateTransitionT
phoneStateT
T
(differences from actual model 3rd feature
stream, pron variants, word transition
bookkeeping)
10
Pronunciation models (Chris, Arthur)
  • Fisher train MLPs trained on 1776 hours of
    Fisher data
  • SVB train MLPs trained on SVitchboard data only
  • Phone MLP Tandem system using phone MLP
    classifer (trained on SVitchboard) instead of
    feature MLPs

11
Summary of selected experiments
  • (Note Some models still being tuned)
  • For reference

12
Audio-only experiments Ongoing
  • State tying for AF-based models (Chris)
  • Factored tandem models (Arthur)
  • Combining LTG-based pronunciation models with
    hybrid observation models (Simon, Steve)
  • Articulatory substitution modeling (Bronwyn)
  • Part-of-speech dependent asynchrony (Lisa)
  • Crossword asynchrony (Nash)

13
Audio-visual models (Partha, Ozgur, Mark, Karen)
Synchronous phoneme-viseme
Audio-only/ video-only
phoneState
phoneState
obsA
obs
obsV
14
Asynchronous phoneme-viseme model with an
asynchrony variable
  • Analogous to AF-based model with asynchrony
    variables

subWordStateA
async
phoneStateA
obsA
subWordStateV
phoneStateV
obsV
15
Phoneme-viseme model with coupled HMM-based
asynchrony
subWordStateA
stateTransitionA
phoneStateA
obsA
subWordStateV
stateTransitionV
phoneStateV
obsV
16
AF-based model with asynchrony variables
subWordStateL
async
phoneStateL
subWordStateT
phoneStateT
obsA
obsV
17
CUAVE experimental setup
  • Training on clean data, number of Gaussians tuned
    on clean dev set
  • Audio/video weights tuned on noise-specific dev
    sets
  • Uniform language model
  • Decoding constrained to 10-word utterances
    (avoids language model scale/penalty tuning)

18
CUAVE selected development set results
19
Audio-visual experiments Ongoing
  • AF models with CHMM-based asynchrony (Mark)
  • State tying for AF models (Partha)
  • Cross-word asynchrony (Partha)
  • Multi-rate modeling (Ozgur)
  • Stream weighting by framewise rejection modeling
    (Ozgur)

20
Other ongoing work
  • Generation of forced feature alignments and
    analysis tool (Ari)
  • Embedded training of MLPs (Simon, Ari, Joe
    Frankel (thanks!))
  • Analysis of recognition outputs (Lisa)
  • Structure learning (Steve)

21
Summary
  • Tandem observation models the most promising so
    far
  • But also the simplest... Much work to be done on
    other models
  • Monofeat and hybrid models approaching monophone
    models in performance
  • Main challenges to new models
  • Speed/memory ? tuning of triangulations and
    pruning parameters
  • Tuning parameters differ widely across models ?
    lots of cross-validation decoding runs
  • For asynchronous structures, low-occupancy states
    ? tying
  • In the next week
  • Wrap-up of ongoing experiments
  • Testing on final test sets
  • Analysis of decoding outputs
  • Combination of most promising directions

22
  • Questions?
  • Comments?

23
  • EXTRA SLIDES

24
Project outline Multistream AF-based models
25
Project outline Asynchrony modeling
Coupled HMMs and variations
26
Project outline Asynchrony modeling (2)
  • Instantaneous asynchrony constraints over feature
    subsets

Single asynchrony constraint over all features
Also Cross-word asynchrony, context-dependent
asynchrony
27
Project outline Reduction/substitution modeling
  • Similar to pronunciation modeling in phone-based
    recognition, but on a per-stream/per-frame basis
  • Context-independent vs. context-dependent feature
    substitutions
  • What kind of context?
  • Phonetic
  • Articulatory
  • Higher-level speaker, speaking rate, dialect...

28
Manual feature transcriptions (Xuemin Chi, Lisa
Lavoie, Karen)
  • Purpose Testing of AF classifiers, automatic
    alignments
  • Main transcription guidelines
  • Should contain enough information for the speaker
    to reproduce the acoustics (up to 20ms shifts in
    boundaries)
  • Should correspond to what we would like our AF
    classifiers to detect

29
Manual feature transcriptions (Xuemin Chi, Lisa
Lavoie, Karen)
  • Details
  • 2 transcribers phonetician, PhD student in
    speech group
  • 78 SVitchboard utterances
  • 9 utterances from Switchboard Transcription
    Project for comparison
  • Multipass transcription using WaveSurfer (KTH)
  • 1st pass Phone-feature hybrid
  • 2nd pass All-feature
  • 3rd pass Discussion, error-correction
  • Transcription speed
  • 623 x RT for 1st pass
  • 335 x RT for 2nd pass
  • Why phone-feature hybrid in 1st pass?
  • In a preliminary test, gt 2x slower to do
    all-feature transcription in 1st pass
  • Transcribers found all-feature format very tedious

30
Manual feature transcriptions Analysis (Nash,
Lisa, Ari)
  • How does the multipass strategy affect agreement?
  • How well do transcribers agree?
  • How does agreement compare with phonetic
    transcriptions in STP?

31
Models implemented so far...
  • Phone-based models
  • Non-classifier-based AF models
  • Tandem models
  • Hybrid model

32
Phone-based models monophone, word-internal
triphone (Ozgur, Chris)
frame 0
frame i
last frame
variable name
values
word one, two ,...
1
wordTransition 0,1
subWordState 0,1,2,...
0
stateTransition 0,1
phoneState w1, w2, w3, s1, s2, s3,...
observation
(Note missing pronunciation variants)
33
What are hybrid models?
  • Conventional HMMs generate observations via a
    likelihood p(Ostate) or p(Oclass) using a
    mixture of Gaussians
  • Hybrid models use another classifier (typically
    an MLP) to obtain the posterior P(classO)
  • Dividing by the prior gives the likelihood, which
    can be used directly in the HMM no Gaussians
    required
  • Advantages of hybrid models include
  • Can easily train the classifier discriminatively
  • Once trained, MLPs will compute P(classO)
    relatively fast
  • MLPs can use a long window of acoustic input
    frames
  • MLPs dont require input feature distribution to
    have diagonal covariance (e.g. can use filterbank
    outputs from computational auditory scene
    analysis front-ends)

34
Configurations
  • Standard hybrid
  • Train an MLP to classify phonemes, frame by frame
  • Decode the MLP output using simple HMMs
    (transition probabilities easily derived from
    phone duration statistics dont even need to
    train them)
  • Standard tandem
  • Instead of using MLP output to directly obtain
    the likelihood, just use it as a feature vector,
    after some transformations (e.g. taking logs) and
    dimensionality reduction
  • Append the resulting features to standard
    features, e.g. PLPs or MFCCs
  • Use this vector as the observation for a standard
    HMM with a mixture-of-Gaussians observation model
  • Currently used in state-of-art systems such as
    from SRI
  • Our configuration
  • Use ANNs to classify articulatory features
    instead of phones
  • 8 MLPs, classifying pl1, dg1, etc. frame-by-frame

35
In other news...
  • gmtkTie
  • Linking pronunciation and observation modeling
  • Structure learning
  • Other ongoing/planned work

36
gmtkTie (Simon)
  • General parameter clustering and tying tool for
    GMTK
  • Currently most developed parts
  • Decision-tree clustering of Gaussians, using same
    technique as HTK
  • Bottom-up agglomerative clustering
  • gmtkTie more general than HTK
  • HTK asks questions about previous/next phone
    identity
  • HTK clusters states only within the same phone
  • gmtkTie can ask user-supplied questions about
    user-supplied features no assumptions about
    states, triphones, or anything else
  • gmtkTie clusters user-defined groups of
    parameters, not just states
  • gmtkTie can compute cluster sizes and centroids
    in 101 different ways (approx.)
  • Will be stress-tested on various observation
    models using Gaussians
  • Can tie based on values of any variables in the
    graph, not just the phone state (e.g. feature
    values)

37
Linking pronunciation and observation models
  • Pronunciation model generates, from the words,
    three feature streams L, T, G
  • Observation model starts with 8 features (pl1,
    dg1, etc) and generates acoustics (possibly via
    MLP posteriors or tandem features)
  • How to link L,T,G with the 8 features?
  • Deterministic mapping
  • Learned mapping
  • Dense conditional probability table (CPT)
  • Sparse CPT
  • Link them by choosing dependencies
    discriminatively?

38
Structure learning (Steve, Simon)
  • In various parts of our models, may want to learn
    which dependencies to include and which to omit
  • Between elements of observation vector (similar
    to Bilmes 2001)
  • Between L, T, G and pl1, dg1, ...
  • Among pl1, dg1, ... (similar to King et al.)
  • May not need all the dependencies that we think
    are necessary for a correct generative model
  • Currently investigating Bilmes EAR measure
    I(X,YC) I(X,Y),
  • EAR formulated for X,Y observed
  • We plan to use forced alignment to generate
    observed values for hidden variables
  • So far computed EAR between pairs of individual
    values of dg1, pl1, etc
  • Suggests adding links to a model with tandem
    observations
  • Also computed EAR between pairs of features
  • Less promising results needs further
    investigation
  • Will also compute EAR on tandem observation
    features can implement these arcs in GMTK as
    dlinks

39
AVICAR Front-end tuning (Mark)
  • Video-only WERs
  • Current status Confirming results, then
    implementing audio-visual structures

40
AVICAR Front-end tuning (Mark)
  • Audio-only WERs

41
CUAVE Implemented models (Partha)
  • Synchronous model now running
  • In progress Monofeat model with two observation
    vectors

phoneState
obsA
obsV
42
Summary
  • gmtkTie ready for prime-time
  • All basic models running
  • Tandem result encouraging
  • Experiments taking longer than expected
  • Too many choices of experiments!

43
Definitions Pronunciation and observation
modeling
44
Types of observation model
  • Observation model can be one of
  • 8 hidden random variables (one per feature) which
    each obtain virtual evidence (VE) from the ANN
  • 8 hidden RVs that generate tandem features using
    their own Gaussians
  • Tandem features for one RV use only the ANN for
    that feature, plus PLPs
  • Single tandem observation generated using
    Gaussians from a monophone/triphone HMM state
  • Tandem feature vector obtained by concatenating 8
    ANN output vectors plus PLPs
  • Compare this to a standard tandem feature vector
    derived using a phone-classifying ANN
  • Non-hybrid/tandem model PLP observations
    generated using Gaussians
Write a Comment
User Comments (0)
About PowerShow.com