Articulatory FeatureBased Speech Recognition - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Articulatory FeatureBased Speech Recognition

Description:

Implementation and testing of several phoneme-viseme and AF-based models (Partha, ... Train an MLP to classify phonemes, frame by frame ... – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 45

Provided by: karenl5

Category:

more less

Transcript and Presenter's Notes

Title: Articulatory FeatureBased Speech Recognition

1
Articulatory Feature-Based Speech Recognition
Team updateAugust 9, 2006
2
WS06AFSR update, in brief

Audio-only (SVitchboard) work
Comparison of AF-based observation models in
phone-based systems (Ozgur, Arthur, Simon)
Implementation of AF-based pronunciation models
(Chris, Arthur, Nash, Lisa, Bronwyn)
Audio-visual (CUAVE) work
Implementation and testing of several
phoneme-viseme and AF-based models (Partha,
Ozgur, Mark, Karen)
Other
Tying of articulatory states (Partha, Chris)
Generation of forced feature alignments (Ari,
Steve)

3
Observation modeling

Comparison of observation models in a phone-based
system

fully generative
hybrid
tandem
phoneState
phoneState
phoneState
dg1
pl1
rd
PLPs
PLPs
. . .
log MLP outputs KLT
odg1
opl1
ord
Virtual evidence from MLPsp(fo) OR p(of)
p(fo) p(o) / p(f)
4
(Mostly) observation modeling Hybrid models
(Simon)
phoneState

Deterministic phoneState feature mapping
p(dg1 phoneState) 1 if dg1 is phoneStates
canonical value
Non-deterministic mapping
p(dg1 phoneState) learned distribution
Hybrid PLP

dg1
pl1
rd
. . .
odg1
opl1
ord
phoneState
dg1
pl1
rd
. . .
PLPs
odg1
opl1
ord
5
Hybrid observation models (Simon)

Requires tuning relative weights of different
MLPs
Non-deterministic model is slow to train with
dense CPTs. Instead
Recipe 1
Train on 1000 utterances for 2 iterations
Make the dense CPTs more sparse by zeroing all
entries less than 0.1
Using these parameters, run the genetic
triangulation script to find a fast
triangulation, given this particular sparsity of
the DCPTs.
Starting from these parameters, train to 0.5
tolerance (takes 8 its) on full training set
Find a decoding graph triangulation using the
final trained parameters.
Recipe 2
Using a faster triangulation, train model with
fully dense CPTs
Make CPTs sparse by zeroing all entries less than
0.1

6
Tandem models (Ozgur, Arthur)

standard (unfactored)

partially factored
fully factored (not yet implemented)
phoneState
phoneState
phoneState
pl1
rd
dg1
PLPs
PLPs
. . .
log MLP outputs KLT
PLPs
log outputs of separate MLPs
log MLP outputs KLT
7
Tandem observation models (Ozgur, Arthur)

Fisher train MLPs trained on 1776 hours of
Fisher data
SVB train MLPs trained on SVitchboard data only
Phone MLP Tandem system using phone MLP
classifer (trained on SVitchboard) instead of
feature MLPs

8
Reminder phone-based models (Ozgur, Chris)
frame 0
frame i
last frame
variable name
values
word one, two ,...
1
wordTransition 0,1
subWordState 0,1,2,...
0
stateTransition 0,1
phoneState w1, w2, w3, s1, s2, s3,...
observation
(Note missing pronunciation variants)
9
Pronunciation modeling (Arthur, Chris, Nash,
Lisa, Bronwyn)
wordTransition
word
wordTransitionL
subWordStateL
async
stateTransitionL
phoneStateL
wordTransitionT
L
subWordStateT
stateTransitionT
phoneStateT
T
(differences from actual model 3rd feature
stream, pron variants, word transition
bookkeeping)
10
Pronunciation models (Chris, Arthur)

Fisher train MLPs trained on 1776 hours of
Fisher data
SVB train MLPs trained on SVitchboard data only
Phone MLP Tandem system using phone MLP
classifer (trained on SVitchboard) instead of
feature MLPs

11
Summary of selected experiments

(Note Some models still being tuned)

For reference

12
Audio-only experiments Ongoing

State tying for AF-based models (Chris)
Factored tandem models (Arthur)
Combining LTG-based pronunciation models with
hybrid observation models (Simon, Steve)
Articulatory substitution modeling (Bronwyn)
Part-of-speech dependent asynchrony (Lisa)
Crossword asynchrony (Nash)

13
Audio-visual models (Partha, Ozgur, Mark, Karen)
Synchronous phoneme-viseme
Audio-only/ video-only
phoneState
phoneState
obsA
obs
obsV
14
Asynchronous phoneme-viseme model with an
asynchrony variable

Analogous to AF-based model with asynchrony
variables

subWordStateA
async
phoneStateA
obsA
subWordStateV
phoneStateV
obsV
15
Phoneme-viseme model with coupled HMM-based
asynchrony
subWordStateA
stateTransitionA
phoneStateA
obsA
subWordStateV
stateTransitionV
phoneStateV
obsV
16
AF-based model with asynchrony variables
subWordStateL
async
phoneStateL
subWordStateT
phoneStateT
obsA
obsV
17
CUAVE experimental setup

Training on clean data, number of Gaussians tuned
on clean dev set
Audio/video weights tuned on noise-specific dev
sets
Uniform language model
Decoding constrained to 10-word utterances
(avoids language model scale/penalty tuning)

18
CUAVE selected development set results
19
Audio-visual experiments Ongoing

AF models with CHMM-based asynchrony (Mark)
State tying for AF models (Partha)
Cross-word asynchrony (Partha)
Multi-rate modeling (Ozgur)
Stream weighting by framewise rejection modeling
(Ozgur)

20
Other ongoing work

Generation of forced feature alignments and
analysis tool (Ari)
Embedded training of MLPs (Simon, Ari, Joe
Frankel (thanks!))
Analysis of recognition outputs (Lisa)
Structure learning (Steve)

21
Summary

Tandem observation models the most promising so
far
But also the simplest... Much work to be done on
other models
Monofeat and hybrid models approaching monophone
models in performance
Main challenges to new models
Speed/memory ? tuning of triangulations and
pruning parameters
Tuning parameters differ widely across models ?
lots of cross-validation decoding runs
For asynchronous structures, low-occupancy states
? tying
In the next week
Wrap-up of ongoing experiments
Testing on final test sets
Analysis of decoding outputs
Combination of most promising directions

Questions?
Comments?

EXTRA SLIDES

24
Project outline Multistream AF-based models
25
Project outline Asynchrony modeling
Coupled HMMs and variations
26
Project outline Asynchrony modeling (2)

Instantaneous asynchrony constraints over feature
subsets

Single asynchrony constraint over all features
Also Cross-word asynchrony, context-dependent
asynchrony
27
Project outline Reduction/substitution modeling

Similar to pronunciation modeling in phone-based
recognition, but on a per-stream/per-frame basis
Context-independent vs. context-dependent feature
substitutions
What kind of context?
Phonetic
Articulatory
Higher-level speaker, speaking rate, dialect...

28
Manual feature transcriptions (Xuemin Chi, Lisa
Lavoie, Karen)

Purpose Testing of AF classifiers, automatic
alignments
Main transcription guidelines
Should contain enough information for the speaker
to reproduce the acoustics (up to 20ms shifts in
boundaries)
Should correspond to what we would like our AF
classifiers to detect

29
Manual feature transcriptions (Xuemin Chi, Lisa
Lavoie, Karen)

Details
2 transcribers phonetician, PhD student in
speech group
78 SVitchboard utterances
9 utterances from Switchboard Transcription
Project for comparison
Multipass transcription using WaveSurfer (KTH)
1st pass Phone-feature hybrid
2nd pass All-feature
3rd pass Discussion, error-correction
Transcription speed
623 x RT for 1st pass
335 x RT for 2nd pass
Why phone-feature hybrid in 1st pass?
In a preliminary test, gt 2x slower to do
all-feature transcription in 1st pass
Transcribers found all-feature format very tedious

30
Manual feature transcriptions Analysis (Nash,
Lisa, Ari)

How does the multipass strategy affect agreement?

How well do transcribers agree?

How does agreement compare with phonetic
transcriptions in STP?

31
Models implemented so far...

Phone-based models
Non-classifier-based AF models
Tandem models
Hybrid model

32
Phone-based models monophone, word-internal
triphone (Ozgur, Chris)
frame 0
frame i
last frame
variable name
values
word one, two ,...
1
wordTransition 0,1
subWordState 0,1,2,...
0
stateTransition 0,1
phoneState w1, w2, w3, s1, s2, s3,...
observation
(Note missing pronunciation variants)
33
What are hybrid models?

Conventional HMMs generate observations via a
likelihood p(Ostate) or p(Oclass) using a
mixture of Gaussians
Hybrid models use another classifier (typically
an MLP) to obtain the posterior P(classO)
Dividing by the prior gives the likelihood, which
can be used directly in the HMM no Gaussians
required
Advantages of hybrid models include
Can easily train the classifier discriminatively
Once trained, MLPs will compute P(classO)
relatively fast
MLPs can use a long window of acoustic input
frames
MLPs dont require input feature distribution to
have diagonal covariance (e.g. can use filterbank
outputs from computational auditory scene
analysis front-ends)

34
Configurations

Standard hybrid
Train an MLP to classify phonemes, frame by frame
Decode the MLP output using simple HMMs
(transition probabilities easily derived from
phone duration statistics dont even need to
train them)

Standard tandem
Instead of using MLP output to directly obtain
the likelihood, just use it as a feature vector,
after some transformations (e.g. taking logs) and
dimensionality reduction
Append the resulting features to standard
features, e.g. PLPs or MFCCs
Use this vector as the observation for a standard
HMM with a mixture-of-Gaussians observation model
Currently used in state-of-art systems such as
from SRI

Our configuration
Use ANNs to classify articulatory features
instead of phones
8 MLPs, classifying pl1, dg1, etc. frame-by-frame

35
In other news...

gmtkTie
Linking pronunciation and observation modeling
Structure learning
Other ongoing/planned work

36
gmtkTie (Simon)

General parameter clustering and tying tool for
GMTK
Currently most developed parts
Decision-tree clustering of Gaussians, using same
technique as HTK
Bottom-up agglomerative clustering
gmtkTie more general than HTK
HTK asks questions about previous/next phone
identity
HTK clusters states only within the same phone
gmtkTie can ask user-supplied questions about
user-supplied features no assumptions about
states, triphones, or anything else
gmtkTie clusters user-defined groups of
parameters, not just states
gmtkTie can compute cluster sizes and centroids
in 101 different ways (approx.)
Will be stress-tested on various observation
models using Gaussians
Can tie based on values of any variables in the
graph, not just the phone state (e.g. feature
values)

37
Linking pronunciation and observation models

Pronunciation model generates, from the words,
three feature streams L, T, G
Observation model starts with 8 features (pl1,
dg1, etc) and generates acoustics (possibly via
MLP posteriors or tandem features)
How to link L,T,G with the 8 features?
Deterministic mapping
Learned mapping
Dense conditional probability table (CPT)
Sparse CPT
Link them by choosing dependencies
discriminatively?

38
Structure learning (Steve, Simon)

In various parts of our models, may want to learn
which dependencies to include and which to omit
Between elements of observation vector (similar
to Bilmes 2001)
Between L, T, G and pl1, dg1, ...
Among pl1, dg1, ... (similar to King et al.)
May not need all the dependencies that we think
are necessary for a correct generative model
Currently investigating Bilmes EAR measure
I(X,YC) I(X,Y),
EAR formulated for X,Y observed
We plan to use forced alignment to generate
observed values for hidden variables
So far computed EAR between pairs of individual
values of dg1, pl1, etc
Suggests adding links to a model with tandem
observations
Also computed EAR between pairs of features
Less promising results needs further
investigation
Will also compute EAR on tandem observation
features can implement these arcs in GMTK as
dlinks

39
AVICAR Front-end tuning (Mark)

Video-only WERs

Current status Confirming results, then
implementing audio-visual structures

40
AVICAR Front-end tuning (Mark)

Audio-only WERs

41
CUAVE Implemented models (Partha)

Synchronous model now running
In progress Monofeat model with two observation
vectors

phoneState
obsA
obsV
42
Summary

gmtkTie ready for prime-time
All basic models running
Tandem result encouraging
Experiments taking longer than expected
Too many choices of experiments!

43
Definitions Pronunciation and observation
modeling
44
Types of observation model

Observation model can be one of
8 hidden random variables (one per feature) which
each obtain virtual evidence (VE) from the ANN
8 hidden RVs that generate tandem features using
their own Gaussians
Tandem features for one RV use only the ANN for
that feature, plus PLPs
Single tandem observation generated using
Gaussians from a monophone/triphone HMM state
Tandem feature vector obtained by concatenating 8
ANN output vectors plus PLPs
Compare this to a standard tandem feature vector
derived using a phone-classifying ANN
Non-hybrid/tandem model PLP observations
generated using Gaussians