SVitchboard - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

SVitchboard

Description:

... SVitchboard - Small Vocabulary Switchboard. SVitchboard [King, Bartels ... Trained on all of Fisher but not on any data from Switchboard 1. SVitchboard ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 41
Provided by: karenl5
Category:

less

Transcript and Presenter's Notes

Title: SVitchboard


1
  • SVitchboard

2
Data SVitchboard - Small Vocabulary Switchboard
  • SVitchboard King, Bartels Bilmes, 2005 is a
    collection of small-vocabulary tasks extracted
    from Switchboard 1
  • Closed vocabulary no OOV issues
  • Various tasks of increasing vocabulary sizes 10,
    500 words
  • Pre-defined train/validation/test sets
  • and 5-fold cross-validation scheme
  • Utterance fragments extracted from SWB 1
  • always surrounded by silence
  • Word alignments available (msstate)
  • Whole word HMM baselines already built
  • SVitchboard SVB

3
SVitchboard amount of data
4
SVitchboard amount of data
5
SVitchboard word frequency distributions
6
SVitchboard number of words per utterance
7
SVitchboard example utterances
  • 10 word task
  • oh
  • right
  • oh really
  • so
  • well the
  • 500 word task
  • oh how funny
  • oh no
  • i feel like they need a big home a nice place
    where someone can have the time to play with them
    and things but i can't give them up
  • oh
  • oh i know it's like the end of the world
  • i know i love mine too

8
SVitchboard isnt it too easy (or too hard)?
  • No (no).
  • Results on the 500 word task test set using a
    recent SRI system
  • SVitchboard data included in the training set for
    this system
  • SRI system has 50k vocab
  • System not tuned to SVB in any way

9
SVitchboard what is the point of a 10 word task?
  • Originally designed for debugging purposes
  • However, results on the 10 and 500 word tasks
    obtained in this workshop show good correlation
    between WERs on the two tasks

WER on 500 word task vs 10 word task
85
80
75
70
65
WER () 500 word task
60
55
50
15
17
19
21
23
25
27
29
WER () 10 word task
10
SVitchboard pre-existing baseline word error
rates
  • Whole word HMMs trained on SVitchboard
  • these results are from King, Bartels Bilmes,
    2005
  • Built with HTK
  • Use MFCC observations

11
SVitchboard experimental technique
  • We only perfomed task 1 of SVitchboard (the first
    of 5 cross-fold sets)
  • Training set is known as ABC
  • Validation set is known as D
  • Test set is known as E
  • SVitchboard defines cross-validation sets
  • But these were too big for the very large number
    of experiments we ran
  • We mainly used a fixed 500 utterance
    randomly-chosen subset of D which we call the
    small validation set
  • All validation set results reported today are on
    this set, unless stated otherwise

12
SVitchboard experimental technique
  • SVitchboard includes word alignments.
  • We found that using these made training
    significantly faster, and gave improved results
    in most cases
  • Word alignments are only ever used during
    training
  • Results above is for a monophone HMM with PLP
    observations

13
SVitchboard workshop baseline word error rates
  • Monophone HMMs trained on SVitchboard
  • PLP observations

14
SVitchboard workshop baseline word error rates
  • Triphone HMMs trained on SVitchboard
  • PLP observations
  • 500 word task only
  • (GMTK system was trained without word alignments)

15
SVitchboard baseline word error rates summary
  • Test set word error rates

16
  • gmtkTie

17
gmtkTie
  • General parameter clustering and tying tool for
    GMTK
  • Written for this workshop
  • Currently most developed parts
  • Decision-tree clustering of Gaussians, using same
    technique as HTK
  • Bottom-up agglomerative clustering
  • Decision-tree tying was tested in this workshop
    on various observation models using Gaussians
  • Conventional triphone models
  • Tandem models, including with factored
    observation streams
  • Feature based models
  • Can tie based on values of any variables in the
    graph, not just the phone state (e.g. feature
    values)

18
gmtkTie
  • gmtkTie is more general than HTK
  • HTK asks questions about previous/next phone
    identity
  • HTK clusters states only within the same phone
  • gmtkTie can ask user-supplied questions about
    user-supplied features no assumptions about
    states, triphones, or anything else
  • gmtkTie clusters user-defined groups of
    parameters, not just states
  • gmtkTie can compute cluster sizes and centroids
    in lots of different ways
  • GMTK/gmtkTie triphone system built in this
    workshop is at least as good as HTK system

19
gmtkTie conclusions
  • It works!
  • Triphone performance at least as good as HTK
  • Can cluster arbitrary groups of parameters,
    asking questions about any feature the user can
    supply
  • Later in this presentation, we will see an
    example of separately clustering the Gaussians
    for two observation streams
  • Opens up new possibilities for clustering
  • Much to explore
  • Building different decision trees for various
    factorings of the acoustic observation vector
  • Asking questions about other contextual factors

20
  • Hybrid models

21
Hybrid models introduction
  • Motivation
  • Want to use feature-based representation
  • In previous work, we have successfully recovered
    feature values from continuous speech using
    neural networks (MLPs)
  • MLPs alone are just frame-by-frame classifiers
  • Need some back end model to decode their output
    into words
  • Ways to use such classifiers
  • Hybrid models
  • Tandem observations

22
Hybrid models introduction
  • Conventional HMMs generate observations via a
    likelihood p(Ostate) or p(Oclass) using a
    mixture of Gaussians
  • Hybrid models use another classifier (typically
    an MLP) to obtain the posterior P(classO)
  • Dividing by the prior gives the likelihood, which
    can be used directly in the HMM no Gaussians
    required

23
Hybrid models introduction
  • Advantages of hybrid models include
  • Can easily train the classifier discriminatively
  • Once trained, MLPs will compute P(classO)
    relatively fast
  • MLPs can use a long window of acoustic input
    frames
  • MLPs dont require input feature distribution to
    have diagonal covariance (e.g. can use filterbank
    outputs from computational auditory scene
    analysis front-ends)

24
Hybrid models standard method
  • Standard phone-based hybrid
  • Train an MLP to classify phonemes, frame by frame
  • Decode the MLP output using simple HMMs for
    smoothing (transition probabilities easily
    derived from phone duration statistics dont
    even need to train them)

Hybrid models our method
  • Feature-based hybrid
  • Use ANNs to classify articulatory features
    instead of phones
  • 8 MLPs, classifying pl1, dg1, etc frame-by-frame
  • One of the motivations for using features is that
    it should be easier to build a multi-lingual /
    cross-language system this way

25
Hybrid models using feature-classifying MLPs
p(dg1 phoneState) Non-deterministic CPT
(learned)
phoneState
. . .
dg1
pl1
rd
dummy variable
MLPs provide virtual evidence here
26
Hybrid models training the MLPs
  • We use MLPs to classify speech into AFs,
    frame-by-frame
  • Must obtain targets for training
  • These are derived from phone labels
  • obtained by forced alignment using the SRI
    recogniser
  • this is less than ideal, but embedded training
    might help (results later)
  • MLPs were trained by Joe Frankel (Edinburgh/ICSI)
    Mathew Magimai (ICSI)
  • Standard feedforward MLPs
  • Trained using Quicknet
  • Input to nets is a 9-frame window of PLPs (with
    VTLN and per-speaker mean and variance
    normalisation)

27
Hybrid models training the MLPs
  • Two versions of MLPs were initially trained
  • Fisher
  • Trained on all of Fisher but not on any data from
    Switchboard 1
  • SVitchboard
  • Trained only on the training set of SVB
  • The Fisher nets performed better, so were used in
    all hybrid experiments

28
Hybrid models MLP details
  • MLP architecture is
  • input units x hidden units x
    output units

29
Hybrid models MLP overall accuracies
  • Frame-level accuracies
  • MLPs trained on Fisher
  • Accuracy computed with respect to SVB test set
  • Silence frames excluded from this calculation
  • More detailed analysis coming up later

30
Hybrid models experiments
  • Using MLPs trained on Fisher using original
    phone-derived targets
  • vs.
  • Using MLPs retrained on SVB data, which has been
    aligned using one of our models
  • Hybrid model
  • vs
  • Hybrid model plus PLP observation

31
Hybrid models experiments basic model
  • Basic model is trained on activations from
    original MLPs (Fisher-trained)
  • The only parameters in this DBN are the
    conditional probability tables (CPTS) describing
    how each feature depends on phone state
  • Embedded training
  • Use the model to realign the SVB data (500 word
    task)
  • Starting from the Fisher-trained nets, retrain on
    these new targets
  • Retrain the DBN on the new net activations

phoneState
. . .
dg1
pl1
rd
32
Hybrid models 500 word results
33
Hybrid models adding in PLPs
  • To improve accuracy, we combined the pure
    hybrid model with a standard monophone model
  • Can/must weight contribution of PLPs
  • Used a global weight on each of the 8 virtual
    evidences, and a fixed weight on PLPs of 1.0
  • Weight tuning worked best if done both during
    training and decoding
  • Computationally expensive must train and
    cross-validate many different systems

34
Hybrid models adding PLPs
p(dg1 phoneState) Non-deterministic CPT
(learned)
phoneState
. . .
PLPs
dg1
pl1
rd
dummy variable
MLP likelihoods (implemented via virtual evidence
in GMTK)
35
Hybrid models weighting virtual evidence vs PLP
36
Hybrid models experiments basic model PLP
  • Basic model is augmented with PLP observations
  • Generated from mixtures of Gaussians, initialised
    from a conventional monphone model
  • A big improvement over hybrid-only model
  • A small improvement over the PLP-only monophone
    model

phoneState
. . .
dg1
pl1
rd
PLPs
37
Hybrid experiments conclusions
  • Hybrid models perform reasonably well, but not
    yet as well as conventional models
  • But they have fewer parameters to be trained
  • So may be a viable approach for small databases
  • Train MLPs on large database (e.g. Fisher)
  • Train hybrid model on small database
  • Cross-language??
  • Embedded training gives good improvements for the
    pure hybrid model
  • Hybrid models augmented with PLPs perform better
    than baseline PLP-only models
  • But improvement is only small
  • The best way to use the MLPs trained on Fisher
    might be to construct tandem observation vectors

38
Using MLPs to transfer knowledge from larger
databases
  • Scenario
  • we need to build a system for a
    domain/accent/language for which we have only a
    small amount of data
  • We have lots of data from other
    domains/accents/languages
  • Method
  • Train MLP on large database
  • Use it in either a hybrid or a tandem system in
    target domain

39
Using MLPs to transfer knowledge from larger
databases
  • Articulatory features
  • It is plausible that training MLPs to be AF
    classifiers could be more accent/language
    independent than phones
  • Tandem results coming up shortly will show that,
    across very similar domains (Fished SVB), AF
    nets perform as well or better than phone nets

40
Hybrid models vs Tandem observations
  • Standard hybrid
  • Train an MLP to classify phonemes, frame by frame
  • Decode the MLP output using simple HMMs
    (transition probabilities easily derived from
    phone duration statistics dont even need to
    train them)
  • Standard tandem
  • Instead of using MLP output to directly obtain
    the likelihood, just use it as a feature vector,
    after some transformations (e.g. taking logs) and
    dimensionality reduction
  • Append the resulting features to standard
    features, e.g. PLPs or MFCCs
  • Use this vector as the observation for a standard
    HMM with a mixture-of-Gaussians observation model
  • Currently used in state-of-art systems such as
    from SRI
  • but first a look at
    structural modifications . . .
Write a Comment
User Comments (0)
About PowerShow.com