Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G) - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G)

Description:

Bootstrap TDNN for Classification of Voiced Stop Consonants ... that big goose ...' Stop gap = 2260 (141 msec) 20580 20850 dh. 20850 22920 ae. 22920 25180 tcl ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 34
Provided by: Jun109
Category:

less

Transcript and Presenter's Notes

Title: Bootstrap TDNN for Classification of Voiced Stop Consonants (B,D,G)


1
Bootstrap TDNN for Classification of Voiced Stop
Consonants (B,D,G)
Jun Hou, Lawrence Rabiner and Sorin Dusan
  • CAIP, Rutgers University
  • Oct.13, 2006

2
Outline
  • Review TDNN basics
  • Bootstrap TDNN using categories
  • Model lattice
  • Experiments
  • Discussion and future work

3
ASAT Paradigm
1
2
3
4
5. Overall System Prototypes and Common Platform
4
Previous Research
  • Frame based method
  • Used MLP to detect the 14 Sound Pattern of
    English features for the 61 phoneme TIMIT
    alphabet using single frame of MFCCs
  • Major problem
  • Frames capture static properties of speech
  • Need dynamic information when detection dynamic
    features and sounds
  • Need segment-based methods rather than
    frame-based methods

5
Phoneme Hierarchy
  • Bottom-up approach

Phonemes
Vowels
Diphthongs
Semivowels
Consonants
AW AY EY OY
W L
R Y
Whisper H
Back
Mid
Front
Nasals M N NG
Affricates J CH
UW
AA
IY
High
UH
ER AX
IH EH
Mid
Step 1
Stops
Fricatives
OW
AO
AE
Low
Step 2
Voiced B D G
Unvoiced P T K
Voiced V DH Z ZH
Unvoiced F TH S SH
Step 3
Step 4
39 phonemes 11 vowels, 4 diphthongs, 4
semivowels, 20 consonants
  • Classify voiced stop consonants /B/, /D/ and /G/
    using segment based methods (this is the hardest
    classification problem among the consonants)

6
Voiced Stop Consonants Classification
  • B, D, and G tokens in the TIMIT training and test
    sets, without the SA sentences
  • C V form of tokens
  • - preceding phoneme can be any sound
  • C - B, D, or G
  • V - vowel or diphthong
  • 10 msec windows, 150 msec segments (15 frames)
  • The beginning of the vowel is at the 10th frame

Any preceding phoneme (s)
burst
vowel
10th frame
15th frame
1st frame
  • Distribution (in tokens and percentage)

Training set Training set Training set Training set Test set Test set Test set Test set
B D G Total B D G Total
1567 1460 658 3685 638 537 243 1418
42.5 39.6 17.9 3685 45.0 37.8 17.1 1418
7
Voiced Stop Consonants Classification
  • Example /b/ (speech wave, 10 log(energy),
    spectrogram)

Short stop gap
Medium stop gap
Long stop gap
that big goose Stop gap 2260 (141
msec) 20580 20850 dh 20850 22920 ae 22920 25180
tcl 25180 25460 b 25460 27320 ih 27320 29270
gcl 29270 29850 g 29850 32418 ux 32418 34506 s
and become Stop gap 396 (24.74
msec) 48215 48855 ix 48855 49684 n 49684 50080
bcl 50080 50321 b 50321 51400 iy 51400 52360
kcl 52360 53463 k 53463 54600 ah 54600 55560 m
judged by Stop gap 1100 (68.75
msec) 27770 28970 jh 28970 31960 ah 31960 32619
dcl 32619 33810 jh 33810 34280 dcl 34280 34840
d 34840 35940 bcl 35940 36180 b 36180 37800 ay

8
Voiced Stop Consonants Classification
  • Example /d/ (speech wave, 10 log(energy),
    spectrogram)

Long stop gap
Short stop gap
Medium stop gap
scampered across Stop gap 398 (24.88
msec) 16560 17100 pcl 17100 17530 p 17530 18162
axr 18162 18560 dcl 18560 18800 d 18800 19360
ix 19360 20150 kcl 20150 21080 k 21080 21640 r
A doctor Stop gap 800 (50 msec) 2360 3720
ey 3720 4520 dcl 4520 4760 d 4760 7060 aa 7060
8920 kcl 8920 9480 t 9480 10443 axr
Does Stop gap silence 1960 (122.5 msec) 0
1960 h 1960 2440 d 2440 3800 ah 3800 5413 z
9
Voiced Stop Consonants Classification
  • Example /g/ (speech wave, 10 log(energy),
    spectrogram)

Short stop gap
Medium stop gap
Long stop gap
, give or take Stop gap 3022 (188.88
msec) 24800 27822 pau 27822 28280 g 28280 29080
ih 29080 29960 v 29960 30840 axr 30840 31480
tcl 31480 32360 t 32360 34200 ey 34200 34966
kcl 34966 35400 k
May I get Stop gap 430 (26.88 msec) 0 2080
h 2080 2720 m 2720 4344 ey 4344 6080 ay 6080
6510 gcl 6510 6930 g 6930 8141 ih 8141 9170
tcl 9170 9492 t
a good mechanic Stop gap 960 (60
msec) 30340 36313 pau 36313 36762 q 36762 37720
ah 37720 38680 gcl 38680 39101 g 39101 40120
uh 40120 41000 dcl 41000 41600 m 41600 42200 ix
10
Time-Delay Neural Network
  • Developed by A. Waibel etc.
  • Effective in classifying dynamic sounds, like
    voiced stop consonants
  • Introduces delays into the input of each layer of
    a regular MLP
  • The inputs of a unit are multiplied by the
    un-delayed weights and the delayed weights, then
    summed and passed through a nonlinear function

11
Problems in TDNN Training
  • Slow convergence
  • during error back propagation, the weight
    increments are smeared when averaging on the sets
    of time-delayed weights
  • Requires staged batch training
  • initially trained on a small number of tokens
  • after convergence, gradually add more tokens to
    the training set

balanced
Unbalanced
Hand selected
Bootstrapping begins here
12
Training Solution
  • A well designed bootstrap training method

13
Bootstrap Training Introduction
  • A bootstrap sample utilizes tokens from the
    original dataset, by sampling with replacement
  • The statistics are calculated on each bootstrap
    sample
  • Estimate standard error, etc. on the bootstrap
    replications

Bootstrap replications
S(X1)
S(X2)
S(XB)
Bootstrap samples
X1
X2
XB
X(x1, x2, xn)
Training dataset
14
Bootstrap TDNN
  • Because of the slow convergence of a TDNN, it is
    difficult (and time consuming) to repeat the
    training of a TDNN many times (more than 200
    times for normal bootstrap training experiments)
  • Instead of resampling individual tokens, we build
    a bootstrap sample by resampling clusters of
    tokens
  • Need to find a way to partition the input space
    into a small number of clusters which we call
    categories

15
Bootstrap TDNN Concept
  • Begin with a good starting point
  • use a small set of hand selected tokens to train
    the initial TDNN
  • Partition the training set into subsets
  • use the initial TDNN to partition the training
    set into several subsets which we call
    Categories, and a subsequent TDNN is trained on
    each category
  • Expand each category
  • iteratively use the TDNN to partition the
    remaining (unutilized) training data into
    categories merge the tokens in the category with
    the previous training data, and train a new TDNN
    based on the merged data
  • Merge final categories
  • merge the tokens in the categories (in a
    sequenced manner) and train the final TDNNs based
    on the union of the categories
  • Use an n-best list to combine the TDNN scores to
    give the final segment classification

16
Bootstrap TDNN Notation
  • Double thresholds a high score threshold (fmax)
    and a low score threshold (fmin)
  • Good score and bad score if, and only if, one
    phoneme has a score above fmax and the other two
    phonemes have scores below fmin, the
    classification score is considered as a good
    score. All other cases are treated as bad scores.
  • Segmentation rule
  • Category 1 good score and correct
    classification
  • Category 2 good score and incorrect
    classification
  • Category 3 bad score and correct classification
  • Category 4 bad score and incorrect
    classification

17
Bootstrap TDNN Procedure
  • (1) Use the balanced training set of 99
    hand-selected tokens (i.e., 33 tokens each of
    /B/, /D/, and /G/) as the initial training set,
    and train a single TDNN
  • (2) The current set of TDNNs (one TDNN initially,
    four TDNNs at later stages of training) is used
    to score and segment the complete set of training
    tokens into 4 Categories (based on the double
    threshold procedure)
  • (3) Add selected and balanced (equal number of
    /B/, /D/ and /G/ tokens) training tokens from the
    above four categories to the old training data,
    and train a new set of four updated TDNNs.
  • (4) Iterate steps (2) and (3) until a stopping
    criterion is met
  • stopping criteria there are no more new tokens
    to be added the desired TDNN performance is met.
  • (5) Merge the tokens from the four training
    categories in a sequenced manner to obtain a new
    TDNN. Use a beam search to select the best
    sequence for merging the data from the four
    categories

18
Bootstrap TDNN Illustration
  • Use the 99 hand selected tokens to train a TDNN,
    and partition the initial input space into 4
    categories

954
525
309
688
I(1)
II(1)
99
TDNN
III(1)
IV(1)
1081
863
429
612
A circle denotes balanced tokens in this category
19
Bootstrap TDNN Illustration
  • Merge 99 tokens and the balanced tokens in one
    category, and train a TDNN
  • Use the TDNN to partition the remaining space
    into 4 categories

II(2)
I(2)
99
TDNN

III(2)
IV(2)
I(1)
  • Iterate until the TDNN performance is met, or
    there are no more new and balanced training
    tokens to be added to the previous training data

99
TDNN
II(n1)
III(n1)
IV(n1)
I(1)

I(n)
20
Merge Categories Model Lattice
  • Use a forward lattice to merge the four different
    category TDNNs
  • after the initial categories are established, we
    create a bootstrap sample consisting of some or
    all of the categories, selected in a sequenced
    manner
  • Partial lattice select category samples without
    replacement
  • Full lattice select category samples with
    replacement

21
Partial Lattice
  • Starting point 4 TDNNs built on each of the 4
    category TDNNs
  • At each step
  • select a category without replacement of the
    previous categories
  • merge the data in this category and the data in
    previous step
  • build a TDNN on the union of the data
  • Iterate until all the categories are merged
    together or the TDNN performance is met
  • Use a beam search to select the best path(s) for
    merging categories to obtain the best set of TDNNs

22
Partial Lattice
  • Cross node beam search compare TDNNs that are
    trained on the same categories but in different
    sequences

Cat (I)
Cat (II)
Net 1
Net 2
Cat. (I) U Cat.(II)
Net 12
Net 21
Net(1,2)
If beam width 1, select the best net between
net12 and net21
  • Beam search comparison criterion
  • performance on the complete training set or
  • weighted sum of performances on the 99 hand
    selected tokens, the training set, and the
    complete training set.

23
Full Lattice
  • Select categories with replacement
  • Regular beam search

24
Experiments
  • Training set and test set
  • TIMIT training set and test set without the SA
    sentences
  • CV form tokens, where C denotes /B/, /D/ or /G/,
    V denotes any vowel or diphthong and denotes
    any previous sound
  • /B/ - 1567 /D/ - 1460 /G/ - 658. 3685 tokens
    in training set
  • /B/ - 638 /D/ - 537 /G/ - 243. 1418 tokens in
    test set.
  • 13 MFCCs calculated on a 10 msec window with 5
    msec window overlap
  • average adjacent frames resulting in 10 msec
    frame rate
  • segment length 150 msec (15 frames, with the
    beginning of the succeeding vowel at the 10th
    frame)
  • TDNN
  • inputs 13 mfccs, 15 frames ? 195 input nodes
  • 1st hidden layer 8 nodes, delay D1 2
  • 2nd hidden layer 3 nodes, delay D2 4
  • output layer 3 nodes, one each for /B/, /D/, and
    /G/.

25
Baseline Results
  • A single TDNN trained using staged batch training
    of 3, 6, 9, 24, 99, 249, 780, 3685 tokens.
  • Train the TDNN on the 99 tokens and test on the
    same 99 tokens

B D G
B 32 0 1
D 0 32 1
G 0 2 31
  • After staged batch training
  • performance on the complete training set 91.8
  • performance on the test set 82.0
  • After bootstrapping training lattice decision,
    need 68 of training set to achieve comparable
    results

26
Results on Building Categories
Cat No. 99 hand selected tokens 99 hand selected tokens 99 hand selected tokens 99 hand selected tokens New training set New training set New training set New training set
Cat No. B D G B D G
(I) B 32 0 1 B 135 0 1
(I) D 0 32 1 D 0 135 1
(I) G 0 1 32 G 0 1 135
(II) B 32 1 0 B 200 7 1
(II) D 2 31 0 D 9 199 0
(II) G 11 21 1 G 92 110 6
(III) B 32 0 1 B 173 2 1
(III) D 0 32 1 D 0 175 1
(III) G 0 1 32 G 0 2 174
(IV) B 33 0 0 B 234 2 1
(IV) D 1 32 0 D 6 231 0
(IV) G 0 1 32 G 2 3 232
  • Starting point train the TDNN on the 99 tokens
    and test on the same 99 tokens

B D G
B 32 0 1
D 0 32 1
G 0 2 31
  • After one iteration, all categories except
    category (II) showed some degree of improvement

27
Results of the 4 Category TDNNs
  • TDNN performance after stopping criteria are met

28
Results of the 4 Category TDNNs
  • Number of tokens used in each category

29
Results on Building categories
  • At the end of bootstrapping categories,
    performance on the complete training set and the
    test set, in terms of percentage correct

Cat.1 Cat.2 Cat.3 Cat.4
Train Set 99.4 73.3 99.5 93.6
All Training Tokens 53.2 79.9 74.0 84.5
Test Set 52.0 72.7 71.6 72.5
  • Number of training tokens in each category

Cat.1 Cat.2 Cat.3 Cat.4
tokens 480 1197 1056 1176
  • Category (IV) model achieves the best performance
    on the complete training set and the test set

30
Results on Partial Lattice
  • Using beam width 1, the best 4 sequences are
    shown in the table below. The total number of
    training tokens is 2517, i.e., 68 of the 3685
    tokens in the complete training set are used in
    training all the TDNNs
  • The best result provides 34 error reduction on
    the best TDNN trained on a single category

Sequence of merging categories 1-3-4-2 1-3-2-4 1-4-2-3 3-2-4-1

Performance on the training set 90.94 91.82 90.90 89.35
Performance on the complete training set 89.7 89.6 89.0 87.8
Performance on the test set 77.5 77.3 77.2 77.1
31
Results on Partial Lattice
  • Using an n-best list method to score all 4 models
    and choosing the highest score for each of /B/,
    /D/ and /G/. The maximum of the 3 highest scores
    provides the final classification decision.
  • performance on the complete training set 93.1
  • performance on the complete test set 82.1
  • 35 error reduction on the complete training set,
    over the best model trained using data from all 4
    categories
  • 18 error reduction on the complete training set
    compared with a single TDNN trained using all the
    3685 tokens that achieved 91.8 on the training
    set and 82.0 on the test set

Cat 1 Cat 2 Cat 3 Cat 4
Lattice Complete train 2423 (65.75) 21 (0.57) 1014 (27.52) 227 (6.16)
Lattice Test 630 (44.43) 33 (2.33) 534 (37.66) 221 (15.59)
Staged batch training Complete train 2920 (79.24) 105 (2.85) 483 (13.11) 177 (4.80)
Staged batch training Test 977 (68.90) 123 (8.67) 184 (12.98) 134 (9.45)
32
Bootstrap - Discussion
  • Bootstrapping is an effective procedure to
    guarantee convergence in robust training of TDNNs
  • The problem with bootstrapping is that the TDNN
    needs to be trained several times, which is quite
    time consuming
  • In order to reduce the total number of training
    cycles, we use a beam search method to prune the
    path for merging different categories of data
  • The results showed that, although trained on a
    relatively small portion of all the training data
    (approximately 68), the TDNN achieved better
    performance on the complete training set and
    concomitant improvement on the test set

33
A Few Issues Starting Point
  • 99 hand selected tokens as the initial training
    data
  • or partition the voiced stop syllables into 2
    classes and train TDNN for each class
  • those with a noticeable stop gap
  • those without a noticeable stop gap
  • threshold 30 msec, close to the mean minus
    variance (one-sigma threshold)
  • or use VCV structure tokens as the initial
    training set

Stop closure duration distribution
Voiced stop Voiced stop Voiced stop Unvoiced stop Unvoiced stop Unvoiced stop
/bcl/ /dcl/ /gcl/ /pcl/ /tcl/ /kcl/
Mean (msec) 62.7 51.0 49.7 69.6 54.6 59.8
Variance (msec) 24.3 25.8 22.3 24.8 30.5 25.9
34
Starting Point
  • Short stop gap and long stop gap tokens
  • manually select 99 tokens (33 tokens each for
    /B/, /D/ and /G/) with a short stop gap and 99
    tokens with a long stop gap
  • train a separate TDNN for each training set
  • the TDNN for the short stop gap training data was
    found to have 100 classification accuracy on the
    99 training tokens, while the TDNN for the long
    stop gap training data achieved 96
    classification accuracy

35
A Few Issues Shifting
  • The difficulty in TDNN training lies in the
    shift-invariant nature of the TDNN. For the set
    of voiced stop consonants, the stop regions can
    appear at any 30 msec window during the 150 msec
    segment.
  • The previous vowel affects the BDG articulation
    and provides information useful for
    classification
  • We can make a TDNN converge faster (to a better
    solution) by appropriately shifting frames for
    tokens in categories (II), (III), and (IV).

36
Frame Shifting
  • Train TDNN on the 99 long stop gap hand selected
    tokens test on the same 99 tokens
  • Shift right 4 frames for tokens in categories
    (II), (III) and (IV)
  • Train the TDNN again, and test on the 99 tokens

Before shifting ( tokens) Before shifting ( tokens) Before shifting ( tokens) After shifting ( tokens) After shifting ( tokens) After shifting ( tokens)
B D G B D G
B 32 1 0 31 1 1
D 1 30 2 0 33 0
G 0 0 33 0 0 33
Number of tokens in each category
Category (i) (ii) (iii) (iv)
Before shift ( tokens) 87 0 8 4
After shift( tokens) 94 0 3 2
37
Discussion and Future Work
  • Examine bootstrapping more closely, to reduce the
    total number of bootstrap iterations, and to
    improve model accuracy
  • Segment length affects classification accuracy
    150 msec can contain more than one short stop
    consonants ? use DTW to map the input frames to a
    fixed number of frames and then use the aligned
    tokens to train a TDNN
  • Investigate TDNN for classification of other
    phoneme classes, e.g., voiced fricatives,
    diphthongs, etc.
  • Use frame-based methods for classification of
    static sounds (e.g. vowels, unvoiced fricatives)
    use segment-based methods to recognize dynamic
    sounds (e.g., voiced stop consonants, diphthongs)
  • Develop a bottom-up approach to build small but
    accurate classifiers first, then gradually
    classify broader classes in the phoneme hierarchy

38
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com