Title: The development of the HTK Broadcast News transcription system: An overview
1The development of the HTK Broadcast News
transcription system An overview
- Paper by P. C. Woodland.
- Appeared in Speech Communication 37 (2002),
47-67. - Mathias.Creutz_at_hut.fi,
- T-61.184 Audio Mining, 17 October 2002.
2Motivation
- Transcription of broadcast radio and television
news is challenging - different speech styles
- read, spontaneous, conversational speech
- native and non-native speakers
- high- and low-bandwidth channels
- ... with or without background music or other
background noise - ? Solving these problems is of great utility in
more general tasks.
3Procedure
- Based on the HTK large vocabulary speech
recognition system - by Woodland, Leggetter, Odell, Valtchev, Young,
Gales, Pye, Cambridge University, Entropic Ltd.,
1994 1996. - Developed and evaluated in the NIST/DARPA
Broadcast News TREC SDR evaluations - 1996 (DARPA BN)
- 1997 (DARPA BN)
- 1998 (DARPA BN)
- 1998, 10 x real time (TREC 7 spoken docum.
retrieval) - 1999, 10 x real time (TREC 8 spoken docum.
retrieval)
4Standard HTK LVSR system (1)
- Acoustic feature extraction
- Initially designed for clean speech tasks
- Standard Mel-frequency cepstral coefficients
(MFCC) - 39 dimensional feature vector
- Cepstral mean normalization on an
utterance-by-utterance basis -
- Prounciation dictionary
- Based on the LIMSI 1993 WSJ pronunciation
dictionary - 46 phones
- Vocabulary of 65k words
5Standard HTK LVSR system (2)
- Acoustic modelling
- Hidden Markov models (HMMs)
- States are implemented as Gaussian mixture
models. - Embedded Baum-Welch re-estimation
- Forced Viterbi alignment chooses between
pronunciation variants in the dictionary, e.g.,
the /ðeh/ or /ði/. - Transcription
- Monophones, e.g., you speak sil j u sp s p i
k sil - Triphones, sil ju j-us sp u-sp s-pi
p-ik i-k sil - Quinphones,
- sil jus j-usp sp j-u-spi
u-s-pik s-p-ik p-i-k sil
6Standard HTK LVSR system (3)
- Language modelling (LM)
- N-gram models using Katz-backoff
- Class-based models based on automatically derived
word classes - Dynamic language models based on a cache model
- Decoding
- Time-synchronous decoders
- Single pass or generation or rescoring of word
lattices - Early stages with triphone models and bigram or
trigram LMs - Later stages with adapted quinphones and more
advanced LMs
7Standard HTK LVSR system (4)
- Adaptation (new speaker or acoustic environment)
-
- MLLR (Maximum Likelihood Linear Regression)
- adjust Gaussian means (and optionally variances)
in HMMs in order to increase likelihood of
adaptation data - madapted A moriginal b
- use a single, global transform for all Gaussians
- or separate transforms for different clusters of
Gaussians - can be used in combination with multi-pass
decoding - decode
- adapt
- decode/rescore lattice with adapted models
- adapt again, etc.
8F-conditions
- Data supplied by the Linguistic Data Consortium
(LDC) - Part of the training data and all test data were
hand labelled according to the focus or
F-conditions
F0 Baseline broadcast speech (clean, planned)
F1 Spontaneous broadcast speech (clean)
F2 Low-fidelity speech (mainly narrowband)
F3 Speech in the presence of background music
F4 Speech under degraged acoustical conditions
F5 Non-native speakers (clean, planned)
FX All other speech (e.g., spontaneous non-native)
9Broadcast news data
1996 1997 1998
Training data 35 hours 37 hours 71 hours
Development test set 2 shows 4 shows
Test set 2 hours (4 shows) 3 hours (9 shows) 3 hours
Text for LM 132 70(?) million words transcr. of acoustic data 60 70 million words
Underlined text pre-partioned at the speaker
turn
Green background hand labelled according to
F-conditions
10Initial experiments using 1995 BN data
- Speech Recognition Existing HTK SR system
- Wall Street Journal (WSJ), triphone models,
trigram LM - Data Broadcast news data from the radio show
Marketplace, marked according to - presense/absense of background music
- full/reduced bandwidth
- Goals
- Compare standard MFCC (Mel-Frequency Cepstral
Coeffs) to MF-PLP (Mel-Frequency Perceptual
Linear Prediction) - Try out unsupervised test-data MLLR adaptation
- Results
- 12 word error rate reduction with MF-PLP
- further 26 using two-iteration MLLR adaptation
Good. Lets use these techniques!
11The 1996 BN evaluation (1)
- Acoustic environment adaptation (training data)
- Adapt WSJ triphone and quinphone models to each
of the focus conditions using mean and variance
MLLR ? data-type specific model sets. - Automatically classify F2 (low-fidelity speech)
as narrowband or wideband ? adapt separate sets. - A couple of tricks for F5 (non-native) and FX
(other speech), due to small amounts of data. - (Mainly) Speaker adaptation (test data)
- Unknown speaker identities.
- Cluster (bottom-up) similar segments until
sufficient data is available in each group ?
robust unsupervised adaptation. - Each segment is represented by its mean and
variance.
12The 1996 BN evaluation (2)
- Language modelling
- Static Bigram, trigram, 4-gram word-based LMs
(Katz backoff) - Dynamic Unigram and bigram cache model
- Woodland et al. 1996
- Based on last recognition output
- Operates on a per-show, per-focus-condition basis
- Includes future and previous words other word
forms with the same stem - Excludes common words
- Interpolated with the static 4-gram language model
13The 1996 BN evaluation (3)
Pass MLLR HMMs LM WER Imprv.
P1 - triph. trigram 33.4 0
P2 1 transf. triph. trigram 31.1 6.9
P3 1 transf. triph. bigram 34.1 -2.1
P3 lat.rs. 1 transf. triph. fourgram 29.4 12.0
P4 lat.rs. 1 transf. quinph. fourgram 27.2 18.6
P5 lat.rs. 2 transf. quinph. fourgram 26.9 19.5
P6 lat.rs. 4 transf. quinph. fourgram 26.7 20.1
cache 4 transf. quinph. cache 26.6 20.4
14Towards the 1997 BN evaluation (1)
- Information about data segmentation and type is
not supplied. - Goal Compare performance of condition-dependent
and condition-independent models - Test data 1996 development test set
- Experiment with different acoustic models
- Adapt WSJ models to each F-condition (cond.-dep.)
- Train models on 1996 BN training data
(cond.-indep.) - Train models on 1997 BN training data
(cond.-indep.) - Results
- Condition-independent models slightly better than
adapted condition-dependent models! (WER 32.0,
31.7, 29.6)
15Towards the 1997 BN evaluation (2)
- Gender effect?
- 2/3 male, 1/3 female speakers in BN data
- 1/2 male, 1/2 female speakers in WSJ models
- Use gender-dependent models
- gender of speakers in data is known ? assume that
perfect gender determination is possible - Results (1997 BN data)
- Gender-indep All 29.6, Male 28.8, Female
31.1 - Gender-dep. All 28.1, Male 27.8, Female 28.8
161997 BN Automatic segm. clustering
- Goal Convert the audio stream into clusters of
reasonably sized homogeneous speech segments ?
each cluster shares a set of MLLR transforms. - The audio stream is first classified into 3 broad
categories - wideband speech, narrowband speech, music (?
reject) - Use a gender-dependent recognizer to locate
silence portions and gender change points. - Cluster segments separately for each gender and
bandwidth combination for use in MLLR adaptation. - Result Only 0.1 absolute higher WER than manual
segments.
17The 1997 BN evaluation
- Language modelling
- Bigram, trigram, 4-gram word-based LMs (Katz
backoff) - Category language model
- Kneser Ney 93 Martin et al., 95 Niesler et
al., 98 - 1000 automatically generated word classes based
on word bigram statistics in the training set - Trigram model
- Interpolation of word 4-gram and class trigram
models - weights 0.7 and 0.3
- Hypothesis combination
- Different types of errors ? Combine triphone and
quinphone results. - Use confidence scores and dynamic
programming-based string alignment.
18The 1997 BN evaluation (2)
Pass MLLR HMMs LM WER Imprv.
P1 - gi triph. trigram 21.4 0
P2 1 transf. gd triph. bigram 21.3 0.5
P2 lat.rs. 1 transf. gd triph. trigram 18.0 15.9
P2 lat.rs. 1 transf. gd triph. fourgram 17.3 19.2
P2 lat.rs. 1 transf. gd triph. inp.w4c3 16.8 21.5
P3 lat.rs. 1 transf. gd quin. inp.w4c3 16.4 23.4
P4 lat.rs. 2 transf. gd quin. inp.w4c3 16.2 24.3
P5 lat.rs. 4 transf. gd quin. inp.w4c3 16.2 24.3
cache 4 transf. gd quin. cache 16.2 24.3
ROVER 1 tr./4 tr. gd tri/qu. inp.w4c3 15.8 26.2
conf. combine
19The 1998 BN evaluation (1)
- Vocal tract length normalization (VTLN)
- Max. likelihood selection of best warp factor
(parabolic search) - 0.4 lower absolute WER (MLLR-adapted quinphones)
- Language modelling
- Interpolate 3 separate word-based LMs (BN,
newswire, acoustic data) instead of pooling them. - 0.5 lower absolute WER (adapted quinphones)
- Full variance MLLR transforms
- 0.2 lower absolute WER
- Speaker-adaptive training
- further 0.1 lower absolute WER (in combination
with full variance transforms)
20The 1998 BN evaluation (2)
Pass MLLR HMMs LM WER Imprv.
P1 - gi -V trip. trigram 19.9 0
P2 - gd V trip. trigram 17.5 12.1
P3 1 tr. FV gd V trip. bigram 19.1 4.0
P3 lat.rs. 1 tr. FV gd V trip. inp.w4c3 15.3 23.1
P4 lat.rs. 1 tr. FV gd V qui. inp.w4c3 14.9 25.1
P4 lat.rs. 1 tr. FV gd V qui. inp.w4c3 14.2 28.6
P6 lat.rs. 4 tr. FV gd V qui. inp.w4c3 14.2 28.6
ROVER 1-F/4F gd V tr/q. inp.w4c3 13.8 30.7
conf. combine
211998 TREC 7 evaluation
- Constraint Must operate in max. 10 x real time
- 1999 TREC 8 Same architecture, larger vocab.
Pass MLLR HMMs LM 1997 1998
P1 - gi -V trip. trigram
P1 lat.? 1 - gi -V trip. fourgram 21.4 21.2
P2 1 tran. gd -V trip. trigram
P2 lat.?1 1 tran. gd -V trip. inp.w4c3 15.8 16.1
Full time systems 15.8 13.8
22Discussion and conclusion
- The HTK system had either the lowest overall
error rate in every evaluation or a value not
significantly different from the lowest. - HTK was always the best for F0 speech (clean,
planned). - In worse conditions, the applied adaptation
methods were shown to significantly reduce the
error. - Still a long way to go(?) Word error rates for
bulk transcriptions of BN data remains at about
20 for the best systems. - ... with very high WER for some audio conditions.
- What about other languages than English?
23Project work
- Planned project
- Literature study on language models used in audio
mining (broadcast news quality speech) - How do they work?
- What is their contribution to the overall error
reduction?
24Home assignment
- Briefly comment the following claims in the
light of Woodland's paper. (Simply answering true
or false is not enough.) - The better overall performance of the 1997 system
compared to the 1996 system was mainly due to the
doubling of the amount of training data. - Triphone HMMs cannot be estimated unless there is
a huge amount of training data available. - Gender-dependent acoustic models are to be
preferred over gender-independent models. - Quinphone HMMs are not created through two-model
re-estimation. - MLLR (Maximum Likelihood Linear Regression) is an
adaptation method that is sensitive to
transcription errors.