The development of the HTK Broadcast News transcription system: An overview

About This Presentation

Title:

The development of the HTK Broadcast News transcription system: An overview

Description:

with or without background music or other background noise ... wideband speech, narrowband speech, music ( reject) ... Vocal tract length normalization (VTLN) ... – PowerPoint PPT presentation

Number of Views:78

Avg rating:3.0/5.0

Slides: 25

Provided by: mathias

Category:

more less

Transcript and Presenter's Notes

Title: The development of the HTK Broadcast News transcription system: An overview

1
The development of the HTK Broadcast News
transcription system An overview

Paper by P. C. Woodland.
Appeared in Speech Communication 37 (2002),
47-67.
Mathias.Creutz_at_hut.fi,
T-61.184 Audio Mining, 17 October 2002.

2
Motivation

Transcription of broadcast radio and television
news is challenging
different speech styles
read, spontaneous, conversational speech
native and non-native speakers
high- and low-bandwidth channels
... with or without background music or other
background noise
? Solving these problems is of great utility in
more general tasks.

3
Procedure

Based on the HTK large vocabulary speech
recognition system
by Woodland, Leggetter, Odell, Valtchev, Young,
Gales, Pye, Cambridge University, Entropic Ltd.,
1994 1996.
Developed and evaluated in the NIST/DARPA
Broadcast News TREC SDR evaluations
1996 (DARPA BN)
1997 (DARPA BN)
1998 (DARPA BN)
1998, 10 x real time (TREC 7 spoken docum.
retrieval)
1999, 10 x real time (TREC 8 spoken docum.
retrieval)

4
Standard HTK LVSR system (1)

Acoustic feature extraction
Initially designed for clean speech tasks
Standard Mel-frequency cepstral coefficients
(MFCC)
39 dimensional feature vector
Cepstral mean normalization on an
utterance-by-utterance basis
Prounciation dictionary
Based on the LIMSI 1993 WSJ pronunciation
dictionary
46 phones
Vocabulary of 65k words

5
Standard HTK LVSR system (2)

Acoustic modelling
Hidden Markov models (HMMs)
States are implemented as Gaussian mixture
models.
Embedded Baum-Welch re-estimation
Forced Viterbi alignment chooses between
pronunciation variants in the dictionary, e.g.,
the /ðeh/ or /ði/.
Transcription
Monophones, e.g., you speak sil j u sp s p i
k sil
Triphones, sil ju j-us sp u-sp s-pi
p-ik i-k sil
Quinphones,
sil jus j-usp sp j-u-spi
u-s-pik s-p-ik p-i-k sil

6
Standard HTK LVSR system (3)

Language modelling (LM)
N-gram models using Katz-backoff
Class-based models based on automatically derived
word classes
Dynamic language models based on a cache model
Decoding
Time-synchronous decoders
Single pass or generation or rescoring of word
lattices
Early stages with triphone models and bigram or
trigram LMs
Later stages with adapted quinphones and more
advanced LMs

7
Standard HTK LVSR system (4)

Adaptation (new speaker or acoustic environment)
MLLR (Maximum Likelihood Linear Regression)
adjust Gaussian means (and optionally variances)
in HMMs in order to increase likelihood of
adaptation data
madapted A moriginal b
use a single, global transform for all Gaussians
or separate transforms for different clusters of
Gaussians
can be used in combination with multi-pass
decoding
decode
adapt
decode/rescore lattice with adapted models
adapt again, etc.

8
F-conditions

Data supplied by the Linguistic Data Consortium
(LDC)
Part of the training data and all test data were
hand labelled according to the focus or
F-conditions

F0 Baseline broadcast speech (clean, planned)
F1 Spontaneous broadcast speech (clean)
F2 Low-fidelity speech (mainly narrowband)
F3 Speech in the presence of background music
F4 Speech under degraged acoustical conditions
F5 Non-native speakers (clean, planned)
FX All other speech (e.g., spontaneous non-native)
9
Broadcast news data
1996 1997 1998
Training data 35 hours 37 hours 71 hours
Development test set 2 shows 4 shows
Test set 2 hours (4 shows) 3 hours (9 shows) 3 hours
Text for LM 132 70(?) million words transcr. of acoustic data 60 70 million words
Underlined text pre-partioned at the speaker
turn
Green background hand labelled according to
F-conditions
10
Initial experiments using 1995 BN data

Speech Recognition Existing HTK SR system
Wall Street Journal (WSJ), triphone models,
trigram LM
Data Broadcast news data from the radio show
Marketplace, marked according to
presense/absense of background music
full/reduced bandwidth
Goals
Compare standard MFCC (Mel-Frequency Cepstral
Coeffs) to MF-PLP (Mel-Frequency Perceptual
Linear Prediction)
Try out unsupervised test-data MLLR adaptation
Results
12 word error rate reduction with MF-PLP
further 26 using two-iteration MLLR adaptation

Good. Lets use these techniques!
11
The 1996 BN evaluation (1)

Acoustic environment adaptation (training data)
Adapt WSJ triphone and quinphone models to each
of the focus conditions using mean and variance
MLLR ? data-type specific model sets.
Automatically classify F2 (low-fidelity speech)
as narrowband or wideband ? adapt separate sets.
A couple of tricks for F5 (non-native) and FX
(other speech), due to small amounts of data.
(Mainly) Speaker adaptation (test data)
Unknown speaker identities.
Cluster (bottom-up) similar segments until
sufficient data is available in each group ?
robust unsupervised adaptation.
Each segment is represented by its mean and
variance.

12
The 1996 BN evaluation (2)

Language modelling
Static Bigram, trigram, 4-gram word-based LMs
(Katz backoff)
Dynamic Unigram and bigram cache model
Woodland et al. 1996
Based on last recognition output
Operates on a per-show, per-focus-condition basis
Includes future and previous words other word
forms with the same stem
Excludes common words
Interpolated with the static 4-gram language model

13
The 1996 BN evaluation (3)

Multi-pass decoding

Pass MLLR HMMs LM WER Imprv.
P1 - triph. trigram 33.4 0
P2 1 transf. triph. trigram 31.1 6.9
P3 1 transf. triph. bigram 34.1 -2.1
P3 lat.rs. 1 transf. triph. fourgram 29.4 12.0
P4 lat.rs. 1 transf. quinph. fourgram 27.2 18.6
P5 lat.rs. 2 transf. quinph. fourgram 26.9 19.5
P6 lat.rs. 4 transf. quinph. fourgram 26.7 20.1
cache 4 transf. quinph. cache 26.6 20.4
14
Towards the 1997 BN evaluation (1)

Information about data segmentation and type is
not supplied.
Goal Compare performance of condition-dependent
and condition-independent models
Test data 1996 development test set
Experiment with different acoustic models
Adapt WSJ models to each F-condition (cond.-dep.)
Train models on 1996 BN training data
(cond.-indep.)
Train models on 1997 BN training data
(cond.-indep.)
Results
Condition-independent models slightly better than
adapted condition-dependent models! (WER 32.0,
31.7, 29.6)

15
Towards the 1997 BN evaluation (2)

Gender effect?
2/3 male, 1/3 female speakers in BN data
1/2 male, 1/2 female speakers in WSJ models
Use gender-dependent models
gender of speakers in data is known ? assume that
perfect gender determination is possible
Results (1997 BN data)
Gender-indep All 29.6, Male 28.8, Female
31.1
Gender-dep. All 28.1, Male 27.8, Female 28.8

16
1997 BN Automatic segm. clustering

Goal Convert the audio stream into clusters of
reasonably sized homogeneous speech segments ?
each cluster shares a set of MLLR transforms.
The audio stream is first classified into 3 broad
categories
wideband speech, narrowband speech, music (?
reject)
Use a gender-dependent recognizer to locate
silence portions and gender change points.
Cluster segments separately for each gender and
bandwidth combination for use in MLLR adaptation.
Result Only 0.1 absolute higher WER than manual
segments.

17
The 1997 BN evaluation

Language modelling
Bigram, trigram, 4-gram word-based LMs (Katz
backoff)
Category language model
Kneser Ney 93 Martin et al., 95 Niesler et
al., 98
1000 automatically generated word classes based
on word bigram statistics in the training set
Trigram model
Interpolation of word 4-gram and class trigram
models
weights 0.7 and 0.3
Hypothesis combination
Different types of errors ? Combine triphone and
quinphone results.
Use confidence scores and dynamic
programming-based string alignment.

18
The 1997 BN evaluation (2)
Pass MLLR HMMs LM WER Imprv.
P1 - gi triph. trigram 21.4 0
P2 1 transf. gd triph. bigram 21.3 0.5
P2 lat.rs. 1 transf. gd triph. trigram 18.0 15.9
P2 lat.rs. 1 transf. gd triph. fourgram 17.3 19.2
P2 lat.rs. 1 transf. gd triph. inp.w4c3 16.8 21.5
P3 lat.rs. 1 transf. gd quin. inp.w4c3 16.4 23.4
P4 lat.rs. 2 transf. gd quin. inp.w4c3 16.2 24.3
P5 lat.rs. 4 transf. gd quin. inp.w4c3 16.2 24.3
cache 4 transf. gd quin. cache 16.2 24.3
ROVER 1 tr./4 tr. gd tri/qu. inp.w4c3 15.8 26.2
conf. combine
19
The 1998 BN evaluation (1)

Vocal tract length normalization (VTLN)
Max. likelihood selection of best warp factor
(parabolic search)
0.4 lower absolute WER (MLLR-adapted quinphones)
Language modelling
Interpolate 3 separate word-based LMs (BN,
newswire, acoustic data) instead of pooling them.
0.5 lower absolute WER (adapted quinphones)
Full variance MLLR transforms
0.2 lower absolute WER
Speaker-adaptive training
further 0.1 lower absolute WER (in combination
with full variance transforms)

20
The 1998 BN evaluation (2)
Pass MLLR HMMs LM WER Imprv.
P1 - gi -V trip. trigram 19.9 0
P2 - gd V trip. trigram 17.5 12.1
P3 1 tr. FV gd V trip. bigram 19.1 4.0
P3 lat.rs. 1 tr. FV gd V trip. inp.w4c3 15.3 23.1
P4 lat.rs. 1 tr. FV gd V qui. inp.w4c3 14.9 25.1
P4 lat.rs. 1 tr. FV gd V qui. inp.w4c3 14.2 28.6
P6 lat.rs. 4 tr. FV gd V qui. inp.w4c3 14.2 28.6
ROVER 1-F/4F gd V tr/q. inp.w4c3 13.8 30.7
conf. combine
21
1998 TREC 7 evaluation

Constraint Must operate in max. 10 x real time
1999 TREC 8 Same architecture, larger vocab.

Pass MLLR HMMs LM 1997 1998
P1 - gi -V trip. trigram
P1 lat.? 1 - gi -V trip. fourgram 21.4 21.2
P2 1 tran. gd -V trip. trigram
P2 lat.?1 1 tran. gd -V trip. inp.w4c3 15.8 16.1
Full time systems 15.8 13.8
22
Discussion and conclusion

The HTK system had either the lowest overall
error rate in every evaluation or a value not
significantly different from the lowest.
HTK was always the best for F0 speech (clean,
planned).
In worse conditions, the applied adaptation
methods were shown to significantly reduce the
error.
Still a long way to go(?) Word error rates for
bulk transcriptions of BN data remains at about
20 for the best systems.
... with very high WER for some audio conditions.
What about other languages than English?

23
Project work

Planned project
Literature study on language models used in audio
mining (broadcast news quality speech)
How do they work?
What is their contribution to the overall error
reduction?

24
Home assignment

Briefly comment the following claims in the
light of Woodland's paper. (Simply answering true
or false is not enough.)
The better overall performance of the 1997 system
compared to the 1996 system was mainly due to the
doubling of the amount of training data.
Triphone HMMs cannot be estimated unless there is
a huge amount of training data available.
Gender-dependent acoustic models are to be
preferred over gender-independent models.
Quinphone HMMs are not created through two-model
re-estimation.
MLLR (Maximum Likelihood Linear Regression) is an
adaptation method that is sensitive to
transcription errors.

Write a Comment

User Comments (0)

About PowerShow.com

The development of the HTK Broadcast News transcription system: An overview - PowerPoint PPT Presentation

The development of the HTK Broadcast News transcription system: An overview

with or without background music or other background noise ... wideband speech, narrowband speech, music ( reject) ... Vocal tract length normalization (VTLN) ... – PowerPoint PPT presentation