The development of the HTK Broadcast News transcription system: An overview - PowerPoint PPT Presentation

About This Presentation
Title:

The development of the HTK Broadcast News transcription system: An overview

Description:

with or without background music or other background noise ... wideband speech, narrowband speech, music ( reject) ... Vocal tract length normalization (VTLN) ... – PowerPoint PPT presentation

Number of Views:78
Avg rating:3.0/5.0
Slides: 25
Provided by: mathias
Category:

less

Transcript and Presenter's Notes

Title: The development of the HTK Broadcast News transcription system: An overview


1
The development of the HTK Broadcast News
transcription system An overview
  • Paper by P. C. Woodland.
  • Appeared in Speech Communication 37 (2002),
    47-67.
  • Mathias.Creutz_at_hut.fi,
  • T-61.184 Audio Mining, 17 October 2002.

2
Motivation
  • Transcription of broadcast radio and television
    news is challenging
  • different speech styles
  • read, spontaneous, conversational speech
  • native and non-native speakers
  • high- and low-bandwidth channels
  • ... with or without background music or other
    background noise
  • ? Solving these problems is of great utility in
    more general tasks.

3
Procedure
  • Based on the HTK large vocabulary speech
    recognition system
  • by Woodland, Leggetter, Odell, Valtchev, Young,
    Gales, Pye, Cambridge University, Entropic Ltd.,
    1994 1996.
  • Developed and evaluated in the NIST/DARPA
    Broadcast News TREC SDR evaluations
  • 1996 (DARPA BN)
  • 1997 (DARPA BN)
  • 1998 (DARPA BN)
  • 1998, 10 x real time (TREC 7 spoken docum.
    retrieval)
  • 1999, 10 x real time (TREC 8 spoken docum.
    retrieval)

4
Standard HTK LVSR system (1)
  • Acoustic feature extraction
  • Initially designed for clean speech tasks
  • Standard Mel-frequency cepstral coefficients
    (MFCC)
  • 39 dimensional feature vector
  • Cepstral mean normalization on an
    utterance-by-utterance basis
  • Prounciation dictionary
  • Based on the LIMSI 1993 WSJ pronunciation
    dictionary
  • 46 phones
  • Vocabulary of 65k words

5
Standard HTK LVSR system (2)
  • Acoustic modelling
  • Hidden Markov models (HMMs)
  • States are implemented as Gaussian mixture
    models.
  • Embedded Baum-Welch re-estimation
  • Forced Viterbi alignment chooses between
    pronunciation variants in the dictionary, e.g.,
    the /ðeh/ or /ði/.
  • Transcription
  • Monophones, e.g., you speak sil j u sp s p i
    k sil
  • Triphones, sil ju j-us sp u-sp s-pi
    p-ik i-k sil
  • Quinphones,
  • sil jus j-usp sp j-u-spi
    u-s-pik s-p-ik p-i-k sil

6
Standard HTK LVSR system (3)
  • Language modelling (LM)
  • N-gram models using Katz-backoff
  • Class-based models based on automatically derived
    word classes
  • Dynamic language models based on a cache model
  • Decoding
  • Time-synchronous decoders
  • Single pass or generation or rescoring of word
    lattices
  • Early stages with triphone models and bigram or
    trigram LMs
  • Later stages with adapted quinphones and more
    advanced LMs

7
Standard HTK LVSR system (4)
  • Adaptation (new speaker or acoustic environment)
  • MLLR (Maximum Likelihood Linear Regression)
  • adjust Gaussian means (and optionally variances)
    in HMMs in order to increase likelihood of
    adaptation data
  • madapted A moriginal b
  • use a single, global transform for all Gaussians
  • or separate transforms for different clusters of
    Gaussians
  • can be used in combination with multi-pass
    decoding
  • decode
  • adapt
  • decode/rescore lattice with adapted models
  • adapt again, etc.

8
F-conditions
  • Data supplied by the Linguistic Data Consortium
    (LDC)
  • Part of the training data and all test data were
    hand labelled according to the focus or
    F-conditions

F0 Baseline broadcast speech (clean, planned)
F1 Spontaneous broadcast speech (clean)
F2 Low-fidelity speech (mainly narrowband)
F3 Speech in the presence of background music
F4 Speech under degraged acoustical conditions
F5 Non-native speakers (clean, planned)
FX All other speech (e.g., spontaneous non-native)
9
Broadcast news data
1996 1997 1998
Training data 35 hours 37 hours 71 hours
Development test set 2 shows 4 shows
Test set 2 hours (4 shows) 3 hours (9 shows) 3 hours
Text for LM 132 70(?) million words transcr. of acoustic data 60 70 million words
Underlined text pre-partioned at the speaker
turn
Green background hand labelled according to
F-conditions
10
Initial experiments using 1995 BN data
  • Speech Recognition Existing HTK SR system
  • Wall Street Journal (WSJ), triphone models,
    trigram LM
  • Data Broadcast news data from the radio show
    Marketplace, marked according to
  • presense/absense of background music
  • full/reduced bandwidth
  • Goals
  • Compare standard MFCC (Mel-Frequency Cepstral
    Coeffs) to MF-PLP (Mel-Frequency Perceptual
    Linear Prediction)
  • Try out unsupervised test-data MLLR adaptation
  • Results
  • 12 word error rate reduction with MF-PLP
  • further 26 using two-iteration MLLR adaptation

Good. Lets use these techniques!
11
The 1996 BN evaluation (1)
  • Acoustic environment adaptation (training data)
  • Adapt WSJ triphone and quinphone models to each
    of the focus conditions using mean and variance
    MLLR ? data-type specific model sets.
  • Automatically classify F2 (low-fidelity speech)
    as narrowband or wideband ? adapt separate sets.
  • A couple of tricks for F5 (non-native) and FX
    (other speech), due to small amounts of data.
  • (Mainly) Speaker adaptation (test data)
  • Unknown speaker identities.
  • Cluster (bottom-up) similar segments until
    sufficient data is available in each group ?
    robust unsupervised adaptation.
  • Each segment is represented by its mean and
    variance.

12
The 1996 BN evaluation (2)
  • Language modelling
  • Static Bigram, trigram, 4-gram word-based LMs
    (Katz backoff)
  • Dynamic Unigram and bigram cache model
  • Woodland et al. 1996
  • Based on last recognition output
  • Operates on a per-show, per-focus-condition basis
  • Includes future and previous words other word
    forms with the same stem
  • Excludes common words
  • Interpolated with the static 4-gram language model

13
The 1996 BN evaluation (3)
  • Multi-pass decoding

Pass MLLR HMMs LM WER Imprv.
P1 - triph. trigram 33.4 0
P2 1 transf. triph. trigram 31.1 6.9
P3 1 transf. triph. bigram 34.1 -2.1
P3 lat.rs. 1 transf. triph. fourgram 29.4 12.0
P4 lat.rs. 1 transf. quinph. fourgram 27.2 18.6
P5 lat.rs. 2 transf. quinph. fourgram 26.9 19.5
P6 lat.rs. 4 transf. quinph. fourgram 26.7 20.1
cache 4 transf. quinph. cache 26.6 20.4
14
Towards the 1997 BN evaluation (1)
  • Information about data segmentation and type is
    not supplied.
  • Goal Compare performance of condition-dependent
    and condition-independent models
  • Test data 1996 development test set
  • Experiment with different acoustic models
  • Adapt WSJ models to each F-condition (cond.-dep.)
  • Train models on 1996 BN training data
    (cond.-indep.)
  • Train models on 1997 BN training data
    (cond.-indep.)
  • Results
  • Condition-independent models slightly better than
    adapted condition-dependent models! (WER 32.0,
    31.7, 29.6)

15
Towards the 1997 BN evaluation (2)
  • Gender effect?
  • 2/3 male, 1/3 female speakers in BN data
  • 1/2 male, 1/2 female speakers in WSJ models
  • Use gender-dependent models
  • gender of speakers in data is known ? assume that
    perfect gender determination is possible
  • Results (1997 BN data)
  • Gender-indep All 29.6, Male 28.8, Female
    31.1
  • Gender-dep. All 28.1, Male 27.8, Female 28.8

16
1997 BN Automatic segm. clustering
  • Goal Convert the audio stream into clusters of
    reasonably sized homogeneous speech segments ?
    each cluster shares a set of MLLR transforms.
  • The audio stream is first classified into 3 broad
    categories
  • wideband speech, narrowband speech, music (?
    reject)
  • Use a gender-dependent recognizer to locate
    silence portions and gender change points.
  • Cluster segments separately for each gender and
    bandwidth combination for use in MLLR adaptation.
  • Result Only 0.1 absolute higher WER than manual
    segments.

17
The 1997 BN evaluation
  • Language modelling
  • Bigram, trigram, 4-gram word-based LMs (Katz
    backoff)
  • Category language model
  • Kneser Ney 93 Martin et al., 95 Niesler et
    al., 98
  • 1000 automatically generated word classes based
    on word bigram statistics in the training set
  • Trigram model
  • Interpolation of word 4-gram and class trigram
    models
  • weights 0.7 and 0.3
  • Hypothesis combination
  • Different types of errors ? Combine triphone and
    quinphone results.
  • Use confidence scores and dynamic
    programming-based string alignment.

18
The 1997 BN evaluation (2)
Pass MLLR HMMs LM WER Imprv.
P1 - gi triph. trigram 21.4 0
P2 1 transf. gd triph. bigram 21.3 0.5
P2 lat.rs. 1 transf. gd triph. trigram 18.0 15.9
P2 lat.rs. 1 transf. gd triph. fourgram 17.3 19.2
P2 lat.rs. 1 transf. gd triph. inp.w4c3 16.8 21.5
P3 lat.rs. 1 transf. gd quin. inp.w4c3 16.4 23.4
P4 lat.rs. 2 transf. gd quin. inp.w4c3 16.2 24.3
P5 lat.rs. 4 transf. gd quin. inp.w4c3 16.2 24.3
cache 4 transf. gd quin. cache 16.2 24.3
ROVER 1 tr./4 tr. gd tri/qu. inp.w4c3 15.8 26.2
conf. combine
19
The 1998 BN evaluation (1)
  • Vocal tract length normalization (VTLN)
  • Max. likelihood selection of best warp factor
    (parabolic search)
  • 0.4 lower absolute WER (MLLR-adapted quinphones)
  • Language modelling
  • Interpolate 3 separate word-based LMs (BN,
    newswire, acoustic data) instead of pooling them.
  • 0.5 lower absolute WER (adapted quinphones)
  • Full variance MLLR transforms
  • 0.2 lower absolute WER
  • Speaker-adaptive training
  • further 0.1 lower absolute WER (in combination
    with full variance transforms)

20
The 1998 BN evaluation (2)
Pass MLLR HMMs LM WER Imprv.
P1 - gi -V trip. trigram 19.9 0
P2 - gd V trip. trigram 17.5 12.1
P3 1 tr. FV gd V trip. bigram 19.1 4.0
P3 lat.rs. 1 tr. FV gd V trip. inp.w4c3 15.3 23.1
P4 lat.rs. 1 tr. FV gd V qui. inp.w4c3 14.9 25.1
P4 lat.rs. 1 tr. FV gd V qui. inp.w4c3 14.2 28.6
P6 lat.rs. 4 tr. FV gd V qui. inp.w4c3 14.2 28.6
ROVER 1-F/4F gd V tr/q. inp.w4c3 13.8 30.7
conf. combine
21
1998 TREC 7 evaluation
  • Constraint Must operate in max. 10 x real time
  • 1999 TREC 8 Same architecture, larger vocab.

Pass MLLR HMMs LM 1997 1998
P1 - gi -V trip. trigram
P1 lat.? 1 - gi -V trip. fourgram 21.4 21.2
P2 1 tran. gd -V trip. trigram
P2 lat.?1 1 tran. gd -V trip. inp.w4c3 15.8 16.1
Full time systems 15.8 13.8
22
Discussion and conclusion
  • The HTK system had either the lowest overall
    error rate in every evaluation or a value not
    significantly different from the lowest.
  • HTK was always the best for F0 speech (clean,
    planned).
  • In worse conditions, the applied adaptation
    methods were shown to significantly reduce the
    error.
  • Still a long way to go(?) Word error rates for
    bulk transcriptions of BN data remains at about
    20 for the best systems.
  • ... with very high WER for some audio conditions.
  • What about other languages than English?

23
Project work
  • Planned project
  • Literature study on language models used in audio
    mining (broadcast news quality speech)
  • How do they work?
  • What is their contribution to the overall error
    reduction?

24
Home assignment
  • Briefly comment the following claims in the
    light of Woodland's paper. (Simply answering true
    or false is not enough.)
  • The better overall performance of the 1997 system
    compared to the 1996 system was mainly due to the
    doubling of the amount of training data.
  • Triphone HMMs cannot be estimated unless there is
    a huge amount of training data available.
  • Gender-dependent acoustic models are to be
    preferred over gender-independent models.
  • Quinphone HMMs are not created through two-model
    re-estimation.
  • MLLR (Maximum Likelihood Linear Regression) is an
    adaptation method that is sensitive to
    transcription errors.
Write a Comment
User Comments (0)
About PowerShow.com