Ingen diastitel - PowerPoint PPT Presentation

About This Presentation
Title:

Ingen diastitel

Description:

Acoustic modeling on telephony speech corpora. for directory assistance systems applications ... turn - high perplexity. dynamic, parallel allocation of recognisers ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 34
Provided by: blind
Category:

less

Transcript and Presenter's Notes

Title: Ingen diastitel


1
Acoustic modeling on telephony speech corpora
for directory assistance systems applications
Børge Lindberg, Center for PersonKommunikation
(CPK), Aalborg University Denmark lindberg_at_cpk.a
uc.dk
2
Outline
Part 1 - Acoustic modeling Reference recogniser
(COST 249) Part 2 - Directory assistance NaNu -
Names Numbers (Tele Danmark) Acoustic model
optimisation Project- and system details
3
The COST 249 SpeechDat Multilingual Reference
Recogniserhttp//www.telenor.no/fou/prosjekter/ta
letek/refrec
COST 249
  • F.T. Johansen, N. Warakagoda (Telenor, Kjeller,
    Norway),
  • B. Lindberg (CPK, Aalborg, Denmark),
  • G. Lehtinen (ETH, Zürich, Switzerland),
  • Z. Kacic, B. Imperl, A. Zgank (UMB, Maribor,
    Slovenia),
  • B. Milner, D. Chaplin (British Telecom, Ipswich,
    UK),
  • K. Elenius, G. Salvi (KTH, Stockholm, Sweden),
  • E. Sanders, F. de Wet (KUN, Nijmegen, The
    Netherlands)

4
What is the reference recogniser?
Phoneme based recogniser design
procedure Language-independent Fully automatic,
one script works straight from CDs Standardised
database format SpeechDat(II) Available in many
languages world wide Oriented towards telephone
applications Commonly available recogniser
toolkit HTK
5
Motivation
  • A fast start for recognition research in new
    languages
  • Share experience, avoid doing the same mistakes
  • Improve state-of-the-art
  • Share research efforts
  • Provide a benchmark for recogniser performance
    comparison across tasks and languages
  • Facilitate true multilingual recognition research

6
Related Work
COST 232 Assumed TIMIT-like segmented
database Reference verification systems CAVE,
PICASSO COST 250 GlobalPhone (Schultz Waibel,
ICSLP 98) Dictation type multilingual
databases Language independent and -adaptive
recognition
7
SpeechDat(II) databases
20 FDBs (fixed network), 5 MDBs (mobile
networks) 500-5000 speakers, 4-8 minutes
recording sessions Telephone information and
transaction services Compatible
databases SpeechDat(E) 5 central and Eastern
European languages SALA 8 dialect zones in Latin
America SpeechDat-Car 9 languages, parallel GSM
and in-car SpeechDat Australian English
8
Core Utterance Types in SpeechDat(II)
number type corpus code 1 isolated digit
items I 5 digit/number strings B,C 1 natural
numbers N 1 money amounts M 2 yes/no
questions Q 3 dates D 2 times T 3 applica
tion keywords/keyphrases A 1 word spotting
phrase E 5 directory assistance
names O 3 spellings L 4 phonetically rich
words W 9 phonetically rich sentences S 40 In
total
9
Recogniser design - version 0.95
Standard HTK tutorial features (39-dimensional
MFCC_0_D_A), no normalisation Word internal
triphone HMMs, 3 states per model Decision-tree
state clustering Trained from flat-start using
only orthographic transcriptions and a SpeechDat
lexicon Remove difficult utterances from the
training set 1,2,4,8,16 and 32 diagonal
covariance Gaussian mixtures Re-training on
re-segmented material
10
MFCC_0_D_A - feature set
Pre-empasis 0.97 Frame shift 10 ms Analysis
window Hamming Window length 25 ms Spectrum
type FFT-magnitude Filterbank type Mel-scale Fil
ter shape Triangular Filterbank
channels 26 Cepstral coefficients 12 Cepstral
liftering 22 Energy feature C0 Deltas 13 Delta
-deltas 13 Total features 39
11
Test design
Common test suite on SpeechDat I-test Isolated
digit recognition (SVIP) Q-test Yes/no
recognition (SVIP) A-test Recognition of 30
isolated application words (SVIP) BC-test Unknown
length connected digit string recognition
(SVWL) O-test City name recognition
(MVIP) W-test Recognition of phonetically rich
words (MVIP) Two test procedures used SVIP
Small Vocabulary Isolated Phrase MVIP Medium
Vocabulary Isolated Phrase SVWL Small
Vocabulary Word Loop, NIST alignment
12
Results
Six labs have completed the training procedure on
the SpeechDat(II) databases KUN has converted the
Dutch Polyphone to SpeechDat(II) format train
only on phonetically rich sentences tests only on
digit strings More details available on the web
13
Training Statistics
External information available (either session
list, pronunciation lexicon or a phoneme
mapping - see web-site) Results are for
Refrec. v. 0.93
14
A typical training curve
15
Word error rates
Results are for Refrec. v. 0.93
Average numberof phonemes intest vocabularies
16
Word error rates - cont.
17
Word error rates - cont.
18
Word error rates - cont.
19
Language independent considerations
Performance probably below state-of-the-art
systems No whole-word modelling, no cross-word
context (especially needed for connected
digits) A lot of training data with noise has
been removed No speaker noise of filled pause
model Not robust enough feature analyser
20
Language differences
Mobile database has 3-5 times the error rate of
FDBs more robust modeling needed Slovenian high
noise level on recordings
21
Conclusion - part 1
Practical/logistic problems mostly solved Future
work Improve language and database coverage More
speakers Swedish 5000 More challenging tests,
large vocabularies More analyses Improved
training procedure, clustering
22
Directory assistance
NaNu Børge Lindberg, Bo Nygaard Bai, Tom
Brøndsted, Jesper Ø. Olsen
  • Recognition of Names Numbers
  • In collaboration with Tele Danmark
  • Auto attendant/directory assistance applications
  • Large vocabulary - for the first time in Danish
  • Exploiting the SpeechDat(II) database

23
Acoustic modeling - Decision trees
(Ref HTK Book)
24
Acoustic modeling of Danish diphthongs
25
Acoustic modeling - CMN
26
Acoustic modeling - decision trees
27
NaNu
  • Acoustic models
  • SpeechDat - COST 249
  • 20k tied-mixture tri-phones, 6554 clusters
  • 16 mixture models - 100k mixture components
  • Database
  • ¼ million subscribers (Århus and Næstved areas)
  • Vocabulary extracted from database, for which
  • there is a minimum of two occurences
  • transcription exists (Onomastica)

28
Vocabulary and Coverage
NaNu Vocabulary
Unique database entries, Denmark (source Tele
Danmark)
29
SLANG
  • Recogniser - Spoken LANGuage
  • Speech Recognition Research Platform
  • For Dialogue Systems execution
  • Modular design and implementation (C)
  • Frame synchronous operation
  • Dynamic Tree Structured Decoder
  • Optimised towards large vocabulary recognition
    (Gaussian mixture selection)

30
NLP
  • N-Best lists are parsed into semantic frames and
    SQL queries are generated according to the
    following strategy
  • 1. simple 1-best match
  • 2. full search in all N-best lists
  • 3. under specified (street name and last name
    required to be contained in the N-best list)
  • Output is converted to synthetic speech.

31
Dialogue System
  • Java implementation of dialogue system and
    telephony server.
  • uses SLANG speech recognition library in C
  • connects to public domain SQL database (mySQL)
  • system directed dialogue
  • one word pr. turn - high perplexity
  • dynamic, parallel allocation of recognisers

32
Performance
  • Lack of test data - SpeechDat data were used (!)
  • Person names task
  • First name, optional middle name, last name
  • 434 test utterances (speaker independent)
  • Results from predecessor configuration (10646
    last names, 2777 first/middle names)
  • Recognition accuracy 1-best 39.1

33
Conclusion - Part 2
Real system probably needs application specific
data - not mentioning the dialogue aspect
! Effect of further acoustic model optimisation
(on SpeechDat) may be marginal, when N-best lists
are used Limited number of pronunciation variants
available Immediate steps are- test data !-
acoustic validation of retrieved candidates Mixed
initiative dialogue - CPKs incentive to work on
NaNu !
Write a Comment
User Comments (0)
About PowerShow.com