Titlu: 100K Words, Machine-Readable, Pronunciation Dictionary for the Romanian Language - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Titlu: 100K Words, Machine-Readable, Pronunciation Dictionary for the Romanian Language

Description:

Titlu: 100K+ Words, Machine-Readable, Pronunciation Dictionary for the Romanian Language Autor: J zsef DOMOKOS CONTENTS Introduction Scope of the presentation Used ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 15
Provided by: 47794
Category:

less

Transcript and Presenter's Notes

Title: Titlu: 100K Words, Machine-Readable, Pronunciation Dictionary for the Romanian Language


1
Titlu 100K Words, Machine-Readable,
Pronunciation Dictionary for the Romanian
Language Autor József DOMOKOS
2
CONTENTS
  • Introduction
  • Scope of the presentation
  • Used grapheme set
  • Used phoneme set
  • Pronunciation dictionary development
  • Testing
  • Conclusions, future works
  • Acknowledgments

3
INTRODUCTION
  • Pronunciation dictionaries are very useful
    resources for spoken language technology. These
    resources are widely used in automatic speech
    recognition (ASR) and text to speech (TTS)
    synthesis applications
  • they are at the base of automated segmentation of
    speech at phonetic level
  • predicting the pronunciation of a written word is
    an important sub-task of all speech production
    systems

4
INTRODUCTION
  • To our best knowledge there is not available any
    large, machine-readable pronunciation dictionary
    for Romanian language
  • For English we have
  • Carnegie Mellon University (CMU) Pronouncing
    Dictionary
  • British English Example Pronunciations (BEEP) for
    English
  • In case of some under-resourced languages such as
    Romanian, the existence of a pronunciation
    dictionary can considerably speed up ASR and TTS
    system development.

5
SCOPE OF THE PRESENTATION
  • The scope of this presentation is to introduce a
    newly developed Romanian language pronunciation
    dictionary.
  • The dictionary contains more than 100k words from
    the DexOnline dictionary together with their
    phonetic transcriptions in SAMPA machine readable
    alphabet.
  • The pronunciation dictionary is freely available
    on the project website in HTK and Festival
    dictionary format.

6
THE USED GRAPHEME SET
  • The 31 graphemes used for modern Romanian writing
    (according to DOOM II)

a a â b c d
e f g h i î
j k l m n o
p q r s s t
t u v w x y
Z
7
THE USED PHONEME SET
  • The used phoneme set presented in SAMPA coding
  • the phonetically null unit sil
  • 7 vowels
  • 4 semivowels
  • 20 consonants

a _at_ 1 b k d
e e_X f g h i
i_0 j l m n o
o_X p r s S t
ts tS u v z Z dZ sil
8
PRONUNCIATION DICTIONARY DEVELOPMENT
  • the word list consists of words exported from the
    DexOnline dictionary
  • Importing an initial dictionary will provide
    rules that can be used to predict pronunciations
    for words in the word list
  • 5k words were transcribed using our previously
    presented ANN based automated grapheme-to-phoneme
    transcription system (and used as initial
    dictionary)

9
PRONUNCIATION DICTIONARY DEVELOPMENT
  • We have recorded and segmented the Romanian
    phonemes for the used phoneme set (presented on
    the previous slide)
  • These audio files were provided to Dictionary
    Maker as phoneme audio sample in order to be used
    for generation of the sounded version for each
    transcription for the words included in the word
    list.
  • The system runs through the word list word by
    word, predicts a pronunciation based on the rules
    extracted from the initial dictionary and sounds
    out the phonemes of the word.

10
TESTING
  • The user listen to the generated pronunciation
    variant and provides a verdict with consideration
    to the accuracy of the word pronunciation pair
    choosing one from the fallowing answers
  • Correct The word is a valid word in the
    language concerned and its pronunciation as
    displayed is correct
  • Invalid The word is not a valid word (e.g. it
    is a URL), or it is spelled wrong, or it is only
    part of a word
  • Uncertain The user is unable to decide whether
    the word and its pronunciation are valid
  • Ambiguous There are multiple valid
    pronunciations, i.e. pronunciation variants
  • Proper noun The word is a proper noun
  • Foreign The word is a valid word from a foreign
    language, but not a word in the source language.

11
PRONUNCIATION DICTIONARY
  • The created dictionary was exported in text
    format with UTF-8 character encoding and it is
    available in 2 different formats
  • HTK dictionary format with the following syntax
    for each line
  • lttranscriptiongt ltspacegt ltpronunciationgt
  • Festival dictionary format with the following
    layout for each line
  • (lttranscriptiongt ltnilgt (ltpronunciationgt))
  • Each phoneme is delimited by spaces in
    ltpronunciationgt.

HTK dictionary format example aclamat a k l a m
a t aclimata a k l i m a t a aclimatare a k l i m
a t a r e aclimatiza a k l i m a t i z
a aclimatizat a k l i m a t i z a t Festival
dictionary format example ("hidrolog" nil (h i d
r o l o g)) ("hidrotehnic" nil (h i d r o t e h n
i k)) ("higrofit" nil (h i g r o f i
t)) ("higrofobie" nil (h i g r o f o b i j
e)) ("himenorafie" nil (h i m e n o r a f i j e))
12
CONCLUSIONS
  • We have created the first 100k words
    machine-readable Romanian language pronunciation
    dictionary based on the words from the lexem
    table of DexOnline dictionary.
  • The generated transcription dictionary together
    with the used grapheme set and the recorded and
    segmented audio samples for the used phoneme set
    can be freely downloaded from the NaviRO project
    website
  • (http//users.utcluj.ro/jdomokos/naviro/)
  • Any suggestions, corrections and observations are
    welcomed

13
FUTURE WORK
  • We intend to continually update the dictionary by
    correcting existing entries and by adding new
    ones
  • We are also interested to generate pronunciations
    for Romanian person names and institutions names
  • We want to use this dictionary for automatic
    phonetic transcription of our Romanian language
    broadcast speech database

14
ACKNOWLEDGMENTS
  • This paper was supported by the project "Develop
    and support multidisciplinary postdoctoral
    programs in primordial technical areas of
    national strategy of the research - development -
    innovation" 4D-POSTDOC, contract nr.
    POSDRU/89/1.5/S/52603, project co-funded from
    European Social Fund through Sectorial
    Operational Program Human Resources 2007-2013.
Write a Comment
User Comments (0)
About PowerShow.com