Title: The 10-milion-words Spoken Dutch Corpus and its potential use in experimental phonetics
1The 10-milion-words Spoken Dutch Corpus and its
potential use in experimental phonetics
- Louis C.W. Pols
- Institute of Phonetic Sciences
- University of Amsterdam
100 Years of Experimental Phonetics in
Russia St.-Petersburg State Univ., Febr. 1-4, 2001
2Amsterdam city center
Herengracht 338
3Overview
- Introduction
- Corpus design, recording, digitization
- Orthographic transcription
- Part-of-speech tagging, lemmatization and
syntactic annotation - Phonetic transcription
- Prosodic transcription
- Exploration
- Potential phonetic benefit
4Introduction
- appropriate topic given long Russian tradition
- Dutch-Flemish initiative
- 10 Mƒ, 10 M words (about 1000 hrs of speech)
- start June 1998, 5 yrs, 7 releases (audio ann.)
- many speaking styles, also over telephone, only
adult speakers, ABN variants but no dialect - for linguistics and speech/language technology
- rights with NTU (http//www.taalunie.nl)
5Corpus design(number of words x 1000)
dialogues and multilogues
monologues
6Recording, digitization
- mono or stereo using portable DAT-recorders
- 16 kHz and 16 bit (telephone recordings at 8
kHz and 8 bit) - .WAV format in PRAAT
- meta data about recording and speaker
- 7 audio releases on CD-ROM, or DVD (future?)
- annotations updated with each release
7Orthographic transcription (1)
- by trained students, checked by expert
- according to fixed protocol no text
interpretations - transcr. aligned at few sec. chunks multiple
tiers - few punctuations capitals for names only
- standard spelling conventions, checked vs.
lexicon - special mark-up symbols
- d dialect words z regionally accented words
- t interjection a truncated wrd u
mispronunciation - v foreign words n new words x hardly
intelligible - ggg speaker sounds xxx unintelligible
word(part)(s)
8Orthographic transcription (2)
9Part-of-speech tagging
- all words in the text automatically tagged
- discontinuous verbs not recognized at this level
- Dutch tag set with 10 major word classes
- (noun, adjective, verb, pronoun, article,
numeral, preposition, adverb, conjunction, and
interjection) - additional morpho-syntactic features per class
- (e.g., singular, dimunitive and neuter for nouns)
- resulting in some 300 tags
- self-learning automatic tagger (given context)
10Lemmatization
- all words autom. paired with base form (lemma)
- verbs ? infinitive (gedaan ? doen) other forms
? stem (vijfde ? vijf) truncated forms ? full
forms (zn ? zijn) - base form must be an independently existing form
- (hersenen ? hersen meisje ? meis)
- discontinuous verbs and split prepositions are
not recognized at this level (op...bellen
van...uit) - one and only one baseform per word
- (vliegen ? verb vliegen, or noun vlieg, depending
POS)
11Broad phonetic transcription (1)
- on 10 of the data (mainly dialogues)
- hand correction of automatic phonetic
transcription - across-word assimilation, levels of reduction?
- use of extended SAMPA
- within PRAAT
- word level respected
- die ik wel vind dat ze kloppen ? di k wEl
fInt_tAt s_at_ klOp_at_ - no hand segmentation at phoneme level
12Broad phonetic transcription (2)
13Signal coupling, word alignment
- the phonetically transcribed part (1 M words)
will be automatically aligned at word level - using ASR techniques (forced alignment)
- this word alignment will be hand corrected
- pauses and noises will also be aligned
- geminate plosives are aligned separately, others
shared (komt terug ? kom t erug is zeker ?
isseker) - inserted phonemes are shared with neighbouring
words (toen belde n ie naar huis ? belden nie - all the rest may be automatically aligned only
- few seconds chunks are always accessible
14Syntactic annotation
- 10 will be semi-automatically annotated
- procedure still under developed
- interactive annotation software from NEGRA
project (Saarbrücken) will be used - taking into account idiosyncracies of speech,
such as hesitations, false starts, clause
extensions - functional information (dependency labels)
- category information (in form of node labels)
15Prosodic annotation
- manually, on 250K words subset only
- procedure still under development
- prosodic markers in orthography
- 1) prosodic boundaries
- long silences (?)
- phrase boundaries (?)
- other discontinuities, like (filled) pauses ()
- 2) prominence ( before vowel in prominent
syllable) - sp. A nee ? Jan heeft negen medailles ?
zeven medailles. ? - sp. B zeven ?
16Exploration software
- COREX tool under developed (Max Planck Inst.)
- both locally and internet-based (Java)
- 1) browser
- 2) viewer for orthography and annotations, plus
waveform display and audio player (time synchr.) - 3) search module, also on meta data
17Potential phonetic benefit
- huge database, many speakers/styles,real speech
- easily accessible via orthography, plus audio
- partly accessible via phonetic transcription
- no segmentation at phoneme level (automatic?)
- automatic segmentation at word level
- after COREX search own additions possible
- f.i. spectro-temporal analyses via PRAAT scripts
- f.i. svarabhakti vowel, final n-deletion,
assimilation - f.i. vowel reduction, turn-taking behavior, etc.
18More information
- see references in paper
- see websites mentioned in paper
- second release Oct. 2000
- new releases every half year
- feedback from users group (workshops)
- useful for proposed INTAS project Spontaneous
speech of typologically unrelated languages
(Russian, Finnish and Dutch) Comparison of
phonetic properties (De Silva, 2000)