The 10-milion-words Spoken Dutch Corpus and its potential use in experimental phonetics - PowerPoint PPT Presentation

About This Presentation
Title:

The 10-milion-words Spoken Dutch Corpus and its potential use in experimental phonetics

Description:

start June 1998, 5 yrs, 7 releases (audio ann. ... 7 audio releases on CD-ROM, or DVD (future?) annotations updated with each release ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 19
Provided by: louisc
Category:

less

Transcript and Presenter's Notes

Title: The 10-milion-words Spoken Dutch Corpus and its potential use in experimental phonetics


1
The 10-milion-words Spoken Dutch Corpus and its
potential use in experimental phonetics
  • Louis C.W. Pols
  • Institute of Phonetic Sciences
  • University of Amsterdam

100 Years of Experimental Phonetics in
Russia St.-Petersburg State Univ., Febr. 1-4, 2001
2
Amsterdam city center
Herengracht 338
3
Overview
  • Introduction
  • Corpus design, recording, digitization
  • Orthographic transcription
  • Part-of-speech tagging, lemmatization and
    syntactic annotation
  • Phonetic transcription
  • Prosodic transcription
  • Exploration
  • Potential phonetic benefit

4
Introduction
  • appropriate topic given long Russian tradition
  • Dutch-Flemish initiative
  • 10 Mƒ, 10 M words (about 1000 hrs of speech)
  • start June 1998, 5 yrs, 7 releases (audio ann.)
  • many speaking styles, also over telephone, only
    adult speakers, ABN variants but no dialect
  • for linguistics and speech/language technology
  • rights with NTU (http//www.taalunie.nl)

5
Corpus design(number of words x 1000)
dialogues and multilogues
monologues
6
Recording, digitization
  • mono or stereo using portable DAT-recorders
  • 16 kHz and 16 bit (telephone recordings at 8
    kHz and 8 bit)
  • .WAV format in PRAAT
  • meta data about recording and speaker
  • 7 audio releases on CD-ROM, or DVD (future?)
  • annotations updated with each release

7
Orthographic transcription (1)
  • by trained students, checked by expert
  • according to fixed protocol no text
    interpretations
  • transcr. aligned at few sec. chunks multiple
    tiers
  • few punctuations capitals for names only
  • standard spelling conventions, checked vs.
    lexicon
  • special mark-up symbols
  • d dialect words z regionally accented words
  • t interjection a truncated wrd u
    mispronunciation
  • v foreign words n new words x hardly
    intelligible
  • ggg speaker sounds xxx unintelligible
    word(part)(s)

8
Orthographic transcription (2)
9
Part-of-speech tagging
  • all words in the text automatically tagged
  • discontinuous verbs not recognized at this level
  • Dutch tag set with 10 major word classes
  • (noun, adjective, verb, pronoun, article,
    numeral, preposition, adverb, conjunction, and
    interjection)
  • additional morpho-syntactic features per class
  • (e.g., singular, dimunitive and neuter for nouns)
  • resulting in some 300 tags
  • self-learning automatic tagger (given context)

10
Lemmatization
  • all words autom. paired with base form (lemma)
  • verbs ? infinitive (gedaan ? doen) other forms
    ? stem (vijfde ? vijf) truncated forms ? full
    forms (zn ? zijn)
  • base form must be an independently existing form
  • (hersenen ? hersen meisje ? meis)
  • discontinuous verbs and split prepositions are
    not recognized at this level (op...bellen
    van...uit)
  • one and only one baseform per word
  • (vliegen ? verb vliegen, or noun vlieg, depending
    POS)

11
Broad phonetic transcription (1)
  • on 10 of the data (mainly dialogues)
  • hand correction of automatic phonetic
    transcription
  • across-word assimilation, levels of reduction?
  • use of extended SAMPA
  • within PRAAT
  • word level respected
  • die ik wel vind dat ze kloppen ? di k wEl
    fInt_tAt s_at_ klOp_at_
  • no hand segmentation at phoneme level

12
Broad phonetic transcription (2)
13
Signal coupling, word alignment
  • the phonetically transcribed part (1 M words)
    will be automatically aligned at word level
  • using ASR techniques (forced alignment)
  • this word alignment will be hand corrected
  • pauses and noises will also be aligned
  • geminate plosives are aligned separately, others
    shared (komt terug ? kom t erug is zeker ?
    isseker)
  • inserted phonemes are shared with neighbouring
    words (toen belde n ie naar huis ? belden nie
  • all the rest may be automatically aligned only
  • few seconds chunks are always accessible

14
Syntactic annotation
  • 10 will be semi-automatically annotated
  • procedure still under developed
  • interactive annotation software from NEGRA
    project (Saarbrücken) will be used
  • taking into account idiosyncracies of speech,
    such as hesitations, false starts, clause
    extensions
  • functional information (dependency labels)
  • category information (in form of node labels)

15
Prosodic annotation
  • manually, on 250K words subset only
  • procedure still under development
  • prosodic markers in orthography
  • 1) prosodic boundaries
  • long silences (?)
  • phrase boundaries (?)
  • other discontinuities, like (filled) pauses ()
  • 2) prominence ( before vowel in prominent
    syllable)
  • sp. A nee ? Jan heeft negen medailles ?
    zeven medailles. ?
  • sp. B zeven ?

16
Exploration software
  • COREX tool under developed (Max Planck Inst.)
  • both locally and internet-based (Java)
  • 1) browser
  • 2) viewer for orthography and annotations, plus
    waveform display and audio player (time synchr.)
  • 3) search module, also on meta data

17
Potential phonetic benefit
  • huge database, many speakers/styles,real speech
  • easily accessible via orthography, plus audio
  • partly accessible via phonetic transcription
  • no segmentation at phoneme level (automatic?)
  • automatic segmentation at word level
  • after COREX search own additions possible
  • f.i. spectro-temporal analyses via PRAAT scripts
  • f.i. svarabhakti vowel, final n-deletion,
    assimilation
  • f.i. vowel reduction, turn-taking behavior, etc.

18
More information
  • see references in paper
  • see websites mentioned in paper
  • second release Oct. 2000
  • new releases every half year
  • feedback from users group (workshops)
  • useful for proposed INTAS project Spontaneous
    speech of typologically unrelated languages
    (Russian, Finnish and Dutch) Comparison of
    phonetic properties (De Silva, 2000)
Write a Comment
User Comments (0)
About PowerShow.com