Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations - PowerPoint PPT Presentation

About This Presentation
Title:

Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations

Description:

SU-internal breaks (clausal and coordination) more frequent in Conversations than in News ... SUs and contain a higher number of complex sentences than News ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 16
Provided by: J3226
Learn more at: http://www.lrec-conf.org
Category:

less

Transcript and Presenter's Notes

Title: Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations


1
Structural Metadata Annotation of Speech
Corpora Comparing Broadcast News and Broadcast
Conversations
  • Jáchym Kolár Jan Švec

University of West Bohemia in Pilsen, Czech
Republic
2
Talk Overview
  • Structural metadata annotation
  • Speech data
  • Statistics about fillers
  • Statistics about edit disfluencies
  • Statistics about sentence-like units
  • Summary

3
Structural Metadata Extraction
  • Metadata Extraction (MDE) research started as
    part of DARPA EARS program
  • Metadata annotation scheme for MDE introduced
    by LDC (originally for English ? we have extended
    it to Czech)
  • ULTIMATE GOAL of MDE
  • Automatic conversion of raw speech recognition
    output to forms more useful to humans and
    downstream automatic processes

4
MDE Annotation Subtasks
  • Boundaries of syntactic/semantic units (SUs)
  • Statements, Interrogatives, Incompletes
  • Coordination breaks, Clausal breaks
  • Non-content words (fillers)
  • Filled pauses (FPs)
  • Discourse markers (DMs)
  • Speech disfluencies (edits)
  • Deletable regions (DelRegs), Interruption points,
  • Explicit editing terms, Corrections

5
MDE Annotation Example
but I you know really pre- uh prefer this
form of of um presentation/. she Sheila
told me on Tuesday no on Wednesday/, she
didnt/. so lets move on/, because we dont
have uh dont have time/. well do you like
this this example/?
but I you know really pre- uh prefer this form
of of um presentation she Sheila told me
on Tuesday no on Wednesday she didnt so
lets move on because we dont have uh dont
have time well do you like this this example
6
Goal of This Paper
  • Analyse and compare two Czech MDE corpora from
    different domains in terms of metadata statistics
  • Compare Czech Broadcast News (BN) vs. Broadcast
    Conversations (BC)
  • Also compare Czech and English MDE corpora
    English Broadcast News and Conversational
    Telephone Speech (CTS)

7
Czech Broadcast News Data
  • News from 3 TV channels and 4 radio stations
  • Both public and commercial broadcast companies
  • Differing in presentation style
  • 26 hours of transcribed speech
  • 300 speakers
  • Speech recordings and verbatim transcripts
    publicly available from LDC

8
Broadcast Conversation Data
  • 52 recordings of a Czech radio talk show
    Radioforum
  • 24 hours of transcribed speech
  • 100 speakers
  • 1-3 guests spontaneously answer questions asked
    by 1-2 interviewers
  • Mostly political debates
  • Currently being extended by additional 20
    recordings (10 hours)

9
Statistics about Fillers
  • Filled pauses more frequent in Czech Broadcast
    Conversations (3.8 of words) than in News (0.5)
  • English MDE CTS 2.2, BN 1.4
  • Discourse markers also more frequent in Czech
    Conversations (1.6) than in News (0.1)
  • English MDE CTS 4.4, BN 0.5

10
Statistics about Edit Disfluencies
  • Deletable regions 2.8 of words in
    Conversations and 0.2 in News
  • English MDE 5.4 in CTS and 1.5 in BN
  • Percentage of disfluencies having a correction
    larger in News (94.6) than in Conversations
    (83.8)
  • Explicit editing terms rare in both corpora
  • occur just at 4 of disfluencies

11
POS Analysis of Edit Disfluencies
  • Tagged the Czech corpora employing an automatic
    POS tagger
  • Czech uses structured tags with 15 positions
  • we only used the first position distinguishing
    10 basic POS
  • Computed and compared three POS distributions
  • Whole corpus
  • Deletable regions only
  • Corrections only

12
POS Analysis of Edit Disfluencies
13
Statistics about SUs
  • Average SU length Conversations (14.5 words)
    shows longer SUs than News (13.0)
  • English BN (12.5) similar to Czech, but CTS shows
    much shorter SUs (7.0) than Broadcast
    Conversations
  • SU-internal breaks (clausal and coordination)
    more frequent in Conversations than in News
  • (49 vs. 31 of all SU symbols)
  • ? Complex and compound sentences more common in
    spontaneous conversations than in prearranged news

14
Summary
  • Broadcast Conversations contain significantly
    more fillers and disfluencies than News
  • Conversations also show longer SUs and contain a
    higher number of complex sentences than News
  • Deletable regions and corrections in both corpora
    show different POS distributions in comparison
    with the general POS distributions
  • We plan to make Czech MDE corpora publicly
    available

15
Structural Metadata Annotation of Speech
Corpora Comparing Broadcast News and Broadcast
Conversations
  • Jáchym Kolár Jan Švec

University of West Bohemia in Pilsen, Czech
Republic
Write a Comment
User Comments (0)
About PowerShow.com