Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations

About This Presentation

Title:

Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations

Description:

SU-internal breaks (clausal and coordination) more frequent in Conversations than in News ... SUs and contain a higher number of complex sentences than News ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 16

Provided by: J3226

Learn more at: http://www.lrec-conf.org

Category:

more less

Transcript and Presenter's Notes

Title: Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations

1
Structural Metadata Annotation of Speech
Corpora Comparing Broadcast News and Broadcast
Conversations

Jáchym Kolár Jan Švec

University of West Bohemia in Pilsen, Czech
Republic
2
Talk Overview

Structural metadata annotation
Speech data
Statistics about fillers
Statistics about edit disfluencies
Statistics about sentence-like units
Summary

3
Structural Metadata Extraction

Metadata Extraction (MDE) research started as
part of DARPA EARS program
Metadata annotation scheme for MDE introduced
by LDC (originally for English ? we have extended
it to Czech)
ULTIMATE GOAL of MDE
Automatic conversion of raw speech recognition
output to forms more useful to humans and
downstream automatic processes

4
MDE Annotation Subtasks

Boundaries of syntactic/semantic units (SUs)
Statements, Interrogatives, Incompletes
Coordination breaks, Clausal breaks
Non-content words (fillers)
Filled pauses (FPs)
Discourse markers (DMs)
Speech disfluencies (edits)
Deletable regions (DelRegs), Interruption points,
Explicit editing terms, Corrections

5
MDE Annotation Example
but I you know really pre- uh prefer this
form of of um presentation/. she Sheila
told me on Tuesday no on Wednesday/, she
didnt/. so lets move on/, because we dont
have uh dont have time/. well do you like
this this example/?
but I you know really pre- uh prefer this form
of of um presentation she Sheila told me
on Tuesday no on Wednesday she didnt so
lets move on because we dont have uh dont
have time well do you like this this example
6
Goal of This Paper

Analyse and compare two Czech MDE corpora from
different domains in terms of metadata statistics
Compare Czech Broadcast News (BN) vs. Broadcast
Conversations (BC)
Also compare Czech and English MDE corpora
English Broadcast News and Conversational
Telephone Speech (CTS)

7
Czech Broadcast News Data

News from 3 TV channels and 4 radio stations
Both public and commercial broadcast companies
Differing in presentation style
26 hours of transcribed speech
300 speakers
Speech recordings and verbatim transcripts
publicly available from LDC

8
Broadcast Conversation Data

52 recordings of a Czech radio talk show
Radioforum
24 hours of transcribed speech
100 speakers
1-3 guests spontaneously answer questions asked
by 1-2 interviewers
Mostly political debates
Currently being extended by additional 20
recordings (10 hours)

9
Statistics about Fillers

Filled pauses more frequent in Czech Broadcast
Conversations (3.8 of words) than in News (0.5)
English MDE CTS 2.2, BN 1.4
Discourse markers also more frequent in Czech
Conversations (1.6) than in News (0.1)
English MDE CTS 4.4, BN 0.5

10
Statistics about Edit Disfluencies

Deletable regions 2.8 of words in
Conversations and 0.2 in News
English MDE 5.4 in CTS and 1.5 in BN
Percentage of disfluencies having a correction
larger in News (94.6) than in Conversations
(83.8)
Explicit editing terms rare in both corpora
occur just at 4 of disfluencies

11
POS Analysis of Edit Disfluencies

Tagged the Czech corpora employing an automatic
POS tagger
Czech uses structured tags with 15 positions
we only used the first position distinguishing
10 basic POS
Computed and compared three POS distributions
Whole corpus
Deletable regions only
Corrections only

12
POS Analysis of Edit Disfluencies
13
Statistics about SUs

Average SU length Conversations (14.5 words)
shows longer SUs than News (13.0)
English BN (12.5) similar to Czech, but CTS shows
much shorter SUs (7.0) than Broadcast
Conversations
SU-internal breaks (clausal and coordination)
more frequent in Conversations than in News
(49 vs. 31 of all SU symbols)
? Complex and compound sentences more common in
spontaneous conversations than in prearranged news

14
Summary

Broadcast Conversations contain significantly
more fillers and disfluencies than News
Conversations also show longer SUs and contain a
higher number of complex sentences than News
Deletable regions and corrections in both corpora
show different POS distributions in comparison
with the general POS distributions
We plan to make Czech MDE corpora publicly
available

15
Structural Metadata Annotation of Speech
Corpora Comparing Broadcast News and Broadcast
Conversations