Title: Welcome to the Rich Transcription 2005 Spring Meeting Recognition Evaluation Workshop
1Welcome to the Rich Transcription 2005 Spring
Meeting Recognition Evaluation Workshop
- July 13, 2005
- Royal College of Physicians
- Edinburgh, UK
2TodaysAgenda
1800
Meeting Venue Cleared
Updated July 5, 2005
3Administrative Points
- Participants
- Pick up the hard copy proceedings on the front
desk - Presenters
- The agenda will be strictly followed
- Time slots include QA time.
- Presenters should either
- Load their presentations on the computer at the
front, or - Test their laptops during the breaks prior to
making their presentation - Wed like to thank
- MLMI-05 organizing committee for hosting this
workshop - Caroline Hastings for the workshops
administration - All the volunteers evaluation participants, data
providers, transcribers, annotators, paper
authors, presenters and other contributors
4The Rich Transcription 2005 Spring Meeting
Recognition Evaluation
http//www.nist.gov/speech/tests/rt/rt2005/spring/
- Jonathan Fiscus, Nicolas Radde, John Garofolo,
Audrey Le, Jerome Ajot, Christophe Laprun - July 13, 2005
- Rich Transcription 2004
- Spring Meeting Recognition Workshop
- at MLMI 2005
5Overview
- Rich Transcription Evaluation Series
- Research opportunities in the Meeting Domain
- RT-05S Evaluation
- Audio input conditions
- Corpora
- Evaluation tasks and results
- Conclusion/Future
6The Rich Transcription Task
Multiple Applications
RICH TRANSCRIPTION Speech-To-Text METADATA
Readable Transcripts
Human-to-Human Speech
Component Recognition Technologies
Smart Meeting Rooms Translation Extraction Retrie
val Summarization
7Rich Transcription Evaluation Series
- Goal
- Develop recognition technologies that produce
transcripts which are understandable by humans
and useful for downstream processes. - Domains
- Broadcast News (BN)
- Conversational Telephone Speech (CTS)
- Meeting Room speech
- Parameterized Black Box evaluations
- Evaluations control input conditions to
investigate weaknesses/strengths - Sub-test scoring provides finer-grained
diagnostics
8NIST STT Benchmark Test History May. 05
100
Conversational Speech
Read Speech
(Non-English)
CTS Arabic (UL)
Meeting - SDM
Switchboard II
Switchboard Cellular
Broadcast Speech
Meeting - MDM
CTS Mandarin (UL)0
Meeting - IHM
Spontaneous Speech
BNews Mandarin 10X
Varied Microphones
(Non-English)
BNews Arabic 10X
CTS Fisher (UL)
20k
BNews English 1X
10
Noisy
BNews English unlimited
BNews English 10X
5k
1k
1
1988 1989 1990 1991 1992 1993
1994 1995 1996 1997 1998 1999
2000 2001 2002 2003 2004 2005
2006 2007 2008 2009 2010 2011
9Research Opportunities in the Meeting Domain
- Provide fertile environment to advance
state-of-the-art in technologies for
understanding human interaction - Many potential applications
- Meeting archives, interactive meeting rooms,
remote collaborative systems - Important Human Language Technology challenges
not posed by other domains - Varied forums and vocabularies
- Highly interactive and overlapping spontaneous
speech - Far field speech effects
- Ambient noise
- Reverberation
- Participant movement
- Varied room configurations
- Many microphone conditions
- Many camera views
- Multimedia information integration
- Person, face, and head detection/tracking
10RT-05S Evaluation Tasks
- Focus on core speech technologies
- Speech-to-Text Transcription
- Diarization Who Spoke When
- Diarization Speech Activity Detection
- Diarization Source Localization
11Five System Input Conditions
- Distant microphone conditions
- Multiple Distant Microphones (MDM)
- Three or more centrally located table mics
- Multiple Source Localization Arrays (MSLA)
- Inverted T topology, 4-channel digital
microphone array - Multiple Mark III digital microphone Arrays
(MM3A) - Linear topology, 64-channel digital microphone
array - Contrastive microphone conditions
- Single Distant Microphone (SDM)
- Center-most MDM microphone
- Gauge performance benefit using multiple table
mics - Individual Head Microphones (IHM)
- Performance on clean speech
- Similar to Conversational Telephone Speech
- One speaker per channel, conversational speech
12Training/Development Corpora
- Corpora provided at no cost to participants
- ICSI Meeting Corpus
- ISL Meeting Corpus
- NIST Meeting Pilot Corpus
- Rich Transcription 2004 Spring (RT-04S)
Development Evaluation Data - Topic Detection and Tracking Phase 4 (TDT4)
corpus - Fisher English conversational telephone speech
corpus - CHIL development test set
- AMI development test set and training set
- Thanks to ELDA and LDC for making this possible
13RT-05S Evaluation Test CorporaConference Room
Test Set
- Goal-oriented small conference room meetings
- Group meetings and decision-making exercises
- Meetings involved 4-10 participants
- 120 minutes Ten excerpts, each twelve minutes
in duration - Five sites donated two meetings each
- Augmented Multiparty Interaction (AMI) Program,
International Computer Science Institute (ICSI),
NIST, and Virginia Tech (VT) - No VT data was available for system development
- Similar test set construction used for RT-04S
evaluation - Microphones
- Participants wore head microphones
- Microphones were placed on the table among
participants - AMI meetings included an 8-channel circular
microphone array on the table - NIST meetings include 3 Mark III digital
microphone arrays
14RT-05S Evaluation Test Corpora Lecture Room Test
Set
- Technical lectures in small meeting rooms
- Educational events where a single lecturer is
briefing an audience on a particular topic - Meetings excerpts involve one lecturer and up to
five participating audience members - 150 minutes 29 excerpts from 16 lectures
- Two types of excerpts selected by CMU
- Lecturer excerpts 89 minutes, 17 excerpts
- Question Answer (QA) excerpts 61 minutes, 12
excerpts - All data collected at Karlsruhe University
- Sensors
- Lecturer and at most two other participants wore
head microphones - Microphones were placed on the table among
participants - A source localization array mounted on each of
the rooms four walls - Mark III mounted on the wall opposite the lecturer
15RT-05S Evaluation Participants
Site ID Site Name Evaluation Task Evaluation Task Evaluation Task Evaluation Task
Site ID Site Name STT SPKR SAD SLOC
AMI Augmented Multiparty Interaction Program X
ICSI/SRI International Computer Science Institute and SRI International X X
ITC-irst Center for Scientific and Technological Research X
KU Karlsruhe University X
ELISA Consortium Laboratoire Informatique d'Avignon (LIA), Communication Langagière et Interaction Personne-Système (CLIPS), and LIUM X X
MQU Macquarie University X
Purdue Purdue University X
TNO The Netherlands Organisation for Applied Scientific Research X X
TUT Tampere University of Technology X
16Diarization Who Spoke When (SPKR) Task
- Task definition
- Identify the number of participants in each
meeting and create a list of speech time
intervals for each such participant - Several input conditions
- Primary MDM
- Contrast SDM, MSLA
- Four participating sites ICSI/SRI, ELISA, MQU,
TNO
17SPKR System Evaluation Method
- Primary Metric
- Diarization Error Rate (DER) the ratio of
incorrectly detected speaker time to total
speaker time - System output speaker segment sets are mapped to
reference speaker segment sets so as to minimize
the total error - Errors consist of
- Speaker assignment errors (i.e., detected speech
but not assigned to the right speaker) - False alarm detections
- Missed detections
- Systems were scored using the mdeval tool
- Forgiveness collar of /- 250ms around reference
segment boundaries - DER on non-overlapping speech is the primary
metric
18RT-05S SPKR ResultsPrimary Systems,
Non-Overlapping Speech
- Conference room SDM DER less than MDM
- Sign test indicates differences are not
significant - Primary ICSI/SRI Lecture Room system attributed
the entire duration of each test excerpt to be
from a single speaker. - ICSI/SRI contrastive system had a lower DER
19Lecture Room ResultsBroken Down by Excerpt Type
- Lecturer excerpt DERs are lower than QA excerpt
DERs
20Historical Best System SPKR Performance on
Conference Data
- 20 relative reduction for MDM
- 43 relative reduction for SDM
21Diarization Speech Activity Detection (SAD) Task
- Task definition
- create a list of speech time intervals where at
least one person is talking - Dry run evaluation for RT-05S
- Proposed by CHIL
- Several input conditions
- Primary MDM
- Contrast SDM, MSLA, IHM
- Systems designed for the IHM condition must
detect speech and also reject cross talk speech
and breath noises, therefore IHM systems are not
directly comparable to MDM or SDM systems - Three participating sites ELISA, Purdue, TNO
22SAD System Evaluation Method
- Primary metric
- Diarization Error Rate (DER)
- Same formula and software as used for the SPKR
task - Reduced to a two-class problem speech vs.
non-speech - No speaker assignment errors, just false alarms
and missed detections - Forgiveness collar of /- 250ms around reference
segment boundaries
23RT-05S SAD ResultsPrimary Systems
- DERs for conference and lecture room MDM data are
similar - Purdue didnt compensate for breath noise and
crosstalk
24Speech-To-Text (STT) Task
- Task definition
- Systems output a single stream of time-tagged
word tokens - Several input conditions
- Primary MDM
- Contrast SDM, MSLA, IHM
- Two participating sites AMI and ICSI/SRI
25STT System Evaluation Method
- Primary metric
- Word Error Rate (WER) - ratio of inserted,
deleted, and substituted words to the total
number of words in the reference - System and reference words are normalized to a
common form - System words are mapped to reference words using
a word-mediated dynamic programming string
alignment program - Systems were scored using the NIST Scoring
Toolkit (SCTK) version 2.1 - A Spring 2005 update to the SCTK alignment tool
can now score most of the overlapping speech in
the distant microphone test material - Can now handle up to 5 simultaneous speakers
- 98 of Conference Room test can be scored
- 100 of Lecture Room test set can be scored
- Greatly improved over Spring 2004 prototype
26Simultaneous Speech for STT
- 98 of Conference Room test set has lt 5
overlapping speakers - 100 of Lecture Room test set has lt 5
overlapping speakers - The updated alignment tool ran in less than real
time
27RT-05S STT ResultsPrimary Systems (Incl.
overlaps)
Lecture Room
Conference Room
Microphone conditions
- First evaluation for the AMI team
- IHM error rates for conference and lecture room
data are comparable - ICSI/SRI lecture room MSLA WER lower than MDM/SDM
WER
28Historical STT Performance in the Meeting Domain
- Performance for ICSI/SRI has dramatically
improved for all conditions
29STT Error RatesEffect of Simultaneous Speech
30Diarization Source Localization (SLOC) Task
- Task definition
- Systems track the three-dimensional position of
the lecturer (using audio input only) - Constrained to lecturer subset of the Lecture
Room test set - Evaluation protocol and metrics defined in the
CHIL Speaker Localization and Tracking
Evaluation Criteria document - Dry run pilot evaluation for RT-05S
- Proposed by CHIL
- CHIL provided the scoring software and annotated
the evaluation data - One evaluation condition
- Multiple source localization arrays
- Required calibration of source localization
microphone positions and video cameras - Three participating sites ITC-irst, KU, TNO
31SLOC System Evaluation Method
- Primary Metric
- Root Mean Squared Error (RMSE) a measure of the
average Euclidean distance between the reference
speaker position and the system-determined
speaker position - Measured in millimeters at 667 ms intervals
- IRST SLOC scoring software
- Maurizio Omologo will give further details this
afternoon
32R-05S SLOC ResultsPrimary Systems
- Issues
- What accuracy and resolution is needed for
successful beamforming? - What will performance be for multiple speakers?
33Summary
- Nine sites participated in the RT-05S evaluation
- Up from six in RT-04S
- Four evaluation tasks were supported across two
meeting sub-domains - Two experimental tasks SAD and SLOC successfully
completed - Dramatically lower STT and SPKR error rates for
RT-05S
34Issues for RT-06 Meeting Eval
- Domain
- Sub domains
- Tasks
- Require at least three sites per task
- Agreed-upon primary condition for each task
- Data contributions
- Source data and annotations
- Participation intent
- Participation commitment
- Decision making process
- Only sites with intent to participate will have
input to the task definition
35Proposal for RT-06
- Encourage multi-modality systems
- Publish video for all 05 meetings
- Include video for 06 eval
- Tasks
- STT
- SPKR
- Score all speech for the SPKR task, not just the
non-overlapping speech - Multi-stream STT marriage of STT and SPKR
- SLOC lecturer and audience
- Test sets
- 2 hour conference meetings
- 2 hour lecture meetings
- Data requirements
- IHM, multiple table mics, video,
- Time table
- Evaluation March/April
- Workshop May, East coast US venue
- Participation
- Data donation