Welcome to the Rich Transcription 2005 Spring Meeting Recognition Evaluation Workshop - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Welcome to the Rich Transcription 2005 Spring Meeting Recognition Evaluation Workshop

Description:

Welcome to the Rich Transcription 2005 Spring Meeting Recognition Evaluation Workshop ... Measured in millimeters at 667 ms intervals. IRST SLOC scoring software ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 32
Provided by: bin68
Category:

less

Transcript and Presenter's Notes

Title: Welcome to the Rich Transcription 2005 Spring Meeting Recognition Evaluation Workshop


1
Welcome to the Rich Transcription 2005 Spring
Meeting Recognition Evaluation Workshop
  • July 13, 2005
  • Royal College of Physicians
  • Edinburgh, UK

2
TodaysAgenda






 
 
1800
 
 
 
Meeting Venue Cleared
 
 
 
  
Updated July 5, 2005
3
Administrative Points
  • Participants
  • Pick up the hard copy proceedings on the front
    desk
  • Presenters
  • The agenda will be strictly followed
  • Time slots include QA time.
  • Presenters should either
  • Load their presentations on the computer at the
    front, or
  • Test their laptops during the breaks prior to
    making their presentation
  • Wed like to thank
  • MLMI-05 organizing committee for hosting this
    workshop
  • Caroline Hastings for the workshops
    administration
  • All the volunteers evaluation participants, data
    providers, transcribers, annotators, paper
    authors, presenters and other contributors

4
The Rich Transcription 2005 Spring Meeting
Recognition Evaluation
http//www.nist.gov/speech/tests/rt/rt2005/spring/
  • Jonathan Fiscus, Nicolas Radde, John Garofolo,
    Audrey Le, Jerome Ajot, Christophe Laprun
  • July 13, 2005
  • Rich Transcription 2004
  • Spring Meeting Recognition Workshop
  • at MLMI 2005

5
Overview
  • Rich Transcription Evaluation Series
  • Research opportunities in the Meeting Domain
  • RT-05S Evaluation
  • Audio input conditions
  • Corpora
  • Evaluation tasks and results
  • Conclusion/Future

6
The Rich Transcription Task
Multiple Applications
RICH TRANSCRIPTION Speech-To-Text METADATA
Readable Transcripts
Human-to-Human Speech
Component Recognition Technologies
Smart Meeting Rooms Translation Extraction Retrie
val Summarization
7
Rich Transcription Evaluation Series
  • Goal
  • Develop recognition technologies that produce
    transcripts which are understandable by humans
    and useful for downstream processes.
  • Domains
  • Broadcast News (BN)
  • Conversational Telephone Speech (CTS)
  • Meeting Room speech
  • Parameterized Black Box evaluations
  • Evaluations control input conditions to
    investigate weaknesses/strengths
  • Sub-test scoring provides finer-grained
    diagnostics

8
NIST STT Benchmark Test History May. 05
100
Conversational Speech
Read Speech
(Non-English)
CTS Arabic (UL)
Meeting - SDM
Switchboard II
Switchboard Cellular
Broadcast Speech
Meeting - MDM
CTS Mandarin (UL)0
Meeting - IHM
Spontaneous Speech
BNews Mandarin 10X
Varied Microphones
(Non-English)
BNews Arabic 10X
CTS Fisher (UL)
20k
BNews English 1X
10
Noisy
BNews English unlimited
BNews English 10X
5k
1k
1
1988 1989 1990 1991 1992 1993
1994 1995 1996 1997 1998 1999
2000 2001 2002 2003 2004 2005
2006 2007 2008 2009 2010 2011
9
Research Opportunities in the Meeting Domain
  • Provide fertile environment to advance
    state-of-the-art in technologies for
    understanding human interaction
  • Many potential applications
  • Meeting archives, interactive meeting rooms,
    remote collaborative systems
  • Important Human Language Technology challenges
    not posed by other domains
  • Varied forums and vocabularies
  • Highly interactive and overlapping spontaneous
    speech
  • Far field speech effects
  • Ambient noise
  • Reverberation
  • Participant movement
  • Varied room configurations
  • Many microphone conditions
  • Many camera views
  • Multimedia information integration
  • Person, face, and head detection/tracking

10
RT-05S Evaluation Tasks
  • Focus on core speech technologies
  • Speech-to-Text Transcription
  • Diarization Who Spoke When
  • Diarization Speech Activity Detection
  • Diarization Source Localization

11
Five System Input Conditions
  • Distant microphone conditions
  • Multiple Distant Microphones (MDM)
  • Three or more centrally located table mics
  • Multiple Source Localization Arrays (MSLA)
  • Inverted T topology, 4-channel digital
    microphone array
  • Multiple Mark III digital microphone Arrays
    (MM3A)
  • Linear topology, 64-channel digital microphone
    array
  • Contrastive microphone conditions
  • Single Distant Microphone (SDM)
  • Center-most MDM microphone
  • Gauge performance benefit using multiple table
    mics
  • Individual Head Microphones (IHM)
  • Performance on clean speech
  • Similar to Conversational Telephone Speech
  • One speaker per channel, conversational speech

12
Training/Development Corpora
  • Corpora provided at no cost to participants
  • ICSI Meeting Corpus
  • ISL Meeting Corpus
  • NIST Meeting Pilot Corpus
  • Rich Transcription 2004 Spring (RT-04S)
    Development Evaluation Data
  • Topic Detection and Tracking Phase 4 (TDT4)
    corpus
  • Fisher English conversational telephone speech
    corpus
  • CHIL development test set
  • AMI development test set and training set
  • Thanks to ELDA and LDC for making this possible

13
RT-05S Evaluation Test CorporaConference Room
Test Set
  • Goal-oriented small conference room meetings
  • Group meetings and decision-making exercises
  • Meetings involved 4-10 participants
  • 120 minutes Ten excerpts, each twelve minutes
    in duration
  • Five sites donated two meetings each
  • Augmented Multiparty Interaction (AMI) Program,
    International Computer Science Institute (ICSI),
    NIST, and Virginia Tech (VT)
  • No VT data was available for system development
  • Similar test set construction used for RT-04S
    evaluation
  • Microphones
  • Participants wore head microphones
  • Microphones were placed on the table among
    participants
  • AMI meetings included an 8-channel circular
    microphone array on the table
  • NIST meetings include 3 Mark III digital
    microphone arrays

14
RT-05S Evaluation Test Corpora Lecture Room Test
Set
  • Technical lectures in small meeting rooms
  • Educational events where a single lecturer is
    briefing an audience on a particular topic
  • Meetings excerpts involve one lecturer and up to
    five participating audience members
  • 150 minutes 29 excerpts from 16 lectures
  • Two types of excerpts selected by CMU
  • Lecturer excerpts 89 minutes, 17 excerpts
  • Question Answer (QA) excerpts 61 minutes, 12
    excerpts
  • All data collected at Karlsruhe University
  • Sensors
  • Lecturer and at most two other participants wore
    head microphones
  • Microphones were placed on the table among
    participants
  • A source localization array mounted on each of
    the rooms four walls
  • Mark III mounted on the wall opposite the lecturer

15
RT-05S Evaluation Participants
Site ID Site Name Evaluation Task Evaluation Task Evaluation Task Evaluation Task
Site ID Site Name STT SPKR SAD SLOC
AMI Augmented Multiparty Interaction Program X
ICSI/SRI International Computer Science Institute and SRI International X X
ITC-irst Center for Scientific and Technological Research X
KU Karlsruhe University X
ELISA Consortium Laboratoire Informatique d'Avignon (LIA), Communication Langagière et Interaction Personne-Système (CLIPS), and LIUM X X
MQU Macquarie University X
Purdue Purdue University X
TNO The Netherlands Organisation for Applied Scientific Research X X
TUT Tampere University of Technology X
16
Diarization Who Spoke When (SPKR) Task
  • Task definition
  • Identify the number of participants in each
    meeting and create a list of speech time
    intervals for each such participant
  • Several input conditions
  • Primary MDM
  • Contrast SDM, MSLA
  • Four participating sites ICSI/SRI, ELISA, MQU,
    TNO

17
SPKR System Evaluation Method
  • Primary Metric
  • Diarization Error Rate (DER) the ratio of
    incorrectly detected speaker time to total
    speaker time
  • System output speaker segment sets are mapped to
    reference speaker segment sets so as to minimize
    the total error
  • Errors consist of
  • Speaker assignment errors (i.e., detected speech
    but not assigned to the right speaker)
  • False alarm detections
  • Missed detections
  • Systems were scored using the mdeval tool
  • Forgiveness collar of /- 250ms around reference
    segment boundaries
  • DER on non-overlapping speech is the primary
    metric

18
RT-05S SPKR ResultsPrimary Systems,
Non-Overlapping Speech
  • Conference room SDM DER less than MDM
  • Sign test indicates differences are not
    significant
  • Primary ICSI/SRI Lecture Room system attributed
    the entire duration of each test excerpt to be
    from a single speaker.
  • ICSI/SRI contrastive system had a lower DER

19
Lecture Room ResultsBroken Down by Excerpt Type
  • Lecturer excerpt DERs are lower than QA excerpt
    DERs

20
Historical Best System SPKR Performance on
Conference Data
  • 20 relative reduction for MDM
  • 43 relative reduction for SDM

21
Diarization Speech Activity Detection (SAD) Task
  • Task definition
  • create a list of speech time intervals where at
    least one person is talking
  • Dry run evaluation for RT-05S
  • Proposed by CHIL
  • Several input conditions
  • Primary MDM
  • Contrast SDM, MSLA, IHM
  • Systems designed for the IHM condition must
    detect speech and also reject cross talk speech
    and breath noises, therefore IHM systems are not
    directly comparable to MDM or SDM systems
  • Three participating sites ELISA, Purdue, TNO

22
SAD System Evaluation Method
  • Primary metric
  • Diarization Error Rate (DER)
  • Same formula and software as used for the SPKR
    task
  • Reduced to a two-class problem speech vs.
    non-speech
  • No speaker assignment errors, just false alarms
    and missed detections
  • Forgiveness collar of /- 250ms around reference
    segment boundaries

23
RT-05S SAD ResultsPrimary Systems
  • DERs for conference and lecture room MDM data are
    similar
  • Purdue didnt compensate for breath noise and
    crosstalk

24
Speech-To-Text (STT) Task
  • Task definition
  • Systems output a single stream of time-tagged
    word tokens
  • Several input conditions
  • Primary MDM
  • Contrast SDM, MSLA, IHM
  • Two participating sites AMI and ICSI/SRI

25
STT System Evaluation Method
  • Primary metric
  • Word Error Rate (WER) - ratio of inserted,
    deleted, and substituted words to the total
    number of words in the reference
  • System and reference words are normalized to a
    common form
  • System words are mapped to reference words using
    a word-mediated dynamic programming string
    alignment program
  • Systems were scored using the NIST Scoring
    Toolkit (SCTK) version 2.1
  • A Spring 2005 update to the SCTK alignment tool
    can now score most of the overlapping speech in
    the distant microphone test material
  • Can now handle up to 5 simultaneous speakers
  • 98 of Conference Room test can be scored
  • 100 of Lecture Room test set can be scored
  • Greatly improved over Spring 2004 prototype

26
Simultaneous Speech for STT
  • 98 of Conference Room test set has lt 5
    overlapping speakers
  • 100 of Lecture Room test set has lt 5
    overlapping speakers
  • The updated alignment tool ran in less than real
    time

27
RT-05S STT ResultsPrimary Systems (Incl.
overlaps)
Lecture Room
Conference Room
Microphone conditions
  • First evaluation for the AMI team
  • IHM error rates for conference and lecture room
    data are comparable
  • ICSI/SRI lecture room MSLA WER lower than MDM/SDM
    WER

28
Historical STT Performance in the Meeting Domain
  • Performance for ICSI/SRI has dramatically
    improved for all conditions

29
STT Error RatesEffect of Simultaneous Speech
30
Diarization Source Localization (SLOC) Task
  • Task definition
  • Systems track the three-dimensional position of
    the lecturer (using audio input only)
  • Constrained to lecturer subset of the Lecture
    Room test set
  • Evaluation protocol and metrics defined in the
    CHIL Speaker Localization and Tracking
    Evaluation Criteria document
  • Dry run pilot evaluation for RT-05S
  • Proposed by CHIL
  • CHIL provided the scoring software and annotated
    the evaluation data
  • One evaluation condition
  • Multiple source localization arrays
  • Required calibration of source localization
    microphone positions and video cameras
  • Three participating sites ITC-irst, KU, TNO

31
SLOC System Evaluation Method
  • Primary Metric
  • Root Mean Squared Error (RMSE) a measure of the
    average Euclidean distance between the reference
    speaker position and the system-determined
    speaker position
  • Measured in millimeters at 667 ms intervals
  • IRST SLOC scoring software
  • Maurizio Omologo will give further details this
    afternoon

32
R-05S SLOC ResultsPrimary Systems
  • Issues
  • What accuracy and resolution is needed for
    successful beamforming?
  • What will performance be for multiple speakers?

33
Summary
  • Nine sites participated in the RT-05S evaluation
  • Up from six in RT-04S
  • Four evaluation tasks were supported across two
    meeting sub-domains
  • Two experimental tasks SAD and SLOC successfully
    completed
  • Dramatically lower STT and SPKR error rates for
    RT-05S

34
Issues for RT-06 Meeting Eval
  • Domain
  • Sub domains
  • Tasks
  • Require at least three sites per task
  • Agreed-upon primary condition for each task
  • Data contributions
  • Source data and annotations
  • Participation intent
  • Participation commitment
  • Decision making process
  • Only sites with intent to participate will have
    input to the task definition

35
Proposal for RT-06
  • Encourage multi-modality systems
  • Publish video for all 05 meetings
  • Include video for 06 eval
  • Tasks
  • STT
  • SPKR
  • Score all speech for the SPKR task, not just the
    non-overlapping speech
  • Multi-stream STT marriage of STT and SPKR
  • SLOC lecturer and audience
  • Test sets
  • 2 hour conference meetings
  • 2 hour lecture meetings
  • Data requirements
  • IHM, multiple table mics, video,
  • Time table
  • Evaluation March/April
  • Workshop May, East coast US venue
  • Participation
  • Data donation
Write a Comment
User Comments (0)
About PowerShow.com