Design of a Multilingual MT for Realtime Broadcast Captioning - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Design of a Multilingual MT for Realtime Broadcast Captioning

Description:

A broadcasting company wishes to translate the captioning for their show. ... Existing KBMT system. Existing EBMT software. Transcribed caption (monolingual) data ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 21
Provided by: Joy293
Category:

less

Transcript and Presenter's Notes

Title: Design of a Multilingual MT for Realtime Broadcast Captioning


1
Design of a Multi-lingual MT for Real-time
Broadcast Captioning
  • Course Project for 11-731
  • Ying Zhang (Joy) Joy_at_cs.cmu.edu
  • Advisor Eric Nyberg
  • April 18th, 2001

2
Project Description
  • A broadcasting company wishes to translate the
    captioning for their show. Translations must be
    provided from English to multiple target
    languages. Real-time, high-accuracy translations
    are required. If the captions are poorly formed,
    then they need not be translated, but the
    customer would like you to consider teaching a
    controlled language to the captioners, so that
    high-quality translation can be achieved.

3
Domain Analysis (Cont.)
  • Special requirements
  • Translating spoken language
  • The system must perform in real-time
  • The system can not make pre-edit and post-edit
  • The system should provide positive information to
    users as much as possible

4
Domain Analysis
  • Characteristics amenable for MT
  • The domain is narrow
  • Possible to build large scale monolingual corpus
  • Not necessary to translate every utterance in the
    broadcast
  • Well-defined discourse structure (greetings, etc)
  • No correspondence in another culture (the bulls
    outrunning the bears today on Wall Street)

5
Domain Analysis-Data(1)
  • Fixed Patterns
  • 000084 WWW.MEDIACAPTIONING.COM CLOSED
    CAPTIONING
  • 000085 PROVIDED BY BELL ATLANTIC, THE
  • 000089 HEART OF COMMUNICATION gtgt
  • 000090 FROM CNN IN WASHINGTON, THIS
  • 000091 IS "WORLDVIEW." I'M BERNARD
  • 000094 SHAW. gtgt AND I'M JUDY WOODRUFF. gtgtgt
  • 000095 TALKS BETWEEN PROTESTANT AND

6
Domain Analysis-Data(2)
  • Idioms, Phrases and Acronyms
  • 000218 SCHEDULE? gtgt THE DISARMAMENT OF ALL
  • 000220 PARAMILITARIES INCLUDING THE IRA
  • 000223 OR THE INSTITUTION OF A
  • 000225 CABINET FOR THE NEW NORTHERN
  • 000227 IRELAND ASSEMBLY INCLUDING SINN FEIN
  • 000270 ALL THIS INDICATE THAT THE
  • 000271 ORIGINAL GOOD FRIDAY AGREEMENT WAS
  • 000276 JUST NOT REALISTIC? gtgt NO,

7
Domain Analysis-Data(3)
  • Sentence Boundary
  • 000112 PARAMILITARIES MUST DISARM. MR. BLAIR AND
  • 00140 BEFORE TOO LONG JUDY, YOU CAN SEE
  • 000141 BEHIND ME THAT PEOPLE ARE
  • 000143 STILL AT WORK HERE ALMOST 24 HOURS
  • 000145 AFTER THE DEADLINE PASSED
  • 000147 WHERE THERE'S LIFE, THERE'S HOPE
  • 000149 AND WHEN THERE'S TALK, I GUESS
  • 000151 THERE IS LIFE. WE'RE TOLD BY

8
Assumptions (1)
  • Partial Translation is acceptable
  • Users may know some English, although their
    vocabulary size may not be large enough
  • Users have visual information
  • Users may have background information for the
    topic
  • Provide only positive information to user, do not
    translate everything unless confident

9
Assumptions (2)
  • 10 seconds delay is acceptable

100122am
100112 am
10
Risk Factors
  • Technical risks
  • Business risks

11
Technical Risks
  • Performance constraints
  • Real-time
  • High-quality, even if partial translation is
    acceptable
  • Interface with hardware and software in
    broadcasting system
  • Specialized user interface if a human translator
    works together with MT
  • The domain of news broadcasting may be too wide
    to be covered by current MT technology

12
Business Risks
  • If the quality or real-time requirement can not
    be reached, the customer will not accept this
    product
  • The population of potential customers who need
    partial translation result is not large enough
  • Human translators provided with transcribed
    caption can translate it in real-time
  • Sales force do not think they can sell this
    translated service

13
Technical Rationale
  • Multi-engine machine translation system (the
    requirement of multi target languages can not be
    satisfied now)
  • Automated update corpus/lexicon from news source
  • Provide only positive information, un-translated
    text has 0 information, wrong translation has
    negative information!
  • Translate only chunks with high confidence
  • Translate only simple structures, leave
    conjunction and prepositions for complex
    structures untranslated

14
System Architecture
Nyberg and Mitamura (1997) "A Real-Time MT
System for Translating Broadcast
Captions"Proceedings of MT Summit VI
15
Extracting Lexicon/Phrase
  • The lexicon/phrase used in news domain changes
    rapidly
  • Comparable corpus exists
  • Extracting lexicon/phrase from comparable corpus

16
Comparable corpus
17
Plan Overview
Research on extracting lexicon from comparable
corpus
Augmenting rules for news domain
  • Constructing bilingual corpus

Research on effects of partial translation
Training EBMT
Adjust chart manager for partial translation
18
Resources
  • Existing KBMT system
  • Existing EBMT software
  • Transcribed caption (monolingual) data
  • Dictionary

19
Bibliography
  • Nyberg and Mitamura, 1997, A Real-Time MT System
    for Translating Broadcast Captions, Proceedings
    of MT Summit VI
  • David Turcato, A Unified Example-Based and
    Lexical Approach to Machine Translation, TMI 99
  • Pascale Fung, A Statistical View on Bilingual
    Lexicon Extraction From Parallel Corpora to
    Non-parallel Corpora, Lecture Notes in Artificial
    Intelligence, Springer Publisher, vol 1529, 1-17.

20
Thanks!
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com