BUILDING AN ENVIRONMENT FOR UNSUPERVISED AUTOMATIC EMAIL TRANSLATION EAMTCLAW Dublin 2003 - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

BUILDING AN ENVIRONMENT FOR UNSUPERVISED AUTOMATIC EMAIL TRANSLATION EAMTCLAW Dublin 2003

Description:

Catalan students who are competent in Catalan and Spanish. ... Machine translation of emails in both Catalan-Spanish/ Spanish-Catalan directions. ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 19
Provided by: UOC3
Category:

less

Transcript and Presenter's Notes

Title: BUILDING AN ENVIRONMENT FOR UNSUPERVISED AUTOMATIC EMAIL TRANSLATION EAMTCLAW Dublin 2003


1
 BUILDING AN ENVIRONMENT FOR UNSUPERVISED
AUTOMATIC EMAIL TRANSLATION EAMT/CLAW Dublin
2003
  • Salvador ClimentJoaquim MoréAntoni
    OliverInterdisciplinary Internet Institute
    (IN3)/Universitat Oberta de Catalunya (UOC)Av.
    Tibidabo 3908035 Barcelonascliment_at_uoc.edu
    jmore_at_uoc.edu aoliverg_at_uoc.edu

2
What is the UOC?
  • UOC (www.uoc.edu) is a virtual University (17
    official university degrees, 1 Ph.D. program,
    several dozens of other courses)
  • Communication between professors, students and
    supervisors email, newsgroups.
  • Languages used Spanish and Catalan
  • Geographical coverage Catalonia, rest of Spain
    and South America

3
The students language profile
  • Catalan students who are competent in Catalan and
    Spanish.
  • Spanish speakers living in Catalonia who
    understand and speak Catalan but are not able to
    write it properly.
  • South American students who are not competent in
    Catalan at all.

4
The status of Catalan in the virtual classrooms
  • Statistical study on the language used when
    writing new emails and replying them
  • Relevant Data 42.9 out of 68.9 spontaneous
    Catalan users code-switch to Spanish when
    replying to messages in that language.
  • Conclusion The UOC expansion might lead to
    gradual substitution of Catalan by Spanish in the
    virtual classrooms.

5
The Interlingua Project
  • Goal to enable the student to use his/her own
    language independently of the receiver. As a
    consequence, Catalan is not to be replaced by
    Spanish as the instrumental language among
    students.
  • How to get the goal Machine translation of
    emails in both Catalan-Spanish/ Spanish-Catalan
    directions.

6
Challenges posed for MT
  • Impossibility of human intervention
  • Email communication admits no delay for human
    formatting, pre and post-edition.
  • Specificity of the email register
  • Use of non-standard and incorrect language.
  • To charge users with some kind of language
    self-control is led to fail.
  • Problems caused by bilinguism and languages in
    contact
  • Messages in both Catalan and Spanish (quoting).
  • Language interference in monolingual mails caused
    by different levels of competence in either of
    the languages.

7
Outline of the project
  • Evaluation of the MT system chosen to perform the
    task in order to know
  • The actual linguistic effects of the email
    communication related to MT
  • What aspects of the email communication are
    problematic for a standard MT system
  • What kind of problems can be solved by
    costumizing the system
  • System customizing

8
Evaluation
  • System evaluated Sail-Labs Incyta (currently
    working)
  • Translation directions SPA-CAT/ CAT-SPA
  • Environment chosen Fòrum dInformàtica
    (Computer-Science Newsgroup)
  • Guidelines
  • ISLE ISLE00 standards suitable to the needs of
    the project

9
Stages of the evaluation
  • Macro-evaluation to know where we are, what can
    be expected and what we will eventually reach
  • Micro-evaluation information to decide what kind
    of modules should be priorized to achieve a
    greater improvement in translation quality

10
Micro-evaluation
  • Granularity
  • 1240 sentences out of the 130 emails for each
    direction.
  • Two versions pre-edited and non pre-edited
  • ISLE items
  • Intelligibility
  • Fidelity
  • Terminological precision
  • Style
  • Error types
  • Characteristics of the input (imputable to the
    email writer)
  • Characteristics of the output (imputable to the
    MT system)

11
Outstanding problems according to the
micro-evaluation
  • Untranslated terminology
  • Domain terminology
  • CAT-SPA 24.42 SPA-CAT 27.90
  • Examples Server, script,...
  • Speech-community terminology
  • CAT-SPA 11.66 SPA-CAT 13.71
  • Example MIC (an acronym meaning an academic
    subject)

12
Outstanding problems according to the
micro-evaluation (Competence errors 1)
  • Ortographic errors
  • Problems related to accentuation
  • CAT-SPA 22.87 SPA-CAT 29.81
  • Examples CAT El professor està be (The teacher
    is sheep) CAT página (Not translated)/ pàgina
    (page) SPA Estudio ingles II (I study groins
    II)
  • Confusion phoneme/grapheme
  • CAT-SPA 4.46 SPA-CAT 0.19
  • Examples CAT reçerca recerca (research)

13
Outstanding problems according to the
micro-evaluation (Competence errors 2)
  • Lexical errors
  • Oral reproduction
  • CAT-SPA 2.64 SPA-CAT 0.66
  • Examples CAT avere/ a veure (lets see)
  • Barbarisms
  • CAT-SPA 1.54 SPA-CAT 0.76
  • Examples CAT insertar instead of inserir.
    Insertar is a Spanish word.

14
Outstanding problems according to the
micro-evaluation (Competence errors 3)
  • Syntactic errors
  • CAT-SPA 3.28 SPA-CAT 4.57
  • Cohesion errors
  • Punctuation errors
  • CAT-SPA 2.27 SPA-CAT 1.81

15
Outstanding problems according to the
micro-evaluation (Intentional deviations)
  • Ortographical innovations
  • CAT-SPA 4.83 SPA-CAT 8.19
  • Example SPA tod_at_s (everybody)
  • Language shift
  • CAT-SPA 2.18 SPA-CAT 4.38
  • Examples ciao, merci, help,...
  • CAT Vés a Inicio (Click on Start)
  • Problems related to the presence of smileys, SMS
    words, etc are not relevant quantitatively
    speaking (around 1)

16
Outstanding problems according to the
micro-evaluation (final)
  • Typing errors
  • CAT-SPA 8.38 SPA-CAT 5.23
  • Translated proper nouns in the body of the email

17
Problems to be solved by customizing the system
  • Pre-edition
  • Accent recovery
  • Typing mistakes recovery
  • Punctuation recovery
  • Language shifts
  • Proper Noun resolution (confusion with other
    capitalized words)
  • Building domain and speech community lexicons
  • Post-edition
  • Untranslated terminology treatment

18
Outline of the translation process
  • Sketch of the environment
Write a Comment
User Comments (0)
About PowerShow.com