Title: BUILDING AN ENVIRONMENT FOR UNSUPERVISED AUTOMATIC EMAIL TRANSLATION EAMTCLAW Dublin 2003
1 BUILDING AN ENVIRONMENT FOR UNSUPERVISED
AUTOMATIC EMAIL TRANSLATION EAMT/CLAW Dublin
2003
- Salvador ClimentJoaquim MoréAntoni
OliverInterdisciplinary Internet Institute
(IN3)/Universitat Oberta de Catalunya (UOC)Av.
Tibidabo 3908035 Barcelonascliment_at_uoc.edu
jmore_at_uoc.edu aoliverg_at_uoc.edu
2What is the UOC?
- UOC (www.uoc.edu) is a virtual University (17
official university degrees, 1 Ph.D. program,
several dozens of other courses) - Communication between professors, students and
supervisors email, newsgroups. - Languages used Spanish and Catalan
- Geographical coverage Catalonia, rest of Spain
and South America
3The students language profile
- Catalan students who are competent in Catalan and
Spanish. - Spanish speakers living in Catalonia who
understand and speak Catalan but are not able to
write it properly. - South American students who are not competent in
Catalan at all.
4The status of Catalan in the virtual classrooms
- Statistical study on the language used when
writing new emails and replying them - Relevant Data 42.9 out of 68.9 spontaneous
Catalan users code-switch to Spanish when
replying to messages in that language. - Conclusion The UOC expansion might lead to
gradual substitution of Catalan by Spanish in the
virtual classrooms.
5The Interlingua Project
- Goal to enable the student to use his/her own
language independently of the receiver. As a
consequence, Catalan is not to be replaced by
Spanish as the instrumental language among
students. - How to get the goal Machine translation of
emails in both Catalan-Spanish/ Spanish-Catalan
directions.
6Challenges posed for MT
- Impossibility of human intervention
- Email communication admits no delay for human
formatting, pre and post-edition. - Specificity of the email register
- Use of non-standard and incorrect language.
- To charge users with some kind of language
self-control is led to fail. - Problems caused by bilinguism and languages in
contact - Messages in both Catalan and Spanish (quoting).
- Language interference in monolingual mails caused
by different levels of competence in either of
the languages.
7Outline of the project
- Evaluation of the MT system chosen to perform the
task in order to know - The actual linguistic effects of the email
communication related to MT - What aspects of the email communication are
problematic for a standard MT system - What kind of problems can be solved by
costumizing the system - System customizing
8Evaluation
- System evaluated Sail-Labs Incyta (currently
working) - Translation directions SPA-CAT/ CAT-SPA
- Environment chosen Fòrum dInformàtica
(Computer-Science Newsgroup) - Guidelines
- ISLE ISLE00 standards suitable to the needs of
the project
9Stages of the evaluation
- Macro-evaluation to know where we are, what can
be expected and what we will eventually reach - Micro-evaluation information to decide what kind
of modules should be priorized to achieve a
greater improvement in translation quality
10Micro-evaluation
- Granularity
- 1240 sentences out of the 130 emails for each
direction. - Two versions pre-edited and non pre-edited
- ISLE items
- Intelligibility
- Fidelity
- Terminological precision
- Style
- Error types
- Characteristics of the input (imputable to the
email writer) - Characteristics of the output (imputable to the
MT system)
11Outstanding problems according to the
micro-evaluation
- Untranslated terminology
- Domain terminology
- CAT-SPA 24.42 SPA-CAT 27.90
- Examples Server, script,...
- Speech-community terminology
- CAT-SPA 11.66 SPA-CAT 13.71
- Example MIC (an acronym meaning an academic
subject)
12Outstanding problems according to the
micro-evaluation (Competence errors 1)
- Ortographic errors
- Problems related to accentuation
- CAT-SPA 22.87 SPA-CAT 29.81
- Examples CAT El professor està be (The teacher
is sheep) CAT página (Not translated)/ pàgina
(page) SPA Estudio ingles II (I study groins
II) -
- Confusion phoneme/grapheme
- CAT-SPA 4.46 SPA-CAT 0.19
- Examples CAT reçerca recerca (research)
13Outstanding problems according to the
micro-evaluation (Competence errors 2)
- Lexical errors
- Oral reproduction
- CAT-SPA 2.64 SPA-CAT 0.66
- Examples CAT avere/ a veure (lets see)
- Barbarisms
- CAT-SPA 1.54 SPA-CAT 0.76
- Examples CAT insertar instead of inserir.
Insertar is a Spanish word.
14Outstanding problems according to the
micro-evaluation (Competence errors 3)
- Syntactic errors
- CAT-SPA 3.28 SPA-CAT 4.57
- Cohesion errors
- Punctuation errors
- CAT-SPA 2.27 SPA-CAT 1.81
15Outstanding problems according to the
micro-evaluation (Intentional deviations)
- Ortographical innovations
- CAT-SPA 4.83 SPA-CAT 8.19
- Example SPA tod_at_s (everybody)
- Language shift
- CAT-SPA 2.18 SPA-CAT 4.38
- Examples ciao, merci, help,...
- CAT Vés a Inicio (Click on Start)
- Problems related to the presence of smileys, SMS
words, etc are not relevant quantitatively
speaking (around 1)
16Outstanding problems according to the
micro-evaluation (final)
- Typing errors
- CAT-SPA 8.38 SPA-CAT 5.23
- Translated proper nouns in the body of the email
17Problems to be solved by customizing the system
- Pre-edition
- Accent recovery
- Typing mistakes recovery
- Punctuation recovery
- Language shifts
- Proper Noun resolution (confusion with other
capitalized words) - Building domain and speech community lexicons
- Post-edition
- Untranslated terminology treatment
18Outline of the translation process
- Sketch of the environment