CSA405: Advanced Topics in NLP - PowerPoint PPT Presentation

About This Presentation
Title:

CSA405: Advanced Topics in NLP

Description:

was the dream of FAMT. Fully Automatic (High Quality) Machine ... Richard Kittredge Sublanguages, Computational Linguistics vol 11 numbers 2-3 1985. ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 52
Provided by: michael307
Category:

less

Transcript and Presenter's Notes

Title: CSA405: Advanced Topics in NLP


1
CSA405 Advanced Topicsin NLP
  • Machine Translation I
  • Introduction to MT

2
Outline
  • MT Machine Translation
  • Why MT is important
  • What MT is and why MT is difficult
  • MT and the Human Translator

3
Why Machine Translation is Important
4
Misconceptions about MT
  • There was/is an MT system which translated The
    spirit is willing, but the flesh is weak into the
    Russian equivalent of The vodka is good, but the
    steak is lousy, and hydraulic ram into the French
    equivalent of water goat. MT is useless.
  • MT is a waste of time because you will never make
    a machine that can translate Shakespeare.
  • Generally, the quality  of translation you can
    get from an MT system is very low. This makes
    them useless in practice.
  • MT threatens the jobs of translators.
  • The Japanese have developed a system that you can
    talk to on the phone. It translates what you say
    into Japanese, and translates the other speaker's
    replies into English.
  • There is an amazing South American Indian
    language with a structure of such logical
    perfection that it solves the problem of
    designing MT systems.
  • MT systems are machines, and buying an MT system
    should be very much like buying a car.

5
Some Facts about MT
  • MT is useful. The METEO  system has been in daily
    use since 1977. As of 1990, it was regularly
    translating around 45 000 words daily. In the
    1980s. It also produces high quality output.
  • While MT systems sometimes produce howlers, there
    are many situations where the ability of MT
    systems to produce reliable, if less than
    perfect, translations at high speed is valuable.
  • MT does not threaten translators' jobs. The
    limitations of current MT systems are too great.
    However, MT systems can take over some of the
    boring, repetitive translation jobs and allow
    human translation to concentrate on more
    interesting specialist tasks.
  • Speech-to-Speech MT is still a research topic.
    Verbmobil has been developed in Germany.
  • Building an MT system is an arduous and time
    consuming job, involving the construction of
    grammars and very large monolingual and bilingual
    dictionaries. There is no magic solution' to
    this.
  • Before an MT system becomes really useful, a
    user will typically have to invest a considerable
    amount of effort in customization.

6
The Place for MT
  • Human Translators are good at
  • Getting the right turn of phrase
  • Preserving translation equivalence
  • Human Translators are bad at
  • Dictionary look-up
  • Consistency of translation
  • Translation of terminology
  • MT can exploit these weaknesses

7
Implications of Multilinguality
8
MT is important because...
  • There are too few human translators
  • Socio-political considerations require it.
  • Availability of materials in appropriate language
    has significant economic consequences.
  • Scientifically, it is still one of the best test
    areas for language technology
  • Philosophically, it demands practical solutions
    to old problems (e.g. role of knowledge and
    understanding in translation).negatively charged
    electrons and protons

9
How much is MT used?
  • It is a myth that MT is not used
  • In 2000, MT specialist Scott Bennett said
    Altavista's BabelFish ... initiated 1997, is now
    used a million times per day.
  • In 2001, Softissimo announced that the Internet
    translation request volume processed by
    www.reverso.net has now reached several million
    (Web pages, e-mail, short texts and results of
    search engine requests) per month on its mail
    translation portal and the portals of its
    Internet partners.
  • V.d. Meer (2003) "Every day, portals like
    Altavista and Google process nearly 10 million
    requests for automatic translation.
  • MT usage is increasing

10
How much more could it be used?
  • Translation/localisation industry has so far
    focused largely on product documentation
  • This represents less than 20 of all text-based
    information repositories that need to be
    localised
  • ?Corporate decision makers and governments will
    have to begin supporting multilingual
    communication initiatives and strategies.

11
Why Translation is Difficult
12
What is Translation?
  • The process of transforming text from one
    language into another language.
  • A written communication in a second language
    having the same meaning as the written
    communication in a first language
  • It is what translators actually do! (Martin Kay)

13
What Translators Actually DoAn Example of En/Fr
Translation
  • As recently as a decade ago it was widely
    believed that infectious disease was no longer
    much of a threat in the developed world. The
    remaining challenges to public health there, it
    was thought, stemmed from noninfectious
    conditions such as cancer, heart disease and
    degenerative diseases.
  • Il y a une dizaine dannees, on croyait que les
    pays industrialises etait debarasses des risques
    lies aux maladies infectieuses et que la sante
    publique netait menacee que par des maladies
    comme le cancer, les troubles cardiaques, et les
    anomolies genetiques

14
Problems style and meaning
  • English
  • Two sentences
  • infectious disease was no longer much of a threat
    in the developed world
  • The remaining challenges to public health there
  • noninfectious conditions
  • French
  • One sentence
  • les pays industrialises etait debarasses des
    risques lies aux maladies infectieuses
  • la sante publique netait menacee que
  • maladies

15
Problems Contextual Interpretation
OPEN
16
Problems Non-Equivalences, Lexical Gaps
  • English
  • Room
  • I arrive/am arriving
  • ?Consumptions?
  • VAT
  • ?bits and pieces?
  • I miss you
  • French
  • Salle/chambre/piece
  • Jarrive
  • Consommations
  • TVA
  • Petites fournitures
  • Tu me manques

17
Cultural Models
English Health Insurance German Krankenversiche
rung French Assurance Maladie
English stamp German entwerten French
obliterer
18
Structural Ambiguity
  • I bought a car with four doors/liri
  • I forgot how good beer tastes
  • Time flies like an arrow
  • The councillors refused the women a permit
    because they advocated/feared violence.

19
Summary
  • Translation is about more than equivalence of
    meaning.
  • Translation may involve the resolution of
    ambiguity.
  • Preservation of intention involves cultural
    background as well as linguistic knowledge.
  • Translation is a hard problem for humans let
    alone machines.

20
Similarities and Differences Between Languages
  • Differences
  • Morphology
  • Word order and syntactic structures
  • Marking of semantic distinctions
  • Lexical
  • Similarities
  • Communicative function for survival
  • Mechanisms for reference to people, eating,
    politeness, time.
  • Syntactic complexity
  • Nouns
  • Verbs

21
Differences in Morphology
  • Number of morphemes per word
  • One morpheme per word (Vietnamese)
  • Many morphemes per word (Maltese)
  • Segmentability of morphemes
  • Agglutinative (Turkish)uygarlastiramadiklar
    imizdanmis sinizcasinabehaving as if you are
    among those whom we could not case to become
    civilised.
  • Fusion single affix multiple morphemes
    (Russian)
  • stolom with (a) table(om SING/INSTR/DECL1)

22
Differences in Word Order
  • SVO (English)The man kicked the ball
  • SOV (Japanese)The man the ball kicked
  • Mixed (German)The man (has) the ball kicked must
  • VSO (Classical Arabic)Kicked the man the ball
  • Free word order (Latin)

23
Differences in Marking of Semantic Information
  • Head marking.
  • In English possessive relation is marked on the
    head The man's house
  • In Hungarian it is marked on the dependentThe
    man house-his
  • his house / sa maison
  • Direction and manner of motion marking
  • He ran into the room (English)
  • He entered the room running (French)

24
Lexical DifferencesSemantic Granularity
25
Hutchins Somers (1992)
26
Lexical Differences
  • Lexical gaps when a word exists in one language
    but not in another
  • Japanese does not have a word corresponding to
    privacy.
  • English does not have a word for Japanese oyakoko
    ( filial piety).
  • Sapir/Wharf hypothesis
  • Language constrains thought
  • Speakers of different languages employ different
    conceptual systems
  • Impossibility of translation in general.

27
Machine Translation and Human Translators
28
In the Beginning ....was the dream of FAMT
  • Fully Automatic (High Quality) Machine
    Translation (Bar Hillel 1960)

Source Language text
TargetLanguage text
FAHQMT
29
FAMT
  • Basic Charactistics
  • No human intervention
  • Arbitrary text
  • Evaluation Criteria
  • Quality of ouput
  • Cost (/page)
  • Speed (pages/hour)

30
Translation Process 1
  • Pre-editing
  • Translation
  • Post-editing
  • No pre-editing ? Lots of post-editing!
  • Lots of pre-editing ? No(t much) post-editing!
  • GARBAGE IN, GARBAGE OUT!!!

31
Pre-editing
  • What constitutes Good Input?
  • Depends on system.
  • short, simple, grammatical sentences
  • New toner units are held level during
    installation and, since they do not as supplied
    contain toner, must be filled prior to
    installation from a small cartridge.
  • Fill the new toner unit with toner from a toner
    cartridge. Hold the new toner unit level while
    you put it in the printer.

32
Pre-editing
  • Avoidance of ambiguous terms
  • Trend towards controlled languages and related
    tools
  • Spellcheckers
  • Grammar Checkers
  • Critiquing Systems
  • Controlled English to make English accessible and
    useable by greatest no. of people. Basic English,
    cf Esperanto.
  • Main idea to reduce no. of general words needed
    for writing anything to a few hundred from 75000
    (avg. for skilled native speakers) by operator
    verbs, e.g. make perfect'' instead of
    perfect''.
  • Xerox offers its technical writers one-day
    course, British Aerospace does the same in a few
    short sessions

33
Translation Process 2
  • Coordination
  • Communication
  • In theory, FAMT is meant to usurp pre-editing,
    translation, post-editing phases.
  • But even with current technology, no system can
    be built which satisfies all of FAMT's goals
    simultaneously

34
FAMT Success StoryTAUM METEO
  • Written by Chevalier et al. 1978.
  • Translation of weather reports from English to
    French
  • Highly constrained subset of English
  • Small number of senses for each word
  • Restricted syntactic constructions
  • System determines whether a given sentence is
    within its capabilities
  • Very fast, very accurate, no post-editing

35
FAMT MORAL
  • FAMT can work well but only if we give up one or
    more of the goals e.g.
  • Unrestricted text input
  • High quality translation
  • This observation has lead to research on
    sub-languages
  • And to the use of FALQT

36
Sublanguages
  • Restricted domain of reference
  • Restricted purpose and orientation
  • Restricted mode of communication (may include
    bandwidth considerations)
  • Community of users sharing specialised knowledge
  • See Kittredge (1985) for further details of what
    computational techniques are applicable to
    sublanguages

37
Fully Automatic Low Quality Translation (FALQT)
  • Can be used where translation volume is high.
  • Where the gist is more important than an accurate
    translation
  • Where we need to select a small group of
    documents from a large collection for subsequent
    high quality translation.
  • Must answer question could document X in
    collection Z be about Y?

38
FAMT is not the only way
  • FAMT lies at one extreme of a continuum of ways
    in which technology can be brought to bear upon
    the translation problem
  • At the other extreme there are word processing
    software, fax machines, and even mobile phones
  • Between these two extremes there are other points
    of interest where technology can radically affect
    the productivity of the individual translator.

39
MAHT and HAMT
  • Machine Aided Human Translation (MAHT)
  • Human Aided Machine Translation (HAMT).
  • The essential difference between these two lies
    not only in the way in which the person is
    involved but also in the extent of their
    involvement

40
MAHT
  • All initiative resides with the human.
  • Often based on a text editor with certain
    translation-specific functionalities such as
  • Simultaneous access to source and target texts
  • Online access to dictionaries, thesauri,
    terminological databases, and word concordance
    tools.
  • Identification of and access to secondary
    materials such as texts being worked on and other
    texts like it in both source and target forms.

41
MAHT - Translation Memories
  • Systems consist of a database in which each
    source sentence of a translation is stored
    together with the target sentence (this is called
    a translation memory "unit")
  • Any new source sentences will be searched for in
    the database and a match value is calculated.
  • When the match value is 100, the translation of
    the source sentence from the database is inserted
    into the text being translated.

42
MAHT - Translation Memories
  • If the match value is below 100 and above a
    certain user-definable percentage (i.e., "fuzzy
    match"), the old translation will be inserted as
    a translation proposal for the translator to
    review and edit.
  • Sentences with match values below that margin
    have to be translated from scratch.
  • New and changed translation proposals will then
    be stored in the database for future use.

43
MAHT - Translation Memories Advantages
  • Avoid redoing translation of repeated material
  • Use previous texts as a model for new
    translations
  • Ensure consistency throughout a translation

44
MAHT - Translation Memories - Drawbacks
  • If terminology changes between projects the
    content of a TM needs to be updated to reflect
    these changes.
  • Blind faith in exact matches (without validation)
    can generate incorrect translation since there is
    no verification of the context where the new
    segment is used compared to where the original
    one was used.

45
MAHT - Translation Memories - Remarks
  • Translation Process TM tools may not easily fit
    into existing translation or localization
    processes work best where work can be signed off
    in pieces rather than as a whole.
  • Customisation rarely works straight out of the
    box. Menu adaptation, filters to desktop
    applications may require significant effort.
  • Investment costs are high
  • Setup and maintenance of TMs has to factored in.
  • OpenTag/TMX formats for exchanging TM data
    between competing systems

46
MAHT Other Technology
  • Communication/coordination amongst translators
  • Integration of internet technologies and web
    services.
  • Database technology, smart indexing, and
    networking
  • Improvements can be achieved that are well within
    the scope of current technology.

47
HAMT Human Assisted Machine Translation
  • Machine retains the initiative but works in
    collaboration with human consultant.
  • System translates autonomously until it
    recognises that a linguistic difficulty of a
    certain type has arisen, e.g.
  • ambiguity
  • pronoun reference
  • unknown word
  • unrecognised construction
  • At this point it seeks help from the consultant.

48
HAMT Challenges
  • Reliable identification/classification of
    difficulty.
  • Reliable communication of difficulty to user.
  • Tradeoff between quality and scope of
    translation.

49
HAMT - Advantages
  • Modulo challenges a high quality of translation
    can be guaranteed.
  • Speed if large sections of text can be
    translated automatically.
  • Human consultant need not necessarily have all
    the skills of a human translator native
    competence in one or both languages may suffice.

50
Summary
  • Machine Translation is a continuum
  • FAMT
  • HAMT
  • MAHT
  • The utility of a given type of system cannot be
    assessed with very simple criteria
  • Utlility function involves at least the human
    cost, the machine cost, the quality of the
    result, and the nature of the translation
    requirements.

51
Some References
  • Jonathan Slocum, Machine Translation its
    History, Current Status, and Future Prospects,
    Proc ACL 1984, Stanford University,
    http//acl.ldc.upenn.edu/P/P84/P84-1116.pdf
  • Martin Kay Machine Translation, Computational
    Linguistics vol 11 numbers 2-3 1985.
  • Richard Kittredge Sublanguages, Computational
    Linguistics vol 11 numbers 2-3 1985.
Write a Comment
User Comments (0)
About PowerShow.com