An Introduction to Machine Translation - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

An Introduction to Machine Translation

Description:

Can we write an algorithm to generate translations using our translation dictionary and ... (vs. source) in making fluent ... updated Europarl Arabic English: LDC ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 24
Provided by: MTgr9
Category:

less

Transcript and Presenter's Notes

Title: An Introduction to Machine Translation


1
An Introduction to Machine Translation
  • Andy Way, DCU

2
The Rise Fall of Different MT Paradigms
3
Three main approaches to RBMT
language-neutral interlingua
TRANSFER
GENERATION
ANALYSIS
direct translation
target text
source text
The Vauquois Pyramid
4
System Design Concerns
  • Multilingual vs. Bilingual
  • Multilingual
  • Extreme Eurotra, i.e. 72 language pairs Modest
    EN ?DE,FR,ES, i.e. 3 language pairs
  • Intermediate EN,FR,DE,ES,JP, but not all
    combinations
  • Bilingual
  • Unidirectional vs. Bidirectional
  • EN?FR or FR?EN
  • Reversible vs. Non-reversible
  • EN?FR, same EN,FR components for Analysis
    Generation, and reversible transfer module
  • EN?FR FR?EN, but different EN, FR components
    for Analysis Generation, and different transfer
    modules, NB, lack of modularity
  • Direct vs. Transfer vs. Interlingua
  • Batch vs. Interactive

5
Advantages/Disadvantages of Direct Systems
  • Advantages
  • Engine's competence lies in its comparative
    grammar.
  • Highly robust. Does not break down or stop
    whenit encounters unknown words, unknown
    grammaticalconstructs, or ill-formed Input
  • Designed for unidirectional translation between
    one pair of langs. Not conducive to genuine
    multilingual MT design.
  • Disadvantages
  • word-for-word' translation local reordering
    poortranslation, using cheap bilingual
    dictionary rudimentary knowledge of target
    language.
  • Linguistically, computationally naive. No
    analysis of internalstructure of Input,
    especially w.r.t. the grammatical relationships
    between the main parts of sentences.

6
Advantages/Disadvantages of Interlingual Systems
  • Advantages
  • Intermediate representation (IR) fully specified,
    i.e. no need to look back' at Source in order to
    generate Target.
  • Easy to extend to other langs.
  • Built-in back translation useful for testing.
  • Disadvantages
  • How to define an Interlingua for closely related
    languages?
  • Truly universal Interlingua possible?

7
Advantages/Disadvantages of Transfer Systems
  • Advantages
  • No language-independent representations source
    IR specific to a particular lang., as is the
    target lang. IR.
  • So Complexity of Analysis Generation components
    much reduced
  • Also, no necessary equivalence between source and
    targetIRs for the same language!
  • Disadvantages
  • Not so easy to extend to other languages n
    analysis modules, n generation modules, n x n-1
    transfer modules, i.e. not much less than n²
  • No guaranteed built-in back translation.

8
Direct, or Indirect?
  • Direct
  • From manufacturer's viewpoint, better, as it's
    more robust
  • Indirect
  • Falls over more easily.
  • Development phase can be trying.
  • Commercially, must be supplemented with
    techniques for dealing with unseen Input.
  • What about Translation Quality?
  • Indirect systems clearly better in principle.
  • However, constructing MT engine requires
    considerable effort.
  • Direct Systems can achieve good performance.
  • Summary
  • Research mostly Transfer-based, with rules
    automatically acquired from data
  • Industrially we can expect highly-developed
    Direct Systems to survive for some years to come

9
Other Material
  • Arnold, D. et al. (1994) Machine Translation -
    An Introductory Guide NCC Blackwell, Oxford
  • Hutchins, J. H. Somers (1992) An Introduction
    to MT Academic Press, London
  • Trujillo, A. (1999) Translation Engines
    Springer, London
  • Newer books include
  • Bowker, L. (2002) Computer-Aided Translation
    Technology, U. of Ottawa Press.
  • Somers, H. (2003) Computers and Translation A
    translator's guide, John Benjamins.
  • Bond, F. (2005) Translating the Untranslatable,
    CSLI.
  • Quah, C. (2006) Translation and Technology,
    Palgrave MacMillan.

10
Why Corpus-Based MT?
  • the (relative) failure of rule-based approaches
  • the increasing availability of machine-readable
    text
  • the increase in capability of hardware (CPU,
    memory, disk space) with associated decrease in
    cost

11
Corpus-Based MT is here to stay
  • These approaches are now mainstream
  • Most researchers are developing corpus-based
    systems
  • First company to use SMT now exists
    http//www.languageweaver.com
  • CNGL partner Traslán uses EBMT/SMT hybrid
  • In recent large-scale evaluations, corpus-based
    MT systems come first.
  • Two caveats
  • Most industrial systems are still rule-based (but
    cf. Googles systems now all SMT)
  • Current mainstream evaluation metrics favour
    n-gram-based systems (i.e. bias towards SMT).

12
Thanks to Kevin Knight
13
Centauri/Arcturan Exercise
Slides already on CA446 webpage
14
Centauri/Arcturan Knight, 1997
  • Your assignment, put these words in order
  • jjat, arrat, mat, bat, oloat, at-yurp
  • There are 6! different orders possible, so 720
    different translations.
  • Best order (according to placement in TL side of
    the corpus is as given above)
  • Not just unigrams, but n-grams also

15
Its Really SpanishEnglish!
Clients do not sell pharmaceuticals in Europe gt
Clientes no venden medicinas en Europa
 
16
Some more to try
  • iat lat pippat eneat hilat oloat at-yurp.
  • totat nnat forat arrat mat bat.
  • wat dat quat cat uskrat at-drubel.

17
Some more to try
  • iat lat pippat eneat hilat oloat at-yurp.
  • totat nnat forat arrat mat bat.
  • wat dat quat cat uskrat at-drubel.
  • if you have trouble sleeping at nights!

18
What have we just seen?
  • what parallel corpora look like
  • how relevant parallel corpora are for MT
  • how to build bilingual dictionaries from parallel
    corpora
  • how cognate information may be useful in MT
  • how to do word alignment.

19
What else do we need to know?
  • about word alignment on a larger scale
  • about phrasal alignment, the norm in real
    translation data
  • about unknown words
  • the importance of knowing the target language
    (vs. source) in making fluent translations
  • about locality in word order shifts
  • how to guess the meanings/translations of unknown
    words
  • about how much uncertainty the machine faces in
    working with limited data
  • about working on different domains

20
Do such methods scale to real MT?
  • Availability of monolingual and bilingual
    corpora?
  • Possibility of sentence-aligning bilingual
    corpora?
  • Can we write an algorithm to extract the
    translation dictionary?
  • Can we write an algorithm to extract the
    monolingual word pair counts?
  • Can we write an algorithm to generate
    translations using our translation dictionary and
    word pair counts?

21
Do such methods scale to real MT?
  • Availability of monolingual and bilingual
    corpora?
  • Possibility of sentence-aligning bilingual
    corpora?
  • Can we write an algorithm to extract the
    translation dictionary?
  • Can we write an algorithm to extract the
    monolingual word pair counts?
  • Can we write an algorithm to generate
    translations using our translation dictionary and
    word pair counts?
  • WILL THE TRANSLATIONS PRODUCED BE ANY GOOD?

22
Parallel Corpora
  • Hugely important but not available in a wide
    range of language pairs
  • ChineseEnglish Hong Kong data
  • FrenchEnglish Canadian Hansards
  • Older EU pairs Europarl Koehn 04
  • Newer EU pairs JRC-Acquis Communautaire, very
    recently distributed updated Europarl
  • ArabicEnglish LDC Data
  • NIST, IWSLT, TC-STAR Evaluations

23
Caveat interpres!
  • Beware of sparse data!
  • Beware of unrepresentative corpora!
  • Beware of poor quality language!
  • If the corpora are small, or of poor quality, or
    are unrepresentative, then our statistical
    language models will be poor, so any results we
    achieve will be poor.
Write a Comment
User Comments (0)
About PowerShow.com