Title: An Introduction to Machine Translation
1An Introduction to Machine Translation
2The Rise Fall of Different MT Paradigms
3Three main approaches to RBMT
language-neutral interlingua
TRANSFER
GENERATION
ANALYSIS
direct translation
target text
source text
The Vauquois Pyramid
4System Design Concerns
- Multilingual vs. Bilingual
- Multilingual
- Extreme Eurotra, i.e. 72 language pairs Modest
EN ?DE,FR,ES, i.e. 3 language pairs - Intermediate EN,FR,DE,ES,JP, but not all
combinations - Bilingual
- Unidirectional vs. Bidirectional
- EN?FR or FR?EN
- Reversible vs. Non-reversible
- EN?FR, same EN,FR components for Analysis
Generation, and reversible transfer module - EN?FR FR?EN, but different EN, FR components
for Analysis Generation, and different transfer
modules, NB, lack of modularity - Direct vs. Transfer vs. Interlingua
- Batch vs. Interactive
5Advantages/Disadvantages of Direct Systems
- Advantages
- Engine's competence lies in its comparative
grammar. - Highly robust. Does not break down or stop
whenit encounters unknown words, unknown
grammaticalconstructs, or ill-formed Input - Designed for unidirectional translation between
one pair of langs. Not conducive to genuine
multilingual MT design. - Disadvantages
- word-for-word' translation local reordering
poortranslation, using cheap bilingual
dictionary rudimentary knowledge of target
language. - Linguistically, computationally naive. No
analysis of internalstructure of Input,
especially w.r.t. the grammatical relationships
between the main parts of sentences.
6Advantages/Disadvantages of Interlingual Systems
- Advantages
- Intermediate representation (IR) fully specified,
i.e. no need to look back' at Source in order to
generate Target. - Easy to extend to other langs.
- Built-in back translation useful for testing.
- Disadvantages
- How to define an Interlingua for closely related
languages? - Truly universal Interlingua possible?
7Advantages/Disadvantages of Transfer Systems
- Advantages
- No language-independent representations source
IR specific to a particular lang., as is the
target lang. IR. - So Complexity of Analysis Generation components
much reduced - Also, no necessary equivalence between source and
targetIRs for the same language! - Disadvantages
- Not so easy to extend to other languages n
analysis modules, n generation modules, n x n-1
transfer modules, i.e. not much less than n² - No guaranteed built-in back translation.
8Direct, or Indirect?
- Direct
- From manufacturer's viewpoint, better, as it's
more robust - Indirect
- Falls over more easily.
- Development phase can be trying.
- Commercially, must be supplemented with
techniques for dealing with unseen Input. - What about Translation Quality?
- Indirect systems clearly better in principle.
- However, constructing MT engine requires
considerable effort. - Direct Systems can achieve good performance.
- Summary
- Research mostly Transfer-based, with rules
automatically acquired from data - Industrially we can expect highly-developed
Direct Systems to survive for some years to come
9Other Material
- Arnold, D. et al. (1994) Machine Translation -
An Introductory Guide NCC Blackwell, Oxford - Hutchins, J. H. Somers (1992) An Introduction
to MT Academic Press, London - Trujillo, A. (1999) Translation Engines
Springer, London - Newer books include
- Bowker, L. (2002) Computer-Aided Translation
Technology, U. of Ottawa Press. - Somers, H. (2003) Computers and Translation A
translator's guide, John Benjamins. - Bond, F. (2005) Translating the Untranslatable,
CSLI. - Quah, C. (2006) Translation and Technology,
Palgrave MacMillan.
10Why Corpus-Based MT?
- the (relative) failure of rule-based approaches
- the increasing availability of machine-readable
text - the increase in capability of hardware (CPU,
memory, disk space) with associated decrease in
cost
11Corpus-Based MT is here to stay
- These approaches are now mainstream
- Most researchers are developing corpus-based
systems - First company to use SMT now exists
http//www.languageweaver.com - CNGL partner Traslán uses EBMT/SMT hybrid
- In recent large-scale evaluations, corpus-based
MT systems come first. - Two caveats
- Most industrial systems are still rule-based (but
cf. Googles systems now all SMT) - Current mainstream evaluation metrics favour
n-gram-based systems (i.e. bias towards SMT).
12Thanks to Kevin Knight
13Centauri/Arcturan Exercise
Slides already on CA446 webpage
14Centauri/Arcturan Knight, 1997
- Your assignment, put these words in order
- jjat, arrat, mat, bat, oloat, at-yurp
- There are 6! different orders possible, so 720
different translations. - Best order (according to placement in TL side of
the corpus is as given above) - Not just unigrams, but n-grams also
15Its Really SpanishEnglish!
Clients do not sell pharmaceuticals in Europe gt
Clientes no venden medicinas en Europa
16Some more to try
- iat lat pippat eneat hilat oloat at-yurp.
- totat nnat forat arrat mat bat.
- wat dat quat cat uskrat at-drubel.
17Some more to try
- iat lat pippat eneat hilat oloat at-yurp.
- totat nnat forat arrat mat bat.
- wat dat quat cat uskrat at-drubel.
- if you have trouble sleeping at nights!
18What have we just seen?
- what parallel corpora look like
- how relevant parallel corpora are for MT
- how to build bilingual dictionaries from parallel
corpora - how cognate information may be useful in MT
- how to do word alignment.
19What else do we need to know?
- about word alignment on a larger scale
- about phrasal alignment, the norm in real
translation data - about unknown words
- the importance of knowing the target language
(vs. source) in making fluent translations - about locality in word order shifts
- how to guess the meanings/translations of unknown
words - about how much uncertainty the machine faces in
working with limited data - about working on different domains
20Do such methods scale to real MT?
- Availability of monolingual and bilingual
corpora? - Possibility of sentence-aligning bilingual
corpora? - Can we write an algorithm to extract the
translation dictionary? - Can we write an algorithm to extract the
monolingual word pair counts? - Can we write an algorithm to generate
translations using our translation dictionary and
word pair counts?
21Do such methods scale to real MT?
- Availability of monolingual and bilingual
corpora? - Possibility of sentence-aligning bilingual
corpora? - Can we write an algorithm to extract the
translation dictionary? - Can we write an algorithm to extract the
monolingual word pair counts? - Can we write an algorithm to generate
translations using our translation dictionary and
word pair counts? - WILL THE TRANSLATIONS PRODUCED BE ANY GOOD?
22Parallel Corpora
- Hugely important but not available in a wide
range of language pairs - ChineseEnglish Hong Kong data
- FrenchEnglish Canadian Hansards
- Older EU pairs Europarl Koehn 04
- Newer EU pairs JRC-Acquis Communautaire, very
recently distributed updated Europarl - ArabicEnglish LDC Data
- NIST, IWSLT, TC-STAR Evaluations
23Caveat interpres!
- Beware of sparse data!
- Beware of unrepresentative corpora!
- Beware of poor quality language!
- If the corpora are small, or of poor quality, or
are unrepresentative, then our statistical
language models will be poor, so any results we
achieve will be poor.