Introduction to Machine Translation - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

Introduction to Machine Translation

Description:

translation tools (HAMT, MAHT) ... Statistical Machine Translation Technology. Spanish/English. Bilingual Text. English Text ... in UN Spanish/English Corpus ... – PowerPoint PPT presentation

Number of Views:779
Avg rating:3.0/5.0
Slides: 44
Provided by: mitchel4
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Machine Translation


1
Introduction to Machine Translation
  • Mitch Marcus
  • CIS 530
  • Some slides adapted from slides by
  • John Hutchins, Bonnie Dorr, Martha Palmer

2
Why use computers in translation?
  • Too much translation for humans
  • Technical materials too boring for humans
  • Greater consistency required
  • Need results more quickly
  • Not everything needs to be top quality
  • Reduce costs
  • Any one of these may justify machine translation
    or computer aids

3
The Early History of NLP (Hutchins) MT in the
1950s and 1960s
  • Sponsored by government bodies in USA and USSR
    (also CIA and KGB)
  • assumed goal was fully automatic quality output
    (i.e. of publishable quality) dissemination
  • actual need was translation for information
    gathering assimilation
  • Survey by Bar-Hillel of MT research
  • criticised assumption of FAHQT as goal
  • demonstrated non-feasibility of FAHQT (without
    unrealisable encyclopedic knowledge bases)
  • advocated man-machine symbiosis, i.e. HAMT and
    MAHT
  • ALPAC 1966, set up by disillusioned funding
    agencies
  • compared latest systems with early unedited MT
    output (IBM-GU demo, 1954), criticised for still
    needing post-editing
  • advocated machine aids, and no further support of
    MT research
  • but failed to identify the actual needs of
    funders assimilation
  • therefore failed to see that output of IBM-USAF
    Translator and Georgetown systems were used and
    appreciated

4
Consequences of ALPAC
  • MT research virtually ended in US
  • identification of actual needs
  • assimilation vs. dissemination
  • recognition that perfectionism (FAHQT) had
    neglected
  • operational factors and requirements
  • expertise of translators
  • machine aids for translators
  • henceforth three strands of MT
  • translation tools (HAMT, MAHT)
  • operational systems (post-editing, controlled
    languages, domain-specific systems)
  • research (new approaches, new methods)
  • computational linguistics born in the aftermath

5
Machine Translation (Pass 0 From Intro Lectures)
6
Why use computers in translation?
  • Too much translation for humans
  • Technical materials too boring for humans
  • Greater consistency required
  • Need results more quickly
  • Not everything needs to be top quality
  • Reduce costs
  • Any one of these may justify machine translation
    or computer aids
  • (next several slides adapted from Language Weaver)

7
Statistical Machine Translation Technology
Spanish/English Bilingual Text
English Text
Statistical Analysis
Statistical Analysis
Que hambre tengo yo
8
How A Statistical MT System Learns
9
Translating a New Document
10
Language Weaver
Source Aljazeera, January 8, 2005
11
Translingual Chat Instant Messaging
Original
Translation
12
Language Weaver (Al Jazeera 8/2007)
LanguageWeaver Demo Website
13
Language Weaver Hybrid Translation Technology
  • Chinese Source TextSample 1 ??,???????????????
    ???????????????????24????1200??????,8???40??????
    ,???????????,??????????????????    
    ????????????????????,???????????????????????,???
    ???????????????     ??????????????????????????
    ?????,35????????3562.1??????,????????????????,?
    ???????????????????76?????????????????    
    ????,??????????????????????????????,?4??,???????26
    .79??,??27.6???4??,? ???9??,????50.5,??????????
    ?????????????,???,?????????????18.8??,????
    22,????105?????????1.1??,?????25? Language
    Weaver Experimental Syntax MT Sample 1  The
    motor show, has always been the' barometer' of a
    national car consumption and market potential.
    The Beijing International Auto Show has more than
    1,200 exhibitors from 24 countries and 8 days of
    receiving more than 40 million visitors, setting
    a new record in China's auto show, are deeply
    aware of the automobile market signals.  "China
    is one of the largest automobile market in the
    world. Over the years, this phrase implies more
    auto businessmen. But now, more and more facts
    indicates that it is to become a reality.  Data
    from the Motor Show is very convincing. The
    Beijing Qingnian Bao Report on-the-spot
    investigation showed that about 35 percent of
    35-year-old visitors, 62.1 percent of the
    respondents said that the truck was mainly to buy
    a car in the near future to collect information,
    even at the exhibition may purchase or suitable
    products 76 of respondents indicated in the
    past two years to buy private cars.  Since the
    beginning of this year, the strong growth of the
    domestic car market. According to the figures
    released by the National Bureau of Statistics, in
    the first four months, the country produced
    267,900 vehicles, up 27.6 percent in particular,
    in April, the production of 90,000 vehicles, an
    increase of 50.5 over the same period last year,
    setting a record high for the monthly output
    growth over the past 10-odd years. In terms of
    sales in the first quarter, manufacturing
    enterprises in the country sold 188,000 cars, up
    22 percent over the same period of last year, up
    10.5 percent 11,000 vehicles, dropping by nearly
    25 percent lower than the beginning of the year.

14
Broadcast Monitoring BBN MAPS Language Weaver
MT
15
(No Transcript)
16
Three MT Approaches Direct, Transfer,
Interlingual (Vauquois triangle)
17
Examples of Three Approaches
  • Direct
  • I checked his answers against those of the
    teacher ?
  • Yo comparé sus respuestas a las de la
    profesora
  • Rule check X against Y ? comparar X a Y
  • Transfer
  • Ich habe ihn gesehen ? I have seen him
  • Rule clause agt aux obj pred ? clause agt aux
    pred obj
  • Interlingual
  • I like Mary? Mary me gusta a mí
  • Rep BeIdent (I ATIdent (I, Mary) Likeingly)

18
Direct MT Pros and Cons
  • Pros
  • Fast
  • Simple
  • Inexpensive
  • Cons
  • Unreliable
  • Not powerful
  • Rule proliferation
  • Requires too much context
  • Major restructuring after lexical substitution

19
Transfer MT Pros and Cons
  • Pros
  • Dont need to find language-neutral rep
  • No translation rules hidden in lexicon
  • Relatively fast
  • Cons
  • N2 sets of transfer rules Difficult to extend
  • Proliferation of language-specific rules in
    lexicon and syntax
  • Cross-language generalizations lost

20
Interlingual MT Pros and Cons
  • Pros
  • Portable (avoids N2 problem)
  • Lexical rules and structural transformations
    stated more simply on normalized representation
  • Explanatory Adequacy
  • Cons
  • Difficult to deal with terms on primitive level
    universals?
  • Must decompose and reassemble concepts
  • Useful information lost (paraphrase)
  • (Is thought really language neutral??)

21
MT Challenges Ambiguity
  • Syntactic AmbiguityI saw the man on the hill
    with the telescope
  • Lexical Ambiguity
  • E book
  • S libro, reservar
  • Semantic Ambiguity
  • Homographyball(E) pelota, baile(S)
  • Polysemykill(E), matar, acabar (S)
  • Semantic granularityesperar(S) wait, expect,
    hope (E)be(E) ser, estar(S)fish(E) pez,
    pescado(S)

22
MT Challenges Divergences
  • Meaning of two translationally equivalent phrases
    is distributed differently in the two languages
  • Example
  • English RUN INTO ROOM
  • Spanish ENTER IN ROOM RUNNING

23
Spanish/Arabic Divergences
24
Divergence Frequency
  • 32 of sentences in UN Spanish/English Corpus
    (5K)
  • 35 of sentences in TREC El Norte Corpus (19K)
  • Divergence Types
  • Categorial (X tener hambre ? X have hunger)
    98
  • Conflational (X dar puñaladas a Z ? X stab Z)
    83
  • Structural (X entrar en Y ? X enter Y) 35
  • Head Swapping (X cruzar Y nadando ? X swim
    across Y) 8
  • Thematic (X gustar a Y ? Y like X) 6

25
MT Lexical Choice- WSD
  • Iraq lost the battle.
  • Ilakuka centwey ciessta.
  • Iraq battle lost.
  • John lost his computer.
  • John-i computer-lul ilepelyessta.
  • John computer misplaced.

26
WSD with Source Language Semantic Class
Constraints
lose1(Agent, Patient competition)
ciessta lose2 (Agent, Patient physobj)
ilepelyessta
27
Lexical Gaps English to Chinese
  • break
  • smash
  • shatter
  • snap
  • ?
  • da po - irregular pieces
  • da sui - small pieces
  • pie duan -line
  • segments

28
An Gentle Introduction to Statistical MT 1949 to
1988
29
Warren Weaver 1949 Memorandum I
  • Proposes Local Word Sense Disambiguation!
  • If one examines the words in a book, one at a
    time through an opaque mask with a hole in it one
    word wide, then it is obviously impossible to
    determine, one at a time, the meaning of words.
    "Fast" may mean "rapid" or it may mean
    "motionless" and there is no way of telling
    which.
  • But, if one lengthens the slit in the opaque
    mask, until one can see not only the central word
    in question but also say N words on either side,
    then, if N is large enough one can unambiguously
    decide the meaning. . .

30
Warren Weaver 1949 Memorandum II
  • Proposes Interlingua for Machine Translation!
  • Thus it may be true that the way to translate
    from Chinese to Arabic, or from Russian to
    Portuguese, is not to attempt the direct route,
    shouting from tower to tower. Perhaps the way is
    to descend, from each language, down to the
    common base of human communicationthe real but
    as yet undiscovered universal languageandthen
    re-emerge by whatever particular route is
    convenient.

31
Warren Weaver 1949 Memorandum III
  • Proposes Machine Translation using Information
    Theory!
  • It is very tempting to say that a book written
    in Chinese is simply a book written in English
    which was coded into the "Chinese code." If we
    have useful methods for solving almost any
    cryptographic problem, may it not be that with
    proper interpretation we already have useful
    methods for translation?
  • Weaver, W. (1949) Translation. Repr. in
    Locke, W.N. and Booth, A.D. (eds.) Machine
    translation of languages fourteen essays
    (Cambridge, Mass. Technology Press of the
    Massachusetts Institute of Technology, 1955), pp.
    15-23.

32
IBM Adopts Statistical MT Approach I (early
1990s)
  • In 1949, Warren Weaver proposed that statistical
    techniques from the emerging field of information
    theory might make it possible to use modern
    digital computers to translate text from one
    natural language to another automatically.
    Although Weaver's scheme foundered on the rocky
    reality of the limited computer resources of the
    day, a group of IBM researchers in the late
    1980's felt that the increase in computer power
    over the previous forty years made reasonable a
    new look at the applicability of statistical
    techniques to translation. Thus the "Candide"
    project, aimed at developing an experimental
    machine translation system, was born at IBM TJ
    Watson Research Center.

33
IBM Adopts Statistical MT Approach II
  • The Candide group adopted an information-theoreti
    c perspective on the MT problem, which goes as
    follows. In speaking a French sentence F, a
    French speaker originally thought up a sentence E
    in English, but somewhere in the noisy channel
    between his brain and mouth, the sentence E got
    "corrupted" to its French translation F. The task
    of an MT system is to discover E argmax(E')
    p(FE') p(E') that is, the MAP-optimal English
    sentence, given the observed French sentence.
    This approach involves constructing a model of
    likely English sentences, and a model of how
    English sentences translate to French sentences.
    Both these tasks are accomplished automatically
    with the help of a large amount of bilingual
    text.
  • As wacky as this perspective might sound, it's no
    stranger than the view that an English sentence
    gets corrupted into an acoustic signal in passing
    from the person's brain to his mouth, and this
    perspective is now essentially universal in
    automatic speech recognition.

34
The Channel Model for Machine Translation
this and following 3 out of 4 slides from
original 1990 IBM MT paper
35
Noisy Channel - Why useful?
  • Word reordering in translation handled by P(S)
  • P(S) factor frees P(T S) from worrying about
    word order in the Source language
  • Word choice in translation handled by P (TS)
  • P(T S) factor frees P(S) from worrying about
    picking the right translation

36
An Alignment
distortion
fertility
37
Fertilities and Lexical Probabilities for not
38
Fertilities and Lexical Probabilities for hear
39
Schematic of Translation Model
fertility
null cepts
translation
distortion
from What's New in Statistical Machine
Translation, Kevin Knight and Philipp Koehn,
Tutorial at HLT/NAACL 2003
40
How do we evaluate MT?
  • Human-based Metrics
  • Semantic Invariance
  • Pragmatic Invariance
  • Lexical Invariance
  • Structural Invariance
  • Spatial Invariance
  • Fluency
  • Accuracy Number of Human Edits required
  • HTER Human Translation Error Rate
  • Do you get it?
  • Automatic Metrics Bleu

41
BiLingual Evaluation Understudy (BLEU Papineni,
2001)
  • Automatic Technique, but .
  • Requires the pre-existence of Human (Reference)
    Translations
  • Compare n-gram matches between candidate
    translation and 1 or more reference translations

42
Bleu Metric
Chinese-English Translation Example Candidate 1
It is a guide to action which ensures that the
military always obeys the commands of the
party. Candidate 2 It is to insure the troops
forever hearing the activity guidebook that party
direct.
Reference 1 It is a guide to action that ensures
that the military will forever heed Party
commands. Reference 2 It is the guiding
principle which guarantees the military forces
always being under the command of the
Party. Reference 3 It is the practical guide for
the army always to heed the directions of the
party.
43
Bleu Metric
Chinese-English Translation Example Candidate 1
It is a guide to action which ensures that the
military always obeys the commands of the
party. Candidate 2 It is to insure the troops
forever hearing the activity guidebook that party
direct.
Reference 1 It is a guide to action that ensures
that the military will forever heed Party
commands. Reference 2 It is the guiding
principle which guarantees the military forces
always being under the command of the
Party. Reference 3 It is the practical guide for
the army always to heed the directions of the
party.
Write a Comment
User Comments (0)
About PowerShow.com