Machine%20Translation-%203 - PowerPoint PPT Presentation

About This Presentation
Title:

Machine%20Translation-%203

Description:

Model 1: Bag of words. Unique local maxima. Efficient EM ... Maria did not slap the green witch. IBM Model 3 Example. Maria did not slap the green witch ... – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 101
Provided by: IBMU306
Category:

less

Transcript and Presenter's Notes

Title: Machine%20Translation-%203


1
Machine Translation- 3
  • Autumn 2008

Lecture 18 8 Sep 2008
2
Translation Steps

3
IBM Models 15
  • Model 1 Bag of words
  • Unique local maxima
  • Efficient EM algorithm (Model 12)
  • Model 2 General alignment
  • Model 3 fertility n(k e)
  • No full EM, count only neighbors (Model 35)
  • Deficient (Model 34)
  • Model 4 Relative distortion, word classes
  • Model 5 Extra variables to avoid deficiency

4
IBM Model 1
  • Model parameters
  • T(fj eaj ) translation probability of foreign
    word given English word that generated it

5
IBM Model 1
  • Generative story
  • Given e
  • Pick m f, where all lengths m are equally
    probable
  • Pick A with probability P(Ae) 1/(l1)m, since
    all alignments are equally likely given l and m
  • Pick f1fm with probability
  • where T(fj eaj ) is the translation
    probability of fj given the English word it is
    aligned to

6
IBM Model 1 Example
  • e blue witch

7
IBM Model 1 Example
  • e blue witch
  • f f1 f2

Pick m f 2
8
IBM Model 1 Example
  • e blue witch
  • f f1 f2

Pick A 2,1 with probability 1/(l1)m
9
IBM Model 1 Example
  • e blue witch
  • f bruja f2

Pick f1 bruja with probability t(brujawitch)
10
IBM Model 1 Example
  • e blue witch
  • f bruja azul

Pick f2 azul with probability t(azulblue)
11
IBM Model 1 Parameter Estimation
  • How does this generative story help us to
    estimate P(fe) from the data?
  • Since the model for P(fe) contains the parameter
    T(fj eaj ), we first need to estimate
    T(fj eaj )

12
lBM Model 1 Parameter Estimation
  • How to estimate T(fj eaj ) from the data?
  • If we had the data and the alignments A, along
    with P(Af,e), then we could estimate T(fj eaj
    ) using expected counts as follows

13
lBM Model 1 Parameter Estimation
  • How to estimate P(Af,e)?
  • P(Af,e) P(A,fe) / P(fe)
  • But
  • So we need to compute P(A,fe)
  • This is given by the Model 1 generative story

14
IBM Model 1 Example
  • e the blue witch
  • f la bruja azul

P(Af,e) P(f,Ae)/ P(fe)
15
IBM Model 1 Parameter Estimation
  • So, in order to estimate P(fe), we first need to
    estimate the model parameter
  • T(fj eaj )
  • In order to compute T(fj eaj ) , we need to
    estimate P(Af,e)
  • And in order to compute P(Af,e), we need to
    estimate T(fj eaj )

16
IBM Model 1 Parameter Estimation
  • Training data is a set of pairs lt ei, figt
  • Log likelihood of training data given model
    parameters is
  • To maximize log likelihood of training data given
    model parameters, use EM
  • hidden variable alignments A
  • model parameters translation probabilities T

17
EM
  • Initialize model parameters T(fe)
  • Calculate alignment probabilities P(Af,e) under
    current values of T(fe)
  • Calculate expected counts from alignment
    probabilities
  • Re-estimate T(fe) from these expected counts
  • Repeat until log likelihood of training data
    converges to a maximum

18
IBM Model 1 Example
  • Parallel corpus
  • the dog le chien
  • the cat le chat
  • Step 12 (collect candidates and initialize
    uniformly)
  • P(le the) P(chien the) P(chat
    the) 1/3
  • P(le dog) P(chien dog) P(chat dog)
    1/3
  • P(le cat) P(chien cat) P(chat
    cat) 1/3
  • P(le NULL) P(chien NULL) P(chat NULL)
    1/3

19
IBM Model 1 Example
  • Step 3 Iterate
  • NULL the dog le chien
  • j1
  • total P(le NULL)P(le the)P(le dog) 1
  • tc(le NULL) P(le NULL)/1 0 .333/1
    0.333
  • tc(le the) P(le the)/1 0 .333/1
    0.333
  • tc(le dog) P(le dog)/1 0 .333/1
    0.333
  • j2
  • total P(chien NULL)P(chien the)P(chien
    dog)1
  • tc(chien NULL) P(chien NULL)/1 0
    .333/1 0.333
  • tc(chien the) P(chien the)/1 0
    .333/1 0.333
  • tc(chien dog) P(chien dog)/1 0
    .333/1 0.333

20
IBM Model 1 Example
  • NULL the cat le chat
  • j1
  • total P(le NULL)P(le the)P(le cat)1
  • tc(le NULL) P(le NULL)/1 0.333
    .333/1 0.666
  • tc(le the) P(le the)/1 0.333
    .333/1 0.666
  • tc(le cat) P(le cat)/1 0
    .333/1 0.333
  • j2
  • total P(chien NULL)P(chien the)P(chien
    dog)1
  • tc(chat NULL) P(chat NULL)/1 0
    .333/1 0.333
  • tc(chat the) P(chat the)/1
    0 .333/1 0.333
  • tc(chat cat) P(chat dog)/1 0
    .333/1 0.333

21
IBM Model 1 Example
  • Re-compute translation probabilities
  • total(the) tc(le the) tc(chien the)
    tc(chat the)
  • 0.666 0.333 0.333
    1.333
  • P(le the) tc(le the)/total(the)
  • 0.666 / 1.333 0.5
  • P(chien the) tc(chien the)/total(the)
  • 0.333/1.333 0.25
  • P(chat the) tc(chat the)/total(the)
  • 0.333/1.333 0.25
  • total(dog) tc(le dog) tc(chien dog)
    0.666
  • P(le dog) tc(le dog)/total(dog)
  • 0.333 / 0.666 0.5
  • P(chien dog) tc(chien dog)/total(dog)
  • 0.333 / 0.666 0.5

22
IBM Model 1 Example
  • Iteration 2
  • NULL the dog le chien
  • j1
  • total P(le NULL)P(le the)P(le dog)
    1.5
  • 0.5 0.5 0.5 1.5
  • tc(le NULL) P(le NULL)/1 0 .5/1.5
    0.333
  • tc(le the) P(le the)/1 0 .5/1.5
    0.333
  • tc(le dog) P(le dog)/1 0 .5/1.5
    0.333
  • j2
  • total P(chien NULL)P(chien the)P(chien
    dog)1
  • 0.25 0.25 0.5 1
  • tc(chien NULL) P(chien NULL)/1 0
    .25/1 0.25
  • tc(chien the) P(chien the)/1 0
    .25/1 0.25
  • tc(chien dog) P(chien dog)/1 0
    .5/1 0.5

23
IBM Model 1 Example
  • NULL the cat le chat
  • j1
  • total P(le NULL)P(le the)P(le cat)
    1.5
  • 0.5 0.5 0.5 1.5
  • tc(le NULL) P(le NULL)/1 0.333
    .5/1 0.833
  • tc(le the) P(le the)/1 0.333 .5/1
    0.833
  • tc(le cat) P(le cat)/1 0 .5/1
    0.5
  • j2
  • total P(chat NULL)P(chat the)P(chat
    cat)1
  • 0.25 0.25 0.5 1
  • tc(chat NULL) P(chat NULL)/1 0
    .25/1 0.25
  • tc(chat the) P(chat the)/1 0 .25/1
    0.25
  • tc(chat cat) P(chat cat)/1 0 .5/1
    0.5

24
IBM Model 1 Example
  • Re-compute translations (iteration 2)
  • total(the) tc(le the) tc(chien the)
    tc(chat the)
  • .833 0.25 0.25 1.333
  • P(le the) tc(le the)/total(the)
  • .833 / 1.333 0.625
  • P(chien the) tc(chien the)/total(the)
  • 0.25/1.333 0.188
  • P(chat the) tc(chat the)/total(the)
  • 0.25/1.333 0.188
  • total(dog) tc(le dog) tc(chien dog)
  • 0.333 0.5 0.833
  • P(le dog) tc(le dog)/total(dog)
  • 0.333 / 0.833 0.4
  • P(chien dog) tc(chien dog)/total(dog)
  • 0.5 / 0.833 0.6

25
IBM Model 1Example
  • After 5 iterations
  • P(le NULL) 0.755608028335301
  • P(chien NULL) 0.122195985832349
  • P(chat NULL) 0.122195985832349
  • P(le the) 0.755608028335301
  • P(chien the) 0.122195985832349
  • P(chat the) 0.122195985832349
  • P(le dog) 0.161943319838057
  • P(chien dog) 0.838056680161943
  • P(le cat) 0.161943319838057
  • P(chat cat) 0.838056680161943

26
IBM Model 1 Recap
  • IBM Model 1 allows for an efficient computation
    of translation probabilities
  • No notion of fertility, i.e., its possible that
    the same English word is the best translation for
    all foreign words
  • No positional information, i.e., depending on the
    language pair, there might be a tendency that
    words occurring at the beginning of the English
    sentence are more likely to align to words at the
    beginning of the foreign sentence

27
IBM Model 2
  • Model parameters
  • T(fj eaj ) translation probability of foreign
    word fj given English word eaj that generated it
  • d(ij,l,m) distortion probability, or
    probability that fj is aligned to ei , given l
    and m

28
IBM Model 3
  • Model parameters
  • T(fj eaj ) translation probability of foreign
    word fj given English word eaj that generated it
  • r(ji,l,m) reverse distortion probability, or
    probability of position fj, given its alignment
    to ei, l, and m
  • n(ei) fertility of word ei , or number of
    foreign words aligned to ei
  • p1 probability of generating a foreign word by
    alignment with the NULL English word

29
IBM Model 3
  • IBM Model 3 offers two additional features
    compared to IBM Model 1
  • How likely is an English word e to align to k
    foreign words (fertility)?
  • Positional information (distortion), how likely
    is a word in position i to align to a word in
    position j?

30
IBM Model 3 Fertility
  • The best Model 1 alignment could be that a single
    English word aligns to all foreign words
  • This is clearly not desirable and we want to
    constrain the number of words an English word can
    align to
  • Fertility models a probability distribution that
    word e aligns to k words n(k,e)
  • Consequence translation probabilities cannot be
    computed independently of each other anymore
  • IBM Model 3 has to work with full alignments,
    note there are up to (l1)m different alignments

31
IBM Model 3
  • Generative Story
  • Choose fertilities for each English word
  • Insert spurious words according to probability of
    being aligned to the NULL English word
  • Translate English words -gt foreign words
  • Reorder words according to reverse distortion
    probabilities

32
IBM Model 3 Example
  • Consider the following example from Knight
    1999
  • Maria did not slap the green witch

33
IBM Model 3 Example
  • Maria did not slap the green witch
  • Maria not slap slap slap the green witch
  • Choose fertilities phi(Maria) 1

34
IBM Model 3 Example
  • Maria did not slap the green witch
  • Maria not slap slap slap the green witch
  • Maria not slap slap slap NULL the green witch
  • Insert spurious words p(NULL)

35
IBM Model 3 Example
  • Maria did not slap the green witch
  • Maria not slap slap slap the green witch
  • Maria not slap slap slap NULL the green witch
  • Maria no dio una bofetada a la verde bruja
  • Translate words t(verdegreen)

36
IBM Model 3 Example
  • Maria no dio una bofetada a la verde bruja
  • Maria no dio una bofetada a la bruja verde
  • Reorder words

37
IBM Model 3
  • For models 1 and 2
  • We can compute exact EM updates
  • For models 3 and 4
  • Exact EM updates cannot be efficiently computed
  • Use best alignments from previous iterations to
    initialize each successive model
  • Explore only the subspace of potential alignments
    that lies within same neighborhood as the initial
    alignments

38
IBM Model 4
  • Model parameters
  • Same as model 3, except uses more complicated
    model of reordering (for details, see Brown et
    al. 1993)

39
(No Transcript)
40
IBM Model 1 Model 3
  • Iterating over all possible alignments is
    computationally infeasible
  • Solution Compute the best alignment with Model 1
    and change some of the alignments to generate a
    set of likely alignments (pegging)
  • Model 3 takes this restricted set of alignments
    as input

41
Pegging
  • Given an alignment a we can derive additional
    alignments from it by making small changes
  • Changing a link (j,i) to (j,i)
  • Swapping a pair of links (j,i) and (j,i) to
    (j,i) and (j,i)
  • The resulting set of alignments is called the
    neighborhood of a

42
IBM Model 3 Distortion
  • The distortion factor determines how likely it is
    that an English word in position i aligns to a
    foreign word in position j, given the lengths of
    both sentences
  • d(j i, l, m)
  • Note, positions are absolute positions

43
Deficiency
  • Problem with IBM Model 3 It assigns probability
    mass to impossible strings
  • Well formed string This is possible
  • Ill-formed but possible string This possible
    is
  • Impossible string
  • Impossible strings are due to distortion values
    that generate different words at the same
    position
  • Impossible strings can still be filtered out in
    later stages of the translation process

44
Limitations of IBM Models
  • Only 1-to-N word mapping
  • Handling fertility-zero words (difficult for
    decoding)
  • Almost no syntactic information
  • Word classes
  • Relative distortion
  • Long-distance word movement
  • Fluency of the output depends entirely on the
    English language model

45
Decoding
  • How to translate new sentences?
  • A decoder uses the parameters learned on a
    parallel corpus
  • Translation probabilities
  • Fertilities
  • Distortions
  • In combination with a language model the decoder
    generates the most likely translation
  • Standard algorithms can be used to explore the
    search space (A, greedy searching, )
  • Similar to the traveling salesman problem

46
Three Problems for Statistical MT
  • Language model
  • Given an English string e, assigns P(e) by
    formula
  • good English string -gt high P(e)
  • random word sequence -gt low P(e)
  • Translation model
  • Given a pair of strings ltf,egt, assigns P(f e)
    by formula
  • ltf,egt look like translations -gt high P(f e)
  • ltf,egt dont look like translations -gt low P(f
    e)
  • Decoding algorithm
  • Given a language model, a translation model, and
    a new sentence f find translation e maximizing
    P(e) P(f e)

Slide from Kevin Knight
47
The Classic Language ModelWord N-Grams
  • Goal of the language model -- choose among
  • He is on the soccer field
  • He is in the soccer field
  • Is table the on cup the
  • The cup is on the table
  • Rice shrine
  • American shrine
  • Rice company
  • American company

Slide from Kevin Knight
48
Intuition of phrase-based translation (Koehn et
al. 2003)
  • Generative story has three steps
  • Group words into phrases
  • Translate each phrase
  • Move the phrases around

49
Generative story again
  • Group English source words into phrases e1, e2,
    , en
  • Translate each English phrase ei into a Spanish
    phrase fj.
  • The probability of doing this is ?(fjei)
  • Then (optionally) reorder each Spanish phrase
  • We do this with a distortion probability
  • A measure of distance between positions of a
    corresponding phrase in the 2 lgs.
  • What is the probability that a phrase in
    position X in the English sentences moves to
    position Y in the Spanish sentence?

50
Distortion probability
  • The distortion probability is parameterized by
  • ai-bi-1
  • Where ai is the start position of the foreign
    (Spanish) phrase generated by the ith English
    phrase ei.
  • And bi-1 is the end position of the foreign
    (Spanish) phrase generated by the I-1th English
    phrase ei-1.
  • Well call the distortion probability d(ai-bi-1).
  • And well have a really stupid model
  • d(ai-bi-1) ?ai-bi-1
  • Where ? is some small constant.

51
Final translation model for phrase-based MT
  • Lets look at a simple example with no distortion

52
Phrase-based MT
  • Language model P(E)
  • Translation model P(FE)
  • Model
  • How to train the model
  • Decoder finding the sentence E that is most
    probable

53
Training P(FE)
  • What we mainly need to train is ?(fjei)
  • Suppose we had a large bilingual training corpus
  • A bitext
  • In which each English sentence is paired with a
    Spanish sentence
  • And suppose we knew exactly which phrase in
    Spanish was the translation of which phrase in
    the English
  • We call this a phrase alignment
  • If we had this, we could just count-and-divide

54
But we dont have phrase alignments
  • What we have instead are word alignments

55
Getting phrase alignments
  • To get phrase alignments
  • We first get word alignments
  • Then we symmetrize the word alignments into
    phrase alignments

56
How to get Word Alignments
  • Word alignment a mapping between the source
    words and the target words in a set of parallel
    sentences.
  • Restriction each foreign word comes from exactly
    1 English word
  • Advantage represent an alignment by the index of
    the English word that the French word comes from
  • Alignment above is thus 2,3,4,5,6,6,6

57
One addition spurious words
  • A word in the foreign sentence
  • That doesnt align with any word in the English
    sentence
  • Is called a spurious word.
  • We model these by pretending they are generated
    by an English word e0

58
More sophisticated models of alignment
59
Computing word alignments IBM Model 1
  • For phrase-based machine translation
  • We want a word-alignment
  • To extract a set of phrases
  • A word alignment algorithm gives us P(F,E)
  • We want this to train our phrase probabilities
    ?(fjei) as part of P(FE)
  • But a word-alignment algorithm can also be part
    of a mini-translation model itself.

60
IBM Model 1
61
IBM Model 1
62
How does the generative story assign P(FE) for a
Spanish sentence F?
  • Terminology
  • Suppose we had done steps 1 and 2, I.e. we
    already knew the Spanish length J and the
    alignment A (and English source E)

63
Lets formalize steps 1 and 2
  • We want P(AE) of an alignment A (of length J)
    given an English sentence E
  • IBM Model 1 makes the (very) simplifying
    assumption that each alignment is equally likely.
  • How many possible alignments are there between
    English sentence of length I and Spanish sentence
    of length J?
  • Hint Each Spanish word must come from one of the
    English source words (or the NULL word)
  • (I1)J
  • Lets assume probability of choosing length J is
    small constant epsilon

64
Model 1 continued
  • Prob of choosing a length and then one of the
    possible alignments
  • Combining with step 3
  • The total probability of a given foreign sentence
    F

65
Decoding
  • How do we find the best A?

66
Training alignment probabilities
  • Step 1 get a parallel corpus
  • Hansards
  • Canadian parliamentary proceedings, in French and
    English
  • Hong Kong Hansards English and Chinese
  • Step 2 sentence alignment
  • Step 3 use EM (Expectation Maximization) to
    train word alignments

67
Step 1 Parallel corpora
  • Example from DE-News (8/1/1996)

English German
Diverging opinions about planned tax reform Unterschiedliche Meinungen zur geplanten Steuerreform
The discussion around the envisaged major tax reform continues . Die Diskussion um die vorgesehene grosse Steuerreform dauert an .
The FDP economics expert , Graf Lambsdorff , today came out in favor of advancing the enactment of significant parts of the overhaul , currently planned for 1999 . Der FDP - Wirtschaftsexperte Graf Lambsdorff sprach sich heute dafuer aus , wesentliche Teile der fuer 1999 geplanten Reform vorzuziehen .
Slide from Christof Monz
68
Step 2 Sentence Alignment
  • The old man is happy. He has fished many times.
    His wife talks to him. The fish are jumping.
    The sharks await.
  • Intuition
  • - use length in words or chars
  • - together with dynamic programming
  • - or use a simpler MT model

El viejo está feliz porque ha pescado muchos
veces. Su mujer habla con él. Los tiburones
esperan.
Slide from Kevin Knight
69
Sentence Alignment
  1. The old man is happy.
  2. He has fished many times.
  3. His wife talks to him.
  4. The fish are jumping.
  5. The sharks await.

El viejo está feliz porque ha pescado muchos
veces. Su mujer habla con él. Los tiburones
esperan.
Slide from Kevin Knight
70
Sentence Alignment
  1. The old man is happy.
  2. He has fished many times.
  3. His wife talks to him.
  4. The fish are jumping.
  5. The sharks await.

El viejo está feliz porque ha pescado muchos
veces. Su mujer habla con él. Los tiburones
esperan.
Slide from Kevin Knight
71
Sentence Alignment
  1. The old man is happy. He has fished many times.
  2. His wife talks to him.
  3. The sharks await.

El viejo está feliz porque ha pescado muchos
veces. Su mujer habla con él. Los tiburones
esperan.
Note that unaligned sentences are thrown out,
and sentences are merged in n-to-m alignments (n,
m gt 0).
Slide from Kevin Knight
72
Step 3 word alignments
  • It turns out we can bootstrap alignments
  • From a sentence-aligned bilingual corpus
  • We use is the Expectation-Maximization or EM
    algorithm

73
EM for training alignment probs
la maison la maison bleue la fleur
the house the blue house the flower
All word alignments equally likely All
P(french-word english-word) equally likely
Slide from Kevin Knight
74
EM for training alignment probs
la maison la maison bleue la fleur
the house the blue house the flower
la and the observed to co-occur
frequently, so P(la the) is increased.
Slide from Kevin Knight
75
EM for training alignment probs
la maison la maison bleue la fleur
the house the blue house the flower
house co-occurs with both la and maison,
but P(maison house) can be raised without
limit, to 1.0, while P(la house) is limited
because of the (pigeonhole principle)
Slide from Kevin Knight
76
EM for training alignment probs
la maison la maison bleue la fleur
the house the blue house the flower
settling down after another iteration
Slide from Kevin Knight
77
EM for training alignment probs
la maison la maison bleue la fleur
the house the blue house the flower
  • Inherent hidden structure revealed by EM
    training!
  • For details, see
  • Section 24.6.1 in the chapter
  • A Statistical MT Tutorial Workbook (Knight,
    1999).
  • The Mathematics of Statistical Machine
    Translation (Brown et al, 1993)
  • Software GIZA

Slide from Kevin Knight
78
Statistical Machine Translation
la maison la maison bleue la fleur
the house the blue house the flower
P(juste fair) 0.411 P(juste correct)
0.027 P(juste right) 0.020
Possible English translations, to be rescored by
language model
new French sentence
Slide from Kevin Knight
79
A more complex model IBM Model 3Brown et al.,
1993
Generative approach
Mary did not slap the green witch
n(3slap)
Mary not slap slap slap the green witch
P-Null
Mary not slap slap slap NULL the green witch
t(lathe)
Maria no dió una bofetada a la verde bruja
d(ji)
Maria no dió una bofetada a la bruja verde
Probabilities can be learned from raw bilingual
text.
80
How do we evaluate MT? Human tests for fluency
  • Rating tests Give the raters a scale (1 to 5)
    and ask them to rate
  • Or distinct scales for
  • Clarity, Naturalness, Style
  • Or check for specific problems
  • Cohesion (Lexical chains, anaphora, ellipsis)
  • Hand-checking for cohesion.
  • Well-formedness
  • 5-point scale of syntactic correctness
  • Comprehensibility tests
  • Noise test
  • Multiple choice questionnaire
  • Readability tests
  • cloze

81
How do we evaluate MT? Human tests for fidelity
  • Adequacy
  • Does it convey the information in the original?
  • Ask raters to rate on a scale
  • Bilingual raters give them source and target
    sentence, ask how much information is preserved
  • Monolingual raters give them target a good
    human translation
  • Informativeness
  • Task based is there enough info to do some task?
  • Give raters multiple-choice questions about
    content

82
Evaluating MT Problems
  • Asking humans to judge sentences on a 5-point
    scale for 10 factors takes time and (weeks or
    months!)
  • We cant build language engineering systems if we
    can only evaluate them once every quarter!!!!
  • We need a metric that we can run every time we
    change our algorithm.
  • It would be OK if it wasnt perfect, but just
    tended to correlate with the expensive human
    metrics, which we could still run in quarterly.

Bonnie Dorr
83
Automatic evaluation
  • Miller and Beebe-Center (1958)
  • Assume we have one or more human translations of
    the source passage
  • Compare the automatic translation to these human
    translations
  • Bleu
  • NIST
  • Meteor
  • Precision/Recall

84
BiLingual Evaluation Understudy (BLEU Papineni,
2001)
http//www.research.ibm.com/people/k/kishore/RC221
76.pdf
  • Automatic Technique, but .
  • Requires the pre-existence of Human (Reference)
    Translations
  • Approach
  • Produce corpus of high-quality human translations
  • Judge closeness numerically (word-error rate)
  • Compare n-gram matches between candidate
    translation and 1 or more reference translations

Slide from Bonnie Dorr
85
BLEU Evaluation Metric (Papineni et al, ACL-2002)
Reference (human) translation The U.S. island
of Guam is maintaining a high state of alert
after the Guam airport and its offices both
received an e-mail from someone calling himself
the Saudi Arabian Osama bin Laden and threatening
a biological/chemical attack against public
places such as the airport .
  • N-gram precision (score is between 0 1)
  • What percentage of machine n-grams can be found
    in the reference translation?
  • An n-gram is an sequence of n words
  • Not allowed to use same portion of reference
    translation twice (cant cheat by typing out the
    the the the the)
  • Brevity penalty
  • Cant just type out single word the (precision
    1.0!)
  • Amazingly hard to game the system (i.e.,
    find a way to change machine output so that BLEU
    goes up, but quality doesnt)

Machine translation The American ?
international airport and its the office all
receives one calls self the sand Arab rich
business ? and so on electronic mail , which
sends out The threat will be able after public
place and so on the airport to start the
biochemistry attack , ? highly alerts after the
maintenance.
Slide from Bonnie Dorr
86
BLEU Evaluation Metric (Papineni et al, ACL-2002)
Reference (human) translation The U.S. island
of Guam is maintaining a high state of alert
after the Guam airport and its offices both
received an e-mail from someone calling himself
the Saudi Arabian Osama bin Laden and threatening
a biological/chemical attack against public
places such as the airport .
  • BLEU4 formula
  • (counts n-grams up to length 4)
  • exp (1.0 log p1
  • 0.5 log p2
  • 0.25 log p3
  • 0.125 log p4
  • max(words-in-reference / words-in-machine
    1,
  • 0)
  • p1 1-gram precision
  • P2 2-gram precision
  • P3 3-gram precision
  • P4 4-gram precision

Machine translation The American ?
international airport and its the office all
receives one calls self the sand Arab rich
business ? and so on electronic mail , which
sends out The threat will be able after public
place and so on the airport to start the
biochemistry attack , ? highly alerts after the
maintenance.
Slide from Bonnie Dorr
87
Multiple Reference Translations
Slide from Bonnie Dorr
88
BLEU in Action
???????? (Foreign Original) the gunman was
shot to death by the police . (Reference
Translation) the gunman was police kill .
1wounded police jaya of 2the gunman
was shot dead by the police . 3the gunman
arrested by police kill . 4the gunmen were
killed . 5the gunman was shot to death by
the police . 6 gunmen were killed by police
?SUBgt0 ?SUBgt0 7 al by the police . 8the
ringer is killed by the police . 9police
killed the gunman . 10
Slide from Bonnie Dorr
89
BLEU in Action
???????? (Foreign Original) the gunman was
shot to death by the police . (Reference
Translation) the gunman was police kill .
1wounded police jaya of 2the gunman
was shot dead by the police . 3the gunman
arrested by police kill . 4the gunmen were
killed . 5the gunman was shot to death by
the police . 6 gunmen were killed by police
?SUBgt0 ?SUBgt0 7 al by the police . 8the
ringer is killed by the police . 9police
killed the gunman . 10
green 4-gram match (good!) red word not
matched (bad!)
Slide from Bonnie Dorr
90
Bleu Comparison
Chinese-English Translation Example Candidate 1
It is a guide to action which ensures that the
military always obeys the commands of the
party. Candidate 2 It is to insure the troops
forever hearing the activity guidebook that party
direct.
Reference 1 It is a guide to action that ensures
that the military will forever heed Party
commands. Reference 2 It is the guiding
principle which guarantees the military forces
always being under the command of the
Party. Reference 3 It is the practical guide for
the army always to heed the directions of the
party.
Slide from Bonnie Dorr
91
How Do We Compute Bleu Scores?
  • Intuition What percentage of words in candidate
    occurred in some human translation?
  • Proposal count up of candidate translation
    words (unigrams) in any reference translation,
    divide by the total of words in candidate
    translation
  • But cant just count total of overlapping
    N-grams!
  • Candidate the the the the the the
  • Reference 1 The cat is on the mat
  • Solution A reference word should be considered
    exhausted after a matching candidate word is
    identified.

Slide from Bonnie Dorr
92
Modified n-gram precision
  • For each word compute
  • (1) total number of times it occurs in any
    single reference translation
  • (2) number of times it occurs in the candidate
    translation
  • Instead of using count 2, use the minimum of 2
    and 2, I.e. clip the counts at the max for the
    reference transcription
  • Now use that modified count.
  • And divide by number of candidate words.

Slide from Bonnie Dorr
93
Modified Unigram Precision Candidate 1
It(1) is(1) a(1) guide(1) to(1) action(1)
which(1) ensures(1) that(2) the(4) military(1)
always(1) obeys(0) the commands(1) of(1) the
party(1)
Reference 1 It is a guide to action that ensures
that the military will forever heed Party
commands. Reference 2 It is the guiding
principle which guarantees the military forces
always being under the command of the
Party. Reference 3 It is the practical guide for
the army always to heed the directions of the
party.
Whats the answer???
17/18
Slide from Bonnie Dorr
94
Modified Unigram Precision Candidate 2
It(1) is(1) to(1) insure(0) the(4) troops(0)
forever(1) hearing(0) the activity(0)
guidebook(0) that(2) party(1) direct(0)
Reference 1 It is a guide to action that ensures
that the military will forever heed Party
commands. Reference 2 It is the guiding
principle which guarantees the military forces
always being under the command of the
Party. Reference 3 It is the practical guide for
the army always to heed the directions of the
party.
Whats the answer????
8/14
Slide from Bonnie Dorr
95
Modified Bigram Precision Candidate 1
It is(1) is a(1) a guide(1) guide to(1) to
action(1) action which(0) which ensures(0)
ensures that(1) that the(1) the military(1)
military always(0) always obeys(0) obeys the(0)
the commands(0) commands of(0) of the(1) the
party(1)
Reference 1 It is a guide to action that ensures
that the military will forever heed Party
commands. Reference 2 It is the guiding
principle which guarantees the military forces
always being under the command of the
Party. Reference 3 It is the practical guide for
the army always to heed the directions of the
party.
10/17
Whats the answer????
Slide from Bonnie Dorr
96
Modified Bigram Precision Candidate 2
It is(1) is to(0) to insure(0) insure the(0) the
troops(0) troops forever(0) forever hearing(0)
hearing the(0) the activity(0) activity
guidebook(0) guidebook that(0) that party(0)
party direct(0)
Reference 1 It is a guide to action that ensures
that themilitary will forever heed Party
commands. Reference 2 It is the guiding
principle which guarantees the military forces
always being under the command of the
Party. Reference 3 It is the practical guide for
the army always to heed the directions of the
party.
Whats the answer????
1/13
Slide from Bonnie Dorr
97
Catching Cheaters
the(2) the the the(0) the(0) the(0) the(0)
Reference 1 The cat is on the mat Reference 2
There is a cat on the mat
Whats the unigram answer?
2/7
Whats the bigram answer?
0/7
Slide from Bonnie Dorr
98
Bleu distinguishes human from machine translations
Slide from Bonnie Dorr
99
Bleu problems with sentence length
  • Candidate of the
  • Solution brevity penalty prefers candidates
    translations which are same length as one of the
    references

Reference 1 It is a guide to action that ensures
that themilitary will forever heed Party
commands. Reference 2 It is the guiding
principle which guarantees the military forces
always being under the command of the
Party. Reference 3 It is the practical guide for
the army always to heed the directions of the
party.
Problem modified unigram precision is 2/2,
bigram 1/1!
Slide from Bonnie Dorr
100
BLEU Tends to Predict Human Judgments
(variant of BLEU)
slide from G. Doddington (NIST)
101
Summary
  • Intro and a little history
  • Language Similarities and Divergences
  • Four main MT Approaches
  • Transfer
  • Interlingua
  • Direct
  • Statistical
  • Evaluation

102
Classes
  • LINGUIST 139M/239M. Human and Machine
    Translation. (Martin Kay)
  • CS 224N. Natural Language Processing (Chris
    Manning)
Write a Comment
User Comments (0)
About PowerShow.com