Statistical Machine Translation: IBM Models and the Alignment Template System - PowerPoint PPT Presentation

1 / 72

About This Presentation

Title:

Statistical Machine Translation: IBM Models and the Alignment Template System

Description:

Statistical Machine Translation: IBM Models and the Alignment Template System Statistical Machine Translation Goal: Given foreign sentence f: Maria no dio una ... – PowerPoint PPT presentation

Number of Views:169

Avg rating:3.0/5.0

Slides: 73

Provided by: Victoria176

Category:

more less

Transcript and Presenter's Notes

Title: Statistical Machine Translation: IBM Models and the Alignment Template System

1
Statistical Machine Translation IBM Models and
the Alignment Template System
2
Statistical Machine Translation

Goal
Given foreign sentence f
Maria no dio una bofetada a la bruja verde
Find the most likely English translation e
Maria did not slap the green witch

3
Statistical Machine Translation

Most likely English translation e is given by
P(ef) estimates conditional probability of any e
given f

4
Statistical Machine Translation

How to estimate P(ef)?
Noisy channel
Decompose P(ef) into P(fe) P(e) / P(f)
Estimate P(fe) and P(e) separately using
parallel corpus
Direct
Estimate P(ef) directly using parallel corpus
(more on this later)

5
Noisy Channel Model

Translation Model
P(fe)
How likely is f to be a translation of e?
Estimate parameters from bilingual corpus
Language Model
P(e)
How likely is e to be an English sentence?
Estimate parameters from monolingual corpus
Decoder
Given f, what is the best translation e?

6
Noisy Channel Model

Generative story
Generate e with probability p(e)
Pass e through noisy channel
Out comes f with probability p(fe)
Translation task
Given f, deduce most likely e that produced f, or

7
Translation Model

How to model P(fe)?
Learn parameters of P(fe) from a bilingual
corpus S of sentence pairs ltei,figt
lt e1,f1 gt ltthe blue witch, la bruja azulgt
lt e2,f2 gt ltgreen, verdegt
lt eS,fS gt ltthe witch, la brujagt

8
Translation Model

Insufficient data in parallel corpus to estimate
P(fe) at the sentence level (Why?)
Decompose process of translating e -gt f into
small steps whose probabilities can be estimated

9
Translation Model

English sentence e e1el
Foreign sentence f f1fm
Alignment A a1am, where aj e 0l
A indicates which English word generates each
foreign word

10
Alignments

e the blue witch
f la bruja azul

A 1,3,2 (intuitively good alignment)
11
Alignments

e the blue witch
f la bruja azul

A 1,1,1 (intuitively bad alignment)
12
Alignments

e the blue witch
f la bruja azul

(illegal alignment!)
13
Alignments

Question how many possible alignments are there
for a given e and f, where e l and f m?

14
Alignments

Question how many possible alignments are there
for a given e and f, where e l and f m?
Answer
Each foreign word can align with any one of e
l words, or it can remain unaligned
Each foreign word has (l 1) choices for an
alignment, and there are f m foreign words
So, there are (l1)m alignments for a given e
and f

15
Alignments

Question If all alignments are equally likely,
what is the probability of any one alignment,
given e?

16
Alignments

Question If all alignments are equally likely,
what is the probability of any one alignment,
given e?
Answer
P(Ae) p(f m) 1/(l1)m
If we assume that p(f m) is uniform over all
possible values of f, then we can let p(f
m) C
P(Ae) C /(l1)m

17
Generative Story

e blue witch
f bruja azul

?
How do we get from e to f?
18
IBM Model 1

Model parameters
T(fj eaj ) translation probability of foreign
word given English word that generated it

19
IBM Model 1

Generative story
Given e
Pick m f, where all lengths m are equally
probable
Pick A with probability P(Ae) 1/(l1)m, since
all alignments are equally likely given l and m
Pick f1fm with probability
where T(fj eaj ) is the translation
probability of fj given the English word it is
aligned to

20
IBM Model 1 Example

e blue witch

21
IBM Model 1 Example

e blue witch
f f1 f2

Pick m f 2
22
IBM Model 1 Example

e blue witch
f f1 f2

Pick A 2,1 with probability 1/(l1)m
23
IBM Model 1 Example

e blue witch
f bruja f2

Pick f1 bruja with probability t(brujawitch)
24
IBM Model 1 Example

e blue witch
f bruja azul

Pick f2 azul with probability t(azulblue)
25
IBM Model 1 Parameter Estimation

How does this generative story help us to
estimate P(fe) from the data?
Since the model for P(fe) contains the parameter
T(fj eaj ), we first need to estimate T(fj
eaj )

26
lBM Model 1 Parameter Estimation

How to estimate T(fj eaj ) from the data?
If we had the data and the alignments A, along
with P(Af,e), then we could estimate T(fj eaj
) using expected counts as follows

27
lBM Model 1 Parameter Estimation

How to estimate P(Af,e)?
P(Af,e) P(A,fe) / P(fe)
But
So we need to compute P(A,fe)
This is given by the Model 1 generative story

28
IBM Model 1 Example

e the blue witch
f la bruja azul

P(Af,e) P(f,Ae)/ P(fe)
29
IBM Model 1 Parameter Estimation

So, in order to estimate P(fe), we first need to
estimate the model parameter
T(fj eaj )
In order to compute T(fj eaj ) , we need to
estimate P(Af,e)
And in order to compute P(Af,e), we need to
estimate T(fj eaj )

30
IBM Model 1 Parameter Estimation

Training data is a set of pairs lt ei, figt
Log likelihood of training data given model
parameters is
To maximize log likelihood of training data given
model parameters, use EM
hidden variable alignments A
model parameters translation probabilities T

31
EM

Initialize model parameters T(fe)
Calculate alignment probabilities P(Af,e) under
current values of T(fe)
Calculate expected counts from alignment
probabilities
Re-estimate T(fe) from these expected counts
Repeat until log likelihood of training data
converges to a maximum

32
IBM Model 2

Model parameters
T(fj eaj ) translation probability of foreign
word fj given English word eaj that generated it
d(ij,l,m) distortion probability, or
probability that fj is aligned to ei , given l
and m

33
IBM Model 3

Model parameters
T(fj eaj ) translation probability of foreign
word fj given English word eaj that generated it
r(ji,l,m) reverse distortion probability, or
probability of position fj, given its alignment
to ei, l, and m
n(ei) fertility of word ei , or number of
foreign words aligned to ei
p1 probability of generating a foreign word by
alignment with the NULL English word

34
IBM Model 3

Generative Story
Choose fertilities for each English word
Insert spurious words according to probability of
being aligned to the NULL English word
Translate English words -gt foreign words
Reorder words according to reverse distortion
probabilities

35
IBM Model 3 Example

Consider the following example from Knight
1999
Maria did not slap the green witch

36
IBM Model 3 Example

Maria did not slap the green witch
Maria not slap slap slap the green witch
Choose fertilities phi(Maria) 1

37
IBM Model 3 Example

Maria did not slap the green witch
Maria not slap slap slap the green witch
Maria not slap slap slap NULL the green witch
Insert spurious words p(NULL)

38
IBM Model 3 Example

Maria did not slap the green witch
Maria not slap slap slap the green witch
Maria not slap slap slap NULL the green witch
Maria no dio una bofetada a la verde bruja
Translate words t(verdegreen)

39
IBM Model 3 Example

Maria no dio una bofetada a la verde bruja
Maria no dio una bofetada a la bruja verde
Reorder words

40
IBM Model 3

For models 1 and 2
We can compute exact EM updates
For models 3 and 4
Exact EM updates cannot be efficiently computed
Use best alignments from previous iterations to
initialize each successive model
Explore only the subspace of potential alignments
that lies within same neighborhood as the initial
alignments

41
IBM Model 4

Model parameters
Same as model 3, except uses more complicated
model of reordering (for details, see Brown et
al. 1993)

42
Language Model

Given an English sentence e1, e2 el
P(e1, e2 el )
P(e1)
P(e2e1 )
P(el e1, e2 el-1 )
N-gram model
Assume P(ei) depends only on the N-1 previous
words, so that P(ei e1,e2, ei-1)
P(ei ei-N,ei-N1, ei-1)

43
N2 Bigram Language Model

P(Maria did not slap the green witch)
P(MariaSTART)
P(didMaria)
P(notdid)
P(ENDwitch)

44
Word-Based MT

Word fundamental unit of translation
Weaknesses
no explicit modeling of word context
word-by-word translation may not accurately
convey meaning of phrase
il ne va pas -gt he does not go
IBM models prevent alignment of foreign words
with gt1 English word
aller -gt to go

45
Phrase-Based MT

Phrase basic unit of translation
Strengths
explicit modeling of word context
captures local reorderings, local dependencies

46
Example Rules

English he does not go
Foreign il ne va pas
ne va pas -gt does not go

47
Alignment Template System

Och and Ney, 2004
Alignment template
Pair of source and target language phrases
Word alignment among words within those phrases
Formally, an alignment template is a triple
(F,E,A)
F words on foreign side
E words on English side
A alignments among words on the foreign and
English sides

48
Estimating P(ef)

Noisy channel
Decompose P(ef) into P(fe) and P(e)
Estimate P(fe) and P(e) separately
Direct
Estimate P(ef) directly from training corpus
Use log-linear model

49
Log-linear Models for MT

Compute best translation as follows
where hi are the feature functions and ?i are the
model parameters
Typical feature functions include
phrase translation probabilities
lexical translation probabilities
language model probability
reordering model
word penalty

50
Log-linear Models for MT

Noisy Channel model is a special case of
Log-Linear model where
h1 log(P(fe)), ?1 1
h2 log(P(e)), ?2 1
Then

51
Alignment Template System

Word-align training corpus
Extract phrase pairs
Assign probabilities to phrase pairs
Train language model
Decode

52
Word-Align Training Corpus

Run GIZA word alignment in normal direction,
from e -gt f

il ne va pas
he
does
not
go
53
Word-Align Training Corpus

Run GIZA word alignment in inverse direction,
from f-gte

il ne va pas
he
does
not
go
54
Alignment Symmetrization

Merge bi-directional alignments using some
heuristic between intersection and union
Question what is tradeoff in precision/recall
using intersection/union?
Here, we use union

il ne va pas
he
does
not
go
55
Alignment Template System

Word-align training corpus
Extract phrase pairs
Assign probabilities to phrase pairs
Train language model
Decode

56
Extract phrase pairs

Extract all phrase pairs (E,F) consistent with
word alignments, where consistency is defined as
follows
(1) Each word in English phrase is aligned only
with words in the foreign phrase
(2) Each word in foreign phrase is aligned only
with words in the English phrase
Phrase pairs must consist of contiguous words in
each language

il ne va pas
he
does
not
go
57
Extract phrase pairs

Question why is the illustrated phrase pair
inconsistent with the alignment matrix?

il ne va pas
he
does
not
go
58
Extract phrase pairs

Question why is the illustrated phrase pair
inconsistent with the alignment matrix?
Answer ne is aligned with not, which is
outside the phrase pair also, does is aligned
with pas, which is outside the phrase pair

il ne va pas
he
does
not
go
59
Extract phrase pairs

lthe, ilgt

il ne va pas
he
does
not
go
60
Extract phrase pairs

lthe, ilgt
ltgo, vagt

il ne va pas
he
does
not
go
61
Extract phrase pairs

lthe, ilgt
ltgo, vagt
ltdoes not go,
ne va pasgt

il ne va pas
he
does
not
go
62
Extract phrase pairs

lthe, ilgt
ltgo, vagt
ltdoes not go,
ne va pasgt
lthe does not go,
il ne va pasgt

il ne va pas
he
does
not
go
63
Alignment Template System

Word-align training corpus
Extract phrase pairs
Assign probabilities to phrase pairs
Train language model
Decode

64
Probability Assignment

Use relative frequency estimation
P(F,E,AF) Count(F,E,A)/Count(F,E,A)

65
Alignment Template System

Word-align training corpus
Extract phrase pairs
Assign probabilities to phrase pairs
Train language model
Decode

66
Language Model

Use N-gram language model P(e), just as for
word-based MT

67
Alignment Template System

Word-align training corpus
Extract phrase pairs
Assign probabilities to phrase pairs
Train language model
Decode

68
Decode

Beam search
State space
set of possible partial translation hypotheses
Start state
initial empty translation of foreign input
Expansion operation
extend existing English hypothesis one phrase at
a time, by translating a phrase in foreign
sentence into English

69
Decoder Example

Start
f Maria no dio una bofetada a la bruja verde
e
Expand English translation
translate Maria -gt Mary or bruja -gt witch
mark foreign words as covered
update probabilities

70
Decoder Example
Example from Koehn 2003
71
BLEU MT Evaluation Metric

BLEU measure n-gram precision against a set of k
reference English translations
What percentage of n-grams (where n ranges from 1
through 5, typically) in the MT English output
are also found in a reference translation?
Brevity penalty penalize English translations
with fewer words than the reference translations
Why is this metric so widely used?
Correlates surprisingly well with human judgment
of machine-generated translations

72
References

Brown et al. 1990. A statistical approach to
Machine Translation.
Brown et al. 1993. The mathematics of
statistical machine translation.
Collins 2003. Lecture Notes from 6.891 Fall
2003 Machine Learning Approaches for Natural
Language Processing.
Knight 1999. A Statistical MT Workbook.
Knight and Koehn 2004. A Statistical Machine
Translation Tutorial.
Koehn, Och and Marcu 2003. A Phrase-Based
Statistical Machine Translation System.
Koehn, 2003. Pharaoh A Phrase-Based Decoder.
Och and Ney 2004. The Alignment Template
System.
Och and Ney 2003. Discriminative Training and
Maximum Entropy Models for Statistical Machine
Translation.