Statistical Machine Translation: IBM Models and the Alignment Template System - PowerPoint PPT Presentation

1 / 72
About This Presentation
Title:

Statistical Machine Translation: IBM Models and the Alignment Template System

Description:

Statistical Machine Translation: IBM Models and the Alignment Template System Statistical Machine Translation Goal: Given foreign sentence f: Maria no dio una ... – PowerPoint PPT presentation

Number of Views:169
Avg rating:3.0/5.0
Slides: 73
Provided by: Victoria176
Category:

less

Transcript and Presenter's Notes

Title: Statistical Machine Translation: IBM Models and the Alignment Template System


1
Statistical Machine Translation IBM Models and
the Alignment Template System
2
Statistical Machine Translation
  • Goal
  • Given foreign sentence f
  • Maria no dio una bofetada a la bruja verde
  • Find the most likely English translation e
  • Maria did not slap the green witch

3
Statistical Machine Translation
  • Most likely English translation e is given by
  • P(ef) estimates conditional probability of any e
    given f

4
Statistical Machine Translation
  • How to estimate P(ef)?
  • Noisy channel
  • Decompose P(ef) into P(fe) P(e) / P(f)
  • Estimate P(fe) and P(e) separately using
    parallel corpus
  • Direct
  • Estimate P(ef) directly using parallel corpus
    (more on this later)

5
Noisy Channel Model
  • Translation Model
  • P(fe)
  • How likely is f to be a translation of e?
  • Estimate parameters from bilingual corpus
  • Language Model
  • P(e)
  • How likely is e to be an English sentence?
  • Estimate parameters from monolingual corpus
  • Decoder
  • Given f, what is the best translation e?

6
Noisy Channel Model
  • Generative story
  • Generate e with probability p(e)
  • Pass e through noisy channel
  • Out comes f with probability p(fe)
  • Translation task
  • Given f, deduce most likely e that produced f, or

7
Translation Model
  • How to model P(fe)?
  • Learn parameters of P(fe) from a bilingual
    corpus S of sentence pairs ltei,figt
  • lt e1,f1 gt ltthe blue witch, la bruja azulgt
  • lt e2,f2 gt ltgreen, verdegt
  • lt eS,fS gt ltthe witch, la brujagt

8
Translation Model
  • Insufficient data in parallel corpus to estimate
    P(fe) at the sentence level (Why?)
  • Decompose process of translating e -gt f into
    small steps whose probabilities can be estimated

9
Translation Model
  • English sentence e e1el
  • Foreign sentence f f1fm
  • Alignment A a1am, where aj e 0l
  • A indicates which English word generates each
    foreign word

10
Alignments
  • e the blue witch
  • f la bruja azul

A 1,3,2 (intuitively good alignment)
11
Alignments
  • e the blue witch
  • f la bruja azul

A 1,1,1 (intuitively bad alignment)
12
Alignments
  • e the blue witch
  • f la bruja azul

(illegal alignment!)
13
Alignments
  • Question how many possible alignments are there
    for a given e and f, where e l and f m?

14
Alignments
  • Question how many possible alignments are there
    for a given e and f, where e l and f m?
  • Answer
  • Each foreign word can align with any one of e
    l words, or it can remain unaligned
  • Each foreign word has (l 1) choices for an
    alignment, and there are f m foreign words
  • So, there are (l1)m alignments for a given e
    and f

15
Alignments
  • Question If all alignments are equally likely,
    what is the probability of any one alignment,
    given e?

16
Alignments
  • Question If all alignments are equally likely,
    what is the probability of any one alignment,
    given e?
  • Answer
  • P(Ae) p(f m) 1/(l1)m
  • If we assume that p(f m) is uniform over all
    possible values of f, then we can let p(f
    m) C
  • P(Ae) C /(l1)m

17
Generative Story
  • e blue witch
  • f bruja azul

?
How do we get from e to f?
18
IBM Model 1
  • Model parameters
  • T(fj eaj ) translation probability of foreign
    word given English word that generated it

19
IBM Model 1
  • Generative story
  • Given e
  • Pick m f, where all lengths m are equally
    probable
  • Pick A with probability P(Ae) 1/(l1)m, since
    all alignments are equally likely given l and m
  • Pick f1fm with probability
  • where T(fj eaj ) is the translation
    probability of fj given the English word it is
    aligned to

20
IBM Model 1 Example
  • e blue witch

21
IBM Model 1 Example
  • e blue witch
  • f f1 f2

Pick m f 2
22
IBM Model 1 Example
  • e blue witch
  • f f1 f2

Pick A 2,1 with probability 1/(l1)m
23
IBM Model 1 Example
  • e blue witch
  • f bruja f2

Pick f1 bruja with probability t(brujawitch)
24
IBM Model 1 Example
  • e blue witch
  • f bruja azul

Pick f2 azul with probability t(azulblue)
25
IBM Model 1 Parameter Estimation
  • How does this generative story help us to
    estimate P(fe) from the data?
  • Since the model for P(fe) contains the parameter
    T(fj eaj ), we first need to estimate T(fj
    eaj )

26
lBM Model 1 Parameter Estimation
  • How to estimate T(fj eaj ) from the data?
  • If we had the data and the alignments A, along
    with P(Af,e), then we could estimate T(fj eaj
    ) using expected counts as follows

27
lBM Model 1 Parameter Estimation
  • How to estimate P(Af,e)?
  • P(Af,e) P(A,fe) / P(fe)
  • But
  • So we need to compute P(A,fe)
  • This is given by the Model 1 generative story

28
IBM Model 1 Example
  • e the blue witch
  • f la bruja azul

P(Af,e) P(f,Ae)/ P(fe)
29
IBM Model 1 Parameter Estimation
  • So, in order to estimate P(fe), we first need to
    estimate the model parameter
  • T(fj eaj )
  • In order to compute T(fj eaj ) , we need to
    estimate P(Af,e)
  • And in order to compute P(Af,e), we need to
    estimate T(fj eaj )

30
IBM Model 1 Parameter Estimation
  • Training data is a set of pairs lt ei, figt
  • Log likelihood of training data given model
    parameters is
  • To maximize log likelihood of training data given
    model parameters, use EM
  • hidden variable alignments A
  • model parameters translation probabilities T

31
EM
  • Initialize model parameters T(fe)
  • Calculate alignment probabilities P(Af,e) under
    current values of T(fe)
  • Calculate expected counts from alignment
    probabilities
  • Re-estimate T(fe) from these expected counts
  • Repeat until log likelihood of training data
    converges to a maximum

32
IBM Model 2
  • Model parameters
  • T(fj eaj ) translation probability of foreign
    word fj given English word eaj that generated it
  • d(ij,l,m) distortion probability, or
    probability that fj is aligned to ei , given l
    and m

33
IBM Model 3
  • Model parameters
  • T(fj eaj ) translation probability of foreign
    word fj given English word eaj that generated it
  • r(ji,l,m) reverse distortion probability, or
    probability of position fj, given its alignment
    to ei, l, and m
  • n(ei) fertility of word ei , or number of
    foreign words aligned to ei
  • p1 probability of generating a foreign word by
    alignment with the NULL English word

34
IBM Model 3
  • Generative Story
  • Choose fertilities for each English word
  • Insert spurious words according to probability of
    being aligned to the NULL English word
  • Translate English words -gt foreign words
  • Reorder words according to reverse distortion
    probabilities

35
IBM Model 3 Example
  • Consider the following example from Knight
    1999
  • Maria did not slap the green witch

36
IBM Model 3 Example
  • Maria did not slap the green witch
  • Maria not slap slap slap the green witch
  • Choose fertilities phi(Maria) 1

37
IBM Model 3 Example
  • Maria did not slap the green witch
  • Maria not slap slap slap the green witch
  • Maria not slap slap slap NULL the green witch
  • Insert spurious words p(NULL)

38
IBM Model 3 Example
  • Maria did not slap the green witch
  • Maria not slap slap slap the green witch
  • Maria not slap slap slap NULL the green witch
  • Maria no dio una bofetada a la verde bruja
  • Translate words t(verdegreen)

39
IBM Model 3 Example
  • Maria no dio una bofetada a la verde bruja
  • Maria no dio una bofetada a la bruja verde
  • Reorder words

40
IBM Model 3
  • For models 1 and 2
  • We can compute exact EM updates
  • For models 3 and 4
  • Exact EM updates cannot be efficiently computed
  • Use best alignments from previous iterations to
    initialize each successive model
  • Explore only the subspace of potential alignments
    that lies within same neighborhood as the initial
    alignments

41
IBM Model 4
  • Model parameters
  • Same as model 3, except uses more complicated
    model of reordering (for details, see Brown et
    al. 1993)

42
Language Model
  • Given an English sentence e1, e2 el
  • P(e1, e2 el )
  • P(e1)
  • P(e2e1 )
  • P(el e1, e2 el-1 )
  • N-gram model
  • Assume P(ei) depends only on the N-1 previous
    words, so that P(ei e1,e2, ei-1)
  • P(ei ei-N,ei-N1, ei-1)

43
N2 Bigram Language Model
  • P(Maria did not slap the green witch)
  • P(MariaSTART)
  • P(didMaria)
  • P(notdid)
  • P(ENDwitch)

44
Word-Based MT
  • Word fundamental unit of translation
  • Weaknesses
  • no explicit modeling of word context
  • word-by-word translation may not accurately
    convey meaning of phrase
  • il ne va pas -gt he does not go
  • IBM models prevent alignment of foreign words
    with gt1 English word
  • aller -gt to go

45
Phrase-Based MT
  • Phrase basic unit of translation
  • Strengths
  • explicit modeling of word context
  • captures local reorderings, local dependencies

46
Example Rules
  • English he does not go
  • Foreign il ne va pas
  • ne va pas -gt does not go

47
Alignment Template System
  • Och and Ney, 2004
  • Alignment template
  • Pair of source and target language phrases
  • Word alignment among words within those phrases
  • Formally, an alignment template is a triple
    (F,E,A)
  • F words on foreign side
  • E words on English side
  • A alignments among words on the foreign and
    English sides

48
Estimating P(ef)
  • Noisy channel
  • Decompose P(ef) into P(fe) and P(e)
  • Estimate P(fe) and P(e) separately
  • Direct
  • Estimate P(ef) directly from training corpus
  • Use log-linear model

49
Log-linear Models for MT
  • Compute best translation as follows
  • where hi are the feature functions and ?i are the
    model parameters
  • Typical feature functions include
  • phrase translation probabilities
  • lexical translation probabilities
  • language model probability
  • reordering model
  • word penalty

50
Log-linear Models for MT
  • Noisy Channel model is a special case of
    Log-Linear model where
  • h1 log(P(fe)), ?1 1
  • h2 log(P(e)), ?2 1
  • Then

51
Alignment Template System
  • Word-align training corpus
  • Extract phrase pairs
  • Assign probabilities to phrase pairs
  • Train language model
  • Decode

52
Word-Align Training Corpus
  • Run GIZA word alignment in normal direction,
    from e -gt f

il ne va pas
he
does
not
go
53
Word-Align Training Corpus
  • Run GIZA word alignment in inverse direction,
    from f-gte

il ne va pas
he
does
not
go
54
Alignment Symmetrization
  • Merge bi-directional alignments using some
    heuristic between intersection and union
  • Question what is tradeoff in precision/recall
    using intersection/union?
  • Here, we use union

il ne va pas
he
does
not
go
55
Alignment Template System
  • Word-align training corpus
  • Extract phrase pairs
  • Assign probabilities to phrase pairs
  • Train language model
  • Decode

56
Extract phrase pairs
  • Extract all phrase pairs (E,F) consistent with
    word alignments, where consistency is defined as
    follows
  • (1) Each word in English phrase is aligned only
    with words in the foreign phrase
  • (2) Each word in foreign phrase is aligned only
    with words in the English phrase
  • Phrase pairs must consist of contiguous words in
    each language

il ne va pas
he
does
not
go
57
Extract phrase pairs
  • Question why is the illustrated phrase pair
    inconsistent with the alignment matrix?

il ne va pas
he
does
not
go
58
Extract phrase pairs
  • Question why is the illustrated phrase pair
    inconsistent with the alignment matrix?
  • Answer ne is aligned with not, which is
    outside the phrase pair also, does is aligned
    with pas, which is outside the phrase pair

il ne va pas
he
does
not
go
59
Extract phrase pairs
  • lthe, ilgt

il ne va pas
he
does
not
go
60
Extract phrase pairs
  • lthe, ilgt
  • ltgo, vagt

il ne va pas
he
does
not
go
61
Extract phrase pairs
  • lthe, ilgt
  • ltgo, vagt
  • ltdoes not go,
  • ne va pasgt

il ne va pas
he
does
not
go
62
Extract phrase pairs
  • lthe, ilgt
  • ltgo, vagt
  • ltdoes not go,
  • ne va pasgt
  • lthe does not go,
  • il ne va pasgt

il ne va pas
he
does
not
go
63
Alignment Template System
  • Word-align training corpus
  • Extract phrase pairs
  • Assign probabilities to phrase pairs
  • Train language model
  • Decode

64
Probability Assignment
  • Use relative frequency estimation
  • P(F,E,AF) Count(F,E,A)/Count(F,E,A)

65
Alignment Template System
  • Word-align training corpus
  • Extract phrase pairs
  • Assign probabilities to phrase pairs
  • Train language model
  • Decode

66
Language Model
  • Use N-gram language model P(e), just as for
    word-based MT

67
Alignment Template System
  • Word-align training corpus
  • Extract phrase pairs
  • Assign probabilities to phrase pairs
  • Train language model
  • Decode

68
Decode
  • Beam search
  • State space
  • set of possible partial translation hypotheses
  • Start state
  • initial empty translation of foreign input
  • Expansion operation
  • extend existing English hypothesis one phrase at
    a time, by translating a phrase in foreign
    sentence into English

69
Decoder Example
  • Start
  • f Maria no dio una bofetada a la bruja verde
  • e
  • Expand English translation
  • translate Maria -gt Mary or bruja -gt witch
  • mark foreign words as covered
  • update probabilities

70
Decoder Example
Example from Koehn 2003
71
BLEU MT Evaluation Metric
  • BLEU measure n-gram precision against a set of k
    reference English translations
  • What percentage of n-grams (where n ranges from 1
    through 5, typically) in the MT English output
    are also found in a reference translation?
  • Brevity penalty penalize English translations
    with fewer words than the reference translations
  • Why is this metric so widely used?
  • Correlates surprisingly well with human judgment
    of machine-generated translations

72
References
  • Brown et al. 1990. A statistical approach to
    Machine Translation.
  • Brown et al. 1993. The mathematics of
    statistical machine translation.
  • Collins 2003. Lecture Notes from 6.891 Fall
    2003 Machine Learning Approaches for Natural
    Language Processing.
  • Knight 1999. A Statistical MT Workbook.
  • Knight and Koehn 2004. A Statistical Machine
    Translation Tutorial.
  • Koehn, Och and Marcu 2003. A Phrase-Based
    Statistical Machine Translation System.
  • Koehn, 2003. Pharaoh A Phrase-Based Decoder.
  • Och and Ney 2004. The Alignment Template
    System.
  • Och and Ney 2003. Discriminative Training and
    Maximum Entropy Models for Statistical Machine
    Translation.
Write a Comment
User Comments (0)
About PowerShow.com