Statistical Machine Translation: IBM Models and the Alignment Template System PowerPoint PPT Presentation

presentation player overlay
1 / 72
About This Presentation
Transcript and Presenter's Notes

Title: Statistical Machine Translation: IBM Models and the Alignment Template System


1
Statistical Machine Translation IBM Models and
the Alignment Template System
2
Statistical Machine Translation
  • Goal
  • Given foreign sentence f
  • Maria no dio una bofetada a la bruja verde
  • Find the most likely English translation e
  • Maria did not slap the green witch

3
Statistical Machine Translation
  • Most likely English translation e is given by
  • P(ef) estimates conditional probability of any e
    given f

4
Statistical Machine Translation
  • How to estimate P(ef)?
  • Noisy channel
  • Decompose P(ef) into P(fe) P(e) / P(f)
  • Estimate P(fe) and P(e) separately using
    parallel corpus
  • Direct
  • Estimate P(ef) directly using parallel corpus
    (more on this later)

5
Noisy Channel Model
  • Translation Model
  • P(fe)
  • How likely is f to be a translation of e?
  • Estimate parameters from bilingual corpus
  • Language Model
  • P(e)
  • How likely is e to be an English sentence?
  • Estimate parameters from monolingual corpus
  • Decoder
  • Given f, what is the best translation e?

6
Noisy Channel Model
  • Generative story
  • Generate e with probability p(e)
  • Pass e through noisy channel
  • Out comes f with probability p(fe)
  • Translation task
  • Given f, deduce most likely e that produced f, or

7
Translation Model
  • How to model P(fe)?
  • Learn parameters of P(fe) from a bilingual
    corpus S of sentence pairs ltei,figt
  • lt e1,f1 gt ltthe blue witch, la bruja azulgt
  • lt e2,f2 gt ltgreen, verdegt
  • lt eS,fS gt ltthe witch, la brujagt

8
Translation Model
  • Insufficient data in parallel corpus to estimate
    P(fe) at the sentence level (Why?)
  • Decompose process of translating e -gt f into
    small steps whose probabilities can be estimated

9
Translation Model
  • English sentence e e1el
  • Foreign sentence f f1fm
  • Alignment A a1am, where aj e 0l
  • A indicates which English word generates each
    foreign word

10
Alignments
  • e the blue witch
  • f la bruja azul

A 1,3,2 (intuitively good alignment)
11
Alignments
  • e the blue witch
  • f la bruja azul

A 1,1,1 (intuitively bad alignment)
12
Alignments
  • e the blue witch
  • f la bruja azul

(illegal alignment!)
13
Alignments
  • Question how many possible alignments are there
    for a given e and f, where e l and f m?

14
Alignments
  • Question how many possible alignments are there
    for a given e and f, where e l and f m?
  • Answer
  • Each foreign word can align with any one of e
    l words, or it can remain unaligned
  • Each foreign word has (l 1) choices for an
    alignment, and there are f m foreign words
  • So, there are (l1)m alignments for a given e
    and f

15
Alignments
  • Question If all alignments are equally likely,
    what is the probability of any one alignment,
    given e?

16
Alignments
  • Question If all alignments are equally likely,
    what is the probability of any one alignment,
    given e?
  • Answer
  • P(Ae) p(f m) 1/(l1)m
  • If we assume that p(f m) is uniform over all
    possible values of f, then we can let p(f
    m) C
  • P(Ae) C /(l1)m

17
Generative Story
  • e blue witch
  • f bruja azul

?
How do we get from e to f?
18
IBM Model 1
  • Model parameters
  • T(fj eaj ) translation probability of foreign
    word given English word that generated it

19
IBM Model 1
  • Generative story
  • Given e
  • Pick m f, where all lengths m are equally
    probable
  • Pick A with probability P(Ae) 1/(l1)m, since
    all alignments are equally likely given l and m
  • Pick f1fm with probability
  • where T(fj eaj ) is the translation
    probability of fj given the English word it is
    aligned to

20
IBM Model 1 Example
  • e blue witch

21
IBM Model 1 Example
  • e blue witch
  • f f1 f2

Pick m f 2
22
IBM Model 1 Example
  • e blue witch
  • f f1 f2

Pick A 2,1 with probability 1/(l1)m
23
IBM Model 1 Example
  • e blue witch
  • f bruja f2

Pick f1 bruja with probability t(brujawitch)
24
IBM Model 1 Example
  • e blue witch
  • f bruja azul

Pick f2 azul with probability t(azulblue)
25
IBM Model 1 Parameter Estimation
  • How does this generative story help us to
    estimate P(fe) from the data?
  • Since the model for P(fe) contains the parameter
    T(fj eaj ), we first need to estimate T(fj
    eaj )

26
lBM Model 1 Parameter Estimation
  • How to estimate T(fj eaj ) from the data?
  • If we had the data and the alignments A, along
    with P(Af,e), then we could estimate T(fj eaj
    ) using expected counts as follows

27
lBM Model 1 Parameter Estimation
  • How to estimate P(Af,e)?
  • P(Af,e) P(A,fe) / P(fe)
  • But
  • So we need to compute P(A,fe)
  • This is given by the Model 1 generative story

28
IBM Model 1 Example
  • e the blue witch
  • f la bruja azul

P(Af,e) P(f,Ae)/ P(fe)
29
IBM Model 1 Parameter Estimation
  • So, in order to estimate P(fe), we first need to
    estimate the model parameter
  • T(fj eaj )
  • In order to compute T(fj eaj ) , we need to
    estimate P(Af,e)
  • And in order to compute P(Af,e), we need to
    estimate T(fj eaj )

30
IBM Model 1 Parameter Estimation
  • Training data is a set of pairs lt ei, figt
  • Log likelihood of training data given model
    parameters is
  • To maximize log likelihood of training data given
    model parameters, use EM
  • hidden variable alignments A
  • model parameters translation probabilities T

31
EM
  • Initialize model parameters T(fe)
  • Calculate alignment probabilities P(Af,e) under
    current values of T(fe)
  • Calculate expected counts from alignment
    probabilities
  • Re-estimate T(fe) from these expected counts
  • Repeat until log likelihood of training data
    converges to a maximum

32
IBM Model 2
  • Model parameters
  • T(fj eaj ) translation probability of foreign
    word fj given English word eaj that generated it
  • d(ij,l,m) distortion probability, or
    probability that fj is aligned to ei , given l
    and m

33
IBM Model 3
  • Model parameters
  • T(fj eaj ) translation probability of foreign
    word fj given English word eaj that generated it
  • r(ji,l,m) reverse distortion probability, or
    probability of position fj, given its alignment
    to ei, l, and m
  • n(ei) fertility of word ei , or number of
    foreign words aligned to ei
  • p1 probability of generating a foreign word by
    alignment with the NULL English word

34
IBM Model 3
  • Generative Story
  • Choose fertilities for each English word
  • Insert spurious words according to probability of
    being aligned to the NULL English word
  • Translate English words -gt foreign words
  • Reorder words according to reverse distortion
    probabilities

35
IBM Model 3 Example
  • Consider the following example from Knight
    1999
  • Maria did not slap the green witch

36
IBM Model 3 Example
  • Maria did not slap the green witch
  • Maria not slap slap slap the green witch
  • Choose fertilities phi(Maria) 1

37
IBM Model 3 Example
  • Maria did not slap the green witch
  • Maria not slap slap slap the green witch
  • Maria not slap slap slap NULL the green witch
  • Insert spurious words p(NULL)

38
IBM Model 3 Example
  • Maria did not slap the green witch
  • Maria not slap slap slap the green witch
  • Maria not slap slap slap NULL the green witch
  • Maria no dio una bofetada a la verde bruja
  • Translate words t(verdegreen)

39
IBM Model 3 Example
  • Maria no dio una bofetada a la verde bruja
  • Maria no dio una bofetada a la bruja verde
  • Reorder words

40
IBM Model 3
  • For models 1 and 2
  • We can compute exact EM updates
  • For models 3 and 4
  • Exact EM updates cannot be efficiently computed
  • Use best alignments from previous iterations to
    initialize each successive model
  • Explore only the subspace of potential alignments
    that lies within same neighborhood as the initial
    alignments

41
IBM Model 4
  • Model parameters
  • Same as model 3, except uses more complicated
    model of reordering (for details, see Brown et
    al. 1993)

42
Language Model
  • Given an English sentence e1, e2 el
  • P(e1, e2 el )
  • P(e1)
  • P(e2e1 )
  • P(el e1, e2 el-1 )
  • N-gram model
  • Assume P(ei) depends only on the N-1 previous
    words, so that P(ei e1,e2, ei-1)
  • P(ei ei-N,ei-N1, ei-1)

43
N2 Bigram Language Model
  • P(Maria did not slap the green witch)
  • P(MariaSTART)
  • P(didMaria)
  • P(notdid)
  • P(ENDwitch)

44
Word-Based MT
  • Word fundamental unit of translation
  • Weaknesses
  • no explicit modeling of word context
  • word-by-word translation may not accurately
    convey meaning of phrase
  • il ne va pas -gt he does not go
  • IBM models prevent alignment of foreign words
    with gt1 English word
  • aller -gt to go

45
Phrase-Based MT
  • Phrase basic unit of translation
  • Strengths
  • explicit modeling of word context
  • captures local reorderings, local dependencies

46
Example Rules
  • English he does not go
  • Foreign il ne va pas
  • ne va pas -gt does not go

47
Alignment Template System
  • Och and Ney, 2004
  • Alignment template
  • Pair of source and target language phrases
  • Word alignment among words within those phrases
  • Formally, an alignment template is a triple
    (F,E,A)
  • F words on foreign side
  • E words on English side
  • A alignments among words on the foreign and
    English sides

48
Estimating P(ef)
  • Noisy channel
  • Decompose P(ef) into P(fe) and P(e)
  • Estimate P(fe) and P(e) separately
  • Direct
  • Estimate P(ef) directly from training corpus
  • Use log-linear model

49
Log-linear Models for MT
  • Compute best translation as follows
  • where hi are the feature functions and ?i are the
    model parameters
  • Typical feature functions include
  • phrase translation probabilities
  • lexical translation probabilities
  • language model probability
  • reordering model
  • word penalty

50
Log-linear Models for MT
  • Noisy Channel model is a special case of
    Log-Linear model where
  • h1 log(P(fe)), ?1 1
  • h2 log(P(e)), ?2 1
  • Then

51
Alignment Template System
  • Word-align training corpus
  • Extract phrase pairs
  • Assign probabilities to phrase pairs
  • Train language model
  • Decode

52
Word-Align Training Corpus
  • Run GIZA word alignment in normal direction,
    from e -gt f

53
Word-Align Training Corpus
  • Run GIZA word alignment in inverse direction,
    from f-gte

54
Alignment Symmetrization
  • Merge bi-directional alignments using some
    heuristic between intersection and union
  • Question what is tradeoff in precision/recall
    using intersection/union?
  • Here, we use union

55
Alignment Template System
  • Word-align training corpus
  • Extract phrase pairs
  • Assign probabilities to phrase pairs
  • Train language model
  • Decode

56
Extract phrase pairs
  • Extract all phrase pairs (E,F) consistent with
    word alignments, where consistency is defined as
    follows
  • (1) Each word in English phrase is aligned only
    with words in the foreign phrase
  • (2) Each word in foreign phrase is aligned only
    with words in the English phrase
  • Phrase pairs must consist of contiguous words in
    each language

57
Extract phrase pairs
  • Question why is the illustrated phrase pair
    inconsistent with the alignment matrix?

58
Extract phrase pairs
  • Question why is the illustrated phrase pair
    inconsistent with the alignment matrix?
  • Answer ne is aligned with not, which is
    outside the phrase pair also, does is aligned
    with pas, which is outside the phrase pair

59
Extract phrase pairs
  • lthe, ilgt

60
Extract phrase pairs
  • lthe, ilgt
  • ltgo, vagt

61
Extract phrase pairs
  • lthe, ilgt
  • ltgo, vagt
  • ltdoes not go,
  • ne va pasgt

62
Extract phrase pairs
  • lthe, ilgt
  • ltgo, vagt
  • ltdoes not go,
  • ne va pasgt
  • lthe does not go,
  • il ne va pasgt

63
Alignment Template System
  • Word-align training corpus
  • Extract phrase pairs
  • Assign probabilities to phrase pairs
  • Train language model
  • Decode

64
Probability Assignment
  • Use relative frequency estimation
  • P(F,E,AF) Count(F,E,A)/Count(F,E,A)

65
Alignment Template System
  • Word-align training corpus
  • Extract phrase pairs
  • Assign probabilities to phrase pairs
  • Train language model
  • Decode

66
Language Model
  • Use N-gram language model P(e), just as for
    word-based MT

67
Alignment Template System
  • Word-align training corpus
  • Extract phrase pairs
  • Assign probabilities to phrase pairs
  • Train language model
  • Decode

68
Decode
  • Beam search
  • State space
  • set of possible partial translation hypotheses
  • Start state
  • initial empty translation of foreign input
  • Expansion operation
  • extend existing English hypothesis one phrase at
    a time, by translating a phrase in foreign
    sentence into English

69
Decoder Example
  • Start
  • f Maria no dio una bofetada a la bruja verde
  • e
  • Expand English translation
  • translate Maria -gt Mary or bruja -gt witch
  • mark foreign words as covered
  • update probabilities

70
Decoder Example
Example from Koehn 2003
71
BLEU MT Evaluation Metric
  • BLEU measure n-gram precision against a set of k
    reference English translations
  • What percentage of n-grams (where n ranges from 1
    through 5, typically) in the MT English output
    are also found in a reference translation?
  • Brevity penalty penalize English translations
    with fewer words than the reference translations
  • Why is this metric so widely used?
  • Correlates surprisingly well with human judgment
    of machine-generated translations

72
References
  • Brown et al. 1990. A statistical approach to
    Machine Translation.
  • Brown et al. 1993. The mathematics of
    statistical machine translation.
  • Collins 2003. Lecture Notes from 6.891 Fall
    2003 Machine Learning Approaches for Natural
    Language Processing.
  • Knight 1999. A Statistical MT Workbook.
  • Knight and Koehn 2004. A Statistical Machine
    Translation Tutorial.
  • Koehn, Och and Marcu 2003. A Phrase-Based
    Statistical Machine Translation System.
  • Koehn, 2003. Pharaoh A Phrase-Based Decoder.
  • Och and Ney 2004. The Alignment Template
    System.
  • Och and Ney 2003. Discriminative Training and
    Maximum Entropy Models for Statistical Machine
    Translation.
Write a Comment
User Comments (0)
About PowerShow.com