Decoding Algorithms for Statistical Machine Translation - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Decoding Algorithms for Statistical Machine Translation

Description:

The order to choose the next source word to translate is like to choose the next ... For each source word and phrase, there are |t| translation alternatives. ... – PowerPoint PPT presentation

Number of Views:193
Avg rating:5.0/5.0
Slides: 60
Provided by: Joy276
Category:

less

Transcript and Presenter's Notes

Title: Decoding Algorithms for Statistical Machine Translation


1
Decoding Algorithms for Statistical Machine
Translation
  • Dr. Joy Ying Zhang
  • Carnegie Mellon University

2
Carnegie Mellon University Silicon Valley
3
CMU SV Campus
4
In a few years
5
Outline
  • Overview
  • Monotone decoder
  • Decoding with reordering
  • Jumping window
  • Decoding with ITG
  • Hierarchical decoder
  • Decoder for mobile devices

6
Phrase-based SMT
7
Decoding is NP Complete
  • Even the simplest decoding algorithm is
    NP-complete complexity is exponential to the
    sentence length. Just as a Travelling Salesman
    Problem (TSP) Knight et. al. 99

8
Decoding as TSP
  • In a word-to-word translation model
  • The order to choose the next source word to
    translate is like to choose the next city to
    visit
  • To choose the target translation is like to
    choose which hotel to stay in a city
  • The optimal translation corresponds to the
    optimal city/hotel choice.
  • We can only afford a suboptimal solution ?
  • Lets start with the simplest one

9
Monotone Decoding
  • No reordering is allowed, decoding from left to
    right
  • Apply translation model over the testing sentence
    to build up a lattice
  • Search the lattice for a best path given all
    knowledge sources (translation model, language
    model, sentence length model )

10
Monotone Decoding
  • Traverse the lattice from left to right
  • Building partial translation hypotheses for each
    node (what are good translations up to this
    source positions)
  • Output the best that covers the complete
    sentence as the final translation.

11
Probability/Score of a Partial Hyp
  • Depends on the model used in decoder
  • Translation model scores under the independent
    assumption.
  • E.g. P(e1en f1fm) P(e1..e3
    f1f4)P(e4..e5 f5f6)..
  • Language modelP(e1en )
  • Sentence length model score(n src length)
  • Distortion model
  • And be creative ..

12
Sentence Length Model
  • Different language have different level of
    wordiness
  • Histogram over source sentence length target
    sentence length shows that distribution is rather
    flat -gt p( J I ) is not very helpful
  • Very simple sentence length model the more the
    better
  • i.e. give bonus for each word (not a
    probabilistic model)
  • Balances shortening effect of LM
  • Can be applied immediately, as absolute length is
    not important
  • However this is insensitive to whats in the
    sentence
  • Optimize length of translations for entire test
    set, not each sentence
  • Long sentences are made longer to cover for
    sentences which are too short

13
Partial Hypotheses Recombination
  • For each source word and phrase, there are t
    translation alternatives.
  • If simply combine them, final node will have
    tJ hyps to be explored.
  • However, many partial hyps are not
    distinguishable to decoder models
  • If using TM and a 3-gram LM only
  • I will come to office
  • I came to office

14
Recombination of Hypotheses
  • Recombination Of two hypotheses keep only the
    better one if no future information can switch
    their current ranking
  • Notice this depends on the models
  • Model score depends on current partial
    translation and the extension, e.g. LM
  • Model score depends on global features known only
    at the sentence end, e.g. sentence length model
  • The models define equivalence classes for the
    hypotheses
  • Expand only best hypothesis in each equivalence
    class

15
Recombination of Hypotheses Example
  • TM and n-gram LM only
  • Hypotheses
  • H1 I would like to go
  • H2 I would not like to go
  • Assume as possible expansions Eto the movies
    to the cinema and watch a film
  • LMscore is identical for H1Expansion as for
    H2Expansion for bi, tri, four-gram LMs
  • E.g 3-gram LMscore Expansion 1 is-logpr( to
    to go ) - logpr( the go to ) logpr( movies
    to the)
  • Therefore Cost(H1) lt Cost(H2) gt Cost(H1E) lt
    Cost(H2E) for all possible expansions E

16
Beam Pruning
  • Still a lot of partial hyps to explore even after
    the recombination for each node in lattice (src
    sentence up to this position)
  • To a not so good partial hyp Sorry, we dont
    give you chances any more since you failed this
    mid-term
  • Prune H if it is not the top B-best hyp --- Beam
    Size Pruning
  • Prune H if its score is lower than factorbest
    score --- Beam Factor Pruning
  • Pruning reduces number of partial hyps to
    explore faster decoding
  • But it eliminates those might become good
    translations later on

17
Beam Pruning
18
Rest-Cost Estimation
  • In Pruning we compare hyps, which are not
    strictly equivalent under the models
  • Risk prefer hypotheses which have covered the
    easy parts
  • Remedy estimate remaining cost for each
    hypothesis
  • Want to know minimum expected cost (similar to A
    search)
  • Gives a bound for pruning
  • However, not possible with acceptable effort
  • Want to include as many models as possible
  • Translation model costs, word count, phrase count
  • Language model costs
  • Distortion model costs
  • Calculate expected cost for each span (l, r)
    R(l, r)

19
Rest Cost for Translation Models
  • Translation model, word count and phrase count
    features are local costs
  • Depend only on current phrase pair
  • Strictly additive R(l, m) R(m, r) R(l, r)
  • Minimize over alternative translations
  • For each source phrase span (l, r) initialize
    with cost for best translation
  • Combine adjacent spans, take best combination

20
Rest Cost for Language Models
  • We do not have history -gt only approximation
  • For each span (l, r) calculate LM score without
    history
  • Combine LM scores for adjacent spans
  • Notice p(e1 em) p(em1 en) ! p(e1 en)
    beyond 1-gram LM
  • Alternative fast monotone decoding with TM-best
    translations
  • History available
  • Then R(l,r) R(1,r) R(1,l)

21
Rest Cost for Distance-Based DM
  • Distance-based DM rest cost depends on coverage
    pattern
  • To many different coverage patterns, can not
    pre-calculate
  • Estimate by jumping to first gap, then filling
    gaps in sequence
  • Moore Quirk 2007 DM cost plus rest cost

S
S
S
Previous phrase
Gap-free initial segment
Current phrase
L(.) length of phrase, D(.,.) distance
between phrases
S adjacent S d0
S left of S d2L(S)
S subsequence of S d2(D(S,S)L(S))
Otherwise d2(D(S,S)L(S))
22
Rest Cost for Lexicalized DM
  • Lexicalized DM per phrase
  • DM(f,e) scores in-mon, in-swap, in-dist,
    out-mon, out-swap, out-dist
  • Treat as local cost for each span (l, r)
  • Minimize over alternative translations and
    different orientations in- and out-

23
Effect of Rest-Cost Estimation
  • From Richard Zens 2008
  • LM is important, DM is important

24
Output Best Translation
  • Optimal hypothesis in the last node of the
    lattice
  • We need to keep the back pointers

25
Monotone Decoding Algorithm
  • Apply TM on sentence f1fJ
  • For j1 to J
  • Foreach incoming edge e that enters node j
  • Edge e i-gtj
  • Foreach partial hyp h in node I
  • Extend h with edge e
  • Estimate hyp prob/score for he
  • Store lthe, prob/score, back pointer to hgt in
    node j
  • Prune partial hyps in node j
  • In node J, find out the best hyp
  • Follow the backpointers and output the final
    translation

26
Output N-best List
  • When traverse back from the last node, decoder
    can output the top N-best hyps for the whole
    sentence N-best list.
  • Model scores do not correlate well with external
    scores such as BLEU
  • In a 1000-best list, hyps with the highest BLEU
    ranks about 489.38 according to their model
    scores.

27
N-Best List
28
N-Best Rescoring
  • Generate n-best list
  • Use different TM and/or LM to rescore each
    translation -gt reordering of translations, i.e.
    different best translation
  • Different TMs
  • Use IBM1 lexicon for entire translation
  • Use HMM-FB and IBM4 lexicons
  • Forced alignment with HMM alignment model
  • Different LMs
  • Very large LM (Distributed Language Model)
  • Link grammar too slow)
  • Other syntax-based LMs, e.g. Charniaks parser?

29
Problem with N-Best Generation
  • Duplicates from different transducers
  • _at_Lex A B 0.5
  • _at_ISA A B 0.7
  • -gt Two identical translations with different
    scores or even same score (when rescoring all
    translations with same lexicon)
  • Spurious ambiguities
  • us companies and other institutions
  • us companies and other institutions
  • us companies and other institutions
  • us companies and other institutions
  • . . .
  • Example run 1000 n-best -gt 400 different
    strings on average Extreme case only 10 unique
    strings
  • Possible solution Checking uniqueness during
    backtracking

30
Oracle Score of N-best List
31
Using Distributed LM for Reranking Systems
  • Large training data available
  • Distributed computing clusters
  • Distributed language modeling (Zhang and Vogel,
    2006 Emami, 2007 Brants et al, 2007)?

32
Rerank the N-Best List using LM Features
33
Rerank N-best List
34
Rerank N-best List
35
Rerank N-best List
  • Considering long-distance dependencies

36
Reranking N-best List
37
Tuning the SMT System
  • We use different models in SMT system
  • Models have simplifications
  • Trained on different amounts of data
  • gt Different levels of reliability
  • gt Give different weight to different ModelsQ
    c1 Q1 c2 Q2 cn Qn
  • Find optimal scaling factors c1 cn
  • Optimal means Highest score for chosen
    evaluation metric

38
Automatic Tuning
  • Many algorithms to find (near) optimal solutions
    available
  • Simplex
  • Maximum entropy
  • Minimum error training
  • Minimum Bayes risk training
  • Genetic algorithm
  • Note models are not improved, only their
    combination
  • Large number of fully translations required gt
    still problematic when decoding is slow

39
Automatic Tuning on N-best List
  • Generate n-best lists, e.g. for each of 500
    source sentences 1000 translations
  • Loop
  • Changing scaling factors results in re-ranking
    the n-best lists
  • Evaluate new 1-best translations
  • Apply any of the standard optimization techniques
  • Advantage much faster
  • Can pre-calculate the counts (e.g. n-gram
    matches) for each translation to speed up
    evaluation
  • For Bleu or NIST metric with global length
    penalty do local hill climbing for each
    individual n-best list

40
Minimum Error Training
  • For each scaling factor we have Q ck Qk
    QRest
  • For different values different hyps have lowest
    score
  • Different hyps lead to different MT eval scores

41
Decoding with Reordering
  • Languages are of different word orders
  • 1??/Austrilia 2?/is ?/with 3??/North Korea
    4?/has 5??/diplomatic relationship 6?/of
    7??/a few 8??/countries 9??/one of
  • Austrilia is one of the few countries that have
    diplomatic relationship with North Korea
  • To generate the right English translation, we
    need to translate the source in order of 12967845
  • Reordering either change the order to translate
    the source, or equivalently re-arrange the
    partial translations
  • Knowledge sources
  • Reordering models
  • Language models
  • Syntax

42
Reordering Strategies
  • All permutations
  • Any re-ordering possible
  • Complexity of traveling salesman -gt only possible
    for very short sentences
  • Small jumps ahead filling in the gaps pretty
    soon
  • Only local word reordering
  • Implemented in current decoder
  • Leaving small number of gaps fill in at any
    time
  • Allows for global but limited reordering
  • Similar decoding complexity exponential in
    number of gaps
  • IBM-style reordering (described in IBM patent)
  • Merging neighboring regions with swap no gaps
    at all
  • Allows for global reordering
  • Complexity lower than 1, but higher than 2 and 3

43
Decoding with Reordering Window
  • Word and phrase reordering within a given window
  • From first un-translated source word next k
    positions
  • Window length 1 monotone decoding
  • Restrict total number of reordering (typically 3
    per 10 words)
  • Simple Jump model
  • One reordering typically includes two jumps
  • Jump distance D depends on gap and also on phrase
    lengthdistance measured from center of phrase to
    center of phrase
  • Simple Gaussian distribution p(D) exp( D - 1)
  • Lexicalized jump model

44
Jumping ahead in the Lattice
  • Hypothesis describes a partial translation
  • Coverage information, Back-trace information,
    Score
  • Expand hypothesis over uncovered position

I will come
to your office
I come
tomorrow
to
you
come
I
morgen
zu
dir
ich
komme
h c11000, tI will come
h c11011, tI will come to your office
h c11111, tI will come to your office tomorrow
45
Word Order Coverage Info
  • Need to know which source words have already been
    translated
  • Dont want to miss some words
  • Dont want to translate words twice
  • Can compare hypotheses which cover the same words
  • Use Coverage vector to store this information
  • Essentially a bit vector
  • For small jumps ahead position of first gap
    plus short bit vector
  • For small number of gaps array of positions of
    uncovered words
  • For merging neighboring regions left and right
    position

46
Decoding with Inverted Transduction Grammar
  • Translation model phrase to phrase translation
  • May include lexicalized reordering probabilities
  • Grammar X-gtltF1F2, E1E2gt X-gtltF1F2, E2E1gt
    X-gtltf, egt

47
Combine Adjacent Edges
  • Take adjacent edges el and er and create a new
    edge e
  • e.FromNode el.FromNode
  • e.ToNode er.ToNode
  • e.Translation el.Translation er.Translation

tomorrow I will come
I will come
to your office
I come
to
you
come
tomorrow
I
morgen
zu
dir
ich
komme
hl c(0,2), tI will come
hr c(2,3) ttomorrow
h c(0,3), tI will come tomorrow
48
And Allow For Reordering
  • Create additional edge
  • e.FromNode el.FromNode
  • e.ToNode er.ToNode
  • e.Translation er.Translation el.Translation

tomorrow I will come
I will come
to your office
I come
to
you
come
tomorrow
I
morgen
zu
dir
ich
komme
hl c(0,2), tI will come
hr c(2,3) ttomorrow
h c(0,3), ttomorrow I will come
49
Chart-Decoder for Simple ITG
  • Recall Simple ITG binary tree
  • Word reordering straight and inverted subtrees
  • Allows long distance reordering first-gtlast
    word, last -gt first word
  • Generation of partial hypotheses
  • Initialize with phrase translations
  • Combine adjacent areas into longer translations
  • Allow for swaps
  • Requires different organization of decoder

50
Chart Decoder
51
LM in Chart-Based Translation
  • Language model states on both sides
  • History has not been seen
  • Combine h(0,2) abc with h(2,5) de to give
    h(0,5) abcde
  • Calculated was p(a) p(ba) p(cab)and
    p(d) p(ed)
  • But now needed p(dbc) p(fde)
  • Partly undo calculation
  • subtract wrong log probs p(d) p(ed)
  • add correct log probs p(dbc) p(fde)
  • For short extensions just extend from left
    hypothesis
  • For long extensions, faster to correct LM score

52
Effect of Reordering
  • Arabic devtest set (203 sentences)
  • Chinese test set 2002 (878 sentences)
  • Reordering mainly improves fluency, i.e. stronger
    effect for Bleu
  • Improvement for Arabic 4.8 NIST and 12.7 Bleu
  • Less improvement for Chinese 5 in Bleu

53
Effect of Reordering
Arabic/English translation
54
Effect of Reordering
55
Hierarchical Decoding
  • Translation model phrase pairs with holes
    (phrase of phrases)
  • Consider hierarchical phrase pairs as translation
    rules
  • Decoding is a CYK parsing find the optimal
    synchronous parsing tree

56
Hierarchical Decoding (no LM)
57
Decoding as Parsing (Hiero)
58
SMT Decoder for Mobile Devices
  • Mobile speech translators
  • Fast (close to real time) speech translation
  • Domain limited but should not limit to
    pre-recorded sentences
  • Two-way translation
  • Challenges
  • Weaker CPU (e.g. iPhone 3G S 600MHz)
  • Tiny RAM a few MB, up to 256 MB
  • No numerical co-processors
  • Pandora decoder
  • Minimum on-device computing
  • Intergized computation
  • Compact data structure

59
Summary
  • Decoder
  • Generating translation lattice
  • Finding best path
  • Limited word reordering
  • Generation of N-best list
  • Esp used for tuning system
  • May also be used for downstream NLP modules
  • Tuning of System
  • Find optimal set of scaling factors
  • Done on n-best list for speed
  • Direct minimization of any MT eval metric
Write a Comment
User Comments (0)
About PowerShow.com