Decoding Algorithms for Statistical Machine Translation

About This Presentation

Title:

Decoding Algorithms for Statistical Machine Translation

Description:

The order to choose the next source word to translate is like to choose the next ... For each source word and phrase, there are |t| translation alternatives. ... – PowerPoint PPT presentation

Number of Views:193

Avg rating:5.0/5.0

Slides: 60

Provided by: Joy276

Category:

more less

Transcript and Presenter's Notes

Title: Decoding Algorithms for Statistical Machine Translation

1
Decoding Algorithms for Statistical Machine
Translation

Dr. Joy Ying Zhang
Carnegie Mellon University

2
Carnegie Mellon University Silicon Valley
3
CMU SV Campus
4
In a few years
5
Outline

Overview
Monotone decoder
Decoding with reordering
Jumping window
Decoding with ITG
Hierarchical decoder
Decoder for mobile devices

6
Phrase-based SMT
7
Decoding is NP Complete

Even the simplest decoding algorithm is
NP-complete complexity is exponential to the
sentence length. Just as a Travelling Salesman
Problem (TSP) Knight et. al. 99

8
Decoding as TSP

In a word-to-word translation model
The order to choose the next source word to
translate is like to choose the next city to
visit
To choose the target translation is like to
choose which hotel to stay in a city
The optimal translation corresponds to the
optimal city/hotel choice.
We can only afford a suboptimal solution ?
Lets start with the simplest one

9
Monotone Decoding

No reordering is allowed, decoding from left to
right
Apply translation model over the testing sentence
to build up a lattice
Search the lattice for a best path given all
knowledge sources (translation model, language
model, sentence length model )

10
Monotone Decoding

Traverse the lattice from left to right
Building partial translation hypotheses for each
node (what are good translations up to this
source positions)
Output the best that covers the complete
sentence as the final translation.

11
Probability/Score of a Partial Hyp

Depends on the model used in decoder
Translation model scores under the independent
assumption.
E.g. P(e1en f1fm) P(e1..e3
f1f4)P(e4..e5 f5f6)..
Language modelP(e1en )
Sentence length model score(n src length)
Distortion model
And be creative ..

12
Sentence Length Model

Different language have different level of
wordiness
Histogram over source sentence length target
sentence length shows that distribution is rather
flat -gt p( J I ) is not very helpful
Very simple sentence length model the more the
better
i.e. give bonus for each word (not a
probabilistic model)
Balances shortening effect of LM
Can be applied immediately, as absolute length is
not important
However this is insensitive to whats in the
sentence
Optimize length of translations for entire test
set, not each sentence
Long sentences are made longer to cover for
sentences which are too short

13
Partial Hypotheses Recombination

For each source word and phrase, there are t
translation alternatives.
If simply combine them, final node will have
tJ hyps to be explored.
However, many partial hyps are not
distinguishable to decoder models
If using TM and a 3-gram LM only
I will come to office
I came to office

14
Recombination of Hypotheses

Recombination Of two hypotheses keep only the
better one if no future information can switch
their current ranking
Notice this depends on the models
Model score depends on current partial
translation and the extension, e.g. LM
Model score depends on global features known only
at the sentence end, e.g. sentence length model
The models define equivalence classes for the
hypotheses
Expand only best hypothesis in each equivalence
class

15
Recombination of Hypotheses Example

TM and n-gram LM only
Hypotheses
H1 I would like to go
H2 I would not like to go
Assume as possible expansions Eto the movies
to the cinema and watch a film
LMscore is identical for H1Expansion as for
H2Expansion for bi, tri, four-gram LMs
E.g 3-gram LMscore Expansion 1 is-logpr( to
to go ) - logpr( the go to ) logpr( movies
to the)
Therefore Cost(H1) lt Cost(H2) gt Cost(H1E) lt
Cost(H2E) for all possible expansions E

16
Beam Pruning

Still a lot of partial hyps to explore even after
the recombination for each node in lattice (src
sentence up to this position)
To a not so good partial hyp Sorry, we dont
give you chances any more since you failed this
mid-term
Prune H if it is not the top B-best hyp --- Beam
Size Pruning
Prune H if its score is lower than factorbest
score --- Beam Factor Pruning
Pruning reduces number of partial hyps to
explore faster decoding
But it eliminates those might become good
translations later on

17
Beam Pruning
18
Rest-Cost Estimation

In Pruning we compare hyps, which are not
strictly equivalent under the models
Risk prefer hypotheses which have covered the
easy parts
Remedy estimate remaining cost for each
hypothesis
Want to know minimum expected cost (similar to A
search)
Gives a bound for pruning
However, not possible with acceptable effort
Want to include as many models as possible
Translation model costs, word count, phrase count
Language model costs
Distortion model costs
Calculate expected cost for each span (l, r)
R(l, r)

19
Rest Cost for Translation Models

Translation model, word count and phrase count
features are local costs
Depend only on current phrase pair
Strictly additive R(l, m) R(m, r) R(l, r)
Minimize over alternative translations
For each source phrase span (l, r) initialize
with cost for best translation
Combine adjacent spans, take best combination

20
Rest Cost for Language Models

We do not have history -gt only approximation
For each span (l, r) calculate LM score without
history
Combine LM scores for adjacent spans
Notice p(e1 em) p(em1 en) ! p(e1 en)
beyond 1-gram LM

Alternative fast monotone decoding with TM-best
translations
History available
Then R(l,r) R(1,r) R(1,l)

21
Rest Cost for Distance-Based DM

Distance-based DM rest cost depends on coverage
pattern
To many different coverage patterns, can not
pre-calculate
Estimate by jumping to first gap, then filling
gaps in sequence
Moore Quirk 2007 DM cost plus rest cost

S
S
S
Previous phrase
Gap-free initial segment
Current phrase
L(.) length of phrase, D(.,.) distance
between phrases
S adjacent S d0
S left of S d2L(S)
S subsequence of S d2(D(S,S)L(S))
Otherwise d2(D(S,S)L(S))
22
Rest Cost for Lexicalized DM

Lexicalized DM per phrase
DM(f,e) scores in-mon, in-swap, in-dist,
out-mon, out-swap, out-dist
Treat as local cost for each span (l, r)
Minimize over alternative translations and
different orientations in- and out-

23
Effect of Rest-Cost Estimation

From Richard Zens 2008
LM is important, DM is important

24
Output Best Translation

Optimal hypothesis in the last node of the
lattice
We need to keep the back pointers

25
Monotone Decoding Algorithm

Apply TM on sentence f1fJ
For j1 to J
Foreach incoming edge e that enters node j
Edge e i-gtj
Foreach partial hyp h in node I
Extend h with edge e
Estimate hyp prob/score for he
Store lthe, prob/score, back pointer to hgt in
node j
Prune partial hyps in node j
In node J, find out the best hyp
Follow the backpointers and output the final
translation

26
Output N-best List

When traverse back from the last node, decoder
can output the top N-best hyps for the whole
sentence N-best list.
Model scores do not correlate well with external
scores such as BLEU
In a 1000-best list, hyps with the highest BLEU
ranks about 489.38 according to their model
scores.

27
N-Best List
28
N-Best Rescoring

Generate n-best list
Use different TM and/or LM to rescore each
translation -gt reordering of translations, i.e.
different best translation
Different TMs
Use IBM1 lexicon for entire translation
Use HMM-FB and IBM4 lexicons
Forced alignment with HMM alignment model
Different LMs
Very large LM (Distributed Language Model)
Link grammar too slow)
Other syntax-based LMs, e.g. Charniaks parser?

29
Problem with N-Best Generation

Duplicates from different transducers
_at_Lex A B 0.5
_at_ISA A B 0.7
-gt Two identical translations with different
scores or even same score (when rescoring all
translations with same lexicon)
Spurious ambiguities
us companies and other institutions
us companies and other institutions
us companies and other institutions
us companies and other institutions
. . .
Example run 1000 n-best -gt 400 different
strings on average Extreme case only 10 unique
strings
Possible solution Checking uniqueness during
backtracking

30
Oracle Score of N-best List
31
Using Distributed LM for Reranking Systems

Large training data available
Distributed computing clusters
Distributed language modeling (Zhang and Vogel,
2006 Emami, 2007 Brants et al, 2007)?

32
Rerank the N-Best List using LM Features
33
Rerank N-best List
34
Rerank N-best List
35
Rerank N-best List

Considering long-distance dependencies

36
Reranking N-best List
37
Tuning the SMT System

We use different models in SMT system
Models have simplifications
Trained on different amounts of data
gt Different levels of reliability
gt Give different weight to different ModelsQ
c1 Q1 c2 Q2 cn Qn
Find optimal scaling factors c1 cn
Optimal means Highest score for chosen
evaluation metric

38
Automatic Tuning

Many algorithms to find (near) optimal solutions
available
Simplex
Maximum entropy
Minimum error training
Minimum Bayes risk training
Genetic algorithm
Note models are not improved, only their
combination
Large number of fully translations required gt
still problematic when decoding is slow

39
Automatic Tuning on N-best List

Generate n-best lists, e.g. for each of 500
source sentences 1000 translations
Loop
Changing scaling factors results in re-ranking
the n-best lists
Evaluate new 1-best translations
Apply any of the standard optimization techniques
Advantage much faster
Can pre-calculate the counts (e.g. n-gram
matches) for each translation to speed up
evaluation
For Bleu or NIST metric with global length
penalty do local hill climbing for each
individual n-best list

40
Minimum Error Training

For each scaling factor we have Q ck Qk
QRest
For different values different hyps have lowest
score
Different hyps lead to different MT eval scores

41
Decoding with Reordering

Languages are of different word orders
1??/Austrilia 2?/is ?/with 3??/North Korea
4?/has 5??/diplomatic relationship 6?/of
7??/a few 8??/countries 9??/one of
Austrilia is one of the few countries that have
diplomatic relationship with North Korea
To generate the right English translation, we
need to translate the source in order of 12967845
Reordering either change the order to translate
the source, or equivalently re-arrange the
partial translations
Knowledge sources
Reordering models
Language models
Syntax

42
Reordering Strategies

All permutations
Any re-ordering possible
Complexity of traveling salesman -gt only possible
for very short sentences
Small jumps ahead filling in the gaps pretty
soon
Only local word reordering
Implemented in current decoder
Leaving small number of gaps fill in at any
time
Allows for global but limited reordering
Similar decoding complexity exponential in
number of gaps
IBM-style reordering (described in IBM patent)
Merging neighboring regions with swap no gaps
at all
Allows for global reordering
Complexity lower than 1, but higher than 2 and 3

43
Decoding with Reordering Window

Word and phrase reordering within a given window
From first un-translated source word next k
positions
Window length 1 monotone decoding
Restrict total number of reordering (typically 3
per 10 words)
Simple Jump model
One reordering typically includes two jumps
Jump distance D depends on gap and also on phrase
lengthdistance measured from center of phrase to
center of phrase
Simple Gaussian distribution p(D) exp( D - 1)
Lexicalized jump model

44
Jumping ahead in the Lattice

Hypothesis describes a partial translation
Coverage information, Back-trace information,
Score
Expand hypothesis over uncovered position

I will come
to your office
I come
tomorrow
to
you
come
I
morgen
zu
dir
ich
komme
h c11000, tI will come
h c11011, tI will come to your office
h c11111, tI will come to your office tomorrow
45
Word Order Coverage Info

Need to know which source words have already been
translated
Dont want to miss some words
Dont want to translate words twice
Can compare hypotheses which cover the same words
Use Coverage vector to store this information
Essentially a bit vector
For small jumps ahead position of first gap
plus short bit vector
For small number of gaps array of positions of
uncovered words
For merging neighboring regions left and right
position

46
Decoding with Inverted Transduction Grammar

Translation model phrase to phrase translation
May include lexicalized reordering probabilities
Grammar X-gtltF1F2, E1E2gt X-gtltF1F2, E2E1gt
X-gtltf, egt

47
Combine Adjacent Edges

Take adjacent edges el and er and create a new
edge e
e.FromNode el.FromNode
e.ToNode er.ToNode
e.Translation el.Translation er.Translation

tomorrow I will come
I will come
to your office
I come
to
you
come
tomorrow
I
morgen
zu
dir
ich
komme
hl c(0,2), tI will come
hr c(2,3) ttomorrow
h c(0,3), tI will come tomorrow
48
And Allow For Reordering

Create additional edge
e.FromNode el.FromNode
e.ToNode er.ToNode
e.Translation er.Translation el.Translation

tomorrow I will come
I will come
to your office
I come
to
you
come
tomorrow
I
morgen
zu
dir
ich
komme
hl c(0,2), tI will come
hr c(2,3) ttomorrow
h c(0,3), ttomorrow I will come
49
Chart-Decoder for Simple ITG

Recall Simple ITG binary tree
Word reordering straight and inverted subtrees
Allows long distance reordering first-gtlast
word, last -gt first word
Generation of partial hypotheses
Initialize with phrase translations
Combine adjacent areas into longer translations
Allow for swaps
Requires different organization of decoder

50
Chart Decoder
51
LM in Chart-Based Translation

Language model states on both sides
History has not been seen
Combine h(0,2) abc with h(2,5) de to give
h(0,5) abcde
Calculated was p(a) p(ba) p(cab)and
p(d) p(ed)
But now needed p(dbc) p(fde)
Partly undo calculation
subtract wrong log probs p(d) p(ed)
add correct log probs p(dbc) p(fde)
For short extensions just extend from left
hypothesis
For long extensions, faster to correct LM score

52
Effect of Reordering

Arabic devtest set (203 sentences)
Chinese test set 2002 (878 sentences)

Reordering mainly improves fluency, i.e. stronger
effect for Bleu
Improvement for Arabic 4.8 NIST and 12.7 Bleu
Less improvement for Chinese 5 in Bleu

53
Effect of Reordering
Arabic/English translation
54
Effect of Reordering
55
Hierarchical Decoding

Translation model phrase pairs with holes
(phrase of phrases)
Consider hierarchical phrase pairs as translation
rules
Decoding is a CYK parsing find the optimal
synchronous parsing tree

56
Hierarchical Decoding (no LM)
57
Decoding as Parsing (Hiero)
58
SMT Decoder for Mobile Devices

Mobile speech translators
Fast (close to real time) speech translation
Domain limited but should not limit to
pre-recorded sentences
Two-way translation
Challenges
Weaker CPU (e.g. iPhone 3G S 600MHz)
Tiny RAM a few MB, up to 256 MB
No numerical co-processors
Pandora decoder
Minimum on-device computing
Intergized computation
Compact data structure

59
Summary

Decoder
Generating translation lattice
Finding best path
Limited word reordering
Generation of N-best list
Esp used for tuning system
May also be used for downstream NLP modules
Tuning of System
Find optimal set of scaling factors
Done on n-best list for speed
Direct minimization of any MT eval metric

Write a Comment

User Comments (0)

About PowerShow.com

Decoding Algorithms for Statistical Machine Translation - PowerPoint PPT Presentation

Decoding Algorithms for Statistical Machine Translation

The order to choose the next source word to translate is like to choose the next ... For each source word and phrase, there are |t| translation alternatives. ... – PowerPoint PPT presentation