Title: Evaluation of Context-Dependent Phrasal Translation Lexicons for Statistical Machine Translation
1Evaluation of Context-DependentPhrasal
Translation Lexiconsfor Statistical Machine
Translation
- Marine CARPUAT and Dekai WU
- Human Language Technology Center
- Department of Computer Science and Engineering
- HKUST
2New resources for SMT context-dependent phrasal
translation lexicons
- A key new resource for Phrase Sense
Disambiguation (PSD) for SMT Carpuat Wu 2007 - Entirely automatically acquired
- Consistently improves 8 translation quality
metrics EMNLP 2007 - Fully phrasal just like conventional SMT lexicons
TMI 2007 - But much larger than conventional lexicons!
- Why is this extremely large resource necessary?
- Is its contribution observably useful?
- Is it used by the SMT system differently than
conventional SMT lexicons?
3Our finding context-dependent lexicons directly
improve lexical choice in SMT
- Exploit the available vocabulary better for
phrasal segmentation - more and longer phrases are used in decoding
- consistent with other findings TMI2007
- fully phrasal context-dependent lexicons yield
more reliable improvements than single word
lexicons - Select better translation candidates
- even after compensating for differences in
phrasal segmentation - improvements in BLEU, TER, METEOR, etc. really
reflect improved lexical choice
4Problems with current SMT systems
- Input ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
- Ref. Prof. Zhang gave a lecture on China and
India to a packed audience. - SMT1 Prof. Zhang to a group of people on China
and India class. - SMT2 Prof. Zhang and a group of people go into
class on China and India.
Ref. Prof. Zhang gave a lecture on China and
India to a packed audience.
SMT2 Prof. Zhang and a group of people go into
class on China and India.
5Translation lexicons in SMT are independent of
context!
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Prof. Zhang gave a lecture on China and India
to a packed audience.
? ? ? ? ? ? ? ? ? , ? ? ? ? ? ? ? ? ?
Everyone is welcome to attend class tomorrow, on
the topic China and India.
6Phrasal lexicons in SMT are independent of
context too!
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Prof. Zhang gave a lecture on China and India
to a packed audience.
? ? ? ? ? ? ? ? ? , ? ? ? ? ? ? ? ? ?
Everyone is welcome to attend class tomorrow, on
the topic China and India.
7Current SMT systems are hurt byvery weak models
of context
- Translation disambiguation models are too
simplistic - Phrasal lexicon translation probabilities are
static, so not sensitive to context - Context in input language is only modeled weakly
- by phrase segments
- Context in output language is only modeled weakly
- by n-grams
- Error analysis reveals many lexical choice errors
- Yet, few attempts at directly modeling context
8Todays SMT systems ignore the contextual
features that would help lexical choice
- No full sentential context
- merely local n-gram context
- No POS information
- merely surface form of words
- No structural information
- merely word n-gram identities
9Correct translation disambiguation requires rich
context features
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
Prof. Zhang gave a lecture on China and India
to a packed audience.
? ? ? ? ? ? ? ? ? , ? ? ? ? ? ? ? ? ?
Everyone is welcome to attend class tomorrow, on
the topic China and India.
10Todays SMT systems ignore context in their
phrasal translation lexicons
11Todays SMT systems ignore context in their
phrasal translation lexicons
cj(f)
Entire input sentence context
12But context-dependent lexical choice does not
necessarily improve translation quality
- Early pilot study Brown et al. 1991
- use single most discriminative feature to
disambiguate between 2 English translations of a
French word - WSD improves French-English translation quality,
but not on a significant vocabulary and allowing
only 2 senses - Context-dependent lexical choice helps word
alignment, but not really translation quality
Garcia Varea et al. 2001, 2002 - maximum-entropy trained bilexicon replaces
IBM-4/5 translation probabilities - improves AER on Canadian Hansards and Verbmobil
tasks - small improvement on WER and PER by rescoring
n-best lists, but not statistically significant
Garcia Varea Casacuberta 2005
13Context-dependent modeling improves quality of
Statistical MT Carpuat Wu 2007
- Introduced context-dependent phrasal lexicons for
SMT - leverage WSD techniques for SMT lexical choice
- generalize conventional WSD to Phrase Sense
Disambiguation - Context-dependent modeling always improves SMT
accuracy - on all tasks - 3 different IWSLT06 datasets,
NIST04 - on all 8 common automatic metrics - BLEU, NIST,
METEOR, METEORsynsets, TER, WER, PER, CDER
14No other WSD for SMT approach improves
translation quality as consistently
- Until recently, using WSD to improve SMT quality
has met with mixed or disappointing results - Carpuat Wu ACL-2005, Cabezas Resnik unpub
- Last year, for the first time, different
approaches showed that WSD can help translation
quality - WSD improved BLEU (but how about other
metrics??) on 3 Chinese-English tasks
Carpuat et al.
IWSLT-2006 - WSD improved BLEU (but how about other
metrics??) on Chinese-English NIST task
Chan et al. ACL-2007 - WSD improved METEOR (but not BLEU!) on
Spanish-English Europarl task
Giménez Màrquez WMT-2007 - Phrasal WSD improves BLEU, NIST, METEOR (but how
about error rates??) - on Italian-English and Chinese-English IWSLT
task Stroppa et al. TMI-2007 - But no other approach improves on 8 metrics on 4
different tasks
15But how useful are the context-dependent lexicons
as resources?
- Improving translation quality is great, but
- Metrics aggregate impact of many different
factors - Metrics ignore how translation hypotheses are
generated - Context-dependent lexicons are more expensive to
train, so - Are their contributions observably useful?
- Direct analysis needed how do SMT systems use
context-dependent vs. conventional lexicons?
16Learning context-dependent vs. conventional
lexicons for SMT
- learned from the same word-aligned parallel data
- cover the same phrasal input vocabulary
- know the same phrasal translation candidates
- Only difference an additional context-dependent
parameter - dynamically computed vs. static conventional
scores - Uses WSD modeling vs. MLE in conventional
lexicons
17Word Sense Disambiguation provides appropriate
models of context
- WSD has long targeted the questions of
- how to design context features
- how to combine contextual evidence into a sense
prediction - Senseval/SemEval have extensively evaluated WSD
systems - with different feature sets
- with different machine learning classifiers
- Senseval multilingual lexical sample tasks
- use observable lexical translations as senses
- just like lexical choice in SMT
- E.g. Senseval-2003 English-Hindi, SemEval-2007
Chinese-English
18Leveraging a Senseval WSD system
- Top Senseval-3 Chinese Lexical Sample
systemCarpuat et al. 2004 - standard classification models
- maximum entropy, SVM, boosted decision stumps,
naïve Bayes - rich lexical and syntactic features
- bag of word sentence context
- position sensitive co-occurring words and POS
tags - basic syntactic dependency features
19Generalizing WSD to PSD for context-dependent
phrasal translation lexicons
- One PSD model per input language phrase
- regardless of POS, length, etc.
- Generalization of standard WSD models
- Sense candidates are the phrase translation
candidates seen in training - The sense candidates are extracted just like the
conventional SMT phrasal lexicon - typically, output language phrases consistent
with the intersection of bidirectional IBM
alignments
20Extracting PSD senses and training examples from
word-aligned parallel text
? ?? ?? ?? ? ? ???? ? ? ?
is there a new - age music concert within
the next few days ?
21Extracting PSD senses and training examples from
word-aligned parallel text
? ?? ?? ?? ? ? ???? ? ? ?
is there a new - age music concert within
the next few days ?
Extracted PSD training instances
? ?? ?? ?? ltt sense withingt?lt/tgt ? ???? ? ? ?
22Extracting PSD senses and training examples from
word-aligned parallel text
? ?? ?? ?? ? ? ???? ? ? ?
is there a new - age music concert within
the next few days ?
Extracted PSD training instances
? ?? ?? ?? ltt sense withingt?lt/tgt ? ???? ? ? ?
? ?? ?? ?? ? ? ltt sensenew - age
musicgt????lt/tgt ? ? ?
23Extracting PSD senses and training examples from
word-aligned parallel text
? ?? ?? ?? ? ? ???? ? ? ?
is there a new - age music concert within
the next few days ?
Extracted PSD training instances
? ?? ?? ?? ltt sense withingt?lt/tgt ? ???? ? ? ?
? ?? ?? ?? ? ? ltt sensenew - age
musicgt????lt/tgt ? ? ?
? ltt sensewithin the next few daysgt?? ?? ??
?lt/tgt ? ???? ? ? ?
24Integrating context-dependent lexicon into
phrase-based SMT architectures
- The context-dependent phrasal lexicon
probabilities - Are conditional translation probabilities
- can naturally be added as a feature in log linear
translation models - Unlike conventional translation probabilities,
they are - dynamically computed
- dependent on full-sentence context
- Decoding can make full use of context-dependent
phrasal lexicons predictions at all stages of
decoding - unlike in n-best reranking
25Evaluating context-dependent phrasal translation
lexicons
- lexical choice only
- vs. translation quality Carpuat Wu EMNLP 2007
- integrated evaluation in SMT
- vs. stand-alone as in Senseval Carpuat et al.
2004 - fully phrasal lexicons only
- vs. single-word context-dependent lexicon
Carpuat Wu TMI 2007 - Translation task
- Test set NIST-04 Chinese-English text
translation - 1788 sentences
- 4 reference translations
- Standard phrase-based SMT decoder (Moses)
26Experimental setupLearning the lexicons
- Standard conventional lexicon learning
- Newswire Chinese-English corpus
- 2M sentences
- Standard word-alignment methodology
- GIZA
- Intersection using grow-diag heuristics Koehn
et al. 2003 - Standard Pharaoh/Moses phrase-table
- Maximum phrase length 10
- Translation probabilities in both directions,
lexical weights - Context-dependent lexicons
- Use the exact same word-aligned parallel data
- Train a WSD model for each known phrase
27Step 1 Evaluating phrasal segmentation with
context-dependent vs. conventional lexicons
- Goal compare the phrasal segmentation of the
input sentence used to produce the top hypothesis - Method
- We do not evaluate accuracy
- There is no gold standard phrasal segmentation!
- Instead, we analyze how the input phrases
available in lexicons are used
28SMT uses longer input phrases with
context-dependent lexicons
- Context-dependent lexicons help use longer, less
ambiguous phrases
29SMT uses more input phrase types with
context-dependent lexicons
- 26 of phrase types used with context-dependent
lexicon are not used with conventional lexicon - 96 of those lexicon entries are truly phrasal
(not single words) - Context-dependent lexicons make better use of
available input language vocabulary
30SMT uses more rare phrases with context-dependent
lexicons
- With context modeling, less training data is
needed for phrases to be used
31Step 2 Comparing translation selection
- Goal compare translation selection only
- Method
- We compare accuracy of translation selection for
identical segments only - Because different lexicons yield different
phrasal segmentations - A translation is considered accurate if it
matches any of the reference translations - Because input sentence and references are not
word-aligned
32Context-dependent lexicon predictions match
references better
- Context-dependent lexicons yield more matches
than conventional lexicons - 48 of errors made with conventional lexicons are
corrected with context-dependent lexicons
Lexicon Conventional Match No match
Context-dependent
Match 1435 2139
No match 683 2272
33Conclusion context-dependent phrasal translation
lexicons are useful resources for SMT
- A key new resource for Phrase Sense
Disambiguation (PSD) for SMT Carpuat Wu 2007 - Entirely automatically acquired
- Consistently improves 8 translation quality
metrics EMNLP 2007 - Fully phrasal just like conventional SMT lexicons
TMI 2007 - But much larger than conventional lexicons!
- Why is this extremely large resource necessary?
- Is its contribution observably useful?
- Is it used by the SMT system differently than
conventional SMT lexicons?
34Conclusion context-dependent phrasal translation
lexicons are useful resources for SMT
- Improve phrasal segmentation
- Exploit available input vocabulary better
- More phrases, longer phrases and more rare
phrases are used in decoding - Consistent with other findings
- fully phrasal context-dependent lexicons yield
more reliable improvements than single word
lexicons Carpuat Wu TMI2007 - Improve translation candidate selection
- Even after compensating for differences in
phrasal segmentation - Genuinely improve lexical choice
- Not just BLEU and other metrics!
35Evaluation of Context-DependentPhrasal
Translation Lexiconsfor Statistical Machine
Translation
- Marine CARPUAT and Dekai WU
- Human Language Technology Center
- Department of Computer Science and Engineering
- HKUST
36Translation quality evaluationNot just BLEU, but
8 automatic metrics
- N-gram matching metrics
- BLEU4
- NIST
- METEOR
- METEORsynsets
- augmented with WordNet synonym matching
- Edit distances
- TER
- WER
- PER
- CDER
37Context-dependent modeling consistently improves
translation quality
Test set Experiment BLEU NIST METEOR METEOR (no syn) TER WER PER CDER
IWSLT 1 SMT 42.21 7.888 65.40 63.24 40.45 45.58 37.80 40.09
SMTWSD 42.38 7.902 65.73 63.64 39.98 45.30 37.60 39.91
IWSLT 2 SMT 41.49 8.167 66.25 63.85 40.95 46.42 37.52 40.35
SMTWSD 41.97 8.244 66.35 63.86 40.63 46.14 37.25 40.10
IWSLT 3 SMT 49.91 9.016 73.36 70.70 35.60 40.60 32.30 35.46
SMTWSD 51.05 9.142 74.13 71.44 34.68 39.75 31.71 34.58
NIST SMT 20.41 7.155 60.21 56.15 76.76 88.26 61.71 70.32
SMTWSD 20.92 7.468 60.30 56.79 71.34 83.37 57.29 67.38
38Results are statistically significant
- NIST results are statistically significant at the
95 level - Tested using paired bootstrap resampling
39Translations with context-dependent phrasal
lexicons often differ from SMT translations
Test set Translations changed by context modeling
IWSLT 1 25.49
IWSLT 2 30.40
IWSLT 3 29.25
NIST 95.74
40Context-dependent modeling helps even for small
and single-domain IWSLT
- IWSLT is a single-domain task with very short
sentences - Even in these conditions, context-dependent
phrasal lexicons are helpful - there are genuine sense ambiguities
- E.g.
- turn vs. transfer
- context-features are available
- 19 observed features per occurrence of a Chinese
phrase
41The most useful context features are not
available in standard SMT
- The 3 most useful context feature types are
- POS tag of word preceding the target phrase
- POS tag of word following the target phrase
- Bag-of-word context
- We use weights learned by maximum entropy
classifier to determine the most useful features - We normalized feature weights for each WSD model
- and then compute average weight of each feature
type
42Dynamic context-dependent sense predictions are
better than static predictions
- Context-dependent modeling often helps rank the
correct translation first - Even when context-dependent modeling picks the
same translation candidate, the WSD scores are
more discriminative than baseline translation
probabilities - better at overriding incorrect LM predictions
- gives higher confidence to translate longer input
phrases when appropriate
43Context-dependent modeling improves phrasal
lexical choice examples
44Context-dependent modeling improves phrasal
lexical choice examples
45Context-dependent modeling prefers longer phrases
- Input
- Ref.
- No parliament members voted against him .
- SMT
- Without any congressmen voted against him .
- SMTWSD
- No congressmen voted against him .
46Context-dependent modeling prefers longer phrases
- Input
- Ref.
- No parliament members voted against him .
- SMT
- Without any congressmen voted against him .
- SMTWSD
- No congressmen voted against him .
47Context-dependent modeling prefers longer phrases
- Input
- Ref.
- No parliament members voted against him .
- SMT
- Without any congressmen voted against him .
- SMTWSD
- No congressmen voted against him .
48Context-dependent modeling prefers longer phrases
- Average length of Chinese phrases used is higher
with context-dependent phrasal lexicon - This confirms that
- Context-dependent predictions for all phrases are
useful - Context-dependent predictions should be available
at all stages of decoding - This explains why using WSD for single words only
has a less reliable impact on translation quality - as in Cabezas Resnik 2005, Carpuat et al.
2006
49Context-dependent lexicons should be phrasal to
always help translation
Test set Experiment BLEU NIST METEOR METEOR (no syn) TER WER PER CDER
1 SMT 42.21 7.888 65.40 63.24 40.45 45.58 37.80 40.09
word lex. 41.94 7.911 65.55 63.52 40.59 45.61 37.75 40.09
phrasal lex. 42.38 7.902 65.73 63.64 39.98 45.30 37.60 39.91
2 SMT 41.49 8.167 66.25 63.85 40.95 46.42 37.52 40.35
word lex. 41.31 8.161 66.23 63.72 41.34 46.82 37.98 40.69
phrasal lex. 41.97 8.244 66.35 63.86 40.63 46.14 37.25 40.10
3 SMT 49.91 9.016 73.36 70.70 35.60 40.60 32.30 35.46
word lex. 49.73 9.017 73.32 70.82 35.72 40.61 32.10 35.30
phrasal lex. 51.05 9.142 74.13 71.44 34.68 39.75 31.71 34.58
50No other WSD for SMT approach improves
translation quality as consistently
- Until recently, using WSD to improve SMT quality
has met with mixed or disappointing results - Carpuat Wu ACL-2005, Cabezas Resnik unpub
- Last year, for the first time, different
approaches showed that WSD can help translation
quality - WSD improved BLEU (but how about other
metrics??) on 3 Chinese-English tasks
Carpuat et al.
IWSLT-2006 - WSD improved BLEU (but how about other
metrics??) on Chinese-English NIST task
Chan et al. ACL-2007 - WSD improved METEOR (but not BLEU!) on
Spanish-English Europarl task
Giménez Màrquez WMT-2007 - Phrasal WSD improves BLEU, NIST, METEOR (but how
about error rates??) - on Italian-English and Chinese-English IWSLT
task Stroppa et al. TMI-2007 - But no other approach improves on 8 metrics on 4
different tasks
51Context-dependent modeling improves quality of
Statistical MT
- Presenting context-dependent phrasal lexicons for
SMT - leverage WSD techniques for SMT lexical choice
- Context-dependent modeling always improves SMT
accuracy - on all tasks - 3 different IWSLT06 datasets,
NIST04 - on all 8 common automatic metrics - BLEU, NIST,
METEOR, METEORsynsets, TER, WER, PER, CDER - Why?
- Most useful context features are unavailable to
current SMT systems - Better phrasal segmentation
- Better phrasal lexical choice
- more accurate rankings
- more discriminative scores
52Maxent-based sense disambiguation in Candide
Berger 1996
- No evaluation of impact on translation quality
- only 2 example sentences, no contrastive
evaluation by human judgment nor any automatic
metric - extension by Garcia Varea et al. does not
significantly improve translation quality - Still does not model input language context
- Overly simplified context model
- does not use full sentential context
- only 3 words to the left, 3 words to the right
- does not generalize over word identities
- only words, no POS tags
- does not generalize to phrasal disambiguation
targets - only words
- Does not augment the existing SMT model
- only replace context-independent translation
probability