Title: Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 14
1Johns Hopkins 2003 Summer Workshop onSyntax and
Statistical Machine TranslationChapters 1-4
2Outline
- Introduction
- Implicit Syntactic Feature Functions
- Shallow Syntactic Feature Functions
- Deep Syntactic Feature Functions
3Introduction
- Motivation
- The Plan
- The Baseline system
4Motivation
Introduction
- Statistical MT systems are the current
state-of-the art, but they often make stupid
syntax errors - Missing content words
- Ukraine condemns US interference in its internal
affairs - Condemns US interference in its internal affairs
- Missing articles
- he is fully able to activate the team
- he is fully able to activate team
- Incorrect dependencies
- particularly those players who cheat the
audience - particularly those who cheat the audience
players - These systems are data-driven, so reflect
implicit syntactic properties of language, as
n-grams and alignment models - What if we incorporate some explicit syntactic
knowledge?
5The Plan
Introduction
- Investigate the effect of integrating syntactic
rules on the performance of an SMT system - Analyze errors in a baseline SMT system
- Develop syntactically-motivated feature functions
to target specific errors - Observe effect of each feature on system score
- Hope for improved results!
- Measuring improvement BLEU
- Is this always an effective/appropriate metric?
6The Baseline System
Introduction
- The Model
- Feature Functions
- Training and Test corpora
- Results
7The Model
Baseline System
- Alignment template system
- Och 2002 Och, Tillman, Ney 1999 Och, Ney 2004
- How it works segment input sentence into
phrases, translate phrases, and reorder in target
language - Uses log-linear modelling approach for direct
translation - Basic idea of experiments model each syntactic
feature as a function and plug it into the model
8Fig1.1, System architecture based on Log-linear
modelling
9Alignment Templates
Baseline System
- Used to do the phrase-based translation
- Sentences e and f are decomposed into K phrase
pairs, and a template z is assigned to translate
each pair - Parameters
- Segmentation points in both e and f
- Search for optimal segmentation included in
Global Search component of model (Fig 1.1) - Set of templates 1 through K z1K
- Permutation of templates 1 through K p1K
- This parameter allows for reordering of phrases
- The z and p parameters are added as hidden
variables to the model
10Fig1.2, Sample segmentation of e and f and
translation into alignment templates
11Fig1.3, Dependencies in the alignment template
model
12Feature Functions
Baseline System
- Alignment template selection
- Word selection
- Phrase alignment
- Language Model features
- Word/Phrase penalty
- Phrases from conventional lexicon
- Additional features
13Alignment template selection
Baseline System
- Product of alignment template probabilities
- Feature function
- Notice there are no insertions or deletions on
phrase level just permutations
14Word selection
Baseline System
- Product of word translation probabilities
- Feature function
- Notice i and j to include dependence on word
positions (ubermorgen -gt the day after tomorrow
should be weighted higher than ubermorgen -gt
after the day tomorrow) - Ei is word class for word ei
- A is matrix of word alignments Ap1Kz1K
15Phrase Alignment
Baseline System
- Feature function for phrase alignment
- Sum over distance (in source) of alignment
templates which are consecutive in target - Measures non-monotonicity of phrases
- Takes into account that very often monotone
alignments are the correct alignments
16Language Model features
Baseline System
- Standard word-based trigram for language model
feature - Report mentions a total of four language models
dont know what other three are (or just four
variations on trigram?)
17Word-phrase penalty
Baseline System
- Number of produced words, ie, length of target
sentence - Number of produced phrases
- Can be arranged to prefer long or short phrases
(I imagine this means smaller K for longer
phrases, and larger K for shorter phrases...?)
18Additional Features
Baseline System
- Phrases from conventional lexicon
- Entries in the Chinese-English lexicon provided
by Linguistic Data Consortium can be potential
phrase translation pairs in the align template
system - A feature function is included that counts the
number of times each lexical entry is used in
training - Model allows further addition of any number of
feature functions, for example, other syntactic
features (numbers of verb arguments), semantic
features, pragmatic features
19Training and Test corpora
Baseline System
- Three corpora
- training corpus (train)
- 170M English words
- development corpus (dev)
- 993 sentences (25K words) in both languages
- 5765 sentences (175K words) for use in
post-workshop experiments - test corpus (test)
- And
- unseen test corpus (blind-test)
- for experiments on completely unseen data
20Preprocessing
Baseline System
- Some additional tweaking was needed to get the
system ready to roll - Segmentation and POS tagging used standard
tools distributed by LDC - Slightly different tag sets are appropriate for
English and Chinese data (no NN v. NNS
distinction in Chinese, no M for measure words in
English) - Parsing used Collins 1999 for English, Bikel
2002 for Chinese - Chunking used fnTBL chunker
- Case issues used HMM to insert upper case back
into baseline system output - Tokenization Issues normalize hyphenation,
other formatting for i/o into various systems
21The Baseline Result
Baseline System
- BLEU score 31.6
- This will be the score that every experiment will
be compared against - SPOILER! Youre not going to see anything much
different from this... - (But hold on anyway... here we go!)
22Outline
- Introduction
- Implicit Syntactic Feature Functions
- Shallow Syntactic Feature Functions
- Deep Syntactic Feature Functions
23Implicit Syntactic Feature Functions
- A Trio for Punctuation
- Specific Word Penalty
- Model 1 Score
- Missing Content Words
- Multi-Sequence Alignment (MSA) of Hypotheses
24Punctuation
Implicit Functions
- Problem Ungrammatical punctuation in hypotheses
affect syntactic quality of output - Idea 1 Count of unmatched or empty parens and
quotes - Feature function penalizes for ungrammatical
punctuation - Idea 2 Percent overlap between groups in e and
f - penalizes word movement around punctuation
- penalizes punctuation deletion
- Idea 3 Add hypotheses to correct bad
punctuation - delete unaligned parens and quotes
- insert and opening paren/quote before the first
word aligned to the first Chinese word inside the
parens - Insert a closing paren/quote after the last word
aligned to the last Chinese word inside the parens
25Punctuation - Results
- BLEU score no statistically significant
improvement - Ideas 1 and 2 are restricted in their application
- Have little discriminating power when most of the
hypotheses for a Chinese sentences make similar
punctuation mistakes - Idea 2 doesnt work when punctuation deletion is
at borders of sentence, or next to another
punctuation mark - Idea 3 hypotheses apparently make only trivial
changes to feature function values (and hence
n-best score) - Conclusion Punctuation soundness has little
influence on BLEU
26Specific Word Penalty
Implicit Functions
- Problem Errant ie, wrongly-placed, inserted,
or deleted content words - Idea Use counts of ten most common non-content
words as feature functions - Individually, 10 counts ? 10 functions
- Combined into one count, to avoid overfitting
- Results, compared to 31.6 baseline
- Using individual counts as features 31.1
- Combined into one feature value 31.7
- Conclusion BLEU drops with these features
- But! They did find that that and a were more
commonly systematically mistranslated than
others. Maybe further experiments can be done on
larger list of non-content words
27Model 1 Score
Implicit Functions
- Idea Use IBM Model 1 for two feature functions
- Model 1 gives the sum of all possible alignment
probabilities - Feature functions p(fe) and p(ef)
- Trained with subset of training corpus for
baseline system 30M English words - Smoothing constant t(fjei) 10-40 used for
unknown words
28Model 1 - Results
- Compared to 31.6 baseline
- With p(fe) yields 32.5 average, p(ef) 30.6
- One of the best-performing features in workshop
- Breakdown for different training sizes
- (numbers for p(ef) a little jumpy may be bug in
eval script)
29Missing Content Words
Implicit Functions
- Problem Those missing content words are really
annoying. - Sentences missing content words can have overall
higher probability ranking than those with
correct content words - Idea Implement feature function that counts
number of content words missing in a candidate
translation - Results 31.9 BLEU score
- 0.3 improvement over baseline 31.6
- Comparatively large improvement, yet not
statistically significant - Discussion The BLEU score is not significantly
better, but on manual inspection, the adequacy of
resulting sentences is much better. Perhaps BLEU
is not the best metric to evaluate application of
this feature function.
30MSA for Hypotheses
Implicit Functions
- Problem No real range of diversity in
translation sentences - Idea Use Multi-Sequence Alignment (MSA)
lattices to recombine subparts of existing
hypotheses into new ones. Three features - Path weight of each hypothesis
- Arc weight number of hypotheses that agree
with that arc - Binary feature Does arc represent majority
hypoths? - Number of arcs on which a hypothesis agreed with
the consensus path
31MSA - Results
- BLEU scores Meh.
- Conclusion SMT not constrained enough to be a
very good fit for MSA
32Implicit Syntax Results
33Outline
- Introduction
- Implicit Syntactic Feature Functions
- Shallow Syntactic Feature Functions
- Deep Syntactic Feature Functions
34Shallow Syntactic Feature Functions
- Overview
- Part-Of-Speech and Chunk Tag Counts
- Tag Fertility Models
- Projected POS Language Model
- Aligned POS-Tag Sequences
35Overview
Shallow Functions
- Shallow features depend on POS tagging or
chunking - Motivations
- Overcome data sparseness
- Generalize from behavior of words to behavior of
tags and chunks - Make stronger generalizations about syntactic
behavior than what is observed in training corpus - Disadvantages It may not be possible to capture
more info than is already implicitly modeled in
baseline - POS and baseline systems trained on same input
- Chunker output not at a much higher granularity
than Alignment Templates - Advantages
- Efficiency of POS and chunking systems
- Decisions are local, so better for noisy
hypotheses - Lots of available input data (1.3M tagged
parallel sentences available for training) - Simpler models allow quicker reaction to
problems, contrastive error analysis
36Part-Of-Speechand Chunk Tag Counts
Shallow Functions
- Problem baseline is systematically under- and
over-generating certain POS and chunk types - Idea Favor sentences with more or less of
certain tags (depending on under- or
over-generation). For example - Number of NPs in English
- Difference in number of NPs from Chinese to
English - Number of Chinese N tags translated only to non-N
tags in English - Results Meh.
- Conclusions
- Individual tag-count features probably already
encoded in trigram models - Combined tag-count features do better, maybe
because counteract biases in more sophisticated
features
37POS and Chunk counts - Results
38Tag Fertility Models
Shallow Functions
- Problem Tag distribution again
- Idea Model expected English tag distribution,
with and without given Chinese tags - Single feature consisting of product of various
probability distributions for English tags - Some bag-o-tags models, eg, P(N Ne 2)
- Some conditional given Chinese tags, eg,
- P(N Pe 2 N Pf 1)
39Tag Fertility - Results
- Not as good as hoped
- Discussion
- Parameter estimation was rather simplistic
obviously-related probs such as -
- were independently calculated
- Fewer free parameters might be tried
40Projected POS Language Model
Shallow Functions
- Problem Word movement model in baseline system
is pretty weak. - Idea Since Chinese words are too sparse to
model movement, use POS instead - Use word alignment to project Chinese POS
sequences into English - Similar to HMM alignment model of Vogel, Ney, and
Tillman 1996, but with POS instead of words
41Projected POS - Results
- This is a little better...
- Conclusion
- Results better simply because of poorness of
movement-handling in baseline - Strongest-performing of shallow features
- Should be investigated further indicates
possible move from purely word-based models to
ones based on shallow syntax
42Aligned POS-Tag Sequences
Shallow Functions
- Problem Alignments in baseline computed on word
level however, lexical item distribution is
always sparse - Idea Use POS tag sequence alignments instead
- Replace words in alignment templates with POS
tags, and use following alignment models - Unigram p(f,e) product of all p(sf, se)
- Conditional p(e,f) p(f) product of all p(se
sf)
43Aligned POS - Results
- Unigram model average 31.6
- Conditional model average 31.4
- Conclusion Maybe need more input for training
models - Baseline system does not output alignment
information for words translated by rules, so
these particular alignments cannot be recovered - Performance of these feature functions may
improve if can reconfigure baseline system to
output more alignments
44Shallow Syntax Results
45Outline
- Introduction
- Implicit Syntactic Feature Functions
- Shallow Syntactic Feature Functions
- Deep Syntactic Feature Functions
46Deep Syntactic Feature Functions
- Grammaticality Test of English Parser
- Parser Probability
- Parser Probability / Unigram LM Scores
- Syntax-based Translation Models
- Tree to String
- Tree to Tree Alignment
- Dependency Tree-to-Tree Alignments
47Overview
- Deep syntactic features depend on parser output
- Grammaticality is measured by parse trees
- How to use parser output
- simple features
- model-based features
- dependency-based features
- other complex features
- tricky features (Chapter 5, Ethan)
48Grammaticality Testof English Parser
Deep Functions
- Idea Grammatical sentences should have a higher
parse probability - Try parse probability of sentence by itself
- Try parse probability of sentence / unigram prob
for words in sentence - Result Worse than baseline! Guess these probs
are not really related...
49Tree to String Model
Deep Functions
- Idea Incorporate syntax-based Tree-to-String
model as a feature function (Yamada and Knight
2001, 2002) - Theta is the set of reorderings, insertions, and
leaf-word translation operations - Results Average 31.7 BLEU
- Conclusion
- Results are not bad, but this is computationally
very expensive! Expense makes it impractical for
this model. - Try reducing cost by fragmenting long sentences
with a tool called machete kinks are still
being worked out of this tool, but it may be
promising
50Tree to Tree Alignment
Deep Functions
- Idea Use (Gildea 2003) tree alignment
probabilities as feature function - Remember, Gildeas model includes cloning, and
many-to-one, one-to-many node mappings - Experiment
- Lexical translation probs for leaf nodes were
trained using IBM Model 1 - Some tweaks for performance max fan-out of 6,
max sentence length of 60 - Results 31.6 BLEU
51DependencyTree-to-Tree Alignments
Deep Functions
- Idea Try a gaggle of dependency-derived
features (listed in results table, next slide) - By representing relationships between words,
dependency trees for source and target sentences
supposedly have less conflicting structures than
constituency trees - Results Not much different from baseline
- Conclusion A lot of the lack of gain for this
approach is probably accounted for by errors in
the parsing tools. Fixing these errors would
likely improve results of this using this feature.
52Dependency Tree Alignments - Results