Title: Recent%20Trends%20in%20MT%20Evaluation:%20%20Linguistic%20Information%20and%20Machine%20Learning
1Recent Trends in MT Evaluation Linguistic
Information and Machine Learning
- Jason Adams
- 11-734
- 2008-03-05
- Instructors
- Alon Lavie
- Stephan Vogel
2Outline
- Background
- Machine Learning
- Linguistic Information
- Combined Approaches
- Conclusions
3Background
- Fully automatic MT Eval is as hard as MT
- If we could judge with certainty that a
translation is correct, reverse the process and
generate a correct translation - Reference translations help to close this gap
4Background Adequacy and Fluency
- Adequacy
- How much of the meaning in the source sentence
that is preserved in the hypothesis - Reference translations are assumed to achieve
this sufficiently - Fluency
- How closely the hypothesis sentence conforms to
the norms of the target language - Reference translations are a subset of target
language
5Background Human Judgments
- Judge on a scale for adequacy and fluency
- Agreement between judges is low
- Judgment scores normalized
- Blatz et al (2003)
6Background Evaluating Metrics
- Correlation with human assessments (judgments)
- Pearson Correlation
- Spearman Rank Correlation
- Adding more references helps BLEU but hurts NIST
(Finch et al. 2004)
7Background BLEU
- Papineni et al. (2001)
- First automatic MT metric to be widely adopted
- Geometric mean of modified n-gram precision
- Criticisms
- Poor sentence level correlation
- Favors statistical systems
- Ignores recall
- Local word choice more important than global
accuracy
8Background METEOR
- Banerjee and Lavie (2005).
- Addresses some of the shortcomings of BLEU
- Uses recall of best reference
- Attempts to align hypothesis and reference
- Better correlation with human judgments
- Optionally uses WordNet and Porter stemming
9Outline
- Background
- Machine Learning
- Linguistic Information
- Combinations
- Conclusions
10Machine Learning Kulesza Shieber (2004)
- Frame the MT Evaluation problem as a
classification task - Can we predict if a sentence is generated by a
human or a machine by comparing against reference
translations?
11Machine Learning Kulesza Shieber (2004)
- Derived a set of features (partially based on
BLEU) - Unmodified n-gram precisions (1 to 5)
- Min and max ratio of hypothesis to reference
length - Word error rate
- minimum edit distance between hypothesis and any
reference - Position-independent word error rate
- shorter translation removed from longer and size
of remaining set returned
12Machine Learning Kulesza Shieber (2004)
- Trained an SVM using classification
- Positive human translation
- Negative machine translation
- Score is output of SVM
- Distance to hyperplane is treated as a measure of
confidence - Classification Accuracy
- 59 for human examples (positive)
- 70 for machine examples (negative)
13Machine Learning Kulesza Shieber (2004)
- Compared to BLEU, WER, PER, F-Measure at the
sentence level
14Outline
- Background
- Machine Learning
- Linguistic Information
- Combinations
- Conclusions
15Linguistic Information Liu Gildea (2005)
- Introduce syntactic information
- Use Collins parser on hypothesis and reference
translations - Looked at three different metrics for comparing
trees
16Linguistic Information Liu Gildea (2005)
- Subtree Metric (STM)
- D depth of trees considered
- Count is times subtree appears in any reference
- Clipped count limits count to the maximum number
of times it appears in any one reference
17Linguistic Information Liu Gildea (2005)
- Kernel-based Subtree Metric (TKM)
- H(t) is a vector of counts for all subtrees of t
- H(t1) H(t2) counts subtrees in common
- Use convolution kernels (Collins Duffy, 2001)
to compute in polynomial time - counting all subtrees would be exponential in the
size of the trees
18Linguistic Information Liu Gildea (2005)
- Headword Chain Metric (HWCM)
- Convert phrase-structure parse into dependency
parse - Each mother-daughter relationship in the
dependency parse is a headword chain of length 2 - No siblings included in any headword chain
- Score computed in the same fashion as STM
- Other two metrics have dependency versions
19Linguistic Information Liu Gildea (2005)
- Data is from MT03 and JHU Summer Workshop (2003)
Correlation with overall judgments for one MT
system (E15)
Correlation with fluency judgments for one MT
system (E15)
20Linguistic Information Liu Gildea (2005)
- Corpus level judgments for MT03
21Linguistic Information Pozar Charniak (2006)
- Propose the Bllip metric
- Intuition meaning-preserving transformations in
sentences should not heavily impact dependency
structure - Perhaps intuitive, but unsubstantiated
22Linguistic Information Pozar Charniak (2006)
- Parse hypothesis and reference translations with
the Charniak parser - Construct dependency parses from the output parse
trees - Given a lexical head pair (w1, w2) it is a
dependency if - w1 ! w2
- w1 is the lexical head of a constituent
immediately dominating the constituent of which
w2 is the head
23Linguistic Information Pozar Charniak (2006)
- Construct all dependency pairs for the hypothesis
and reference translation - If multiple reference translations, compare them
one at a time - Compute precision and recall to score
- Formula for doing so not explicitly stated, but
probably F1
24Linguistic Information Pozar Charniak (2006)
- Evaluation was performed by comparing the biggest
discrepancies between Bllip and BLEU and
determining which was more accurate - Results suggest Bllip makes better choices than
BLEU - Results arent directly given
25Linguistic Information Pozar Charniak (2006)
- Fairly weak paper
- Evaluation is basically just eye-balled
- But, simple headword bi-chains seem to perform as
well as BLEU - Unfortunately, cannot be reliably compared
26Linguistic Information Owczarzak et al. (2007)
- Extended work by Liu Gildea (2005)
- They used unlabeled dependency parses
- Insight having more information about
grammatical relations might be helpful - X is the subject of Y
- X is a determiner of Y
27Linguistic Information Owczarzak et al. (2007)
- Used an LFG parser to generate f-structures that
contain information about grammatical relations
28Linguistic Information Owczarzak et al. (2007)
- Types of dependencies
- Predicate only
- Predicate-value pair, i.e. grammatical relations
- Non-predicate
- Tense
- Passive
- Adjectival degree (comparative, superlative)
- Verb particle
- Etc.
- Extended HWCM from Liu Gildea (2005) to use
these labeled dependencies
29Linguistic Information Owczarzak et al. (2007)
- How do you account for parser noise?
- The positions of adjuncts should not affect
f-structure in an LFG parse - Constructed re-orderings for 100 English
sentences - Re-ordered sentence treated as translation
hypothesis - Original sentence treated as reference
translation
30Linguistic Information Owczarzak et al. (2007)
31Linguistic Information Owczarzak et al. (2007)
- Solution introduce n-best parses
- Tradeoff with computation time
- Used 10-best
32Linguistic Information Owczarzak et al. (2007)
- Obtained precision and recall for each
hypothesis, reference pair - Four examples for each machine hypothesis
- Extended matching using WordNet synonyms
- Extended with partial matches
- One part of a grammatical relation matches and
the other may or may not - Computed F1
- Tried different values for the weighted harmonic
mean but saw no significant improvement
Personal communication with Karolina Owczarzak
33Linguistic Information Owczarzak et al. (2007)
- Evaluated using Pearson correlation with
un-normalized human judgment scores - Values ranging from 1 to 5
- Their metric using 50-best parses and WordNet
performed the best on fluency - METEOR with WordNet performed best on adequacy
and overall - 50-best partial matching performed slightly
lower than METEOR overall - Significantly outperformed BLEU
Personal communication with Karolina Owczarzak
34Outline
- Background
- Machine Learning
- Linguistic Information
- Combinations
- Conclusions
35Combinations Albrecht Hwa (2007)
- Extended work by Kulesza Shieber (2004)
- Included work by Liu and Gildea with headword
chains - Compared classification to regression using SVMs
36Combinations Albrecht Hwa (2007)
- Classification attempts to learn decision
boundaries - Regression attempts to learn a continuous
function - MT evaluation metrics are continuous
- No clear boundary between good and bad
- Instead of trying to classify as human or machine
(Human-Likeness Classifier), try to learn the
function of human judgments - Score hypothesis according to a rating scale
37Combinations Albrecht Hwa (2007)
- Features
- Syntax based compared to reference
- HWCM
- STM
- String-based metrics over large English corpus
- Syntax-based metrics over a dependency treebank
38Combinations Albrecht Hwa (2007)
- Data was LDC Multiple Translation Chinese Part 4
- Spearman correlation instead of Pearson
- Classification accuracy
- Positively related but its possible to improve
classification accuracy and not improve
correlation - Human-Likeness classification seems inconsistent
39Combinations Albrecht Hwa (2007)
- It is possible to train using regression with
reasonable size sets of training instances - Regression generalizes across data sets
- Results showed highest correlation overall of
metrics compared
40Combinations Albrecht Hwa (2007)
41Outline
- Background
- Machine Learning
- Linguistic Information
- Combinations
- Conclusions
42Conclusions
- Evaluating the performance of MT evaluation
metrics still has plenty of room for improvement - Given that humans dont agree well on MT quality,
correlation with human judgments is inherently
limited
43Conclusions
- Machine learning
- Only scratching the surface of possibilities
- Finding the right way to frame the problem is not
straightforward - Learning the function of how humans assess
translations performs better than attempting to
classify a translation as human or machine
44Conclusions
- Linguistic Information
- Intuitively, this should be helpful
- METEOR performs very well with limited linguistic
information (synonymy) - Automatic parsers/NLP tools are noisy, so
possibly compound the problem
45Conclusions
- Linguistic Information and Machine Learning
- Combining the two leads to good results (Albrecht
Hwa 2007)
46Conclusions
- New directions
- Machine learning with richer linguistic
information - Labeled dependencies
- Paraphrases
- Are other machine learning algorithms better
suited than SVMs? - Are there better ways of framing the evaluation
question? - How well can these approaches be extended to
task-specific evaluation?
47Questions?
48References
- Joshua S. Albrecht and Rebecca Hwa. 2007. A
re-examination of machine learning approaches for
sentence-level MT evaluation. In Proceedings of
the 45th Annual Meeting of the Association for
Computational Linguistics (ACL-2007). - Satanjeev Banerjee and Alon Lavie. 2005. Meteor
An automatic metric for MT evaluation with
improved correlation with human judgments. In ACL
2005 Workshop on Intrinsic and Extrinsic
Evaluation Measures for Machine Translation
and/or Summarization, June. - John Blatz, Erin Fitzgerald, George Foster,
Simona Gandrabur, Cyril Goutte, Alex Kulesza,
Alberto Sanchis, and Nicola Ueffing. 2003.
Confidence estimation for machine translation.
Technical Report Natural Language Engineering
Workshop Final Report, Johns Hopkins University. - Alex Kulesza and Stuart M. Shieber. 2004. A
learning approach to improving sentence-level MT
evaluation. In Proceedings of the 10th
International Conference on Theoretical and
Methodological Issues in Machine Translation
(TMI), Baltimore, MD, October.
49References
- Ding Liu and Daniel Gildea. 2005. Syntactic
features for evaluation of machine translation.
In ACL 2005 Workshop on Intrinsic and Extrinsic
Evaluation Measures for Machine Translation
and/or Summarization, June. - Karolina Owczarzak, Josef van Genabith, and Andy
Way. 2007. Labelled Dependencies in Machine
Translation Evaluation. Proceedings of the ACL
2007 Workshop on Statistical Machine Translation
104-111. Prague, Czech Republic. - Kishore Papineni, Salim Roukos, Todd Ward, and
Wei-Jing Zhu. 2002. Bleu a method for automatic
evaluation of machine translation. In Proceedings
of the 40th Annual Meeting of the Association for
Computational Linguistics, Philadelphia, PA. - Michael Pozar and Eugene Charniak. 2006. Bllip
An Improved Evaluation Metric for Machine
Translation. Masters Thesis, Brown University.