Recent%20Trends%20in%20MT%20Evaluation:%20%20Linguistic%20Information%20and%20Machine%20Learning - PowerPoint PPT Presentation

About This Presentation

Title:

Recent%20Trends%20in%20MT%20Evaluation:%20%20Linguistic%20Information%20and%20Machine%20Learning

Description:

How much of the meaning in the source sentence that is preserved in ... Adding more references helps BLEU but hurts NIST (Finch et al. 2004) Background: BLEU ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 50

Provided by: jason88

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Recent%20Trends%20in%20MT%20Evaluation:%20%20Linguistic%20Information%20and%20Machine%20Learning

1
Recent Trends in MT Evaluation Linguistic
Information and Machine Learning

Jason Adams
11-734
2008-03-05
Instructors
Alon Lavie
Stephan Vogel

2
Outline

Background
Machine Learning
Linguistic Information
Combined Approaches
Conclusions

3
Background

Fully automatic MT Eval is as hard as MT
If we could judge with certainty that a
translation is correct, reverse the process and
generate a correct translation
Reference translations help to close this gap

4
Background Adequacy and Fluency

Adequacy
How much of the meaning in the source sentence
that is preserved in the hypothesis
Reference translations are assumed to achieve
this sufficiently
Fluency
How closely the hypothesis sentence conforms to
the norms of the target language
Reference translations are a subset of target
language

5
Background Human Judgments

Judge on a scale for adequacy and fluency
Agreement between judges is low
Judgment scores normalized
Blatz et al (2003)

6
Background Evaluating Metrics

Correlation with human assessments (judgments)
Pearson Correlation
Spearman Rank Correlation
Adding more references helps BLEU but hurts NIST
(Finch et al. 2004)

7
Background BLEU

Papineni et al. (2001)
First automatic MT metric to be widely adopted
Geometric mean of modified n-gram precision
Criticisms
Poor sentence level correlation
Favors statistical systems
Ignores recall
Local word choice more important than global
accuracy

8
Background METEOR

Banerjee and Lavie (2005).
Addresses some of the shortcomings of BLEU
Uses recall of best reference
Attempts to align hypothesis and reference
Better correlation with human judgments
Optionally uses WordNet and Porter stemming

9
Outline

Background
Machine Learning
Linguistic Information
Combinations
Conclusions

10
Machine Learning Kulesza Shieber (2004)

Frame the MT Evaluation problem as a
classification task
Can we predict if a sentence is generated by a
human or a machine by comparing against reference
translations?

11
Machine Learning Kulesza Shieber (2004)

Derived a set of features (partially based on
BLEU)
Unmodified n-gram precisions (1 to 5)
Min and max ratio of hypothesis to reference
length
Word error rate
minimum edit distance between hypothesis and any
reference
Position-independent word error rate
shorter translation removed from longer and size
of remaining set returned

12
Machine Learning Kulesza Shieber (2004)

Trained an SVM using classification
Positive human translation
Negative machine translation
Score is output of SVM
Distance to hyperplane is treated as a measure of
confidence
Classification Accuracy
59 for human examples (positive)
70 for machine examples (negative)

13
Machine Learning Kulesza Shieber (2004)

Compared to BLEU, WER, PER, F-Measure at the
sentence level

14
Outline

Background
Machine Learning
Linguistic Information
Combinations
Conclusions

15
Linguistic Information Liu Gildea (2005)

Introduce syntactic information
Use Collins parser on hypothesis and reference
translations
Looked at three different metrics for comparing
trees

16
Linguistic Information Liu Gildea (2005)

Subtree Metric (STM)
D depth of trees considered
Count is times subtree appears in any reference
Clipped count limits count to the maximum number
of times it appears in any one reference

17
Linguistic Information Liu Gildea (2005)

Kernel-based Subtree Metric (TKM)
H(t) is a vector of counts for all subtrees of t
H(t1) H(t2) counts subtrees in common
Use convolution kernels (Collins Duffy, 2001)
to compute in polynomial time
counting all subtrees would be exponential in the
size of the trees

18
Linguistic Information Liu Gildea (2005)

Headword Chain Metric (HWCM)
Convert phrase-structure parse into dependency
parse
Each mother-daughter relationship in the
dependency parse is a headword chain of length 2
No siblings included in any headword chain
Score computed in the same fashion as STM
Other two metrics have dependency versions

19
Linguistic Information Liu Gildea (2005)

Data is from MT03 and JHU Summer Workshop (2003)

Correlation with overall judgments for one MT
system (E15)
Correlation with fluency judgments for one MT
system (E15)
20
Linguistic Information Liu Gildea (2005)

Corpus level judgments for MT03

21
Linguistic Information Pozar Charniak (2006)

Propose the Bllip metric
Intuition meaning-preserving transformations in
sentences should not heavily impact dependency
structure
Perhaps intuitive, but unsubstantiated

22
Linguistic Information Pozar Charniak (2006)

Parse hypothesis and reference translations with
the Charniak parser
Construct dependency parses from the output parse
trees
Given a lexical head pair (w1, w2) it is a
dependency if
w1 ! w2
w1 is the lexical head of a constituent
immediately dominating the constituent of which
w2 is the head

23
Linguistic Information Pozar Charniak (2006)

Construct all dependency pairs for the hypothesis
and reference translation
If multiple reference translations, compare them
one at a time
Compute precision and recall to score
Formula for doing so not explicitly stated, but
probably F1

24
Linguistic Information Pozar Charniak (2006)

Evaluation was performed by comparing the biggest
discrepancies between Bllip and BLEU and
determining which was more accurate
Results suggest Bllip makes better choices than
BLEU
Results arent directly given

25
Linguistic Information Pozar Charniak (2006)

Fairly weak paper
Evaluation is basically just eye-balled
But, simple headword bi-chains seem to perform as
well as BLEU
Unfortunately, cannot be reliably compared

26
Linguistic Information Owczarzak et al. (2007)

Extended work by Liu Gildea (2005)
They used unlabeled dependency parses
Insight having more information about
grammatical relations might be helpful
X is the subject of Y
X is a determiner of Y

27
Linguistic Information Owczarzak et al. (2007)

Used an LFG parser to generate f-structures that
contain information about grammatical relations

28
Linguistic Information Owczarzak et al. (2007)

Types of dependencies
Predicate only
Predicate-value pair, i.e. grammatical relations
Non-predicate
Tense
Passive
Adjectival degree (comparative, superlative)
Verb particle
Etc.
Extended HWCM from Liu Gildea (2005) to use
these labeled dependencies

29
Linguistic Information Owczarzak et al. (2007)

How do you account for parser noise?
The positions of adjuncts should not affect
f-structure in an LFG parse
Constructed re-orderings for 100 English
sentences
Re-ordered sentence treated as translation
hypothesis
Original sentence treated as reference
translation

30
Linguistic Information Owczarzak et al. (2007)
31
Linguistic Information Owczarzak et al. (2007)

Solution introduce n-best parses
Tradeoff with computation time
Used 10-best

32
Linguistic Information Owczarzak et al. (2007)

Obtained precision and recall for each
hypothesis, reference pair
Four examples for each machine hypothesis
Extended matching using WordNet synonyms
Extended with partial matches
One part of a grammatical relation matches and
the other may or may not
Computed F1
Tried different values for the weighted harmonic
mean but saw no significant improvement

Personal communication with Karolina Owczarzak
33
Linguistic Information Owczarzak et al. (2007)

Evaluated using Pearson correlation with
un-normalized human judgment scores
Values ranging from 1 to 5
Their metric using 50-best parses and WordNet
performed the best on fluency
METEOR with WordNet performed best on adequacy
and overall
50-best partial matching performed slightly
lower than METEOR overall
Significantly outperformed BLEU

Personal communication with Karolina Owczarzak
34
Outline

Background
Machine Learning
Linguistic Information
Combinations
Conclusions

35
Combinations Albrecht Hwa (2007)

Extended work by Kulesza Shieber (2004)
Included work by Liu and Gildea with headword
chains
Compared classification to regression using SVMs

36
Combinations Albrecht Hwa (2007)

Classification attempts to learn decision
boundaries
Regression attempts to learn a continuous
function
MT evaluation metrics are continuous
No clear boundary between good and bad
Instead of trying to classify as human or machine
(Human-Likeness Classifier), try to learn the
function of human judgments
Score hypothesis according to a rating scale

37
Combinations Albrecht Hwa (2007)

Features
Syntax based compared to reference
HWCM
STM
String-based metrics over large English corpus
Syntax-based metrics over a dependency treebank

38
Combinations Albrecht Hwa (2007)

Data was LDC Multiple Translation Chinese Part 4
Spearman correlation instead of Pearson
Classification accuracy
Positively related but its possible to improve
classification accuracy and not improve
correlation
Human-Likeness classification seems inconsistent

39
Combinations Albrecht Hwa (2007)

It is possible to train using regression with
reasonable size sets of training instances
Regression generalizes across data sets
Results showed highest correlation overall of
metrics compared

40
Combinations Albrecht Hwa (2007)
41
Outline

Background
Machine Learning
Linguistic Information
Combinations
Conclusions

42
Conclusions

Evaluating the performance of MT evaluation
metrics still has plenty of room for improvement
Given that humans dont agree well on MT quality,
correlation with human judgments is inherently
limited

43
Conclusions

Machine learning
Only scratching the surface of possibilities
Finding the right way to frame the problem is not
straightforward
Learning the function of how humans assess
translations performs better than attempting to
classify a translation as human or machine

44
Conclusions

Linguistic Information
Intuitively, this should be helpful
METEOR performs very well with limited linguistic
information (synonymy)
Automatic parsers/NLP tools are noisy, so
possibly compound the problem

45
Conclusions

Linguistic Information and Machine Learning
Combining the two leads to good results (Albrecht
Hwa 2007)

46
Conclusions

New directions
Machine learning with richer linguistic
information
Labeled dependencies
Paraphrases
Are other machine learning algorithms better
suited than SVMs?
Are there better ways of framing the evaluation
question?
How well can these approaches be extended to
task-specific evaluation?

47
Questions?
48
References

Joshua S. Albrecht and Rebecca Hwa. 2007. A
re-examination of machine learning approaches for
sentence-level MT evaluation. In Proceedings of
the 45th Annual Meeting of the Association for
Computational Linguistics (ACL-2007).
Satanjeev Banerjee and Alon Lavie. 2005. Meteor
An automatic metric for MT evaluation with
improved correlation with human judgments. In ACL
2005 Workshop on Intrinsic and Extrinsic
Evaluation Measures for Machine Translation
and/or Summarization, June.
John Blatz, Erin Fitzgerald, George Foster,
Simona Gandrabur, Cyril Goutte, Alex Kulesza,
Alberto Sanchis, and Nicola Ueffing. 2003.
Confidence estimation for machine translation.
Technical Report Natural Language Engineering
Workshop Final Report, Johns Hopkins University.
Alex Kulesza and Stuart M. Shieber. 2004. A
learning approach to improving sentence-level MT
evaluation. In Proceedings of the 10th
International Conference on Theoretical and
Methodological Issues in Machine Translation
(TMI), Baltimore, MD, October.

49
References

Ding Liu and Daniel Gildea. 2005. Syntactic
features for evaluation of machine translation.
In ACL 2005 Workshop on Intrinsic and Extrinsic
Evaluation Measures for Machine Translation
and/or Summarization, June.
Karolina Owczarzak, Josef van Genabith, and Andy
Way. 2007. Labelled Dependencies in Machine
Translation Evaluation. Proceedings of the ACL
2007 Workshop on Statistical Machine Translation
104-111. Prague, Czech Republic.
Kishore Papineni, Salim Roukos, Todd Ward, and
Wei-Jing Zhu. 2002. Bleu a method for automatic
evaluation of machine translation. In Proceedings
of the 40th Annual Meeting of the Association for
Computational Linguistics, Philadelphia, PA.
Michael Pozar and Eugene Charniak. 2006. Bllip
An Improved Evaluation Metric for Machine
Translation. Masters Thesis, Brown University.