Recent%20Trends%20in%20MT%20Evaluation:%20%20Linguistic%20Information%20and%20Machine%20Learning - PowerPoint PPT Presentation

About This Presentation
Title:

Recent%20Trends%20in%20MT%20Evaluation:%20%20Linguistic%20Information%20and%20Machine%20Learning

Description:

How much of the meaning in the source sentence that is preserved in ... Adding more references helps BLEU but hurts NIST (Finch et al. 2004) Background: BLEU ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 50
Provided by: jason88
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Recent%20Trends%20in%20MT%20Evaluation:%20%20Linguistic%20Information%20and%20Machine%20Learning


1
Recent Trends in MT Evaluation Linguistic
Information and Machine Learning
  • Jason Adams
  • 11-734
  • 2008-03-05
  • Instructors
  • Alon Lavie
  • Stephan Vogel

2
Outline
  • Background
  • Machine Learning
  • Linguistic Information
  • Combined Approaches
  • Conclusions

3
Background
  • Fully automatic MT Eval is as hard as MT
  • If we could judge with certainty that a
    translation is correct, reverse the process and
    generate a correct translation
  • Reference translations help to close this gap

4
Background Adequacy and Fluency
  • Adequacy
  • How much of the meaning in the source sentence
    that is preserved in the hypothesis
  • Reference translations are assumed to achieve
    this sufficiently
  • Fluency
  • How closely the hypothesis sentence conforms to
    the norms of the target language
  • Reference translations are a subset of target
    language

5
Background Human Judgments
  • Judge on a scale for adequacy and fluency
  • Agreement between judges is low
  • Judgment scores normalized
  • Blatz et al (2003)

6
Background Evaluating Metrics
  • Correlation with human assessments (judgments)
  • Pearson Correlation
  • Spearman Rank Correlation
  • Adding more references helps BLEU but hurts NIST
    (Finch et al. 2004)

7
Background BLEU
  • Papineni et al. (2001)
  • First automatic MT metric to be widely adopted
  • Geometric mean of modified n-gram precision
  • Criticisms
  • Poor sentence level correlation
  • Favors statistical systems
  • Ignores recall
  • Local word choice more important than global
    accuracy

8
Background METEOR
  • Banerjee and Lavie (2005).
  • Addresses some of the shortcomings of BLEU
  • Uses recall of best reference
  • Attempts to align hypothesis and reference
  • Better correlation with human judgments
  • Optionally uses WordNet and Porter stemming

9
Outline
  • Background
  • Machine Learning
  • Linguistic Information
  • Combinations
  • Conclusions

10
Machine Learning Kulesza Shieber (2004)
  • Frame the MT Evaluation problem as a
    classification task
  • Can we predict if a sentence is generated by a
    human or a machine by comparing against reference
    translations?

11
Machine Learning Kulesza Shieber (2004)
  • Derived a set of features (partially based on
    BLEU)
  • Unmodified n-gram precisions (1 to 5)
  • Min and max ratio of hypothesis to reference
    length
  • Word error rate
  • minimum edit distance between hypothesis and any
    reference
  • Position-independent word error rate
  • shorter translation removed from longer and size
    of remaining set returned

12
Machine Learning Kulesza Shieber (2004)
  • Trained an SVM using classification
  • Positive human translation
  • Negative machine translation
  • Score is output of SVM
  • Distance to hyperplane is treated as a measure of
    confidence
  • Classification Accuracy
  • 59 for human examples (positive)
  • 70 for machine examples (negative)

13
Machine Learning Kulesza Shieber (2004)
  • Compared to BLEU, WER, PER, F-Measure at the
    sentence level

14
Outline
  • Background
  • Machine Learning
  • Linguistic Information
  • Combinations
  • Conclusions

15
Linguistic Information Liu Gildea (2005)
  • Introduce syntactic information
  • Use Collins parser on hypothesis and reference
    translations
  • Looked at three different metrics for comparing
    trees

16
Linguistic Information Liu Gildea (2005)
  • Subtree Metric (STM)
  • D depth of trees considered
  • Count is times subtree appears in any reference
  • Clipped count limits count to the maximum number
    of times it appears in any one reference

17
Linguistic Information Liu Gildea (2005)
  • Kernel-based Subtree Metric (TKM)
  • H(t) is a vector of counts for all subtrees of t
  • H(t1) H(t2) counts subtrees in common
  • Use convolution kernels (Collins Duffy, 2001)
    to compute in polynomial time
  • counting all subtrees would be exponential in the
    size of the trees

18
Linguistic Information Liu Gildea (2005)
  • Headword Chain Metric (HWCM)
  • Convert phrase-structure parse into dependency
    parse
  • Each mother-daughter relationship in the
    dependency parse is a headword chain of length 2
  • No siblings included in any headword chain
  • Score computed in the same fashion as STM
  • Other two metrics have dependency versions

19
Linguistic Information Liu Gildea (2005)
  • Data is from MT03 and JHU Summer Workshop (2003)

Correlation with overall judgments for one MT
system (E15)
Correlation with fluency judgments for one MT
system (E15)
20
Linguistic Information Liu Gildea (2005)
  • Corpus level judgments for MT03

21
Linguistic Information Pozar Charniak (2006)
  • Propose the Bllip metric
  • Intuition meaning-preserving transformations in
    sentences should not heavily impact dependency
    structure
  • Perhaps intuitive, but unsubstantiated

22
Linguistic Information Pozar Charniak (2006)
  • Parse hypothesis and reference translations with
    the Charniak parser
  • Construct dependency parses from the output parse
    trees
  • Given a lexical head pair (w1, w2) it is a
    dependency if
  • w1 ! w2
  • w1 is the lexical head of a constituent
    immediately dominating the constituent of which
    w2 is the head

23
Linguistic Information Pozar Charniak (2006)
  • Construct all dependency pairs for the hypothesis
    and reference translation
  • If multiple reference translations, compare them
    one at a time
  • Compute precision and recall to score
  • Formula for doing so not explicitly stated, but
    probably F1

24
Linguistic Information Pozar Charniak (2006)
  • Evaluation was performed by comparing the biggest
    discrepancies between Bllip and BLEU and
    determining which was more accurate
  • Results suggest Bllip makes better choices than
    BLEU
  • Results arent directly given

25
Linguistic Information Pozar Charniak (2006)
  • Fairly weak paper
  • Evaluation is basically just eye-balled
  • But, simple headword bi-chains seem to perform as
    well as BLEU
  • Unfortunately, cannot be reliably compared

26
Linguistic Information Owczarzak et al. (2007)
  • Extended work by Liu Gildea (2005)
  • They used unlabeled dependency parses
  • Insight having more information about
    grammatical relations might be helpful
  • X is the subject of Y
  • X is a determiner of Y

27
Linguistic Information Owczarzak et al. (2007)
  • Used an LFG parser to generate f-structures that
    contain information about grammatical relations

28
Linguistic Information Owczarzak et al. (2007)
  • Types of dependencies
  • Predicate only
  • Predicate-value pair, i.e. grammatical relations
  • Non-predicate
  • Tense
  • Passive
  • Adjectival degree (comparative, superlative)
  • Verb particle
  • Etc.
  • Extended HWCM from Liu Gildea (2005) to use
    these labeled dependencies

29
Linguistic Information Owczarzak et al. (2007)
  • How do you account for parser noise?
  • The positions of adjuncts should not affect
    f-structure in an LFG parse
  • Constructed re-orderings for 100 English
    sentences
  • Re-ordered sentence treated as translation
    hypothesis
  • Original sentence treated as reference
    translation

30
Linguistic Information Owczarzak et al. (2007)
31
Linguistic Information Owczarzak et al. (2007)
  • Solution introduce n-best parses
  • Tradeoff with computation time
  • Used 10-best

32
Linguistic Information Owczarzak et al. (2007)
  • Obtained precision and recall for each
    hypothesis, reference pair
  • Four examples for each machine hypothesis
  • Extended matching using WordNet synonyms
  • Extended with partial matches
  • One part of a grammatical relation matches and
    the other may or may not
  • Computed F1
  • Tried different values for the weighted harmonic
    mean but saw no significant improvement

Personal communication with Karolina Owczarzak
33
Linguistic Information Owczarzak et al. (2007)
  • Evaluated using Pearson correlation with
    un-normalized human judgment scores
  • Values ranging from 1 to 5
  • Their metric using 50-best parses and WordNet
    performed the best on fluency
  • METEOR with WordNet performed best on adequacy
    and overall
  • 50-best partial matching performed slightly
    lower than METEOR overall
  • Significantly outperformed BLEU

Personal communication with Karolina Owczarzak
34
Outline
  • Background
  • Machine Learning
  • Linguistic Information
  • Combinations
  • Conclusions

35
Combinations Albrecht Hwa (2007)
  • Extended work by Kulesza Shieber (2004)
  • Included work by Liu and Gildea with headword
    chains
  • Compared classification to regression using SVMs

36
Combinations Albrecht Hwa (2007)
  • Classification attempts to learn decision
    boundaries
  • Regression attempts to learn a continuous
    function
  • MT evaluation metrics are continuous
  • No clear boundary between good and bad
  • Instead of trying to classify as human or machine
    (Human-Likeness Classifier), try to learn the
    function of human judgments
  • Score hypothesis according to a rating scale

37
Combinations Albrecht Hwa (2007)
  • Features
  • Syntax based compared to reference
  • HWCM
  • STM
  • String-based metrics over large English corpus
  • Syntax-based metrics over a dependency treebank

38
Combinations Albrecht Hwa (2007)
  • Data was LDC Multiple Translation Chinese Part 4
  • Spearman correlation instead of Pearson
  • Classification accuracy
  • Positively related but its possible to improve
    classification accuracy and not improve
    correlation
  • Human-Likeness classification seems inconsistent

39
Combinations Albrecht Hwa (2007)
  • It is possible to train using regression with
    reasonable size sets of training instances
  • Regression generalizes across data sets
  • Results showed highest correlation overall of
    metrics compared

40
Combinations Albrecht Hwa (2007)
41
Outline
  • Background
  • Machine Learning
  • Linguistic Information
  • Combinations
  • Conclusions

42
Conclusions
  • Evaluating the performance of MT evaluation
    metrics still has plenty of room for improvement
  • Given that humans dont agree well on MT quality,
    correlation with human judgments is inherently
    limited

43
Conclusions
  • Machine learning
  • Only scratching the surface of possibilities
  • Finding the right way to frame the problem is not
    straightforward
  • Learning the function of how humans assess
    translations performs better than attempting to
    classify a translation as human or machine

44
Conclusions
  • Linguistic Information
  • Intuitively, this should be helpful
  • METEOR performs very well with limited linguistic
    information (synonymy)
  • Automatic parsers/NLP tools are noisy, so
    possibly compound the problem

45
Conclusions
  • Linguistic Information and Machine Learning
  • Combining the two leads to good results (Albrecht
    Hwa 2007)

46
Conclusions
  • New directions
  • Machine learning with richer linguistic
    information
  • Labeled dependencies
  • Paraphrases
  • Are other machine learning algorithms better
    suited than SVMs?
  • Are there better ways of framing the evaluation
    question?
  • How well can these approaches be extended to
    task-specific evaluation?

47
Questions?
48
References
  • Joshua S. Albrecht and Rebecca Hwa. 2007. A
    re-examination of machine learning approaches for
    sentence-level MT evaluation. In Proceedings of
    the 45th Annual Meeting of the Association for
    Computational Linguistics (ACL-2007).
  • Satanjeev Banerjee and Alon Lavie. 2005. Meteor
    An automatic metric for MT evaluation with
    improved correlation with human judgments. In ACL
    2005 Workshop on Intrinsic and Extrinsic
    Evaluation Measures for Machine Translation
    and/or Summarization, June.
  • John Blatz, Erin Fitzgerald, George Foster,
    Simona Gandrabur, Cyril Goutte, Alex Kulesza,
    Alberto Sanchis, and Nicola Ueffing. 2003.
    Confidence estimation for machine translation.
    Technical Report Natural Language Engineering
    Workshop Final Report, Johns Hopkins University.
  • Alex Kulesza and Stuart M. Shieber. 2004. A
    learning approach to improving sentence-level MT
    evaluation. In Proceedings of the 10th
    International Conference on Theoretical and
    Methodological Issues in Machine Translation
    (TMI), Baltimore, MD, October.

49
References
  • Ding Liu and Daniel Gildea. 2005. Syntactic
    features for evaluation of machine translation.
    In ACL 2005 Workshop on Intrinsic and Extrinsic
    Evaluation Measures for Machine Translation
    and/or Summarization, June.
  • Karolina Owczarzak, Josef van Genabith, and Andy
    Way. 2007. Labelled Dependencies in Machine
    Translation Evaluation. Proceedings of the ACL
    2007 Workshop on Statistical Machine Translation
    104-111. Prague, Czech Republic.
  • Kishore Papineni, Salim Roukos, Todd Ward, and
    Wei-Jing Zhu. 2002. Bleu a method for automatic
    evaluation of machine translation. In Proceedings
    of the 40th Annual Meeting of the Association for
    Computational Linguistics, Philadelphia, PA.
  • Michael Pozar and Eugene Charniak. 2006. Bllip
    An Improved Evaluation Metric for Machine
    Translation. Masters Thesis, Brown University.
Write a Comment
User Comments (0)
About PowerShow.com