Automatic Evaluation of Machine Translation - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Automatic Evaluation of Machine Translation

Description:

Correlates highly with Human Evaluation ... Reference 1: It is a guide to action that ensures that the military will forever ... H2: Human Translation by ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 30
Provided by: itt7
Category:

less

Transcript and Presenter's Notes

Title: Automatic Evaluation of Machine Translation


1
Automatic Evaluation of Machine Translation
  • Sandeep Kakarla

2
Evaluation
  • Helps the developers of MT in monitoring effects
    of daily changes to the system
  • Useful in comparison between alternate systems or
    ideas
  • Used to select an output from multiple
    translations

3
  • Aspects of Translation usually considered for
    evaluation
  • adequacy
  • fidelity
  • fluency

4
Automatic Evaluation
  • Quick
  • Inexpensive
  • Language Independent
  • Correlates highly with Human Evaluation

5
  • Quality of machine Translation is its closeness
    to one or more reference human translations,
    measured according to a numerical metric.
  • Evaluation requires
  • Numerical translation closeness metric
  • A corpus of good quality human reference
    translations

6
Example 1
  • Candidate 1 It is a guide to action which
    ensures that the military always obeys the
    commands of the party.
  • Candidate 2 It is to insure the troops forever
    hearing the activity guidebook that party direct.
  • Reference 1 It is a guide to action that ensures
    that the military will forever heed Party
    commands.
  • Reference 2 It is the guiding principle which
    guarantees the military forces always being under
    the command of the Party
  • Reference 3 It is the practical guide for the
    army always to heed the directions of the party.

7
  • The basic idea in calculation of metric is to
    compare n-grams of the candidate with the n-grams
    of the reference translation and count the number
    of matches.
  • These matches are position independent.
  • The more the matches, the better the candidate
    translation is.

8
Precision
  • The corner stone of the evaluation metric is
    precision measure
  • To compute precision, count the number of
    candidate translation words which occur in any
    reference translation and divide by total number
    of words in the candidate translation.

9
Example 1
  • Candidate 1 It is a guide to action which
    ensures that the military always obeys the
    commands of the party.
  • Candidate 2 It is to insure the troops forever
    hearing the activity guidebook that party direct.
  • Reference 1 It is a guide to action that ensures
    that the military will forever heed Party
    commands.
  • Reference 2 It is the guiding principle which
    guarantees the military forces always being under
    the command of the Party
  • Reference 3 It is the practical guide for the
    army always to heed the directions of the party.
  • Unigram Precision of Candidate 1 is 17/18

10
  • Example 2
  • Candidate the the the the the the.
  • Reference 1 The cat is on the mat
  • Reference 2 There is a cat on the mat.
  • Unigram Precision of Candidate is 7/7
  • A reference word should be considered exhausted
    after a matching candidate word is identified.

11
Modified Precision
  • Count the maximum number of times a word occurs
    in any single reference translation. Next, clip
    the total count of each candidate word by its
    maximum reference counts, add these clipped
    counts up, and divide by total (unclipped) number
    of candidate words.
  • Modified n-gram precision is similarly calculated
    for any n. Candidate n-gram counts are clipped by
    their corresponding reference maximum value,
    summed, and divided by the total number of
    candidate n-grams.

12
  • Example 1
  • Candidate 1 It is a guide to action which
    ensures that the military always obeys the
    commands of the party.
  • Candidate 2 It is to insure the troops forever
    hearing the activity guidebook that party direct.
  • Reference 1 It is a guide to action that ensures
    that the military will forever heed Party
    commands.
  • Reference 2 It is the guiding principle which
    guarantees the military forces always being under
    the command of the Party
  • Reference 3 It is the practical guide for the
    army always to heed the directions of the party.
  • Modified Unigram Precision of Candidate 1 is
    17/18, bi-gram precision is 10/17
  • Modified Unigram Precision of Candidate 2 is
    8/14, bi-gram precision is 1/13

13
  • Example 2
  • Candidate the the the the the the.
  • Reference 1 The cat is on the mat
  • Reference 2 There is a cat on the mat.
  • Modified Unigram Precision of Candidate is 2/7
  • Modified Bigram Precision of the candidate is 0
  • High modified unigram precision refers to
    adequacy
  • High modified n-gram with high n refers to
    fluency

14
  • Modified n-gram precision on blocks of text

15
  • 4 reference translations for each of 127 source
    sentences (Chinese)

16
  • 2 Reference translations for 500 source sentences
    translated by 2 human translators and 3 MT
    Systems
  • H2 Human Translation by native English speaker
  • H1 Human Translation by someone lacking
    proficiency in both English and Chinese
  • S1 S3 Translations by MT Systems 1-3

17
  • We observe that the modified n-gram precision
    decays roughly exponentially with n. The modified
    unigram precision is much larger than the
    modified bigram precision which in turn is much
    bigger than modified trigram precision.
  • A reasonable averaging scheme that takes into
    account this exponential decay would be weighted
    average of logarithm of modified precisions.
  • A candidate translation should be neither too
    long nor too short, and the evaluation metric
    should enforce this. Modified n-gram precision
    take cares if the candidate translation is long.

18
  • Example 3
  • Candidate of the
  • Reference 1 It is a guide to action that ensures
    that the military will forever heed Party
    commands.
  • Reference 2 It is the guiding principle which
    guarantees the military forces always being under
    the command of the Party.
  • Reference 3 It is the practical guide for the
    army always to heed the directions of the party.
  • The modified unigram precision would be 2/2 and
    bigram precision would be 1/1.
  • Traditionally, precision is paired with recall to
    overcome length related issues.

19
  • Example 4
  • Candidate 1 I always invariably perpetually do.
  • Candidate 2 I always do.
  • Reference 1 I always do.
  • Reference 2 I invariably do.
  • Reference 3 I perpetually do.
  • Though first candidate recalls more words from
    the references, is it a poorer translation. So
    recall cannot be used to handle length problem.
  • For handling this length factor, we introduce
    brevity penalty factor BP.

20
Brevity Penalty
  • r is effective reference length of test corpus,
    which is sum of best match lengths for each
    candidate sentence in the corpus.
  • c is the total length of candidate translation
    corpus.

21
BLEU Metric
  • pn is modified n-grams precision
  • N is maximum length of n-gram
  • Wn positive weight assigned to pn

22
BLEU Evaluation
  • N 4 and Wn 1/N

23
  • The t-statistics compares each system with its
    left neighbor in the table.
  • t value greater than 1.7 is 95 significant so
    differences between the scores is significant

24
Human Evaluation
  • 500 sentences 5 translations each
  • 2500 pairs of Chinese source and English
    translations.
  • Monolingual Group 10 native speakers of English
  • Bilingual Group 10 native speakers of Chinese
    who had lived in United States for 7 years
  • The sentence pairs were given to these two groups
    to rate from 1(very bad) to 5(very good). The
    monolingual group made judgments based only on
    readability and fluency.

25
Monolingual group pair wise judgments
  • Monolingual judgments pairwise differential
    comparison

26
Bilingual group pair wise judgments
  • Bilingual judgments pairwise differential
    comparison

27
BLEU vs Human Evaluation
  • Correlation coefficient 0.99

28
BLEU vs Human Evaluation
  • Correlation coefficient 0.96

29
BLEU vs Human Evaluation
Write a Comment
User Comments (0)
About PowerShow.com