Automatic Evaluation of Machine Translation - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Automatic Evaluation of Machine Translation

Description:

Correlates highly with Human Evaluation ... Reference 1: It is a guide to action that ensures that the military will forever ... H2: Human Translation by ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 30

Provided by: itt7

Category:

more less

Transcript and Presenter's Notes

Title: Automatic Evaluation of Machine Translation

1
Automatic Evaluation of Machine Translation

Sandeep Kakarla

2
Evaluation

Helps the developers of MT in monitoring effects
of daily changes to the system
Useful in comparison between alternate systems or
ideas
Used to select an output from multiple
translations

Aspects of Translation usually considered for
evaluation
adequacy
fidelity
fluency

4
Automatic Evaluation

Quick
Inexpensive
Language Independent
Correlates highly with Human Evaluation

Quality of machine Translation is its closeness
to one or more reference human translations,
measured according to a numerical metric.
Evaluation requires
Numerical translation closeness metric
A corpus of good quality human reference
translations

6
Example 1

Candidate 1 It is a guide to action which
ensures that the military always obeys the
commands of the party.
Candidate 2 It is to insure the troops forever
hearing the activity guidebook that party direct.
Reference 1 It is a guide to action that ensures
that the military will forever heed Party
commands.
Reference 2 It is the guiding principle which
guarantees the military forces always being under
the command of the Party
Reference 3 It is the practical guide for the
army always to heed the directions of the party.

The basic idea in calculation of metric is to
compare n-grams of the candidate with the n-grams
of the reference translation and count the number
of matches.
These matches are position independent.
The more the matches, the better the candidate
translation is.

8
Precision

The corner stone of the evaluation metric is
precision measure
To compute precision, count the number of
candidate translation words which occur in any
reference translation and divide by total number
of words in the candidate translation.

9
Example 1

Candidate 1 It is a guide to action which
ensures that the military always obeys the
commands of the party.
Candidate 2 It is to insure the troops forever
hearing the activity guidebook that party direct.
Reference 1 It is a guide to action that ensures
that the military will forever heed Party
commands.
Reference 2 It is the guiding principle which
guarantees the military forces always being under
the command of the Party
Reference 3 It is the practical guide for the
army always to heed the directions of the party.
Unigram Precision of Candidate 1 is 17/18

Example 2
Candidate the the the the the the.
Reference 1 The cat is on the mat
Reference 2 There is a cat on the mat.
Unigram Precision of Candidate is 7/7
A reference word should be considered exhausted
after a matching candidate word is identified.

11
Modified Precision

Count the maximum number of times a word occurs
in any single reference translation. Next, clip
the total count of each candidate word by its
maximum reference counts, add these clipped
counts up, and divide by total (unclipped) number
of candidate words.
Modified n-gram precision is similarly calculated
for any n. Candidate n-gram counts are clipped by
their corresponding reference maximum value,
summed, and divided by the total number of
candidate n-grams.

Example 1
Candidate 1 It is a guide to action which
ensures that the military always obeys the
commands of the party.
Candidate 2 It is to insure the troops forever
hearing the activity guidebook that party direct.
Reference 1 It is a guide to action that ensures
that the military will forever heed Party
commands.
Reference 2 It is the guiding principle which
guarantees the military forces always being under
the command of the Party
Reference 3 It is the practical guide for the
army always to heed the directions of the party.
Modified Unigram Precision of Candidate 1 is
17/18, bi-gram precision is 10/17
Modified Unigram Precision of Candidate 2 is
8/14, bi-gram precision is 1/13

Example 2
Candidate the the the the the the.
Reference 1 The cat is on the mat
Reference 2 There is a cat on the mat.
Modified Unigram Precision of Candidate is 2/7
Modified Bigram Precision of the candidate is 0
High modified unigram precision refers to
adequacy
High modified n-gram with high n refers to
fluency

Modified n-gram precision on blocks of text

4 reference translations for each of 127 source
sentences (Chinese)

2 Reference translations for 500 source sentences
translated by 2 human translators and 3 MT
Systems
H2 Human Translation by native English speaker
H1 Human Translation by someone lacking
proficiency in both English and Chinese
S1 S3 Translations by MT Systems 1-3

We observe that the modified n-gram precision
decays roughly exponentially with n. The modified
unigram precision is much larger than the
modified bigram precision which in turn is much
bigger than modified trigram precision.
A reasonable averaging scheme that takes into
account this exponential decay would be weighted
average of logarithm of modified precisions.
A candidate translation should be neither too
long nor too short, and the evaluation metric
should enforce this. Modified n-gram precision
take cares if the candidate translation is long.

Example 3
Candidate of the
Reference 1 It is a guide to action that ensures
that the military will forever heed Party
commands.
Reference 2 It is the guiding principle which
guarantees the military forces always being under
the command of the Party.
Reference 3 It is the practical guide for the
army always to heed the directions of the party.
The modified unigram precision would be 2/2 and
bigram precision would be 1/1.
Traditionally, precision is paired with recall to
overcome length related issues.

Example 4
Candidate 1 I always invariably perpetually do.
Candidate 2 I always do.
Reference 1 I always do.
Reference 2 I invariably do.
Reference 3 I perpetually do.
Though first candidate recalls more words from
the references, is it a poorer translation. So
recall cannot be used to handle length problem.
For handling this length factor, we introduce
brevity penalty factor BP.

20
Brevity Penalty

r is effective reference length of test corpus,
which is sum of best match lengths for each
candidate sentence in the corpus.
c is the total length of candidate translation
corpus.

21
BLEU Metric

pn is modified n-grams precision
N is maximum length of n-gram
Wn positive weight assigned to pn

22
BLEU Evaluation

N 4 and Wn 1/N

The t-statistics compares each system with its
left neighbor in the table.
t value greater than 1.7 is 95 significant so
differences between the scores is significant

24
Human Evaluation

500 sentences 5 translations each
2500 pairs of Chinese source and English
translations.
Monolingual Group 10 native speakers of English
Bilingual Group 10 native speakers of Chinese
who had lived in United States for 7 years
The sentence pairs were given to these two groups
to rate from 1(very bad) to 5(very good). The
monolingual group made judgments based only on
readability and fluency.

25
Monolingual group pair wise judgments