ParaEval: Using Paraphrases to Improve Machine Translation and Summarization Evaluations - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

ParaEval: Using Paraphrases to Improve Machine Translation and Summarization Evaluations

Description:

My short career in NLP so far. Summarization: Headline generation. ... Unsupervised: cheap and delivered in a timely fashion. Large enough to be general. ... – PowerPoint PPT presentation

Number of Views:81

Avg rating:3.0/5.0

Slides: 32

Provided by: liang4

Category:

more less

Transcript and Presenter's Notes

Title: ParaEval: Using Paraphrases to Improve Machine Translation and Summarization Evaluations

1
ParaEval Using Paraphrases to Improve Machine
Translation and Summarization Evaluations

Liang Zhou
USC/ISI

2
Introduction

My short career in NLP so far.
Summarization
Headline generation.
Biography creation.
Question-focused multi-document summarization.
Online discussion summarization.
Evaluation
Summary content fragments (Basic Elements).
Better comparisons for texts (ParaEval)
paraphrases.
Other things
Sentence segmentation or simplification for SMT.
Speech processing for summarization purposes.

3
Talk Outline

Why evaluation?
From solving problems.
To looking at results.
MT vs. summarization evaluations.
Whats the difference?
Which one is more difficult?
ParaEval for MT evaluation.
ParaEval for summarization evaluation.
Conclusion.

4
Solving problems
I am the king of the world!
5
Evaluating results
6
MT and Summ Evaluations

Seriously important.
MT an 8 billion industry (according to
Slate.com).
Summarization ?.
Which systems are good? Which are bad?
Manual evaluations are better
Done by human, trusted by human.
Too costly and too slow.
Automatic methods
Have enabled the rapid development in both
fields.
Fast turn-around for design-eval cycles.
Good for hill climbing for single systems.

7
MT vs. Summarization

MT
Translates one sentence at a time
state-of-the-arts.
Evaluation
Peer and reference sentences are nicely aligned.
Summarization
Shrinking a whole lot of information into 100 or
200 words.
Leaving lots of possibilities for the summarizer
(both human and machine) to explore.
A reason that a good summary is hard to define.
Evaluation
No sentence alignments.
Entire summaries as the limit.

Hard!
Even harder!
8
Current Evaluation Methods

Counting n-gram co-occurrence statistics
BLEU (Papineni et al., 2002).
ROUGE (Lin and Hovy, 2003).

Lexical identity

Limits lack of support for measuring semantic
closeness
BLEU
The use of multiple references was to lessen the
problem.
Need to recognize paraphrases from peer and refs
individually, and also b/w them.
ROUGE
Similar problem. But more problematic due to the
length of the summaries.
This has led to more manual annotations being
carried out.

9
Semantic Closeness

Using paraphrases as an approximation
Alternative verbalizations conveying the same
information.
Required by many NLP apps QA, summ, gen, MT,
etc.
Incorporating paraphrase recognition and matching
into the evaluation process
Comparing more content from texts.
Better correlation with human assessments (we
hope!).
Paraphrase acquisition
Domain-independent.
Unsupervised cheap and delivered in a timely
fashion.
Large enough to be general.

10
Paraphrases

Previous acquisition efforts
Manual domain-specific.
Lexical resources (WordNet) need more than
single-word synsets.
Derivation from corpora
Multiple translations Barzilay and McKeown
(2001) and Pang et al. (2003) Nice! But cannot
be utilized on a large scale.
Our approach also in (Bannard and
Callison-Burch, 2005).
Use parallel corpus and extract phrases in both
languages
Find phrases with the same translation.
For phrases that often have the same translation

paraphrases
11
Collecting Paraphrases
Bannard and Callison-Burch (2005) for MT.

Uh
GIZA
Direct substitute from a n-gram language model
(Och and Ney, 2004)
12
ParaEval for MT

Incorporate the paraphrases into MT eval

Grey BLEU
Color Grey ParaEval
13
BLEU-style Match Counting

BLEU details (Papineni et al., 2002)
Modified n-gram precision
Clipping.
ParaEval
Adopting this clipping philosophy
Unigram clipping.
Paraphrase clipping.
Counting unit unigram.
Our goals
Paraphrase support.
Recall computation.

14
The Paraphrase Effect

Paraphrase matching on top of BLEU
Alas only unigrams
Isolate the examination of The Paraphrase Effect
directly.
Not complicated by a weighted n-gram scheme.
Evaluating ParaEval
Calculate the correlation with human ranking of
MT systems.
Need to distinguish
Good systems from bad ones.
Systems from humans.
Used NIST 2003 Chinese MT evaluation
8 MT systems. 4 sets of reference translations.
NIST assessments on fluency and adequacy
rankings.

15
Correlation Comparison

Footnote on ParaEval
Even better when using a combination of higher
ngrams gt trigrams, for paraphrase matching.
Could indicate mistakes by the word-alignment
program for lower ngrams unigram and bigrams.
Waiting for results from (Liang et al.,
2006)bidirectional discriminative training on
achieve consensus for word-alignments more
accurate single word alignments.

0.7619
16
Rank Humans and Systems
humans
systems
All this from Precision
17
Recall Support

BLEU no recall.
Multiple references to overcome the limitation
of matching on lexical identity.
Use brevity penalty.
ParaEval
We have lots of paraphrases.
Single ref paraphrases multiple
verbalizations
Correlate with Adequacy assessment only
Just want to see how much I got right.

18
Using Single Reference

Lots of paraphrases for MT eval lead to
Using single reference translation is enough.
But make sure the single ref is good

single ref!
19
Intermission

So far, we talked about ParaEval for MT eval.
Next ParaEval for summarization eval.
Shrinking a whole lot of information into 100 or
200 words.
Evaluation
No sentence alignments.
Entire summaries as the limit.

20
Previous Summ. Eval Work

Manual annotations
SEE (Lin and Hovy, 2001) sentence-level,
partial credit allowed.
Factoid (Van Halteren and Teufel, 2003).
Pyramid (Nenkova and Passonneau, 2004).
Automated
Matching based on lexical identity.
ROUGE (Lin and Hovy, 2003)
Summary units are of fixed length (unigram,
bigram, etc).
BE (Hovy et al., 2005)
Summary units of variable size, yet still
meaningful.

nuggets
21
Human Evaluator

What humans do naturally?
In text understanding and comparison.
The ability to recognize semantic equivalence
paraphrasing.

(Nenkova and Passonneau, 2004)
22
Comparison Strategy

Comparison strategy
Top multi-word paraphrase match (optimal).
Paraphrase table.
Dynamic programming.
Counting unit unigram.
Middle 1-to-N, N-to-1, 1-to-1 paraphrase match
(greedy).
Bottom lexical n-gram match (unigram).
On text segments not consumed by paraphrase
matching.
A ROUGE-1 guarantee (important).
Overall score recall.

23
Complicated Process
24
Paraphrase Matching

Optimization problem
Favoring recall content coverage.
Peer phrases match best with phrases from
reference
No double counting for either side.

25
ParaEval-summ Validation

Document Understanding Conference (DUC)
Systems compete in summarization tasks.
NIST assessors make judgments.
Validation process
Grade each summary produced by each system.
Assign an overall score for each system.
Compare with humans scores.
DUC-2003 18 systems, 4 sets of refs, 30 doc
sets.

26
Correlation
27
Detailed Comparison
28
Quality of the Paraphrases

How good or bad are the paraphrases?
We expect them to be noisy.
But we see they are effective in MT and summ
evaluations.
Phrases matched in MT eval
Matching is done at sentence-level
Many words matched by lexical identity.
Leave little room for paraphrase mistakes.
Human assessment for phrases in summ eval
128 pairs of paraphrases.
Sentences as contexts.
3 judges.

29
Manual Results

Results
Precision 68.
Kappa 0.582.
Difficult to judge
Semantic equivalence.

30
Conclusion and Future Work

Advancing the evaluations for MT and summ
By using paraphrases extracted by MT.
Getting more textual content compared.
Evaluation metrics turn out to be good.
Future directions
Paraphrase induction for MT
I dont know this word/phrase, but I have been in
similar situations before. What was the
word/phrase I used that time?
Longer phrases. Possibly to the level of the
Microsoft paraphrases.
Measuring word-alignment errors. Better
paraphrases.
Text generation recognize more and precise
redundancy.