ParaEval: Using Paraphrases to Improve Machine Translation and Summarization Evaluations - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

ParaEval: Using Paraphrases to Improve Machine Translation and Summarization Evaluations

Description:

My short career in NLP so far. Summarization: Headline generation. ... Unsupervised: cheap and delivered in a timely fashion. Large enough to be general. ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 32
Provided by: liang4
Category:

less

Transcript and Presenter's Notes

Title: ParaEval: Using Paraphrases to Improve Machine Translation and Summarization Evaluations


1
ParaEval Using Paraphrases to Improve Machine
Translation and Summarization Evaluations
  • Liang Zhou
  • USC/ISI

2
Introduction
  • My short career in NLP so far.
  • Summarization
  • Headline generation.
  • Biography creation.
  • Question-focused multi-document summarization.
  • Online discussion summarization.
  • Evaluation
  • Summary content fragments (Basic Elements).
  • Better comparisons for texts (ParaEval)
    paraphrases.
  • Other things
  • Sentence segmentation or simplification for SMT.
  • Speech processing for summarization purposes.

3
Talk Outline
  • Why evaluation?
  • From solving problems.
  • To looking at results.
  • MT vs. summarization evaluations.
  • Whats the difference?
  • Which one is more difficult?
  • ParaEval for MT evaluation.
  • ParaEval for summarization evaluation.
  • Conclusion.

4
Solving problems
I am the king of the world!
5
Evaluating results
6
MT and Summ Evaluations
  • Seriously important.
  • MT an 8 billion industry (according to
    Slate.com).
  • Summarization ?.
  • Which systems are good? Which are bad?
  • Manual evaluations are better
  • Done by human, trusted by human.
  • Too costly and too slow.
  • Automatic methods
  • Have enabled the rapid development in both
    fields.
  • Fast turn-around for design-eval cycles.
  • Good for hill climbing for single systems.

7
MT vs. Summarization
  • MT
  • Translates one sentence at a time
    state-of-the-arts.
  • Evaluation
  • Peer and reference sentences are nicely aligned.
  • Summarization
  • Shrinking a whole lot of information into 100 or
    200 words.
  • Leaving lots of possibilities for the summarizer
    (both human and machine) to explore.
  • A reason that a good summary is hard to define.
  • Evaluation
  • No sentence alignments.
  • Entire summaries as the limit.

Hard!
Even harder!
8
Current Evaluation Methods
  • Counting n-gram co-occurrence statistics
  • BLEU (Papineni et al., 2002).
  • ROUGE (Lin and Hovy, 2003).

Lexical identity
  • Limits lack of support for measuring semantic
    closeness
  • BLEU
  • The use of multiple references was to lessen the
    problem.
  • Need to recognize paraphrases from peer and refs
    individually, and also b/w them.
  • ROUGE
  • Similar problem. But more problematic due to the
    length of the summaries.
  • This has led to more manual annotations being
    carried out.

9
Semantic Closeness
  • Using paraphrases as an approximation
  • Alternative verbalizations conveying the same
    information.
  • Required by many NLP apps QA, summ, gen, MT,
    etc.
  • Incorporating paraphrase recognition and matching
    into the evaluation process
  • Comparing more content from texts.
  • Better correlation with human assessments (we
    hope!).
  • Paraphrase acquisition
  • Domain-independent.
  • Unsupervised cheap and delivered in a timely
    fashion.
  • Large enough to be general.

10
Paraphrases
  • Previous acquisition efforts
  • Manual domain-specific.
  • Lexical resources (WordNet) need more than
    single-word synsets.
  • Derivation from corpora
  • Multiple translations Barzilay and McKeown
    (2001) and Pang et al. (2003) Nice! But cannot
    be utilized on a large scale.
  • Our approach also in (Bannard and
    Callison-Burch, 2005).
  • Use parallel corpus and extract phrases in both
    languages
  • Find phrases with the same translation.
  • For phrases that often have the same translation

paraphrases
11
Collecting Paraphrases
Bannard and Callison-Burch (2005) for MT.

Uh
GIZA
Direct substitute from a n-gram language model
(Och and Ney, 2004)
12
ParaEval for MT
  • Incorporate the paraphrases into MT eval

Grey BLEU
Color Grey ParaEval
13
BLEU-style Match Counting
  • BLEU details (Papineni et al., 2002)
  • Modified n-gram precision
  • Clipping.
  • ParaEval
  • Adopting this clipping philosophy
  • Unigram clipping.
  • Paraphrase clipping.
  • Counting unit unigram.
  • Our goals
  • Paraphrase support.
  • Recall computation.

14
The Paraphrase Effect
  • Paraphrase matching on top of BLEU
  • Alas only unigrams
  • Isolate the examination of The Paraphrase Effect
    directly.
  • Not complicated by a weighted n-gram scheme.
  • Evaluating ParaEval
  • Calculate the correlation with human ranking of
    MT systems.
  • Need to distinguish
  • Good systems from bad ones.
  • Systems from humans.
  • Used NIST 2003 Chinese MT evaluation
  • 8 MT systems. 4 sets of reference translations.
  • NIST assessments on fluency and adequacy
    rankings.

15
Correlation Comparison
  • Footnote on ParaEval
  • Even better when using a combination of higher
    ngrams gt trigrams, for paraphrase matching.
  • Could indicate mistakes by the word-alignment
    program for lower ngrams unigram and bigrams.
  • Waiting for results from (Liang et al.,
    2006)bidirectional discriminative training on
    achieve consensus for word-alignments more
    accurate single word alignments.

0.7619
16
Rank Humans and Systems
humans
systems
All this from Precision
17
Recall Support
  • BLEU no recall.
  • Multiple references to overcome the limitation
    of matching on lexical identity.
  • Use brevity penalty.
  • ParaEval
  • We have lots of paraphrases.
  • Single ref paraphrases multiple
    verbalizations
  • Correlate with Adequacy assessment only
  • Just want to see how much I got right.

18
Using Single Reference
  • Lots of paraphrases for MT eval lead to
  • Using single reference translation is enough.
  • But make sure the single ref is good

single ref!
19
Intermission
  • So far, we talked about ParaEval for MT eval.
  • Next ParaEval for summarization eval.
  • Shrinking a whole lot of information into 100 or
    200 words.
  • Evaluation
  • No sentence alignments.
  • Entire summaries as the limit.

20
Previous Summ. Eval Work
  • Manual annotations
  • SEE (Lin and Hovy, 2001) sentence-level,
    partial credit allowed.
  • Factoid (Van Halteren and Teufel, 2003).
  • Pyramid (Nenkova and Passonneau, 2004).
  • Automated
  • Matching based on lexical identity.
  • ROUGE (Lin and Hovy, 2003)
  • Summary units are of fixed length (unigram,
    bigram, etc).
  • BE (Hovy et al., 2005)
  • Summary units of variable size, yet still
    meaningful.

nuggets
21
Human Evaluator
  • What humans do naturally?
  • In text understanding and comparison.
  • The ability to recognize semantic equivalence
    paraphrasing.

(Nenkova and Passonneau, 2004)
22
Comparison Strategy
  • Comparison strategy
  • Top multi-word paraphrase match (optimal).
  • Paraphrase table.
  • Dynamic programming.
  • Counting unit unigram.
  • Middle 1-to-N, N-to-1, 1-to-1 paraphrase match
    (greedy).
  • Bottom lexical n-gram match (unigram).
  • On text segments not consumed by paraphrase
    matching.
  • A ROUGE-1 guarantee (important).
  • Overall score recall.

23
Complicated Process
24
Paraphrase Matching
  • Optimization problem
  • Favoring recall content coverage.
  • Peer phrases match best with phrases from
    reference
  • No double counting for either side.

25
ParaEval-summ Validation
  • Document Understanding Conference (DUC)
  • Systems compete in summarization tasks.
  • NIST assessors make judgments.
  • Validation process
  • Grade each summary produced by each system.
  • Assign an overall score for each system.
  • Compare with humans scores.
  • DUC-2003 18 systems, 4 sets of refs, 30 doc
    sets.

26
Correlation
27
Detailed Comparison
28
Quality of the Paraphrases
  • How good or bad are the paraphrases?
  • We expect them to be noisy.
  • But we see they are effective in MT and summ
    evaluations.
  • Phrases matched in MT eval
  • Matching is done at sentence-level
  • Many words matched by lexical identity.
  • Leave little room for paraphrase mistakes.
  • Human assessment for phrases in summ eval
  • 128 pairs of paraphrases.
  • Sentences as contexts.
  • 3 judges.

29
Manual Results
  • Results
  • Precision 68.
  • Kappa 0.582.
  • Difficult to judge
  • Semantic equivalence.

30
Conclusion and Future Work
  • Advancing the evaluations for MT and summ
  • By using paraphrases extracted by MT.
  • Getting more textual content compared.
  • Evaluation metrics turn out to be good.
  • Future directions
  • Paraphrase induction for MT
  • I dont know this word/phrase, but I have been in
    similar situations before. What was the
    word/phrase I used that time?
  • Longer phrases. Possibly to the level of the
    Microsoft paraphrases.
  • Measuring word-alignment errors. Better
    paraphrases.
  • Text generation recognize more and precise
    redundancy.

31
Thank you!
Questions?
Write a Comment
User Comments (0)
About PowerShow.com