Automatic Evaluation of Summaries Using Ngram CoOccurrence Statistics - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Automatic Evaluation of Summaries Using Ngram CoOccurrence Statistics

Description:

Machine Translation and Summarization Evaluation. Machine Translation. Inputs ... Penalty (BP) to prevent short translation that try to maximize their precision score ... – PowerPoint PPT presentation

Number of Views:156
Avg rating:3.0/5.0
Slides: 19
Provided by: kinoco
Category:

less

Transcript and Presenter's Notes

Title: Automatic Evaluation of Summaries Using Ngram CoOccurrence Statistics


1
Automatic Evaluation of Summaries Using N-gram
Co-Occurrence Statistics
  • By Chin-Yew Lin and Eduard Hovy

2
The Document Understanding Conference
  • In 2002 there were two main tasks
  • Summarization of single-documents
  • Summarization of Multiple-documents

3
DUC Single Document Summarization
  • Summarization of single-documents
  • Generate a 100 word summary
  • Training 30 sets of 10 docs each with 100 word
    summaries
  • Test against 30 unseen documents

4
DUC Multi-Document Summarization
  • Summarization of multiple documents about a
    single subject
  • Generate 50,100,200,400 word summaries
  • Four types single natural disaster, single
    event, multiple instance of a type of event, info
    about an individual
  • Training 30 sets of 10 documents with their
    50,100,200,400 word summaries
  • Test 30 unseen documents

5
DUC Evaluation Material
  • For each document set, one human summary was
    created to be the Ideal summary for each
    length.
  • Two additional human summaries were created at
    each length
  • Base line summaries were create automatically for
    each length as reference points
  • Lead base line took first n-words of last
    document for multi-doc task
  • Coverage baseline used first sentence of each doc
    until it reached its length

6
SEE- Summary Evaluation Environment
  • A tool to allow assessors to compare system text
    (peer) with Ideal text (model).
  • Can rank quality and content.
  • Assessor marks all system units sharing content
    with model as all,most,some, hardly any
  • Assessor rate quality of grammaticality,
    cohesion, and coherence all, most, some, hardly
    any ,none

7
SEE interface
8
Making a Judgement
  • From Chin-Yew-Lin / MT summit IX 2003-09-27

9
Evaluation Metrics
  • One idea is simple sentence recall, but it cannot
    differentiate system performance (it pays to be
    over productive)
  • Recall is measured relative to the model text
  • E is average of coverage scores

10
Machine Translation and Summarization Evaluation
  • Machine Translation
  • Inputs
  • Reference translation
  • Candidate translation
  • Methods
  • Manually compare two translations in
  • Accuracy
  • Fluency
  • Informativeness
  • Auto evaluation using
  • Blue/NIST scores
  • Auto Summarization
  • Inputs
  • Reference summary
  • Candidate summary
  • Methods
  • Manually compare two summaries in
  • Content Overlap
  • Linguistic Qualities
  • Auto Evaluation ?
  • ?

11
NIST BLEU
  • Goal Measure the translation closeness between a
    candidate translation and set of reference
    translations with a numeric metric
  • Method use a weighted average of variable length
    n-gram matches between system translation and the
    set of human reference translations
  • BLEU correlates highly with human assessments
  • Would like to make the same assumptions the
    closer a summary is to a professional summary the
    better it is

12
BLEU
  • Is a promising automatic scoring metric for
    summary evaluation
  • Basically a precision metric
  • Measures how well a source overlaps a model using
    n-gram co-occurrence statistics
  • Uses a Brevity Penalty (BP) to prevent short
    translation that try to maximize their precision
    score
  • In formulas c candidate length, r reference
    length

13
Anatomy of BLEU Matching Score
  • From Chin-Yew-Lin / MT summit IX 2003-09-27

14
ROUGE Recall-Oriented Understudy for Gisting
Evaluation
  • From Chin-Yew-Lin / MT summit IX 2003-09-27

15
What makes a good metric?
  • Automatic Evaluation should correlate highly,
    positively, and consistently with human
    assessments
  • If a human recognizes a good system, so will the
    metric
  • The statistical significance of automatic
    evaluations should be a good predictor of the
    statistical significance of human assessments
    with high reliability
  • The system can be used to assist in system
    development in place of humans

16
ROUGE vs BLUE
  • ROUGE Recall based
  • Separately evaluate 1,2,3, and 4 grams
  • No length penalty
  • Verified for extraction summaries
  • Focus on content overlap
  • BLUE-Precision based
  • Mixed n-grams
  • Use Brevity penalty to penalize system
    translations that are shorter than the average
    reference length
  • Favors longer n-grmas for grammaticality or word
    order

17
By all measures
18
Findings
  • Ngram(1,4) is a weighted variable length n-gram
    match score similar to BLEU
  • Simple unigrams, Ngram(1,1) and Bigrams
    Ngram(2,2) consistently outperformed Ngram(1,4)
    in single and multiple document tasks when
    stopwords are ignored
  • Weighted average n-grams are between bi-gram and
    tri-gram scores suggesting summaries are
    over-penalized by the weighted average due to
    lack of longer n-gram matches
  • Excluding stopword in computing n-gram statistics
    generally achieves better correlation than
    including them
  • Ngram(1,1) and Ngram(2,2) are good automatic
    scoring metrics based on statistical predictive
    power.
Write a Comment
User Comments (0)
About PowerShow.com