Introduction to DUC-2001: an Intrinsic Evaluation of Generic News Text Summarization Systems - PowerPoint PPT Presentation

1 / 56

About This Presentation

Title:

Introduction to DUC-2001: an Intrinsic Evaluation of Generic News Text Summarization Systems

Description:

Summarization has always been a major TIDES component ... a contemporary adult newspaper reader. D,E,F: Cut, paste, and reformulate to reduce the size ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 57

Provided by: ellenmv

Learn more at: https://www-nlpir.nist.gov

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to DUC-2001: an Intrinsic Evaluation of Generic News Text Summarization Systems

1
Introduction to DUC-2001 an Intrinsic
Evaluation of Generic News Text Summarization
Systems

Paul Over
Retrieval Group
Information Access Division
National Institute of Standards and Technology

2
Document Understanding Conferences (DUC)

Summarization has always been a major TIDES
component
An evaluation roadmap was completed in the summer
of 2000 following the spring TIDES PI meeting
DUC-2000 occurred in November 2000
research reports
planning for first evaluation using the roadmap

3
Summarization road map

Specifies a series of annual cycles, with
progressively more demanding text data
both direct (intrinsic) and indirect (extrinsic,
task-based) evaluations
increasing challenge in tasks
Year 1 (September 2001)
Intrinsic evaluation of generic summaries,
of newswire/paper stories
for single and multiple documents
with fixed lengths of 50, 100, 200, and 400 words
60 sets of 10 documents used
30 for training
30 for test

4
DUC-2001 schedule

Preliminary call out via ACL
over 55 responses
25 groups signed up
Creation/Distribution of training and test data
30 training reference sets released March 1
30 test sets of documents released June 15
System development
System testing
Evaluation at NIST
15 sets of summaries submitted July 1
Human judgments of submissions at NIST July
9-31
Analysis of results
Discussion of results and plans
DUC-2001 at SIGIR in New Orleans Sept. 13-14

5
Goals of the talk

Provide an overview of the
Data
Tasks
Evaluation
Experience with implementing the evaluation
procedure
Feedback from NIST assessors
Introduce the results
Sanity checking the results and measures
Effect of reassessment with a different model
summary (Phase 2)
Emphasize
Exploratory data analysis
Attention to evaluation fundamentals over final
conclusions
Improving future evaluations

6
The design
7
Data Formation of training/test document sets

Each of 10 NIST information analysts chose one
set of newswire/paper articles of each of the
following types
A single event with causes and consequences
Multiple distinct events of a single type
Subject (discuss a single subject)
One of the above in the domain of natural
disasters
Biographical (discuss a single person)
Opinion (different opinions about the same
subject)
Each set contains about 10 documents (mean10.2,
std2.1)
All documents in a set to be mainly about a
specific concept

8
Human summary creation
Single-document summaries
A
B
Documents
C
Multi-document summaries
A Read hardcopy of documents. B Create a
100-word softcopy summary for each document
using the document authors perspective. C
Create a 400-word softcopy multi-document summary
of all 10 documents written as a report for a
contemporary adult newspaper reader. D,E,F Cut,
paste, and reformulate to reduce the size of the
summary by half.
400
D
200
E
100
F
50
9
Training and test document sets

For each of the 10 authors,
3 docsets were chosen at random to be training
sets
the 3 remaining sets were reserved for testing
Counts of docsets by type

Training Test
Single event 9 3
Multiple events of same type 6 12
Subject 4 6
Biographical 7 3
Opinion 4 6
10
Example training and test document sets

Assessor A
TR - D01 Clarence Thomass nomination to the
Supreme Court 11
TR - D06 Police misconduct 16
TR - D05 Mad cow disease 11
4/1. TE - D04 Hurricane Andrew 11
5. TE - D02 Rise and fall of Michael Miliken
11
6. TE - D03 Sununu resignation 11
Assessor B
TR - D09 Americas response to the Iraqi
invasion of Kuwait 16
TE - D08 Solar eclipses 11
TR - D07 Antarctica 9
4/2. TE - D11 Tornadoes 8
5. TR - D10 Robert Bork 12
6. TE - D12 Welfare reform 8

11
Automatic baselines

NIST created 3 baselines automatically based
roughly on algorithms suggested by Daniel Marcu
from earlier work
Single-document summaries
Take the first 100 words in the document
Multi-document summaries
Take the first 50, 100, 200, 400 words in the
most recent document.
23.3 of the 400-word summaries were shorter than
the target.
Take the first sentence in the 1st, 2nd, 3rd,
document in chronological sequence until you have
the target summary size. Truncate the last
sentence if target size is exceeded.
86.7 of the 400-word summaries and 10 of the
200-word summaries were shorter than the target .

12
Submitted summaries
System
Multi- Single- Code Group name
doc. doc.

L Columbia University 120 -----
M Cogentex 112 -----
N USC/ISI Webclopedia 120 -----
O Univ. of Ottowa 120 307
P Univ. of Michigan 120 308
Q Univ. of Lethbridge ----- 308
R SUNY at Albany 120 308
S TNO/TPD 118 308
T SMU 120 307
U Rutgers Univ. 120 -----
V NYU ----- 308
W NSA 120 279
X NIJL ----- 296
Y USC/ISI 120 308
Z Baldwin Lang. Tech. 120 308
-------- -------
1430 3345

13
Evaluation basics

Intrinsic evaluation by humans using special
version of SEE (thanks to Chin-Yew Lin, ISI)
Compare
a model summary - authored by a human
a peer summary - system-created, baseline, or
human
Produce judgments of
Peer grammaticality, cohesion, organization
Coverage of each model unit by the peer (recall)
Characteristics of peer-only material

14
PhasesSummary evaluation and evaluation
evaluation

Phase 1 Assessor judged peers against his/her
own models.
Phase 2 Assessor judged subset of peers for a
subset of docsets twice - against two other
humans summaries
Phase 3 (not implemented) 2 different assessors
judge same peers using same models.

15
Models

Source
Authored by a human
Phase 1 assessor is document selector and
model author
Phase 2 assessor is neither document selector
nor model author
Formatting
Divided into model units (MUs) (EDUs - thanks to
William Wong at ISI)
Lightly edited by authors to integrate
uninterpretable fragments
Flowed together with HTML tags for SEE

16
Model editing very limited
17
Peers

Formatting
Divided into peer units (PUs)
simple automatically determined sentences
tuned slightly to documents and submissions
Abbreviations list
Submission ending most sentences with
Submission formatted as lists of titles
Flowed together with HTML tags for SEE
3 Sources
Automatically generated by research systems
For single-document summaries 5 randomly
selected, common
No multi-document summaries for docset 31 (model
error)
Automatically generated by baseline algorithms
Authored by a human other than the assessor

18
The implementation
19
Origins of the evaluation framworkSEE

Evaluation framework builds on ISI work embodied
in original SEE software
Challenges for DUC-2001
Better explain questions posed to the NIST
assessors
Modify the software to reduce sources of
error/distraction
Get agreement from DUC program committee
Three areas of assessment in SEE
Overall peer quality
Per-unit content
Unmarked peer units

20
Overall peer qualityDifficult to define
operationally

Grammaticality Do the sentences, clauses,
phrases, etc. follow the basic rules of English?
Dont worry here about style or the ideas.
Concentrate on grammar.
Cohesion Do the sentences fit in as they
should with the surrounding sentences?
Dont worry about the overall structure of the
ideas.
Concentrate on whether each sentence naturally
follows the preceding one and leads into the
next.
Organization Is the content expressed and
arranged in an effective manner?
Concentrate here on the high-level arrangement of
the ideas.

21
SEE overall peer quality
22
Overall peer quality assessor feedback

How much should typos, truncated sentences,
obvious junk characters, headlines vs. full
sentences, etc. affect grammaticality score?
Hard to keep all three questions separate
especially cohesion and organization.
5-values answer scale is ok.
Good to be able to go back and change judgments
for correctness and consistency.
Need rule for small and single-unit summaries
cohesion and organization as defined dont make
much sense for these.

23
Counts of peer units (sentences) in
submissionsWidely variable
24
Grammaticality across all summaries

Most scores relatively high
System score range very wide
Medians/means
Baselines lt Systems lt Humans
But why are baselines (extractions) less than
perfect?

Mean Std.
Baseline 3.23 0.67
System 3.53 0.75
Human 3.79 0.52
Notches in box plots indicate 95
confidence intervals around the mean if and only
if - the sample is large (gt 30), or -
the sample has an approximate normal
distribution.
25
Most baselines contained a sentence fragment

Single-document summaries
Take the first 100 words in the document
91.7 of these summaries ended with a sentence
fragment.
Multi-document summaries
Take the first 50, 100, 200, 400 words in the
most recent document.
87.5 of these summaries ended with a sentence
fragment.
Take the first sentence in the 1st, 2nd, 3rd,
document in chronological sequence until you have
the target summary size. Truncate the last
sentence if target size is exceeded.
69.2 of these summaries ended with a sentence
fragment.

26
Grammaticality singles vs multisSingle- vs
multi-document seems to have little effect
27
Grammaticality among multisWhy more lower
scores for baseline 50s and human 400s?
28
Cohesion across all summaries Median baselines
systems lt humans
29
Cohesion singles vs multis

Better results on singles than multis
For singles median baselines systems humans

30
Cohesion among multisWhy more higher system
summaries in 50s?
31
Organization across all summariesMedian
baselines gt systems gt humans
32
Organization singles vs multis

Generally lower scores for multi-document
summaries than single-document summaries

33
Organization among multisWhy more higher system
summaries in 50s?Why are human summaries worse
for the 200s?
34
Cohesion vs Organization Any real difference for
assessors?Why is organization ever higher than
cohesion?
35
Per-unit content evaluation details

First, find all the peer units which tell you
at least some of what the current model unit
tells you, i.e., peer units which express at
least some of the same facts as the current model
unit. When you find such a PU, click on it to
mark it.
When you have marked all such PUs for the
current MU, then think about the whole set of
marked Pus and answer the question.
The marked PUs, taken together, express
All, Most, Some, Hardly any, or None of
the meaning expressed by the current model unit

36
SEE per-unit content
37
Per-unit content assessor feedback

This is a laborious process and easy to get wrong
loop within a loop.
How to interpret fragments as units, e.g., a date
standing alone?
How much and what kind of information (e.g., from
context) can/should you add to determine what a
peer unit means?
Criteria for marking a PU need to be clear -
sharing of what?
Facts
Ideas
Meaning
Information
Reference

38
Per-unit content measures

Recall
Average coverage - average of the per-MU
completeness judgments 0..4 for a peer summary
Recall at various threshold levels
Recall4 MUs with all information covered /
MUs
Recall3 MUs with all/most information covered
/ MUs
Recall2 MUs with all/most/some information
covered / MUs
Recall1 MUs with all/most/some/any information
covered / MUs
Weighted average?
Precision problems
Peer summary lengths fixed
Insensitive to
Duplicate information
Partially unused peer units

39
Average coverage across all summaries

Medians
baselines lt systems lt humans
Lots of outliers
Best system summaries approach, equal, or exceed
human models

40
Average coverage singles vs multis

Relatively lower baseline and system summaries
for multi-document summaries

41
Average coverage among multisSmall improvement
as size increases
42
Average coverage by system for singles
T
R
O
P
Q
W
Y
V
X
S
Z
Base Humans
Systems
43
Average coverage by system for multis
T
N
Y
L
P
S
M
R
O
Z
W
U
Bases Humans
Systems
44
Average coverage by docset for 2 systemsAverages
hide lots of variation by docset-assessor
45
SEE unmarked peer units
46
Unmarked peer units evaluation details

Think of 3 categories of unmarked PUs
really should be in the model in place of
something already there
not good enough to be in the model, but at least
relevant to the models subject
not even related to the model
Answer the following question for each category
All, Most, Some, Hardly any, or None of
the unmarked PUs belong in this category.
Every PU should be accounted for in some
category.
If there are no unmarked PUs, then answer each
question with None
If there is only one unmarked PU, then the
answers can only be All or None.

47
Unmarked peer units assessor feedback

Many errors (illogical results) had to be
corrected, e.g., if one question is answered
all, then the others must be answered none.
Allow identification of duplicate information in
the peer.
Very little peer material that deserved to be in
the model in place of something there.
Assessors were possessive of their model
formulations.

48
Unmarked peer unitsFew extremely good or bad
Needed in model Not needed, but relevant Not relevant
All mean median 0.32 0 2.80 4 0.40 0
Singles 0.35 2.57 0.33
Multis 0.29 3.06 0.47
50 0.16 1.06 0.55
100 0.25 3.03 0.49
200 0.34 3.18 0.50
400 0.41 3.27 0.41
49
Phase 2 initial results

Designed to gauge effect of different models
Restricted to multi-document summaries of size
50- and 200-words
Assessor used 2 models created by other authors
Within-assessor differences mostly very small
Mean 0.020
Std 0.55
Still want to compare to original judgments

50
Summing up
51
Summing up

Overall peer quality
Grammaticality(especially All choice) was too
sensitive to low-level formatting.
Coherence and organization, as defined, made
little sense for very short summaries.
Coherence was generally hard to distinguish
from organization.
For the future
Address dependence of grammaticality on low-level
formatting
Reassess operational definitions of
coherence/organization to better capture what
researchers want to measure

52
Summing up

Per-unit content (coverage)
Assessors were often in a quandary about how much
information to bring to the interpretation of the
summaries.
Even for simple sentences/EDUs, determination of
shared meaning was very hard.
For the future
Results seem to pass sanity check?
Was the assessor time worth it in terms of what
researchers can learn from the output?

53
Summing up

Unmarked peers
Very little difference in the DUC 2001 summaries
with respect to quality of the unmarked
(peer-only) material
For the future
DUC-2001 systems are not producing junk, so
little will be unrelated.
Summary authors are not producing junk, so little
will be good enough to replace what is there
What then if anything can be usefully measured
with respect to peer-only material?

54
The End
55
Average coverage by docset type(confounded by
docset and assessor/author)Human and system
summaries
56
Average coverage by docset type(confounded by
docset and assessor/author)Single- and multi-doc
baselines

Write a Comment

User Comments (0)