Title: Introduction to DUC-2001: an Intrinsic Evaluation of Generic News Text Summarization Systems
1Introduction to DUC-2001 an Intrinsic
Evaluation of Generic News Text Summarization
Systems
- Paul Over
- Retrieval Group
- Information Access Division
- National Institute of Standards and Technology
2Document Understanding Conferences (DUC)
- Summarization has always been a major TIDES
component - An evaluation roadmap was completed in the summer
of 2000 following the spring TIDES PI meeting - DUC-2000 occurred in November 2000
- research reports
- planning for first evaluation using the roadmap
3Summarization road map
- Specifies a series of annual cycles, with
- progressively more demanding text data
- both direct (intrinsic) and indirect (extrinsic,
task-based) evaluations - increasing challenge in tasks
- Year 1 (September 2001)
- Intrinsic evaluation of generic summaries,
- of newswire/paper stories
- for single and multiple documents
- with fixed lengths of 50, 100, 200, and 400 words
- 60 sets of 10 documents used
- 30 for training
- 30 for test
4DUC-2001 schedule
- Preliminary call out via ACL
- over 55 responses
- 25 groups signed up
- Creation/Distribution of training and test data
- 30 training reference sets released March 1
- 30 test sets of documents released June 15
- System development
- System testing
- Evaluation at NIST
- 15 sets of summaries submitted July 1
- Human judgments of submissions at NIST July
9-31 - Analysis of results
- Discussion of results and plans
- DUC-2001 at SIGIR in New Orleans Sept. 13-14
5Goals of the talk
- Provide an overview of the
- Data
- Tasks
- Evaluation
- Experience with implementing the evaluation
procedure - Feedback from NIST assessors
- Introduce the results
- Sanity checking the results and measures
- Effect of reassessment with a different model
summary (Phase 2) - Emphasize
- Exploratory data analysis
- Attention to evaluation fundamentals over final
conclusions - Improving future evaluations
6The design
7Data Formation of training/test document sets
- Each of 10 NIST information analysts chose one
set of newswire/paper articles of each of the
following types - A single event with causes and consequences
- Multiple distinct events of a single type
- Subject (discuss a single subject)
- One of the above in the domain of natural
disasters - Biographical (discuss a single person)
- Opinion (different opinions about the same
subject) - Each set contains about 10 documents (mean10.2,
std2.1) - All documents in a set to be mainly about a
specific concept
8Human summary creation
Single-document summaries
A
B
Documents
C
Multi-document summaries
A Read hardcopy of documents. B Create a
100-word softcopy summary for each document
using the document authors perspective. C
Create a 400-word softcopy multi-document summary
of all 10 documents written as a report for a
contemporary adult newspaper reader. D,E,F Cut,
paste, and reformulate to reduce the size of the
summary by half.
400
D
200
E
100
F
50
9Training and test document sets
- For each of the 10 authors,
- 3 docsets were chosen at random to be training
sets - the 3 remaining sets were reserved for testing
- Counts of docsets by type
Training Test
Single event 9 3
Multiple events of same type 6 12
Subject 4 6
Biographical 7 3
Opinion 4 6
10Example training and test document sets
- Assessor A
- TR - D01 Clarence Thomass nomination to the
Supreme Court 11 - TR - D06 Police misconduct 16
- TR - D05 Mad cow disease 11
- 4/1. TE - D04 Hurricane Andrew 11
- 5. TE - D02 Rise and fall of Michael Miliken
11 - 6. TE - D03 Sununu resignation 11
- Assessor B
- TR - D09 Americas response to the Iraqi
invasion of Kuwait 16 - TE - D08 Solar eclipses 11
- TR - D07 Antarctica 9
- 4/2. TE - D11 Tornadoes 8
- 5. TR - D10 Robert Bork 12
- 6. TE - D12 Welfare reform 8
11Automatic baselines
- NIST created 3 baselines automatically based
roughly on algorithms suggested by Daniel Marcu
from earlier work - Single-document summaries
- Take the first 100 words in the document
- Multi-document summaries
- Take the first 50, 100, 200, 400 words in the
most recent document. - 23.3 of the 400-word summaries were shorter than
the target. - Take the first sentence in the 1st, 2nd, 3rd,
document in chronological sequence until you have
the target summary size. Truncate the last
sentence if target size is exceeded. - 86.7 of the 400-word summaries and 10 of the
200-word summaries were shorter than the target . -
12Submitted summaries
System
Multi- Single- Code Group name
doc. doc.
- L Columbia University 120 -----
- M Cogentex 112 -----
- N USC/ISI Webclopedia 120 -----
- O Univ. of Ottowa 120 307
- P Univ. of Michigan 120 308
- Q Univ. of Lethbridge ----- 308
- R SUNY at Albany 120 308
- S TNO/TPD 118 308
- T SMU 120 307
- U Rutgers Univ. 120 -----
- V NYU ----- 308
- W NSA 120 279
- X NIJL ----- 296
- Y USC/ISI 120 308
- Z Baldwin Lang. Tech. 120 308
-
-------- ------- -
1430 3345
13Evaluation basics
- Intrinsic evaluation by humans using special
version of SEE (thanks to Chin-Yew Lin, ISI) - Compare
- a model summary - authored by a human
- a peer summary - system-created, baseline, or
human - Produce judgments of
- Peer grammaticality, cohesion, organization
- Coverage of each model unit by the peer (recall)
- Characteristics of peer-only material
14PhasesSummary evaluation and evaluation
evaluation
- Phase 1 Assessor judged peers against his/her
own models. - Phase 2 Assessor judged subset of peers for a
subset of docsets twice - against two other
humans summaries - Phase 3 (not implemented) 2 different assessors
judge same peers using same models.
15Models
- Source
- Authored by a human
- Phase 1 assessor is document selector and
model author - Phase 2 assessor is neither document selector
nor model author - Formatting
- Divided into model units (MUs) (EDUs - thanks to
William Wong at ISI) - Lightly edited by authors to integrate
uninterpretable fragments - Flowed together with HTML tags for SEE
16Model editing very limited
17Peers
- Formatting
- Divided into peer units (PUs)
- simple automatically determined sentences
- tuned slightly to documents and submissions
- Abbreviations list
- Submission ending most sentences with
- Submission formatted as lists of titles
- Flowed together with HTML tags for SEE
- 3 Sources
- Automatically generated by research systems
- For single-document summaries 5 randomly
selected, common - No multi-document summaries for docset 31 (model
error) - Automatically generated by baseline algorithms
- Authored by a human other than the assessor
18The implementation
19Origins of the evaluation framworkSEE
- Evaluation framework builds on ISI work embodied
in original SEE software - Challenges for DUC-2001
- Better explain questions posed to the NIST
assessors - Modify the software to reduce sources of
error/distraction - Get agreement from DUC program committee
- Three areas of assessment in SEE
- Overall peer quality
- Per-unit content
- Unmarked peer units
20Overall peer qualityDifficult to define
operationally
- Grammaticality Do the sentences, clauses,
phrases, etc. follow the basic rules of English?
- Dont worry here about style or the ideas.
- Concentrate on grammar.
- Cohesion Do the sentences fit in as they
should with the surrounding sentences? - Dont worry about the overall structure of the
ideas. - Concentrate on whether each sentence naturally
follows the preceding one and leads into the
next. - Organization Is the content expressed and
arranged in an effective manner? - Concentrate here on the high-level arrangement of
the ideas.
21SEE overall peer quality
22Overall peer quality assessor feedback
- How much should typos, truncated sentences,
obvious junk characters, headlines vs. full
sentences, etc. affect grammaticality score? - Hard to keep all three questions separate
especially cohesion and organization. - 5-values answer scale is ok.
- Good to be able to go back and change judgments
for correctness and consistency. - Need rule for small and single-unit summaries
cohesion and organization as defined dont make
much sense for these.
23Counts of peer units (sentences) in
submissionsWidely variable
24Grammaticality across all summaries
- Most scores relatively high
- System score range very wide
- Medians/means
- Baselines lt Systems lt Humans
- But why are baselines (extractions) less than
perfect?
Mean Std.
Baseline 3.23 0.67
System 3.53 0.75
Human 3.79 0.52
Notches in box plots indicate 95
confidence intervals around the mean if and only
if - the sample is large (gt 30), or -
the sample has an approximate normal
distribution.
25Most baselines contained a sentence fragment
- Single-document summaries
- Take the first 100 words in the document
- 91.7 of these summaries ended with a sentence
fragment. - Multi-document summaries
- Take the first 50, 100, 200, 400 words in the
most recent document. - 87.5 of these summaries ended with a sentence
fragment. - Take the first sentence in the 1st, 2nd, 3rd,
document in chronological sequence until you have
the target summary size. Truncate the last
sentence if target size is exceeded. - 69.2 of these summaries ended with a sentence
fragment.
26Grammaticality singles vs multisSingle- vs
multi-document seems to have little effect
27Grammaticality among multisWhy more lower
scores for baseline 50s and human 400s?
28Cohesion across all summaries Median baselines
systems lt humans
29Cohesion singles vs multis
- Better results on singles than multis
- For singles median baselines systems humans
30Cohesion among multisWhy more higher system
summaries in 50s?
31Organization across all summariesMedian
baselines gt systems gt humans
32Organization singles vs multis
- Generally lower scores for multi-document
summaries than single-document summaries
33Organization among multisWhy more higher system
summaries in 50s?Why are human summaries worse
for the 200s?
34Cohesion vs Organization Any real difference for
assessors?Why is organization ever higher than
cohesion?
35Per-unit content evaluation details
- First, find all the peer units which tell you
at least some of what the current model unit
tells you, i.e., peer units which express at
least some of the same facts as the current model
unit. When you find such a PU, click on it to
mark it. - When you have marked all such PUs for the
current MU, then think about the whole set of
marked Pus and answer the question. - The marked PUs, taken together, express
All, Most, Some, Hardly any, or None of
the meaning expressed by the current model unit
36SEE per-unit content
37Per-unit content assessor feedback
- This is a laborious process and easy to get wrong
loop within a loop. - How to interpret fragments as units, e.g., a date
standing alone? - How much and what kind of information (e.g., from
context) can/should you add to determine what a
peer unit means? - Criteria for marking a PU need to be clear -
sharing of what? - Facts
- Ideas
- Meaning
- Information
- Reference
38Per-unit content measures
- Recall
- Average coverage - average of the per-MU
completeness judgments 0..4 for a peer summary - Recall at various threshold levels
- Recall4 MUs with all information covered /
MUs - Recall3 MUs with all/most information covered
/ MUs - Recall2 MUs with all/most/some information
covered / MUs - Recall1 MUs with all/most/some/any information
covered / MUs - Weighted average?
- Precision problems
- Peer summary lengths fixed
- Insensitive to
- Duplicate information
- Partially unused peer units
39Average coverage across all summaries
- Medians
- baselines lt systems lt humans
- Lots of outliers
- Best system summaries approach, equal, or exceed
human models
40Average coverage singles vs multis
- Relatively lower baseline and system summaries
for multi-document summaries
41Average coverage among multisSmall improvement
as size increases
42Average coverage by system for singles
T
R
O
P
Q
W
Y
V
X
S
Z
Base Humans
Systems
43Average coverage by system for multis
T
N
Y
L
P
S
M
R
O
Z
W
U
Bases Humans
Systems
44Average coverage by docset for 2 systemsAverages
hide lots of variation by docset-assessor
45SEE unmarked peer units
46Unmarked peer units evaluation details
- Think of 3 categories of unmarked PUs
- really should be in the model in place of
something already there - not good enough to be in the model, but at least
relevant to the models subject - not even related to the model
- Answer the following question for each category
- All, Most, Some, Hardly any, or None of
the unmarked PUs belong in this category. - Every PU should be accounted for in some
category. - If there are no unmarked PUs, then answer each
question with None - If there is only one unmarked PU, then the
answers can only be All or None.
47Unmarked peer units assessor feedback
- Many errors (illogical results) had to be
corrected, e.g., if one question is answered
all, then the others must be answered none. - Allow identification of duplicate information in
the peer. - Very little peer material that deserved to be in
the model in place of something there. - Assessors were possessive of their model
formulations.
48Unmarked peer unitsFew extremely good or bad
Needed in model Not needed, but relevant Not relevant
All mean median 0.32 0 2.80 4 0.40 0
Singles 0.35 2.57 0.33
Multis 0.29 3.06 0.47
50 0.16 1.06 0.55
100 0.25 3.03 0.49
200 0.34 3.18 0.50
400 0.41 3.27 0.41
49Phase 2 initial results
- Designed to gauge effect of different models
- Restricted to multi-document summaries of size
50- and 200-words - Assessor used 2 models created by other authors
- Within-assessor differences mostly very small
- Mean 0.020
- Std 0.55
- Still want to compare to original judgments
50Summing up
51Summing up
- Overall peer quality
- Grammaticality(especially All choice) was too
sensitive to low-level formatting. - Coherence and organization, as defined, made
little sense for very short summaries. - Coherence was generally hard to distinguish
from organization. - For the future
- Address dependence of grammaticality on low-level
formatting - Reassess operational definitions of
coherence/organization to better capture what
researchers want to measure
52Summing up
- Per-unit content (coverage)
- Assessors were often in a quandary about how much
information to bring to the interpretation of the
summaries. - Even for simple sentences/EDUs, determination of
shared meaning was very hard. - For the future
- Results seem to pass sanity check?
- Was the assessor time worth it in terms of what
researchers can learn from the output?
53Summing up
- Unmarked peers
- Very little difference in the DUC 2001 summaries
with respect to quality of the unmarked
(peer-only) material - For the future
- DUC-2001 systems are not producing junk, so
little will be unrelated. - Summary authors are not producing junk, so little
will be good enough to replace what is there - What then if anything can be usefully measured
with respect to peer-only material?
54The End
55Average coverage by docset type(confounded by
docset and assessor/author)Human and system
summaries
56Average coverage by docset type(confounded by
docset and assessor/author)Single- and multi-doc
baselines