Introduction to DUC-2001: an Intrinsic Evaluation of Generic News Text Summarization Systems - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Introduction to DUC-2001: an Intrinsic Evaluation of Generic News Text Summarization Systems

Description:

Summarization has always been a major TIDES component ... a contemporary adult newspaper reader. D,E,F: Cut, paste, and reformulate to reduce the size ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 57
Provided by: ellenmv
Category:

less

Transcript and Presenter's Notes

Title: Introduction to DUC-2001: an Intrinsic Evaluation of Generic News Text Summarization Systems


1
Introduction to DUC-2001 an Intrinsic
Evaluation of Generic News Text Summarization
Systems
  • Paul Over
  • Retrieval Group
  • Information Access Division
  • National Institute of Standards and Technology

2
Document Understanding Conferences (DUC)
  • Summarization has always been a major TIDES
    component
  • An evaluation roadmap was completed in the summer
    of 2000 following the spring TIDES PI meeting
  • DUC-2000 occurred in November 2000
  • research reports
  • planning for first evaluation using the roadmap

3
Summarization road map
  • Specifies a series of annual cycles, with
  • progressively more demanding text data
  • both direct (intrinsic) and indirect (extrinsic,
    task-based) evaluations
  • increasing challenge in tasks
  • Year 1 (September 2001)
  • Intrinsic evaluation of generic summaries,
  • of newswire/paper stories
  • for single and multiple documents
  • with fixed lengths of 50, 100, 200, and 400 words
  • 60 sets of 10 documents used
  • 30 for training
  • 30 for test

4
DUC-2001 schedule
  • Preliminary call out via ACL
  • over 55 responses
  • 25 groups signed up
  • Creation/Distribution of training and test data
  • 30 training reference sets released March 1
  • 30 test sets of documents released June 15
  • System development
  • System testing
  • Evaluation at NIST
  • 15 sets of summaries submitted July 1
  • Human judgments of submissions at NIST July
    9-31
  • Analysis of results
  • Discussion of results and plans
  • DUC-2001 at SIGIR in New Orleans Sept. 13-14

5
Goals of the talk
  • Provide an overview of the
  • Data
  • Tasks
  • Evaluation
  • Experience with implementing the evaluation
    procedure
  • Feedback from NIST assessors
  • Introduce the results
  • Sanity checking the results and measures
  • Effect of reassessment with a different model
    summary (Phase 2)
  • Emphasize
  • Exploratory data analysis
  • Attention to evaluation fundamentals over final
    conclusions
  • Improving future evaluations

6
The design
7
Data Formation of training/test document sets
  • Each of 10 NIST information analysts chose one
    set of newswire/paper articles of each of the
    following types
  • A single event with causes and consequences
  • Multiple distinct events of a single type
  • Subject (discuss a single subject)
  • One of the above in the domain of natural
    disasters
  • Biographical (discuss a single person)
  • Opinion (different opinions about the same
    subject)
  • Each set contains about 10 documents (mean10.2,
    std2.1)
  • All documents in a set to be mainly about a
    specific concept

8
Human summary creation
Single-document summaries
A
B
Documents
C
Multi-document summaries
A Read hardcopy of documents. B Create a
100-word softcopy summary for each document
using the document authors perspective. C
Create a 400-word softcopy multi-document summary
of all 10 documents written as a report for a
contemporary adult newspaper reader. D,E,F Cut,
paste, and reformulate to reduce the size of the
summary by half.
400
D
200
E
100
F
50
9
Training and test document sets
  • For each of the 10 authors,
  • 3 docsets were chosen at random to be training
    sets
  • the 3 remaining sets were reserved for testing
  • Counts of docsets by type

Training Test
Single event 9 3
Multiple events of same type 6 12
Subject 4 6
Biographical 7 3
Opinion 4 6
10
Example training and test document sets
  • Assessor A
  • TR - D01 Clarence Thomass nomination to the
    Supreme Court 11
  • TR - D06 Police misconduct 16
  • TR - D05 Mad cow disease 11
  • 4/1. TE - D04 Hurricane Andrew 11
  • 5. TE - D02 Rise and fall of Michael Miliken
    11
  • 6. TE - D03 Sununu resignation 11
  • Assessor B
  • TR - D09 Americas response to the Iraqi
    invasion of Kuwait 16
  • TE - D08 Solar eclipses 11
  • TR - D07 Antarctica 9
  • 4/2. TE - D11 Tornadoes 8
  • 5. TR - D10 Robert Bork 12
  • 6. TE - D12 Welfare reform 8

11
Automatic baselines
  • NIST created 3 baselines automatically based
    roughly on algorithms suggested by Daniel Marcu
    from earlier work
  • Single-document summaries
  • Take the first 100 words in the document
  • Multi-document summaries
  • Take the first 50, 100, 200, 400 words in the
    most recent document.
  • 23.3 of the 400-word summaries were shorter than
    the target.
  • Take the first sentence in the 1st, 2nd, 3rd,
    document in chronological sequence until you have
    the target summary size. Truncate the last
    sentence if target size is exceeded.
  • 86.7 of the 400-word summaries and 10 of the
    200-word summaries were shorter than the target .

12
Submitted summaries
System
Multi- Single- Code Group name
doc. doc.
  • L Columbia University 120 -----
  • M Cogentex 112 -----
  • N USC/ISI Webclopedia 120 -----
  • O Univ. of Ottowa 120 307
  • P Univ. of Michigan 120 308
  • Q Univ. of Lethbridge ----- 308
  • R SUNY at Albany 120 308
  • S TNO/TPD 118 308
  • T SMU 120 307
  • U Rutgers Univ. 120 -----
  • V NYU ----- 308
  • W NSA 120 279
  • X NIJL ----- 296
  • Y USC/ISI 120 308
  • Z Baldwin Lang. Tech. 120 308

  • -------- -------

  • 1430 3345

13
Evaluation basics
  • Intrinsic evaluation by humans using special
    version of SEE (thanks to Chin-Yew Lin, ISI)
  • Compare
  • a model summary - authored by a human
  • a peer summary - system-created, baseline, or
    human
  • Produce judgments of
  • Peer grammaticality, cohesion, organization
  • Coverage of each model unit by the peer (recall)
  • Characteristics of peer-only material

14
PhasesSummary evaluation and evaluation
evaluation
  • Phase 1 Assessor judged peers against his/her
    own models.
  • Phase 2 Assessor judged subset of peers for a
    subset of docsets twice - against two other
    humans summaries
  • Phase 3 (not implemented) 2 different assessors
    judge same peers using same models.

15
Models
  • Source
  • Authored by a human
  • Phase 1 assessor is document selector and
    model author
  • Phase 2 assessor is neither document selector
    nor model author
  • Formatting
  • Divided into model units (MUs) (EDUs - thanks to
    William Wong at ISI)
  • Lightly edited by authors to integrate
    uninterpretable fragments
  • Flowed together with HTML tags for SEE

16
Model editing very limited
17
Peers
  • Formatting
  • Divided into peer units (PUs)
  • simple automatically determined sentences
  • tuned slightly to documents and submissions
  • Abbreviations list
  • Submission ending most sentences with
  • Submission formatted as lists of titles
  • Flowed together with HTML tags for SEE
  • 3 Sources
  • Automatically generated by research systems
  • For single-document summaries 5 randomly
    selected, common
  • No multi-document summaries for docset 31 (model
    error)
  • Automatically generated by baseline algorithms
  • Authored by a human other than the assessor

18
The implementation
19
Origins of the evaluation framworkSEE
  • Evaluation framework builds on ISI work embodied
    in original SEE software
  • Challenges for DUC-2001
  • Better explain questions posed to the NIST
    assessors
  • Modify the software to reduce sources of
    error/distraction
  • Get agreement from DUC program committee
  • Three areas of assessment in SEE
  • Overall peer quality
  • Per-unit content
  • Unmarked peer units

20
Overall peer qualityDifficult to define
operationally
  • Grammaticality Do the sentences, clauses,
    phrases, etc. follow the basic rules of English?
  • Dont worry here about style or the ideas.
  • Concentrate on grammar.
  • Cohesion Do the sentences fit in as they
    should with the surrounding sentences?
  • Dont worry about the overall structure of the
    ideas.
  • Concentrate on whether each sentence naturally
    follows the preceding one and leads into the
    next.
  • Organization Is the content expressed and
    arranged in an effective manner?
  • Concentrate here on the high-level arrangement of
    the ideas.

21
SEE overall peer quality
22
Overall peer quality assessor feedback
  • How much should typos, truncated sentences,
    obvious junk characters, headlines vs. full
    sentences, etc. affect grammaticality score?
  • Hard to keep all three questions separate
    especially cohesion and organization.
  • 5-values answer scale is ok.
  • Good to be able to go back and change judgments
    for correctness and consistency.
  • Need rule for small and single-unit summaries
    cohesion and organization as defined dont make
    much sense for these.

23
Counts of peer units (sentences) in
submissionsWidely variable
24
Grammaticality across all summaries
  • Most scores relatively high
  • System score range very wide
  • Medians/means
  • Baselines lt Systems lt Humans
  • But why are baselines (extractions) less than
    perfect?

Mean Std.
Baseline 3.23 0.67
System 3.53 0.75
Human 3.79 0.52
Notches in box plots indicate 95
confidence intervals around the mean if and only
if - the sample is large (gt 30), or -
the sample has an approximate normal
distribution.
25
Most baselines contained a sentence fragment
  • Single-document summaries
  • Take the first 100 words in the document
  • 91.7 of these summaries ended with a sentence
    fragment.
  • Multi-document summaries
  • Take the first 50, 100, 200, 400 words in the
    most recent document.
  • 87.5 of these summaries ended with a sentence
    fragment.
  • Take the first sentence in the 1st, 2nd, 3rd,
    document in chronological sequence until you have
    the target summary size. Truncate the last
    sentence if target size is exceeded.
  • 69.2 of these summaries ended with a sentence
    fragment.

26
Grammaticality singles vs multisSingle- vs
multi-document seems to have little effect
27
Grammaticality among multisWhy more lower
scores for baseline 50s and human 400s?
28
Cohesion across all summaries Median baselines
systems lt humans
29
Cohesion singles vs multis
  • Better results on singles than multis
  • For singles median baselines systems humans

30
Cohesion among multisWhy more higher system
summaries in 50s?
31
Organization across all summariesMedian
baselines gt systems gt humans
32
Organization singles vs multis
  • Generally lower scores for multi-document
    summaries than single-document summaries

33
Organization among multisWhy more higher system
summaries in 50s?Why are human summaries worse
for the 200s?
34
Cohesion vs Organization Any real difference for
assessors?Why is organization ever higher than
cohesion?
35
Per-unit content evaluation details
  • First, find all the peer units which tell you
    at least some of what the current model unit
    tells you, i.e., peer units which express at
    least some of the same facts as the current model
    unit. When you find such a PU, click on it to
    mark it.
  • When you have marked all such PUs for the
    current MU, then think about the whole set of
    marked Pus and answer the question.
  • The marked PUs, taken together, express
    All, Most, Some, Hardly any, or None of
    the meaning expressed by the current model unit

36
SEE per-unit content
37
Per-unit content assessor feedback
  • This is a laborious process and easy to get wrong
    loop within a loop.
  • How to interpret fragments as units, e.g., a date
    standing alone?
  • How much and what kind of information (e.g., from
    context) can/should you add to determine what a
    peer unit means?
  • Criteria for marking a PU need to be clear -
    sharing of what?
  • Facts
  • Ideas
  • Meaning
  • Information
  • Reference

38
Per-unit content measures
  • Recall
  • Average coverage - average of the per-MU
    completeness judgments 0..4 for a peer summary
  • Recall at various threshold levels
  • Recall4 MUs with all information covered /
    MUs
  • Recall3 MUs with all/most information covered
    / MUs
  • Recall2 MUs with all/most/some information
    covered / MUs
  • Recall1 MUs with all/most/some/any information
    covered / MUs
  • Weighted average?
  • Precision problems
  • Peer summary lengths fixed
  • Insensitive to
  • Duplicate information
  • Partially unused peer units

39
Average coverage across all summaries
  • Medians
  • baselines lt systems lt humans
  • Lots of outliers
  • Best system summaries approach, equal, or exceed
    human models

40
Average coverage singles vs multis
  • Relatively lower baseline and system summaries
    for multi-document summaries

41
Average coverage among multisSmall improvement
as size increases
42
Average coverage by system for singles
T
R
O
P
Q
W
Y
V
X
S
Z
Base Humans
Systems
43
Average coverage by system for multis
T
N
Y
L
P
S
M
R
O
Z
W
U
Bases Humans
Systems
44
Average coverage by docset for 2 systemsAverages
hide lots of variation by docset-assessor
45
SEE unmarked peer units
46
Unmarked peer units evaluation details
  • Think of 3 categories of unmarked PUs
  • really should be in the model in place of
    something already there
  • not good enough to be in the model, but at least
    relevant to the models subject
  • not even related to the model
  • Answer the following question for each category
  • All, Most, Some, Hardly any, or None of
    the unmarked PUs belong in this category.
  • Every PU should be accounted for in some
    category.
  • If there are no unmarked PUs, then answer each
    question with None
  • If there is only one unmarked PU, then the
    answers can only be All or None.

47
Unmarked peer units assessor feedback
  • Many errors (illogical results) had to be
    corrected, e.g., if one question is answered
    all, then the others must be answered none.
  • Allow identification of duplicate information in
    the peer.
  • Very little peer material that deserved to be in
    the model in place of something there.
  • Assessors were possessive of their model
    formulations.

48
Unmarked peer unitsFew extremely good or bad
Needed in model Not needed, but relevant Not relevant
All mean median 0.32 0 2.80 4 0.40 0
Singles 0.35 2.57 0.33
Multis 0.29 3.06 0.47
50 0.16 1.06 0.55
100 0.25 3.03 0.49
200 0.34 3.18 0.50
400 0.41 3.27 0.41
49
Phase 2 initial results
  • Designed to gauge effect of different models
  • Restricted to multi-document summaries of size
    50- and 200-words
  • Assessor used 2 models created by other authors
  • Within-assessor differences mostly very small
  • Mean 0.020
  • Std 0.55
  • Still want to compare to original judgments

50
Summing up
51
Summing up
  • Overall peer quality
  • Grammaticality(especially All choice) was too
    sensitive to low-level formatting.
  • Coherence and organization, as defined, made
    little sense for very short summaries.
  • Coherence was generally hard to distinguish
    from organization.
  • For the future
  • Address dependence of grammaticality on low-level
    formatting
  • Reassess operational definitions of
    coherence/organization to better capture what
    researchers want to measure

52
Summing up
  • Per-unit content (coverage)
  • Assessors were often in a quandary about how much
    information to bring to the interpretation of the
    summaries.
  • Even for simple sentences/EDUs, determination of
    shared meaning was very hard.
  • For the future
  • Results seem to pass sanity check?
  • Was the assessor time worth it in terms of what
    researchers can learn from the output?

53
Summing up
  • Unmarked peers
  • Very little difference in the DUC 2001 summaries
    with respect to quality of the unmarked
    (peer-only) material
  • For the future
  • DUC-2001 systems are not producing junk, so
    little will be unrelated.
  • Summary authors are not producing junk, so little
    will be good enough to replace what is there
  • What then if anything can be usefully measured
    with respect to peer-only material?

54
The End
55
Average coverage by docset type(confounded by
docset and assessor/author)Human and system
summaries
56
Average coverage by docset type(confounded by
docset and assessor/author)Single- and multi-doc
baselines
Write a Comment
User Comments (0)
About PowerShow.com