Evaluate, Evaluate, Evaluate: Issues and Results in Pyramid Evaluation for DUC 2003 and 2005 - PowerPoint PPT Presentation

Loading...

PPT – Evaluate, Evaluate, Evaluate: Issues and Results in Pyramid Evaluation for DUC 2003 and 2005 PowerPoint presentation | free to view - id: 12c0a3-NjI3Y



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Evaluate, Evaluate, Evaluate: Issues and Results in Pyramid Evaluation for DUC 2003 and 2005

Description:

Evaluate, Evaluate, Evaluate: Issues and Results in Pyramid Evaluation for DUC 2003 and 2005 ... You can tell your Program Manger to watch 'n wait ... – PowerPoint PPT presentation

Number of Views:154
Avg rating:3.0/5.0
Slides: 31
Provided by: rebeccapa
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Evaluate, Evaluate, Evaluate: Issues and Results in Pyramid Evaluation for DUC 2003 and 2005


1
Evaluate, Evaluate, EvaluateIssues and Results
in Pyramid Evaluation for DUC 2003 and 2005
  • Rebecca Passonneau
  • Center for Computational Learning Systems
  • Columbia University

2
Outline
  • Overview of Pyramids
  • Inter-annotator Agreement
  • Pyramid Annotation DUC 2003
  • Inter-annotator Agreement
  • Score Correlations
  • Peer Annotation DUC 2003, DUC 2005
  • A Meditation on Evaluation

3
Building a Pyramid from Model Summaries
W4
W3
W2
W1
4
Building a Pyramid from Model Summaries
W4
W3
W2
W1
5
Building a Pyramid from Model Summaries
W4
W3
W2
W1
6
Building a Pyramid from Model Summaries
W4
W3
W2
W1
7
Building a Pyramid from Model Summaries
W4
W3
W2
W1
8
Applying the Method
  • Collect human model summaries for a document set
  • Create a pyramid manual SCU annotation
  • Manually annotate new summaries against the
    pyramid
  • Match SCUs to the summary
  • Sum the SCU weights (Obs)
  • Calculate Obs/Max

9
Pyramid Score
Where Oi is the number of SCUs of weight i, and
Ti represents each pyramid tier , POBS/MAX
where jmaxi
10
Inter-annotator Agreement
  • What is it?
  • How is it measured?
  • When is it good enough?
  • There is no set answer (Krippendorff, 1980)

11
Canonical Agreement Matrix
12
Inter-annotator Metrics
  • 0 if no different from chance
  • 1 for perfect agreement
  • Approaches 1 for disagreements

13
Paradigmatic Reliability Study
  • Measure inter-annotator agreement
  • Determine effect of annotated variables in
    independent context

14
Example of SCU Similarity
15
SCUs from Different Annotators
  • Similar labels
  • Americans asked Saudi officials for help
  • Through the Saudis, the U.S. tried to get
    cooperation from the Taliban
  • Large amount of overlap
  • Similar weights
  • Compare each word unit to each SCU category

16
Table 1 Set Subsumption
17
Table 2 Symmetric Difference
18
Tools to Compare Sets
  • Size Jaccard
  • Type of relation M for two sets P, O
  • If PO, M1
  • If P is a subset of O, M2/3
  • If P,O symmetric diff, M1/3
  • If P,O disjoint, M0

19
Measuring Agreement for Set-valued Items (MASI)
  • MASIJM
  • (or 1-JM if measuring disagreement)
  • Originally developed for co-reference

20
Agreement for Example SCUs
21
DUC 2003 Pyramids
  • Docsets 10, 500 word news articles
  • Summaries 100 words
  • (one paragraph)
  • Number of model summaries 7
  • Number of pyramids 5

22
Agreement on Five Pyramids
23
Score Correlations
Number of Peer Machine Summaries 16
24
Mean SCU Weight
No significant difference between annotators
25
DUC 2003 Data (Retrospective)
  • Docsets 10, 500 word documents
  • Summaries 100 words
  • (one paragraph 2/5 2003 length)
  • Number of model summaries 7
  • Number of pyramids 8
  • Three concensus pyramids
  • Five independent pairs of pyramids
  • Number of peers 16 system

26
DUC 2005 Data
  • Docsets 30, 720 word documents
  • Summaries 250 words
  • (one page 2 1/2 times 2003 length)
  • Number of model summaries 7
  • Number of pyramids 20
  • Number of peers 25 systems

27
Comparison Across Years
  • Mean SCU Weight
  • 2003 2.88
  • 2005 1.9
  • Number of significantly different groups
  • 2003 5 (4, 4, 3, 2, 2, 1)
  • 2005 2 (23 vs. 2)
  • Number of significantly different docsets
  • 2003 8

28
PRSs Peer Annotation
  • 2003 peer annotation
  • Mean AlphaDICE.78
  • ScoreMod correlations, Pearson.84 (p0)
  • 2005 peer annotation
  • Mean AlphaDICE.61
  • ScoreMod correlations, Pearson.88 (p0)

29
Conclusion
  • Pyramid annotation reliability accords with score
    correlations (2003)
  • Peer annotation reliability accords with score
    correlations (2003 2005)
  • Mean SCU weight matters
  • Varies with model length
  • Varies with document set

30
Do the Evaluation Rock(apologies to Allen
Ginsberg)
If you sit and evaluate for an hour or a minute
every day You can tell your Program Manager to
sit the same way. You can tell your Program
Manger to watch 'n wait And stop and evaluate
'cause it's never too late. Do the Evaluation,
Do the Evaluation Get yourself together, lots of
energy And Generosity, Generosity and Generosity.
About PowerShow.com