The Pyramid Method at DUC05 - PowerPoint PPT Presentation

About This Presentation
Title:

The Pyramid Method at DUC05

Description:

SCU: A cable car caught fire (Weight = 4) A. The cause of the fire was unknown. B. A cable car caught fire just after entering a mountainside tunnel in an ... – PowerPoint PPT presentation

Number of Views:144
Avg rating:3.0/5.0
Slides: 53
Provided by: Kathleen275
Category:
Tags: caught | duc05 | method | pyramid

less

Transcript and Presenter's Notes

Title: The Pyramid Method at DUC05


1
The Pyramid Method at DUC05
  • Ani Nenkova
  • Becky Passonneau
  • Kathleen McKeown
  • Other team members David Elson, Advaith
    Siddharthan, Sergey Siegelman

2
Overview
  • Review of Pyramids (Kathy)
  • Characteristics of the responses
  • Analyses (Ani)
  • Scores and Significant Differences
  • Reliability of Pyramid scoring
  • Comparisons between annotators
  • Impact of editing on scores
  • Impact of Weight 1 SCUs
  • Correlation with responsiveness and Rouge
  • Lessons learned

3
Pyramids
  • Uses multiple human summaries
  • Previous data indicated 5 needed for score
    stability
  • Information is ranked by its importance
  • Allows for multiple good summaries
  • A pyramid is created from the human summaries
  • Elements of the pyramid are content units
  • System summaries are scored by comparison with
    the pyramid

4
Summarization Content Units
  • Near-paraphrases from different human summaries
  • Clause or less
  • Avoids explicit semantic representation
  • Emerges from analysis of human summaries

5
SCU A cable car caught fire (Weight 4)
  • A. The cause of the fire was unknown.
  • B. A cable car caught fire just after entering a
    mountainside tunnel in an alpine resort in
    Kaprun, Austria on the morning of November 11,
    2000.
  • C. A cable car pulling skiers and snowboarders
    to the Kitzsteinhorn resort, located 60 miles
    south of Salzburg in the Austrian Alps, caught
    fire inside a mountain tunnel, killing
    approximately 170 people.
  • D. On November 10, 2000, a cable car filled to
    capacity caught on fire, trapping 180 passengers
    inside the Kitzsteinhorn mountain, located in the
    town of Kaprun, 50 miles south of Salzburg in the
    central Austrian Alps.

6
SCU The cause of the fire is unknown (Weight 1)
  • A. The cause of the fire was unknown.
  • B. A cable car caught fire just after entering a
    mountainside tunnel in an alpine resort in
    Kaprun, Austria on the morning of November 11,
    2000.
  • C. A cable car pulling skiers and snowboarders
    to the Kitzsteinhorn resort, located 60 miles
    south of Salzburg in the Austrian Alps, caught
    fire inside a mountain tunnel, killing
    approximately 170 people.
  • D. On November 10, 2000, a cable car filled to
    capacity caught on fire, trapping 180 passengers
    inside the Kitzsteinhorn mountain, located in the
    town of Kaprun, 50 miles south of Salzburg in the
    central Austrian Alps.

7
SCU The accident happened in the Austrian Alps
(Weight 3)
  • A. The cause of the fire was unknown.
  • B. A cable car caught fire just after entering a
    mountainside tunnel in an alpine resort in
    Kaprun, Austria on the morning of November 11,
    2000.
  • C. A cable car pulling skiers and snowboarders
    to the Kitzsteinhorn resort, located 60 miles
    south of Salzburg in the Austrian Alps, caught
    fire inside a mountain tunnel, killing
    approximately 170 people.
  • D. On November 10, 2000, a cable car filled to
    capacity caught on fire, trapping 180 passengers
    inside the Kitzsteinhorn mountain, located in the
    town of Kaprun, 50 miles south of Salzburg in the
    central Austrian Alps.

8
Idealized representation
  • Tiers of differentially weighted SCUs
  • Top few SCUs, high weight
  • Bottom many SCUs, low weight

W3
W2
W1
9
Creation of pyramids
  • Done for each of 20 out of 50 sets
  • Primary annotator, secondary checker
  • Held round-table discussions of problematic
    constructions that occurred in this data set
  • Comma separated lists
  • Extractive reserves have been formed for managed
    harvesting of timber, rubber, Brazil nuts, and
    medical plants without deforestation.
  • General vs. specific
  • Eastern Europe vs. Hungary, Poland, Lithuania,
    and Turkey

10
Characteristics of the Responses
  • Proportion of SCUs of Weight 1 is large
  • 44 (D324) to 81 (D695)
  • Mean SCU weight 1.9
  • Agreement among human responders is quite low

11
of SCUs at each weight
SCU Weights
12
Pyramids DUC 2003
  • 100 word summaries (vs. 250 word)
  • 10 500-word articles per cluster (vs. 30 720-word
    articles)
  • 3 clusters (vs. 20 clusters)
  • Mean SCU Weight (7 models)
  • 2005 avg 1.9
  • 2003 avg 2.4
  • Proportion of SCUs of W1
  • 2005 avg 60, 44 to 81
  • 2003 avg 40, 37 to 47

13
DUC03 DUC05
.4
.4
14
Computing pyramid scoresIdeally informative
summary
  • Does not include an SCU from a lower tier unless
    all SCUs from higher tiers are included as well

15
Ideally informative summary
  • Does not include an SCU from a lower tier unless
    all SCUs from higher tiers are included as well

16
Ideally informative summary
  • Does not include an SCU from a lower tier unless
    all SCUs from higher tiers are included as well

17
Ideally informative summary
  • Does not include an SCU from a lower tier unless
    all SCUs from higher tiers are included as well

18
Ideally informative summary
  • Does not include an SCU from a lower tier unless
    all SCUs from higher tiers are included as well

19
Ideally informative summary
  • Does not include an SCU from a lower tier unless
    all SCUs from higher tiers are included as well

20
Original Pyramid Score
  • SCORE D/MAX
  • D Sum of the weights of the SCUs in a summary
  • MAX Sum of the weights of the SCUs in a ideally
    informative summary
  • Measures the proportion of good information in
    the summary precision

21
Modified pyramid score (recall)
  • EN average SCUs in human models
  • This is the number of content units humans chose
    to convey about the story
  • WCompute the weight of a maximally informative
    summary of size EN
  • D/W is the modified pyramid score
  • Shows the proportion of expected good information

22
Scoring Methods
  • Presents scores for the 20 pyramid sets
  • Recompute Rouge for comparison
  • We compute Rouge using only 7 models
  • 8 and 9 reserved for computing human performance
  • Best because of significant topic effect
  • Comparisons between Pyramid (original,modified),
    responsiveness, and Rouge-SU4
  • Pyramids score computed from multiple humans
  • Responsiveness is just one humans judgment
  • Rouge-SU4 equivalent to Rouge-2

23
Preview of Results
  • Manual metrics
  • Large differences between humans and machines
  • No single system the clear winner
  • But a top group identified by all metrics
  • Significant differences
  • Different predictions from manual and automatic
    metrics
  • Correlations between metrics
  • Some correlation but one cannot be substituted
    for another
  • This is good

24
Human performance/Best sys
  • Pyramid Modified Resp
    ROUGE-SU4
  • B 0.5472 B 0.4814 A 4.895
    A 0.1722
  • A 0.4969 A 0.4617 B 4.526
    B 0.1552
  • 14 0.2587 10 0.2052 4 2.85
    15 0.139

Best system 50 of human performance on manual
metrics Best system 80 of human performance on
ROUGE
25
  • Pyramid
  • original Modified Resp
    Rouge-SU4
  • 14 0.2587 10 0.2052 4 2.85
    15 0.139
  • 17 0.2492 17 0.1972 14 2.8
    4 0.134
  • 15 0.2423 14 0.1908 10 2.65
    17 0.1346
  • 10 0.2379 7 0.1852 15 2.6
    19 0.1275
  • 4 0.2321 15 0.1808 17 2.55
    11 0.1259
  • 7 0.2297 4 0.177 11 2.5
    10 0.1278
  • 16 0.2265 16 0.1722 28 2.45
    6 0.1239
  • 6 0.2197 11 0.1703 21 2.45
    7 0.1213
  • 32 0.2145 6 0.1671 6 2.4
    14 0.1264
  • 21 0.2127 12 0.1664 24 2.4
    25 0.1188
  • 12 0.2126 19 0.1636 19 2.4
    21 0.1183
  • 11 0.2116 21 0.1613 6 2.4
    16 0.1218
  • 26 0.2106 32 0.1601 27 2.35
    24 0.118
  • 19 0.2072 26 0.1464 12 2.35
    12 0.116
  • 28 0.2048 3 0.145 7 2.3
    3 0.1198
  • 13 0.1983 28 0.1427 25 2.2
    28 0.1203
  • 3 0.1949 13 0.1424 32 2.15
    27 0.110

26
  • Pyramid
  • original Modified Resp
    Rouge-SU4
  • 14 0.2587 10 0.2052 4 2.85
    15 0.139
  • 17 0.2492 17 0.1972 14 2.8
    4 0.134
  • 15 0.2423 14 0.1908 10 2.65
    17 0.1346
  • 10 0.2379 7 0.1852 15 2.6
    19 0.1275
  • 4 0.2321 15 0.1808 17 2.55
    11 0.1259
  • 7 0.2297 4 0.177 11 2.5
    10 0.1278
  • 16 0.2265 16 0.1722 28 2.45
    6 0.1239
  • 6 0.2197 11 0.1703 21 2.45
    7 0.1213
  • 32 0.2145 6 0.1671 6 2.4
    14 0.1264
  • 21 0.2127 12 0.1664 24 2.4
    25 0.1188
  • 12 0.2126 19 0.1636 19 2.4
    21 0.1183
  • 11 0.2116 21 0.1613 6 2.4
    16 0.1218
  • 26 0.2106 32 0.1601 27 2.35
    24 0.118
  • 19 0.2072 26 0.1464 12 2.35
    12 0.116
  • 28 0.2048 3 0.145 7 2.3
    3 0.1198
  • 13 0.1983 28 0.1427 25 2.2
    28 0.1203
  • 3 0.1949 13 0.1424 32 2.15
    27 0.110

27
  • Pyramid
  • original Modified Resp
    Rouge-SU4
  • 14 0.2587 10 0.2052 4 2.85
    15 0.139
  • 17 0.2492 17 0.1972 14 2.8
    4 0.134
  • 15 0.2423 14 0.1908 10 2.65
    17 0.1346
  • 10 0.2379 7 0.1852 15 2.6
    19 0.1275
  • 4 0.2321 15 0.1808 17 2.55
    11 0.1259
  • 7 0.2297 4 0.177 11 2.5
    10 0.1278
  • 16 0.2265 16 0.1722 28 2.45
    6 0.1239
  • 6 0.2197 11 0.1703 21 2.45
    7 0.1213
  • 32 0.2145 6 0.1671 6 2.4
    14 0.1264
  • 21 0.2127 12 0.1664 24 2.4
    25 0.1188
  • 12 0.2126 19 0.1636 19 2.4
    21 0.1183
  • 11 0.2116 21 0.1613 6 2.4
    16 0.1218
  • 26 0.2106 32 0.1601 27 2.35
    24 0.118
  • 19 0.2072 26 0.1464 12 2.35
    12 0.116
  • 28 0.2048 3 0.145 7 2.3
    3 0.1198
  • 13 0.1983 28 0.1427 25 2.2
    28 0.1203
  • 3 0.1949 13 0.1424 32 2.15
    27 0.110

28
  • Pyramid
  • original Modified Resp
    Rouge-SU4
  • 14 0.2587 10 0.2052 4 2.85
    15 0.139
  • 17 0.2492 17 0.1972 14 2.8
    4 0.134
  • 15 0.2423 14 0.1908 10 2.65
    17 0.1346
  • 10 0.2379 7 0.1852 15 2.6
    19 0.1275
  • 4 0.2321 15 0.1808 17 2.55
    11 0.1259
  • 7 0.2297 4 0.177 11 2.5
    10 0.1278
  • 16 0.2265 16 0.1722 28 2.45
    6 0.1239
  • 6 0.2197 11 0.1703 21 2.45
    7 0.1213
  • 32 0.2145 6 0.1671 6 2.4
    14 0.1264
  • 21 0.2127 12 0.1664 24 2.4
    25 0.1188
  • 12 0.2126 19 0.1636 19 2.4
    21 0.1183
  • 11 0.2116 21 0.1613 6 2.4
    16 0.1218
  • 26 0.2106 32 0.1601 27 2.35
    24 0.118
  • 19 0.2072 26 0.1464 12 2.35
    12 0.116
  • 28 0.2048 3 0.145 7 2.3
    3 0.1198
  • 13 0.1983 28 0.1427 25 2.2
    28 0.1203
  • 3 0.1949 13 0.1424 32 2.15
    27 0.110

29
Significant Differences
  • Manual metrics
  • Few differences between systems
  • Pyramid 23 is worse
  • Responsive 23 and 31 are worse
  • Both humans better than all systems
  • Automatic (Rouge-SU4)
  • Many differences between systems
  • One human indistinguishable from 5 systems

30
Multiple and pairwise comparisons
  • Multiple comparisons
  • Tukeys method
  • Control for the experiment-wise type I error
  • Show fewer significant differences
  • Pairwise comparisons
  • Wilcoxon paired test
  • Controls the error for individual comparisons
  • Appropriate how your system did for development

31
Peer
Better than
21 32 6 12 19 11 16 4 15 7 14 17 10 A B 23 23 23 23 23 23 23 23 23 23 23 23 20 23 20 23 20 30 24 31 1 27 25 28 13 26 3 21 32 6 12 19 11 16 4 15 7 14 17 10 23 20 30 24 31 1 27 25 28 13 26 3 21 32 6 12 19 11 16 4 15 7 14 17 10
  • Modified pyramid significant differences
  • One systems accounts for most of the differences
  • Humans significantly better than all systems

32
26 13 20 3 32 25 7 12 27 6 16 19 24 21 28 11 17 15 10 14 4 B A 23 23 23 23 23 23 23 23 23 23 31 23 31 23 31 23 31 23 31 23 31 23 31 23 31 23 31 1 23 31 1 23 31 1 30 26 13 20 23 31 1 30 26 13 20 3 23 31 1 30 26 13 20 3 32 25 7 12 27 6 16 19 24 21 28 11 17 15 10 14 4 23 31 1 30 26 13 20 3 32 25 7 12 27 6 16 19 24 21 28 11 17 15 10 14 4
  • Responsiveness 1 Significant differences
  • Differences primarily between 2 systems
  • Differences between humans and each system

33
16 12 15 28 3 7 4 14 17 10 B A 23 23 23 23 23 23 23 23 31 20 23 31 20 23 31 20 23 31 1 30 26 13 20 3 32 25 7 12 27 6 16 19 24 21 28 11 17 15 10 14 4 23 31 1 30 26 13 20 3 32 25 7 12 27 6 16 19 24 21 28 11 17 15 10 14 4
  • Responsive-2
  • Similar shape to original

34
20 31 26 1 32 11 28 13 30 27 3 16 21 12 24 25 7 14 6 19 10 17 4 15 B A 23 23 23 23 23 20 23 20 31 23 20 31 23 20 31 23 20 31 23 20 31 23 20 31 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 23 20 31 26 1 32 11 28 13 30 27 3 16 21 12 24 25 7 14 6 23 20 31 26 1 32 11 28 13 30 27 3 16 21 12 24 25 7 14 6 19 10 17 4 15
  • Skip-bigram significant differences
  • Many more differences between systems than any
    manual metric
  • No difference between human and 5 systems

35
(No Transcript)
36
Pairwise comparisons Modified Pyramid
10 17 14 7 15 4 16 11 19 12 6 32 21 3 26 13 28 25 27 31 24 30 20 23 3 25 27 24 30 20 23 25 27 1 24 30 20 23 13 25 27 31 24 30 20 23 3 25 27 1 24 30 20 23 25 27 31 24 30 20 23 24 30 23 24 30 23 24 30 23 30 23 31 30 23 24 30 20 23 24 30 23 30 23 23 23 30 20 23


37
Agreement between annotators
Overall Low High
Percent Agreement 95 90 96
Kappa .57 .46 .62
Alpha .57 .41 .59
Alpha-Dice .67 .49 .68
38
Editing of participant annotations
  • To correct obvious errors
  • Ensures uniform checking
  • Predominantly involved correct splitting
    unmatching SCUs
  • Average paired differences
  • Original 0.0043
  • Modified 0.0005
  • Average magnitude of the difference
  • Original 0.0115
  • Modified 0.0032

39
Excluding weight 1 SCUs
  • Removing weight 1 SCUs improves agreement
  • Kappa 0.64 (was 0.57)
  • Annotating without weight 1 has negligible impact
    on scores
  • Set D324 done without weight 1 SCUs
  • Ave.magnitude between paired differences
  • On average 0.07 difference

40
Correlations Pearsons, 25 systems
Pyr-mod Resp-1 Resp2 R-2 R-SU4
Pyr-orig 0.96 0.77 0.86 0.84 0.80
Pyr-mod 0.81 0.90 0.90 0.86
Resp-1 0.83 0.92 0.92
Resp-2 0.88 0.87
R-2 0.98
41
Correlations Pearsons, 25 systems
Pyr-mod Resp-1 Resp2 R-2 R-SU4
Pyr-orig 0.96 0.77 0.86 0.84 0.80
Pyr-mod 0.81 0.90 0.90 0.86
Resp-1 0.83 0.92 0.92
Resp-2 0.88 0.87
R-2 0.98
Questionable that responsiveness could be a gold
standard
42
Pyramid and responsiveness
Pyr-mod Resp-1 Resp2 R-2 R-SU4
Pyr-orig 0.96 0.77 0.86 0.84 0.80
Pyr-mod 0.81 0.90 0.90 0.86
Resp-1 0.83 0.92 0.92
Resp-2 0.88 0.87
R-2 0.98
High correlation, but the metrics are not
mutually substitutable
43
Pyramid and Rouge
Pyr-mod Resp-1 Resp2 R-2 R-SU4
Pyr-orig 0.96 0.77 0.86 0.84 0.80
Pyr-mod 0.81 0.90 0.90 0.86
Resp-1 0.83 0.92 0.92
Resp-2 0.88 0.87
R-2 0.98
High correlation, but the metrics are not
mutually substitutable
44
Lessons Learned
  • Comparing content is hard
  • All kinds of judgment calls
  • We didnt evaluate the NIST assessors in previous
    years
  • Paraphrases
  • VP vs. NP
  • Ministers have been exchanged
  • Reciprocal ministerial visits
  • Length and constituent type
  • Robotics assists doctors in the medical operating
    theater
  • Surgeons started using robotic assistants

45
Modified scores better
  • Easier peer annotation
  • Can drop weight 1 SCUs
  • Better agreement
  • No emphasis on splitting non-matching SCUs

46
Agreement between annotators
  • Participants can perform peer annotation reliably
  • Absolute difference between scores
  • Original 0.0555
  • Modified 0.0617
  • Empirical prediction of difference 0.06 (HLT 2004)

47
Correlations
  • Original and modified can substitute for each
    other
  • High correlation between manual and automatic,
    but automatic not yet a substitute
  • Similar patterns between pyramid and
    responsiveness

48
Current Directions
  • Automated identification of SCUs (Harnly et al
    05)
  • Applied to DUC05 pyramid data set
  • Correlation of .91 with modified pyramid scores

49
Questions
  • What was the experience annotating pyramids?
  • Does it shed insight on the problem
  • Are people willing to do it again?
  • Would you have been willing to go through
    training?
  • If youve done pyramid analysis, can you share
    your insights

50
(No Transcript)
51
(No Transcript)
52
Correlations of Scores on Matched Sets
53
SCU Weight by Cardinality(Ten pyramids)
54
Mean SCU Weight(Ten pyramids)
Write a Comment
User Comments (0)
About PowerShow.com