Automatic Text Summarization: A Solid Base presentation

About This Presentation

Transcript and Presenter's Notes

Title: Automatic Text Summarization: A Solid Base

1
Automatic Text Summarization A Solid Base

Martijn B. Wieling,
Rijksuniversiteit Groningen

November, 25th 2004
2
Outline

Why should we bother at all? (a.k.a.
Introduction)
A frequency based ATS Luhn, 1958
An ATS based on multiple features Edmundson,
1969
Automatically combining the features (1) Kupiec
et al, 1995
Automatically combining the features (2) Teufel
Moens, 1997
Why should we still bother? (a.k.a. Conclusion)

0000001
3
Why should we bother at all?

Time saving
Large scale application possible, e.g.
Google-xtract
Extract translation
Abstracts will be consistent and objective

0000010
4
And in the beginning there was

Hans Peter Luhn (father of Information
Retrieval) The Automatic Creation of
Literature Abstracts - 1958

Image Courtesy IBM
0000011
5
Luhns method basic idea

Target documents technical literature
The method is based on the following assumptions
Frequency of word occurrence in an article is a
useful measurement of word significance
Relative position of these significant words
within a sentence is also a useful measurement of
word significance
Based on limited capabilities of machines (IBM
704) ? no semantic information

IBM 704 - Courtesy IBM
0000100
6
Why word frequency?

Important words are repeated throughout the text
examples are given in favor of a certain
principle
arguments are given for a certain principle
Technical literature ? one word one notion
Simple and straightforward algorithm ? cheap to
implement (processing time is costly)
Note that different forms of the same word are
counted as the same word

0000101
7
When significant?

Too low frequent words are not significant
Too high frequent words are also not significant
(e.g. the, and)
Removing low frequent words is easy
set a minimum frequency-threshold
Removing common (high frequent) words
Setting a maximum frequency threshold
(statistically obtained)
Comparing to a common-word list

0000110
Figure 1 from Luhn, 1958
8
Using relative position

Where greatest number of high-frequent words are
found closest together ? probability very high
that representative information is given
Based on the characteristic that an explanation
of a certain idea is represented by words closely
together (e.g. sentences paragraphs - chapters)

0000111
9
The significance factor

The significance factor of a sentence reflects
the number of occurrences of significant words
within a sentence and the linear distance between
them due to non-significant words in between
Only consider portion of sentence bracketed by
significant words with maximum of 5
non-significant words in between,
e.g. () - - - - - - - - -
- ()
Significance factor formula (S)2 / .
(2.5 in the above example)

0001000
10
Generating the abstract

For every sentence the significance factor is
calculated
The sentences with a significance factor higher
than a certain cut-off value are returned
(alternatively the N highest-valued sentences can
be returned)
For large texts, it can also be applied to
subdivisions of the text
No evaluation of the results present in the
journal paper!

0001001
11
A new method by Edmundson

H.P. Edmundson New methods in Automatic
Extracting - 1969

IBM 7090 - Courtesy IBM
0001010
12
Four methods for weighting

Weighting methods
Cue Method
Key Method
Title Method
Location Method
The weight of a sentence is a linear combination
of the weights obtained with the above four
methods
The highest weighing sentences are included in
the abstract
Target documents technical literature

0001011
13
Cue Method

Based on the hypothesis that the probable
relevance of a sentence is affected by presence
of pragmatic words (e.g. Significant,
Greatest, Impossible, Hardly)
Three types of Cue words
Bonus words positively affecting the relevance
of a sentence (e.g. Significant, Greatest)
Stigma words negatively affecting the relevance
of a sentence (e.g. Impossible, Hardly)
Null words irrelevant

0001100
14
Obtaining Cue words

The lists were obtained by statistical analyses
of 100 documents
Dispersion (?) number of documents in which the
word occurred
Selection ratio (?) ratio of number of
occurrences in extractor-selected sentences to
number of occurrences in all sentences
Bonus words ? gt thigh?
Stigma words ? lt tlow?
Null words ? gt t? and tlow?lt ? lt thigh?

0001101
15
Resulting Cue lists

Bonus list (783) comparatives, superlatives,
adverbs of conclusion, value terms, etc.
Stigma list (73) anaphoric expressions,
belittling expressions, etc.
Null list (139) ordinals, cardinals, the verb
to be, prepositions, pronouns, etc.

0001110
16
Cue weight of sentence

Tag all Bonus words with weight b gt 0, all Stigma
words with weight s lt 0, all Null words with
weight n 0
Cue weight of sentence S (Cue weight of each
word in sentence)

0001111
17
Key Method

Principle based on Luhn, counting the frequency
of words.
Algorithm differs
Create key glossary of all non-Cue words in the
document which have a frequency larger than a
certain threshold
Weight of each key word in the key glossary is
set to the frequency it occurs in the document
Assign key weight to each word which can be found
in the key glossary
If word is not in key glossary, key weight 0
No relative position is used (Luhn)
Key weight of sentence S (Key weight of each
word in sentence)

0010000
18
Title Method

Based on the hypothesis that an author conceives
title as circumscribing the subject matter of the
document (similarly for headings vs. paragraphs)
Create title glossary consisting of all non-Null
words in the title, subtitle and headings of the
document
Words are given a positive title weight if they
appear in this glossary
Title words are given a larger weight than
heading words
Title weight of sentence S (Title weight of each
word in sentence)

0010001
19
Location Method

Based on the hypothesis that
Sentences occurring under certain headings are
positively relevant
Topic sentences tend to occur very early or very
late in a document and its paragraphs
Global idea
Give each sentence below his heading the same
weight as the heading itself (note that this is
independent from the Title Method) Heading
weight
Give each sentence a certain weight based on its
position - Ordinal weight
Location weight of sentence Ordinal weight of
sentence Heading weight of sentence

0010010
20
Location Method Heading weight

Compare each word in a heading with the
pre-stored Heading dictionary
If the word occurs in this dictionary, assign it
a weight equal to the weight it has in the
dictionary
Heading weight of a heading S (heading weight of
each word in heading)
Heading weight of a sentence Heading weight of
its heading

0010011
21
Creating the Heading dictionary

The Heading dictionary was created by listing all
words in the headings of 120 documents and
calculating the selection ratio for each word
Selection ratio (?) ratio of number of
occurrences in extractor-selected sentences to
number of occurrences in all headings
Deletions from this list were made on the basis
of low frequency and unrelatedness to the desired
information types (subject, purpose, conclusion,
etc.)
Weights were given to the words in the Heading
dictionary proportional to the selection ratio
The resulting Heading dictionary contained 90
words

0010100
22
Location Method Ordinal weight

Sentences of the first paragraph are tagged with
weight O1
Sentences of the last paragraph are tagged with
weight O2
The first sentence of a paragraph is tagged with
weight O3
The last sentence of a paragraph is tagged with
weight O4
Ordinal weight of sentence O1 O2 O3 O4

0010101
23
Generating the abstract

Calculate the weight of a sentence aC bK cT
dL, with a,b,c,d constant positive integers, C
Cue Weight, K Key weight, T Title weight, L
Location weight
The values of a, b, c and d were obtained by
manually comparing the generated automatic
abstracts with the desired (human made) abstract
Return the highest N sentences under their proper
headings as the abstract (including title)
N is calculated by taking a percentage of the
size of the original documents, in this journal
paper 25 is used

0010110
24
Which combination is best?

All combinations of C, K, T and L were tried to
see which result had (on average) the most
overlap with the handmade extract
As can be seen in the figure below (only the
interesting results are shown), the Key method
was omitted and only C, T and L are used to
create the best abstract
Surprising result! (Luhn used only keywords to
create the abstract)

Figure 4 from Edmundson, 1969
0010111
25
Evaluation

Evaluation was done on unseen data (40 technical
documents), comparison with handmade abstracts
Result 44 of the sentences co-selected, 66
similarity between abstracts (human judge)
Random abstract 25 of the sentences
co-selected, 34 similarity between abstracts
Another evaluation criterion extract-worthiness
Result 84 of the sentences selected is
extract-worthy
Therefore for one document many possible
abstracts (differing in length and content)

0011000
26
Comments

Goldstein e.a., 1999 Not good to base length
of abstract on length of document
Summary length is independent of document length
The longer the document, the smaller the
compression ratio ( doc. / abstract )
Better to use constant summary length
Rath e.a., 1961Human selection of sentences in
abstracts is very variable
6 abstracts of 20 sentences only 32 overlap
between 5 subjects (6 8)
Abstracting the same document 2 times by the same
person with 8 weeks in between only 55 overlap
(average for 6 subjects)
Perhaps the Key Method algorithm used here is not
that good (Luhns algorithm could be better)

0011001
27
Time and cost of this system ?

Speed of extracting 7800 words/minute
Cost 0,015 / word
Including keypunching costs 0.01 / word
Used corpus of 29,500 words ? 442.50 total cost
CPI 2003 2798.00 total cost

0011010
28
A jump in time

1969 First man on the moon
1972 Watergate scandal
1980 John Lennon killed
1981 First identification of AIDS Birth of me
?
1986 Space Shuttle Challenger explodes after
launch
1989 Fall of Berlin Wall
1990 Start Gulf War Introduction WWW
1991 Soviet Union breaks up
1992 Formal end of Cold War
1993 Creation of European Union (Verdrag van
Maastricht)
1994 Nelson Mandela president of South Africa

0011011
29
1995 Trained summarization

Julian Kupiec, Jan Pedersen and Francine Chen A
Trainable Document Summarizer - 1995

0011100
30
Trained weighting

Edmundson used subjective weighting of the
features (Cue, Key, Title, Location) to create an
abstract
In this journal paper generating the abstract is
approached as a statistical classification
problem
Given a training set of documents with handmade
abstracts
Develop a classification function that estimates
the probability a given sentence is included in
the abstract
This requires a training corpus of documents with
abstracts
Target documents technical literature

0011101
31
Features

Five features were used
Sentence Length Cut-off Feature
Fixed Phrase Feature
Paragraph Feature
Thematic Word Feature
Uppercase Word Feature
The above features were chosen by experimentation

0011110
32
Sentence Length Cut-off Feature

Based on the principle that short sentences are
often not included in abstracts
Given a threshold (e.g. 5 words)
SLC-value is true for sentences longer than the
threshold
SLC-value is false otherwise
Note that this feature is not similar to any of
the features Edmundson used

0011111
33
Fixed-Phrase Feature

Based on the hypothesis that
sentences containing any of a list of fixed
phrases (mostly 2 words long) are likely to be in
the abstract (e.g. in conclusion, this result
total 26 elements)
Sentences following a heading containing a
certain keyword are more likely to be in the
abstract (e.g., conclusions, results,
summary)
FP-value is true for sentences in the above
situations, false otherwise
Note that this feature is a combination of
Edmundsons Location Method and Cue Method,
though in reduced form

0100000
34
Paragraph Feature

Each sentence in the first ten and last five
paragraphs is tagged based on its location
Paragraph-initial
Paragraph-final (P gt 1 sentence)
Paragraph-medial (P gt 2 sentences)
Note that this feature is a reduced form of
Edmundsons Location Method

0100001
35
Thematic Word Feature

The most frequent words in a document are defined
as thematic words
A small number of thematic words is selected and
each sentence is scored as a function of
frequency of these thematic words
TW-value is true if it is one of the highest
scoring sentences
TW-value is false otherwise
Note that this feature is an adapted version of
Edmundsons Key Method

0100010
36
Uppercase Word Feature

Based on the hypothesis that proper names often
are important, since it is the explanatory text
for acronyms (e.g. the ISO (International
Standards Organization) )
Count the frequency of each proper name
Constraint the uppercase thematic word is not
sentence initial and begins with a capital letter
The word must occur several times and may not be
an abbreviated measurement unit
Score each sentence based on the number of
frequent proper names in each sentence
The score of a sentence in which the frequent
proper name appears first is twice as high as
later occurrences
UW-value is true if it is one of the highest
scoring sentences, false otherwise
Note that this feature is a bit similar to
Edmundsons Key Method

0100011
37
Classification

For each sentence s the probability P is
calculated that it will be included in the
summary S given the k features (Bayes rule)
Assuming statistical independence of the
features
is constant, and
and can be estimated directly from the
training set by counting occurrences
This function assigns for each s a score which
can be used to select sentences for inclusion in
the abstract

0100100
38
The training material

188 documents with professionally created
abstracts from the scientific/technical domain,
the average length of the abstracts is 3
sentences (3.5 of the total size of the
document)
Sentences from the abstract were matched to the
original document
79 direct sentence matches
3 direct joins (2 sentences combined)
18 no direct match or join possible
Therefore the maximum performance of the
automatic system is 82

0100101
39
Evaluation (1)

Too little material ? Cross-validation used to
evaluate
Two evaluation measures
Fraction of manually selected sentences which
were reproduced correctly average result 35
Fraction of the matchable selected sentences
which were reproduced correctly average result
42
Performance of features (2nd measure)

Feature Individual sentences correct Cumulative sentences correct
Paragraph 33 33
Fixed Phrases 29 42
Length Cut-off 24 44
Thematic Word 20 42
Uppercase Word 20 42
0100110
40
Evaluation (2)

Best combination is Paragraph Fixed Phrase
Length Cut-off (44 performance)
Addition of frequency keyword features results in
a slight decrease of performance (44 ? 42)
Note that Edmundson in this case also reports a
decrease in performance
In final implementation frequency keyword
features are retained in favor of robustness
Baseline used in this experiment Selecting N
sentences from the beginning (Length Cut-off,
thus positively biased)
Full feature set has an improvement of 74 over
baseline (24 ? 42)

0100111
41
Evaluation (3)

If the size of the generated abstract is
increased to 25, the performance improves to 84
Edmundson only had a performance of 44

0101000
42
Comments

The features used in this paper were chosen by
experimentation
No results/discussions of these experiments are
given in the paper, so the reason for the choices
remain unclear
The comparison to Edmundson is not very fair
Handmade reference abstracts of Edmundson had a
size of 25 (here 3.5)
Also the comments which were given about
Edmundson apply here
Not good to base length of abstract on length of
document
Human selection of sentences in abstracts is very
variable
Perhaps the Key Method algorithm used here is too
simple (Luhns algorithm could be better)

0101001
43
Revisited Kupiec e.a., 1995

Simone Teufel and Marc Moens Sentence
extraction as a classification task - 1997

0101010
44
Main research questions

Could Kupiec e.a.s methodology (training a model
with a corpus) be used for another evaluation
criterion?
What was the difference in extracting performance
of both evaluation criterions for different types
of documents?
Note that another set of features is used here
than Kupiec e.a. used

0101011
45
Another evaluation method

Kupiec e.a. used the match sentences evaluation
criterion
Here the training and test set abstracts are
created by the authors themselves (as opposed to
Kupiec e.a.)
Hence less alignable sentences are available in
the document
32 on average vs. 79 in Kupiec e.a.
This does not mean there are less
extract-worthy sentences in the document ?
another evaluation method is chosen
Evaluation ask human to identify abstract-worthy
non-matchable sentences in the original document

0101100
46
Features

The features used here are different from Kupiec
e.a.
Cue Phrase Method (1670 cue phrases)
Location Method
Sentence Length Method
Thematic Word Method
Title Method

0101101
47
Cue Phrase Method

Similarly as in Edmundson, with some differences
A 5-point scale (-1 3) is used instead of 3
(Bonus, Null, Stigma)
Cue phrases are used instead of Cue words
If a phrase was entered into the list, also
syntactically and semantically similar phrases
were manually included in the list
A sentence gets the score of its maximum-scored
Cue phrase, if no Cue phrases are present it gets
a score of 0
The list was manually created by inspecting
extracted sentences
Also based on relative frequency in abstract and
relative frequency in document
Sentences occurring directly after headings like
Introduction or Conclusion are given a prior
score of 2 (in Edmundson this is part of the
Location Method)

0101110
48
Location Method

As in Edmundson, with the exception of the
sentences directly after headings previously
mentioned
Sensitive for certain headings (e.g.
Introduction) if such headings cannot be
found only the sentences of the first 7 and last
3 paragraphs are tagged (initial, medial, final)

0101111
49
Sentence Length Method

As in Kupiec e.a.
The threshold is set to 15 tokens (including
punctuation)

0110000
50
Thematic Word Method

As in Kupiec e.a., with a few differences
Selecting (non-Cue) words which occur frequently
in this document, but rarely in the overall
collection of documents
For each (non-Cue) word the term-frequencyinverse
-document-frequency value is calculated
score(w) floc log (100N / fglob)
with N total number of documents, floc
frequency of word w in document, fglob number
of documents containing word w
Top 10 scoring words are defined as thematic
words
Top 40 sentences based on the frequency of
thematic words (meaned by sentence length) are
given a TW-value of 1, all others 0

0110001
51
Title Method

As in Edmundson, with the difference that
The Title score of the sentence is the mean
frequency of Title word occurrences in the
sentence (in Edmundson each Title word was given
the same score and the scores were summed)
Headings are not taken into account here (by
experimentation)
The 18 top-scoring sentences receive a
Title-value of 1, the others 0

0110010
52
The experiment

Training set a corpus of 124 documents from
different areas of computational linguistics with
summaries written by the authors
A human judge marked additional abstract-worthy
sentences in each document
32 alignable sentences in the abstracts
Two evaluation methods (alignable and
abstract-worthy) which were also combined

0110011
53
Summary of results
Alignability Abstract-worthy Combined
Best single feature Cue Method 23.2 46.7 55.2
All features 31.6 57.2 68.4

Baseline 28 (obtained in a similar fashion as
Kupiec e.a.)
Bad performance of 31.6 for alignability can be
explained because there are less alignable
sentences to train on
Short abstracts were generated (2 5 of size
original document)
If abstract size would be increased to 25,
performance would increase to
Alignability 96 (Kupiec e.a. 84)
Abstract-worthy 98
Combined 97.3
Therefore compression makes the difference, not
the evaluation criterion

0110100
54
Conclusions of this experiment

The method proposed by Kupiec e.a. of
classificatory sentence selection is not
restricted to texts which have high-quality
handmade abstracts
A higher alignability of the handmade abstract is
therefore not necessary for the purpose of
sentence extraction compression rate is the
factor which influences the result
However, if more flexible abstracts should be
generated, the addition of other training and
evaluation criterions is useful
Increased training did not improve results,
improvement can be obtained in the extraction
methods themselves

0110101
55
Comments

The features used in this paper were different
from Kupiec e.a.
No motivation was given why for instance the
Uppercase Word feature was omitted, and why
adapted versions of Edmundson were chosen instead
of the versions Kupiec e.a. used
Also comments which were given about Edmundson
apply here
Not good to base length of abstract on length of
document
Human selection of abstract-worthy sentences in
abstracts is very variable

0110110
56
Why should we still bother

In the discussed methods no attention is given
to
Cohesion of the abstract filtering anaphors out
of an abstract (e.g. it, that)
Filtering out repetition in the abstract
The semantics of the document
Cohesion an attempt is made by using Lexical
Chains
Repetition an attempt is made by using Maximum
Marginal Relevance
Semantics this can still not be done for the
general case, but an attempt is made by using
Rhetorical Tree Structures
Interested about these problems?
Wicher will explain extraction methods which will
address repetition and semantics problems in his
presentation
Terrence will explain Lexical Chains in his
presentation

0110111
57
References

The Automatic Creation of Literature Abstracts,
H.P. Luhn, 1958
New Methods in Automatic Extracting, H.P.
Edmundson, 1969
A Trainable Document Summarizer, J. Kupiec e.a.,
1995
Sentence Extraction as a Classification Task, S.
Teufel and M. Moens, 1997
The Formation of Abstracts by the Selection of
Sentences, G.J. Rath e.a., 1961
Constructing Literature Abstracts by Computer
Techniques and Prospects, C.D. Paice, 1990
Summarizing Text Documents Sentence Selection
and Evaluation Metrics, Goldstein e.a., 1999

0111000
58
Any questions?
0111001

Write a Comment

User Comments (0)

About PowerShow.com

Automatic Text Summarization: A Solid Base PowerPoint PPT Presentation