Title: MEAD 3.09 A platform for multidocument multilingual text summarization
1MEAD 3.09 A platform for multidocument
multilingual text summarization
- University of Michigan, Smith College, Columbia
UniversityUniversity of Pennsylvania, Johns
Hopkins UniversityChinese University of Hong
Kong, University of AlabamaUniversity of
Sheffield, University of CambridgeJHU Summer
School 2004 - Baltimore
2Text summarization
- Identifying the most important information from
a document or set of documents. - Extractive/abstractive
- Single-document/multi-document
- Informative/Indicative
3MEAD
- Multi-document, multilingual, extractive
summarization platform - Open-source (Perl Java), well documented API
and utilities - v. 1.0-2.0 (Michigan 2000), v. 3.0 (JHU 2001)
- Latest release is v. 3.09 (Michigan 2001-2004)
4Four stages
- Preprocessing and clustering
- CIDR, XML representation
- Feature extraction
- Default custom
- Score extraction
- Feature combination
- Sentence reranking
- Cross-sentence relationships repetitions,
chronology, source preferences
5Sample .config file
ltMEAD-CONFIG TARGET'GA3' LANG'ENG
CLUSTER-PATH'/clair4/mead/data/GA3'
DATA-DIRECTORY'/clair4/mead/data/GA3/docsent'gt ltF
EATURE-SET BASE-DIRECTORY'/clair4/mead/data/GA3/f
eature/'gt ltFEATURE NAME'Centroid
SCRIPT'/clair4/mead/bin/feature-scripts/Centroid.
pl HK-WORD-enidf ENG'/gt ltFEATURE
NAME'Position SCRIPT'/clair4/mead/bin/featur
e-scripts/Position.pl'/gt ltFEATURE
NAME'Length SCRIPT'/clair4/mead/bin/feature-
scripts/Length.pl'/gt lt/FEATURE-SETgt ltCLASSIFIER
COMMAND-LINE'/clair4/mead/bin/default-classifier.
pl \ Centroid 1 Position 1 Length 9'
SYSTEM'MEADORIG' RUN'10/09'/gt ltRERANKER
COMMAND-LINE'/clair4/mead/bin/default-reranker.pl
MEAD-cosine 0.7'/gt ltCOMPRESSION
BASIS'sentences' PERCENT'20'/gt lt/MEAD-CONFIGgt
6Sample .sentfeature file
ltSENT-FEATUREgt ltS DID"87" SNO"1" gt ltFEATURE
N"Centroid" V"0.2749" /gt lt/Sgt ltS DID"87"
SNO"2" gt ltFEATURE N"Centroid" V"0.8288"
/gt lt/Sgt ltS DID"81" SNO"1" gt ltFEATURE
N"Centroid" V"0.1538" /gt lt/Sgt ltS DID"81"
SNO"2" gt ltFEATURE N"Centroid" V"1.0000"
/gt lt/Sgt ltS DID"41" SNO"1" gt ltFEATURE
N"Centroid" V"0.1539" /gt lt/Sgt ltS DID"41"
SNO"2" gt ltFEATURE N"Centroid" V"0.9820"
/gt lt/Sgt lt/SENT-FEATUREgt
7Sample .extract file
lt!DOCTYPE EXTRACT SYSTEM '/clair/tools/mead/dtd/ex
tract.dtd'gt ltEXTRACT QID'GA3' LANG'ENG'
COMPRESSION'7' SYSTEM'MEADORIG' RUN'Sun
Oct 13 110119 2002'gt ltS ORDER'1' DID'41'
SNO'2' /gt ltS ORDER'2' DID'41' SNO'3' /gt
ltS ORDER'3' DID'41' SNO'11' /gt ltS
ORDER'4' DID'81' SNO'3' /gt ltS ORDER'5'
DID'81' SNO'7' /gt ltS ORDER'6' DID'87'
SNO'2' /gt ltS ORDER'7' DID'87' SNO'3'
/gt lt/EXTRACTgt
8Sample .sentjudge file
ltSENT-JUDGE QID'551'gt ltS DID'D-19980731_003.e'
PAR'1' RSNT'1' SNO'1'gt ltJUDGE
N'smith' UTIL'10'/gt ltJUDGE
N'huang' UTIL'10'/gt ltJUDGE
N'moorthy' UTIL'6'/gt lt/Sgt ltS DID'D-19980731_003
.e' PAR'2' RSNT'1' SNO'2'gt
ltJUDGE N'smith' UTIL'6'/gt
ltJUDGE N'huang' UTIL'10'/gt
ltJUDGE N'moorthy' UTIL'10'/gt lt/Sgt ltS
DID'D-19980731_003.e' PAR'3' RSNT'1' SNO'3'gt
ltJUDGE N'smith' UTIL'6'/gt
ltJUDGE N'huang' UTIL'9'/gt
ltJUDGE N'moorthy' UTIL'10'/gt lt/Sgt ltS
DID'D-19981105_011.e' PAR'5' RSNT'2' SNO'7'gt
ltJUDGE N'smith' UTIL'2'/gt
ltJUDGE N'huang' UTIL'1'/gt
ltJUDGE N'moorthy' UTIL'4'/gt lt/Sgt lt/SENT-JUDGE
gt
9Sample .query
lt!DOCTYPE QUERY SYSTEM "/clair4/mead/dtd/query.dtd
" gt ltQUERY QID"Q-551-E" QNO"551"
TRANSLATED"NO"gt ltTITLEgt Natural disaster
victims aided lt/TITLEgt ltDESCRIPTIONgt
The description is usually a few sentences
describing the cluster. lt/DESCRIPTIONgt
ltNARRATIVEgt The narrative often describes
exactly what the user is looking for in the
summary. lt/NARRATIVEgt lt/QUERYgt
10Preprocessing and clustering
- Docjudge relevance judgements
- Docsent document representation
- Extract sentences to be extracted
- Mead-config configuration parameters
- Query similar to TREC
- Sentfeature feature values
- Sentjudge importance annotations
- Sentrel cross-sentence relations
11Features
- Centroid cosine overlap with the centroid vector
of the cluster - SimWithFirst cosine overlap with the first
sentence in the document (or with the title, if
it exists) - Length 1 if the length of the sentence is above
a given threshold and 0 otherwise - RealLength the length of the sentence in words
- Position the position of the sentence in the
document - QueryOverlap cosine overlap with a query
sentence or phrase - KeywordMatch full match from a list of keywords
- CosineCentrality eigenvector centrality of the
sentence on the lexical connectivity matrix with
a defined threshold
12Centrality in summarization
- Motivation capture the most central words in a
document or cluster - Centroid score Radev al. 2000, 2004a
- Alternative methods for computing centrality?
13Social networks
- Induced by a relation r
- Prestige (centrality) in social networks
- Degree centrality number of friends
- Geodesic centrality bridge quality
- Eigenvector centrality who your friends are
14Eigenvectors of stochastic graphs
- Square connectivity matrix
- Directed vs. undirected
- An eigenvalue for a square matrix A is a scalar ?
such that there exists a vector x?0 such that Ax
?x - The normalized eigenvector associated with the
largest ? is called the principal eigenvector of
A - A matrix is called a stochastic matrix when the
sum of entries in each row sum to 1 and none is
negative. All stochastic matrices have a
principal eigenvector - The connectivity matrix used in PageRank Page
al. 1998 is irreducible Langville Meyer 2003 - An iterative method (power method) can be used to
compute the principal eigenvector - That eigenvector corresponds to the stationary
value of the Markov stochastic process described
by the connectivity matrix - This is also equivalent to performing a random
walk on the matrix
15Eigenvectors of stochastic graphs
- The stationary value of the Markov stochastic
matrix can be computed using an iterative power
method
- PageRank adds an extra twist to deal with
dead-end pages. With a probability 1-?, a random
starting point is chosen. This has a natural
interpretation in the case of Web page ranking
su successor nodes pr predecessor nodes
16LexPageRank (Cosine centrality)
Example (cluster d1003t)
1 (d1s1) Iraqi Vice President Taha Yassin Ramadan
announced today, Sunday, that Iraq refuses to
back down from its decision to stop cooperating
with disarmament inspectors before its demands
are met. 2 (d2s1) Iraqi Vice president Taha
Yassin Ramadan announced today, Thursday, that
Iraq rejects cooperating with the United Nations
except on the issue of lifting the blockade
imposed upon it since the year 1990. 3 (d2s2)
Ramadan told reporters in Baghdad that "Iraq
cannot deal positively with whoever represents
the Security Council unless there was a clear
stance on the issue of lifting the blockade off
of it. 4 (d2s3) Baghdad had decided late last
October to completely cease cooperating with the
inspectors of the United Nations Special
Commission (UNSCOM), in charge of disarming
Iraq's weapons, and whose work became very
limited since the fifth of August, and announced
it will not resume its cooperation with the
Commission even if it were subjected to a
military operation. 5 (d3s1) The Russian Foreign
Minister, Igor Ivanov, warned today, Wednesday
against using force against Iraq, which will
destroy, according to him, seven years of
difficult diplomatic work and will complicate the
regional situation in the area. 6 (d3s2) Ivanov
contended that carrying out air strikes against
Iraq, who refuses to cooperate with the United
Nations inspectors, will end the tremendous
work achieved by the international group during
the past seven years and will complicate the
situation in the region.'' 7 (d3s3) Nevertheless,
Ivanov stressed that Baghdad must resume working
with the Special Commission in charge of
disarming the Iraqi weapons of mass destruction
(UNSCOM). 8 (d4s1) The Special Representative of
the United Nations Secretary-General in Baghdad,
Prakash Shah, announced today, Wednesday, after
meeting with the Iraqi Deputy Prime Minister
Tariq Aziz, that Iraq refuses to back down from
its decision to cut off cooperation with the
disarmament inspectors. 9 (d5s1) British Prime
Minister Tony Blair said today, Sunday, that the
crisis between the international community and
Iraq did not end'' and that Britain is still
ready, prepared, and able to strike Iraq.'' 10
(d5s2) In a gathering with the press held at the
Prime Minister's office, Blair contended that the
crisis with Iraq will not end until Iraq has
absolutely and unconditionally respected its
commitments'' towards the United Nations. 11
(d5s3) A spokesman for Tony Blair had indicated
that the British Prime Minister gave permission
to British Air Force Tornado planes stationed in
Kuwait to join the aerial bombardment against
Iraq.
17Cosine centrality
1 2 3 4 5 6 7 8 9 10 11
1 1.00 0.45 0.02 0.17 0.03 0.22 0.03 0.28 0.06 0.06 0.00
2 0.45 1.00 0.16 0.27 0.03 0.19 0.03 0.21 0.03 0.15 0.00
3 0.02 0.16 1.00 0.03 0.00 0.01 0.03 0.04 0.00 0.01 0.00
4 0.17 0.27 0.03 1.00 0.01 0.16 0.28 0.17 0.00 0.09 0.01
5 0.03 0.03 0.00 0.01 1.00 0.29 0.05 0.15 0.20 0.04 0.18
6 0.22 0.19 0.01 0.16 0.29 1.00 0.05 0.29 0.04 0.20 0.03
7 0.03 0.03 0.03 0.28 0.05 0.05 1.00 0.06 0.00 0.00 0.01
8 0.28 0.21 0.04 0.17 0.15 0.29 0.06 1.00 0.25 0.20 0.17
9 0.06 0.03 0.00 0.00 0.20 0.04 0.00 0.25 1.00 0.26 0.38
10 0.06 0.15 0.01 0.09 0.04 0.20 0.00 0.20 0.26 1.00 0.12
11 0.00 0.00 0.00 0.01 0.18 0.03 0.01 0.17 0.38 0.12 1.00
18Cosine centrality (t0.3)
d2s3
d3s3
d3s2
d3s1
d1s1
d4s1
d5s1
d2s1
d5s2
d5s3
d2s2
19Cosine centrality (t0.2)
d2s3
d3s3
d3s2
d3s1
d1s1
d4s1
d5s1
d2s1
d5s2
d5s3
d2s2
20Cosine centrality (t0.1)
d4s1
Sentences vote for the most central sentence!
21Cosine centrality vs. centroid centrality
22Classifiers
- Default linear combination (possibly using
thresholds) - Lead-based positional and chronological
- Random
- Decision-tree trainable
23Rerankers
- Identity trivial
- Default remove sentences that are too similar
- Time-based use chronology
- Source-based source preference
- Novelty
- CST-based cross-document structure theory Radev
2000, Zhangal. 2002, ZhangRadev 2004 - MMR maximal marginal relevance Carbonell
Goldstein 1998
24Evaluation methods
- Precision/recall/f-measure baseline
- Kappa interjudge agreement and difficulty
- Relative utility non-binary judgements Radev
2000 - Relevance correlation IR-based
- Cosine default or TFIDF
- Longest-common subsequence Saggional. 2002
- Word overlap
- BLEU n-gram precision Papinenial. 2002
- ROUGE n-gram recall and lcs Lin 2004
25Corpora
- SummBank
- 40 clusters in Chinese and English
- 360 multidocument, human-written non-extractive
summaries - 2 million single and multi-document extracts
created manually and automatically - Prepared at JHU 2001 Radev al. 2003
- LDC release (2003)
- CSTBank
- Cross-document structure theory
- Identity, fulfillment, paraphrase, subsumption
26Recent applications
- NewsInEssence (www.newsinessence.com)
- DUC 2001-2004
- WapMEAD
- Java-MEAD interface
- Chronological fact extraction
- Novelty detection
- Protein interaction extraction
27(No Transcript)
281
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29(No Transcript)
30More recent additions
- MEAD addons conversion from plain text, HTML,
PDF, etc. to MEAD XML - Client server
- Summary to sentjudge conversion
- Trainable version of MEAD using decision trees,
maxent, and SVM
31Successes
- Large-scale effort (more than 20 people have
participated in it) - Open architecture
- Downloaded more than 1,000 times in the last 2
years - Used in teaching
- Novel models of centrality centroid, degree,
cosine centrality - Currently in five languages English, Chinese,
Korean, Spanish, Japanese - DUC (including several first-place rankings in
2003, 2004)
32Acknowledgments
- NSF grants IIS-0082884 (Cross-document structure
theory) and IIS-0329043 (Graph-based NLP) - Johns Hopkins University Fred Jelinek
- The Linguistic Data Consortium Stephanie
Strassel - Matthew Craig, Naomi Daniel, Günes Erkan,
Amardeep Grewal, Anna Osepayshvili, Siwei Shen,
Jin Yiwww.summarization.com/mead
33Sample .meadrc file
compression_basis sentences compression_absolute
1 classifier \ /clair4/projects/mead307/source/
mead/bin/default-classifier.pl \ Centroid 3.0
Position 1.0 Length 15 SimWithFirst 2.0 reranker
\ /clair4/projects/mead307/source/mead/bin/defa
ult-reranker.pl \ MEAD-cosine 0.9 enidf