MEAD 3.09 A platform for multidocument multilingual text summarization - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

MEAD 3.09 A platform for multidocument multilingual text summarization

Description:

MEAD 3.09 A platform for multidocument multilingual text summarization University of Michigan, Smith College, Columbia University University of Pennsylvania, Johns ... – PowerPoint PPT presentation

Number of Views:133
Avg rating:3.0/5.0
Slides: 30
Provided by: rade93
Category:

less

Transcript and Presenter's Notes

Title: MEAD 3.09 A platform for multidocument multilingual text summarization


1
MEAD 3.09 A platform for multidocument
multilingual text summarization
  • University of Michigan, Smith College, Columbia
    UniversityUniversity of Pennsylvania, Johns
    Hopkins UniversityChinese University of Hong
    Kong, University of AlabamaUniversity of
    Sheffield, University of CambridgeJHU Summer
    School 2004 - Baltimore

2
Text summarization
  • Identifying the most important information from
    a document or set of documents.
  • Extractive/abstractive
  • Single-document/multi-document
  • Informative/Indicative

3
MEAD
  • Multi-document, multilingual, extractive
    summarization platform
  • Open-source (Perl Java), well documented API
    and utilities
  • v. 1.0-2.0 (Michigan 2000), v. 3.0 (JHU 2001)
  • Latest release is v. 3.09 (Michigan 2001-2004)

4
Four stages
  • Preprocessing and clustering
  • CIDR, XML representation
  • Feature extraction
  • Default custom
  • Score extraction
  • Feature combination
  • Sentence reranking
  • Cross-sentence relationships repetitions,
    chronology, source preferences

5
Sample .config file
ltMEAD-CONFIG TARGET'GA3' LANG'ENG
CLUSTER-PATH'/clair4/mead/data/GA3'
DATA-DIRECTORY'/clair4/mead/data/GA3/docsent'gt ltF
EATURE-SET BASE-DIRECTORY'/clair4/mead/data/GA3/f
eature/'gt ltFEATURE NAME'Centroid
SCRIPT'/clair4/mead/bin/feature-scripts/Centroid.
pl HK-WORD-enidf ENG'/gt ltFEATURE
NAME'Position SCRIPT'/clair4/mead/bin/featur
e-scripts/Position.pl'/gt ltFEATURE
NAME'Length SCRIPT'/clair4/mead/bin/feature-
scripts/Length.pl'/gt lt/FEATURE-SETgt ltCLASSIFIER
COMMAND-LINE'/clair4/mead/bin/default-classifier.
pl \ Centroid 1 Position 1 Length 9'
SYSTEM'MEADORIG' RUN'10/09'/gt ltRERANKER
COMMAND-LINE'/clair4/mead/bin/default-reranker.pl
MEAD-cosine 0.7'/gt ltCOMPRESSION
BASIS'sentences' PERCENT'20'/gt lt/MEAD-CONFIGgt
6
Sample .sentfeature file
ltSENT-FEATUREgt ltS DID"87" SNO"1" gt ltFEATURE
N"Centroid" V"0.2749" /gt lt/Sgt ltS DID"87"
SNO"2" gt ltFEATURE N"Centroid" V"0.8288"
/gt lt/Sgt ltS DID"81" SNO"1" gt ltFEATURE
N"Centroid" V"0.1538" /gt lt/Sgt ltS DID"81"
SNO"2" gt ltFEATURE N"Centroid" V"1.0000"
/gt lt/Sgt ltS DID"41" SNO"1" gt ltFEATURE
N"Centroid" V"0.1539" /gt lt/Sgt ltS DID"41"
SNO"2" gt ltFEATURE N"Centroid" V"0.9820"
/gt lt/Sgt lt/SENT-FEATUREgt
7
Sample .extract file
lt!DOCTYPE EXTRACT SYSTEM '/clair/tools/mead/dtd/ex
tract.dtd'gt ltEXTRACT QID'GA3' LANG'ENG'
COMPRESSION'7' SYSTEM'MEADORIG' RUN'Sun
Oct 13 110119 2002'gt ltS ORDER'1' DID'41'
SNO'2' /gt ltS ORDER'2' DID'41' SNO'3' /gt
ltS ORDER'3' DID'41' SNO'11' /gt ltS
ORDER'4' DID'81' SNO'3' /gt ltS ORDER'5'
DID'81' SNO'7' /gt ltS ORDER'6' DID'87'
SNO'2' /gt ltS ORDER'7' DID'87' SNO'3'
/gt lt/EXTRACTgt
8
Sample .sentjudge file
ltSENT-JUDGE QID'551'gt ltS DID'D-19980731_003.e'
PAR'1' RSNT'1' SNO'1'gt ltJUDGE
N'smith' UTIL'10'/gt ltJUDGE
N'huang' UTIL'10'/gt ltJUDGE
N'moorthy' UTIL'6'/gt lt/Sgt ltS DID'D-19980731_003
.e' PAR'2' RSNT'1' SNO'2'gt
ltJUDGE N'smith' UTIL'6'/gt
ltJUDGE N'huang' UTIL'10'/gt
ltJUDGE N'moorthy' UTIL'10'/gt lt/Sgt ltS
DID'D-19980731_003.e' PAR'3' RSNT'1' SNO'3'gt
ltJUDGE N'smith' UTIL'6'/gt
ltJUDGE N'huang' UTIL'9'/gt
ltJUDGE N'moorthy' UTIL'10'/gt lt/Sgt ltS
DID'D-19981105_011.e' PAR'5' RSNT'2' SNO'7'gt
ltJUDGE N'smith' UTIL'2'/gt
ltJUDGE N'huang' UTIL'1'/gt
ltJUDGE N'moorthy' UTIL'4'/gt lt/Sgt lt/SENT-JUDGE
gt
9
Sample .query
lt!DOCTYPE QUERY SYSTEM "/clair4/mead/dtd/query.dtd
" gt ltQUERY QID"Q-551-E" QNO"551"
TRANSLATED"NO"gt ltTITLEgt Natural disaster
victims aided lt/TITLEgt ltDESCRIPTIONgt
The description is usually a few sentences
describing the cluster. lt/DESCRIPTIONgt
ltNARRATIVEgt The narrative often describes
exactly what the user is looking for in the
summary. lt/NARRATIVEgt lt/QUERYgt
10
Preprocessing and clustering
  • Docjudge relevance judgements
  • Docsent document representation
  • Extract sentences to be extracted
  • Mead-config configuration parameters
  • Query similar to TREC
  • Sentfeature feature values
  • Sentjudge importance annotations
  • Sentrel cross-sentence relations

11
Features
  • Centroid cosine overlap with the centroid vector
    of the cluster
  • SimWithFirst cosine overlap with the first
    sentence in the document (or with the title, if
    it exists)
  • Length 1 if the length of the sentence is above
    a given threshold and 0 otherwise
  • RealLength the length of the sentence in words
  • Position the position of the sentence in the
    document
  • QueryOverlap cosine overlap with a query
    sentence or phrase
  • KeywordMatch full match from a list of keywords
  • CosineCentrality eigenvector centrality of the
    sentence on the lexical connectivity matrix with
    a defined threshold

12
Centrality in summarization
  • Motivation capture the most central words in a
    document or cluster
  • Centroid score Radev al. 2000, 2004a
  • Alternative methods for computing centrality?

13
Social networks
  • Induced by a relation r
  • Prestige (centrality) in social networks
  • Degree centrality number of friends
  • Geodesic centrality bridge quality
  • Eigenvector centrality who your friends are

14
Eigenvectors of stochastic graphs
  • Square connectivity matrix
  • Directed vs. undirected
  • An eigenvalue for a square matrix A is a scalar ?
    such that there exists a vector x?0 such that Ax
    ?x
  • The normalized eigenvector associated with the
    largest ? is called the principal eigenvector of
    A
  • A matrix is called a stochastic matrix when the
    sum of entries in each row sum to 1 and none is
    negative. All stochastic matrices have a
    principal eigenvector
  • The connectivity matrix used in PageRank Page
    al. 1998 is irreducible Langville Meyer 2003
  • An iterative method (power method) can be used to
    compute the principal eigenvector
  • That eigenvector corresponds to the stationary
    value of the Markov stochastic process described
    by the connectivity matrix
  • This is also equivalent to performing a random
    walk on the matrix

15
Eigenvectors of stochastic graphs
  • The stationary value of the Markov stochastic
    matrix can be computed using an iterative power
    method
  • PageRank adds an extra twist to deal with
    dead-end pages. With a probability 1-?, a random
    starting point is chosen. This has a natural
    interpretation in the case of Web page ranking

su successor nodes pr predecessor nodes
16
LexPageRank (Cosine centrality)
Example (cluster d1003t)
1 (d1s1) Iraqi Vice President Taha Yassin Ramadan
announced today, Sunday, that Iraq refuses to
back down from its decision to stop cooperating
with disarmament inspectors before its demands
are met. 2 (d2s1) Iraqi Vice president Taha
Yassin Ramadan announced today, Thursday, that
Iraq rejects cooperating with the United Nations
except on the issue of lifting the blockade
imposed upon it since the year 1990. 3 (d2s2)
Ramadan told reporters in Baghdad that "Iraq
cannot deal positively with whoever represents
the Security Council unless there was a clear
stance on the issue of lifting the blockade off
of it. 4 (d2s3) Baghdad had decided late last
October to completely cease cooperating with the
inspectors of the United Nations Special
Commission (UNSCOM), in charge of disarming
Iraq's weapons, and whose work became very
limited since the fifth of August, and announced
it will not resume its cooperation with the
Commission even if it were subjected to a
military operation. 5 (d3s1) The Russian Foreign
Minister, Igor Ivanov, warned today, Wednesday
against using force against Iraq, which will
destroy, according to him, seven years of
difficult diplomatic work and will complicate the
regional situation in the area. 6 (d3s2) Ivanov
contended that carrying out air strikes against
Iraq, who refuses to cooperate with the United
Nations inspectors, will end the tremendous
work achieved by the international group during
the past seven years and will complicate the
situation in the region.'' 7 (d3s3) Nevertheless,
Ivanov stressed that Baghdad must resume working
with the Special Commission in charge of
disarming the Iraqi weapons of mass destruction
(UNSCOM). 8 (d4s1) The Special Representative of
the United Nations Secretary-General in Baghdad,
Prakash Shah, announced today, Wednesday, after
meeting with the Iraqi Deputy Prime Minister
Tariq Aziz, that Iraq refuses to back down from
its decision to cut off cooperation with the
disarmament inspectors. 9 (d5s1) British Prime
Minister Tony Blair said today, Sunday, that the
crisis between the international community and
Iraq did not end'' and that Britain is still
ready, prepared, and able to strike Iraq.'' 10
(d5s2) In a gathering with the press held at the
Prime Minister's office, Blair contended that the
crisis with Iraq will not end until Iraq has
absolutely and unconditionally respected its
commitments'' towards the United Nations. 11
(d5s3) A spokesman for Tony Blair had indicated
that the British Prime Minister gave permission
to British Air Force Tornado planes stationed in
Kuwait to join the aerial bombardment against
Iraq.
17
Cosine centrality
1 2 3 4 5 6 7 8 9 10 11
1 1.00 0.45 0.02 0.17 0.03 0.22 0.03 0.28 0.06 0.06 0.00
2 0.45 1.00 0.16 0.27 0.03 0.19 0.03 0.21 0.03 0.15 0.00
3 0.02 0.16 1.00 0.03 0.00 0.01 0.03 0.04 0.00 0.01 0.00
4 0.17 0.27 0.03 1.00 0.01 0.16 0.28 0.17 0.00 0.09 0.01
5 0.03 0.03 0.00 0.01 1.00 0.29 0.05 0.15 0.20 0.04 0.18
6 0.22 0.19 0.01 0.16 0.29 1.00 0.05 0.29 0.04 0.20 0.03
7 0.03 0.03 0.03 0.28 0.05 0.05 1.00 0.06 0.00 0.00 0.01
8 0.28 0.21 0.04 0.17 0.15 0.29 0.06 1.00 0.25 0.20 0.17
9 0.06 0.03 0.00 0.00 0.20 0.04 0.00 0.25 1.00 0.26 0.38
10 0.06 0.15 0.01 0.09 0.04 0.20 0.00 0.20 0.26 1.00 0.12
11 0.00 0.00 0.00 0.01 0.18 0.03 0.01 0.17 0.38 0.12 1.00
18
Cosine centrality (t0.3)
d2s3
d3s3
d3s2
d3s1
d1s1
d4s1
d5s1
d2s1
d5s2
d5s3
d2s2
19
Cosine centrality (t0.2)
d2s3
d3s3
d3s2
d3s1
d1s1
d4s1
d5s1
d2s1
d5s2
d5s3
d2s2
20
Cosine centrality (t0.1)
d4s1
Sentences vote for the most central sentence!
21
Cosine centrality vs. centroid centrality
22
Classifiers
  • Default linear combination (possibly using
    thresholds)
  • Lead-based positional and chronological
  • Random
  • Decision-tree trainable

23
Rerankers
  • Identity trivial
  • Default remove sentences that are too similar
  • Time-based use chronology
  • Source-based source preference
  • Novelty
  • CST-based cross-document structure theory Radev
    2000, Zhangal. 2002, ZhangRadev 2004
  • MMR maximal marginal relevance Carbonell
    Goldstein 1998

24
Evaluation methods
  • Precision/recall/f-measure baseline
  • Kappa interjudge agreement and difficulty
  • Relative utility non-binary judgements Radev
    2000
  • Relevance correlation IR-based
  • Cosine default or TFIDF
  • Longest-common subsequence Saggional. 2002
  • Word overlap
  • BLEU n-gram precision Papinenial. 2002
  • ROUGE n-gram recall and lcs Lin 2004

25
Corpora
  • SummBank
  • 40 clusters in Chinese and English
  • 360 multidocument, human-written non-extractive
    summaries
  • 2 million single and multi-document extracts
    created manually and automatically
  • Prepared at JHU 2001 Radev al. 2003
  • LDC release (2003)
  • CSTBank
  • Cross-document structure theory
  • Identity, fulfillment, paraphrase, subsumption

26
Recent applications
  • NewsInEssence (www.newsinessence.com)
  • DUC 2001-2004
  • WapMEAD
  • Java-MEAD interface
  • Chronological fact extraction
  • Novelty detection
  • Protein interaction extraction

27
(No Transcript)
28
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
(No Transcript)
30
More recent additions
  • MEAD addons conversion from plain text, HTML,
    PDF, etc. to MEAD XML
  • Client server
  • Summary to sentjudge conversion
  • Trainable version of MEAD using decision trees,
    maxent, and SVM

31
Successes
  • Large-scale effort (more than 20 people have
    participated in it)
  • Open architecture
  • Downloaded more than 1,000 times in the last 2
    years
  • Used in teaching
  • Novel models of centrality centroid, degree,
    cosine centrality
  • Currently in five languages English, Chinese,
    Korean, Spanish, Japanese
  • DUC (including several first-place rankings in
    2003, 2004)

32
Acknowledgments
  • NSF grants IIS-0082884 (Cross-document structure
    theory) and IIS-0329043 (Graph-based NLP)
  • Johns Hopkins University Fred Jelinek
  • The Linguistic Data Consortium Stephanie
    Strassel
  • Matthew Craig, Naomi Daniel, Günes Erkan,
    Amardeep Grewal, Anna Osepayshvili, Siwei Shen,
    Jin Yiwww.summarization.com/mead

33
Sample .meadrc file
compression_basis sentences compression_absolute
1 classifier \ /clair4/projects/mead307/source/
mead/bin/default-classifier.pl \ Centroid 3.0
Position 1.0 Length 15 SimWithFirst 2.0 reranker
\ /clair4/projects/mead307/source/mead/bin/defa
ult-reranker.pl \ MEAD-cosine 0.9 enidf
Write a Comment
User Comments (0)
About PowerShow.com