Title: Open-Source Implementation of Document Structuring Algorithm for NLTK
1Open-Source Implementation of Document
Structuring Algorithm for NLTK
2Natural Language Generation
- Generate coherent text outputs to express
information - Express the right information
- Express information in the right order
3NLG Tasks
- Document Structuring - most important and
relevant information selected from knowledge base
(Content Determination), then ordered and
structured in such a way as to maximize coherence
and informativeness (Text Planning) - Micro-Planning specifics of word selection,
referring expressions, and the finalization of
ordering are determined - Realization internal representations of the
above decisions are realized in actual text output
4Document Structuring
- Given a set of information to be expressed,
determine the order and grouping of this
information - Texts cannot be simply a random bag of sentences
- Order of message presentation has significant
effect on meaning Hovy 1993 - One way
- 1 - Maria was diagnosed with cancer some months
ago. - 2 - Maria and Zurab had a fight last night.
- 3 - She was found dead this morning.
- Vs.
- 1 - Maria was diagnosed with cancer some months
ago. - 2 - She was found dead this morning.
- 3 - Maria and Zurab had a fight last night.
5Document Structuring
- Ordering also effects coherence
- John was hungry. John went to the store. He
bought some bread to make a sandwich. - John bought some bread to make a sandwich. He
went to the store. John was hungry.
6Discourse relations
- Relationship between a message or group of
messages - Elaboration(m1,m2)
- I love jazz music(m1). My favourite album is
Oscar Peterson's Night Train (m2). - Contrast(m1, m3)
- I love jazz music (m1). However, my favourite
album is The Beatles' White Album (m3). - Cue word - However
7Rhetorical Structure Theory
- Mann and Thompson 1988
- A text is coherent by virtue of relationships
that hold between messages in the text - A small number of relations (25) can explain
relationships between messages in a wide range of
text
8Project Proposal
- Implement these general algorithms for inclusion
in NLTK - Provide a sample Data Set and DR schema for
testing and illustration - based on hypothetical WeatherExplainer from
Reiter and Dale 2000 - Experiment utilizing these new tools as part of
Abstractive Summarization System for Evaluative
Statement Summarization (ASSESS)
9Implementation 1 Schemas
- Top-Down Approach
- Output document structure is predictable and
stereotyped - Schemas are patterns of expansion, similar to CFG
- Ie
- CompareAndContrast ? DescribeRelationship
CompareProperties. - CompareProperties ? CompareProperty
CompareProperties. - CompareProperties ? .
- John is much bigger than Kate (DR). He is five
inches taller (CP) and weighs almost twice as
much (CP). - Specify rules for choosing if multiple expansions
exist
10Top-Down Problems
- Hypothesis-Driven
- Content selection done on-line
- Not easily pipelined
- Therefore, Bottom-Up used
11Implementation 2 Bottom-Up
- Output document structure is not predictable
- POOL messages to be expressed
- while( size(pool) gt 1))
- find all pairs of elements in pool which can be
joined by a DR - assign a desirability score to each potential DR
- find pair Ei and Ej with highest score and
combine with Ek - remove Ei and Ej from POOL, replace with Ek
- end while
12Implementation
- Used nltk.featstruct for Messages and DocPlans
- A mapping from feature identifiers to feature
values, where each feature value is either a
basic value (such as a string or an integer), or
a nested feature structure.
TotalRainfallMsg period year 1996
month 06 attribute type
'RelativeVariation' magnitude
unit 'inches' number 4
direction ''
msgType 'TotalRainfallMsg'
direction ''
attribute magnitude
number 4
unit 'inches'
type 'RelativeVariation'
period month 6
year 1996
13Implementation
- nltk.featstruct.FeatStruct
- unify(other)
- Unify fstruct1 with fstruct2, and return the
resulting feature structure. This unified feature
structure is the minimal feature structure that - contains all feature value assignments from both
fstruct1 and fstruct2. - preserves all reentrance properties of fstruct1
and fstruct2. - If no such feature structure exists (because
fstruct1 and fstruct2 specify incompatible values
for some feature), then unification fails, and
unify returns None.
14Unification
TotalRainfallMsg period year 1996
month 06 attribute type
'RelativeVariation' magnitude
unit 'inches' number 4
TotalRainfallMsg period year 1996
month 06 attribute type
'RelativeVariation' magnitude
unit 'inches' number 4
direction ''
TotalRainfallMsg period year 1996
Month 06 attribute type
'RelativeVariation' direction ''
15Implementation
- nltk.featstruct.FeatStruct
- subsumes(other)
- True if self subsumes other. I.e., return true if
unifying self with other would result in a
feature structure equal to other.
16Subsumes
TotalRainfallMsg period year 1996
month 06 attribute type
'RelativeVariation' magnitude
unit 'inches' number 4
direction ''
TotalRainfallMsg period year 1996
Month 06
subsumes
TotalRainfallMsg period year 1996
month 06 attribute type
'RelativeVariation' magnitude
unit 'inches' number 4
TotalRainfallMsg period year 1996
month 06
Does not subsume
17Using Subsumes
Select from messages all DocPlans whose with a
relType of Contrast and a nucleus which is a
message of msgType ('TotalRainfallMsg')
d DocPlan(relType 'Contrast', nucleus
Message('TotalRainfallMsg')) return
filter(lambda msg d.subsumes(msg), messages)
18Implementation Input Formats
TotalRainfallMsg period year 1996
month 06 attribute type
'RelativeVariation' magnitude
unit 'inches' number 4
direction ''
19Input Formats
inputs
Elaboration(Message('MonthlyRainfallMsg') M1,
Message('TotalRainfallMsg') M2) (M1.attribute.dir
ection M2.attribute.direction)
ConstituentSet('Elaboration', M1, M2) 3
conditions
return
heuristic
20Example Usage
with open('msg_file', 'r') as f msg_string
f.read() with open('rule_file', 'r') as
f rule_string f.read() messages
read_messages(msg_string) rules
read_rules(rule_string) plan
bottom_up_plan(messages, rules)
21Data Set - WeatherExplainer
- Simple example provided in Reiter and Dale 2000
- Created 3 messages and 3 rules in the input format
22WeatherExplainer Messages
TotalRainfallMsg period year 1996 month
06 attribute type 'RelativeVariation' magnitu
de unit 'inches' number 4 direction
'' MonthlyRainfallMsg period year
1996 month 06 attribute type
'RelativeVariation' magnitude unit
'inches' number 2 direction ''
MonthlyTemperatureMsg period year 1996 month
06 temperature category 'hot'
23WeatherExplainer Messages
Elaboration(Message('MonthlyRainfallMsg') M1,
Message('TotalRainfallMsg') M2) (M1.attribute.dir
ection M2.attribute.direction)
ConstituentSet('Elaboration', M1, M2)
3 Contrast(Message('MonthlyRainfallMsg') M1,
Message('TotalRainfallMsg') M2) (M1.attribute.dir
ection ! M2.attribute.direction)
ConstituentSet('Contrast', M1, M2)
2 Sequence(Message('MonthlyTemperatureMsg')Const
ituentSet(nucleusMessage('MonthlyTemperatureMsg')
) M1, Message('MonthlyRainfallMsg')ConstituentSet
(nucleusMessage('MonthlyRainfallMsg')) M2) ()
ConstituentSet(Sequence, M1, M2) 1
24WeatherExplainer Result
type 'DPDocument'
msgType 'TotalRainfallMsg'
direction ''
attribute magnitude number 4
aux
unit
'inches'
type
'RelativeVariation'
period month 6
year
1996
aux msgType
'MonthlyRainfallMsg'
direction ''
children
attribute magnitude number 2
nucleus
unit 'inches'
type 'RelativeVariation'
period month 6
year 1996
relType "'Elaboration'"
msgType 'MonthlyTemperatureMsg'
nucleus
period month 6
year 1996
temperature category 'hot'
relType 'Sequence'
title
text None
type None
25WeatherExplainer Result
Roughly This has been a hot month. Average
rainfall this month is greater than usual. So
far, rainfall is four inches above average.
26ASSESS
- Summarization of Evaluative Opinions
27An Abstractive Summarization Pipeline
Data
Input Reviews
Summary
Determine most relevant information and generate
summary
Extract all Information from input corpus
28ASSESS Testing
- Input
- Review sentences tagged with crude-feature
evaluations - Crude-Feature to User-Defined-Feature mapping
- Simple content selection
- Group evaluations by UDF
- Calculate average evaluation
- Also include info on UDF-parent in hierarchy,
number of evaluations
29Example Message
msgType 'AverageOpinionMessage'
numOpinions 17
polarity '-' udf
'Universal Remote Control'
udf_parent 'Extra Features'
valence 1.1764705882352942
12 messages generated
30Rules
Conjunction(Message('AverageOpinionMessage') M1,
Message('AverageOpinionMessage') M2)
(M1.udf_parent M2.udf_parent and M1.polarity
M2.polarity)ConstituentSet(Conjunction,M1,M2)
(2,M1.numOpinionsM2.numOpinions) Contrast(Messag
e('AverageOpinionMessage') M1, Message('AverageOpi
nionMessage') M2) (M1.udf_parent
M2.udf_parent and M1.polarity !
M2.polarity)ConstituentSet(Contrast,M1,M2)(3,M1.
numOpinionsM2.numOpinions) Explanation(Message('
AverageOpinionMessage') M1, Message('AverageOpinio
nMessage') M2) (M1.udf M2.udf_parent and
M1.polarity M2.polarity)ConstituentSet(Explana
tion,M1,M2)(5,0) Explanation(Message('AverageOpi
nionMessage') M1, ConstituentSet(relType
'Conjunction', nucleusMessage('AverageOpinionMess
age')) M2) (M1.udf M2.nucleus.udf_parent
and M1.polarity M2.nucleus.polarity)Constituen
tSet(DExplanation,M1,M2)(100) Sequence(Message(
'AverageOpinionMessage')ConstituentSet() M1,
Message('AverageOpinionMessage')ConstituentSet()
M2) ()ConstituentSet(Sequence,M1,M2)(1,0)
31ASSESS Result
- It works!
- Evaluation of resulting DocPlan would say more
about Rules and Content Selection than Document
Structuring Algorithm - Was able to handle larger number of messages and
rules - 4 of 5 rules used
- Still, only one message type used
32Future Improvements
- Investigate whether this simple framework can be
used to develop more intelligent rules for more
sophisticated domain models - Carenini 2008 SEA
- May require changes to implementation
- Complete comprehensive documentation and
user-manual - Submit to NLTK
33References
- Bird, Steven Ewan Klein Edward Loper (2009).
Natural Language Processing with Python.
O'Reilly Media Inc. Print and online. - Carenini, G., Moore, J.D., (2006) Generating and
evaluating evaluative arguments. Artificial
Intelligence, 170(11) 925- 952 - Carenini, G., Ng, R., and Pauls, A. (2006)
Multi-Document Summarization of Evaluative Text.
Proc. of the Conf. of the European Chapter of
the Association for Computational Linguistics. - FitzGerald, N. (2009) A Complete Pipeline for
Semantic Evaluation Summarization. Unpublished
Project Report - Lester, J. And Porter, B., (1997). Developing and
empirically testing robust explanation
generators the KNIGHT experiments.
Computational Linguistics, 23(1)65-101 - Mann, W. and Thompson, S. (1988) Rhetorical
structure theory toward a functional theory of
text organization. Text 3 243-281. - Marcu, D. (1997) From local to global coherence
A bottom-up approach to test planning.
Proceedings of Fourteenth National Conference
on Artificial Intelligence (AAAI-1997), 629- 635. - Pitler, Emily et al (2008). Easily Identifiable
Discourse Relations. University of Pennsylvania
Department of Computer and Information Science
Technical Report No. MS-CIS-08-24. - Reiter, E. and Dale, R. (1997) Building applied
natural language generation systems. Natural
Language Engineering 3 (1) 57-87. - Reiter, E., and Robert Dale. Building Natural
Language Generation Systems (Studies in Natural
Language Processing). New York Cambridge UP,
2000. Print. - Young, R.M., Moore, J.D. DPOCL A principled
approach to discourse planning, in Proceedings
of the 7th International Workshop on Natural
Language Generation, Kennebunkport, ME, June
1721, 1994, pp. 1320.