Open-Source Implementation of Document Structuring Algorithm for NLTK - PowerPoint PPT Presentation

About This Presentation

Title:

Open-Source Implementation of Document Structuring Algorithm for NLTK

Description:

Open-Source Implementation of Document Structuring Algorithm for NLTK Nicholas FitzGerald – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 34

Provided by: ubc8

Category:

more less

Transcript and Presenter's Notes

Title: Open-Source Implementation of Document Structuring Algorithm for NLTK

1
Open-Source Implementation of Document
Structuring Algorithm for NLTK

Nicholas FitzGerald

2
Natural Language Generation

Generate coherent text outputs to express
information
Express the right information
Express information in the right order

3
NLG Tasks

Document Structuring - most important and
relevant information selected from knowledge base
(Content Determination), then ordered and
structured in such a way as to maximize coherence
and informativeness (Text Planning)
Micro-Planning specifics of word selection,
referring expressions, and the finalization of
ordering are determined
Realization internal representations of the
above decisions are realized in actual text output

4
Document Structuring

Given a set of information to be expressed,
determine the order and grouping of this
information
Texts cannot be simply a random bag of sentences
Order of message presentation has significant
effect on meaning Hovy 1993
One way
1 - Maria was diagnosed with cancer some months
ago.
2 - Maria and Zurab had a fight last night.
3 - She was found dead this morning.
Vs.
1 - Maria was diagnosed with cancer some months
ago.
2 - She was found dead this morning.
3 - Maria and Zurab had a fight last night.

5
Document Structuring

Ordering also effects coherence
John was hungry. John went to the store. He
bought some bread to make a sandwich.
John bought some bread to make a sandwich. He
went to the store. John was hungry.

6
Discourse relations

Relationship between a message or group of
messages
Elaboration(m1,m2)
I love jazz music(m1). My favourite album is
Oscar Peterson's Night Train (m2).
Contrast(m1, m3)
I love jazz music (m1). However, my favourite
album is The Beatles' White Album (m3).
Cue word - However

7
Rhetorical Structure Theory

Mann and Thompson 1988
A text is coherent by virtue of relationships
that hold between messages in the text
A small number of relations (25) can explain
relationships between messages in a wide range of
text

8
Project Proposal

Implement these general algorithms for inclusion
in NLTK
Provide a sample Data Set and DR schema for
testing and illustration
based on hypothetical WeatherExplainer from
Reiter and Dale 2000
Experiment utilizing these new tools as part of
Abstractive Summarization System for Evaluative
Statement Summarization (ASSESS)

9
Implementation 1 Schemas

Top-Down Approach
Output document structure is predictable and
stereotyped
Schemas are patterns of expansion, similar to CFG
Ie
CompareAndContrast ? DescribeRelationship
CompareProperties.
CompareProperties ? CompareProperty
CompareProperties.
CompareProperties ? .
John is much bigger than Kate (DR). He is five
inches taller (CP) and weighs almost twice as
much (CP).
Specify rules for choosing if multiple expansions
exist

10
Top-Down Problems

Hypothesis-Driven
Content selection done on-line
Not easily pipelined
Therefore, Bottom-Up used

11
Implementation 2 Bottom-Up

Output document structure is not predictable
POOL messages to be expressed
while( size(pool) gt 1))
find all pairs of elements in pool which can be
joined by a DR
assign a desirability score to each potential DR
find pair Ei and Ej with highest score and
combine with Ek
remove Ei and Ej from POOL, replace with Ek
end while

12
Implementation

Used nltk.featstruct for Messages and DocPlans
A mapping from feature identifiers to feature
values, where each feature value is either a
basic value (such as a string or an integer), or
a nested feature structure.

TotalRainfallMsg period year 1996
month 06 attribute type
'RelativeVariation' magnitude
unit 'inches' number 4
direction ''
msgType 'TotalRainfallMsg'

direction ''

attribute magnitude
number 4
unit 'inches'

type 'RelativeVariation'

period month 6
year 1996

13
Implementation

nltk.featstruct.FeatStruct
unify(other)
Unify fstruct1 with fstruct2, and return the
resulting feature structure. This unified feature
structure is the minimal feature structure that
contains all feature value assignments from both
fstruct1 and fstruct2.
preserves all reentrance properties of fstruct1
and fstruct2.
If no such feature structure exists (because
fstruct1 and fstruct2 specify incompatible values
for some feature), then unification fails, and
unify returns None.

14
Unification
TotalRainfallMsg period year 1996
month 06 attribute type
'RelativeVariation' magnitude
unit 'inches' number 4
TotalRainfallMsg period year 1996
month 06 attribute type
'RelativeVariation' magnitude
unit 'inches' number 4
direction ''
TotalRainfallMsg period year 1996
Month 06 attribute type
'RelativeVariation' direction ''

15
Implementation

nltk.featstruct.FeatStruct
subsumes(other)
True if self subsumes other. I.e., return true if
unifying self with other would result in a
feature structure equal to other.

16
Subsumes
TotalRainfallMsg period year 1996
month 06 attribute type
'RelativeVariation' magnitude
unit 'inches' number 4
direction ''
TotalRainfallMsg period year 1996
Month 06
subsumes
TotalRainfallMsg period year 1996
month 06 attribute type
'RelativeVariation' magnitude
unit 'inches' number 4
TotalRainfallMsg period year 1996
month 06
Does not subsume
17
Using Subsumes
Select from messages all DocPlans whose with a
relType of Contrast and a nucleus which is a
message of msgType ('TotalRainfallMsg')
d DocPlan(relType 'Contrast', nucleus
Message('TotalRainfallMsg')) return
filter(lambda msg d.subsumes(msg), messages)
18
Implementation Input Formats

Messages

TotalRainfallMsg period year 1996
month 06 attribute type
'RelativeVariation' magnitude
unit 'inches' number 4
direction ''
19
Input Formats

Rules

inputs
Elaboration(Message('MonthlyRainfallMsg') M1,
Message('TotalRainfallMsg') M2) (M1.attribute.dir
ection M2.attribute.direction)
ConstituentSet('Elaboration', M1, M2) 3
conditions
return
heuristic
20
Example Usage
with open('msg_file', 'r') as f msg_string
f.read() with open('rule_file', 'r') as
f rule_string f.read() messages
read_messages(msg_string) rules
read_rules(rule_string) plan
bottom_up_plan(messages, rules)
21
Data Set - WeatherExplainer

Simple example provided in Reiter and Dale 2000
Created 3 messages and 3 rules in the input format

22
WeatherExplainer Messages
TotalRainfallMsg period year 1996 month
06 attribute type 'RelativeVariation' magnitu
de unit 'inches' number 4 direction
'' MonthlyRainfallMsg period year
1996 month 06 attribute type
'RelativeVariation' magnitude unit
'inches' number 2 direction ''
MonthlyTemperatureMsg period year 1996 month
06 temperature category 'hot'
23
WeatherExplainer Messages
Elaboration(Message('MonthlyRainfallMsg') M1,
Message('TotalRainfallMsg') M2) (M1.attribute.dir
ection M2.attribute.direction)
ConstituentSet('Elaboration', M1, M2)
3 Contrast(Message('MonthlyRainfallMsg') M1,
Message('TotalRainfallMsg') M2) (M1.attribute.dir
ection ! M2.attribute.direction)
ConstituentSet('Contrast', M1, M2)
2 Sequence(Message('MonthlyTemperatureMsg')Const
ituentSet(nucleusMessage('MonthlyTemperatureMsg')
) M1, Message('MonthlyRainfallMsg')ConstituentSet
(nucleusMessage('MonthlyRainfallMsg')) M2) ()
ConstituentSet(Sequence, M1, M2) 1
24
WeatherExplainer Result
type 'DPDocument'

msgType 'TotalRainfallMsg'

direction ''

attribute magnitude number 4
aux
unit
'inches'

type
'RelativeVariation'

period month 6

year
1996

aux msgType
'MonthlyRainfallMsg'

direction ''

children
attribute magnitude number 2
nucleus
unit 'inches'

type 'RelativeVariation'

period month 6

year 1996

relType "'Elaboration'"

msgType 'MonthlyTemperatureMsg'

nucleus
period month 6

year 1996

temperature category 'hot'

relType 'Sequence'

title
text None

type None

25
WeatherExplainer Result
Roughly This has been a hot month. Average
rainfall this month is greater than usual. So
far, rainfall is four inches above average.
26
ASSESS

Summarization of Evaluative Opinions

27
An Abstractive Summarization Pipeline
Data
Input Reviews
Summary
Determine most relevant information and generate
summary
Extract all Information from input corpus
28
ASSESS Testing

Input
Review sentences tagged with crude-feature
evaluations
Crude-Feature to User-Defined-Feature mapping
Simple content selection
Group evaluations by UDF
Calculate average evaluation
Also include info on UDF-parent in hierarchy,
number of evaluations

29
Example Message
msgType 'AverageOpinionMessage'
numOpinions 17
polarity '-' udf
'Universal Remote Control'
udf_parent 'Extra Features'
valence 1.1764705882352942
12 messages generated
30
Rules
Conjunction(Message('AverageOpinionMessage') M1,
Message('AverageOpinionMessage') M2)
(M1.udf_parent M2.udf_parent and M1.polarity
M2.polarity)ConstituentSet(Conjunction,M1,M2)
(2,M1.numOpinionsM2.numOpinions) Contrast(Messag
e('AverageOpinionMessage') M1, Message('AverageOpi
nionMessage') M2) (M1.udf_parent
M2.udf_parent and M1.polarity !
M2.polarity)ConstituentSet(Contrast,M1,M2)(3,M1.
numOpinionsM2.numOpinions) Explanation(Message('
AverageOpinionMessage') M1, Message('AverageOpinio
nMessage') M2) (M1.udf M2.udf_parent and
M1.polarity M2.polarity)ConstituentSet(Explana
tion,M1,M2)(5,0) Explanation(Message('AverageOpi
nionMessage') M1, ConstituentSet(relType
'Conjunction', nucleusMessage('AverageOpinionMess
age')) M2) (M1.udf M2.nucleus.udf_parent
and M1.polarity M2.nucleus.polarity)Constituen
tSet(DExplanation,M1,M2)(100) Sequence(Message(
'AverageOpinionMessage')ConstituentSet() M1,
Message('AverageOpinionMessage')ConstituentSet()
M2) ()ConstituentSet(Sequence,M1,M2)(1,0)
31
ASSESS Result

It works!
Evaluation of resulting DocPlan would say more
about Rules and Content Selection than Document
Structuring Algorithm
Was able to handle larger number of messages and
rules
4 of 5 rules used
Still, only one message type used

32
Future Improvements

Investigate whether this simple framework can be
used to develop more intelligent rules for more
sophisticated domain models
Carenini 2008 SEA
May require changes to implementation
Complete comprehensive documentation and
user-manual
Submit to NLTK

33
References

Bird, Steven Ewan Klein Edward Loper (2009).
Natural Language Processing with Python.
O'Reilly Media Inc. Print and online.
Carenini, G., Moore, J.D., (2006) Generating and
evaluating evaluative arguments. Artificial
Intelligence, 170(11) 925- 952
Carenini, G., Ng, R., and Pauls, A. (2006)
Multi-Document Summarization of Evaluative Text.
Proc. of the Conf. of the European Chapter of
the Association for Computational Linguistics.
FitzGerald, N. (2009) A Complete Pipeline for
Semantic Evaluation Summarization. Unpublished
Project Report
Lester, J. And Porter, B., (1997). Developing and
empirically testing robust explanation
generators the KNIGHT experiments.
Computational Linguistics, 23(1)65-101
Mann, W. and Thompson, S. (1988) Rhetorical
structure theory toward a functional theory of
text organization. Text 3 243-281.
Marcu, D. (1997) From local to global coherence
A bottom-up approach to test planning.
Proceedings of Fourteenth National Conference
on Artificial Intelligence (AAAI-1997), 629- 635.
Pitler, Emily et al (2008). Easily Identifiable
Discourse Relations. University of Pennsylvania
Department of Computer and Information Science
Technical Report No. MS-CIS-08-24.
Reiter, E. and Dale, R. (1997) Building applied
natural language generation systems. Natural
Language Engineering 3 (1) 57-87.
Reiter, E., and Robert Dale. Building Natural
Language Generation Systems (Studies in Natural
Language Processing). New York Cambridge UP,
2000. Print.
Young, R.M., Moore, J.D. DPOCL A principled
approach to discourse planning, in Proceedings
of the 7th International Workshop on Natural
Language Generation, Kennebunkport, ME, June
1721, 1994, pp. 1320.