Open-Source Implementation of Document Structuring Algorithm for NLTK - PowerPoint PPT Presentation

About This Presentation
Title:

Open-Source Implementation of Document Structuring Algorithm for NLTK

Description:

Open-Source Implementation of Document Structuring Algorithm for NLTK Nicholas FitzGerald – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 34
Provided by: ubc8
Category:

less

Transcript and Presenter's Notes

Title: Open-Source Implementation of Document Structuring Algorithm for NLTK


1
Open-Source Implementation of Document
Structuring Algorithm for NLTK
  • Nicholas FitzGerald

2
Natural Language Generation
  • Generate coherent text outputs to express
    information
  • Express the right information
  • Express information in the right order

3
NLG Tasks
  1. Document Structuring - most important and
    relevant information selected from knowledge base
    (Content Determination), then ordered and
    structured in such a way as to maximize coherence
    and informativeness (Text Planning)
  2. Micro-Planning specifics of word selection,
    referring expressions, and the finalization of
    ordering are determined
  3. Realization internal representations of the
    above decisions are realized in actual text output

4
Document Structuring
  • Given a set of information to be expressed,
    determine the order and grouping of this
    information
  • Texts cannot be simply a random bag of sentences
  • Order of message presentation has significant
    effect on meaning Hovy 1993
  • One way
  • 1 - Maria was diagnosed with cancer some months
    ago.
  • 2 - Maria and Zurab had a fight last night.
  • 3 - She was found dead this morning.
  • Vs.
  • 1 - Maria was diagnosed with cancer some months
    ago.
  • 2 - She was found dead this morning.
  • 3 - Maria and Zurab had a fight last night.

5
Document Structuring
  • Ordering also effects coherence
  • John was hungry. John went to the store. He
    bought some bread to make a sandwich.
  • John bought some bread to make a sandwich. He
    went to the store. John was hungry.

6
Discourse relations
  • Relationship between a message or group of
    messages
  • Elaboration(m1,m2)
  • I love jazz music(m1). My favourite album is
    Oscar Peterson's Night Train (m2).
  • Contrast(m1, m3)
  • I love jazz music (m1). However, my favourite
    album is The Beatles' White Album (m3).
  • Cue word - However

7
Rhetorical Structure Theory
  • Mann and Thompson 1988
  • A text is coherent by virtue of relationships
    that hold between messages in the text
  • A small number of relations (25) can explain
    relationships between messages in a wide range of
    text

8
Project Proposal
  • Implement these general algorithms for inclusion
    in NLTK
  • Provide a sample Data Set and DR schema for
    testing and illustration
  • based on hypothetical WeatherExplainer from
    Reiter and Dale 2000
  • Experiment utilizing these new tools as part of
    Abstractive Summarization System for Evaluative
    Statement Summarization (ASSESS)

9
Implementation 1 Schemas
  • Top-Down Approach
  • Output document structure is predictable and
    stereotyped
  • Schemas are patterns of expansion, similar to CFG
  • Ie
  • CompareAndContrast ? DescribeRelationship
    CompareProperties.
  • CompareProperties ? CompareProperty
    CompareProperties.
  • CompareProperties ? .
  • John is much bigger than Kate (DR). He is five
    inches taller (CP) and weighs almost twice as
    much (CP).
  • Specify rules for choosing if multiple expansions
    exist

10
Top-Down Problems
  • Hypothesis-Driven
  • Content selection done on-line
  • Not easily pipelined
  • Therefore, Bottom-Up used

11
Implementation 2 Bottom-Up
  • Output document structure is not predictable
  • POOL messages to be expressed
  • while( size(pool) gt 1))
  • find all pairs of elements in pool which can be
    joined by a DR
  • assign a desirability score to each potential DR
  • find pair Ei and Ej with highest score and
    combine with Ek
  • remove Ei and Ej from POOL, replace with Ek
  • end while

12
Implementation
  • Used nltk.featstruct for Messages and DocPlans
  • A mapping from feature identifiers to feature
    values, where each feature value is either a
    basic value (such as a string or an integer), or
    a nested feature structure.

TotalRainfallMsg period year 1996
month 06 attribute type
'RelativeVariation' magnitude
unit 'inches' number 4
direction ''
msgType 'TotalRainfallMsg'

direction ''

attribute magnitude
number 4
unit 'inches'

type 'RelativeVariation'

period month 6
year 1996

13
Implementation
  • nltk.featstruct.FeatStruct
  • unify(other)
  • Unify fstruct1 with fstruct2, and return the
    resulting feature structure. This unified feature
    structure is the minimal feature structure that
  • contains all feature value assignments from both
    fstruct1 and fstruct2.
  • preserves all reentrance properties of fstruct1
    and fstruct2.
  • If no such feature structure exists (because
    fstruct1 and fstruct2 specify incompatible values
    for some feature), then unification fails, and
    unify returns None.

14
Unification
TotalRainfallMsg period year 1996
month 06 attribute type
'RelativeVariation' magnitude
unit 'inches' number 4
TotalRainfallMsg period year 1996
month 06 attribute type
'RelativeVariation' magnitude
unit 'inches' number 4
direction ''
TotalRainfallMsg period year 1996
Month 06 attribute type
'RelativeVariation' direction ''


15
Implementation
  • nltk.featstruct.FeatStruct
  • subsumes(other)
  • True if self subsumes other. I.e., return true if
    unifying self with other would result in a
    feature structure equal to other.

16
Subsumes
TotalRainfallMsg period year 1996
month 06 attribute type
'RelativeVariation' magnitude
unit 'inches' number 4
direction ''
TotalRainfallMsg period year 1996
Month 06
subsumes
TotalRainfallMsg period year 1996
month 06 attribute type
'RelativeVariation' magnitude
unit 'inches' number 4
TotalRainfallMsg period year 1996
month 06
Does not subsume
17
Using Subsumes
Select from messages all DocPlans whose with a
relType of Contrast and a nucleus which is a
message of msgType ('TotalRainfallMsg')
d DocPlan(relType 'Contrast', nucleus
Message('TotalRainfallMsg')) return
filter(lambda msg d.subsumes(msg), messages)
18
Implementation Input Formats
  • Messages

TotalRainfallMsg period year 1996
month 06 attribute type
'RelativeVariation' magnitude
unit 'inches' number 4
direction ''
19
Input Formats
  • Rules

inputs
Elaboration(Message('MonthlyRainfallMsg') M1,
Message('TotalRainfallMsg') M2) (M1.attribute.dir
ection M2.attribute.direction)
ConstituentSet('Elaboration', M1, M2) 3
conditions
return
heuristic
20
Example Usage
with open('msg_file', 'r') as f msg_string
f.read() with open('rule_file', 'r') as
f rule_string f.read() messages
read_messages(msg_string) rules
read_rules(rule_string) plan
bottom_up_plan(messages, rules)
21
Data Set - WeatherExplainer
  • Simple example provided in Reiter and Dale 2000
  • Created 3 messages and 3 rules in the input format

22
WeatherExplainer Messages
TotalRainfallMsg period year 1996 month
06 attribute type 'RelativeVariation' magnitu
de unit 'inches' number 4 direction
'' MonthlyRainfallMsg period year
1996 month 06 attribute type
'RelativeVariation' magnitude unit
'inches' number 2 direction ''
MonthlyTemperatureMsg period year 1996 month
06 temperature category 'hot'
23
WeatherExplainer Messages
Elaboration(Message('MonthlyRainfallMsg') M1,
Message('TotalRainfallMsg') M2) (M1.attribute.dir
ection M2.attribute.direction)
ConstituentSet('Elaboration', M1, M2)
3 Contrast(Message('MonthlyRainfallMsg') M1,
Message('TotalRainfallMsg') M2) (M1.attribute.dir
ection ! M2.attribute.direction)
ConstituentSet('Contrast', M1, M2)
2 Sequence(Message('MonthlyTemperatureMsg')Const
ituentSet(nucleusMessage('MonthlyTemperatureMsg')
) M1, Message('MonthlyRainfallMsg')ConstituentSet
(nucleusMessage('MonthlyRainfallMsg')) M2) ()
ConstituentSet(Sequence, M1, M2) 1
24
WeatherExplainer Result
type 'DPDocument'




msgType 'TotalRainfallMsg'



direction ''



attribute magnitude number 4
aux
unit
'inches'


type
'RelativeVariation'


period month 6

year
1996


aux msgType
'MonthlyRainfallMsg'



direction ''


children
attribute magnitude number 2
nucleus
unit 'inches'



type 'RelativeVariation'



period month 6

year 1996



relType "'Elaboration'"



msgType 'MonthlyTemperatureMsg'


nucleus
period month 6

year 1996



temperature category 'hot'



relType 'Sequence'


title
text None

type None

25
WeatherExplainer Result
Roughly This has been a hot month. Average
rainfall this month is greater than usual. So
far, rainfall is four inches above average.
26
ASSESS
  • Summarization of Evaluative Opinions

27
An Abstractive Summarization Pipeline
Data
Input Reviews
Summary
Determine most relevant information and generate
summary
Extract all Information from input corpus
28
ASSESS Testing
  • Input
  • Review sentences tagged with crude-feature
    evaluations
  • Crude-Feature to User-Defined-Feature mapping
  • Simple content selection
  • Group evaluations by UDF
  • Calculate average evaluation
  • Also include info on UDF-parent in hierarchy,
    number of evaluations

29
Example Message
msgType 'AverageOpinionMessage'
numOpinions 17
polarity '-' udf
'Universal Remote Control'
udf_parent 'Extra Features'
valence 1.1764705882352942
12 messages generated
30
Rules
Conjunction(Message('AverageOpinionMessage') M1,
Message('AverageOpinionMessage') M2)
(M1.udf_parent M2.udf_parent and M1.polarity
M2.polarity)ConstituentSet(Conjunction,M1,M2)
(2,M1.numOpinionsM2.numOpinions) Contrast(Messag
e('AverageOpinionMessage') M1, Message('AverageOpi
nionMessage') M2) (M1.udf_parent
M2.udf_parent and M1.polarity !
M2.polarity)ConstituentSet(Contrast,M1,M2)(3,M1.
numOpinionsM2.numOpinions) Explanation(Message('
AverageOpinionMessage') M1, Message('AverageOpinio
nMessage') M2) (M1.udf M2.udf_parent and
M1.polarity M2.polarity)ConstituentSet(Explana
tion,M1,M2)(5,0) Explanation(Message('AverageOpi
nionMessage') M1, ConstituentSet(relType
'Conjunction', nucleusMessage('AverageOpinionMess
age')) M2) (M1.udf M2.nucleus.udf_parent
and M1.polarity M2.nucleus.polarity)Constituen
tSet(DExplanation,M1,M2)(100) Sequence(Message(
'AverageOpinionMessage')ConstituentSet() M1,
Message('AverageOpinionMessage')ConstituentSet()
M2) ()ConstituentSet(Sequence,M1,M2)(1,0)
31
ASSESS Result
  • It works!
  • Evaluation of resulting DocPlan would say more
    about Rules and Content Selection than Document
    Structuring Algorithm
  • Was able to handle larger number of messages and
    rules
  • 4 of 5 rules used
  • Still, only one message type used

32
Future Improvements
  • Investigate whether this simple framework can be
    used to develop more intelligent rules for more
    sophisticated domain models
  • Carenini 2008 SEA
  • May require changes to implementation
  • Complete comprehensive documentation and
    user-manual
  • Submit to NLTK

33
References
  • Bird, Steven Ewan Klein Edward Loper (2009).
    Natural Language Processing with Python.
    O'Reilly Media Inc. Print and online.
  • Carenini, G., Moore, J.D., (2006) Generating and
    evaluating evaluative arguments. Artificial
    Intelligence, 170(11) 925- 952
  • Carenini, G., Ng, R., and Pauls, A. (2006)
    Multi-Document Summarization of Evaluative Text.
    Proc. of the Conf. of the European Chapter of
    the Association for Computational Linguistics.
  • FitzGerald, N. (2009) A Complete Pipeline for
    Semantic Evaluation Summarization. Unpublished
    Project Report
  • Lester, J. And Porter, B., (1997). Developing and
    empirically testing robust explanation
    generators the KNIGHT experiments.
    Computational Linguistics, 23(1)65-101
  • Mann, W. and Thompson, S. (1988) Rhetorical
    structure theory toward a functional theory of
    text organization. Text 3 243-281.
  • Marcu, D. (1997) From local to global coherence
    A bottom-up approach to test planning.
    Proceedings of Fourteenth National Conference
    on Artificial Intelligence (AAAI-1997), 629- 635.
  • Pitler, Emily et al (2008). Easily Identifiable
    Discourse Relations. University of Pennsylvania
    Department of Computer and Information Science
    Technical Report No. MS-CIS-08-24.
  • Reiter, E. and Dale, R. (1997) Building applied
    natural language generation systems. Natural
    Language Engineering 3 (1) 57-87.
  • Reiter, E., and Robert Dale. Building Natural
    Language Generation Systems (Studies in Natural
    Language Processing). New York Cambridge UP,
    2000. Print.
  • Young, R.M., Moore, J.D. DPOCL A principled
    approach to discourse planning, in Proceedings
    of the 7th International Workshop on Natural
    Language Generation, Kennebunkport, ME, June
    1721, 1994, pp. 1320.
Write a Comment
User Comments (0)
About PowerShow.com