Datadriven Approaches for Information Structure Identification - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Datadriven Approaches for Information Structure Identification

Description:

intonation center (IC) systemic (canonical) order of dependents. t. 13 ... The bearer of Intonation Center (IC) (typically, the rightmost child of the verb) ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 31
Provided by: oanapos
Category:

less

Transcript and Presenter's Notes

Title: Datadriven Approaches for Information Structure Identification


1
Data-driven Approaches for Information Structure
Identification
  • HLT/EMNLP Oct 6, 2005

2
Information Structure (IS)
  • Division of the sentence in two parts
  • Links the sentence to the discourse
  • Advances the discourse (brings new information)
  • Ex
  • Rob needs to talk things out, and he certainly
    isnt
  • going to do that with Dick or Barry.
  • So, he talks to HIMSELF instead.
  • Not the given/new distinction.

3
Why / Where is IS important?
  • Realization of IS
  • Prosody (English)
  • Word order variation (Czech, German)
  • Morphology (Japanese)
  • Applications
  • Text-to-speech systems
  • Natural Language Generation
  • Machine Translation

4
Outline
  • The Prague School Approach of IS
  • Theory of Topic-Focus Articulation (TFA)
  • Annotation of TFA
  • Automatic Extraction of topic focus
  • Experimental setup
  • Results
  • Error analysis

5
Topic-Focus Articulation
  • Topic what is the sentence about
  • Focus information asserted about the topic
  • SentenceSem Focus(Topic)
  • Defined considering
  • the Contextually-Bound/Non-Bound distinction

6
Contextually Bound / Non-Bound
  • Operational criterion based on question-answer
    test
  • CB (Contextually-Bound) are
  • Weak and zero pronouns
  • Items in the answer which reproduce expressions
    present (or associated to those present) in the
    question
  • NB (Non-Bound) are
  • The item corresponding to the wh-word
  • Strong/stressed pronouns

7
Example CB vs. NB
  • Rob needs to talk things out, and he certainly
    isnt going to do that with Dick or Barry.
  • So, he talks to HIMSELF instead.

So, whom does he talk to instead?
8
TFA Theoretical Definition
  • Focus
  • The main verb (V) and any child of V (and the
    subordinated sub-tree) iff they are NB.
  • If V and all its children are CB, then the NB
    items subordinated to the children (and the
    subordinated sub-trees).
  • Topic
  • All items not belonging to Focus cf. 1

9
Example Topic Focus
  • So, he talks to HIMSELF instead.

talks to
CB
he
instead
So
HIMSELF
CB
NB
CB
CB
10
Roadmap
  • The Prague School Approach of IS
  • Theory of Topic-Focus Articulation (TFA)
  • Annotation of TFA
  • Automatic Extraction of topic focus
  • Experimental setup
  • Results
  • Error analysis

11
Prague Dependency Treebank
  • Three layers of annotation
  • Morphological
  • Analytical
  • Syntactic trees containing each token of the
    surface form (incl. punctuation marks)
  • Main syntactic functions SUBJ, OBJ,
  • Tectogrammatical deep structure of the sentence
  • Only autosemantic words
  • Recovered words (deleted on the surface)
  • Detailed classification of functors PAT, ACT,
    ADDR,
  • Topic Focus Articulation (TFA)

12
Annotation of TFA
  • Marked on all nodes from the tectogrammatical
    layer (50K sentences / 632K tecto-nodes).
  • Three classes t, c, f
  • t non-contrastive CB nodes
  • c contrastive CB nodes
  • f NB nodes
  • Guidelines use notions of
  • surface order
  • intonation center (IC)
  • systemic (canonical) order of dependents

t
13
Annotators Agreement (in )
Veselá, Havelka Hajicová, LREC2004
14
Roadmap
  • The Prague School Approach of IS
  • Theory of Topic-Focus Articulation (TFA)
  • Annotation of TFA
  • Automatic Extraction of topic focus
  • Experimental setup
  • Results
  • Error analysis

15
Extraction of topic focus
  • Goal label each tecto-node with t(opic) or
    f(ocus)
  • Steps
  • 1. Rule-based system (used as 2nd baseline)
  • 2. Build models for different classifiers
  • C4.5
  • MaxEnt
  • Ripper
  • 3. Error Analysis

16
Rule-based system 1/2
17
Rule-based system 2/2
18
Machine Learning Models
  • Three different techniques
  • Decision trees (C4.5)
  • Rule induction (RIPPER)
  • Maximum Entropy (MaxEnt)
  • Use 2 classes of features
  • Basic features (attributes from the treebank)
  • Derived features (inspired by the annotation
    guidelines)

19
Basic Features
  • nodetype gt complex, atom,
  • functor gt ACT, PAT, ATT,
  • coref gttrue, false
  • coreftypegttext, gram, NA
  • afun gt Sbj, Obj, Pred,
  • SUBPOS gt P1, PD, PE, NN,
  • -- 23. The gramatemes sempos, verbmod, aspect,

20
Derived Features
  • is_rightmost_dependent_of_the_verb
  • is_rightside_dependent_of_the_verb
  • is_rightside_dependent
  • is_embedded_attribute
  • has_repeated_lemma
  • is_in_canonical_order
  • is_weak_pronoun
  • is_indexical_expression
  • is_pronoun_with_general_meaning
  • is_strong_pronoun_with_no_prep

21
Data statistics
  • 3,168 files
  • 49,442 sentences
  • Instances 621,991
  • Training set 494,756 (78.3)
  • Development set 66,711 (10.5)
  • Test set 70,323 (11.2)
  • TFA distribution on the training set
  • 63.11 f
  • 36.89 t

22
Evaluation
  • Metric Correctly classified instances
  • Baseline assigns the class that has the most
    instances (f)
  • Second Baseline the Rule-based system

23
Results
24
Error analysis
New contexts in development data 2,043 (2,125
instances)
25
Naïve Predictor
New contexts in test data gt f
26
Naïve Predictor evaluation
?226.3 plt0.001
?230.7 plt0.001
27
Learning curves
C4.5
Naïve predictor
MaxEnt
RIPPER
28
Conclusions
  • Information Structure can be recovered using
    mostly syntactic features.
  • Improvement could be done by introducing more
    features rather than providing more annotated
    data.
  • Future research Transferring IS from Czech to
    English through word alignment.

29
Thank you!
30
More examples Topic Focus
NB
CB
CB
CB
CB
CB
NB
NB
Proxy foci
NB
NB
NB
CB
NB
CB
CB
CB
NB
Write a Comment
User Comments (0)
About PowerShow.com