DiscAn: Towards a Discourse Annotation system for Dutch language corpora or why and how we would want to annotate corpora on the discourse level - PowerPoint PPT Presentation

View by Category
About This Presentation
Title:

DiscAn: Towards a Discourse Annotation system for Dutch language corpora or why and how we would want to annotate corpora on the discourse level

Description:

DiscAn: Towards a Discourse Annotation system for Dutch language corpora or why and how we would want to annotate corpora on the discourse level – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 10
Provided by: sande209
Learn more at: http://www.clarin.nl
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: DiscAn: Towards a Discourse Annotation system for Dutch language corpora or why and how we would want to annotate corpora on the discourse level


1
DiscAn Towards a Discourse Annotation system
for Dutch language corpora or why and how we
would want to annotate corpora on the discourse
level
  • Ted Sanders
  • Utrecht institute of Linguistics
  • Universiteit Utrecht

2
Coherence in discourse
  • Many tourists come to Switzerland. They want to
    see the mountains.
  • Referential coherence
  • Many tourists come to Switzerland because they
    want to see the mountains.
  • Relational coherence
  • John was happy. It was a Saturday.
  • We do not need explicit linguistic indicators

3
Coherence in discourse, 2
  • Coherence is a cognitive phenomenon
  • Coherence relations are conceptual relations that
    constitute coherence between discourse segments
    (minimally clauses)
  • Connectives, Cue Phrases and other lexical
    markers can but need not make this coherence
    explicit.
  • Coherence relations are the building blocks of
    discourse structure (causal, contrastive,
    additive)

4
In annotated corpora ?
  • The discourse level is largely lacking in
    annotated Dutch corpora
  • There is an international tendency towards
    discourse annotation
  • The Penn Discourse Treebank (Prasad, Joshi,
    Webber et al.)
  • The Potsdam Corpus (Stede et al.)
  • And at the same time, we do have much data on
    Dutch
  • on connectives
  • Mainly causal
  • Across media (various written genres, spoken,
    chat)
  • At various stages of annotation

5
Larger research issues in the field
  • To be answered on the basis of annotated corpora
  • The meaning and use of connectives varies across
    languages omdat vs. parce que vs. weil
  • Semantic-pragmatic restrictions on use
  • Similarities and differences in acquisition
  • We will start discourse annotation with a study
    on the category of causals

6
Annotation
  • Some criteria
  • Order cause consequence and vice versa
  • Subjectivity want, puisque, since, denn vs.
    omdat parce que, because weil
  • Linguistic marking yes/no, perspective etc.
  • Characteristics of the segments propositional
    attitude, modality, tense, syntax

7
Current situation 15 studies.
  • Corpus conn fragmnr s1 s2 modality s1 modality s2
    protags1 s2 relation
  • 7 omdat 2502 176 176 1 1
    irrelevant want feit 6 1 1 1 Irrelevant want
    feit Irrelevant want feit 1
  • 7 omdat 2502b 177 177 2 1
    Spreker/auteur 6 2 1 1 Expliciet
    aanwezig Irrelevant want feit 1
  • 7 omdat 2509 707 707
    1 1 irrelevant want feit 6 1 1 1 Irrelevant want
    feit Irrelevant want feit 1
  • 7 omdat 2539 3320 3320
    1 1 irrelevant want feit 6 1 1 1 Irrelevant want
    feit Irrelevant want feit 1
  • 7 omdat 2546 3810 3810
    1 2 irrelevant want feit 33 2 3 1 Irrelevant
    want feit Impliciet 19
  • 7 omdat 2551 4357 4357
    1 2 irrelevant want feit 31 2 1 1 Irrelevant
    want feit Expliciet aanwezig 1
  • 7 omdat 2525 2547 2547
    3 1 Spreker/auteur 6 2 1 1 Expliciet
    aanwezig Irrelevant want feit 1

8
The DiscAn project has five main goals
  • standardize and open up an existing set of Dutch
    corpus analyses of coherence relations and
    discourse connectives
  • develop the foundations for a discourse
    annotation system
  • improve the metadata by investigating existing
    CMDI profiles or adding new profiles suited for
    this type of analysis
  • inventorize the required categories and
    investigate to what extent these could be
    included in ISOcat categories for discourse
  • an interdisciplinary discourse community of
    text-, corpus and computational linguists to
    initiate further research in a European context.

9
A model of analysis
  • Var 1 Name of the coder (values the names of the
    two authors)
  • Var 2 Number of the fragment (the values were
    present in the fragments)
  • Var 3 Utterance number(s) of the segment
    preceding want (S1)
  • Var 4 Utterance number(s) of the segment
    following want (S2)
  • Var 5 Propositional attitude of S1 (values
    action, fact, opinion, observation,
  • knowledge, experience)
  • Var 6 Propositional attitude of S2 (values
    action, fact, opinion, observation,
  • knowledge, experience)
  • Var 7 Identity of the conceptualizer in S1
    (values speaker/1st person, second person,
  • third person (nominal or pronominal, generic
    person)
  • Var 8 Identity of the conceptualizer in S2
    (values speaker/1st person, second person,
  • third person (nominal or pronominal, generic
    person)
  • Var 9 Type of relation expressed by want (values
    non-volitional content, volitional
  • content, explanation of a mental state,
    epistemic, textual, speech act)
  • Var 10 Syntactic modification of want (values no
    modification, coordinating
  • conjunction, intensifier, focus element)
About PowerShow.com