Diapositiva 1 - PowerPoint PPT Presentation

About This Presentation
Title:

Diapositiva 1

Description:

Italian corpus comparable in size (62K words), content and annotation to TB 1.2. ... under development: 13K words annotated, 1,755 events; ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 17
Provided by: Tomm221
Learn more at: http://www.lrec-conf.org
Category:

less

Transcript and Presenter's Notes

Title: Diapositiva 1


1
A Bilingual Corpus of
Inter-linked Events
Tommaso Caselli?, Nancy Ide?, Roberto Bartolini
? ?Istituto di Linguistica Computazionale
ILC-CNR Pisa ?Department of Computer Science,
Vassar College, USA tommaso.caselli_at_ilc.cnr.it
ide_at_cs.vassar.edu roberto.bartolini_at_ilc.cnr.it
LREC 08 Marrakech, 30 May 2008
2
Outline
  • Motivations
  • A (gentle) introduction to TimeML, TimeBank and
    Italian TimeBank Corpora
  • The Bilingual Corpus linking events in TimeBank
    Italian TimeBank by means of the Inter-Lingual
    Index
  • Evaluation and Experiments
  • similar events in the two corpora
  • the ILI a bootstrapping device for creating
    comparable corpora
  • Conclusion

3
Motivations
  • Retrieving the temporal relations between events
    in texts is required to improve the performance
    of I.R. and Open Domain Q.A. systems
  • one of the most challenging task is represented
    by event identification
  • can we facilitate events recognition by linking
    two comparable corpora - on size, content and
    annotation in two different languages by means
    of the Inter-Lingual Index (ILI), which links IWN
    WN?
  • are events encoded in the same way in Italian
    and English?
  • can we import layers of annotations from a
    corpus to another in two different languages by
    exploiting the ILI?

4
TimeML, TimeBank Italian TimeBank
  • TimeML (Pustejovsky et al., 2003) is a
    specification language to annotate core elements
    in a temporal framework
  • temporal expressions (ltTIMEX3gt) e.g. December
    1st, three years
  • a wide range of linguistic expressions, like
    verbs, nouns, nominalizations, stative
    adjectives..., realizing eventualities (ltEVENTgt),
    i.e. events and states, and classifies them into
    7 classes, i.e. ASPECTUAL, REPORTING, I_ACTION,
    I_STATE, PERCEPTION, STATE, OCCURRENCE, according
    to semantic and syntactic criteria
  • connectives and temporal prepositions
    (ltSIGNALgt), which make explicit the relation
    holding between two entities
  • it creates dependencies between events (ltSLINKgt,
    ltALINKgt, ltTLINKgt) and between events and times
    (ltTLINKgt).

5
TimeML, TimeBank Italian TimeBank
(2)?
TimeBank 1.2.
  • first available corpus annotated with TimeML
  • 183 news article from different sources,
    including the Penn TreeBank2 Wall Street Journal,
    for a total of 61K words
  • 7,935 events K0.81 on partial match on event
    identification K0.67 on event classification

Italian TimeBank
  • Italian corpus comparable in size (62K words),
    content and annotation to TB 1.2. (171 articles
    from the Italian TreeBank and the PAROLE corpus)
  • under development gt13K words annotated, 1,755
    events
  • customization of TimeML to Italian (ISO-TimeML)
    imperfect value for TENSE two new attributes
    -V_FORM MOOD for the ltEVENTgt tag,
    modification of ltEVENTgt tag text span
  • mapping of the 7 TimeML event classes to the
    SIMPLE Ontology to improve event classification
    (K0.84)?

6
The Bilingual Corpus Linking Events
Linkage between the TimeBank (TB) Italian
TimeBank is accomplished through the
Inter-Lingual Index (ILI), developed in the
EuroWordNet Project (1999)?
The ILI is effectively an unstructured version of
WN, used as a hub through which WN synsets are
associated with synsets in WNs of other languages
  • In IWN the ILI is augmented with several
    semantic relations, such as eq_synonym,
    eq_hyperonym, eq_cause...

specific information on the synsets relations
between English and Italian.
  • 1,835 events (1,777 verbs 658 nominalization)
    manual annotation of WN 2.0. senses, by 2 native
    speakers 91 annotators agreement

1,686 events
  • 1,253 events (778 verbs 462 nominalizations
    and nouns) semi-automatic annotation of IWN
    sense.

7
The Bilingual Corpus Linking Events (2)?
WN 2.0
IWD 1.5
Auto-Generated Mapping from WD 2.0 to IWD 1.5
IWN SENSE
WN SENSE
ILI
ILI
Augmented TimeBank SENSE (WN 2.0)
Italian TimeBank SENSE (from IWN)?
ILI LINK
ILI (IWN)
ILI (IWN)
The ILI link is automatically determined and
restricted to the eq_synonym and eq_near synonym
relations
only events with exaclty or approximately the
same meaning
1,103 events in TB with 115 event synsets 1,250
event in Italian TB with 653 event synsets
8
Evaluation Similar Events
  • To which extent the introduction of WN senses is
    useful for event identification?
  • Verify the Semantic Homogeneity Hypothesis
    events with (almost) the same meaning assign the
    same TimeML class i.e. are semantically
    homogeneous.

Automatic extraction of all events (nouns and
verbs) with same ILI from both corpora
  • 56 common event synsets

DATA SPARSNESS
  • 35 common event synsets for verbs vs. 11 common
    event synsets for nouns

9
Evaluation Similar Events - Verbs
Analysis of common event synsets with a
significant number of occurrences in both
languages 25 event synsets, each with 5
occurrences at least
  • for each event token we analyzed its semantic
    pattern
  • basic argument structure e.g. ARG0 E ARG1
    ARG2
  • semantic class of each argument and thematic
    role e.g. ARG0PersonAgent
  • subvalency features e.g. PersonAgent Def_Np
    E EventThemeClause

and its TimeML class.
  • 30 different patterns have been identified for
    the 25 common synsets
  • 93.22 of cases support the Semantic Homogeneity
    Hypothesis same meaning, same semantic pattern,
    same TimeML class
  • instances of event subcategorization (5 cases)
    i.e. more than one pattern.

10
Evaluation Similar
Events Verbs (2)?
lt 10 of cases seem to question the validity of
Semantic Homogeneity
NOT A COUNTEREXAMPLE
ILI 1432563 WN seek3 IWN cercare2 same
semantic pattern person/organization E
event TimeBank class I_ACTION Italian TB
I_STATE
Inconsistency of the data is due to the
exploitation of the SIMPLE TimeML Mapping and
Heuristics (Caselli et al. 2007)?
- SIMPLETimeML Mapping SIMPLE Semantic_type
Modal Event I_STATE - cercare2 Modal Event
I_STATE
Purpose Act I_ACTION
All other instances of possible counterexamples
we've found can all be explained in terms of
factors others than a real difference between
event realizations in the 2 languages
11
Evaluation Similar
Events Nouns
All 11 common types have been analyzed. They are
all instances of nominalization of a
corresponding event verb.
Presence of WN senses is useful for identifying
incorrect or inconsistent annotations in the
source and target corpora and to more easily
identify those instances which satisfy the
criteria for an event in TimeML
Incorrect Annotations in Italian TB missing
semantic types in SIMPLE e.g. aumento_n has 3
senses in IWN but 1 semantic type in SIMPLE
Incorrect Annotations in TB over-extension of
the notion nominalizationevent e.g.
payment_n 8/10 occurrences are marked as EVENT
when their meaning is ''a sum of money''.
BUT WN senses are not always sufficient to
determine if a nominal realize an event or not,
due to the existence in the lexicon of cases
where the (non-)eventive reading is, somehow,
always possible.
12
Experiments the ILI as Bootstrapping Device
  • Can the ILI and wordnet senses be used as a
    bootstrapping strategy for the creation of
    comparable corpora?
  • Key idea if the Semantic Homogeneity Hypothesis
    holds, this will enable the import of one layer
    of annotation from a source corpus to a target
    one.

To verify the validity of this hypothesis we
developed a system which takes as input the
events augmented with WN senses from the TB, and
gives as output an additional layer of
annotation, i.e. it creates the EVENT tag in
Italian.
Italian Corpus IWN sense
Italian Corpus IWN sense (partial)
EVENT annotation
TB WN sense?
ILI P.O.S of TB EVENT
13
Experiments the ILI as Bootstrapping Device (2)
  • To evaluate the reliability of this approach we
    have used the entire corpus of the Italian
    TreeBank where a total of 62,522 words (9,832
    verbs and 44,957 nouns) are manually assigned a
    sense from IWD.

Our system has identified 3,700 events (6.7),
1,183 of which are considered as ''probable
events'' which need human post-processing. 58 new
event synsets have been retrieved.
- identification of annotation inconsistencies
i.e. over-extension of the notion of event for
nominalizations (e.g. movement4 social
movement) - sense assigment is not sufficient to
disambiguate eventive/non eventive reading of
nominals e.g. indication1 segnale1 -
partial matches occur due to the way sense
annotation is performed with WN - significant
reduction of manual effort only the set of
probable events requires validation and is
restricted to those words whose event reading is
not present in WN senses.
14
Conclusion
  • Identification of a new methodology to link
    comparable corpora in different languages by
    means of WN senses and the ILI
  • Data from the resulting resource can be used for
    contrastive analysis of events as well as
    multilingual temporal analysis of texts
  • There is a semantic homogeneity between similar
    events in different languages, including semantic
    preferences for thematic roles and TimeML
    classes
  • Sense assignment to events improves accuracy in
    annotation, in particular for event
    identification, and useful to reveal
    inconsistencies and errors
  • Modification to TimeML is suggested
    introduction of a tag for those instances of
    ambiguous cases where a double reading
    (eventive/non-eventive) is always possible
  • The ILI can be used as a semi-automatic
    bootstrapping device to create resources by
    importing layers of annotation for words with
    similar sense

15
  • Thank You!!

16
Experiments and Evaluation Similar Events
Nouns (2)?
Identification of the senses is not enough to
determine if a nominal may realize an event or
not.
  • the couple ''agreement1 eq_synonym intesa3,
    accordo3'' do not have a clearcut eventive sense
    in both wordnets BUT
  • in TB 31/32 occurrences are tagged as events
    over-extension of the event reading
  • in Italian TB only 7/16 occurrences are tagged
    as events, in Italian intesa3, accordo3 cannot
    be systematically interpreted as events
  • no difference in WN IWN senses is signalled
    between the eventive and non eventive readings!!
  • This calls for a refinement of annotation
    schemes for events to provide explicit means to
    mark ambiguous cases where the double reading is,
    somehow, always possible.
Write a Comment
User Comments (0)
About PowerShow.com