Spanish FrameNet Project - PowerPoint PPT Presentation

About This Presentation
Title:

Spanish FrameNet Project

Description:

SFN is developed at the Autonomous University of Barcelona (Spain) and the ... the siempre adverb is detected between the core verb and idiom. Subcorporation Process ... – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 18
Provided by: cor66
Category:

less

Transcript and Presenter's Notes

Title: Spanish FrameNet Project


1
Spanish FrameNet Project
  • Autonomous University of Barcelona
  • Marc Ortega

2
Spanish FrameNet Project
  • Spanish FrameNet is a research project which is
    sponsored by the Department of Education of Spain
    (Grant No. TSI2005-01200) from December 2005 to
    December 2006.
  • A new grant proposal has been submitted to the
    Spanish Department of Education for the period
    2007-2009
  • SFN is developed at the Autonomous University of
    Barcelona (Spain) and the International Computer
    Science Institute (Berkeley, CA) in cooperation
    with the FrameNet Project.
  • PI Carlos Subirats, System Analyst Marc Ortega,
    2 linguist

3
SFN Goals
  • The Spanish FrameNet Project is creating an
    online lexical resource for Spanish, based on
    frame semantics and supported by corpus evidence.
  • SFN will be available to the public by July 2007
  • SFN will contain at least 1,000 lexical items
    aprox. -verbs, predicative nouns, and adjectives,
    adverbs, prepositions and entities-
    representative of a wide range of semantic
    domains.
  • The aim is to document the range of semantic and
    syntactic combinatory possibilities (valences) of
    each word in each of its senses

4
Frame Semantics
  • Spanish FrameNet (SFN) is using, adapting and
    changing FrameNet Frames in order to adapt them
    to Spanish
  • Some SFN Frames are the same as English FN (with
    Spanish examples)
  • Some SFN Frames have the same English FN name but
    they are different (slightly different
    definition, different FEs, or different core
    sets)
  • To adapt FN to Spanish we defined some new frames
    and some FN frames are not used (new frames use
    the same FN format), like
  • Cause_to_halt
  • Change_emotional_state
  • Collapse
  • Inventing
  • Motion_backwards, Motion_interruption,
    Motion_manner, Motion_medium, Motion_up_downwards
  • Return
  • Social_interaction
  • Think_up

5
Current Project Status
  • Frames Defined 92
  • Lexical Units 624
  • Annotated 413
  • Subcorporated 130
  • Created but without subcorporation 23

6
Spanish FrameNet Corpus and Tools
  • Spanish FrameNet is using a 350 million word
    corpus
  • It includes both European and New World Spanish
    (40 and 60)
  • The SFN Corpus has been developed by the SFN
    research team, since there are no (large) public
    domain Spanish corpora available
  • The SFN Corpus is lemmatized and tagged with a
    set of in-house tools
  • FNDesktop
  • Web Reports
  • Sato Tool

7
The SFN tagging and chunking system
  • The SFN Corpus is tagged and lemmatized by using
  • An electronic dictionary of Spanish of 600,000
    forms, which is expanded from a dictionary of
    93,000 lemmas
  • 66,000 single-word lexical units, like unir
    (unite), inmoralidad (immorality), allí (there),
    etc.
  • 26,000 multi-word lexical units (MWLU), like
    muerte cerebral (brain death), etc., which are
    automatically expanded in 55,000 inflected MWLU
    forms.
  • Plain text to Deterministic Finite State Automata
    (FSA) corpus tagger
  • 2,000 Finite State Transducers (FST) transducers
    of multi-word verbs
  • Transducers of head of verbal phrases (compound
    verbal tenses)

8
The SFN tagging and chunking system
  • The POS tagging process gives to corpus formats
  • Automata Corpus
  • IMS-CWB (Institut für Maschinelle
    Sprachverarbeitung -Corpus Workbench)

9
Automata Corpus
  • Lexical tagging (part-of-speech, lemma)
  • Word ambiguities are represented in deterministic
    finite state automata (DFSAs) as different
    possible transitions between two consecutive
    states
  • Allows efficient word disambiguation
  • Allows extended lexical tagging using automata
    transduction
  • Compound verbal forms tagging
  • Multi-word verb recognition
  • Very efficient process rates
  • Human access is almost impossible

10
CWB Corpus
  • Lexical tagging (part-of-speech, lemma)
  • Text DSFA are disambiguated and converted to XML
    format
  • Unambiguous corpus
  • Allows human access to corpus contents
  • Allows human corpus search
  • Corpus contents are codified and indexed for an
    efficient corpus search

11
Multi-word verb recognition
  • Inflectional morphological properties are kept
  • the siempre adverb is detected between the core
    verb and idiom

12
Subcorporation Process
  • Internal tools GramCreator and XQS are used to
    create subcorporation grammar

Request solicitud N-de-GN-de ltPALABRAgt
4 ltNPREDgt ( ltAPREDgt ltPALABRAgt )
ltde.PREPgt ( (ltPRONgt (
( ltEgt ltPREDETgt ) ( ltEgt ltDETgt
ltAPOSgt ) ( ltEgt ltAPREDgt ltVPREDPPgt
) )) ltNgt (ltNPROPgt ( ltEgt ltNPROPgt
)) ) ltde.PREPgt
Solicitud grammar example the syntactic
structure N-de-GN-de is detected
13
Subcorporation Process
  • Each grammar (regular expression) is converted to
    a Finite State Transducer
  • LUs subcorpora is transduced with a set of
    grammars FST to produce a set of subcorpora
  • The transduction process allows very efficient
    process rates (100 transductions per second)
  • The subcorporation set is converted to XML and
    imported to FNDesktop

14
Subcorporation Process
N-de-GN-de structure detection
15
Annotation Tool
  • SFN uses the FN annotation tool (FNDesktop) to
    add semantic annotation to the LU subcorporation
    sets
  • The FNClassifier has been adapted to Spanish the
    classifier has new rules which are adapted to the
    Spanish tags and Spanish local Syntactic contexts

16
Annotation search tools (Web Reports)
17
Annotation search tools (Sato Tool)
Write a Comment
User Comments (0)
About PowerShow.com