Verb Valency Frame Extraction Using Morphological and Syntactic Features of Croatian - PowerPoint PPT Presentation

About This Presentation
Title:

Verb Valency Frame Extraction Using Morphological and Syntactic Features of Croatian

Description:

Title: Slide 1 Author: Jan Last modified by: Marko Tadi Created Date: 6/18/2005 8:58:08 PM Document presentation format: On-screen Show (4:3) Other titles – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 19
Provided by: Jan1214
Category:

less

Transcript and Presenter's Notes

Title: Verb Valency Frame Extraction Using Morphological and Syntactic Features of Croatian


1
Verb Valency Frame Extraction Using Morphological
and Syntactic Features of Croatian
  • Krešimir Šojat, Željko Agic, Marko Tadic
  • Department of Linguistics, Department of
    Information SciencesFaculty of Humanities and
    Social Sceinces, University of Zagreb
  • ksojat, zagic, marko.tadic_at_ffzg.hr
  • FASSBL 7 Conference
  • Dubrovnik, Croatia2010-10-05

2
Overview
  • What?
  • extraction and semi-automatic construction of
    verb valency frames
  • How?
  • rule-based extraction procedure run on the
    Croatian dependency treebank
  • manual assignment of tectogrammatical functors
  • inference of rules for assigning functors to
    unseen text
  • Why?
  • creation of treebank-based verb valency lexicon
  • enhancement and enrichment of existing resources

3
Valency frames
  • valency frame extraction means to detect all
    possible environments of particular verb as found
    in the treebank
  • such an approach aims at fast construction of
    valency frames
  • extraction is automatic, no elements of frames
    added manually by human annotators
  • such automatically acquired verb valency lexicon
    can serve as a basis for further enrichment and
    enhancement of manually constructed resources,
    either existing or constructed from scratch

4
The treebank
  • Croatian Dependency Treebank (HOBS)
  • follows the guidelines of the Prague DT
  • taken from the Croatia Weekly 100 kw sub-corpus
    of the Croatian National Corpus (HNK)
  • XCES-encoded up to the word level
  • sentence-delimited, tokenized, manually
    lemmatized and MSD-tagged
  • serves as the morphological layer of the treebank
  • annotated on the syntactic layer
  • approximately 2.700 sentences, 67.000 tokens
  • manually assigned syntactic functions
  • ca 1.300 sentences double-checked and used in
    this experiment

5
The treebank

HR Unija je vec dogovorila neke mjere kako bi pomogla Hrvatskoj.
EN The Union has already arranged some measures in order to help Croatia.
6
Extraction algorithm
  • the algorithm aims at extraction of verb valency
    frame instances
  • for each verb in the treebank sample, it descends
  • one level down the dependency tree to retrieve
    subjects (Sb), objects (Obj), adverbs (Adv) and
    nominal predicates (Pnom)
  • Two levels down to retrieve tokens from the
    previous step introduced by subordinate
    conjunctions (AuxC) or prepositions (AuxP)

7
Extraction algorithm
  • algorithm illustration
  • dogovorila (dogovoriti Pred)
  • Unija Ncfsn Sb mjere Ncfpa Obj vec Rt Adv
    kako Css AuxC

8
Extraction algorithm
  • the first version retrieved predicates only and
    was expanded to retrieve all the verbs from the
    treebank sample
  • algorithm adapted to retrieve any verbs found in
    the dependency structure, regardless of their
    respective analytical functions and position
    within the dependency trees
  • the adaptation itself is implemented in order to
    raise the recall of the algorithm, while still
    maintaining its precision by not changing the
    simple set of descending rules
  • i.e. to retrieve as much verbs as possible given
    the limited size of the treebank sample used in
    the experiment

9
Extraction algorithm
  • the verb imati (Vmn) is annotated as object
    (Obj)

10
Extraction algorithm
  • Thus, from each sentence the number of extracted
    frames correspondes to the number of verbs
  • one frame for the main clause that captures the
    whole syntactic structure of the sentence
  • frames extracted from dependent clauses
  • naglasio (naglasiti Vmps-sma Pred)
  • Mikuška Np-sn Sb kako-gtimati Css AuxC-gtObj
  • imati (imati Vmn Obj)
  • stanovništvo Ncnsn Sb korist Ncfsa Obj
  • od-gtprojekta Spsg-gtNcmsg AuxP-gtAdv
  • kroz-gtekoturizam Spsa-gtNcmsa AuxP-gtAdv

11
Functor assignment
  • In order to annotate verbal frames we used a set
    of 5 argument functors and functors for 32 free
    modification functors
  • Argument functors ACT, PAT, ADDR, ORIG, EFF
  • Temporal functors TWHEN, TFHL, TFRWH, THL, THO,
    TOWH, TPAR, TSIN, TTILL
  • Locative and directional functors DIR1, DIR2,
    DIR3, LOC
  • Functors for causal relations AIM, CAUS, CNCS,
    COND, INTT
  • Functors for expressing manner ACMP, CPR, CRIT,
    DIFF, EXT, MANN, MEANS, REG, RESL, RESTR
  • Functors for specific modifications BEN, CONTRD,
    HER, SUBS
  • 936 frame instances were manually annotated for
    424 different verbs

12
Results
  • valency frame frequency across verb lemmas

Verb Frequency
biti 188
imati 23
reci 15
dobiti 12
raditi 10
kazati 9
pokazati 8
postati 8
vidjeti 8
dati 7
raditi (en. to work, to do) raditi (en. to work, to do)
Valency frame Frequency
ACT PAT 2
ACT CRIT LOC THL 1
ACT MANN TWHEN 1
ACT MEANS TWHEN 1
ACT PAT TSIN 1
dati (en. to give) dati (en. to give)
Valency frame Frequency
ACT ADDR PAT 4
ACT ADDDR PAT 1
ACT ADDR AIM PAT 1
ACT PAT 1
13
Results
  • frequency of verb valency frames, i.e. n-tuples
    of tectogrammatical functors

Frame Count Percent
ACT PAT 250 26.71
PAT 157 16.77
ACT PAT TWHEN 30 3.21
ACT MANN PAT 23 2.46
ACT ADDR PAT 20 2.14
ACT LOC 20 2.14
ACT LOC PAT 20 2.14
MANN PAT 17 1.82
ACT CAUS PAT 16 1.71
ACT MANN 13 1.39
LOC PAT 12 1.28
ADDR PAT 11 1.18
Other 347 37.07
14
Results
  • frames annotated with MSD, analytical functions
    and tectogrammatical functors

djelovati (djeluje Pred) neozbiljno Neozbiljno Rnp Adv MANN odustajanje odustajanje Ncnsn Sb ACT
osloboditi (oslobodili Pred) ACT nikada Nikada Rt Adv THL zloduh zloduha Ncmsg Obj PAT
postati (postali Pred) studij studiji Ncmpn Sb ACT fakultet fakultet Ncmsn Obj PAT
zaustaviti (zaustavio Atr) ACT oni ih Pp3-pa--y-n-- Obj PAT dolina u-gtdolini Spsl-gtNcfsl AuxP-gtAdv LOC
15
Results
  • Distribution of (MSD, analytical function) pairs
    across tectogrammatical functors
  • serves as basis for defining functor assignment
    rules from MSD and analytical function

ACT (Actor) ACT (Actor) ACT (Actor) PAT (Patient) PAT (Patient) PAT (Patient) LOC (Locative) LOC (Locative) LOC (Locative)
A-fun MSD A-fun MSD A-fun MSD
Sb Ncmsn 14.91 Obj Ncfsa 11.25 (AuxP) Adv (Spsl) Ncfsl 21.88
Sb Np-sn 13.50 Obj Ncmsa 9.18 (AuxP) Adv (Spsl) Ncmsl 16.41
Sb Ncfsn 12.87 Pnom Ncmsn 5.69 (AuxP) Adv (Spsl) Npmsl 10.16
Sb Ncmpn 9.89 Obj Ncmpa 4.53 (AuxP) Adv (Spsl) Ncnsl 8.59
Sb Npfsn 5.65 Obj Vmn 4.40 (AuxP) Adv (Spsl) Npfsl 8.59
Sb Pi-mpn--n-a-- 4.71 Obj Ncnsa 3.75 (AuxP) Adv (Spsl) Ncmpl 5.47
Sb Ncfpn 3.30 Obj Ncfpa 3.49 (AuxP) Adv (Spsl) Ncfpl 3.91
Sb Ncnsn 2.98 Pnom Ncfsn 2.72 Adv Rl 3.13
Sb Pi-msn--n-a-- 2.51 (AuxC) Obj (Css) Vmip3s 2.07 Adv Css 1.56
Sb Pi-fsn--n-a-- 1.88 Obj Ncmsn 1.81 (AuxP) Adv (Spsg)Ncmsg 1.56
16
Conclusions
  • in this experiment we have designed and
    implemented one possible approach
  • to semi-automatic extraction of a valency frame
    lexicon for Croatian verbs
  • to the refinement of existing lexicons by using
    the Croatian Dependency Treebank as an underlying
    resource
  • we have automatically extracted 2930 verb valency
    frame instances and annotated 936 frames
  • the distribution of valency frames for each of
    the encountered verbs
  • the distribution of analytical functions and
    morphosyntactic tags for each of the
    tectogrammatical functors

17
Future work
  • the first result enables the enrichment of
    existing valency lexicons, such as CROVALLEX
  • the second result enables the implementation of a
    rule-based system for automatic assignment of
    tectogrammatical functors to morphosyntactically
    tagged and dependency-parsed unseen text
  • this procedure of automatic detection of valency
    frames will be used also in several other
    projects dealing with factored SMT (e.g. ACCURAT)
  • regarding dependency parsing of Croatian by using
    the Croatian Dependency Treebank, we shall
    undergo various research directions in order to
    increase overall parsing accuracy

18
Thank you for your attention.
www.accurat-project.eu
  • The research within the project ACCURAT leading
    to these results has received funding from the
    European Union Seventh Framework Programme
    (FP7/2007-2013), grant agreement no 248347.
Write a Comment
User Comments (0)
About PowerShow.com