Title: Verb Valency Frame Extraction Using Morphological and Syntactic Features of Croatian
1Verb Valency Frame Extraction Using Morphological
and Syntactic Features of Croatian
- Krešimir Šojat, Željko Agic, Marko Tadic
- Department of Linguistics, Department of
Information SciencesFaculty of Humanities and
Social Sceinces, University of Zagreb - ksojat, zagic, marko.tadic_at_ffzg.hr
- FASSBL 7 Conference
- Dubrovnik, Croatia2010-10-05
2Overview
- What?
- extraction and semi-automatic construction of
verb valency frames - How?
- rule-based extraction procedure run on the
Croatian dependency treebank - manual assignment of tectogrammatical functors
- inference of rules for assigning functors to
unseen text - Why?
- creation of treebank-based verb valency lexicon
- enhancement and enrichment of existing resources
3Valency frames
- valency frame extraction means to detect all
possible environments of particular verb as found
in the treebank - such an approach aims at fast construction of
valency frames - extraction is automatic, no elements of frames
added manually by human annotators - such automatically acquired verb valency lexicon
can serve as a basis for further enrichment and
enhancement of manually constructed resources,
either existing or constructed from scratch
4The treebank
- Croatian Dependency Treebank (HOBS)
- follows the guidelines of the Prague DT
- taken from the Croatia Weekly 100 kw sub-corpus
of the Croatian National Corpus (HNK) - XCES-encoded up to the word level
- sentence-delimited, tokenized, manually
lemmatized and MSD-tagged - serves as the morphological layer of the treebank
- annotated on the syntactic layer
- approximately 2.700 sentences, 67.000 tokens
- manually assigned syntactic functions
- ca 1.300 sentences double-checked and used in
this experiment
5The treebank
HR Unija je vec dogovorila neke mjere kako bi pomogla Hrvatskoj.
EN The Union has already arranged some measures in order to help Croatia.
6Extraction algorithm
- the algorithm aims at extraction of verb valency
frame instances - for each verb in the treebank sample, it descends
- one level down the dependency tree to retrieve
subjects (Sb), objects (Obj), adverbs (Adv) and
nominal predicates (Pnom) - Two levels down to retrieve tokens from the
previous step introduced by subordinate
conjunctions (AuxC) or prepositions (AuxP)
7Extraction algorithm
- algorithm illustration
- dogovorila (dogovoriti Pred)
- Unija Ncfsn Sb mjere Ncfpa Obj vec Rt Adv
kako Css AuxC
8Extraction algorithm
- the first version retrieved predicates only and
was expanded to retrieve all the verbs from the
treebank sample - algorithm adapted to retrieve any verbs found in
the dependency structure, regardless of their
respective analytical functions and position
within the dependency trees - the adaptation itself is implemented in order to
raise the recall of the algorithm, while still
maintaining its precision by not changing the
simple set of descending rules - i.e. to retrieve as much verbs as possible given
the limited size of the treebank sample used in
the experiment
9Extraction algorithm
- the verb imati (Vmn) is annotated as object
(Obj)
10Extraction algorithm
- Thus, from each sentence the number of extracted
frames correspondes to the number of verbs - one frame for the main clause that captures the
whole syntactic structure of the sentence - frames extracted from dependent clauses
- naglasio (naglasiti Vmps-sma Pred)
- Mikuška Np-sn Sb kako-gtimati Css AuxC-gtObj
- imati (imati Vmn Obj)
- stanovništvo Ncnsn Sb korist Ncfsa Obj
- od-gtprojekta Spsg-gtNcmsg AuxP-gtAdv
- kroz-gtekoturizam Spsa-gtNcmsa AuxP-gtAdv
11Functor assignment
- In order to annotate verbal frames we used a set
of 5 argument functors and functors for 32 free
modification functors - Argument functors ACT, PAT, ADDR, ORIG, EFF
- Temporal functors TWHEN, TFHL, TFRWH, THL, THO,
TOWH, TPAR, TSIN, TTILL - Locative and directional functors DIR1, DIR2,
DIR3, LOC - Functors for causal relations AIM, CAUS, CNCS,
COND, INTT - Functors for expressing manner ACMP, CPR, CRIT,
DIFF, EXT, MANN, MEANS, REG, RESL, RESTR - Functors for specific modifications BEN, CONTRD,
HER, SUBS - 936 frame instances were manually annotated for
424 different verbs
12Results
- valency frame frequency across verb lemmas
Verb Frequency
biti 188
imati 23
reci 15
dobiti 12
raditi 10
kazati 9
pokazati 8
postati 8
vidjeti 8
dati 7
raditi (en. to work, to do) raditi (en. to work, to do)
Valency frame Frequency
ACT PAT 2
ACT CRIT LOC THL 1
ACT MANN TWHEN 1
ACT MEANS TWHEN 1
ACT PAT TSIN 1
dati (en. to give) dati (en. to give)
Valency frame Frequency
ACT ADDR PAT 4
ACT ADDDR PAT 1
ACT ADDR AIM PAT 1
ACT PAT 1
13Results
- frequency of verb valency frames, i.e. n-tuples
of tectogrammatical functors
Frame Count Percent
ACT PAT 250 26.71
PAT 157 16.77
ACT PAT TWHEN 30 3.21
ACT MANN PAT 23 2.46
ACT ADDR PAT 20 2.14
ACT LOC 20 2.14
ACT LOC PAT 20 2.14
MANN PAT 17 1.82
ACT CAUS PAT 16 1.71
ACT MANN 13 1.39
LOC PAT 12 1.28
ADDR PAT 11 1.18
Other 347 37.07
14Results
- frames annotated with MSD, analytical functions
and tectogrammatical functors
djelovati (djeluje Pred) neozbiljno Neozbiljno Rnp Adv MANN odustajanje odustajanje Ncnsn Sb ACT
osloboditi (oslobodili Pred) ACT nikada Nikada Rt Adv THL zloduh zloduha Ncmsg Obj PAT
postati (postali Pred) studij studiji Ncmpn Sb ACT fakultet fakultet Ncmsn Obj PAT
zaustaviti (zaustavio Atr) ACT oni ih Pp3-pa--y-n-- Obj PAT dolina u-gtdolini Spsl-gtNcfsl AuxP-gtAdv LOC
15Results
- Distribution of (MSD, analytical function) pairs
across tectogrammatical functors - serves as basis for defining functor assignment
rules from MSD and analytical function
ACT (Actor) ACT (Actor) ACT (Actor) PAT (Patient) PAT (Patient) PAT (Patient) LOC (Locative) LOC (Locative) LOC (Locative)
A-fun MSD A-fun MSD A-fun MSD
Sb Ncmsn 14.91 Obj Ncfsa 11.25 (AuxP) Adv (Spsl) Ncfsl 21.88
Sb Np-sn 13.50 Obj Ncmsa 9.18 (AuxP) Adv (Spsl) Ncmsl 16.41
Sb Ncfsn 12.87 Pnom Ncmsn 5.69 (AuxP) Adv (Spsl) Npmsl 10.16
Sb Ncmpn 9.89 Obj Ncmpa 4.53 (AuxP) Adv (Spsl) Ncnsl 8.59
Sb Npfsn 5.65 Obj Vmn 4.40 (AuxP) Adv (Spsl) Npfsl 8.59
Sb Pi-mpn--n-a-- 4.71 Obj Ncnsa 3.75 (AuxP) Adv (Spsl) Ncmpl 5.47
Sb Ncfpn 3.30 Obj Ncfpa 3.49 (AuxP) Adv (Spsl) Ncfpl 3.91
Sb Ncnsn 2.98 Pnom Ncfsn 2.72 Adv Rl 3.13
Sb Pi-msn--n-a-- 2.51 (AuxC) Obj (Css) Vmip3s 2.07 Adv Css 1.56
Sb Pi-fsn--n-a-- 1.88 Obj Ncmsn 1.81 (AuxP) Adv (Spsg)Ncmsg 1.56
16Conclusions
- in this experiment we have designed and
implemented one possible approach - to semi-automatic extraction of a valency frame
lexicon for Croatian verbs - to the refinement of existing lexicons by using
the Croatian Dependency Treebank as an underlying
resource - we have automatically extracted 2930 verb valency
frame instances and annotated 936 frames - the distribution of valency frames for each of
the encountered verbs - the distribution of analytical functions and
morphosyntactic tags for each of the
tectogrammatical functors
17Future work
- the first result enables the enrichment of
existing valency lexicons, such as CROVALLEX - the second result enables the implementation of a
rule-based system for automatic assignment of
tectogrammatical functors to morphosyntactically
tagged and dependency-parsed unseen text - this procedure of automatic detection of valency
frames will be used also in several other
projects dealing with factored SMT (e.g. ACCURAT) - regarding dependency parsing of Croatian by using
the Croatian Dependency Treebank, we shall
undergo various research directions in order to
increase overall parsing accuracy
18Thank you for your attention.
www.accurat-project.eu
- The research within the project ACCURAT leading
to these results has received funding from the
European Union Seventh Framework Programme
(FP7/2007-2013), grant agreement no 248347.