Title: Frame based extraction of term candidates from the domain of social welfare
1Frame based extraction of term candidates from
the domain of social welfare
CULT 2004 Barcelona, January 22-24th, 2004
2Overview
- introduction
- DOT-project
- term candidate extractor
- conclusions and future work
3Project
Embedded in DOT-Project Databank
OverheidsTerminologie Contrastive Database
Governmental Terminology, the Netherlands and
Flanders
- Co-operation of
- Lexicology, University of Amsterdam
- IMS, University of Stuttgart
- Zeno, Belgian software company
4DOT
- Planned use as
- linguistic knowledge database
- monolingual lexicon
- intralingual relations
- translation help
- for both translators and gov. personnel
- active and passive
- terminology support for government personnel
- descriptive
- prescriptive
5DOT
professional training
concepts
lexemes
Netherlands BIJSCHOLING
Netherlands HERSCHOLING
NL Flanders NASCHOLING
6DOT
- Problems with manual filling by experts
- not all entries are completely accurate
- lack of intuition
- time consuming task
- Our extractor should feed this database
semi-automatically - automatic part
- based on syntactic and semantic knowledge
- manual part
- experts correct and/or accept proposed data
7So
We developed a RULE BASED EXTRACTOR for TERM
CANDIDATES from CHUNKED CORPORA using FRAMES
8Frames
A representation model for semantic
knowledge (Minsky 1974)
- Frames exist of
- semantic slots which represent the semantic
valence - lexical fillers
FrameNet (Charles Fillmore, Berkeley) http//www.i
csi.berkeley.edu/framenet/
9Frames example
TO PAY SLOTS FILLERS Is a verb of
financial transaction Payer citizen Beneficiar
y government Payee fee Means cash Reason brea
king the law Amount E. 20 ,--
10Corpus
- HFD-Corpus (Het Financiële Dagblad / Dutch
Financial Times) -
- approx. 8,5 Mln words
-
- tokenized
- POS-tagged and lemmatized
- (using TreeTagger, Schmid 1994, adapted to NL
similar to CGN-tagset) - chunked
- (using YAC_NL recursive chunker, Spranger 2002,
Spranger and Heid, 2003)
11Chunking
12Chunking
- Feature attributes gathered in feature sets are
annotated together with the chunks - Current practice for specialized texts
- Annotation of the head lemma of a chunk
- Lexical properties of the head
- Proper name
- ltnpgtltnp_f negtKoningin Beatrixlt/np_fgtlt/npgt
- Semantic class
- ltnpgtltnp_f tempgteind februarilt/np_fgtlt/npgt
- Structural properties of a chunk
- ltnpgtltnp_f streetgtBeursplein 37lt/np_fgtlt/npgt
- Text-structural properties of a chunk
- ltnpgtltnp_f quotgtabsolute mustlt/np_fgtlt/npgt
13Chunking
De rekeningen worden per 1998 in Nederlandse
guldens voldaan. Since 1998 the bills are payed
in Dutch guilders per 1998 in
Nederlandse guldens ltppgt ltppgt ltpp_f
cardtempyeargt ltpp_f normcurrgt ltpp_h
per1998gt ltpp_h inguldengt per in
ltnpgt ltnpgt ltnp_f cardtempyeargt
ltnp_f normcurrgt ltnp_h 1998gt
ltnp_h guldengt 1998 ltapgt
lt/np_hgt Nederlandse lt/np_fgt
lt/apgt lt/npgt guldens
lt/pp_hgt lt/np_hgt lt/pp_fgt
lt/np_fgt lt/ppgt lt/npgt lt/pp_hgt
lt/pp_fgt lt/ppgt
14Term candidate extractor
- Identification of important verbs from
terminology domain - Semi-automatic term candidate extraction
- Manual specification of semantic frames
- Manual specification of syntactico-semantic
mappings - Building query macros for different
constructions - Active constructions
- Passive constructions
- Nested constructions
- Fully automatic extraction of chunks / phrases
- Fully automatic mapping onto semantic frame
elements
15Syntactico-semantic mapping example
TO PAY passive constructions BETAALDE NP
(Subj) De vergoeding wordt betaald. payee PP
(aan) Er is 500 mln aan rente betaald. PP
(naast) Naast salaris worden ook lonen
betaald. PP (behalve) De rekeninghouders wordt
behalve rente ook dividend betaald. PP
(voor) 10 mrd werd er in totaal voor
het aandelenpakket betaald. BETALER PP
(door) Leges worden normaliter door de burgers
aan payer de gemeente betaald. ONTVANGER NP
(Ind.Obj.) De rekeninghouders wordt behalve rente
ook beneficiary dividend betaald. PP
(aan) De boete is aan de overheid betaald.
16Extraction task
Acquire term classes from syntactic
observables ideal 11-mapping from syntactic
structure on semantic element but often
there is ambiguity one syntactic structure maps
onto several semantic elements
17Ambiguous structures
- PP (uit) NP is semantically ambiguous
- a SOURCE for the payment is indicated
- De premiebetaling wordt uit de rente voldaan.
- The premiums were paid from the interest.
- a DATE is indicated
- De premies uit mei en juni 2003 zijn
overgemaakt. - The premiums from may and june 2003 are
remitted.
18Disambiguation
PP (uit) NP is semantically
ambiguous Disambiguation using annotated
features DATE De premies uit mei en juni
2003 zijn overgemaakt. The premiums from may
and june 2003 are remitted. ltnpgtlt/np_ftempgtui
t mei en juni 2003gtlt/np_fgtlt/npgt Temporal noun
chunks are excluded from being potential term
candidates
19Ambiguous structures
- PP (aan) NP is semantically ambiguous
- a THEME of the payment is indicated
- 500 mln werd er binnen een maand aan belasting
betaald. - Within a month, 500 mln of taxes were paid.
- a BENEFICIARY is indicated
- 500 mln werd er binnen een maand aan de
overheid betaald. - Within a month, 500 mln were paid to the
government.
20Disambiguation
PP (aan) NP is semantically
ambiguous Disambiguation using annotated
structure Observation THEME ? definite article
is not possible 500 mln werd er binnen een
maand aan de belasting betaald. Within a
month, 500 mln was paid to the taxes. BENEFICIAR
Y ? definite article is required 500 mln werd
er binnen een maand aan de overheid betaald.
Within a month, 500 mln was paid to the
government.
21Output
De salarissen worden door de werkgever aan de
werknemer betaald. Salaries are paid to the
employees by the employers.
22Output
, dat de zorg uit de verzekeringsgelden wordt
betaald. , that social services are paid by
insurance premiums
23Output
Uw pensioen wordt door ons in contanten of per
overboeking uitgekeerd. Your pension is remitted
by us in cash or by bank transfer.
24Output
25Evaluation
26Conclusions
- Overal precision of 86,02
- Overall recall of 82,37
- most urgent problem ambiguous structures
27Future work
- Disambiguation of extraction results by
co-occurrence comparisons between fillers that
cannot be assigned unambiguously and
unambiguously assignable fillers - Full coverage of active sentences and of
infinitivals such that broadest possible coverage
of the texts to be analyzed can be achieved - Collecting more relevant verb-based frames,
providing the pertaining mapping and broadening
the extraction exercise
28References
- Kermes, Hannah Stefan Evert (2002). YAC - a
recursive chunker for unrestricted german text.
In Manuel Gonzalez Rodriguez and Carmen
PazSuarez Araujo, eds., Proceedings of the Third
International Conference on Language. Resources
and Evaluation, vol. V. pp. 1805-1812. - Martin, Willy et al. (2001). DOT Eindrapport
(deel 1). Rapport van de Ondergroepsgroep
VU-Lexicologie in het kader van het DOT-Project.
Ms. Amsterdam. Vrije Universiteit - Minsky, Marvin (1974). A framework for
representing knowledge. Ms. MIT. - Schmid, Helmut (1994). Probabilistic
part-of-speech tagging using decision trees. In
International Conference on New Methods in
Language Processing, pp. 44-49. - Spranger, Kristina (2002). A lexically informed
chunking analysis as a starting point for the
extraction of linguistic information and
terminology from dutch text. Master's thesis,
University of Stuttgart, IMS - Spranger, Kristina and Ulrich Heid (2003). A
Dutch Chunker as a Basis for the Extraction of
Linguistic Knowledge. In Tanja Gaustad (ed.)
Computational Linguistics in the Netherlands
2002. Selected Papers from the Thirteenth CLIN
Meeting. - Verhagen, Michel (2003). Frame van economische
transacties. Technical report. Vrije Universiteit
Amsterdam