Frame based extraction of term candidates from the domain of social welfare - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Frame based extraction of term candidates from the domain of social welfare

Description:

'Contrastive Database Governmental Terminology, the ... Lexicology, University of Amsterdam. IMS, University of Stuttgart. Zeno, Belgian software company ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 29
Provided by: fti9
Category:

less

Transcript and Presenter's Notes

Title: Frame based extraction of term candidates from the domain of social welfare


1
Frame based extraction of term candidates from
the domain of social welfare
CULT 2004 Barcelona, January 22-24th, 2004
2
Overview
  • introduction
  • DOT-project
  • term candidate extractor
  • conclusions and future work

3
Project
Embedded in DOT-Project Databank
OverheidsTerminologie Contrastive Database
Governmental Terminology, the Netherlands and
Flanders
  • Co-operation of
  • Lexicology, University of Amsterdam
  • IMS, University of Stuttgart
  • Zeno, Belgian software company

4
DOT
  • Planned use as
  • linguistic knowledge database
  • monolingual lexicon
  • intralingual relations
  • translation help
  • for both translators and gov. personnel
  • active and passive
  • terminology support for government personnel
  • descriptive
  • prescriptive

5
DOT
professional training
concepts
lexemes
Netherlands BIJSCHOLING
Netherlands HERSCHOLING
NL Flanders NASCHOLING
6
DOT
  • Problems with manual filling by experts
  • not all entries are completely accurate
  • lack of intuition
  • time consuming task
  • Our extractor should feed this database
    semi-automatically
  • automatic part
  • based on syntactic and semantic knowledge
  • manual part
  • experts correct and/or accept proposed data

7
So
We developed a RULE BASED EXTRACTOR for TERM
CANDIDATES from CHUNKED CORPORA using FRAMES
8
Frames
A representation model for semantic
knowledge (Minsky 1974)
  • Frames exist of
  • semantic slots which represent the semantic
    valence
  • lexical fillers

FrameNet (Charles Fillmore, Berkeley) http//www.i
csi.berkeley.edu/framenet/
9
Frames example
TO PAY   SLOTS FILLERS   Is a verb of
financial transaction   Payer citizen Beneficiar
y government Payee fee Means cash Reason brea
king the law Amount E. 20 ,--
10
Corpus
  • HFD-Corpus (Het Financiële Dagblad / Dutch
    Financial Times)
  •  
  • approx. 8,5 Mln words
  • tokenized
  • POS-tagged and lemmatized
  • (using TreeTagger, Schmid 1994, adapted to NL
    similar to CGN-tagset)
  • chunked
  • (using YAC_NL recursive chunker, Spranger 2002,
    Spranger and Heid, 2003)

11
Chunking
12
Chunking
  • Feature attributes gathered in feature sets are
    annotated together with the chunks
  • Current practice for specialized texts
  • Annotation of the head lemma of a chunk
  • Lexical properties of the head
  • Proper name
  • ltnpgtltnp_f negtKoningin Beatrixlt/np_fgtlt/npgt
  • Semantic class
  • ltnpgtltnp_f tempgteind februarilt/np_fgtlt/npgt
  • Structural properties of a chunk
  • ltnpgtltnp_f streetgtBeursplein 37lt/np_fgtlt/npgt
  • Text-structural properties of a chunk
  • ltnpgtltnp_f quotgtabsolute mustlt/np_fgtlt/npgt

13
Chunking
De rekeningen worden per 1998 in Nederlandse
guldens voldaan. Since 1998 the bills are payed
in Dutch guilders   per 1998 in
Nederlandse guldens ltppgt ltppgt ltpp_f
cardtempyeargt ltpp_f normcurrgt ltpp_h
per1998gt ltpp_h inguldengt per in
ltnpgt ltnpgt ltnp_f cardtempyeargt
ltnp_f normcurrgt ltnp_h 1998gt
ltnp_h guldengt 1998 ltapgt
lt/np_hgt Nederlandse lt/np_fgt
lt/apgt lt/npgt guldens
lt/pp_hgt lt/np_hgt lt/pp_fgt
lt/np_fgt lt/ppgt lt/npgt lt/pp_hgt
lt/pp_fgt lt/ppgt
14
Term candidate extractor
  • Identification of important verbs from
    terminology domain
  • Semi-automatic term candidate extraction
  • Manual specification of semantic frames
  • Manual specification of syntactico-semantic
    mappings
  • Building query macros for different
    constructions
  • Active constructions
  • Passive constructions
  • Nested constructions
  • Fully automatic extraction of chunks / phrases
  • Fully automatic mapping onto semantic frame
    elements

15
Syntactico-semantic mapping example
TO PAY passive constructions BETAALDE NP
(Subj) De vergoeding wordt betaald. payee PP
(aan) Er is 500 mln aan rente betaald. PP
(naast) Naast salaris worden ook lonen
betaald. PP (behalve) De rekeninghouders wordt
behalve rente ook dividend betaald. PP
(voor) 10 mrd werd er in totaal voor
het aandelenpakket betaald. BETALER PP
(door) Leges worden normaliter door de burgers
aan payer de gemeente betaald. ONTVANGER NP
(Ind.Obj.) De rekeninghouders wordt behalve rente
ook beneficiary dividend betaald. PP
(aan) De boete is aan de overheid betaald.
16
Extraction task
Acquire term classes from syntactic
observables ideal 11-mapping from syntactic
structure on semantic element but often
there is ambiguity one syntactic structure maps
onto several semantic elements
17
Ambiguous structures
  • PP (uit) NP is semantically ambiguous
  • a SOURCE for the payment is indicated
  • De premiebetaling wordt uit de rente voldaan.
  • The premiums were paid from the interest.
  • a DATE is indicated
  • De premies uit mei en juni 2003 zijn
    overgemaakt.
  • The premiums from may and june 2003 are
    remitted.

18
Disambiguation
PP (uit) NP is semantically
ambiguous Disambiguation using annotated
features DATE De premies uit mei en juni
2003 zijn overgemaakt. The premiums from may
and june 2003 are remitted. ltnpgtlt/np_ftempgtui
t mei en juni 2003gtlt/np_fgtlt/npgt Temporal noun
chunks are excluded from being potential term
candidates
19
Ambiguous structures
  • PP (aan) NP is semantically ambiguous
  • a THEME of the payment is indicated
  • 500 mln werd er binnen een maand aan belasting
    betaald.
  • Within a month, 500 mln of taxes were paid.
  • a BENEFICIARY is indicated
  • 500 mln werd er binnen een maand aan de
    overheid betaald.
  • Within a month, 500 mln were paid to the
    government.

20
Disambiguation
PP (aan) NP is semantically
ambiguous Disambiguation using annotated
structure Observation THEME ? definite article
is not possible 500 mln werd er binnen een
maand aan de belasting betaald. Within a
month, 500 mln was paid to the taxes. BENEFICIAR
Y ? definite article is required 500 mln werd
er binnen een maand aan de overheid betaald.
Within a month, 500 mln was paid to the
government.
21
Output
De salarissen worden door de werkgever aan de
werknemer betaald. Salaries are paid to the
employees by the employers.
22
Output
, dat de zorg uit de verzekeringsgelden wordt
betaald. , that social services are paid by
insurance premiums
23
Output
Uw pensioen wordt door ons in contanten of per
overboeking uitgekeerd. Your pension is remitted
by us in cash or by bank transfer.
24
Output
25
Evaluation
26
Conclusions
  • Overal precision of 86,02
  • Overall recall of 82,37
  • most urgent problem ambiguous structures

27
Future work
  • Disambiguation of extraction results by
    co-occurrence comparisons between fillers that
    cannot be assigned unambiguously and
    unambiguously assignable fillers
  • Full coverage of active sentences and of
    infinitivals such that broadest possible coverage
    of the texts to be analyzed can be achieved
  • Collecting more relevant verb-based frames,
    providing the pertaining mapping and broadening
    the extraction exercise

28
References
  • Kermes, Hannah Stefan Evert (2002). YAC - a
    recursive chunker for unrestricted german text.
    In Manuel Gonzalez Rodriguez and Carmen
    PazSuarez Araujo, eds., Proceedings of the Third
    International Conference on Language. Resources
    and Evaluation, vol. V. pp. 1805-1812.
  • Martin, Willy et al. (2001). DOT Eindrapport
    (deel 1). Rapport van de Ondergroepsgroep
    VU-Lexicologie in het kader van het DOT-Project.
    Ms. Amsterdam. Vrije Universiteit
  • Minsky, Marvin (1974). A framework for
    representing knowledge. Ms. MIT.
  • Schmid, Helmut (1994). Probabilistic
    part-of-speech tagging using decision trees. In
    International Conference on New Methods in
    Language Processing, pp. 44-49.
  • Spranger, Kristina (2002). A lexically informed
    chunking analysis as a starting point for the
    extraction of linguistic information and
    terminology from dutch text. Master's thesis,
    University of Stuttgart, IMS
  • Spranger, Kristina and Ulrich Heid (2003). A
    Dutch Chunker as a Basis for the Extraction of
    Linguistic Knowledge. In Tanja Gaustad (ed.)
    Computational Linguistics in the Netherlands
    2002. Selected Papers from the Thirteenth CLIN
    Meeting.
  • Verhagen, Michel (2003). Frame van economische
    transacties. Technical report. Vrije Universiteit
    Amsterdam
Write a Comment
User Comments (0)
About PowerShow.com