Constructing a Romanian Electronic Dictionary Andrei Filip Universitat Autnoma de Barcelona - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Constructing a Romanian Electronic Dictionary Andrei Filip Universitat Autnoma de Barcelona

Description:

In what we call traditional dictionaries, each entry generally ... e.g. om oameni; sora surori; c) Invariable nouns: e.g. tei tei; nvatatoare ; pronume ... – PowerPoint PPT presentation

Number of Views:280
Avg rating:3.0/5.0
Slides: 26
Provided by: Admini238
Category:

less

Transcript and Presenter's Notes

Title: Constructing a Romanian Electronic Dictionary Andrei Filip Universitat Autnoma de Barcelona


1
  • Constructing a Romanian Electronic
    DictionaryAndrei FilipUniversitat Autònoma de
    Barcelona

2
1. The Format Of the Romanian Electronic
Dictionary. 1.1. The Macrostructure 1.2. The
Microstructure 2. The Noun Inflection System.
NooJ Graphs Implementation 2.1. The Gender and
Determination Issue 2.2. The Grammatical
Category of Number 2.3.The Grammatical Category
Of Case
1. The Format of the Romanian Electronic
Dictionary 1.1.The Macrostructure - is
composed by the different lexical units which
make up the dictionary (in our case about 30 738
entries) What makes it different from paper
dictionaries? In what we call
traditional dictionaries, each entry generally
corresponds to a basic unit form, therefore it
implies the separation of syntax (structures in
which the respective units can be combined) and
lexicon (inventory of associated forms to one or
more meanings).
3
  • 1.The Format of the Romanian Electronic
    Dictionary
  • 1.1.The Macrostructure
  • 1.2.The Microstructure
  • 2.The Noun Inflection System. NooJ Graphs
    Implementation
  • 2.1.The Gender and Determination Issues

At least two major problems raise from this
treatment as far as natural language processing
is concerned a) polysemy
b) idiomatic expressions
Therefore, they describe either a part of the
lexical unit or more lexical units at the same
time. The strategy to adopt is to
consider the entry not as a form but as a lexical
unit which is made up by a form a, a meaning
a and a combinatory ?a. e.g. Este o veste
însemnata.(une nouvelle importante)
Vaca care este însemnata îi apartine. (marquée)
Este un om însemnat. (personne
estropiée.)
4
1.TheFormat of the Romanian Electronic
Dictionary 1.1.The Macrostructure 1.2.The
Microstructure 2.The Noun Inflection System.
NooJ Graphs Implementation 2.1.The Gender and
Determination Issues 2.2.The Grammatical
Category of Number 2.3.The Grammatical Category
of Case
In the previous sentences, each of the
uses of the adjective însemnat is characterized
by a combinatory and single meaning which
correspond to an independent lexical unit.
Moreover, as we have already seen, each lexical
unit corresponds to a different translation unit
in the target language. If we
define the lexical units as such, lexical
ambiguity is no longer a problem as each form
corresponds to a single meaning. We should
also distinguish between simple and compound
lexical units. For the time being we concentrate
only in the Romanian dictionary of simple forms
and leave behind for a further research the
dictionary of compound lexical forms.
5
We should also mention here that spelling
variants have been treated separately, that is
they are given a new different entry and
description in the dictionary. e.g.
atunci/atuncea acum/acuma
flutur/fluture We have also approached a
different perspective as far as gender is
regarded. For instance, we have given different
entries for the masculine and feminine nouns
(what we could also term as correlative nouns)
e.g. bunic-bunica copil-copila
cuscru-cuscra cumnat cumnata profesor
profesoara italian italianca leu
leoaica taran taranca doctor doctorita
cârciumar cârciumareasa, paun paunita etc.
Therefore they also correspond to different
inflection graphs and do not come out as
inflections of the corresponding masculine noun.
The aim is also to facilitate the lexicographical
treatment of natural gender.
1.The Format of the Romanian Electronic
Dictionary 1.1. The Macrostructure 1.2. The
Microstructure 2.The Noun Inflection System.
NooJ Graphs Implementation 2.1.The Gender and
Determination Issues 2.2. The Grammatical
Category of Number 2.3.The Grammatical Category
of Case
6
1.2. The Microstructure The microstructure
of an electronic dictionary is made up by the
different lexicographic information which is
mentioned, that is information on the lemma, on
its possible arguments and on lexical units
related from a semantic point of view to the
respective lemma (i.e. lexical restrictions and
translation equivalents). All this
information is divided in the different
descriptive fields of the data base. Each
entry is characterised first of all according to
its morphologic description (G field). It
corresponds to the different inflection graphs
that characterise the parts of speech N, A, V,
ADV, PREP, DET, PRO and Residual. According to
the inflection codes we attach to each entry we
can also make out information on gender for
instance.
1.1.The Format of the Romanian Electronic
Dictionary. 1.1.The Macrostructure 1.2.The
Microstructure 2.The Noun Inflection System.
NooJ Graphs Implementation 2.1.The Gender and
Determination Issues 2.2.The Grammatical
Category of Number 2.3.The Grammatical Category
of Case
7
The next field, T, provides the
information about the syntactico-semantic
features of each entry. They concern mainly
nouns. We distinguish between Hum, Inc, Anl,
Veg, Loc, Tps, and Abs (which is further
subdivided into states, actions and events).
The fourth field, C is reserved to the
classes dobjets (Gross, G. 1994 Le Pesant et
Mathieu Colas, 1998). They have been established
from the syntactic characteristics of the lexical
units. A class of elementary arguments is defined
by the predicates which select arguments
belonging to the same class of objects. The
superior order predicates which accept other
predicates in their argument domain are also
regrouped in classes dobjets. For the time
being 59 classes have been implemented in our
dictionary. e.g. cântareata C artist

1. The Format of the Romanian Electronic
Dictionary 1.1.The Macrostructure 1.2.The
Microstructure 2.The Noun Inflection System.
NooJ Graphs Implementation 2.1.The Gender and
Determination Issues 2.2.The Grammatical
Category of Number 2.3.The Grammatical
Category of Case
8
So as to provide more precision to our
description we shall also include the D field,
which corresponds to the domains that have been
accurately described by the Laboratoire de
Linguistique Informatique of Paris 13 (about 91).
un ensemble dexpressions dénommant dans
une langue naturelle des notions relevant dun
domaine de connaissance thématisé (Lerat, 1995)
This kind of description will allow us to
disambiguate polysemantic lexical units.
For further precision, the field SD (subdomain)
has been introduced. e.g. cineast D
cinema-photography SD cinema
1. The Format of the Romanian Electronic
Dictionary 1.1.The Macrostructure 1.2.The
Microstructure 2.The Noun Inflection System.
NooJ Graphs Implementation 2.1.The Gender and
Determination Issues 2.2.The Grammatical
Category of Number 2.3.The Grammatical
Category of Case
9
Our next field corresponds to the
translation equivalent (Fr/Es). It is highly
important to state that we do not consider this
field as a metalinguistic information relative to
one lexical unit but rather as a pointer to
another lexical unit which has a corresponding
linguistic description in the target language
dictionary. Our aim is creating
monolingual coordinated electronic dictionaries
(cf. Blanco 2001) as in most cases the
morphological and syntactic description differ
from one language to the other. We have also
introduced a further field P (cf. Garrigues 1997)
so as to account for the use a speaker would give
to one lexical entry or the other. Two criteria
are taken into consideration when it comes to
this field - we consider the
(non)existence of a mental image of a given word
in the mental lexicon of a person - we
consider how often a given word would occur in
everyday speech (we refer here not to the form
but to the association form/meaning).

1.The Format of the Romanian Electronic
Dictionary 1.1.The Macrostructure 1.2.The
Microstructure 2.The Noun Inflection System.
NooJ Graphs Implementation. 2.1.The Gender and
Determination Issues 2.2.The Grammatical
Category of Number 2.3The Grammatical Category
of Case
10
A final field is to be introduced
and it has to do to with what Hausmann (1989)
calls diasystematics.
1.The Format of the Romanian Electronic
Dictionary 1.1.The Macrostructure 1.2.The
Microstructure 2.The Noun Inflection System.
NooJ Graphs Implementation 2.1.The Gender and
Determination Issues 2.2.The Grammatical
Category of Number 2.3.The Grammatical Category
of Case
11
  • 2.The Noun Inflection System. NooJ Graphs
    Implementation
  • 2.1.The Gender and Determination Issues
  • As far as gender is considered, we
    distinguish three main classes in Romanian
  • Masculine un frate doi frati
  • Feminine o colega doua colege
  • Neuter un drum doua drumuri
  • From a morphologic point of view neutre
    nouns behave like a masculine noun in the
    singular and as a feminine in the plural.
    Therefore they will select different operators
    according to number.

1.The Format of the Romanian Electronic
Dictionary 1.1.The Macrostructure 1.2.The
Microstructure 2.The Noun Inflection System.
Nooj Graphs Implementation 2.1.The Gender and
Determination Issues 2.2.The Grammatical
Category of Number 2.3.The Grammatical Category
of Case
12
From a semantic point of view we could
assert that it is quite a homogenous class as it
includes mostly Inc nouns (e.g. ciocan
ciocane), HumColl nouns (e.g. popor, trib, grup,
colectiv etc.) and Anl which denote the species
(e.g. mamifer, gasteropod, dobitoc). As far
as the grammatical category of determination is
taken into account we shall concentrate here only
on the definite article. All the other Det have
their own inflection system depending either on
the case and on whether they precede or not the
NG. The definite article in Romanian is an
adjoined enclitic morpheme which needs to be
described in the inflection graph e.g.
studentul , steaua , cartea, regele,
codrul tara tara, popa popa, poezie
poezia etc.
1.The Format of the Romanian Electronic Dictiona
ry 1.1.The Macrostructure 1.2. The
Microstructure 2. The Noun Inflection System.
NooJ Graphs Implementation 2.1.The Gender and
Determination Issues 2.2.The Grammatical
Category of Number 2.3.The Grammatical Category
of Case
13
  • As far as the plural nouns are concerned,
    the definite article morpheme depends only on the
    gender of the corresponding noun
  • i for the masculine nouns
  • e.g. studenti studentii frati fratii
    copaci copacii
  • le for the feminine and neuter nouns
  • e.g. studente studentele poezii
    poeziile
  • popoare popoarele sigilii
    sigiliile.

1.The Format of the Romanian Electronic
Dictionary 1.1.The Macrostructure 1.2.The
Microstructure 2.The Noun Inflection System.
NooJ Graphs Implementation 2.1.The Gender and
determination Issues 2.2.The Grammatical
Category of Number 2.3.The Grammatical
Category of case
14
  • 2.2. The Grammatical Category of Number
  • When it comes to inflectional morphemes
    that designate the opposition singular-plural, we
    could distinguish three main classes of nouns in
    Romanian
  • Variable nouns with a regular inflection
    paradigm
  • e.g. casa case scolar scolari drum
    drumuri
  • Variable nouns with an irregular inflection
    paradigm
  • e.g. om oameni sora
    surori
  • c) Invariable nouns
  • e.g. tei tei învatatoare
    pronume
  • So far we have created 11 different
    inflection graphs for masculine nouns, 16 for the
    feminine and 9 for neuter nouns.

1.The Format of the Romanian Electronic
Dictionary 1.1.The Macrostructure 1.2.The
Microstructure 2.The Noun Inflection System.
NooJ Graphs Implementation 2.1.The Gender and
Determination Issues 2.2.The Grammatical
Category of Number 2.3.The Grammatical
Category of Case
15
We need to add that several nouns have two
plural forms (especially feminine and neuter
ones) e.g. coala coli/coale vreme
vremi/vremuri chibrit chibrituri/chibrite
hotel hoteluri/hotele. However, in
some cases there is a different lexico-semantic
description that we should add to these nouns. As
a matter of fact we speak about the same form,
but different meaning and combinatory. Therefore
they are going to be treated under different
entries in our dictionary. e.g. corn
coarne vs. corn - cornuri
mâncare mâncari vs. mâncare -
mâncaruri A special attention should
be paid to Singularia Tantum and Pluralia Tantum
nouns. The strategy we adopt is to mention the
fact that they are devoid of this inflection
feature in the graph when we label the entry in
the G field. e.g. ochelari N11P moaste
N23P
1.The Format of the Romanian Electronic
Dictionary 1.1.The Macrostructure 1.2.The
Microstructure 2.The Noun Inflection System.
NooJ Graphs Implementation 2.1.The Gender and
Determination Issues 2.2.The Grammatical
Category of Number . 2.3.The Grammatical
Category of Case
16
  • 2.3. The Grammatical Category of Case
  • A third main factor we have to consider when
    building up our inflection graph is case. From
    the point of view of the internal structure,
    nouns can be grouped in the same three main
    classes determined by the number opposition
  • Variable nouns with a regular inflection pattern
  • Variable nouns with an irregular inflection
    pattern
  • Invariable nouns.
  • Lets first consider nouns in the Nominative
    and the Accusative. They can either be inflected
    or not with the enclitic definite article (om
    omul, oameni oamenii, casa, case casele
    etc.).
  • The uninflected noun can be accompanied or
    not by the indefinite article or any other
    determinant which takes over the inflection
    pattern.

1.The Format of the Romanian Electronic
Dictionary 1.1.The Macrostructure 1.2.The
Microstructure 2.The Noun Inflection System.
NooJ Graphs Implementation 2.1.The Gender and
Determination Issues 2.2.The Grammatical
Category of Number 2.3.The Grammatical
Category of Case
17
  • The main noun forms in the Nominative/Accusative
    are
  • With the definite article

1.The Format of the Romanian Electronic
Dictionary 1.1.The Macrostructure 1.2.The
Microstructure 2.The Noun Inflection System.
NooJ Graphs Implementation 2.1.The Gender and
Determination Issues 2.2.The Grammatical
Category of Number 2.3.The Grammatical Category
of Case
18
  • With the indefinite article

1.The Format of the Romanian Electronic
Dictionary 1.1.The Macrostructure 1.2.The
Microstructure 2.The Noun Inflection System.
NooJ Graphs Implementation 2.1.The Gender and
Determination Issues 2.2.The Grammatical
Category of Number 2.3.The Grammatical
Category of Case
19
There were plenty of ortographic
constraints that we had to consider when
concieving our inflection graphs but for the sake
of concision we are not going to enter in detail
here.
1.The Format of the Romanian Electronic
Dictionary 1.1.The Macrostructure 1.2.The
Microstructure 2.The Noun Inflection System.
NooJ Graphs Implementation 2.1.The Gender and
Determination Issues 2.2.The Grammatical
Category of Numbre 2.3.The Grammatical
Category of Case
20
  • As far as the Genitive and the Dative are
    taken into account we distinguish the following
    main forms
  • Articulated Forms

1.The Format of the Romanian Electronic
Dictionary 1.1.The Macrostructure 1.2.The
Microstructure 2.The Noun Inflection System.
NooJ Graphs Implementation 2.1.The Gender and
Determination Issues 2.2.The Grammatical
Category of Number 2.3.The Grammatical
Category of Case .
21
b) Unarticulated Forms
1.The Format of the Romanian Electronic
Dictionary 1.1.The Macrostructure 1.2.The
Microstructure 2.The Noun Inflection System.
NooJ Graphs Implementation 2.1.The Gender and
Determination Issues 2.2.The Grammatical
Category of Number 2.3.The Grammatical
Category of Case
22
We have to note that the only nouns that
change their forms in the Dative and Genitive are
feminine nouns in the singular e.g. casa
casei / unei case basma basmalei / unei
basmale vulpe vulpii / unei vulpi.
In this case the Dative and the Genitive in
the sg. are indicated both by the form that the
noun takes and by the form of the inflected ( the
definite article -i and the indefinite article
unei). As in the case of the
Nominative/Accusative nouns, we also have to deal
here with exceptions from an orthographic point
of view. We can identify four main types but we
are to refer here only to one example.
1.The Format of the Romanian Electronic
Dictionary 1.1.The Macrostructure 1.2.The
Microstructure 2.The Noun Inflection System.
NooJ Graphs Implementation 2.1.The Gender and
Determination Issues 2.2.The Grammatical
category of Number 2.3.The Grammatical
Category of Case
23
Feminine nouns ending in the
Nominative sg. in vowel or diphthong are written
with final -ei or -ii when they are inflected
with the definite article. In order not to get
confused, we would rather use the form of the
unarticulated noun in the Nominative pl. e.g.
N.pl.unart. D/G sg. unart. D/G sg.art
(niste) case (unei) case
casei vulpi
vulpi vulpii femei
femei femeii
1.The Format of the Romanian Electronic
Dictionary 1.1.The Macrostructure 1.2.The
Microstructure 2.The Noun Inflection System.
NooJ Graphs Implementation 2.1.The Gender and
Determination Issues 2.2.The Grammatical
Category of Number 2.3.The Grammatical
Category of Case
24
  • Finally, when it comes to the nouns in the
    Vocative, we can distinguish four different
    cases
  • There are some nouns which have specific forms
    for the Vocative
  • e..g. barbate cumetre bunicule (masculine)
  • bunico cuscro (feminine)
  • 2. Some nouns can have specific Vocative forms,
    but they also accept an alternative form which is
    identical with that in the Nominative/Accusative
    inflected form
  • e.g. bunico - bunica
  • 3. The majority of nouns have specific forms for
    the Vocative case but when they want to emphasize
    the appellative function we use the same form as
    for the Nominative/Accusative uninflected nouns
  • e.g. frate tata mama

1.The Format of the Romanian Electronic
Dictionary 1.1.The Macrostructure 1.2.The
Microstructure 2.The Noun Inflection System.
NooJ Graphs Implementation 2.1.The Gender
and Determination Issues 2..2.The Grammatical
Category of Number 2.3.The Grammatical Category
of Case
25
4. For the masculine and feminine plural nouns
in the Vocative we use the same forms as for
the Nominative/Accusative uninflected nouns or
Genitive/Dative inflected forms e.g. Veniti,
frati! Stati, fratilor/vecinilor/fetel
or! With the support of the Universitat
Autònoma de Barcelona
1.The Format of the Romanian Electronic
Dictionary 1.1.The Macrostructure 1.2.The
Microstructure 2.The Noun Inflection System.
NooJ Graphs Implementation 2.1.The Gender and
Determination Issues 2.2.The Grammatical
Category of Number 2.3.The Grammatical
Category of Case
Write a Comment
User Comments (0)
About PowerShow.com