Title: Lemmatizing and tagging a corpus : which information for which linguistic purposes?
1Lemmatizing and tagging a corpus which
information for which linguistic purposes?
Lemmatizing and tagging a corpus which
information for which linguistic purposes? The
example of the Greek and Latin LASLA databases
compared to others
Dominique Longrée, LASLA Université de Liège et
FUSL (Bruxelles)
2Lemmatizing and tagging a corpus which
information for which linguistic purposes?
0. Introduction objectives
- to share the expertise of the LASLA (50 years)
- Laboratoire dAnalyse statistique des Langues
anciennes , - set up in 1961 at the Liege University
- to offer a discussion
- 1) which information in a database and for
which purposes ? - which influence on the results of our linguistic
studies ? - to compare the lemmatizing and tagging
practices of LASLA - with practices of other (Greek and Latin)
databases
3Lemmatizing and tagging a corpus which
information for which linguistic purposes?
0. Introduction plan
- LASLA and its databases
- the research project LatLem
- the Opera Latina Web interface and the
Hyperbase-Latin CD-Rom - the process of tokenization
- the process of lemmatization
- the process of tagging (morphosyntactic tags)
- the process of tagging (syntactic, semantic and
pragmatic tags) - the research project LatSynt
4Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Databases Greek Latin
- The Laboratory for Statistic Analysis of
Classical Languages (L.A.S.L.A.) - - set up in Septembre1961
- - first research centre
- - aiming to study classical languages (Greek
and Latin) - - using automatic data processing technologies.
- part of the Faculty of Philosophy and Letters at
the University of Liège - Missions
- 1) a detailed study of Greek and Latin languages
and literatures using computer techniques as well
as statistical and quantitative methods - 2) the making of literary data banks and computer
tools in order to distribute those data banks and
make the most of them by all available Media.
5Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Greek Database
1.200.000 words/tokens Attic orators
Andocides, Antiphon, Isocrates and Lysias
Aristotle De Anima, De partibus animalium,
Categorie, Metafisica, Fisica, Historia
animalium. Plato 8 dialogues All classic
tragedies Aeschylus, Sophocles, Euripides and
fragments Pausanias christian authors for
example St John Chrysostom, De
sacerdotio Hesychius of Jerusalem , Homilies
6Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Greek Database
Facing each word form, appear the following data
1. the reference of the word form, according to
the ars citandi. 2. the lemma (the word as it
appears in the dictionary of reference, which is
the Greek-English Lexicon, of H. G. Liddell, R.
Scott et H. S. Jones). 3. the grammatical
category of the word (POS)
lemma token reference POS ? 2 ? 2
1 1 1 1 1 A ? ?????? ??????
2 1 1 2 2 2 A ? a?t?de?f??
a?t?de?f?? 2 1 1 3 3 3 A ?
??sµ??? ? ?sµ???? 2 1 1 4 4 4 A ß
???a 1 ???a 2 1 1 5 5 5 A ß
??a ?? ? 2 1 2 1 6 6 A µ
??da ??s? ? 2 1 2 2 7 7 A ?
.
7Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
- Latin classical texts 2.000.000 tokens
- The LASLA method
- - Étienne ÉVRARD, Le laboratoire danalyse
statistique des langues anciennes de lUniversité
de Liège , Mouvement scientifique en Belgique,
9, 1962, p. 163-169 - - Joseph DENOOZ, Lordinateur et le latin,
Techniques et méthodes , Revue de lorganisation
internationale pour létude des langues anciennes
par ordinateur, 1978, 4, p. 1-36.
8Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
the available fully lemmatized and encoded
texts Classical texts (more than 2.000.000
words/tokens)
Caesar et aliiCato Catullus Cicero
rhetoric works all philosophical works
partim Curtius Horatius Iuvenalis Lucretius Ovidiu
s Persius Petronius Plautus 8 plays Plinius
(Iunior) Propertius Sallustius Seneca Tacitus Tibu
llus Virgilius
9Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
the available fully lemmatized and encoded
texts other texts
Medio-Latin Sedulius Scottus Hagiographic
texts (300.000 words)
Neo-Latin Descartes Spinoza
next available texts works in progress
Cicero (letters) Cornelius NeposLivius Suetoni
usHistoria Augusta Busbecq (by L. Grailet)
10Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
- 2.500.000 words/tokens
- Bibliotheca Teubneriana Latina 13 millions
tokens - fully lemmatized texts,
- with a full morphosyntactic tagging and 1
syntactic tag - systematically verified by a philologist
11Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
For each word of the text, 1.the lemma (the
word as it appears in the dictionary of
reference, the Lexicon totius latinitatis of
Forcellini, éd. de Corradini, Padoue, 1864) 2. an
index which enables to distinguish various
homograph lemmas ET 1 adverb, ET 2
coordinating conjunction or to spot proper names
or adjectives derived from proper names N
opposite Roma means proper name 3. the form as
appearing in the text 4. the reference, according
to the ars citandi 5. the complete morphologic
analysis in alphanumeric format 6. regarding the
verbs, syntactic indications main clauses
verbs subordinate clauses verbs (sorted by
subordination type)
12Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
the information available for each latin form
Lemma index TextForm
Reference
Analysis
Index N Name2 ET 1 adverb, ET 2
coordinating conjunction
13Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
the information available for each latin form
Lemma index TextForm
Reference
Analysis
Analysis urbem 13C00 1 Noun 3 3d Decl.
C Acc. sing.habuere 52L14 5 Verb
2 2d Conj. Act. L 3d pers. Plur 1
Ind. 4 Perfectum main clause
14Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
the information available for each latin form
Analysis audisset 5JC32 5 Verb J
1st Conj. Dep. C 3d pers. Sing 3 Subj
2 ImpPerf. BN cum clauserequisisse
53074 5 Verb 3 3d Conj. Act. 0
unpers. 7 Inf 4 Perfectum
AG Accusativus cum Infinitivo
15Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla system for tagging
old fashioned
the project Latlem
16Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Greek and Latin Database
- accessible through
- index plublished by
- G. Olms (Hildesheim)
- the Centre Informatique de Philosophie et
Lettres (CIPL-Liège) - for Greek texts specific software
- for Latin texts
- the Opera Latina Web interface
- the Hyperbase-Latin CD-Rom
17Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
accesible throug hopera latina
www.ulg.ac.be/cipl/lsl.htm
18Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
accesible through the CD-Rom Hyperbase-latin
collaboration withthe UMR 6039 Bases, corpus,
langage (CNRS-University of Nice)
19Lemmatizing and tagging a corpus which
information for which linguistic purposes?
2. Tokenizing a text establishing the text
20Lemmatizing and tagging a corpus which
information for which linguistic purposes?
2. Tokenizing a text segmenting the text into
sentences
/Accusa senatum, accusa equestrem ordinem...,
accusa omnes ordines, omnes ciues../
/Accusa senatum,/ /accusa equestrem ordinem...,
/ /accusa omnes ordines, omnes ciues../
21Lemmatizing and tagging a corpus which
information for which linguistic purposes?
2. Tokenizing a text segmenting the text into
words
compare with CD-Rom PHI 05 of the Packard
Humanities Institute
The string -ibil-
22Lemmatizing and tagging a corpus which
information for which linguistic purposes?
2. Tokenizing a text segmenting the text into
words
- compare with CD-Rom PHI 05 of the Packard
Humanities Institute
- Vergils Aeneid
- arma virumque cano
- arma uirumque cano
- clitic que
- /queltblankgt/, /quelt,gt/ , /queltgt/, /queltgt/,
/quelt.gt/ - atque, ubique, undique, quicumque
- amatus est / amatust
- animum aduertere / animaduertere
23Lemmatizing and tagging a corpus which
information for which linguistic purposes?
3. Lemmatizing a text
- to allow the recognition of the same lemma in
its various occurrences in a text, - independently of the variety of its forms in
those occurrences - 1) Greek and Latin are inflected languages
24Lemmatizing and tagging a corpus which
information for which linguistic purposes?
3. Lemmatizing a text
- to allow the recognition of the same lemma in
its various occurrences in a text, - independently of the variety of its forms in
those occurrences - 2) the Latin spelling is not completely fixed
- assimilation phenomena (inlicio/illicio
adtuli/attuli quidquid/quicquid) - haplologies (exspecto/expecto)
- weak phonological status of some phonemes
- (harena/arena, exhibeo/exibeo, mihi/mi,
consul/cosul, etc) - transformation of diphthongs into monophthongs
- (saeta/seta plaudite/plodite, poenicus/punicus)
25Lemmatizing and tagging a corpus which
information for which linguistic purposes?
3. Lemmatizing a text
- to allow the recognition of the same lemma in
its various occurrences in a text, - independently of the variety of its forms in
those occurrences - 2) the Latin spelling is not completely fixed
- elision, epenthesis, apheresis, contraction, as
well as abbreviation - disjunction of parts of the compound words or
tmesis - res publica for respublica
- quo modo for quomodo
- quam... ante for antequam
- morphologic diachronic and synchronic variant s
- pater familias/pater familiae
- siet/sit
- igni/igne
- fecerit/faxit.
26Lemmatizing and tagging a corpus which
information for which linguistic purposes?
3. Lemmatizing a text
- populus 1, the people and populus 2, the
poplar - licet 1, it is allowed, licet 2 although
27Lemmatizing and tagging a corpus which
information for which linguistic purposes?
3. Lemmatizing a text
Dux lemma cooccurrent lemmas Ecart Corpus
Extrait Mot 038 932 934 dvx
015 1616 120 miles 014 1285 105
exercitvs1 009 25801 604 qve 009
2447 107 bellvm 009 2298 98
romanvsa 009 1910 88 hostis
008 1725 78 arma 008 1059
58 legio 007 1113 55 castra2
007 862 45 copia 007 615
37 imperator 007 519 36 avctor
006 40004 802 et2 006 1968
70 vrbs 006 786 39 tot
006 536 31 cohors 006 493
29 agmen 006 371 26 comes
Dux Wordform cooccurrent wordforms Ecart
Corpus Extrait Mot 038 141 141 dux
005 336 8 romanus 005 170
7 auctor 004 2733 19 erat
004 626 8 bello 004 506
8 exercitus 004 482 8 hostium
004 299 7 miles 004
151 5 comes 004 119 4 militiae
004 113 4 deae 004 106
4 cohortibus 004 87 4 campis
004 53 3 diuersis 004 44
3 copiarum 004 39 3 uoluntatis
004 37 3 rati 004 20
3 gregis
28Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
- using the POS tag
- research on Greek determiners (UMR6039-Nice,
Michèle Biraud,) - - sequences a ? e ß and a µ ? e ß attested in
the LASLA files - a article
- ß noun
- ? adjective
- e adjective/pronoun
- µ particle
- - ?? ????? p??te? ?????p??
29Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
- using the POS tag
- research on parallel and reminiscent passages
between literary works - (Koen Van Haegendoren, Liège)
30Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
- using the POS tag
- to characterise authors
- and genres
31Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
- using the POS tag
- LASLA Latin texts and
- BFM French
- medieval texts
32Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
- lemmatization and POS the case of the
adjectives used as nouns - a solution amicus 1 noun vs. amicus 2
adjective - but sunt christiani ?
- gt sanctus, beatus, fidelis or impius
- another solution sanctus is analyzed
- 21A00_4 when used as adjective ( 2 for
adjective, - 1 for first class,
- A for singular nominative
- 4 for male)
- 21A0014 when used as a noun (the additional 1
indicating this use). - also for fideles /omnes fideles, credentes
/omnes credentes, laudantes, - audientes, legentes
33Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
- full morphosyntactic analysis Greek declension
in Latin
34Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
- full morphosyntactic analysis 4th conjugation
35Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
- full morphosyntactic analysis the deviant
forms - a solution a special code for the whole
declension (domus) - another solution several lemmas
- ex a plural male accusative saxos, instead of
the plural neutral saxa - - a form of a new male lemma saxus (not
attested in the dictionaries) - - an anomalous form of the neutral lemma saxum
- in both cases with the same codification (12 1
for noun, 2 for 2nd décl.) - but
- ex facta est tonitrua in aera
36Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
- full morphosyntactic analysis the deviant
forms - ex facta est tonitrua in aera
- - tonitrua used as a singular nominative of the
first declension. - - in the dictionaries,
- tonitrus, us (4th decl.)
- tonitruum, i (2nd decl)
- tonitrus, i (m) (2nd decl)
- tonitru (n) (4th decl.)
- but no tonitrua.
- - explained by a plural neutral of tonitruum
reinterpreted - as a feminine singular, but how to lemmatize
it? - a solution
- to consider tonitrua as a form of the lemma
tonitruum - the peculiarity of its use only in the tag
corresponding - to the morphosyntactic analysis
37Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
- full morphosyntactic analysis the deviant
forms - ex
- regular forms of the lemma dulcis, smooth,
- tagged with the code of the adjectives of the
second class in -is (24) - anomalous form dulciam
- tagged as a form of the lemma dulcis
- with the code of the adjectives of the first
class in -is (21) - and in the Classical Latin corpus
- caelum (n) and caelus (m
- inferni (m) and inferna (n)
- cingula (f), cingulus (m) and cingulum (n)
38Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
- using the full morphosyntactic analysis
narrative indicative tenses
39Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
- using the full morphosyntactic analysis
repeated sequences (adj.-adj)
40Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging
- the Treebank approaches
- - Index Thomisticus
- - Latin Treebanks at Perseus
- based on
- - the Dependency grammar (Prague Dependency
Treebank ) - - the Latin grammar of H.Pinkster
- a training corpus is tagged manually and other
corpora are encoded - by using automatic taggers
- problems
- a method imposing a specific linguistic
framework - mixing theoretical linguistic framework
- producing data which are not verified
41Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging the Project LatSynt
- an original and innovative research on word
order and Latin sentence structures - Objectives
- -to develop automatic procedures for parsing
based on word order rules (in order to offer an
alternative to Treebank approaches) - -to evaluate the relevance of the recent
linguistic descriptions - -to offer new tools for textual data analysis
(TDA) - - for enuntiative structure modeling
- - for Latin texts classification and
segmentation - Methods
- - to develop automatic procedures grounded on
- -the already encoded morphological information
in the LASLA database - -the text linearity
- - to refine and improve the computer programmes
in successive stages
42Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging the Project LatSynt the
first stage
- Objective - to mark out the boundaries of
personal verb clauses (provided with a
subordinating word) - - to specify the level of their
subordination (their embedding) - from the alphanumeric data of the LASLA
database
43Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging the Project LatSynt the
first stage
Analysis audisset BN cum clausecum 32
subj. imp.
44Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging the Project LatSynt the
first stage
- 1st stage
- quem uidi, LN14 (subordination in QVI
perfect indicative) transferred to both the
recording of - - the quem form (lemma QVI) and
- - the uidi form (lemma VIDEO)
- 2d stage
- 0014 LN14 -LN14 LN12 GK32 -GK32 -LN12
45Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging the Project LatSynt the
first stage
- 3d stage
- lt0014gtLN14 -LN14LN12 GK32 -GK32
-LN12. - - Final stage
- Tacite, Annales, 13,11,2 / P2849 /
- lt0014gtLN14-LN14LN12GK32-GK32-LN12
- ltsecuta (est)gt que lenitas in Plautium Lateranum
quem ob adulterium Messalinae ordine demotum
-reddidit senatui clementiam suam obstringens
crebris orationibus quas Seneca testificando
quam honesta -praeciperet uel iactandi ingenii
uoce principis -uulgabat
46Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging the Project LatSynt the
first stage
- analysing left dislocations
- 5,22,2 P0909 1 BN35-BN35lt0014gt
- ii cum ad castra -uenissent, nostri eruptione
facta multis eorum interfectis, capto etiam
nobili duce Lugotorige suos incolumes
ltreduxeruntgt
47Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging the Project LatSynt the
first stage
- Results to bring out
- - linguistic regularities (prolepsis)
- - distances between texts (Caesar Tacitus)
- - the importance of semantic and pragmatic
phenomena - Perspectives
- -to mark out the boundaries of complex syntagms
(in order to mark out the boundaries of
subordinate clauses without subordinator) - -to promote interactions with other researches
regarding the text topology (at the micro- and
macro structural levels) -repeted segments
(Hyperbase-latin in collaboration with
BCLNice/CNRS) -syntactic and multidimensional
motifs (in c. with BCLNice/CNRS) - to use the results for texts segmentation and
classification
48Lemmatizing and tagging a corpus which
information for which linguistic purposes?
6. Tagging what else ?
- semantic and pragmatic information,
- semantic functions Goal, Recipient, Agent, etc
- pragmatic functions Rheme, Topic, Focus , etc
- building databases available for all kinds of
research - without imposing specific linguistic frameworks
or analysis - tokenization, lemmatization or tagging
- not trivial processes
- requiring thorough theoretical thinking