Lemmatizing and tagging a corpus : which information for which linguistic purposes? - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Lemmatizing and tagging a corpus : which information for which linguistic purposes?

Description:

Lemmatizing and tagging a corpus : which information for which linguistic purposes? Lemmatizing and tagging a corpus : which information for which linguistic purposes? – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 49

Provided by: long96

Category:

more less

Transcript and Presenter's Notes

Title: Lemmatizing and tagging a corpus : which information for which linguistic purposes?

1
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
Lemmatizing and tagging a corpus which
information for which linguistic purposes? The
example of the Greek and Latin LASLA databases
compared to others
Dominique Longrée, LASLA Université de Liège et
FUSL (Bruxelles)
2
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
0. Introduction objectives

to share the expertise of the LASLA (50 years)
Laboratoire dAnalyse statistique des Langues
anciennes ,
set up in 1961 at the Liege University
to offer a discussion
1) which information in a database and for
which purposes ?
which influence on the results of our linguistic
studies ?
to compare the lemmatizing and tagging
practices of LASLA
with practices of other (Greek and Latin)
databases

3
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
0. Introduction plan

LASLA and its databases
the research project LatLem
the Opera Latina Web interface and the
Hyperbase-Latin CD-Rom
the process of tokenization
the process of lemmatization
the process of tagging (morphosyntactic tags)
the process of tagging (syntactic, semantic and
pragmatic tags)
the research project LatSynt

4
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Databases Greek Latin

The Laboratory for Statistic Analysis of
Classical Languages (L.A.S.L.A.)
- set up in Septembre1961
- first research centre
- aiming to study classical languages (Greek
and Latin)
- using automatic data processing technologies.
part of the Faculty of Philosophy and Letters at
the University of Liège
Missions
1) a detailed study of Greek and Latin languages
and literatures using computer techniques as well
as statistical and quantitative methods
2) the making of literary data banks and computer
tools in order to distribute those data banks and
make the most of them by all available Media.

5
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Greek Database
1.200.000 words/tokens Attic orators
Andocides, Antiphon, Isocrates and Lysias
Aristotle De Anima, De partibus animalium,
Categorie, Metafisica, Fisica, Historia
animalium. Plato 8 dialogues All classic
tragedies Aeschylus, Sophocles, Euripides and
fragments Pausanias christian authors for
example St John Chrysostom, De
sacerdotio Hesychius of Jerusalem , Homilies
6
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Greek Database
Facing each word form, appear the following data
1. the reference of the word form, according to
the ars citandi. 2. the lemma (the word as it
appears in the dictionary of reference, which is
the Greek-English Lexicon, of H. G. Liddell, R.
Scott et H. S. Jones). 3. the grammatical
category of the word (POS)
lemma token reference POS ? 2 ? 2
1 1 1 1 1 A ? ?????? ??????
2 1 1 2 2 2 A ? a?t?de?f??
a?t?de?f?? 2 1 1 3 3 3 A ?
??sµ??? ? ?sµ???? 2 1 1 4 4 4 A ß
???a 1 ???a 2 1 1 5 5 5 A ß
??a ?? ? 2 1 2 1 6 6 A µ
??da ??s? ? 2 1 2 2 7 7 A ?
.
7
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database

Latin classical texts 2.000.000 tokens
The LASLA method
- Étienne ÉVRARD, Le laboratoire danalyse
statistique des langues anciennes de lUniversité
de Liège , Mouvement scientifique en Belgique,
9, 1962, p. 163-169
- Joseph DENOOZ, Lordinateur et le latin,
Techniques et méthodes , Revue de lorganisation
internationale pour létude des langues anciennes
par ordinateur, 1978, 4, p. 1-36.

8
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
the available fully lemmatized and encoded
texts Classical texts (more than 2.000.000
words/tokens)
Caesar et aliiCato Catullus Cicero
rhetoric works all philosophical works
partim Curtius Horatius Iuvenalis Lucretius Ovidiu
s Persius Petronius Plautus 8 plays Plinius
(Iunior) Propertius Sallustius Seneca Tacitus Tibu
llus Virgilius
9
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
the available fully lemmatized and encoded
texts other texts
Medio-Latin Sedulius Scottus Hagiographic
texts (300.000 words)
Neo-Latin Descartes Spinoza
next available texts works in progress
Cicero (letters) Cornelius NeposLivius Suetoni
usHistoria Augusta Busbecq (by L. Grailet)
10
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database

2.500.000 words/tokens
Bibliotheca Teubneriana Latina 13 millions
tokens
fully lemmatized texts,
with a full morphosyntactic tagging and 1
syntactic tag
systematically verified by a philologist

11
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
For each word of the text, 1.the lemma (the
word as it appears in the dictionary of
reference, the Lexicon totius latinitatis of
Forcellini, éd. de Corradini, Padoue, 1864) 2. an
index which enables to distinguish various
homograph lemmas ET 1 adverb, ET 2
coordinating conjunction or to spot proper names
or adjectives derived from proper names N
opposite Roma means proper name 3. the form as
appearing in the text 4. the reference, according
to the ars citandi 5. the complete morphologic
analysis in alphanumeric format 6. regarding the
verbs, syntactic indications main clauses
verbs subordinate clauses verbs (sorted by
subordination type)
12
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
the information available for each latin form
Lemma index TextForm
Reference
Analysis
Index N Name2 ET 1 adverb, ET 2
coordinating conjunction
13
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
the information available for each latin form
Lemma index TextForm
Reference
Analysis
Analysis urbem 13C00 1 Noun 3 3d Decl.
C Acc. sing.habuere 52L14 5 Verb
2 2d Conj. Act. L 3d pers. Plur 1
Ind. 4 Perfectum main clause
14
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
the information available for each latin form
Analysis audisset 5JC32 5 Verb J
1st Conj. Dep. C 3d pers. Sing 3 Subj
2 ImpPerf. BN cum clauserequisisse
53074 5 Verb 3 3d Conj. Act. 0
unpers. 7 Inf 4 Perfectum
AG Accusativus cum Infinitivo
15
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla system for tagging
old fashioned
the project Latlem
16
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Greek and Latin Database

accessible through
index plublished by
G. Olms (Hildesheim)
the Centre Informatique de Philosophie et
Lettres (CIPL-Liège)
for Greek texts specific software
for Latin texts
the Opera Latina Web interface
the Hyperbase-Latin CD-Rom

17
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
accesible throug hopera latina
www.ulg.ac.be/cipl/lsl.htm
18
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
accesible through the CD-Rom Hyperbase-latin
collaboration withthe UMR 6039 Bases, corpus,
langage (CNRS-University of Nice)
19
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
2. Tokenizing a text establishing the text
20
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
2. Tokenizing a text segmenting the text into
sentences
/Accusa senatum, accusa equestrem ordinem...,
accusa omnes ordines, omnes ciues../
/Accusa senatum,/ /accusa equestrem ordinem...,
/ /accusa omnes ordines, omnes ciues../
21
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
2. Tokenizing a text segmenting the text into
words
compare with CD-Rom PHI 05 of the Packard
Humanities Institute
The string -ibil-
22
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
2. Tokenizing a text segmenting the text into
words

compare with CD-Rom PHI 05 of the Packard
Humanities Institute

Vergils Aeneid
arma virumque cano
arma uirumque cano
clitic que
/queltblankgt/, /quelt,gt/ , /queltgt/, /queltgt/,
/quelt.gt/
atque, ubique, undique, quicumque
amatus est / amatust
animum aduertere / animaduertere

23
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
3. Lemmatizing a text

to allow the recognition of the same lemma in
its various occurrences in a text,
independently of the variety of its forms in
those occurrences
1) Greek and Latin are inflected languages

24
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
3. Lemmatizing a text

to allow the recognition of the same lemma in
its various occurrences in a text,
independently of the variety of its forms in
those occurrences
2) the Latin spelling is not completely fixed
assimilation phenomena (inlicio/illicio
adtuli/attuli quidquid/quicquid)
haplologies (exspecto/expecto)
weak phonological status of some phonemes
(harena/arena, exhibeo/exibeo, mihi/mi,
consul/cosul, etc)
transformation of diphthongs into monophthongs
(saeta/seta plaudite/plodite, poenicus/punicus)

25
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
3. Lemmatizing a text

to allow the recognition of the same lemma in
its various occurrences in a text,
independently of the variety of its forms in
those occurrences
2) the Latin spelling is not completely fixed
elision, epenthesis, apheresis, contraction, as
well as abbreviation
disjunction of parts of the compound words or
tmesis
res publica for respublica
quo modo for quomodo
quam... ante for antequam
morphologic diachronic and synchronic variant s
pater familias/pater familiae
siet/sit
igni/igne
fecerit/faxit.

26
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
3. Lemmatizing a text

populus 1, the people and populus 2, the
poplar
licet 1, it is allowed, licet 2 although

27
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
3. Lemmatizing a text
Dux lemma cooccurrent lemmas Ecart Corpus
Extrait Mot 038 932 934 dvx
015 1616 120 miles 014 1285 105
exercitvs1 009 25801 604 qve 009
2447 107 bellvm 009 2298 98
romanvsa 009 1910 88 hostis
008 1725 78 arma 008 1059
58 legio 007 1113 55 castra2
007 862 45 copia 007 615
37 imperator 007 519 36 avctor
006 40004 802 et2 006 1968
70 vrbs 006 786 39 tot
006 536 31 cohors 006 493
29 agmen 006 371 26 comes
Dux Wordform cooccurrent wordforms Ecart
Corpus Extrait Mot 038 141 141 dux
005 336 8 romanus 005 170
7 auctor 004 2733 19 erat
004 626 8 bello 004 506
8 exercitus 004 482 8 hostium
004 299 7 miles 004
151 5 comes 004 119 4 militiae
004 113 4 deae 004 106
4 cohortibus 004 87 4 campis
004 53 3 diuersis 004 44
3 copiarum 004 39 3 uoluntatis
004 37 3 rati 004 20
3 gregis
28
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging

using the POS tag
research on Greek determiners (UMR6039-Nice,
Michèle Biraud,)
- sequences a ? e ß and a µ ? e ß attested in
the LASLA files
a article
ß noun
? adjective
e adjective/pronoun
µ particle
- ?? ????? p??te? ?????p??

29
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging

using the POS tag
research on parallel and reminiscent passages
between literary works
(Koen Van Haegendoren, Liège)

30
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging

using the POS tag
to characterise authors
and genres

31
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging

using the POS tag
LASLA Latin texts and
BFM French
medieval texts

32
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging

lemmatization and POS the case of the
adjectives used as nouns
a solution amicus 1 noun vs. amicus 2
adjective
but sunt christiani ?
gt sanctus, beatus, fidelis or impius
another solution sanctus is analyzed
21A00_4 when used as adjective ( 2 for
adjective,
1 for first class,
A for singular nominative
4 for male)
21A0014 when used as a noun (the additional 1
indicating this use).
also for fideles /omnes fideles, credentes
/omnes credentes, laudantes,
audientes, legentes

33
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging

full morphosyntactic analysis Greek declension
in Latin

34
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging

full morphosyntactic analysis 4th conjugation

35
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging

full morphosyntactic analysis the deviant
forms
a solution a special code for the whole
declension (domus)
another solution several lemmas
ex a plural male accusative saxos, instead of
the plural neutral saxa
- a form of a new male lemma saxus (not
attested in the dictionaries)
- an anomalous form of the neutral lemma saxum
in both cases with the same codification (12 1
for noun, 2 for 2nd décl.)
but
ex facta est tonitrua in aera

36
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging

full morphosyntactic analysis the deviant
forms
ex facta est tonitrua in aera
- tonitrua used as a singular nominative of the
first declension.
- in the dictionaries,
tonitrus, us (4th decl.)
tonitruum, i (2nd decl)
tonitrus, i (m) (2nd decl)
tonitru (n) (4th decl.)
but no tonitrua.
- explained by a plural neutral of tonitruum
reinterpreted
as a feminine singular, but how to lemmatize
it?
a solution
to consider tonitrua as a form of the lemma
tonitruum
the peculiarity of its use only in the tag
corresponding
to the morphosyntactic analysis

37
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging

full morphosyntactic analysis the deviant
forms
ex
regular forms of the lemma dulcis, smooth,
tagged with the code of the adjectives of the
second class in -is (24)
anomalous form dulciam
tagged as a form of the lemma dulcis
with the code of the adjectives of the first
class in -is (21)
and in the Classical Latin corpus
caelum (n) and caelus (m
inferni (m) and inferna (n)
cingula (f), cingulus (m) and cingulum (n)

38
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging

using the full morphosyntactic analysis
narrative indicative tenses

39
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging

using the full morphosyntactic analysis
repeated sequences (adj.-adj)

40
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging

the Treebank approaches
- Index Thomisticus
- Latin Treebanks at Perseus
based on
- the Dependency grammar (Prague Dependency
Treebank )
- the Latin grammar of H.Pinkster
a training corpus is tagged manually and other
corpora are encoded
by using automatic taggers
problems
a method imposing a specific linguistic
framework
mixing theoretical linguistic framework
producing data which are not verified

41
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging the Project LatSynt

an original and innovative research on word
order and Latin sentence structures
Objectives
-to develop automatic procedures for parsing
based on word order rules (in order to offer an
alternative to Treebank approaches)
-to evaluate the relevance of the recent
linguistic descriptions
-to offer new tools for textual data analysis
(TDA)
- for enuntiative structure modeling
- for Latin texts classification and
segmentation
Methods
- to develop automatic procedures grounded on
-the already encoded morphological information
in the LASLA database
-the text linearity
- to refine and improve the computer programmes
in successive stages

42
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging the Project LatSynt the
first stage

Objective - to mark out the boundaries of
personal verb clauses (provided with a
subordinating word)
- to specify the level of their
subordination (their embedding)
from the alphanumeric data of the LASLA
database

43
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging the Project LatSynt the
first stage
Analysis audisset BN cum clausecum 32
subj. imp.
44
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging the Project LatSynt the
first stage

1st stage
quem uidi, LN14 (subordination in QVI
perfect indicative) transferred to both the
recording of
- the quem form (lemma QVI) and
- the uidi form (lemma VIDEO)
2d stage
0014 LN14 -LN14 LN12 GK32 -GK32 -LN12

45
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging the Project LatSynt the
first stage

3d stage
lt0014gtLN14 -LN14LN12 GK32 -GK32
-LN12.
- Final stage
Tacite, Annales, 13,11,2 / P2849 /
lt0014gtLN14-LN14LN12GK32-GK32-LN12
ltsecuta (est)gt que lenitas in Plautium Lateranum
quem ob adulterium Messalinae ordine demotum
-reddidit senatui clementiam suam obstringens
crebris orationibus quas Seneca testificando
quam honesta -praeciperet uel iactandi ingenii
uoce principis -uulgabat

46
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging the Project LatSynt the
first stage

analysing left dislocations
5,22,2 P0909 1 BN35-BN35lt0014gt
ii cum ad castra -uenissent, nostri eruptione
facta multis eorum interfectis, capto etiam
nobili duce Lugotorige suos incolumes
ltreduxeruntgt

47
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging the Project LatSynt the
first stage

Results to bring out
- linguistic regularities (prolepsis)
- distances between texts (Caesar Tacitus)
- the importance of semantic and pragmatic
phenomena
Perspectives
-to mark out the boundaries of complex syntagms
(in order to mark out the boundaries of
subordinate clauses without subordinator)
-to promote interactions with other researches
regarding the text topology (at the micro- and
macro structural levels) -repeted segments
(Hyperbase-latin in collaboration with
BCLNice/CNRS) -syntactic and multidimensional
motifs (in c. with BCLNice/CNRS)
to use the results for texts segmentation and
classification