Lemmatizing and tagging a corpus : which information for which linguistic purposes? - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Lemmatizing and tagging a corpus : which information for which linguistic purposes?

Description:

Lemmatizing and tagging a corpus : which information for which linguistic purposes? Lemmatizing and tagging a corpus : which information for which linguistic purposes? – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 49
Provided by: long96
Category:

less

Transcript and Presenter's Notes

Title: Lemmatizing and tagging a corpus : which information for which linguistic purposes?


1
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
Lemmatizing and tagging a corpus which
information for which linguistic purposes? The
example of the Greek and Latin LASLA databases
compared to others
Dominique Longrée, LASLA Université de Liège et
FUSL (Bruxelles)
2
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
0. Introduction objectives
  • to share the expertise of the LASLA (50 years)
  •  Laboratoire dAnalyse statistique des Langues
    anciennes ,
  • set up in 1961 at the Liege University
  • to offer a discussion
  • 1) which information in a database and for
    which purposes ?
  • which influence on the results of our linguistic
    studies ?
  • to compare the lemmatizing and tagging
    practices of LASLA
  • with practices of other (Greek and Latin)
    databases

3
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
0. Introduction plan
  • LASLA and its databases
  • the research project LatLem
  • the Opera Latina Web interface and the
    Hyperbase-Latin CD-Rom
  • the process of tokenization
  • the process of lemmatization
  • the process of tagging (morphosyntactic tags)
  • the process of tagging (syntactic, semantic and
    pragmatic tags)
  • the research project LatSynt

4
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Databases  Greek Latin
  • The Laboratory for Statistic Analysis of
    Classical Languages (L.A.S.L.A.)
  • - set up in Septembre1961
  • - first research centre
  • - aiming to study classical languages (Greek
    and Latin)
  • - using automatic data processing technologies.
  • part of the Faculty of Philosophy and Letters at
    the University of Liège
  • Missions
  • 1) a detailed study of Greek and Latin languages
    and literatures using computer techniques as well
    as statistical and quantitative methods
  • 2) the making of literary data banks and computer
    tools in order to distribute those data banks and
    make the most of them by all available Media.

5
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Greek Database
1.200.000 words/tokens Attic orators
Andocides, Antiphon, Isocrates and Lysias
Aristotle De Anima, De partibus animalium,
Categorie, Metafisica, Fisica, Historia
animalium. Plato 8 dialogues All classic
tragedies Aeschylus, Sophocles, Euripides and
fragments Pausanias christian authors for
example St John Chrysostom, De
sacerdotio Hesychius of Jerusalem , Homilies
6
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Greek Database
Facing each word form, appear the following data
1. the reference of the word form, according to
the ars citandi. 2. the lemma (the word as it
appears in the dictionary of reference, which is
the Greek-English Lexicon, of H. G. Liddell, R.
Scott et H. S. Jones). 3. the grammatical
category of the word (POS)
lemma token reference POS ? 2 ? 2
1 1 1 1 1 A ? ?????? ??????
2 1 1 2 2 2 A ? a?t?de?f??
a?t?de?f?? 2 1 1 3 3 3 A ?
??sµ??? ? ?sµ???? 2 1 1 4 4 4 A ß
???a 1 ???a 2 1 1 5 5 5 A ß
??a ?? ? 2 1 2 1 6 6 A µ
??da ??s? ? 2 1 2 2 7 7 A ?
.
7
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
  • Latin classical texts 2.000.000 tokens
  • The LASLA method
  • - Étienne ÉVRARD, Le laboratoire danalyse
    statistique des langues anciennes de lUniversité
    de Liège , Mouvement scientifique en Belgique,
    9, 1962, p. 163-169
  • - Joseph DENOOZ, Lordinateur et le latin,
    Techniques et méthodes , Revue de lorganisation
    internationale pour létude des langues anciennes
    par ordinateur, 1978, 4, p. 1-36.

8
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
the available fully lemmatized and encoded
texts Classical texts (more than 2.000.000
words/tokens)
Caesar et aliiCato Catullus Cicero
rhetoric works all philosophical works
partim Curtius Horatius Iuvenalis Lucretius Ovidiu
s Persius Petronius Plautus 8 plays Plinius
(Iunior) Propertius Sallustius Seneca Tacitus Tibu
llus Virgilius
9
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
the available fully lemmatized and encoded
texts other texts
Medio-Latin Sedulius Scottus Hagiographic
texts (300.000 words)
Neo-Latin Descartes Spinoza
next available texts works in progress
Cicero (letters) Cornelius NeposLivius Suetoni
usHistoria Augusta Busbecq (by L. Grailet)
10
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
  • 2.500.000 words/tokens
  • Bibliotheca Teubneriana Latina 13 millions
    tokens
  • fully lemmatized texts,
  • with a full morphosyntactic tagging and 1
    syntactic tag
  • systematically verified by a philologist

11
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
For each word of the text, 1.the lemma (the
word as it appears in the dictionary of
reference, the Lexicon totius latinitatis of
Forcellini, éd. de Corradini, Padoue, 1864) 2. an
index which enables to distinguish various
homograph lemmas ET 1 adverb, ET 2
coordinating conjunction or to spot proper names
or adjectives derived from proper names N
opposite Roma means proper name 3. the form as
appearing in the text 4. the reference, according
to the ars citandi 5. the complete morphologic
analysis in alphanumeric format 6. regarding the
verbs, syntactic indications main clauses
verbs subordinate clauses verbs (sorted by
subordination type)
12
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
the information available for each latin form
Lemma index TextForm
Reference
Analysis
Index N Name2 ET 1 adverb, ET 2
coordinating conjunction
13
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
the information available for each latin form
Lemma index TextForm
Reference
Analysis
Analysis urbem 13C00 1 Noun 3 3d Decl.
C Acc. sing.habuere 52L14 5 Verb
2 2d Conj. Act. L 3d pers. Plur 1
Ind. 4 Perfectum main clause
14
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
the information available for each latin form
Analysis audisset 5JC32 5 Verb J
1st Conj. Dep. C 3d pers. Sing 3 Subj
2 ImpPerf. BN cum clauserequisisse
53074 5 Verb 3 3d Conj. Act. 0
unpers. 7 Inf 4 Perfectum
AG Accusativus cum Infinitivo
15
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla system for tagging
old fashioned
the project Latlem
16
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Greek and Latin Database
  • accessible through
  • index plublished by
  • G. Olms (Hildesheim)
  • the Centre Informatique de Philosophie et
    Lettres (CIPL-Liège)
  • for Greek texts specific software
  • for Latin texts
  • the Opera Latina Web interface
  • the Hyperbase-Latin CD-Rom

17
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
accesible throug hopera latina
www.ulg.ac.be/cipl/lsl.htm
18
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
1. The Lasla Latin Database
accesible through the CD-Rom Hyperbase-latin
collaboration withthe UMR 6039 Bases, corpus,
langage (CNRS-University of Nice)
19
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
2. Tokenizing a text establishing the text
20
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
2. Tokenizing a text segmenting the text into
sentences
/Accusa senatum, accusa equestrem ordinem...,
accusa omnes ordines, omnes ciues../
/Accusa senatum,/ /accusa equestrem ordinem...,
/ /accusa omnes ordines, omnes ciues../
21
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
2. Tokenizing a text segmenting the text into
words
compare with CD-Rom PHI 05 of the Packard
Humanities Institute
The string -ibil-
22
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
2. Tokenizing a text segmenting the text into
words
  • compare with CD-Rom PHI 05 of the Packard
    Humanities Institute
  • Vergils Aeneid
  • arma virumque cano
  • arma uirumque cano
  • clitic que
  • /queltblankgt/, /quelt,gt/ , /queltgt/, /queltgt/,
    /quelt.gt/
  • atque, ubique, undique, quicumque
  • amatus est / amatust
  • animum aduertere / animaduertere

23
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
3. Lemmatizing a text
  • to allow the recognition of the same lemma in
    its various occurrences in a text,
  • independently of the variety of its forms in
    those occurrences
  • 1) Greek and Latin are inflected languages

24
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
3. Lemmatizing a text
  • to allow the recognition of the same lemma in
    its various occurrences in a text,
  • independently of the variety of its forms in
    those occurrences
  • 2) the Latin spelling is not completely fixed
  • assimilation phenomena (inlicio/illicio
    adtuli/attuli quidquid/quicquid)
  • haplologies (exspecto/expecto)
  • weak phonological status of some phonemes
  • (harena/arena, exhibeo/exibeo, mihi/mi,
    consul/cosul, etc)
  • transformation of diphthongs into monophthongs
  • (saeta/seta plaudite/plodite, poenicus/punicus)

25
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
3. Lemmatizing a text
  • to allow the recognition of the same lemma in
    its various occurrences in a text,
  • independently of the variety of its forms in
    those occurrences
  • 2) the Latin spelling is not completely fixed
  • elision, epenthesis, apheresis, contraction, as
    well as abbreviation
  • disjunction of parts of the compound words or
    tmesis
  • res publica for respublica
  • quo modo for quomodo
  • quam... ante for antequam
  • morphologic diachronic and synchronic variant s
  • pater familias/pater familiae
  • siet/sit
  • igni/igne
  • fecerit/faxit.

26
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
3. Lemmatizing a text
  • populus 1, the people and populus 2, the
    poplar
  • licet 1, it is allowed, licet 2 although

27
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
3. Lemmatizing a text
Dux lemma  cooccurrent lemmas  Ecart Corpus
Extrait Mot 038 932 934 dvx
015 1616 120 miles 014 1285 105
exercitvs1 009 25801 604 qve 009
2447 107 bellvm 009 2298 98
romanvsa 009 1910 88 hostis
008 1725 78 arma 008 1059
58 legio 007 1113 55 castra2
007 862 45 copia 007 615
37 imperator 007 519 36 avctor
006 40004 802 et2 006 1968
70 vrbs 006 786 39 tot
006 536 31 cohors 006 493
29 agmen 006 371 26 comes
Dux Wordform cooccurrent wordforms   Ecart
Corpus Extrait Mot 038 141 141 dux
005 336 8 romanus 005 170
7 auctor 004 2733 19 erat
004 626 8 bello 004 506
8 exercitus 004 482 8 hostium
004 299 7 miles 004
151 5 comes 004 119 4 militiae
004 113 4 deae 004 106
4 cohortibus 004 87 4 campis
004 53 3 diuersis 004 44
3 copiarum 004 39 3 uoluntatis
004 37 3 rati 004 20
3 gregis
28
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
  • using the POS tag
  • research on Greek determiners (UMR6039-Nice,
    Michèle Biraud,)
  • - sequences a ? e ß and a µ ? e ß attested in
    the LASLA files
  • a article
  • ß noun
  • ? adjective
  • e adjective/pronoun
  • µ particle
  • - ?? ????? p??te? ?????p??

29
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
  • using the POS tag
  • research on parallel and reminiscent passages
    between literary works
  • (Koen Van Haegendoren, Liège)

30
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
  • using the POS tag
  • to characterise authors
  • and genres

31
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
  • using the POS tag
  • LASLA Latin texts and
  • BFM French
  • medieval texts

32
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
  • lemmatization and POS the case of the
    adjectives used as nouns
  • a solution amicus 1 noun vs. amicus 2
    adjective
  • but sunt christiani ?
  • gt sanctus, beatus, fidelis or impius
  • another solution sanctus is analyzed
  • 21A00_4 when used as adjective ( 2 for
    adjective,
  • 1 for first class,
  • A for singular nominative
  • 4 for male)
  • 21A0014 when used as a noun (the additional 1
    indicating this use).
  • also for fideles /omnes fideles, credentes
    /omnes credentes, laudantes,
  • audientes, legentes

33
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
  • full morphosyntactic analysis Greek declension
    in Latin

34
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
  • full morphosyntactic analysis 4th conjugation

35
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
  • full morphosyntactic analysis the deviant
    forms
  • a solution a special code for the whole
    declension (domus)
  • another solution several lemmas
  • ex a plural male accusative saxos, instead of
    the plural neutral saxa
  • - a form of a new male lemma saxus (not
    attested in the dictionaries)
  • - an anomalous form of the neutral lemma saxum
  • in both cases with the same codification (12 1
    for noun, 2 for 2nd décl.)
  • but
  • ex facta est tonitrua in aera

36
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
  • full morphosyntactic analysis the deviant
    forms
  • ex facta est tonitrua in aera
  • - tonitrua used as a singular nominative of the
    first declension.
  • - in the dictionaries,
  • tonitrus, us (4th decl.)
  • tonitruum, i (2nd decl)
  • tonitrus, i (m) (2nd decl)
  • tonitru (n) (4th decl.)
  • but no tonitrua.
  • - explained by a plural neutral of tonitruum
    reinterpreted
  • as a feminine singular, but how to lemmatize
    it?
  • a solution
  • to consider tonitrua as a form of the lemma
    tonitruum
  • the peculiarity of its use only in the tag
    corresponding
  • to the morphosyntactic analysis

37
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
  • full morphosyntactic analysis the deviant
    forms
  • ex
  • regular forms of the lemma dulcis, smooth,
  • tagged with the code of the adjectives of the
    second class in -is (24)
  • anomalous form dulciam
  • tagged as a form of the lemma dulcis
  • with the code of the adjectives of the first
    class in -is (21)
  • and in the Classical Latin corpus
  • caelum (n) and caelus (m
  • inferni (m) and inferna (n)
  • cingula (f), cingulus (m) and cingulum (n)

38
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
  • using the full morphosyntactic analysis
    narrative indicative tenses

39
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
4. Morphosyntactic tagging
  • using the full morphosyntactic analysis
    repeated sequences (adj.-adj)

40
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging
  • the Treebank approaches
  • - Index Thomisticus
  • - Latin Treebanks at Perseus
  • based on
  • - the Dependency grammar (Prague Dependency
    Treebank )
  • - the Latin grammar of H.Pinkster
  • a training corpus is tagged manually and other
    corpora are encoded
  • by using automatic taggers
  • problems
  • a method imposing a specific linguistic
    framework
  • mixing theoretical linguistic framework
  • producing data which are not verified

41
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging the Project LatSynt
  • an original and innovative research on word
    order and Latin sentence structures
  • Objectives
  • -to develop automatic procedures for parsing
    based on word order rules (in order to offer an
    alternative to Treebank approaches)
  • -to evaluate the relevance of the recent
    linguistic descriptions
  • -to offer new tools for textual data analysis
    (TDA)
  • - for enuntiative structure modeling
  • - for Latin texts classification and
    segmentation
  • Methods
  • - to develop automatic procedures grounded on
  • -the already encoded morphological information
    in the LASLA database
  • -the text linearity
  • - to refine and improve the computer programmes
    in successive stages

42
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging the Project LatSynt the
first stage
  • Objective - to mark out the boundaries of
    personal verb clauses (provided with a
    subordinating word)
  • - to specify the level of their
    subordination (their embedding)
  • from the alphanumeric data of the LASLA
    database

43
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging the Project LatSynt the
first stage
Analysis audisset BN cum clausecum 32
subj. imp.
44
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging the Project LatSynt the
first stage
  • 1st stage
  • quem uidi, LN14 (subordination in QVI
    perfect indicative) transferred to both the
    recording of
  • - the quem form (lemma QVI) and
  • - the uidi form (lemma VIDEO)
  • 2d stage
  • 0014 LN14 -LN14 LN12 GK32 -GK32 -LN12

45
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging the Project LatSynt the
first stage
  • 3d stage
  • lt0014gtLN14 -LN14LN12 GK32 -GK32
    -LN12.
  • - Final stage
  • Tacite, Annales, 13,11,2 / P2849 /
  • lt0014gtLN14-LN14LN12GK32-GK32-LN12
  • ltsecuta (est)gt que lenitas in Plautium Lateranum
    quem ob adulterium Messalinae ordine demotum
    -reddidit senatui clementiam suam obstringens
    crebris orationibus quas Seneca testificando
    quam honesta -praeciperet uel iactandi ingenii
    uoce principis -uulgabat

46
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging the Project LatSynt the
first stage
  • analysing left dislocations
  • 5,22,2 P0909 1 BN35-BN35lt0014gt
  • ii cum ad castra -uenissent, nostri eruptione
    facta multis eorum interfectis, capto etiam
    nobili duce Lugotorige suos incolumes
    ltreduxeruntgt

47
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
5. Syntactic tagging the Project LatSynt the
first stage
  • Results to bring out
  • - linguistic regularities (prolepsis)
  • - distances between texts (Caesar Tacitus)
  • - the importance of semantic and pragmatic
    phenomena
  • Perspectives
  • -to mark out the boundaries of complex syntagms
    (in order to mark out the boundaries of
    subordinate clauses without subordinator)
  • -to promote interactions with other researches
    regarding the text topology (at the micro- and
    macro structural levels) -repeted segments
    (Hyperbase-latin in collaboration with
    BCLNice/CNRS) -syntactic and multidimensional
     motifs  (in c. with BCLNice/CNRS)
  • to use the results for texts segmentation and
    classification

48
Lemmatizing and tagging a corpus which
information for which linguistic purposes?
6. Tagging what else ?
  • semantic and pragmatic information,
  • semantic functions Goal, Recipient, Agent, etc
  • pragmatic functions Rheme, Topic, Focus , etc
  • building databases available for all kinds of
    research
  • without imposing specific linguistic frameworks
    or analysis
  • tokenization, lemmatization or tagging
  • not trivial processes
  • requiring thorough theoretical thinking
Write a Comment
User Comments (0)
About PowerShow.com