Eesti keele ressursid keeleteaduse allikmaterjalina - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Eesti keele ressursid keeleteaduse allikmaterjalina

Description:

Eesti keele ressursid keeleteaduse allikmaterjalina Kadri Muischnek T – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 26

Provided by: Kadr4

Category:

more less

Transcript and Presenter's Notes

Title: Eesti keele ressursid keeleteaduse allikmaterjalina

1
Eesti keele ressursid keeleteaduse
allikmaterjalina

Kadri Muischnek
TÜ

2
Overview of the talk

What are language resources
Language corpora
what are corpora and what are they good for
some corpora of Estonian at UT
www.cl.ut.ee/korpused
what should be borne in mind while using these
corpora
user interfaces and facilities, such as
user interface www.cl.ut.ee/kasutajaliides
morphology-aware user interface
www.keeleveeb.ee
collocation extraction tool www.rabauti.ee/clc
frequency lists www.cl.ut/sagedused1

3
Motto
What is the difference between a dialect and a
language? Some time ago Language is a dialect
with an army and a navy
(Uriel Weinreich) Nowadays A
language is a dialect with a dictionary, grammar,
parser and a multi-million corpus
(Lars
Borin)
4
Language resources

Language resources all knowledge sources based
on language, including text corpora, lexicons,
databases, formal grammar descriptions etc
Keeleressursid elektroonilised andmekogud, sh
tekstikogud e korpused, leksikonid, andmebaasid,
formaalsed grammatikakirjeldused
Eesti keeleressursid vt www.keeleressursid.ee

5
Language corpora
What is a language corpus? Easy answer an
electronic text collection Ideal polyfunctional
electronic text collection that consists of texts
that are chosen purposefully to give an full
represenation of a language on a certain time
span. Polüfunktsionaalne elektroonilisel kujul
olev tekstikogu, millesse kuuluvad tekstid on
valitud eesmärgipäraselt, nii et nendest koosnev
tervik annaks tõepärase pildi kogu keelest (või
mingist allkeelest)
6
Language corpora 2

Size depends on sublanguage, annotation etc
Eesti keele Koondkorpus ca 245 million words
Deutsches Referenzkorpus (DeReKo) or Mannheimer
corpus ca 5,4 billion words (5,4 miljardit sõna)
http//www.ids-mannheim.de/kl/projekte/korpora/

7
Representativeness and text classes
... texts that are chosen purposefully to give an
full represenation of a language (or, at least, a
of a certain part of the language) during a
certain time span. A corpus meeting this
condition is called a representative
corpus. Lets have a look at the text classes of
the Brown/LOB-style corpus of Estonian
http//www.cl.ut.ee/korpused/baaskorpus/1980/
Closed vs open corpora Syncronic vs diacronic
corpora
8
Sublanguages
Määruse alusel 3 lõikes 1 nimetatud taotleja
kaudu põllumajandustoodete töötlejatele ning
taotleja liikmetele, kes ei ole
põllumajandustootjad ega põllumajandustoodete
töötlejad, antav toetus on vähese tähtsusega abi
komisjoni määruse nr 1998/2006, milles
käsitletakse asutamislepingu artiklite 87 ja 88
kohaldamist vähese tähtsusega abi suhtes (ELT
L 379, 28.12.2006, lk 510), mõistes a olks,
sorri offtopicu eest.
9
Corpora of written Estonian at CL.UT

www.cl.ut.ee/korpused
Baaskorpus written Estonian of the 1980s
closed, representative, 1 mio words
written Estonian from the period 1890-1990
closed, partly?? representative, less text
classes than in the previous corpus
Koondkorpus (Reference corpus) 1990
Open, new texts added constantly, ca 245 mio
words at the moment
A subcorpus of the Reference corpus the Balanced
corpus 5 mio words fiction texts, 5 mio words
newspaper texts, 5 mio words science texts

10
Corpora of written Estonian at CL.UT

How one can use these corpora
1) concordancer http//www.cl.ut.ee/korpused/kas
utajaliides/ (a bit slow, regular expressions can
be used)
2) download the TEI XML versions
http//www.cl.ut.ee/korpused/segakorpus/
Well talk about these later
3) morphology-aware user interfaces
4) collocation extraction tool

11
Corpus annotation
Annotation adding some explicit information to
the corpus, e.g 1) Structure of the text
(paragraphs, sentences, non-text (tables,
formulae etc) 2) Morphological information
lemmas, parts-of-speech, grammatical
categories 3) Syntactic information - syntactic
functions, phrases, the relations between
words/phrases 4) Semantic information word
senses, semantic roles, etc etc 5... And much
more
12
Morphological annotation
Mees mees0 //_S_ sg n, // mesis //_S_
sg in, // peeti peet0 //_S_ adt, sg p, //
pidati //_V_ main indic impf imps af // kinni
kinni0 //_D_ //
13
Morphologically annotated corpora of Estonian

http//www.cl.ut.ee/korpused/morfkorpus/
613 000 words, manually double-checked
Can be used via user interface or downloaded (500
000 words)
User interface regular expressions can be used
Lets search for a impersonal verb postposition
poolt
www.keeleveeb.ee
The whole Koondkorpus (Reference Corpus), 245 mio
words
Tagged automatically, statistical HMM
trigram-based tagger
Several systematic errors in the annotation, we
are fixing them gradually

14
Corpus query system at www.keeleveeb.ee

Corpus can be queried for a word-form, a lemma, a
grammatical category or a combination of those
A combination can be adjancency, co-occurence
within a sentence or co-occurence within a clause
One can ask for occurences of a word/lemma that
dont co-occur with another word/lemma/grammatical
category
Some examples (based on fiction subcorpus)
Can the noun kala really be used with the
sid-ending in pl part?
Can the word-form plehku be used without verbs
pistma and panema?
Lets again search for a clause containing an
impersonal verb and a postposition poolt

15
Collocations

Definitions
In computational linguistics
Kollokatsioon statistilises mõttes on sõnade (või
sõnavormide) esinemine üksteise naabruses
sagedamini, kui võiks eeldada nende endi
sageduste põhjal, oletades, et sõnad üldiselt
esinevad tekstis juhuslikult.
In linguistics mõistetakse kollokatsioone ka
kitsamalt need on sellised sageli koos
kasutatavate sõnade ühendid, mis ei mahu idioomi
või ka ühendverbi definitsiooni alla.

16
KOLLOKATSIOONID 2

Probleemiks võõrkeelte õppimisel/kasutamisel
make a decision vs take a decision ??
have a drink vs have an eat ??
tähelepanu pöörama/osutama/suunama/keerama/panema
??
mistõttu neid vajavad nt leksikograafid

17
Collocations

Collocations can be further divided
based on semantic compositionality
1) Idioms (idiomaatilisi verbiühendeid
nimetatakse eesti keele gram. kirjelduses
väljendverbideks)
Laseme jalga, löövad lokku
Miski seisab savist jalgadel
2) Non-idioms, collocations in the linguistic
sense
kange kohv, tähelepanu pöörama, kartuleid võtma,
marju korjama
Of course, borderline cases exist laulu lööma,
tünga tegema

18
Collocations

Collocations can be further divided
based on their syntactic structure, e.g
1) Noun phrase
kange kohv (vrd ingl strong coffee),
ere näide, helge tulevik (nn püsiepiteedid)
2) Verb complement
3) Particle verb
pöörasin viimaks tähelepanu asjaolule, et
laseme siit kähku jalga, ajalehed löövad lokku
autod põrkasid kokku, sõdur jooksis vaenlase
poole üle

19
How to extract collocations from a text corpus?

1. Frequency
1.1. Frequent word pairs (or trigrams etc) kange
kohv
1.2. Frequent word pairs (or trigrams etc),
separated by several intervening words Kass
laskis koerakuudi lähedusest kiiresti jalga.
But most frequent word pairs in Estonian text ei
ole, see on, ta on jne
2. In addition to frequency of the word
combination, also the frequencies of the words
outside the combination could be taken into
account -gt co-occurence statistics
http//purl.org/stefan.evert/PUB/Evert2007HSK_exte
nded_manuscript.pdf

20
How to extract collocations from a text corpus? 2

In practice
Triibuline kass lasi koerakuudi juurest ruttu
jalga.
1. Word pairs
Variant A adjacent pairs
triibuline kass kass lasi lasi koerakuudi jne
Variant B words can be separated by up to n
words e.g n3
triibuline kass triibuline lasi triibuline
koerakuudi triibuline juurest etc

21
How to extract collocations from a text corpus? 2

Variant C candidate pairs are formed combining
all words in a clause, if the information about
the clause boundaries is available
Whole sentence of a written language is too long
context
Triibuline kass lasi koerakuudi juurest ruttu
jalga, aga suur koer ajas teda haukudes taga.
Word pairs or lemma pairs?
Morphology-based filtering e.g only
verb-particle combinations are considered

22
Collocation extraction tool

www.rabauti.ee/clc
Näited otsi verbe, mis kollokeeruvad sõnaga
eile
Otsi verbe mis kollokeeruvad verbiga võima
Otsi verbe, mis kollokeeruvad määrsõnaga üle
Otsi sõnu, mis kollokeeruvad sõnaga plehku

23
Corpus-based frequency lists

Based on 1 million word corpus
http//www.cl.ut.ee/ressursid/sagedused/
Based on 15 million word corpus
http//www.cl.ut.ee/ressursid/sagedused1/

15 mio (fiction newspapers science)

1 mio
(fiction newspapers)

639802 olema
403162 ja
263713 see
233960 tema
180248 mina
172779 ei
171666 et
127728 kui
124942 mis
100058 ka
78156 saama
76123 ning
66694 oma

44904 olema
27232 ja
21850 tema
18441 see
14011 mina
13813 ei
12318 et
8600 kui
8230 mis
6194 ka
5894 saama
5738 oma
5276 aga

Eesti keeletehnoloogiaprojekte, sh ka neid, mille
tulemustest siin juttu oli, rahastab Eesti
Keeletehnoloogia Sihtprogramm www.keeletehnoloogia
.ee
Tulevikus haldab ja jagab eesti keele ressursse
Eesti Keeleressursside Keskus
http//www.keeleressursid.ee

Write a Comment

User Comments (0)