Migration of Intex resources towards NooJ the case of Serbian - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Migration of Intex resources towards NooJ the case of Serbian

Description:

7/27/09. The 8th INTEX/NooJ Workshop, Besan on, May 30-June 2, 2005. 1 ... Faculty of Philology, Belgrade. Cvetana Krstev, professor, cvetana_at_matf.bg.ac.yu ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 26
Provided by: jovanh
Category:

less

Transcript and Presenter's Notes

Title: Migration of Intex resources towards NooJ the case of Serbian


1
Migration of Intex resources towards NooJ - the
case of Serbian
Faculty of Mining and Geology, Belgrade Ranka
Stankovic, assistant, ranka_at_rgf.bg.ac.yu Ivan
Obradovic, professor, ivano_at_rgf.bg.ac.yu
Faculty of Philology, Belgrade Cvetana Krstev,
professor, cvetana_at_matf.bg.ac.yu
Faculty of Mathematics, Belgrade Duko Vitas,
professor, vitas_at_matf.bg.ac.yu Gordana
Pavlovic-Laetic, professor, gordana_at_matf.bg.ac.yu
2
CONTENTS
  • Overview of Intex resources for Serbian
  • Specific features of Serbian
  • The approach to Intex -gt Nooj migration of
    lexical resources
  • ConvertIN Convert Intex to Nooj (Migration
    software scripts)
  • Migration results and open questions

3
Intex dictionaries for Serbian
  • Dictionary of simple words 73000 lemmas (DELAS
    entries) and more than a million word forms
    (DELAF entries)
  • Dictionaries of proper names 21000 lemmas, or
    145000 forms
  • Dictionary of compound words (compound nouns,
    prepositions, conjunctions and adverbs, compound
    toponyms and proper names)
  • Auxiliary dictionaries (special purpose filter
    dictionaries and auxiliary dictionaries for the
    processing of particular texts)

4
Intex transducers for Serbian
  • Transducers for description of inflectional
    classes - used for generating DELAF from DELAS
    dictionaries. The largest group of transducers
    333 for nouns, 60 for adjectives and 344 for
    verbs
  • Transducers for derivation in Serbian. Second
    largest group of transducers 40
  • Transducers for the identification of specific
    forms, such as acronyms with their appropriate
    inflection and derivation (e.g. OEBS, OEBS-a,
    OEBS-ov) 66
  • Transducers for disambiguation 41

5
Specific features of Serbian
  • ,? ? sx
  • d,? ? dx
  • c,? ? cy
  • c,??cx
  • ,? ? zx
  • nj,? ? nx
  • lj,? ? lx
  • d,? ? dy
  • Use of two alphabets
  • Official Cyrillic alphabet
  • Serbian Latin alphabet (also widely used)
  • Absence of a unique transliteration procedure in
    any of the standard coding schemas
  • Rich morphological system
  • Reflected both on the inflective and derivational
    level

6
The approach to migration
  • Development resources are kept in transliterated
    Latin alphabet (with the adopted transliteration
    scheme)
  • Automatic production of both Serbian Latin and
    Cyrillic Unicode versions of the resources, which
    can then be used either separately or jointly
  • Application of the procedure to existing DELAF
    dictionaries will make the translation of
    numerous transducers for inflection unnecessary

7
DELAS transformation
Intex Delas entry ponesxto,PRO13IndefProN
Nooj ponesxto,ponesxto,PROFLXPRO13IndefP
roN poneto,ponesxto,PROFLXPRO13_latIndef
ProN ???????,ponesxto,PROFLXPRO13_cirInd
efProN delas-im.dic -gt ascdelas-im.dic,
latdelas-im.dic, cirdelas-im.dic delas-gl.dic -gt
ascdelas-gl.dic, latdelas-gl.dic,
cirdelas-gl.dic ....
  • ,? ? sx
  • d,? ? dx
  • c,? ? cy
  • c,??cx
  • ,? ? zx
  • nj,? ? nx
  • lj,? ? lx
  • d,? ? dy

Different inflectional classes
8
Options for transforming DELAF (1)
  • Use the same lemma (transliterated option)
  • cyudovisxta,cyudovisxte.Nhumns2vnp1vnp2vnp4v
    np5v
  • cudovita,cyudovisxte.Nns2vnp1vnp2vnp4vnp5v
  • ?????????,cyudovisxte.Nns2vnp1vnp2vnp4vnp5v

cyudovisxta,cyudovisxte,NHumns2v cyudovisxta,
cyudovisxte,Nnp1v cyudovisxta,cyudovisxte,Nn
p2v cyudovisxta,cyudovisxte,Nnp4v cyudovisxt
a,cyudovisxte,Nnp5v
?????????,cyudovisxte,Nns2v ?????????,cyudovis
xte,Nnp1v ?????????,cyudovisxte,Nnp2v ????
?????,cyudovisxte,Nnp4v ?????????,cyudovisxte,
Nnp5v
cudovita,cyudovisxte,Nns2v cudovita,cyudovis
xte,Nnp1v cudovita,cyudovisxte,Nnp2v cudo
vita,cyudovisxte,Nnp4v cudovita,cyudovisxte,
Nnp5v
9
Options for transforming DELAF (2)
  • Use different lemmas
  • cyudovisxta,cyudovisxte.Nhumns2vnp1vnp2vnp4v
    np5v
  • cudovita,cyudovisxte.Nns2vnp1vnp2vnp4vnp5v
  • ?????????,cyudovisxte.Nns2vnp1vnp2vnp4vnp5v

cyudovisxta,cyudovisxte,NHumns2v cyudovisxta,
cyudovisxte,Nnp1v cyudovisxta,cyudovisxte,Nn
p2v cyudovisxta,cyudovisxte,Nnp4v cyudovisxt
a,cyudovisxte,Nnp5v
?????????,????????e,Nns2v ?????????,????????e,
Nnp1v ?????????,????????e,Nnp2v ?????????,
????????e,Nnp4v ?????????,????????e,Nnp5v
cudovita,cudovite,Nns2v cudovita,cudovite,
Nnp1v cudovita,cudovite,Nnp2v cudovita,
cudovite,Nnp4v cudovita,cudovite,Nnp5v
10
ConvertIN - Overview
Delas, Delaf Graph,Text Inflexion?
11
ConvertIN a glimpse at the code
"ltEgt" 28 168 1 16 "" 760 144 0 "-" 236 168 1
13 "godisxnxice" 380 64 1 7 "godisxnxicu" 384
160 1 8 "godisxnxicom" 384 212 1 9
"godisxnxica" 404 288 1 10 "ltEgt/12godisxnxice
,12godisxnxica.NCf2s" 516 64 1 1
"ltEgt/12godisxnxicu,12godisxnxica.NCf4s"
529 160 1 1 "ltEgt/12godisxnxicom,12godisxnxic
a.NCf6s" 524 212 1 1 "ltEgt/12godisxnxica,1
2godisxnxica.NCf1s" 524 288 1 1 "BrojCifre"
88 168 1 15 "(2" 192 168 1 2 ")" 276 168 5 3
4 5 6 14 "godisxnxici" 382 108 1 17 ")" 164
168 1 12 "(1" 68 165 1 11 "ltEgt/12godisxnxici
,12godisxnxica.NCf3sf7s" 520 108 1 1
  • perl script
  • C code

12
ConvertIN a glimpse at the code
"ltEgt" 28 168 1 16 "" 760 144 0 "-" 236 168 1
13 "?????????" 380 64 1 7 "?????????" 384 160 1
8 "??????????" 384 212 1 9 "?????????" 404 288
1 10 "ltEgt/12?????????,12godisxnxica.NCf2s
" 516 64 1 1 "ltEgt/12?????????,12godisxnxica.
NCf4s" 529 160 1 1 "ltEgt/12??????????,12go
disxnxica.NCf6s" 524 212 1 1
"ltEgt/12?????????,12godisxnxica.NCf1s"
524 288 1 1 "BrojCifre" 88 168 1 15 "(2" 192
168 1 2 ")" 276 168 5 3 4 5 6 14 "?????????"
382 108 1 17 ")" 164 168 1 12 "(1" 68 165 1
11 "ltEgt/12?????????,12godisxnxica.NCf3sf7
s" 520 108 1 1
NCf2s NCf4s NCf6s NCf1s
13
ConvertIN DELAS editor
PRO_Distribution Cr Demon Ek Gen Ijk
Indef Int Neg Pos ProA ProN Prs
Ref Rel Sr
Delas editor
_properties.def
14
ConvertIN properties
_properties.def
15
Results of dictionary conversion
Small dictionaries (all in one)
Bigger dictionaries (one each)
16
Conversion of Cyrillic dictionaries
  • Small size Cyrillic dictionaries were converted
    successfully with the transliterated Latin lemma
  • Large size Cyrillic dictionaries could not be
    converted with the transliterated Latin lemma
  • ?????????,cydovisxte,Nns2v
  • ?????????, cydovisxte,Nnp1v
  • ....
  • but could be converted with the Cyrillic lemma
  • ?????????,????????e,Nns2v
  • ?????????,????????e,Nnp1v
  • ....
  • The problem of different lemmas for Latin and
    Cyrillic loss of connection

17
Results of graph conversion
18
Results of graph conversion
?
19
Lexical analysis a comparison
  • Results of a lexical analysis using original
    Intex and resources converted to Nooj format on
    the 13KW ebit2002-bez test corpus
  • Intex
  • 900 unknown words
  • Nooj (from Intex)
  • 1140 unknown words for transliterated Latin
  • 1034 unknown words for Serbian Latin
  • 10000 unknown words for Cyrillic (V,N,A)

20
Lexical analysis cyrilic lema
  • All dictionaries with cyrilic lema
  • All graphs with cyrilic lema

21
Lexical analysis cyrilic lema
22
Morphology an open question
  • ltEgt/msNNgmsAAqmsVVg
  • 2na/fsNNgfsVVgnpNNgnpAAgnpVVg
  • 2ne/fsGGgmpAAgfpNNgfpAAgfpVVg
  • 2ni/mpNNgmpVVg
  • 2nih/mpGGgfpGGgnpGGg
  • 2nim/msIIgnsIIgmpXXgmpIIgmpWWgfpXXgfpIIgfp
    WWgnpXXgnpIIgnpWWg
  • 2nima/mpXXgmpIIgmpWWgfpXXgfpIIgfpWWgnpXXgn
    pIIgnpWWg
  • 2no/nsNNgnsAAgnsVVg
  • 2nog/msGGgmsAAvnsGGg
  • 2noga/msGGgmsAAvnsGGg
  • 2noj/fsXXgfsWWg
  • 2nom/msXXgmsWWgnsXXgnsWWgfsIIg
  • 2nome/msXXgmsWWgnsXXgnsWWg
  • 2nomu/msXXgmsWWgnsXXgnsWWg
  • 2nu/fsAAg

NUM01.exp and NooJ ekvivalent
NUM01 ltEgt/msNNg ltEgt/msAAq
ltEgt/msVVg ltB2gtna(ltEgt/fsNNg
ltEgt/fsVVg ltEgt/npNNg ltEgt/npAAg
ltEgt/npVVg) ltB2gtne(ltEgt/fsGGg
ltEgt/mpAAg ltEgt/fpNNg ltEgt/fpAAg
ltEgt/fpVVg ) ltB2gtni(ltEgt/mpNNg
ltEgt/mpVVg ) ltB2gtnih(ltEgt/mpGGg
ltEgt/fpGGg ltEgt/npGGg)
ltB2gtnim(ltEgt/msIIg ltEgt/nsIIg
ltEgt/mpXXg ltEgt/mpIIg ltEgt/mpWWg
ltEgt/fpXXg ltEgt/fpIIg ltEgt/fpWWg
ltEgt/npXXg ltEgt/npIIg
ltEgt/npWWg ) ltB2gtnima(ltEgt/mpXXg
ltEgt/mpIIg ltEgt/mpWWg ltEgt/fpXXg
ltEgt/fpIIg ltEgt/fpWWg ltEgt/npXXg
ltEgt/npIIg ltEgt/npWWg )
ltB2gtno(ltEgt/nSNNg ltEgt/nSAAg
ltEgt/nSVVg ) ltB2gtnog(ltEgt/msGGg
ltEgt/msAAv ltEgt/nsGGg )
ltB2gtnoga(ltEgt/msGGg ltEgt/msAAv
ltEgt/nsGGg ) ltB2gtnoj(ltEgt/fsXXg
ltEgt/fsWWg ) ltB2gtnom(ltEgt/msXXg
ltEgt/msWWg ltEgt/nsXXg ltEgt/nsWWg
ltEgt/fsIIg) ltB2gtnome(ltEgt/msXXg
ltEgt/msWWg ltEgt/nsXXg ltEgt/nsWWg
) ltB2gtnomu(ltEgt/msXXg ltEgt/msWWg
ltEgt/nsXXg ltEgt/nsWWg )
ltB2gtnu/fsAAg
23
Morphology an open question
ASC N7.exp bubanx, pirinacy (with fleeting a)
  • N7 ltEgt/ms1q ltEgt/ms4q
    ltL2gtltBgtltR2gta/ms2q ltL2gtltBgtltR2gtu/(ltEgt/ms
    3q ltEgt/ms7q) ltL2gtltBgtltR2gte/(ltEgt/ms
    5q ltEgt/mp4q) ltL2gtltBgtltR2gtem/ms6q
    a/mp2q ltL2gtltBgtltR2gtima/(ltEgt/mp
    3q ltEgt/mp6q ltEgt/mp7q)
  • N3 ltEgt/ms1q ltEgt/ms4q
    ltLgtltBgtltRgta/ms2q ltLgtltBgtltRgtu/(ltEgt/ms3q
    ltEgt/ms7q) ltLgtltBgtltRgte/(ltEgt/ms5q
    ltEgt/mp4q) ltLgtltBgtltRgtem/ms6q
    a/mp2q ltLgtltBgtltRgtima/(ltEgt/mp3q
    ltEgt/mp6q ltEgt/mp7q)

LAT bubanj N7, pirinac N3
CIR ????? N7, ??????? N3 ALL NEW
24
A few more open questions
  • The choice of lemma unique (transliterated) or
    three/two different
  • Diferences in Lexical analysis with Intex
  • Rules for automatic conversion of .exp files

Generic Commands ltBgt keyboard Backspace ltDgt
Duplicate current char ltEgt Empty string ltLgt
keyboard Left arrow ltNgt go to end of Next word
form ltPgt go to end of Previous word form ltRgt
keyboard Right arrow
Stack operators L R C LR LL RL LC
c insert character 'c' at the end of the form L
delete last character push it onto the stack R
pop the stack C copy the character at the top of
the stack to the end of the form pop the
stack
25
Conclusion
  • Further work will include procedures for
    automatic import and export between resources in
    Intex/NooJ format and resources in MULTEXT-east
    and other currently accepted formats (MAF, LMF)
  • The process described in this paper has proven
    beneficial for both kinds of the resources.
  • Tool is language independent
  • However, we should point some problems in its
    application. First of all, the sizes of SMD.
Write a Comment
User Comments (0)
About PowerShow.com