Procedures in building CroatianEnglish parallel corpus - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Procedures in building CroatianEnglish parallel corpus

Description:

Filozofski fakultet Sveucili ta u Zagrebu, Zavod za lingvistiku (http://www.ffzg. ... 273;anima /W W type='R' Republike /W W type='R' Hrvatske /W W type='I' , /W ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 18
Provided by: marko5
Category:

less

Transcript and Presenter's Notes

Title: Procedures in building CroatianEnglish parallel corpus


1
Procedures in buildingCroatian-English parallel
corpus
  • Marko Tadic(marko.tadic_at_ffzg.hr)
  • Filozofski fakultet Sveucilita u Zagrebu, Zavod
    za lingvistiku (http//www.ffzg.hr/zzl/zzl-home.ht
    m)
  • 4th TELRI seminar, Bratislava, 1999-11-05

2
Parallel corpora
  • multilingual language research
  • lexicography
  • contrastive linguistics
  • MT
  • ...
  • parallel corpora essential importance
  • role of English as lingua communis
  • common language pairsen LxLx en

3
Croatian-English parallel corpora
  • 1st English-Croatian pairing
  • Rudolf Filipovic, 1968-1971
  • Yugoslav Serbo-Croatian-English Contrastive
    Project
  • Brown corpus cut in half (505.822 tokens)
  • preserving original genre balance
  • morphosyntactically marked
  • translated
  • concordance with morphosyntactic categories as
    keywords
  • bilingual sentence database
  • 1st usage of computers in contrastive
    linguistics?
  • tapes with data still archived in Institute of
    linguistics in Zagreb but no computer system
    which could read them
  • project publications Contrastive Studies, New
    Contrastive Studies, Chapters in Contrastive
    Linguistics

4
Croatian-English parallel corpora 2
  • 2nd Croatian-English pair
  • Platos Republic, TELRI CD-ROM, 1998.
  • hr-en not the only pair
  • rather small
  • properly aligned?

5
Croatian-English parallel corpora 3
  • 3rd hr-en parallel corpus
  • collecting in Institute of linguistics,
    Philosophical Faculty, Zagreb
  • aims to test
  • text conversion procedures
  • corpus organization
  • alignment and encoding
  • will be used later in parallel corpora projects
  • Croatian-Slovene parallel corpus
  • approved in July by both Ministries of science as
    one of 17 bilateral scientific projects in
    humanities
  • launched effectively last week

6
Corpus collecting
  • representativeness in parallel corpora?
  • demand for parallelism narrows the choice (at
    least for languages with smaller number of
    speakers and/or translations)
  • we are happy to get any valuable translations
  • result unbalanced and nonrepresentative set of
    parallel texts
  • methodologically cleaner approach (possible?)
  • texts from one source
  • Corpus of X
  • no problems with representativeness?

7
Corpus collecting 2
  • source Croatia Weekly
  • publisher Croatian Institute for Culture and
    Information (HIKZ)
  • started January 1998
  • like USA today different domains
  • politics, economy and finance, tourism, ecology,
    culture, art, events, sports
  • 12 pages, A3
  • prepared in Croatian then translated by
    professional translating office
  • availability
  • No. 90 is being prepared now
  • access to all texts in electronic form in both
    languages but except for first 5 issues

8
Corpus collecting 3
  • size
  • average issue 15.170 tokens hr 17.900
    tokens en
  • approx. 1.300.000 tokens hr 1.520.000
    tokens en
  • methodological disturbance
  • the biggest weekly newspaper Nacional
  • important source of hr-texts for Croatian
    National Corpus
  • started with English translations of approx. 30
    of Croatian issue on their Web-page
  • Ministry of science and technology
  • description of all closed scientific projects in
    RH
  • on Web
  • Croatian and English

9
Making corpus
  • platform
  • NT instead of UNIX
  • all software (commercial, shareware, custom-made)
    runs on win9X
  • text formats
  • hr texts naked ASCII, no markup gt manual
    marking
  • en texts DTP file, RTF extraction
  • conversion
  • 2XML custom made software we use for Croatian
    National Corpus
  • input HTML, RTF
  • output XML, no header
  • two-step conversion by user-defined scripts
  • enables high level of automation

10
Making corpus 2
  • Sentence marking
  • script in SearchReplace shareware by Funduc SW
  • lt/SgtltSgt after punctuation followed by capital
    letter
  • filtered for known exceptions Mr., Mrs., Miss.,
    dr., St. etc.
  • Tokenizer
  • custom made
  • XML input
  • output
  • tabbed file
  • XML with ltWgtlt/Wgt

11
Making corpus 3
  • ltBODYgt CW011199803260101hr 1 X
  • ltDIV0 type"MAIN"gt CW011199803260101hr 7 X
  • ltHEAD type"NA"gt CW011199803260101hr 25 X
  • ltSgt CW011199803260101hr 41 X
  • Predsjednik CW011199803260101hr 44 R
  • Tudman CW011199803260101hr 56 R
  • primio CW011199803260101hr 63 R
  • Kinkela CW011199803260101hr 70 R
  • , CW011199803260101hr 77 I
  • Vedrinea CW011199803260101hr 79 R
  • i CW011199803260101hr 88 R
  • Primakova CW011199803260101hr 90 R
  • lt/Sgt CW011199803260101hr 99 X
  • lt/HEADgt CW011199803260101hr 10 X
  • ltHEAD type"PN"gt CW011199803260101hr 110 X
  • ltSgt CW011199803260101hr 126 X
  • Tudman CW011199803260101hr 129 R
  • CW011199803260101hr 135 I
  • Hrvatska CW011199803260101hr 137 R

12
Making corpus 4
  • ltBODYgtltDIV0 type"MAIN"gtltHEAD type"NA"gtltSgtltW typ
    e"R"gtPredsjedniklt/Wgt ltW type"R"gtTu273manlt/Wgt
    ltW type"R"gtprimiolt/Wgt ltW type"R"gtKinkelalt/WgtltW t
    ype"I"gt,lt/Wgt ltW type"R"gtVedrinealt/Wgt
    ltW type"R"gtilt/Wgt ltW type"R"gtPrimakovalt/Wgtlt/Sgtlt/H
    EADgt ltHEAD type"PN"gtltSgtltW type"R"gtTu273manlt/W
    gtltW type"I"gtlt/Wgt ltW type"R"gtHrvatskalt/Wgt
    ltW type"R"gtvojnolt/WgtltW type"I"gt,lt/Wgt
    ltW type"R"gtgospodarskilt/Wgt ltW type"R"gtilt/Wgt
    ltW type"R"gtsigurnosnolt/Wgt ltW type"R"gtorijentiran
    alt/Wgt ltW type"R"gtnalt/Wgt ltW type"R"gteuropskelt/Wgt
    ltW type"R"gtintegracijelt/WgtltW type"I"gt.lt/Wgtlt/Sgt
    ltSgtltW type"R"gtMinistrilt/Wgt ltW type"R"gtVedrinelt/W
    gt ltW type"R"gtilt/Wgt ltW type"R"gtKinkellt/Wgt
    ltW type"R"gtuputililt/Wgt ltW type"R"gtzahtjevlt/Wgt
    ltW type"R"gtHrvatskojlt/Wgt ltW type"R"gtdalt/Wgt
    ltW type"R"gtizradilt/Wgt ltW type"R"gtkonkretanlt/Wgt
    ltW type"R"gtplanlt/Wgt ltW type"R"gtpovratkalt/Wgt
    ltW type"R"gtizbjeglicalt/WgtltW type"I"gt.lt/Wgtlt/Sgt
    ltSgtltW type"R"gtNalt/Wgt ltW type"R"gtsastankult/Wgt
    ltW type"R"gtslt/Wgt ltW type"R"gtPrimakovomlt/Wgt
    ltW type"R"gtdogovorenlt/Wgt ltW type"R"gtTu273mano
    vlt/Wgt ltW type"R"gtposjetlt/Wgt ltW type"R"gtRusijilt/W
    gtlt/Sgtlt/HEADgtltPgtltSgtltW type"R"gtHrvatskalt/Wgt
    ltW type"R"gtjelt/Wgt ltW type"R"gtspremnalt/Wgt
    ltW type"R"gtjam269itilt/Wgt ltW type"R"gtpunalt/Wgt
    ltW type"R"gtmanjinskalt/Wgt ltW type"R"gtpravalt/Wgt
    ltW type"R"gtsvimlt/Wgt ltW type"R"gtSrbimalt/WgtltW type
    "I"gt,lt/Wgt ltW type"R"gtgra273animalt/Wgt
    ltW type"R"gtRepublikelt/Wgt ltW type"R"gtHrvatskelt/Wgt
    ltW type"I"gt,lt/Wgt ltW type"R"gtnastavitilt/Wgt
    ltW type"R"gt263elt/Wgt ltW type"R"gtslt/Wgt
    ltW type"R"gtpolitikomlt/Wgt ltW type"R"gtkojalt/Wgt
    ltW type"R"gtjelt/Wgt ltW type"R"gtdovelalt/Wgt
    ltW type"R"gtdolt/Wgt ltW type"R"gtmirnelt/Wgt
    ltW type"R"gtreintegracijelt/Wgt ltW type"R"gtuzlt/Wgt
    ltW type"R"gtpunult/Wgt ltW type"R"gtza353titult/Wgt
    ltW type"R"gtilt/Wgt ltW type"R"gtsigurnostlt/Wgt
    ltW type"R"gtsrpskelt/Wgt ltW type"R"gtmaninelt/Wgt
    ltW type"R"gtult/Wgt ltW type"R"gtcijelojlt/Wgt
    ltW type"R"gtHrvatskojlt/Wgt ltW type"R"gtilt/Wgt
    ltW type"R"gtspremnalt/Wgt ltW type"R"gtjelt/Wgt
    ltW type"R"gtprihvatitilt/Wgt ltW type"R"gtsvelt/Wgt
    ltW type"R"gtizbjeglicelt/Wgt ltW type"R"gtizlt/Wgt
    ltW type"R"gtSRJlt/Wgt ltW type"R"gtilt/Wgt
    ltW type"R"gtsvelt/Wgt ltW type"R"gtizbjeglicelt/Wgt
    ltW type"R"gtkojilt/Wgt ltW type"R"gttolt/Wgt
    ltW type"R"gt382elelt/WgtltW type"I"gt,lt/Wgt
    ltW type"R"gtizjaviolt/Wgt ltW type"R"gtjelt/Wgt
    ltW type"R"gtdrlt/Wgt ltW type"R"gtMatelt/Wgt
    ltW type"R"gtGrani263lt/Wgt ltW type"R"gtnakonlt/Wgt
    ltW type"R"gtsastankalt/Wgt ltW type"R"gtizme273ult/
    Wgt ltW type"R"gtpredsjednikalt/Wgt
    ltW type"R"gtTu273manalt/Wgt ltW type"R"gtilt/Wgt
    ltW type"R"gtministaralt/Wgt ltW type"R"gtvanjskihlt/Wgt
    ltW type"R"gtposlovalt/Wgt ltW type"R"gtNjema269ke
    lt/Wgt ltW type"R"gtilt/Wgt ltW type"R"gtFrancuskelt/Wgt
    ltW type"R"gtKlausalt/Wgt ltW type"R"gtKinkelalt/Wgt
    ltW type"R"gtilt/Wgt ltW type"R"gtHubertalt/Wgt
    ltW type"R"gtVedrinealt/WgtltW type"I"gt,lt/Wgt
    ltW type"R"gt353tolt/Wgt ltW type"R"gtjelt/Wgt
    ltW type"R"gtult/Wgt ltW type"R"gtPredsjedni269kimlt
    /Wgt ltW type"R"gtdvorimalt/WgtltW type"I"gt.lt/Wgtlt/Sgtlt/
    Pgt

13
Alignment
  • testing stage
  • demo of Atrils DéjàVu translations memory
    database V2.3.82
  • aligning module
  • works fine for 11 alignments
  • handwork for 21, 31, 12, 13
  • export to TMX format

14
Alignment 2
15
Alignment 3
16
Alignment 4
  • encoding problemHow to store alignments?
  • several ways to do it now
  • CES with pointers to IDs in 3rd file
  • translations memory (Translation Units as aligned
    pairs)
  • since we are in XML gt PLUG project dtd
    (Tiedemann 1998)
  • si-en parallel corpus (Erjavec 1999) SGML,
    modified TEI ltBODYgt to have TU. But all upper and
    lower level encoding (ltDIVsgt, ltHEADsgt, ltHI
    rendgt) are lost. Is there a way to retain it?
  • TEIXML dtd, Nancy, July 1999. Interpretation of
    TEI dtd? Would that dtd prefer alignment by IDs
    and pointers?
  • Is the SGML/XML decision really a problem to us?
    To the same ltBODYgt element we can attach
    different headers, convert character entities and
    have SGML instead of XML?

17
Preliminary statistics
  • for ltSgt aligning already it seems that we would
    have a lot of handwork
  • discrepancy between number of ltSgt and ltWgt in hr
    and en
  • hr en increase
  • CW010 ltPgt 195 195
  • ltSgt 729 796 9.2
  • ltWgt 15483 18176 17.4
  • CW011 ltPgt 178 178
  • ltSgt 675 754 11.7
  • ltWgt 14853 17602 18.5
  • ltWgt alignment is not on the schedule yet
Write a Comment
User Comments (0)
About PowerShow.com