Experiments on Processing Overlapping Parallel Corpora - PowerPoint PPT Presentation

About This Presentation
Title:

Experiments on Processing Overlapping Parallel Corpora

Description:

... on JRC-Acquis (Estonian, Latvian, English) Overlapping parallel corpora. Hunglish and OPUS. Hu-En subtitles. Hunglish and JRC-Acquis. Hu-En legislation texts ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 20
Provided by: mphi
Learn more at: http://www.lrec-conf.org
Category:

less

Transcript and Presenter's Notes

Title: Experiments on Processing Overlapping Parallel Corpora


1
Experiments on Processing Overlapping Parallel
Corpora
  • University of Tartu
  • Mark Fishel and Heiki-Jaan Kaalep

2
Outline
  • Parallel corpora containing overlapping parts
  • A method for processing these
  • Some experiments on JRC-Acquis (Estonian,
    Latvian, English)

3
Overlapping parallel corpora
  • Hunglish and OPUS
  • Hu-En subtitles
  • Hunglish and JRC-Acquis
  • Hu-En legislation texts
  • Univ. of Tartu corpus and JRC-Acquis
  • Et-En legislation texts
  • JRC-Acquis Vanilla and HunAlign
  • legislation texts

4
Overlapping parallel corpora
  • Additional troubles for handling
  • source version differences
  • encoding differences
  • format differences
  • But also potential benefits
  • detect alignment errors
  • raise corpora quality
  • increase segmentation depth

5
ParAlign the method
  • A method of finding and matching corresponding
    corpora parts
  • Enables
  • combining corpora
  • detecting potential error spots
  • increasing alignment depth
  • evaluating and improving alignment quality

6
Method based on finding corpora correspondence
7
Aligning the corresponding language parts
8
Aligning the corresponding language parts
  • Edit distance over the corpora documents
  • comparing N to M sentences
  • matching weight approx. sentence matching
  • Approximate sentence matching modified edit
    distance
  • same letter different case replacing free
  • number inserting/replacing infinitely costly
  • punctuation replacing cheap

9
Aligning the language alignments
  • Levenstein distance

10
ParAlign, the Implementation
  • Combine corpora, include side with more sentences
  • Print out all mismatching parts (potential error
    spots)
  • Use one corpus as guideline, proof the other one
  • Available at http//ats.cs.ut.ee/smt/paralign

11
Method Benefits
  • Handles different segmentation levels (M to N al.
    unit relations)
  • Insensitive to minor input differences
  • Encoding
  • Typing errors

12
Experiment-1
  • Univ. of Tartu corpus and JRC-Acquis
    (English-Estonian)
  • Overlapping parts found by comparing the CELEX
    codes
  • Aim generate joint corpus

13
Results
  • Joint corpus size 670000 al. units

14
Segmentation differences
15
Experiment-2
  • JRC-Acquis
  • English-Estonian
  • English-Latvian
  • Estonian-Latvian
  • Aim compare alignments produced by Vanilla and
    HunAlign
  • almost 100 overlapping

16
Results
17
Future Work
  • Other corpora
  • Optimizing
  • Test on other domains

18
Summary
  • A method for parallel corpora combining/comparing/
    evaluating/ using overlapping parts
  • Implementation available
  • Joint En-Et corpus
  • Comparison results between HunAlign and Vanilla
    versions of Jrc-Acquis En-Et, En-Lv and Et-Lv
    parts

19
Thank You!
Write a Comment
User Comments (0)
About PowerShow.com