Using Comparable Corpora to Adapt a Translation Model to Domains - PowerPoint PPT Presentation

Loading...

PPT – Using Comparable Corpora to Adapt a Translation Model to Domains PowerPoint presentation | free to download - id: 66f697-MDlkM



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Using Comparable Corpora to Adapt a Translation Model to Domains

Description:

The 7th International Conference on Language Resources and Evaluation, Malta, May 2010. Using Comparable Corpora to Adapt a Translation Model to Domains – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 27
Provided by: Hiroyu4
Learn more at: http://www.lrec-conf.org
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Using Comparable Corpora to Adapt a Translation Model to Domains


1
Using Comparable Corpora to Adapt a Translation
Model to Domains
The 7th International Conference on Language
Resources and Evaluation, Malta, May 2010.
  • Hiroyuki Kaji, Takashi Tsunakawa, Daisuke Okada
  • Department of Computer Science, Shizuoka
    University

2
Overview
  • Motivation and goal
  • Proposed method
  • Estimating noun translation pseudo-probabilities
  • Estimating noun-sequence translation
    pseudo-probabilities
  • Phrase-based SMT using translation
    pseudo-probabilities
  • Experiments
  • Discussion
  • Related work
  • Summary

3
Motivation and goal
  • Statistical machine translation
  • Able to learn a translation model from a parallel
    corpus
  • Suffer from the limited availability of large
    parallel corpora
  • Use comparable corpora for SMT
  • Estimate translation pseudo-probabilities from a
    bilingual dictionary and comparable corpora
  • Use the pseudo-probabilities estimated from
    in-domain comparable corpora to
  • Adapt a translation model learned from an
    out-of-domain parallel corpus, or
  • Augment a translation model learned from a small
    in-domain parallel corpus

4
Overview
  • Motivation and goal
  • Proposed method
  • Estimating noun translation pseudo-probabilities
  • Estimating noun-sequence translation
    pseudo-probabilities
  • Phrase-based SMT using translation
    pseudo-probabilities
  • Experiments
  • Discussion
  • Related work
  • Summary

5
Basic idea for estimating word translation
pseudo-probabilities from comparable corpora
  • Word associations suggest particular senses or
    translations of a polysemous word (Yarowsky 1993)
  • (tank, soldier) ? the military vehicle sense or
    translation ??SENSHA of tank
  • (tank, gasoline) ? the container for liquid or
    gas sense or translation ???TANKU of tank
  • Comparable corpora allow us to determine which
    word associations suggest which translations of a
    polysemous word (Kaji Morimoto 2002)
  • Assume that the more word associations that
    suggest a translation, the higher the probability
    of the translation word would be

6
Naive method for estimating word translation
pseudo-probabilities
Japanese corpus
English corpus
Extract word associations
Extract word associations
English-Japanese dictionary
Align word associations
Fuel, gasoline, and others suggest
???TANKU
Missile, soldier, and others suggest
??SENSHA
Calculate the percentage of associated words
suggesting each translation
Pps(???TANKUtank)fuel, gasoline, /
(fuel, gasoline, missile, soldier, )
Pps(??SENSHAtank)missile, soldier, /
(fuel, gasoline, missile, soldier, )
7
Difficulties the naive method suffers from
  • Failure in word-association alignment
  • (tank, Chechen) ? ?
  • due to the disparity in topical coverage between
    two language corpora
  • (tank, Chechen) ??? (??SENSHA, ?????CHECHEN)
  • due to the incomplete coverage of the
    intermediary bilingual dictionary
  • Incorrect word-association alignment
  • (tank, troop) ? (??SUISOU, ??MURE)
  • due to incidental word-for-word correspondence
    between word associations that do not really
    correspond to each other

8
How to overcome the difficulties
  • Two words associated with a third word are likely
    to suggest the same sense or translation of the
    third word when they are also associated with
    each other
  • Soldier and troop, both of which are
    associated with tank, are associated with each
    other
  • Soldier and troop suggest the same
    translation ??SENSHA
  • Define a correlation between an associated word
    and a translation using the correlations between
    other associated words and the translation
  • C(troop, ??SENSHA)
  • ? MI(troop, tank) ? MI(troop, soldier) ?
    C(soldier, ??SENSHA)
  • MI(troop, missile) ? C(missile, ??SENSHA)
  • C(troop, ???TANKU)
  • ? MI(troop, tank) ? MI(troop, soldier) ?
    C(soldier, ???TANKU)
  • MI(troop, missile) ? C(missile, ???TANKU)

9
  • Calculate the correlations iteratively starting
    with the initial values determined according to
    the results of word-association alignment via a
    bilingual dictionary
  • C0(associated_word, translation)
  • Alignment

tank ??SENSHA ???TANKU
Chechen
fuel ?
gasoline ?
missile ?
soldier ?
troop ? ?
tank ??SENSHA ???TANKU
Chechen 0.0 0.0
fuel 0.0 1.0
gasoline 0.0 1.0
missile 1.0 0.0
soldier 1.0 0.0
troop 0.5 0.5
10
Overview of our method for estimating noun
translation pseudo-probabilities
English corpus
Japanese corpus
Window size ?10 content words
Extract pairs of words co-occurring in a window
Extract pairs of words co-occurring in a window
English-Japanese dictionary
Calculate point-wise mutual information
English word associations
Calculate point-wise mutual information
Japanese word associations
Align
Initial value of correlation matrix of English
associated words vs. Japanese translations for an
English noun
Calculate pairwise correlation between associated
words and translations iteratively
Correlation matrix of associated words vs.
translations
Assign each associated word to the translation
with which it has the highest correla-tion and
calculate the percentage of associated words
assigned to each translation
Noun translation pseudo-probabilities
11
Example correlation matrix and estimated noun
translation pseudo-probabilities
plant ?? SOUCHI ?? SETSUBI ?? SHOKU- BUTSU ?? KOUJOU ???? PURANTO ? NAEI ?? UEKI
activity 0.02 0.03 2.10 0.20 0.03 0.01 0.02
bacteria 0.02 0.03 1.98 0.01 0.02 0.27 0.02
boiler 0.05 2.70 0.05 0.03 2.73 0.03 0.04
coal 0.87 2.35 1.70 0.68 2.06 0.65 0.99
computer 0.55 0.71 0.02 0.49 0.73 0.01 0.01
control 0.47 0.51 0.17 0.15 0.62 0.06 0.01
culture 0.03 0.05 3.26 0.23 0.12 0.77 0.88
environment 0.76 1.25 1.32 0.03 0.05 0.23 0.03
failure 0.93 1.22 0.03 0.53 1.43 0.01 0.01
flower 0.04 0.06 4.02 0.04 0.04 1.23 1.70

Translation pseudo-probabilities .047 .241 .423 .022 .223 .022 .022
12
Overview
  • Motivation and goal
  • Proposed method
  • Estimating noun translation pseudo-probabilities
  • Estimating noun-sequence translation
    pseudo-probabilities
  • Phrase-based SMT using translation
    pseudo-probabilities
  • Experiments
  • Discussion
  • Related work
  • Summary

13
Our method for estimating noun-sequence
translation pseudo-probabilities
English-Japanese dictionary
English corpus
Japanese corpus
E(1)e1(1)e2(1) em(1), E(2) e1(2)e2(2) em(2),
, E(n)e1(n)e2(n) em(n)
Extract a noun sequence with its frequency
Retrieve compo-sitional transla-tions and count
their frequencies
Generate all compositional translations
Ff1f2fm
Estimate according to occurrence frequencies
Estimate according to constituent-word
translation pseudo-probabilities
Combine two estimates
14
Overview
  • Motivation and goal
  • Proposed method
  • Estimating noun translation pseudo-probabilities
  • Estimating noun-sequence translation
    pseudo-probabilities
  • Phrase-based SMT using translation
    pseudo-probabilities
  • Experiments
  • Discussion
  • Related work
  • Summary

15
Phrase-based SMT using translation
pseudo-probabilities
In-domain source- language corpus
In-domain target- language corpus
Out-of-domain (or in-domain) parallel corpus
Giza heuristics
Bilingual dictionary
Estimate translation pseudo-probabilities
In-domain phrase table (pseudo-probabilities)
Basic phrase table
Merge
SRILM
Adapted (or augmented) phrase table
In-domain language model
Moses decoder
Source language text
Target language text
16
Overview
  • Motivation and goal
  • Proposed method
  • Estimating noun translation pseudo-probabilities
  • Estimating noun-sequence translation
    pseudo-probabilities
  • Phrase-based SMT using translation
    pseudo-probabilities
  • Experiments
  • Discussion
  • Related work
  • Summary

17
Experimental setting
  • Experiment A
  • Adapt a phrase table learned from an
    out-of-domain parallel corpus by using in-domain
    comparable corpora
  • Experiment B
  • Augment a phrase table learned from an in-domain
    small parallel corpus by using in-domain larger
    comparable corpora

Experiment A Experiment B
Training parallel corpus 20,000 pairs of Japanese and English patent abstracts in the physics 20,000 pairs of Japanese and English sentences having high similarity ?ones extracted from scientific-paper abstracts in the chemistry
Training comparable corpora Scientific-paper abstracts in the chemistry Japanese 151,958 abstracts (90.8 Mbytes) English 102,730 abstracts (64.9 Mbytes) Scientific-paper abstracts in the chemistry Japanese 151,958 abstracts (90.8 Mbytes) English 102,730 abstracts (64.9 Mbytes)
Test corpus 1,000 Japanese sentences, each having one reference English translation, from scientific paper abstracts in the chemistry 1,000 Japanese sentences, each having one reference English translation, from scientific paper abstracts in the chemistry
Bilingual dictionary 333,656 pairs of translation equivalents between 163,247 Japanese and 93,727 English nouns from EDR, EIJIRO, and EDICT dictionaries 333,656 pairs of translation equivalents between 163,247 Japanese and 93,727 English nouns from EDR, EIJIRO, and EDICT dictionaries
18
  • Our method in four cases using a different volume
    of comparable corpora
  • Japanese all, English all
  • Japanese half, English all
  • Japanese all, English half
  • Japanese half, English half
  • Two baseline methods using the phrase table
    learned from the parallel corpus
  • Baseline without dictionary
  • Baseline with dictionary Phrase table were
    augmented with the bilingual dictionary
  • Note The TL language model learned from the
    whole TL monolingual corpus was used commonly in
    all cases involving our method and the baseline
    methods
  • Evaluation metric BLEU-4

19
Experimental results
  • BLEU-4 score
  • Our method rather slightly improved the BLEU
    score
  • The effect of the difference in volume of
    comparable corpora remains unclear
  • Simply adding a bilingual dictionary improved the
    out-of-domain phrase table, but did not improve
    the in-domain phrase table

Method Method Experiment A Experiment B
Our method Jall, Eall 13.30 16.82
Our method Jhalf, Eall 13.19 16.70
Our method Jall, Ehalf 13.21 16.78
Our method Jhalf, Ehalf 13.27 16.71
Baseline w/o dictionary Baseline w/o dictionary 11.42 16.37
Baseline w/ dictionary Baseline w/ dictionary 12.94 16.32
20
Overview
  • Motivation and goal
  • Proposed method
  • Estimating noun translation pseudo-probabilities
  • Estimating noun-sequence translation
    pseudo-probabilities
  • Phrase-based SMT using translation
    pseudo-probabilities
  • Experiments
  • Discussion
  • Related work
  • Summary

21
Discussions
  • Optimization of the parameters
  • Parameters, including the window size and
    thresholds for word occurrence frequency,
    co-occurrence frequency, and pointwise mutual
    information, affect the correlation matrix of
    associated words vs. translations
  • How to optimize the values for the parameters
    remains unsolved
  • Alternatives for word-association measure
  • Pointwise mutual information, which tends to
    overestimate low-frequency words, is not the most
    suitable for acquiring word associations
  • Need to compare with alternatives such as
    log-likelihood ratio and the Dice coefficient

22
  • Refinement of the definition of translation
    pseudo-probability
  • Need to consider the frequencies of associated
    words as well as the dependence among associated
    words
  • Need to reconsider the strategy assigning an
    associated word to only one translation
  • Estimate of verb translation pseudo-probabilities
  • Need to use syntactic co-occurrence, instead of
    co-occurrence in a widow, to extract verb-noun
    associations from corpora
  • Need to define pariwise correlation between
    associated nouns and translations recursively
    based on heuristics where two nouns associated
    with a verb are likely to suggest the same sense
    of the verb when they belong to the same semantic
    class

23
Overview
  • Motivation and goal
  • Proposed method
  • Estimating noun translation pseudo-probabilities
  • Estimating noun-sequence translation
    pseudo-probabilities
  • Phrase-based SMT using translation
    pseudo-probabilities
  • Experiments
  • Discussion
  • Related work
  • Summary

24
Related work
  • Many studies on bilingual lexicon acquisition
    from bilingual comparable corpora have been
    reported since the mid 90s, but few studies on
    word translation probability estimate from
    bilingual comparable corpora
  • Estimate of word translation probabilities from
    comparable corpora using an EM algorithm (Koehn
    Knight 2000) could be greatly affected by the
    occurrence frequencies of translation candidates
    in the TL corpus
  • In contrast, our method produces translation
    pseudo-probabilities that reflect the
    distribution of the senses of the SL word in the
    SL corpus
  • Methods for extracting parallel sentence pairs
    from bilingual comparable corpora (Zhao Vogel,
    2002 Utiyama Isahara 2003 Fung Cheung,
    2004 Munteanu Marcu, 2005) extracted parallel
    sentences could be used to learn a translation
    model with a conventional method based on
    word-for-word alignment. This approach is
    applicable only to closely comparable corpora.
  • In contrast, our method is applicable even to a
    pair of unrelated monolingual corpora.

25
Overview
  • Motivation and goal
  • Proposed method
  • Estimating noun translation pseudo-probabilities
  • Estimating noun-sequence translation
    pseudo-probabilities
  • Phrase-based SMT using translation
    pseudo-probabilities
  • Experiments
  • Discussion
  • Related work
  • Summary

26
Summary
  • A method for estimating translation
    pseudo-probabilities from a bilingual dictionary
    and bilingual comparable corpora was created
  • Assumption The more associated words a
    translation is correlated with, the higher its
    translation probability
  • Essence of the method Calculate pairwise
    correlations between associated words of an SL
    word and its TL translations
  • A phrase-based SMT framework using out-of-domain
    parallel corpus and in-domain comparable corpora
    was proposed
  • An experiment showed promising results the BLEU
    score was improved by using the translation
    pseudo-probabilities estimated from in-domain
    comparable corpora.
  • Future work includes optimizing the parameters
    and extending the method to estimate translation
    pseudo-probabilities for verbs.
About PowerShow.com