Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation

About This Presentation

Title:

Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation

Description:

E-J translation(en) gets worse with Wikipedia dictionary. J-E translation(ja) gives slightly better performance with Wikipedia dictionary than without ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 20

Provided by: komachi

Category:

more less

Transcript and Presenter's Notes

Title: Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation

1
Graph-based Bilingual Phrase Sense Disambiguation
for Statistical Machine Translation

Mamoru Komachi
mamoru-k_at_is.naist.jp
2008-06-04

2
Background

Success of supervised ML methods depends on
annotated corpus
Hard to maintain
Weakly supervised method requires only small
amount of tagged data
Can reduce amount of human effort

3
Problems remained

WSD is crucial to weakly supervised method
Cf. semantic drift
Parallel corpora (and dictionaries) may help
disambiguate word senses
WSD models in SMT systems gain much attention
(Carpuat and Wu, 2007)

4
WSD and SMT

Improving Statistical Machine Translation using
Word Sense Disambiguation (Carpuat and Wu, EMNLP
2007)
SMT is known to suffer from inaccurate lexical
choice (based on senseval style sense inventory)
Domain adaptation problem
Input is typically a word
Limited contextual features

5
Phrase-based WSD models for SMT

Sense annotations are derived from phrase
alignment learned during SMT training
WSD senses are from the SMT phrasal translation
lexicon phrase table
Not only words but also phrases are to be
disambiguated
Supervised WSD (an ensemble methods of naïve
Bayes, ME, boosting and a Kernel PCA)

6
Phrase table is highly ambiguous

Phrase table constructed from NTCIR-7 J-E
parallel corpus
0.51GB (in gzip format)
2.53 candidates per phrase (3.24 candidates per
phrase for phrases shorter than 5 words)
Includes function words as well

Plant has 120 translations in the phrase table!
Plant ?? Plant ?? Plant ?? Plant
?? ????
Plant ?? ? Plant ?? ?? Plant
??? Plant ?? ????
7
Motivation

Propose a novel graph-based approach to phrase
sense disambiguation
Can exploit bilingual contextual patterns
Evaluate phrase sense disambiguation on SMT
framework

8
Monolingual bootstrapping

Pioneered by (Yarowsky, 1995)
Learn decision lists from a small set of seed
instances (input instance I, output classifier)
One sense per discourse constraint

9
Bootstrapping

Iteratively conduct pattern induction and
instance extraction starting from seed instances
Can fertilize small set of seed instances

Query log (Corpus)
Instances
Contextual patterns
vaio
Compare vaio laptop
Compare laptop
Compare toshiba satellite laptop
Toshiba satellite
slot
Compare HP xb3000 laptop
HP xb3000
10
Bilingual bootstrapping

Word Translation Disambiguation Using Bilingual
Bootstrapping (Li and Li, ACL-2002)

Mill Plant Vegetable
?? ??
corpus
????
WSD classifier
11
Formalization of bootstrapping

Score vector of seed instance
Pattern-instance matrix P
Iterate
Output ranked instances when stopping criterion
met

???(p,i)???????p???????i???
?????????????APTP ?????????????????? inAni0
????????? ???????
12
Bilingual phrase sense disambiguation

The (p,i) element of a pattern-instance matrix P
is a co-occurrence between pattern p and instance
I
p contextual features of both language sides,
with phrase alignment from GIZA
i candidate (monolingual) phrase to disambiguate
A PTP
Similarity is given by the regularized Laplacian
kernel

13
Regularized Laplacian kernel

Predict final sense by k-NN given the target
instance

???G???????L
A???? ß????
??????D?i???????
???????????Rß
?????????? ????A?????-L??????????A???
14
NTCIR-7 Patent Translation Task

Large-scale Japanese-English parallel corpus
2M sentences (comparable to A-E, C-E MT)
Mainly technical documents
Timeline
2008.01 dry run
2008.05 formal run
2008.12 final meeting

15
NAIST-NTT at NTCIR-7

Bilingual dictionary extracted (solely) from
Wikipedia
Used langlinks from Wikipedia DB
1n translations are expanded to n pairs of
bilingual phrase (en, ja)
Extracted 200,000 pairs
12,000 pairs appear in the training corpus (8.8)
44.7 of words (token) is covered by
automatically constructed bilingual lexicon
(GIZA)
Learned 1,193 (0.6) novel translation

16
Proposed method (but not yet finished training)

Extract translation pairs relevant to the given
domain (patent translation task)
Construct a pattern-instance matrix P
Pattern features bag-of-words feature and link
features extracted from Wikipedia ja-en abstract
Instance translation pair (en, ja)
Seed instances 40 translation pairs from the
target domain
Apply Laplacian kernel

17
Evaluation (BLEU score)