Title: Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation
1Graph-based Bilingual Phrase Sense Disambiguation
for Statistical Machine Translation
- Mamoru Komachi
- mamoru-k_at_is.naist.jp
- 2008-06-04
2Background
- Success of supervised ML methods depends on
annotated corpus - Hard to maintain
- Weakly supervised method requires only small
amount of tagged data - Can reduce amount of human effort
3Problems remained
- WSD is crucial to weakly supervised method
- Cf. semantic drift
- Parallel corpora (and dictionaries) may help
disambiguate word senses - WSD models in SMT systems gain much attention
(Carpuat and Wu, 2007)
4WSD and SMT
- Improving Statistical Machine Translation using
Word Sense Disambiguation (Carpuat and Wu, EMNLP
2007) - SMT is known to suffer from inaccurate lexical
choice (based on senseval style sense inventory) - Domain adaptation problem
- Input is typically a word
- Limited contextual features
5Phrase-based WSD models for SMT
- Sense annotations are derived from phrase
alignment learned during SMT training - WSD senses are from the SMT phrasal translation
lexicon phrase table - Not only words but also phrases are to be
disambiguated - Supervised WSD (an ensemble methods of naïve
Bayes, ME, boosting and a Kernel PCA)
6Phrase table is highly ambiguous
- Phrase table constructed from NTCIR-7 J-E
parallel corpus - 0.51GB (in gzip format)
- 2.53 candidates per phrase (3.24 candidates per
phrase for phrases shorter than 5 words) - Includes function words as well
Plant has 120 translations in the phrase table!
Plant ?? Plant ?? Plant ?? Plant
?? ????
Plant ?? ? Plant ?? ?? Plant
??? Plant ?? ????
7Motivation
- Propose a novel graph-based approach to phrase
sense disambiguation - Can exploit bilingual contextual patterns
- Evaluate phrase sense disambiguation on SMT
framework
8Monolingual bootstrapping
- Pioneered by (Yarowsky, 1995)
- Learn decision lists from a small set of seed
instances (input instance I, output classifier) - One sense per discourse constraint
9Bootstrapping
- Iteratively conduct pattern induction and
instance extraction starting from seed instances - Can fertilize small set of seed instances
Query log (Corpus)
Instances
Contextual patterns
vaio
Compare vaio laptop
Compare laptop
Compare toshiba satellite laptop
Toshiba satellite
slot
Compare HP xb3000 laptop
HP xb3000
10Bilingual bootstrapping
- Word Translation Disambiguation Using Bilingual
Bootstrapping (Li and Li, ACL-2002)
Mill Plant Vegetable
?? ??
corpus
????
WSD classifier
11Formalization of bootstrapping
- Score vector of seed instance
- Pattern-instance matrix P
- Iterate
- Output ranked instances when stopping criterion
met
???(p,i)???????p???????i???
?????????????APTP ?????????????????? inAni0
????????? ???????
12Bilingual phrase sense disambiguation
- The (p,i) element of a pattern-instance matrix P
is a co-occurrence between pattern p and instance
I - p contextual features of both language sides,
with phrase alignment from GIZA - i candidate (monolingual) phrase to disambiguate
- A PTP
- Similarity is given by the regularized Laplacian
kernel
13Regularized Laplacian kernel
- Predict final sense by k-NN given the target
instance
???G???????L
A???? ß????
??????D?i???????
???????????Rß
?????????? ????A?????-L??????????A???
14NTCIR-7 Patent Translation Task
- Large-scale Japanese-English parallel corpus
- 2M sentences (comparable to A-E, C-E MT)
- Mainly technical documents
- Timeline
- 2008.01 dry run
- 2008.05 formal run
- 2008.12 final meeting
15NAIST-NTT at NTCIR-7
- Bilingual dictionary extracted (solely) from
Wikipedia - Used langlinks from Wikipedia DB
- 1n translations are expanded to n pairs of
bilingual phrase (en, ja) - Extracted 200,000 pairs
- 12,000 pairs appear in the training corpus (8.8)
- 44.7 of words (token) is covered by
automatically constructed bilingual lexicon
(GIZA) - Learned 1,193 (0.6) novel translation
16Proposed method (but not yet finished training)
- Extract translation pairs relevant to the given
domain (patent translation task) - Construct a pattern-instance matrix P
- Pattern features bag-of-words feature and link
features extracted from Wikipedia ja-en abstract - Instance translation pair (en, ja)
- Seed instances 40 translation pairs from the
target domain - Apply Laplacian kernel
17Evaluation (BLEU score)
- E-J translation(en) gets worse with Wikipedia
dictionary - J-E translation(ja) gives slightly better
performance with Wikipedia dictionary than without
18Future work
- Implement bilingual phrase sense disambiguation
- Evaluate this method against IWSLT 2006 J-E/E-J
and NTCIR-7 J-E/E-J datasets
19Future work(2)
- Automatic extraction of biomedical lexicon
starting from life-science dictionary (mining
from MedLine, etc) - Summarization (Harendras work)