Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation

Description:

E-J translation(en) gets worse with Wikipedia dictionary. J-E translation(ja) gives slightly better performance with Wikipedia dictionary than without ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 20
Provided by: komachi
Category:

less

Transcript and Presenter's Notes

Title: Graph-based Bilingual Phrase Sense Disambiguation for Statistical Machine Translation


1
Graph-based Bilingual Phrase Sense Disambiguation
for Statistical Machine Translation
  • Mamoru Komachi
  • mamoru-k_at_is.naist.jp
  • 2008-06-04

2
Background
  • Success of supervised ML methods depends on
    annotated corpus
  • Hard to maintain
  • Weakly supervised method requires only small
    amount of tagged data
  • Can reduce amount of human effort

3
Problems remained
  • WSD is crucial to weakly supervised method
  • Cf. semantic drift
  • Parallel corpora (and dictionaries) may help
    disambiguate word senses
  • WSD models in SMT systems gain much attention
    (Carpuat and Wu, 2007)

4
WSD and SMT
  • Improving Statistical Machine Translation using
    Word Sense Disambiguation (Carpuat and Wu, EMNLP
    2007)
  • SMT is known to suffer from inaccurate lexical
    choice (based on senseval style sense inventory)
  • Domain adaptation problem
  • Input is typically a word
  • Limited contextual features

5
Phrase-based WSD models for SMT
  • Sense annotations are derived from phrase
    alignment learned during SMT training
  • WSD senses are from the SMT phrasal translation
    lexicon phrase table
  • Not only words but also phrases are to be
    disambiguated
  • Supervised WSD (an ensemble methods of naïve
    Bayes, ME, boosting and a Kernel PCA)

6
Phrase table is highly ambiguous
  • Phrase table constructed from NTCIR-7 J-E
    parallel corpus
  • 0.51GB (in gzip format)
  • 2.53 candidates per phrase (3.24 candidates per
    phrase for phrases shorter than 5 words)
  • Includes function words as well

Plant has 120 translations in the phrase table!
Plant ?? Plant ?? Plant ?? Plant
?? ????
Plant ?? ? Plant ?? ?? Plant
??? Plant ?? ????
7
Motivation
  • Propose a novel graph-based approach to phrase
    sense disambiguation
  • Can exploit bilingual contextual patterns
  • Evaluate phrase sense disambiguation on SMT
    framework

8
Monolingual bootstrapping
  • Pioneered by (Yarowsky, 1995)
  • Learn decision lists from a small set of seed
    instances (input instance I, output classifier)
  • One sense per discourse constraint

9
Bootstrapping
  • Iteratively conduct pattern induction and
    instance extraction starting from seed instances
  • Can fertilize small set of seed instances

Query log (Corpus)
Instances
Contextual patterns
vaio
Compare vaio laptop
Compare laptop
Compare toshiba satellite laptop
Toshiba satellite
slot
Compare HP xb3000 laptop
HP xb3000
10
Bilingual bootstrapping
  • Word Translation Disambiguation Using Bilingual
    Bootstrapping (Li and Li, ACL-2002)

Mill Plant Vegetable
?? ??
corpus
????
WSD classifier
11
Formalization of bootstrapping
  • Score vector of seed instance
  • Pattern-instance matrix P
  • Iterate
  • Output ranked instances when stopping criterion
    met

???(p,i)???????p???????i???
?????????????APTP ?????????????????? inAni0
????????? ???????
12
Bilingual phrase sense disambiguation
  • The (p,i) element of a pattern-instance matrix P
    is a co-occurrence between pattern p and instance
    I
  • p contextual features of both language sides,
    with phrase alignment from GIZA
  • i candidate (monolingual) phrase to disambiguate
  • A PTP
  • Similarity is given by the regularized Laplacian
    kernel

13
Regularized Laplacian kernel
  • Predict final sense by k-NN given the target
    instance

???G???????L
A???? ß????
??????D?i???????
???????????Rß
?????????? ????A?????-L??????????A???
14
NTCIR-7 Patent Translation Task
  • Large-scale Japanese-English parallel corpus
  • 2M sentences (comparable to A-E, C-E MT)
  • Mainly technical documents
  • Timeline
  • 2008.01 dry run
  • 2008.05 formal run
  • 2008.12 final meeting

15
NAIST-NTT at NTCIR-7
  • Bilingual dictionary extracted (solely) from
    Wikipedia
  • Used langlinks from Wikipedia DB
  • 1n translations are expanded to n pairs of
    bilingual phrase (en, ja)
  • Extracted 200,000 pairs
  • 12,000 pairs appear in the training corpus (8.8)
  • 44.7 of words (token) is covered by
    automatically constructed bilingual lexicon
    (GIZA)
  • Learned 1,193 (0.6) novel translation

16
Proposed method (but not yet finished training)
  • Extract translation pairs relevant to the given
    domain (patent translation task)
  • Construct a pattern-instance matrix P
  • Pattern features bag-of-words feature and link
    features extracted from Wikipedia ja-en abstract
  • Instance translation pair (en, ja)
  • Seed instances 40 translation pairs from the
    target domain
  • Apply Laplacian kernel

17
Evaluation (BLEU score)
  • E-J translation(en) gets worse with Wikipedia
    dictionary
  • J-E translation(ja) gives slightly better
    performance with Wikipedia dictionary than without

18
Future work
  • Implement bilingual phrase sense disambiguation
  • Evaluate this method against IWSLT 2006 J-E/E-J
    and NTCIR-7 J-E/E-J datasets

19
Future work(2)
  • Automatic extraction of biomedical lexicon
    starting from life-science dictionary (mining
    from MedLine, etc)
  • Summarization (Harendras work)
Write a Comment
User Comments (0)
About PowerShow.com