A Discriminative Framework for Bilingual Word Alignment

1 / 19
About This Presentation
Title:

A Discriminative Framework for Bilingual Word Alignment

Description:

Our approach based on discriminative training. A weighted linear model: ... Discriminative training generally superior to maximum likelihood training ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: A Discriminative Framework for Bilingual Word Alignment


1
A Discriminative Framework for Bilingual Word
Alignment
  • Robert C. Moore
  • Natural Language Processing Group
  • Microsoft Research

2
Micro-tutorial on current statistical MT
  • Word align a parallel bilingual corpus
  • Extract translation pairs of contiguous phrases
  • Optimize weighted linear combination of
  • Log of phrase-translation probabilities
  • Log of word-translation probabilities
  • Degree of nonmonotonicity of translation
  • Log of n-gram target language string probability
  • Number of target language words
  • Other features
  • Find highest scoring possible translation for
    source language sentences

3
Example of word alignment and phrase extraction
  • I dont speak French
  • Je ne parle pas Français

4
Example of word alignment and phrase extraction
  • I dont speak French
  • Je ne parle pas Français

5
Standard approach to word alignment
  • Generative models, maximum likelihood training
    using approximations to EM
  • IBM Models 1-5
  • Aachen HMM-based model
  • Problems
  • Free parameters not trained by EM are difficult
    to optimize
  • A generative story is required to add new
    features

6
Our approach based on discriminative training
  • A weighted linear model
  • Applied using beam-search aligner
  • Trained with averaged perceptron algorithm on a
    small number (200) hand aligned sentence pairs

7
Technical advantages
  • Discriminative training generally superior to
    maximum likelihood training
  • Conceptually very simple easy to understand and
    implement
  • Easy to add new features without having to invent
    a generative story
  • Fast to train with averaged perceptron

8
Two models
  • First model based on log-likelihood-ratio word
    association statistics
  • Two versions of a second model based on
    conditional probability of a link cluster, given
    co-occurrence, trained on alignments produced by
    a simpler model
  • LLR-based discriminative model
  • LLR-based greedy heuristic model

9
Log-likelihood-ratio measure of word association
10
Features for LLR-based model
  • Sum of LLR scores for linked word pairs
  • Two nonmonotonicity features computed by ordering
    linked word pairs by source token position, then
    by target token position, and
  • Summing backward jumps in target position
  • Counting backward jumps in target position
  • One-to-many feature counting number of links in
    which one word participates in another link
  • Unlinked word feature counting number of words
    with no link

11
Hard constraints in LLR-based model
  • No many-to-many links
  • No more than three words linked to one word
  • Link allowed only if it has the highest LLR score
    in sentence for one of the words

12
Features for CLP-based model
  • Sum of logs of discounted estimates of
    conditional link cluster probabilities
  • Nonmonotonicity features
  • Unlinked word feature

13
Novel alignment search
  • Greedy search used by Liu, Liu, and Lin
  • Start with empty alignment
  • Until improvement obtained lt threshold
  • Estimate how much each remaining possible link
    would improve alignment
  • Add estimated best link
  • Performs alignment evaluations
  • Our alignment search performs alignment
    evaluations

14
Alignment search for LLR-based model
  • Initialize existing alignments to contain the
    empty alignment and its score
  • For each possible link L in decreasing order of
    LLR score
  • Initialize recent alignments to be empty
  • For each existing alignment A
  • Create a new alignment adding L to A
  • For each link L overlapping with L
  • Create a new alignment adding L to A and removing
    L
  • For each new alignment A
  • If A meets hard constraints and is not in recent
    alignments, compute score, and if exceeds
    threshold, add to recent alignments
  • Add recent alignments to existing alignments,
    sort by score, and prune to N best

15
Alignment search for LLR-based model
  • Initialize existing alignments to contain the
    empty alignment and its score
  • For each possible link L in decreasing order of
    LLR score
  • Initialize recent alignments to be empty
  • For each existing alignment A
  • Create a new alignment adding L to A
  • For each link L overlapping with L
  • Create a new alignment adding L to A and removing
    L
  • For each new alignment A
  • If A meets hard constraints and is not in recent
    alignments, compute score, and if exceeds
    threshold, add to recent alignments
  • Add recent alignments to existing alignments,
    sort by score, and prune to N best

16
Alignment search for CLP-based model
  • Initialize existing alignments to contain the
    empty alignment and its score
  • For each possible link cluster L in decreasing
    order of log conditional link probability
  • Initialize recent alignments to be empty
  • For each existing alignment A
  • Create a new alignment adding L to A and removing
    any overlapping link clusters
  • If new alignment is not in recent alignments,
    compute score, and if exceeds threshold, add to
    recent alignments
  • Add recent alignments to existing alignments,
    sort by score, and prune to N best

17
Evaluation methodology
  • Used 500K EF (mostly unannotated) sentence pairs
    from 2003 parallel text workshop
  • 447 annotated sentence pairs evenly split into
    training set and test set
  • Evaluated on recall, precision, and alignment
    error rate

18
Evaluation results
19
Conclusions
  • Discriminatively trained linear models for
    bilingual word alignment can be
  • Simpler to implement than standard approach
  • Easier to add features to than standard approach
  • Easier to optimize than standard approach
  • At least as accurate as standard approach
Write a Comment
User Comments (0)