SemiSupervised Boosting for Statistical Word Alignment - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

SemiSupervised Boosting for Statistical Word Alignment

Description:

Phrase-based machine translation. System: Pharaoh. Metrics: NIST and BLEU ... Boosting does improve word alignment and translation quality ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 22
Provided by: wuh9
Category:

less

Transcript and Presenter's Notes

Title: SemiSupervised Boosting for Statistical Word Alignment


1
Semi-Supervised Boosting for Statistical Word
Alignment
  • Wu Hua
  • 2006/10/18

2
Outline
  • Introduction to semi-supervised learning
  • Introduction to boosting
  • Semi-supervised boosting for word alignment
  • Evaluation results
  • Conclusion

3
Machine Learning Methods
  • Supervised Learning
  • Labeled data
  • Unsupervised learning
  • Unlabeled data
  • Semi-supervised learning
  • Combine both labeled data and unlabeled data

4
Semi-Supervised Learning in NLP
  • Word sense disambiguation
  • (Yarowsky, 1995 Pham et al., 2005)
  • Classification
  • (Blum and Mitchell, 1998 Thorsten, 1999)
  • Clustering
  • (Basu et al., 2004)
  • Named entity classification
  • (Collins and Singer, 1999)
  • Parsing
  • (Sarkar, 2001)

5
Boosting Supervised Learning
Initialization
Supervised Learning
Call Learner
Calculate Error Rate
Re-weight Training data
Yes
Build Ensemble
6
Boosting in NLP
  • Tagging and PP attachment
  • (Abney et al., 1999)
  • Word sense disambiguation
  • (Escudero et al., 2000)
  • Parser construction
  • (Haruno et al., 1999 Henderson and Brill, 2000)
  • Sentence generation
  • (Walker et al., 2001)

7
Semi-Supervised Boosting
  • Three main problems
  • Semi-supervised learner
  • Combine labeled data and unlabeled data
  • Reference set
  • Automatically construct a reference set for
    unlabeled data
  • Error rate calculation
  • How to calculate the error rate with both labeled
    data and unlabeled data

8
Semi-Supervised Boosting Applied to Word Alignment
Labeled Data
Unlabeled Data
Supervised Training
Unsupervised Training
Model Interpolation
Real Reference Set
Error Rate Calculation
Pseudo Reference Set
Re-weight Training data
Yes
Build Ensemble
9
Semi-Supervised Boosting Applied to Word Alignment
  • Five main components
  • Word alignment model interpolation
  • Pseudo reference set construction for unlabeled
    data
  • Error rate calculation
  • Weight update
  • Final Ensemble

10
Word Alignment Model
  • Supervised alignment model
  • Calculate the probabilities for IBM Model 4 based
    on the labeled data
  • Unsupervised alignment model
  • Use GIZA to train IBM Model 4
  • Perform model interpolation

11
Pseudo Reference Set Construction
  • Obtain bi-directional word alignment sets S1 and
    S2 on the training data
  • Obtain the intersection set of these two
    alignment sets
  • Filter the union set of the two alignment sets
  • Build the pseudo reference set

where
12
Error Rate Calculation
  • For a sentence pair
  • Calculate the error rate of a aligner
  • Based on the labeled data instead of the whole
    data

where
is the normalized weight of the ith sentence pair
at the lth round
13
Re-Weight the Training Data
  • Reweight each sentence pair in the training set
  • For each sentence pair, there may exist correct
    links and incorrect links as compared with the
    pseudo reference set
  • Calculate the weight of each sentence pair
    according to the correct and incorrect links

where
K is the number of the error links n is the total
number of the links in the reference
14
Final Ensemble
  • Obtain the final ensemble according to the
    trained word aligners on each round

where
is the final ensemble for word alignment
is the weight of each alignment pair (s,t)
produced by the word aligner
is the weight of the word aligner
15
Evaluation
  • Training set
  • Unlabeled data 320,000 English-Chinese pairs
  • Labeled data 30,000 English-Chinese pairs
  • Held-out set
  • 1,500 sentence pairs
  • Testing set
  • 1,000 bilingual English-Chinese sentence pairs
  • Totally 8,651 alignment links

16
Evaluation Metric
  • Word alignment
  • Precision and Recall
  • Alignment Error Rate (AER)
  • Phrase-based machine translation
  • System Pharaoh
  • Metrics NIST and BLEU

17
Word Alignment Results
18
Weights in Ensembles
  • Two kinds of weights
  • Weights for the individual aligners
  • Weights for the individual alignment links

Baseline only use the first kind of weights Our
method use the two kinds of weights
19
Translation Results
20
Conclusion
  • Features in our semi-supervised boosting method
  • Perform model interpolation
  • Automatically build pseudo reference set
  • Calculate the error rate of training set with the
    labeled data
  • Use two kinds of weights in the ensemble
  • One for aligners
  • The other for alignment links
  • Boosting improves the word alignment and
    translation quality
  • Boosting does improve word alignment and
    translation quality
  • Semi-supervised boosting performs the best

21
Thanks!
Write a Comment
User Comments (0)
About PowerShow.com