Confidence Estimation for Machine Translation - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Confidence Estimation for Machine Translation

Description:

(collaboration with Marco Turchi & Nello ... On-going and future work. The task of CE for MT ... Workshop at JHU (Blatz et al, Coling-04) & MSR (Quirk, LREC-04) ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 43
Provided by: lucias
Category:

less

Transcript and Presenter's Notes

Title: Confidence Estimation for Machine Translation


1
Confidence Estimation for Machine Translation
  • Lucia Specia
  • Xerox Research Centre Europe Grenoble
  • (collaboration with Marco Turchi Nello
    Cristianini U. Bristol)

2
Outline
  • The task of Confidence Estimation for Machine
    Translation
  • Our approach
  • Method
  • Features
  • Algorithms
  • Experiments
  • On-going and future work

3
The task of CE for MT
  • Goal given the output of a Machine Translation
    (MT) system for a given input, provide an
    estimate of its quality.
  • Motivation assessing the quality of
    translations is
  • Time consuming reading the translation takes
    time
  • Los piratas se han incautado un gigante
    petrolero saudita llevando su carga completa de 2
    m de barriles - más de una cuarta parte de la
    Arabia Saudita de la producción diaria - el
    sábado frente a la costa de Kenya (unas 450
    millas náuticas al sudeste de Mombasa) y la
    dirección hacia la puerto somalí de AEL, la
    Marina de los EE.UU. dice.
  • Not possible if user does not know the source
    language
  • ????????????????1??????????????4??1
    -????????????(??450????????)????????-?????????????
    ??????????Eyl??????????????....

4
The task of CE for MT
  • Uses
  • Filter out bad translations to avoid
    professional translators wasting time reading /
    post-editing them.
  • Make end-users aware of the translation
    quality.

Is it worth providing this translation as
suggestion to the professional translator?
Should this translation be highlighted as
suspect to the reader?
5
General approach
  • Different from MT evaluation (BLEU/NIST)
    reference translations are NOT available
  • Unit word, phrase or sentence
  • Embedded to SMT system (word or phrase
    probabilities) or dedicated layer (machine
    learning problem)
  • Binary problem distinguish between good and bad
    translations

6
General approach
  • Different from MT evaluation (BLEU/NIST)
    reference translations are NOT available
  • Unit word, phrase or sentence
  • Embedded to SMT system (word or phrase
    probabilities) or dedicated layer (machine
    learning problem)
  • Binary problem distinguish between good and bad
    translations

7
Related work (sentence-level)
  • Workshop at JHU (Blatz et al, Coling-04) MSR
    (Quirk, LREC-04)
  • Automatic MT metrics or few manually assessed
    cases
  • Poor analysis on the contribution of different
    features
  • Predictions did not have a positive effect on
    practical tasks
  • MSR (Gamon et al, EAMT-05)
  • Human likeness classification
  • Resource dependent features
  • Poor performance if compared to BLEU little
    correlation with human
  • One MT system, one domain, one language pair
  • Only good / bad estimates binary task

8
Our approach
  • Sentence-level natural scenario for MT
  • Many resource language independent features
  • Take contribution of features into account
  • MT system dependent x independent features
  • Machine learning problem regression
  • Any continuous score
  • Human annotation as training
  • Several MT systems, text domains, language pairs
    and quality scores
  • Results useful in practical applications

9
Method
  • Identify and extract information sources.
  • Refine the set of information sources to keep
    only the relevant ones
  • Increase performance.
  • Decrease extraction cost (time).
  • Learn a model to produce quality scores for new
    translations.
  • Use the CE score in some application.

10
Features
  • Most of those identified in previous work new
    ones
  • Black-box (77) from the input and translation
    sentences, monolingual or parallel corpora, e.g.
  • Source and target sentence lengths and their
    ratios
  • Language model and other statistics in the corpus
  • Shallow syntax checking (target and target
    against source)
  • Average number of possible translations per
    source word (SMT)
  • Practical scenario
  • Useful when it is not possible to have access to
    internal features of the MT systems (commercial
    systems, e.g.).
  • Provides a way to perform the task of CE across
    different MT systems, which may use different
    frameworks.

11
Features
  • Glass-box (54) depend on some aspect of the
    translation process, e.g.
  • Language model (target) using n-best list
    word/phrase-based
  • Proximity with other hypothesis in the n-best
    list
  • MT base model features
  • Distortion count, gap count, (compositional)
    bi-phrase probability
  • Search nodes in the graph (aborted, pruned)
  • Proportion of unknown words in the source
  • Richer scenario
  • When it is possible to have access to internal
    features of the MT systems.

12
Learning methods
  • Feature selection Partial Least Squares (PLS)
  • Regression PLS, SVM

13
Partial Least Square Regression
  • Given two matrices X (input variables) and Y
    (response variables) predict Y from X and
    describe their common structure.
  • Projects the original data onto a different space
    of latent variables (or components)
  • Provides by-product an ordering of the original
    features according to their importance.
  • Particularly indicated when the features in X are
    strongly correlated (multicollinearity) ? case of
    CE datasets.
  • Widely applied in other fields not yet for NLP.

14
Partial Least Square Regression
  • Ordinary multiple regression problem
  • Where
  • Bw is the regression matrix computed directly
    using an optimal number of components.
  • F is the residual matrix.
  • When X is standardized, an element of Bw with
    large absolute value indicates an important
    X-variable.

15
Feature Section with PLS
  • Method
  • Compute the Bw matrix on some training data for
    different numbers of components (all possible)
  • Sort the absolute value of the bw-coefficients.
    This produces a list of features from the most
    important to the less important (Lb)
  • Select the n features training and testing on a
    validation set according to some objective
    criteria
  • Train and test these n features on a test set
  • Evaluate predictions using appropriate metrics

16
Feature Section with PLS
  • Method
  • Compute the Bw matrix on some training data for
    different numbers of components (all possible)
  • Sort the absolute value of the bw-coefficients.
    This produces a list of features from the most
    important to the less important (Lb)
  • Done for each i-th training subsample obtain
    several Lb(i)
  • The final list L is obtained picking the most
    voted features for each column (mode) L 66,
    56, , 35, 10

17
Feature Section with PLS
  • Method
  • Compute the Bw matrix on some training data for
    different numbers of components (all possible)
  • Sort the absolute value of the bw-coefficients.
    This produces a list of features from the most
    important to the less important (Lb)
  • Select the n features training and testing on a
    validation set according to some objective
    criterion
  • Objective criterion RMSPE
  • Analyze learning curves to select top n features

18
Feature Section with PLS
19
Feature Section with PLS
  • Method
  • Compute the Bw matrix on some training data for
    different numbers of components (all possible)
  • Sort the absolute value of the bw-coefficients.
    This produces a list of features from the most
    important to the less important (Lb)
  • Select the n features training and testing on a
    validation set according to some objective
    criteria
  • Train and test these n features on a test set
  • SVM or PLS with optimal number of components

20
Feature Section with PLS
  • Method
  • Compute the Bw matrix on some training data for
    different numbers of components (all possible)
  • Sort the absolute value of the bw-coefficients.
    This produces a list of features from the most
    important to the less important (Lb)
  • Select the n features training and testing on a
    validation set according to some objective
    criteria
  • Train and test these n features on a test set
  • Evaluate predictions
  • Root Mean Square Error (RMSPE)

21
Feature Section with PLS
  • Root Mean Squared Error (RMSPE)
  • N number of test cases
  • TP True positives
  • FP False positive
  • y expected value
  • prediction obtained

22
Experiments
  • Datasets
  • Europarl data with quality scores from automatic
    metrics.
  • News data with manually assigned quality scores
    (1-5).
  • Europarl with manually assigned quality scores
    (1-4).
  • Technical documents with manually assigned
    quality scores (1-4).
  • Technical documents with post-edition time
    annotation.

23
GKLS 1-4 en-es dataset
  • WMT-2008 Europarl English-Spanish dev test
    data
  • 4K Translations SMT systems Matrax, Portage,
    Sinuhe and MMR (P-ES-1, P-ES-2, P-ES-3 and
    P-ES-4)
  • Quality score 1-4
  • Features P-ES-1gb 131, others 77 black-box
  • Little gain with glass-box features
  • Good from a practical point of view

24
GKLS 1-4 en-dk dataset
  • Automotive documents English-Danish
  • En-Es is a reasonably close language-pair, try
    En-Dk
  • 2K Translations SMT system trained on 170K
    parallel sentences Matrax
  • Quality score 1-4
  • Features black-box glass-box
  • Results (RMSPE)
  • Considerable gain in using glass-box features
  • Expected with more distant language pairs

25
GKLS post-edition time dataset
  • Automotive documents English-Russian
  • 3K Translations SMT systems trained on 250K
    sentences Matrax, Portage and MMR (P-ER-1,
    P-ER-2 and P-ER-3)
  • Quality score post-edition time in seconds
    (normalized by sentence length)
  • Given a source sentence in English and its
    translation into Russian, a professional
    translator post-edited such translation to make
    it into a good quality sentence, while the time
    was recorded.
  • Features 77 black-box

26
Discussion
  • Results for a subset of features outperform all
    features.
  • GKLS 1-4 CE models deviate 0,6-0,7. E.g.
    sentence that should be classified as fit for
    purpose would never be classified as requires
    complete retranslation.
  • Post-edition time vary considerably from system
    to system. E.g. P-ES-1 (2), CE models deviate
    up to 2 seconds/word.

27
Manually annotated datasets
  • Best features (BB)
  • source target sentence 3-gram language
  • model probability
  • source target sentence lengths
  • percentage and mismatch in the numbers and
    punctuation symbols in the source and target

28
Manually annotated datasets
  • Best features (GB)
  • size of the n-best list
  • sentence n-gram log-probabilities using the
    n-best for training a LM (using words or phrases)
  • bi-phrase count
  • distortion count
  • bi-phrase probability
  • translation model
  • average size of hypotheses in the n-best list
  • number of search graph nodes in final decoders
    graph

29
More results GKLS 1-4 en-es
  • System combination
  • Produce CE predictions for a test set decoded by
    4 systems
  • Sort the four CE predictions for each test
    instance
  • Select the test instance with higher score
  • Evaluate whether this was the best sentence
    according to human annotation
  • matches for Top 659 / 802 82.1

30
More results GKLS 1-4 en-es
  • Predictive power of features Pearsons
    correlation

31
More results GKLS 1-4 en-es
  • CE score x MT metrics - Pearsons correlation

32
More results GKLS 1-4 en-es
  • Filter out bad translations for Lang. Service
    Providers

33
More results GKLS 1-4 en-es
  • Filter out bad translations for Lang. Service
    Providers

34
More results GKLS 1-4 en-dk
  • Predictive power of features Pearsons
    correlation

Source LM log probability
Target sentence 1-gram perplexity using n-best
for training a LM
35
More results GKLS 1-4 en-dk
  • Filter out bad translations for Lang. Service
    Providers
  • Aim is to do better than sentence length
  • Rank (300) test sentences according to true
    score, CE score or sent length and take average
    true score for top-n
  • CE gt 2.3 ( 3 true score) x Sentence Length lt
    12

36
More results GKLS 1-4 en-dk
  • Using thresholds on CE score or Sent Length we
    select
  • 86 of the good sentences (human scores 3 or 4)
  • bad sentences (human scores 1 or 2)
  • When sentence length and CE score disagree

37
Discussion
  • Results considered to be satisfactory (except for
    post-edition time).
  • Prediction errors are similar across different
    language pairs
  • Quality of MT system have some influence.
  • Predictions correlate better with human scores
    than metrics using reference translations.
  • Prediction error would yield uncertainty in the
    boundaries between two adjacent categories only.
  • Estimating continuous scores is more appropriate
    than binary classification, even for a binary
    application
  • Use of Inductive Confidence Machines to threshold
    predicted score

38
Discussion
  • Most relevant features include many that have not
    been used before
  • average size of the phrases in the target,
  • several mismatchings in the source and target
  • proportion of aborted search nodes, etc.
  • Future work further investigate uses for these
    most relevant features
  • In SMT models to improve the translations quality
  • To complement existing features in SMT models.
  • To rerank n-best lists produced by SMT systems.

39
Discussion
  • In MT evaluation
  • To provide additional features to reference-based
    metrics based on ML algorithms.
  • To provide a score to be combined with other MT
    evaluation metrics.
  • To provide a new evaluation metric on itself,
    with some function to optimize the correlation
    with human annotations, without the need of
    reference translations.

40
Discussion
  • Uses of CE score for other applications
  • Cross-language information retrieval
  • Finding parallel data from comparable corpus
  • Whenever is it important to identify whether a
    target sentence is a GOOD translation of a source
    sentence.

41
Thanks!
  • Lucia Specia
  • lucia.specia_at_xrce.xerox.com

42
The source
  • Pirates have seized a giant Saudi oil tanker
    carrying its full load of 2m barrels - more than
    one-quarter of Saudi Arabia's daily output - on
    Saturday off the Kenyan coast (some 450 nautical
    miles south-east of Mombasa) and are steering it
    towards the Somali port of Eyl, the US Navy says.
    BBC News, 17/11/08
Write a Comment
User Comments (0)
About PowerShow.com