Statistical Machine Translation SMT - PowerPoint PPT Presentation

About This Presentation
Title:

Statistical Machine Translation SMT

Description:

Statistical Machine Translation SMT Basic Ideas Stephan Vogel MT Class Spring Semester 2011 Overview Deciphering foreign text an example Principles of SMT ... – PowerPoint PPT presentation

Number of Views:221
Avg rating:3.0/5.0
Slides: 28
Provided by: Vog55
Learn more at: https://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Statistical Machine Translation SMT


1
Statistical Machine TranslationSMT Basic Ideas
  • Stephan Vogel
  • MT Class
  • Spring Semester 2011

2
Overview
  • Deciphering foreign text an example
  • Principles of SMT
  • Data processing

3
Deciphering Example
  • Apinaye English
  • Apinaye belongs to the Ge family of Brazil
  • Spoken by 800 (according to SIL, 1994)
  • http//www.ethnologue.com/show_family.asp?subid90
    784http//www.language-museum.com/a/apinaye.php
  • Example from Linguistic Olympics 2008, see
    http//www.naclo.cs.cmu.edu
  • Parallel Corpus (some characters adapted)
  • Kukre kokoi The monkey eats
  • Ape kre The child
    works
  • Ape kokoi rats The big monkey
    works
  • Ape mi mets The good man
    works
  • Ape mets kra The child works
    well
  • Ape punui mi pinjets The old man works
    badly
  • Can we translate new sentence?

4
Deciphering Example
  • Parallel Corpus (some characters adapted)
  • Can we build a lexicon from these sentence pairs?
  • Observations
  • Apinaye Kukre (1) Ape (5), English The
    (6), works (5)Aha! -gt first guess Ape works
  • monkey in 1, 3 child in 2, 4 man in 4,
    6different distribution over corpus do we find
    words with similar distribution on the Apinaye
    side?

Kukre kokoi The monkey eats
Ape kra The child works
Ape kokoi rats The big monkey works
Ape mi mets The good man works
Ape mets kra The child works well
Ape punui mi pinjets The old man works badly
5
Vocabularies
  • Corpus
    Vocabularies

Kukre kokoi The monkey eats
Ape kra The child works
Ape kokoi rats The big monkey works
Ape mi mets The good man works
Ape mets kra The child works well
Ape punui mi pinjets The old man works badly
Apinaye English
kukre The
kokoi monkey
ape eats
kra child
rats works
mi big
mets good
punui man
pinjets well
old
badly
  • Observations
  • 9 Apinaye words, 11 English words
  • Expectations
  • English words without translation?
  • Apinaye words corresponding to more then 1
    English word?

6
Word Frequencies
  • Corpus
    Vocabularies, with frequencies

Apinaye English
kukre 1 The 6
kokoi 2 monkey 2
ape 5 eats 1
kra 2 child 2
rats 1 works 5
mi 1 big 1
mets 2 good 1
punui 1 man 2
pinjets 1 well 1
old 1
badly 1
Kukre kokoi The monkey eats
Ape kra The child works
Ape kokoi rats The big monkey works
Ape mi mets The good man works
Ape mets kra The child works well
Ape punui mi pinjets The old man works badly
  • Suggestions
  • ape (5) could align to The (6) or works
    (5)
  • More likely that content word workshas match,
    i.e. ape works
  • Other word pairs difficult to predict too many
    similar frequencies

7
Location in Corpus
  • Corpus
    Vocabularies, with occurrences

Apinaye Sentences English Sentences
kukre 1 The 1 2 3 4 5 6
kokoi 1 3 monkey 1 3
ape 2 3 4 5 6 eats 1
kra 2 5 child 2 5
rats 3 works 2 3 4 5 6
mi 4 6 big 3
mets 4 5 good 4
punui 6 man 4 6
pinjets 6 well 5
old 6
badly 6
Kukre kokoi The monkey eats
Ape kra The child works
Ape kokoi rats The big monkey works
Ape mi mets The good man works
Ape mets kra The child works well
Ape punui mi pinjets The old man works badly
  • Observations
  • Same sentences kukre eats, kokoi
    monkey, ape works,kra child,
    rats big, mi man
  • mets (4 and 5) ? good (4) and well (5)
    makes sense
  • punui and pinjets match old and badly
    which is which?

8
Location in Sentence
  • Corpus

Apinaye English Alignment EN - AP
Kukre kokoi The monkey eats 1-0 2-2 3-1
Ape kra The child works 1-0 2-2 3-1
Ape kokoi rats The big monkey works 1-0 2-3 3-2 4-1
Ape mi mets The good man works 1-0 2-3 3-2 4-1
Ape mets kra The child works well 1-0 2-3 3-1 4-2
Ape punui mi pinjets The old man works badly 1-0 2-??? 3-3 4-1 5-???
  • Observations
  • First English word (The) does not align we say
    it aligns to the NULL word
  • Apinaye verb in first position
  • English last word aligns to 1st or 2nd position
  • English -gt Apinaye reverse word order (not
    strictly in sentence pair 5)
  • Hypothesis
  • alignment for last sentence pair is 1-0 2-4 3-3
    4-1 5-2
  • I.e pinjets old and punui badly

9
POS Information
  • Corpus

Kukre kokoi V N The monkey eats DET N V
Ape kra V N The child works Det N V
Ape kokoi rats V N Adj The big monkey works Det Adj N V
Ape mi mets V N Adj The good man works Det Adj N V
Ape mets kra V Adv N The child works well Det N V Adv
Ape punui mi pinjets V ??? N ??? The old man works badly Det Adj N V Adv
  • Observations
  • English determiner (The) does not align
    perhaps no determiners in Apinaye
  • English Verb Adverb -gt Apinaye Verb Adverb -gt
    no reordering
  • English Adjective Noun -gt Apinaye Noun Adjective
    -gt reordering
  • Hypothesis
  • pinjets is Adj to make it N Adj, punui is
    Adv(consistent with alignment hypothesis)

10
Translate New Sentences Ap - En
  • Source Sentence Ape rats mi mets
  • Lexical information works big man
    good/well
  • Reordering information The good man works big
  • Better lexical choice The good man works
    hard
  • Compare Ape mi mets -gt The good man works
  • Source Sentence Kukre rats kokoi
    punui
  • Lexical information eats big monkey
    badly
  • Reordering information The bad monkey eats
    big
  • Better lexical choice The bad monkey
    eats a lot

11
Translate New Sentences En - Ap
  • Source Sentence The old monkey eats
    a lot
  • Lexical information NULL pinjets kokio
    kukre rats
  • Reordering information kukre rats kokio
    pinjets
  • Or
  • Deleting words old monkey eats a
    lot
  • Rephrase old monkey eats
    big
  • Reorder eats big monkey
    old
  • Lexical information kukre rats kokio
    pinjets
  • Source Sentence The big child works
    a long time
  • Delete plus rephrase big child works big
  • Reorder works big
    child big
  • Lexical information Ape rats kra rats

12
Overview
  • Deciphering foreign text an example
  • Principles of SMT
  • Data processing

13
Principles of SMT
  • We will use the same approach learning from
    data
  • Build translation models using frequency,
    co-occurrence, word position, etc. information
  • Use the models to translate new sentences
  • Not manually, but fully automatically
  • The training will be automatically
  • The is still lots of manual work left designing
    models, preparing data, running experiments, etc.

14
Machine Translation Approaches
  • Grammar-based
  • Interlingua-based
  • Transfer-based
  • Direct
  • Example-based
  • Statistical

15
Statistical Approach
  • Using statistical models
  • Create many alternatives we call them hypotheses
  • Give a score to each hypothesis based on
    statistical models
  • Select the best -gt search problem
  • Advantages
  • Avoid hard decisions
  • Sometimes, optimality can be guaranteed
  • Speed can be traded with quality, not
    all-or-nothing
  • It works better !
  • Disadvantages
  • Difficulties in handling structurally rich
    models, mathematically and computationally (but
    thats also true for non-statistical systems)
  • Need data to train the model parameters

16
Statistical versus Grammar-Based
  • Often statistical and grammar-based MT are seen
    as alternatives, even opposing approaches wrong
    !!!
  • Dichotomies are
  • Use probabilities everything is equally
    likely, yes/no decision
  • Rich (deep) structure no or only flat
    structure
  • Both dimensions are continuous
  • Examples
  • EBMT no/little structure and heuristics
  • SMT (initially only) flat structure and
    probabilities
  • XFER deep(er) structure and heuristics
  • Goal structurally rich probabilistic models
  • statXFER deep structure and probabilities
  • Syntax-augmented SMT deep structure and
    probabilities

No Probs Probs
Flat Structure EBMT SMT
Deep Structure XFER, Interlingua Holy Grail
17
Statistical Machine Translation
  • Translator translates source text
  • Use machine learning techniques to extract
    useful knowledge
  • Translation model word and phrase translations
  • Language model how likely words follow in a
    particular sequence
  • Translation system (decoder) usesthese models to
    translates newsentences
  • Advantages
  • Can quickly train for new languages
  • Can adopt to new domains
  • Problems
  • Need parallel data
  • All words, even punctuation, are equal
  • Difficult to pin-point the causes of errors

Source
Target
Translation Model
Language Model
SourceSentence
Translation
18
Tasks in SMT
  • Modelling build statistical models which capture
    characteristic features of translation
    equivalences and of the target language
  • Training train translation model on bilingual
    corpus, train language model on monolingual
    corpus
  • Decoding find best translation for new sentences
    according to models
  • Evaluation
  • Subjective evaluation fluency, adequacy
  • Automatic evaluation WER, Bleu, etc
  • And all the nitty-gritty stuff
  • Text preprocessing, data cleaning
  • Parameter tuning (minimum error rate training)

19
Noisy Channel View
  • French is actually English, which has been
    garbled during transmission recover the correct,
    original English

Noisy channel distortsinto French
Speaker speaks English
You hear French, but need to recover the English
20
Bayesian Approach
  • Select translations which has highest probability
  • ê argmax p(e f)
  • argmax p(e) p(f e)

Model Channel
Search Process
Model Source
21
SMT Architecture
p(e) language model p(f e) translation model
22
Log-Linear Model
  • In practice ê argmax log(p(e)) log( p(f
    e))
  • Translaiton model (TM) and language model (LM)
    may be of different quality
  • - simplifying assumptions
  • - trained on different abounts of data
  • Give different weights to both models
  • ê argmax w1 log(p(e)) w2 log( p(f
    e))
  • Why not add more features?
  • ê argmax w1 h1(e,f) ... wn hn(e, f)
  • Note We dont need the normalization constant
    for the argmax

23
Overview
  • Deciphering foreign text an example
  • Principles of SMT
  • Data processing

24
Corpus Statistics
  • We want to know how much data
  • Corpus size not file size, not documents, but
    words and sentences
  • Why is file size not important?
  • Vocabulary number of word types
  • We want to know some distributions
  • How many words are seen only once?
  • Why is this interesting?
  • Does it help to increase the corpus?
  • How long are the sentence
  • Does it matter if we have many short of fewer,
    but longer sentences?

25
All Simple, Basic, Important
  • Important When you publish, these numbers are
    important
  • To be able to interpret the resultsE.g. what
    works on small corpora may not work on large
    corpora
  • To make them comparable to other papers
  • Basic no deep thinking, no fancy
  • Simple a few unix commands, a few simple scripts
  • wc, grep, sed, sort, uniq
  • perl, awk (my favorite), perhaps python,
  • Lets look at some data!

26
BTEC Spa-Eng
  • Corpus Statistics
  • Corpus and vocabulary size
  • Percentage of singletons
  • Number of unknown words, out-of-vocabulary (OOV)
    rate
  • Sentence length balance
  • Text normalization
  • Spoken language forms Ill, wear, but also I
    will, we are
  • Note this was shown online

27
Tokenization
  • Punctuation attached to words
  • Example you you, you. you?
  • All different strings, i.e. all are different
    words
  • Tokenization can be tricky
  • What about punctuation in numbers
  • What about appreviations(A5-0104/1999)
  • Numbers are not just numbers
  • Percentages 1.2
  • Ordinals 1st, 2.
  • Ranges 2000-2006, 31
  • And more (A5-0104/1999)

28
GigaWord Corpus
  • Distributed by LDC
  • Collection of new papers NYT, Xinhua News,
  • gt 3 billion words
  • How large is vocabulary?
  • Some observations in vocabulary
  • Number of entries with digits
  • Number of entries with special characters
  • Number of strange words
  • Some observations in corpus
  • Sentences with lots of numbers
  • Sentences with lots of punctuation
  • Sentences with very long words
  • Note this was shown online

29
And then the more interesting Stuff
  • POS tagging
  • Parsing
  • For syntax-based MT systems
  • How parallel are the parse trees?
  • Word segmentation
  • Morphological processing
  • In all these tasks the central problem is
  • How to make the corpus more parallel?
Write a Comment
User Comments (0)
About PowerShow.com