Title: Statistical Machine Translation SMT
1Statistical Machine TranslationSMT Basic Ideas
- Stephan Vogel
- MT Class
- Spring Semester 2011
2Overview
- Deciphering foreign text an example
- Principles of SMT
- Data processing
3Deciphering Example
- Apinaye English
- Apinaye belongs to the Ge family of Brazil
- Spoken by 800 (according to SIL, 1994)
- http//www.ethnologue.com/show_family.asp?subid90
784http//www.language-museum.com/a/apinaye.php - Example from Linguistic Olympics 2008, see
http//www.naclo.cs.cmu.edu - Parallel Corpus (some characters adapted)
- Kukre kokoi The monkey eats
- Ape kre The child
works - Ape kokoi rats The big monkey
works - Ape mi mets The good man
works - Ape mets kra The child works
well - Ape punui mi pinjets The old man works
badly - Can we translate new sentence?
4Deciphering Example
- Parallel Corpus (some characters adapted)
- Can we build a lexicon from these sentence pairs?
- Observations
- Apinaye Kukre (1) Ape (5), English The
(6), works (5)Aha! -gt first guess Ape works - monkey in 1, 3 child in 2, 4 man in 4,
6different distribution over corpus do we find
words with similar distribution on the Apinaye
side?
Kukre kokoi The monkey eats
Ape kra The child works
Ape kokoi rats The big monkey works
Ape mi mets The good man works
Ape mets kra The child works well
Ape punui mi pinjets The old man works badly
5 Vocabularies
Kukre kokoi The monkey eats
Ape kra The child works
Ape kokoi rats The big monkey works
Ape mi mets The good man works
Ape mets kra The child works well
Ape punui mi pinjets The old man works badly
Apinaye English
kukre The
kokoi monkey
ape eats
kra child
rats works
mi big
mets good
punui man
pinjets well
old
badly
- Observations
- 9 Apinaye words, 11 English words
- Expectations
- English words without translation?
- Apinaye words corresponding to more then 1
English word?
6 Word Frequencies
- Corpus
Vocabularies, with frequencies
Apinaye English
kukre 1 The 6
kokoi 2 monkey 2
ape 5 eats 1
kra 2 child 2
rats 1 works 5
mi 1 big 1
mets 2 good 1
punui 1 man 2
pinjets 1 well 1
old 1
badly 1
Kukre kokoi The monkey eats
Ape kra The child works
Ape kokoi rats The big monkey works
Ape mi mets The good man works
Ape mets kra The child works well
Ape punui mi pinjets The old man works badly
- Suggestions
- ape (5) could align to The (6) or works
(5) - More likely that content word workshas match,
i.e. ape works - Other word pairs difficult to predict too many
similar frequencies
7 Location in Corpus
- Corpus
Vocabularies, with occurrences
Apinaye Sentences English Sentences
kukre 1 The 1 2 3 4 5 6
kokoi 1 3 monkey 1 3
ape 2 3 4 5 6 eats 1
kra 2 5 child 2 5
rats 3 works 2 3 4 5 6
mi 4 6 big 3
mets 4 5 good 4
punui 6 man 4 6
pinjets 6 well 5
old 6
badly 6
Kukre kokoi The monkey eats
Ape kra The child works
Ape kokoi rats The big monkey works
Ape mi mets The good man works
Ape mets kra The child works well
Ape punui mi pinjets The old man works badly
- Observations
- Same sentences kukre eats, kokoi
monkey, ape works,kra child,
rats big, mi man - mets (4 and 5) ? good (4) and well (5)
makes sense - punui and pinjets match old and badly
which is which?
8 Location in Sentence
Apinaye English Alignment EN - AP
Kukre kokoi The monkey eats 1-0 2-2 3-1
Ape kra The child works 1-0 2-2 3-1
Ape kokoi rats The big monkey works 1-0 2-3 3-2 4-1
Ape mi mets The good man works 1-0 2-3 3-2 4-1
Ape mets kra The child works well 1-0 2-3 3-1 4-2
Ape punui mi pinjets The old man works badly 1-0 2-??? 3-3 4-1 5-???
- Observations
- First English word (The) does not align we say
it aligns to the NULL word - Apinaye verb in first position
- English last word aligns to 1st or 2nd position
- English -gt Apinaye reverse word order (not
strictly in sentence pair 5) - Hypothesis
- alignment for last sentence pair is 1-0 2-4 3-3
4-1 5-2 - I.e pinjets old and punui badly
9 POS Information
Kukre kokoi V N The monkey eats DET N V
Ape kra V N The child works Det N V
Ape kokoi rats V N Adj The big monkey works Det Adj N V
Ape mi mets V N Adj The good man works Det Adj N V
Ape mets kra V Adv N The child works well Det N V Adv
Ape punui mi pinjets V ??? N ??? The old man works badly Det Adj N V Adv
- Observations
- English determiner (The) does not align
perhaps no determiners in Apinaye - English Verb Adverb -gt Apinaye Verb Adverb -gt
no reordering - English Adjective Noun -gt Apinaye Noun Adjective
-gt reordering - Hypothesis
- pinjets is Adj to make it N Adj, punui is
Adv(consistent with alignment hypothesis)
10Translate New Sentences Ap - En
- Source Sentence Ape rats mi mets
- Lexical information works big man
good/well - Reordering information The good man works big
- Better lexical choice The good man works
hard - Compare Ape mi mets -gt The good man works
- Source Sentence Kukre rats kokoi
punui - Lexical information eats big monkey
badly - Reordering information The bad monkey eats
big - Better lexical choice The bad monkey
eats a lot
11Translate New Sentences En - Ap
- Source Sentence The old monkey eats
a lot - Lexical information NULL pinjets kokio
kukre rats - Reordering information kukre rats kokio
pinjets - Or
- Deleting words old monkey eats a
lot - Rephrase old monkey eats
big - Reorder eats big monkey
old - Lexical information kukre rats kokio
pinjets - Source Sentence The big child works
a long time - Delete plus rephrase big child works big
- Reorder works big
child big - Lexical information Ape rats kra rats
12Overview
- Deciphering foreign text an example
- Principles of SMT
- Data processing
13Principles of SMT
- We will use the same approach learning from
data - Build translation models using frequency,
co-occurrence, word position, etc. information - Use the models to translate new sentences
- Not manually, but fully automatically
- The training will be automatically
- The is still lots of manual work left designing
models, preparing data, running experiments, etc.
14Machine Translation Approaches
- Grammar-based
- Interlingua-based
- Transfer-based
- Direct
- Example-based
- Statistical
15Statistical Approach
- Using statistical models
- Create many alternatives we call them hypotheses
- Give a score to each hypothesis based on
statistical models - Select the best -gt search problem
- Advantages
- Avoid hard decisions
- Sometimes, optimality can be guaranteed
- Speed can be traded with quality, not
all-or-nothing - It works better !
- Disadvantages
- Difficulties in handling structurally rich
models, mathematically and computationally (but
thats also true for non-statistical systems) - Need data to train the model parameters
16Statistical versus Grammar-Based
- Often statistical and grammar-based MT are seen
as alternatives, even opposing approaches wrong
!!! - Dichotomies are
- Use probabilities everything is equally
likely, yes/no decision - Rich (deep) structure no or only flat
structure - Both dimensions are continuous
- Examples
- EBMT no/little structure and heuristics
- SMT (initially only) flat structure and
probabilities - XFER deep(er) structure and heuristics
- Goal structurally rich probabilistic models
- statXFER deep structure and probabilities
- Syntax-augmented SMT deep structure and
probabilities
No Probs Probs
Flat Structure EBMT SMT
Deep Structure XFER, Interlingua Holy Grail
17Statistical Machine Translation
- Translator translates source text
- Use machine learning techniques to extract
useful knowledge - Translation model word and phrase translations
- Language model how likely words follow in a
particular sequence - Translation system (decoder) usesthese models to
translates newsentences - Advantages
- Can quickly train for new languages
- Can adopt to new domains
- Problems
- Need parallel data
- All words, even punctuation, are equal
- Difficult to pin-point the causes of errors
Source
Target
Translation Model
Language Model
SourceSentence
Translation
18Tasks in SMT
- Modelling build statistical models which capture
characteristic features of translation
equivalences and of the target language - Training train translation model on bilingual
corpus, train language model on monolingual
corpus - Decoding find best translation for new sentences
according to models - Evaluation
- Subjective evaluation fluency, adequacy
- Automatic evaluation WER, Bleu, etc
- And all the nitty-gritty stuff
- Text preprocessing, data cleaning
- Parameter tuning (minimum error rate training)
19Noisy Channel View
- French is actually English, which has been
garbled during transmission recover the correct,
original English
Noisy channel distortsinto French
Speaker speaks English
You hear French, but need to recover the English
20Bayesian Approach
- Select translations which has highest probability
- ê argmax p(e f)
- argmax p(e) p(f e)
Model Channel
Search Process
Model Source
21SMT Architecture
p(e) language model p(f e) translation model
22Log-Linear Model
- In practice ê argmax log(p(e)) log( p(f
e)) - Translaiton model (TM) and language model (LM)
may be of different quality - - simplifying assumptions
- - trained on different abounts of data
- Give different weights to both models
- ê argmax w1 log(p(e)) w2 log( p(f
e)) - Why not add more features?
- ê argmax w1 h1(e,f) ... wn hn(e, f)
- Note We dont need the normalization constant
for the argmax
23Overview
- Deciphering foreign text an example
- Principles of SMT
- Data processing
24Corpus Statistics
- We want to know how much data
- Corpus size not file size, not documents, but
words and sentences - Why is file size not important?
- Vocabulary number of word types
- We want to know some distributions
- How many words are seen only once?
- Why is this interesting?
- Does it help to increase the corpus?
-
- How long are the sentence
- Does it matter if we have many short of fewer,
but longer sentences?
25All Simple, Basic, Important
- Important When you publish, these numbers are
important - To be able to interpret the resultsE.g. what
works on small corpora may not work on large
corpora - To make them comparable to other papers
- Basic no deep thinking, no fancy
- Simple a few unix commands, a few simple scripts
- wc, grep, sed, sort, uniq
- perl, awk (my favorite), perhaps python,
- Lets look at some data!
26BTEC Spa-Eng
- Corpus Statistics
- Corpus and vocabulary size
- Percentage of singletons
- Number of unknown words, out-of-vocabulary (OOV)
rate - Sentence length balance
- Text normalization
- Spoken language forms Ill, wear, but also I
will, we are - Note this was shown online
27Tokenization
- Punctuation attached to words
- Example you you, you. you?
- All different strings, i.e. all are different
words - Tokenization can be tricky
- What about punctuation in numbers
- What about appreviations(A5-0104/1999)
- Numbers are not just numbers
- Percentages 1.2
- Ordinals 1st, 2.
- Ranges 2000-2006, 31
- And more (A5-0104/1999)
28GigaWord Corpus
- Distributed by LDC
- Collection of new papers NYT, Xinhua News,
- gt 3 billion words
- How large is vocabulary?
- Some observations in vocabulary
- Number of entries with digits
- Number of entries with special characters
- Number of strange words
- Some observations in corpus
- Sentences with lots of numbers
- Sentences with lots of punctuation
- Sentences with very long words
- Note this was shown online
29And then the more interesting Stuff
- POS tagging
- Parsing
- For syntax-based MT systems
- How parallel are the parse trees?
- Word segmentation
- Morphological processing
- In all these tasks the central problem is
- How to make the corpus more parallel?