Introduction to Statistical Machine Translation presentation

About This Presentation

Transcript and Presenter's Notes

Title: Introduction to Statistical Machine Translation

1
Introduction to Statistical Machine Translation
ShihHsiang
2
Reference

Brown, Cocke et al, 1990 A statistical
approach to machine translation, Computational
Linguistics, 1679-85, 1990.
Papineni, Roukos et al, 2001 BLEU a Method
for Automatic Evaluation of Machine Translation,
Technical Report, IBM Research Division
Chou and Juang Pattern Recognition in Speech
and Language Processing, Chapter 11, CRC Press.
Some slides are directly borrowed
Dr. Kevin Knight, University of Southern
California,
Dr. Philipp Koehn from University of Edinburgh
Dr. Franz Josef Och from Google

3
The Rosetta Stone (196 BC)
Egyptian hieroglyphs (used from 3300 BC 400 AD)
Egyptian Demotic (a late cursive script)
Greek (the language of Ptolemy V, ruler of Egypt)
1799 a stone with Egyptian text and its
translation into Greek was found ? Humans could
learn how to translated Egyptian
4
Warren Weaver (1947)
When I look at an article in Russian, I say to
myself This is really written in English, but it
has been coded in some strange symbols. I will
now proceed to decode.
5
Interest in MT

Commercial interest
U.S. has invested in MT for intelligence purposes
MT is popular on the webit is the most used of
Googles special features
EU spends more than 1 billion on translation
costs each year.
(Semi-)automated translation could lead to huge
savings
Academic interest
One of the most challenging problems in NLP
research
Requires knowledge from many NLP sub-areas, e.g.,
lexical semantics, parsing, morphological
analysis, statistical modeling,
Being able to establish links between two
languages allows for transferring resources from
one language to another

6
Why Its Challenging
7
Competitions

Progress driven by MT Competitions
NIST/DARPA Yearly campaigns for Arabic-English,
Chinese-English, newstexts, since 2001
IWSLT Yearly competitions for Asian languages
and Arabic into English, speech travel domain,
since 2003
WPT/WMT Yearly competitions for European
languages, European
Parliament proceedings, since 2005
Increasing number of statistical MT groups
participate
Competitions won by statistical systems

8
Major Speech Translations Systems
9
ATT How May I Help You

Spanish-to-English
MT transnizer
A transnizer is a stochastic finite-state
transducer that integrates the language model of
a speech recognizer and the translation model
into one single finite-state transducer
Directly maps source language phones into target
language word sequences
One step instead of two

10
MIT Lincoln Lab
11
NEC
Stand-alone version ISOTANI03
C/S version as in Yamabana ACL03
12
Levels of Transfer
13
Methodologies

Word-for-word translation
Syntactic transfer
Interlingual approaches
Example-based
Statistical

14
Word-for-word translation

Use a machine-readable bilingual dictionary to
translate each word in a text
Advantages
Easy to implement, results give a rough idea
about what the text is about
Disadvantages
Problems with word order means that this results
in low-quality translation

15
Syntactic transfer

It includes three steps
Parse the sentence ? Rearrange constituents ?
translate the words
Advantages
Deals with the word-order problem
Disadvantages
Must construct transfer rules for each language
pair that you deal with
Sometimes there is syntactic mis-match

?
English word order is subject - verb -
object Japanese order is subject -
object - verb
16
Interlingua

Assign a logical form to sentences
John must not go
OBLIGATORY(NOT(GO(JOHN)))
John may not go
NOT(PERMITTED(GO(JOHN)))
Use logical form to generate a sentence in
another language
Advantages
Single logical form means that we can translate
between all languages and only write a
parser/generator for each language once
Disadvantages
Difficult to define a single logical form.
English words in all capital letter probably
won't cut it.

17
Example-based MT

Fundamental idea
People do not translate by doing deep linguistics
analysis of a sentence
They translate by decomposing sentence into
fragments, translating each of those, and then
composing those properly
Translate
He buys a book on international politics
With these examples
(He buys) a notebook.
(Kare ha) nouto (wo kau).
I read (a book on international politics).
Watashi ha (kokusaiseiji nitsuite kakareta hon)
wo yomu
?(Kare ha) (kokusaiseiji nitsuite kakareta hon)
(wo kau).

18
Example-based MT

Challenges
Locating similar sentences
Aligning sub-sentential fragments
Combining multiple fragments of example
translations into a single sentence
Determining when it is appropriate to substitute
one fragment for another
Selecting the best translation out of many
candidates
Advantages
Uses fragments of human translations which can
result in higher quality
Disadvantages
May have limited coverage depending on the size
of the example database, and flexibility of
matching heuristics

19
Statistical MT

Find most probable target sentence given a source
foreign language sentence
Automatically align words and phrases within
sentence pairs in a parallel corpus
Probabilities are determined automatically by
training a statistical model using the parallel
corpus

parallel corpus
20
Statistical MT

Advantages
Has a way of dealing with lexical ambiguity
Can deal with idioms that occur in the training
data
Requires minimal human effort
Can be created for any language pair that has
enough training data
No need for staff of linguists of language
experts
Disadvantages
Does not explicitly deal with syntax

21
Example-based MT vs. Statistical MT

Both are empirical approaches
As opposed to rule-based machine translation
EBMT emphasizes learning from examples
Often heuristic scoring/learning methods
SMT emphasizes making optimal decisions
SMT and EBMT astonishingly separate research
communities
SMT researchers often use methods and terminology
from speech recognition research
Different language used in both communities

22
Parallel Corpora

Collections of texts and their translation into
different languages
Alignment across languages at various levels
Document
Section
Paragraph
Sentence (not necessarily one-to-one)
Phrase
Word
Examples of Parallel Corpora
European Parliament Proceedings Parallel Corpus
The Bible

23
Statistical MT Systems
Spanish/English Bilingual Text
English Text
Statistical Analysis
Statistical Analysis
Broken English
Spanish
English
What hunger have I, Hungry I am so, I am so
hungry, Have I that hunger
Que hambre tengo yo
I am so hungry
24
Statistical MT Systems
Spanish/English Bilingual Text
English Text
Statistical Analysis
Statistical Analysis
Broken English
Spanish
English
Translation Model P(fe)
Language Model P(e)
Que hambre tengo yo
I am so hungry
Decoding algorithm argmax P(e) P(fe) e
25
Statistical MT Systems
26
Three Problems for Statistical MT

Language model
Assigns a higher probability to fluent /
grammatical sentences
Estimated using monolingual corpora
good English string -gt high P(e)
random word sequence -gt low P(e)
Translation model
Assigns higher probability to sentences that have
corresponding meaning
Estimated using bilingual corpora
ltf,egt look like translations -gt high P(f e)
ltf,egt dont look like translations -gt low P(f
e)
Decoding algorithm
Given a language model, a translation model, and
a new sentence f find translation e maximizing
P(e) P(f e)

27
Translation Model Alignment

Source language string
Target language string
Alignment Mapping

28
Translation Model Alignment

Decomposition without Loss of generality

Length Model
Alignment Model
Lexicon Model
29
IBM Model 1

Generative model break up translation process
into smaller steps
Length Model
Alignment Model
Lexicon Model

la
casa
blu
la
casa
blu
the
blue
house
the
blue
house
30
How to estimate Lexicon Model?

Observation
Co-occurring words potential translations
Frequently co-occurring words likely
translations
Rarely co-occurring words unlikely translations
Idea
estimate translation probabilities using
co-occurring counts
Problem
co-occurrences are very noisy

31
Lexicon model estimation with known alignments

Haus - house 2 occurrences
P(Haushouse) 1.0
blau - blue 1
blaue - blue 1
P(blaublue) 1/2 0.5
P(blaueblue) 1/2 0.5
P(fe) N(f,e)/N(e)

Given alignment information simple relative
frequency
32
Lexicon model estimation with uncertain alignments

Haus - house 1.8 times
blaue - house 0.2 times
P(Haushouse) 1.8/(1.80.2)
P(blauehouse) 0.2/(1.80.2)
blaue - blue 0.8
das - blue 0.2
blau - blue 1.0
P(blaueblue) 0.8/2.00.4
P(dasblue)0.2/2.00.1
P(blaublue)1.0/2.00.5

33
Lexicon model estimation with uncertain alignments

N(f,ea,f,e) count of alignment between (f,e) in
sentence pair f,e with alignment a
c(fe) fractional counts -- counts weighted with
alignment probability

Chicken-Egg Problem
34
Lexicon model estimation with uncertain alignments

Solution EM-algorithm
Iteratively re-estimate parameters given previous
setting
Starting uniformly

35
More sophisticated models

IBM Model 2
Adds dependence on absolute word positions
can learn for example that words at the beginning
of a sentence are often also translated at the
beginning
HMM
Adds dependence on relative word positions
can learn for example that alignments are often
monotone

36
More sophisticated models

IBM Model 3 ( 4,5)
Adds new probability distribution p(ne) for the
fertility of words
Fertility of e number of Foreign words that e
aligns to
Adds soft coverage constraint for English words
Context-dependent lexicon model
Takes into account word context

37
Phrase-Based Statistical MT
Morgen
fliege
ich
nach Kanada
zur Konferenz
Tomorrow
I
will fly
to the conference
In Canada

Foreign input segmented in to phrases
phrase is any sequence of words
Each phrase is probabilistically translated into
English
P(to the conference zur Konferenz)
P(into the meeting zur Konferenz)
Phrases are probabilistically re-ordered

38
Advantages of Phrase-Based

Phrases capture local reordering
Single-word based needs to be stored in
alignment model
Local context useful for disambiguation
Single-word based only target language model
does disambiguation
Phrases are reordered as a whole
Works well for non-compositional phrases
With a lot of data sometimes whole sentences can
be covered

39
Evaluation of MT

Ideal criterion user satisfaction
Problems
Expensive, Slow, Inconsistent, Subjective
Problematic to use in system development
Goal automatic objective evaluation of machine
translation quality
Idea Compute similarity of MT output with good
human translations (reference translations)
Hope
If MT output is good similar to good human
translations
If MT output is bad very different from human
translations
Question Which similarity metric?

40
Evaluation of MT

Use a set of bilingual test sentences so that,
for each source sentence, an associated target
sentence is given
WER (word error rate)
SER (sentence error rate)
PER (position-independent word error rate)
without taking the word order into account
BLEU (Bilingual Evaluation Understudy)
an MT metric based on n-gram precision
ROUGE

41
BLEU (Bilingual Evaluation Understudy)

Modified n-gram precision
N-gram precision fraction of N-grams occurring
in references
Modified N-gram precision same part of reference
cannot be used twice
Brevity penalty
Penalize too short translations
BP exp( min(1 - r/c , 0) )
c length of MT output, r length of reference
translation
BLEUn4 score

42
Typical BLEU scores (2005 NIST evaluation data)

Arabic-English news translation, 4 references
Best statistical (research) system 51 BLEU
score
(some) commercial systems 10 - 34 BLEU score
Estimated human BLEU score 63 BLEU score
Chinese-English news translation, 4 references
Best statistical (research) system 35 BLEU
score
(some) commercial system 15 BLEU score
Estimated human BLEU score 55 BLEU score

Approach used to estimate human BLEU score (given
4 references)
Round robin score one reference against other 3
references

43
SMT for Spoken Language

Spoken-Language-Translation not merely
translation of written text containing ASR errors

44
SMT for Spoken LanguageTraditional Approach

1-best ASR-hypothesis passed to SMT
Other ASR hypotheses not considered
ASR / SMT systems developed independently
Trained using different data
Performance optimized for different criterion
(WER/BLEU)

Hope end-to-end system performance is good
45
Tighter Coupling for SLT
46
Outlook Progress from

Better Models Training
Generalized phrase models (e.g. hierarchical)
Long-distance dependencies
Topic adaptation
Discriminative training with many more features
Much More Data
Monolingual data gt 1 trillion words
Bilingual data gt 1 billion words
Better automatic machine translation evaluation
(BLEU)
Better engineering / infrastructure / tools

Write a Comment

User Comments (0)

About PowerShow.com

Introduction to Statistical Machine Translation PowerPoint PPT Presentation