Transfer-based MT - PowerPoint PPT Presentation

About This Presentation
Title:

Transfer-based MT

Description:

Transfer-based MT Syntactic Transfer-based Machine Translation Direct and Example-based approaches Two ends of a spectrum Recombination of fragments for better coverage. – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 102
Provided by: csPrincet
Category:
Tags: based | kiss | transfer

less

Transcript and Presenter's Notes

Title: Transfer-based MT


1
Transfer-based MT
2
Syntactic Transfer-based Machine Translation
  • Direct and Example-based approaches
  • Two ends of a spectrum
  • Recombination of fragments for better coverage.
  • What if the matching/transfer is done at
    syntactic parse level
  • Three Steps
  • Parse Syntactic parse of the source language
    sentence
  • Hierarchical representation of a sentence
  • Transfer Rules to transform source parse tree
    into target parse tree
  • Subject-Verb-Object ? Subject-Object-Verb
  • Generation Regenerating target language sentence
    from parse tree
  • Morphology of the target language
  • Tree-structure provides better matching and
    longer distance transformations than is possible
    in string-based EBMT.

3
Examples of SynTran-MT
quiero
wanna
I
ajá
usar
yeah
use
mi
tarjeta
my
card
credit
de
crédito
  • Mostly parallel parse structures
  • Might have to insert word pronouns,
    morphological particles

4
Example of SynTran MT -2
  • Pros
  • Allows for structure transfer
  • Re-orderings are typically restricted to the
    parent-child nodes.
  • Cons
  • Transfer rules are for each language pair (N2
    sets of rules)
  • Hard to reuse rules when one of the languages is
    changed

5
Lexical-semantic Divergences
  • Linguistic Divergences
  • Structural differences between languages
  • Categorical Divergence
  • Translation of words in one language into words
    that have different parts of speech in another
    language
  • To be jealous
  • Tener celos (To have jealousy)

6
Issues
  • Linguistic Divergences
  • Conflational Divergence
  • Translation of two or more words in one language
    into one word in another language
  • To kick
  • Dar una patada (Give a kick)

7
Issues
  • Linguistic Divergences
  • Structural Divergence
  • Realization of verb arguments in different
    syntactic configurations in different languages
  • To enter the house
  • Entrar en la casa (Enter in the house)

8
Issues
  • Linguistic Divergences
  • Head-Swapping Divergence
  • Inversion of a structural-dominance relation
    between two semantically equivalent words
  • To run in
  • Entrar corriendo (Enter running)

9
Issues
  • Linguistic Divergences
  • Thematic Divergence
  • Realization of verb arguments that reflect
    different thematic to syntactic mapping orders
  • I like grapes
  • Me gustan uvas (To-me please grapes)

10
Divergence counts from Bonnie Dorr
  • 32 of sentences in UN Spanish/English Corpus (5K)

11
Transfer rules
12
Syntax-driven statistical machine translation
Slides from Devi Xiong, CAS, Beijing
13
Why syntax-based SMT
  • Weakness of phrase-based SMT
  • Long-distance reordering phrase-level reordering
  • Discontinuous phrases
  • Generalization
  • Other methods using syntactic knowledge
  • Word alignment integrating syntactic constraints
  • Pre-order source sentences
  • Rerank n-best output of translation models

14
SSMT based on formal structures
  • Compared with phrase-based SMT
  • Translated hierarchically
  • The target structures finally generated are not
    necessarily real linguistic structures, but
  • Make long-distance reordering more feasible
  • Introduce non-terminals/variables
  • Discontinuous phrases put x on, ? x ?
  • Generalization

15
SCFG
  • Formulated
  • Two CFGs and there correspondences
  • Or
  • P

16
SCFG an example
17
SCFG derivation
18
ITG
  • synchronous CFGs in which the links between
    nonterminals in a production are restricted to
    two possible configurations
  • Inverted
  • Straight
  • Any ITG can be converted into a synchronous CFG
    of rank two.

19
BTG
20
ITG as reordering constraint
  • Two kinds of reordering
  • Inverted
  • straight
  • Coverage
  • Wu(1997) been unable to find real examples of
    cases where alignments would fail under this
    constraint, at least in lightly inflected
    languages, such as English and Chinese.
  • Wellington(2006) we found examples, at least
    5 of the Chinese/English sentence pairs.
  • Weakness
  • No strong mechanism determining which order is
    better, inverted or straight.

21
Chiang05 Hierarchical Phrase-based Model (HPM)
  • Rules
  • Glue rule
  • Model log-linear
  • Decoder CKY

22
Chiang05 rule extraction
23
Chiang05 rule extraction restrictions
  • Initial base rule at most 15 on French side
  • Final rule at most 5 on French side
  • At most two non-terminals on each side,
    nonadjacent
  • At least one aligned terminal pair

24
Chiang05 Model
  • Log-linear form
  • and

25
Chiang05 decoder
26
SSMT based on phrase structures
  • Using grammars with linguistic knowledge
  • The grammars are based on SCFG
  • Two categories
  • Tree-string
  • Tree-to-string
  • String-to-tree
  • Tree-tree

27
Yamada Knight 2001, 2003
28
Yamadas work vs. SCFG
  • Insertion operation
  • A ? (wA1, A1)
  • Reordering operation
  • A ?(A1A2A3, A1A3A2)
  • Translating operation
  • A ?(x, y)

29
Yamada weakness
  • Single-level mapping
  • Multi-level reordering
  • Yamada flatten
  • Word-based
  • Yamada phrasal leaf

30
Galley et al. 2004, 2006
  • translation model incorporates syntactic
    structure on the target language side
  • trained by learning translation rules from
    bilingual data
  • the decoder uses a parser-like method to create
    syntactic trees as output hypotheses

31
Translation rules
  • Translation rules
  • Target multi-level subtrees
  • Source continuous or discontinuous phrases
  • Types of translation rules
  • Translating source phrases into target chunks
  • NPB(PRP/I) ??
  • NP-C(NPB(DT/this NN/address)) ??? ??

32
Types of translation rules
  • Have variables
  • NP-C(NPB(PRP/my x0NN)) ?? ? x0
  • PP(TO/to NP-C(NPB(x0NNS NNP/park))) ? ? x0 ??
  • Combine previously translated results together
  • VP(x0VBZ x1NP-C) ? x1 x0
  • takes a noun phrase followed by a verb, switches
    their order, then combines them into a new verb
    phrase

33
Rules extraction
  • Word-align a parallel corpus
  • Parse the target side
  • Extract translation rules
  • Minimal rules can not be decomposed
  • Composed rules composed by minimal rules
  • Estimate probalities

34
Rule extraction
Minimal rule
35
Composed rules
36
Format is Expressive
Non-constituent Phrases
Phrasal Translation
Non-contiguous Phrases
S
VP
VP
poner, x0
hay, x0
está, cantando
PRO
VP
VB
x0NP
PRT
VBZ
VBG
VB
x0NP
there
on
is
singing
put
are
Multilevel Re-Ordering
Lexicalized Re-Ordering
Context-Sensitive Word Insertion
NP
S
NPB
x0
x0NP
PP
x1, , x0
x1, x0, x2
x0NP
VP
DT
x0NNS
P
x1NP
x1VB
x2NP2
the
of
Knight Graehl, 2005
37
decoder
  • probabilistic CYK-style parsing algorithm with
    beams
  • results in an English syntax tree corresponding
    to the Chinese sentence
  • guarantees the output to have some kind of
    globally coherent syntactic structure

38
Decoding example
39
Decoding example
40
Decoding example
41
Decoding example
42
Decoding example
43
Marcu et al. 2006
  • SPMT
  • Integrating non-syntactifiable phrases
  • Multiple features for each rule
  • Decoding with multiple models

44
SSMT based on phrase structures
  • Two categories
  • Tree-string
  • String-to-tree
  • Tree-to-string
  • Tree-tree

45
Tree-to-string
  • Liu et al. 2006
  • Tree-to-string alignment template model

46
TAT
47
TAT extraction
  • Constraints
  • Source trees have to be Subtree
  • Have to be consistent with word alignment
  • Restrictions on extraction
  • both the first and last symbols in the target
    string must be aligned to some source symbols
  • The height of T(z) is limited to no greater than
    h
  • The number of direct descendants of a node of
    T(z) is limited to no greater than c

48
TAT Model
49
Decoding
50
Tree-to-string vs. string-to-tree
  • Tree-to-string
  • Integrating source structures into translation
    and reordering
  • The output can not be grammatical
  • string-to-tree
  • guarantees the output to have some kind of
    globally coherent syntactic structure
  • Can not use any knowledge from source structures

51
SSMT based on phrase structures
  • Two categories
  • Tree-string
  • String-to-tree
  • Tree-to-string
  • Tree-tree

52
Tree-Tree
  • Synchronous tree-adjoining grammar (STAG)
  • Synchronous tree substitution grammar (STSG)

53
STAG
54
STAG derivation
55
STSG
56
STSG elementary trees
57
Dependency structures
IP
VP
NP
??
NP
NP
NP
ADJP
NP
??
NN
NN
NN
VV
NR
NN
JJ
NN
???
??
??
??
??
??
?? ?? ?? ?? ?? ?? ?? ???
(b)
(a)
58
For MT dependency structures vs. phrase
structures
  • Advantages of dependency structures over phrase
    structures for machine translation
  • Inherent lexicalization
  • Meaning-relative
  • Better representation of divergences across
    languages

59
SSMT based on dependency structures
  • Lin 2004
  • A Path-based Transfer Model for Machine
    Translation
  • Quirk et al. 2005
  • Dependency Treelet Translation Syntactically
    Informed Phrasal SMT
  • Ding et al. 2005
  • Machine Translation Using Probabilistic
    Synchronous Dependency Insertion Grammars

60
Lin 2004
  • Translation model trained by learning transfer
    rules from bilingual corpus where the source
    language sentences are parsed.
  • decoding finding the minimum path covering of
    the source language dependency tree

61
Lin 2004 path
62
Lin 2004 transfer rule
63
Quirk et al. 2005
  • Translation model trained by learning treelet
    pairs from bilingual corpus where the source
    language sentences are parsed.
  • Decoding CKY-style

64
Treelet pairs
65
Quirk 2005 decoding
66
Ding 2005
67
summary
68
State of the art machine translation systems
based on statistical models rooted in the theory
of formal grammars/automata Translation models
based on finite state devices cannot easily model
translations between languages with strong
differences in word ordering Recently, several
models based on context-free grammars have been
investigated, borrowing from the theory of
compilers the idea of synchronous rewriting
Slides from G. Satta
69
Translation models based on synchronous
rewriting Inversion Transduction Grammars (Wu,
1997) Head Transducer Grammars (Alshawi et al.,
2000) Tree-to-string models (Yamada Knight,
2001 Galley et al, 2004) Loosely tree-based
model (Gildea, 2003) Multi-Text Grammars
(Melamed, 2003) Hierarchical phrase-based model
(Chiang, 2005) We use synchronous CFGs to study
formal properties of all these
70
A synchronous context-free grammar (SCFG) is
based on three components Context free grammar
(CFG) for source language CFG for target
language Pairing relation on the productions of
the two grammars and on the nonterminals in their
right-hand sides
71
Example (Yamada Knight, 2001)
VB --gt PRP(1) VB1(2) VB2(3) VB2 --gt VB(1)
TO(2) TO --gt TO(1) NN(2) PRP --gt he VB1
--gt adores VB --gt listening TO --gt to NN
--gt music
VB --gt PRP(1) VB2(3) VB1(2) VB2 --gt TO(2)
VB(1) ga TO --gt NN(2) TO(1) PRP --gt kare
ha VB1 --gt daisuki desu VB --gt kiku no TO
--gt wo NN --gt ongaku
72
Example (contd)
73
A pair of CFG productions in a SCFG is called a
synchronous production A SCFG generates pairs of
trees/strings, where each component is a
translation of the other A SCFG can be extended
with probabilities Each pair of productions is
assigned a probability Probability of a pair of
trees is the product of probabilities of
synchronous productions involved
74
The membership problem (Wu, 1997) for SCFGs is
defined as follows Input SCFG and pair of
strings w1, w2 Output Yes/No depending on
whether w1 translates into w2 under the
SCFG Applications in segmentation, word alignment
and bracketing of parallel corpora Assumption
that SCFG is part of the input is made here to
investigate the dependency of problem complexity
on grammar size
75
Result Membership problem for SCFGs is
NP-complete Proof uses SCFG derivations to
explore space of consistent truth assignments
that satisfy source 3SAT instance Remarks
Result transfers to (Yamada Knight, 2001),
(Gildea, 2003), (Melamed, 2003), which are at
least as powerful as SCFG
76
  • Remarks (contd)
  • Problem can be solved in polynomial time if
  • input grammar is fixed or production length is
    bounded (Melamed, 2004)
  • Inversion Transduction Grammars (Wu, 1997)
  • Head Transducer Grammars (Alshawi et al., 2000)
  • For NLP applications, it is more realistic to
    assume a fixed grammar and varying input string

77
Providing an exponential time lower bound for the
membership problem would amount to showing P ?
NP But we can show such a lower bound if we make
some assumptions on the class of algorithms and
data structures that we use to solve the
problem Result If chart parsing techniques are
used to solve the membership problem for SCFG, a
number of partial analyses is obtained that grows
exponentially with the production length of the
input grammar
78
Chart parsing for CFGs works by combining
completed constituents with partial analyses
A --gt B1 B2 B3 Bn
Three indices are used to process each
combination, for a total number of O(n3)
possible combinations that must be checked, n
the length of the input string
79
Consider the synchronous production A --gt
B (1) B (2) B (3) B (4) , A --gt B (3) B (1) B
(4) B (2) representing the permutation
80
When applying chart parsing, there is no way to
keep partial analyses contiguous
81
The proof of our result generalizes the previous
observations We show that, for some worst case
permutations of length q, any combination
strategy we choose leads to a number of indices
growing with order at least sqrt(q) Then for
SCFGs of size q, sqrt(q) is an asymptotic lower
bound for the membership problem when chart
parsing algorithms are used
82
A probabilistic SCFG provides the probability
that tree t1 translates into tree t2 Pr( t1 ,
t2 ) Accordingly, we can define the probability
that string w1 translates into string w2 Pr(
w1 , w2 ) ?t1?w1,t2?w2 Pr( t1 , t2 ) and
the probability that string w translates into
tree t Pr( w , t ) ?t1?w Pr( t1 , t )
83
The string-to-tree translation problem for
probabilistic SCFGs is defined as follows Input
Probabilistic SCFG and string w Output tree t
such that Pr(w, t ) is maximized Application in
machine translation Again, assumption that SCFG
is part of the input is made to investigate the
dependency of problem complexity on grammar size
84
Result string-to-tree translation problem for
probabilistic SCFGs (summing over possible source
trees) is NP-hard Proof reduces from consensus
problem Strings generated by probabilistic
finite automaton or hidden Markov model have
probabilities defined as sum of probabilities of
several paths Maximizing such summation is
NP-hard (Casacuberta Higuera, 2000) (Lyngso
Pedersen, 2002)
85
Remarks Source of complexity of the problem
comes from the fact that several source trees can
be translated into the same target tree Result
persists if there is a constant bound on length
of synchronous productions Open can the problem
be solved in polynomial time if probabilistic
SCFG is fixed?
86
Learning Non-Isomorphic Tree Mappings for Machine
Translation
a
A
b
B
misinform
report
events
wrongly
to-John
of
him
events
the
wrongly report events to-John
him misinform of the events
Slides from J. Eisner
87
Syntax-Based Machine Translation
  • Previous work assumes essentially isomorphic
    trees
  • Wu 1995, Alshawi et al. 2000, Yamada Knight
    2000
  • But trees are not isomorphic!
  • Discrepancies between the languages
  • Free translation in the training data

88
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English.
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
89
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange.
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
beaucoup(lots)
quite
NP
d (of)
NP
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
90
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange. A much worse alignment ...
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
NP
beaucoup(lots)
quite
d (of)
NP
Adv
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
91
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange.
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
beaucoup(lots)
quite
NP
d (of)
NP
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
92
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange. Alignment shows how trees are
generated synchronously from little trees ...
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
93
Grammar Set of Elementary Trees
94
Grammar Set of Elementary Trees
95
Grammar Set of Elementary Trees
96
Grammar Set of Elementary Trees
97
Grammar Set of Elementary Trees
98
Grammar Set of Elementary Trees
99
Probability model similar to PCFG
Probability of generating training trees T1, T2
with alignment A
P(T1, T2, A) ? p(t1,t2,a n)
probabilities of the little trees that are used
100
Form of model of big tree pairs
Joint model P?(T1,T2).
Wise to use noisy-channel form P?(T1 T2)
P?(T2)
But any joint model will do.
could be trained on zillionsof target-language
trees
train on paired trees (hard to get)
In synchronous TSG, aligned big tree pair is
generated by choosing a sequence of little tree
pairs
P(T1, T2, A) ? p(t1,t2,a n)
101
Maxent model of little tree pairs
p(
  • FEATURES
  • reportwrongly ? misinform?(use dictionary)
  • report ? misinform? (at root)
  • wrongly ? misinform?
  • verb incorporates adverb child?
  • verb incorporates child 1 of 3?
  • children 2, 3 switch positions?
  • common tree sizes shapes?
  • ... etc. ....

102
Inside Probabilities
a
A
b
B
misinform
report
VP
events
wrongly
to-John
of
him
events
the
?( ) ...
103
Inside Probabilities
a
A
only O(n2)
b
B
misinform
report
VP
events
wrongly
to-John
of
him
NP
events
NP
the
?( ) ...
104
P(T1, T2, A) ? p(t1,t2,a n)
  • Alignment find A to max P?(T1,T2,A)
  • Decoding find T2, A to max P?(T1,T2,A)
  • Training find ? to max ?A P?(T1,T2,A)
  • Do everything on little trees instead!
  • Only need to train decode a model of
    p?(t1,t2,a)
  • But not sure how to break up big tree correctly
  • So try all possible little trees all ways
    of combining them, by dynamic prog.

105
Alignment Pseudocode
  • for each node c1 of T1 (bottom-up)
  • for each possible little tree t1 rooted at c1
  • for each node c2 of T2 (bottom-up)
  • for each possible little tree t2 rooted at c2
  • for each matching a between frontier nodes of t1
    and t2
  • p p(t1,t2,a)
  • for each pair (d1,d2) of frontier nodes matched
    by a
  • p p ?(d1,d2) // inside probability of
    kids
  • ?(c1,c2) ?(c1,c2) p // our inside
    probability
  • Nonterminal states are used in practice but not
    shown here
  • For EM training, also find outside probabilities

106
An MT Architecture
dynamic programming engine
Decoder
Trainer
scores all alignmentsbetween a big tree T1 a
forest of big trees T2
scores all alignmentsof two big trees T1,T2
Probability Model p?(t1,t2,a) of Little Trees
score little tree pair
propose translations t2 of little tree t1
update parameters ?
107
Related Work
  • Synchronous grammars (Shieber Schabes 1990)
  • Statistical work has allowed only 11 (isomorphic
    trees)
  • Stochastic inversion transduction grammars (Wu
    1995)
  • Head transducer grammars (Alshawi et al. 2000)
  • Statistical tree translation
  • Noisy channel model (Yamada Knight 2000)
  • Infers tree trains on (string, tree) pair, not
    (tree, tree) pair
  • But again, allows only 11, plus 10 at leaves
  • Data-oriented translation (Poutsma 2000)
  • Synchronous DOP model trained on already aligned
    trees
  • Statistical tree generation
  • Similar to our decoding construct forest of
    appropriate trees, pick by highest prob
  • Dynamic prog. search in packed forest (Langkilde
    2000)
  • Stack decoder (Ratnaparkhi 2000)

108
What Is New Here?
  • Learning full elementary tree pairs, not rule
    pairs or subcat pairs
  • Previous statistical formalisms have basically
    assumed isomorphic trees
  • Maximum-entropy modeling of elementary tree pairs
  • New, flexible formalization of synchronous Tree
    Subst. Grammar
  • Allows either dependency trees or
    phrase-structure trees
  • Empty trees permit insertion and deletion
    during translation
  • Concrete enough for implementation (cf. informal
    previous descriptions)
  • TSG is more powerful than CFG for modeling trees,
    but faster than TAG
  • Observation that dynamic programming is
    surprisingly fast
  • Find all possible decompositions into aligned
    elementary tree pairs
  • O(n2) if both input trees are fully known and
    elem. tree size is bounded
Write a Comment
User Comments (0)
About PowerShow.com