Title: Learning Dependency Translation Models as Collections of FiniteState Head Transducers
1Learning Dependency Translation Models as
Collections of Finite-State Head Transducers
Machine Translation Seminar, 2006 Winter
- Alshawi, H., Bangalore, S., Douglas, S.
- ACL, 2000
Presenter Yow-Ren Chiang
2Overview
- Weighted head transducers, finite-state machines
- middle-out string transduction
- More expressive than left-to-right FST
- Dependency Transduction Models
- collections of weighted head transducers
- that are applied hierarchically
- A dynamic programming search algorithm
- finding the optimal transduction of an input
string - A method for automatically training a dependency
transduction model - from a set of input-output example strings
- searches for hierarchical alignments of the
training examples guided by correlation
statistics, - constructs the transitions of head transducers
that are consistent with these alignments - Experiments
- applying the training method to translation from
English to Spanish and Japanese
3Head Transducers
- Sub-topic1
- Weighted Finite-State Head Transducers
- Sub-topic2
- Relationship to Standard FSTs
4Weighted Finite-State Head Transducers
- Quintuple ltW, V, Q, F,Tgt
- W an alphabet input symbols
- V an alphabet output symbols
- Q a finite set of states q0, ... ,qs
- F a set of final states
- T a finite set of state transitions
5Transition
Weighted Finite-State Head Transducers
- A transition from state q to state q has the
form - ltq, q, w, v, ?, ?, cgt
- w ?W ? w ? (empty string)
- v ?V ? v ? (empty string)
- ? input position
- ? output position
- c weight or cost of the transition
- Head transition
- ltq, q, w0, v0, 0, 0, cgt
6Transition contd
Weighted Finite-State Head Transducers
- Notional Input and Output tapes
- Reading w from square ? on the input tape
- If input was already read from position ?,
- if ? lt 0
- w is taken from the next unread square to
the left of ? - if ? ? 0
- to the right of ?
- Writing v to square ? to the output tape
- If square ? is occupied
- if ? lt 0
- v is written to the next empty square to
the left of ? - if ? ? 0
- to the right of ?
7Transition Symbols and Positions
Weighted Finite-State Head Transducers
8Head Transition
Weighted Finite-State Head Transducers
- Head transitionltq, q, w0, v0, 0, 0, cgt
- w0 input string (not necessarily the leftmost)
- v0 output string
- ?0 w0 at square o of the input tape
- ?0 v0 at square o of the output tape
9Cost (weight)
Weighted Finite-State Head Transducers
- The cost of a derivation
- The sum of the costs of transitions
- String-to-string transduction function
- Maps an input string to the output string
- By the lowest-cost valid derivation
- Not defined
- there are no derivations
- There are multiple outputs with the same minimal
cost
10Relationship to Standard FSTs
- Traditional left-to-right transducer can be
simulated by a head transducer - Starting at the leftmost input symbol
- Setting the positions of the first transition
taken to ?0 and ?0 - Setting the positions of the subsequent
transition taken to ?1 and ?1 - etc.
11Head Transducer More expressive
- Reverses a string of arbitrary length
- Convert a palindrome of arbitrary length into one
of its component halves
12More expressive Example
13Dependency Transduction Models
- Sub-topic1
- Dependency Transduction Models using Head
Transducers - Sub-topic2
- Transduction Algorithm
14Dependency Transduction Models using Head
Transducers
- Consists a collection of head transducers
- Applied hierarchically
- A nonhead transition is interpreted
- reading and writing a pair of strings headed by
(w, v) according to the derivation of a
subnetwork.
15Machine Translation
Dependency Transduction Models
- The transducers derive pairs of dependency trees
- A source language dependency tree
- A target dependency tree
- The dependency trees are ordered
- An ordering on the nodes of each local tree
- The target sentence can be constructed directly
by a simple recursive traversal of the target
dependency tree - Each pair of source and target trees is
synchronized
16Dependency tree
Dependency Transduction Models
- Dependency grammar (Hays1964 and Hudson1984
- The words of the sentence appear as nodes
- The parent of a node is its head
- The child of a node is the nodes dependent
17Synchronized dependency trees
Dependency Transduction Models
18Head Transducers
Dependency Transduction Models
- The head transducer
- converts
- a sequence consisting of a headword w
- the left and right dependent words
- to
- a sequence consisting of a target word v
- the left and right dependent words
19Head Transducer
Dependency Transduction Models
20Relation
Dependency Transduction Models
- Head transducers and dependency transduction
models are related as follows - Each pair of local trees produced by a dependency
transduction derivation is the result of a head
transducer derivation - The input to a head transducer is the string
corresponding to the flattened local source
dependency tree - The output of the head transducer derivation is
the string corresponding to the flattened local
target dependency tree
21Cost
Dependency Transduction Models
- The cost of a derivation produced by a dependency
transduction model - the sum of all the weights of the head transducer
derivations involved - When applied to language translation
- Choose the target string of the lowest-cost
dependency derivation
22Probabilistic parameterizations
Cost
- Probability for a transition with headwords w and
v and dependent words w' and v' - The probability of choosing a head transition
ltq0, q1, w, v, 0, 0gt - The probability of choosing w0, v0 as the root
nodes of the two trees
23Probability of a derivation
Cost
- The probability of such a derivation can be
expressed as - where P(Dw,v) is the probability of a
subderivation headed by w and v
24Transduction Algorithm
- The algorithm works bottom-up, maintaining a set
of configurations. - A configuration has the form
- n1, n2, w, v, q, c, t
- Corresponding to a bottom-up partial derivation
- currently in state q covering an input sequence
between nodes n1 and n2 of the input lattice. - w and v are the topmost nodes in the source and
target derivation trees. - Only the target tree t is stored in the
configuration.
25Transduction Algorithm
- initializes configurations for the input words
- performs transitions and optimizations to develop
the set of configurations bottom-up
26Transduction Algorithm
- Initialization
- an initial configuration has the form
- n, n?, w0, v0, q?, c, v0
- for each word edge between nodes n and n? in the
lattice with source word w0, - for any head transition of the form ltq, q?, w0,
v0, 0, 0, cgt - Transition
- Example a transition
- a source dependent w1 to the left of a headword
w - the corresponding target dependent v1 to the
right of the target head v. - The transition applied is
- ltq, q?, w1, v1, -1, 1, cgt
- It is applicable when there are the following
head and dependent configurations - n2, n3, w, v, q, c, t
- n1, n2, w1, v1, qf, c1, t1
- where the dependent configuration is in a final
state qf - The result of applying the transition is to add
the following to the set of configuration - n1, n3, w, v, q?, cc1c?, t?
- where t? is the target dependency tree formed by
adding t1 as the rightmost dependent of t. - Optimization
27Transduction Algorithm
- the optimal derivation the lowest cost
- After all applicable transitions have been taken
- if there are configurations spanning the entire
input lattice - A pragmatic approach in the translation
application - When there are no configurations spanning the
entire input lattice - simply concatenate the lowest costing of the
minimal length sequences of partial derivations
that span the entire lattice - To find the optimal sequence of derivations
- A Viterbe-like search of the graph formed by
configuration is used
28Training Method
- Subtopic 1
- Computing Pairing Costs
- Subtopic 2
- Computing Hierarchical Alignments
- Subtopic 3
- Constructing transducers
- Subtopic 4
- Multiword pairings
29Training Method
- requires a set of training examples.
- Each example (bitext) consists of a source
language string paired with a target language
string - to produce bilingual dependency representations
- Example
- headwords in both languages are chosen to force
a synchronized alignment in order to simplify
cases involving head-switching
30Four Stages
- Compute co-occurrence statistics
- Search for an optimal synchronized hierarchical
alignment for each bitext - Construct a set of head transducers that can
generate these alignments - With transition weights derived from maximum
likelihood estimation - ?Multiword Pairings
31Computing Pairing Costs
- the translation pairing cost c(w, v)
- For each source word w in the data set, assign a
cost for all possible translations v into the
target language - the statistical function ? correlation measure
- indicates the strength of co-occurrence
correlation between source and target words - indicative of carrying the same semantic content
- apply this statistic to co-occurrence of the
source word with all its possible translations in
the data set examples - In addition, the cost includes a distance measure
component - penalizes pairings proportionately to the
difference between the (normalized) positions of
the source and target words in their respective
sentences
32Computing Hierarchical Alignments
- Each derivation generates a pair of dependency
trees - Synchronized hierarchical alignment of two
strings - A hierarchical alignment consists of four
functions - Function 1 an alignment mapping f
- from source words w to target words f(w)
- Function 2 An inverse alignment mapping
- from target words v to source words f?(v)
- The inverse mapping is needed to handle mapping
of target words to e, it coincides with f for
pairs without source e. - Function 3 Source head-map g
- mapping source dependent words w to their heads
g(w) in the source string - Function 4 A target head-map h
- mapping target dependent words v to their
headwords h(v) in the target string
33Hierarchical Alignment
Computing Hierarchical Alignments
34Computing Hierarchical Alignments
- A hierarchical alignment is synchronized if these
conditions hold - Nonoverlap
- If w1?w2, then f(w1)?f(w2), and similarly, if
v1?v2, then f?(v1)?f?(v2) - Synchronization
- If f(w) v, and v? e, then f(g(w)) h(v), and
f?(v1) w. - Similarly, if f?(v1) w and w? e, then
f?(h(v))g(w), and f(w) v. - Phrase contiguity
- The image under f of the maximal substring
dominated by a headword w is a contiguous segment
of the target string
35Computing Hierarchical Alignments
- The source and target strings of a bitext are
decomposed into three aligned regions - A head regionconsisting of headword w in the
source and its corresponding target f(w) in the
target string - A left substring regionconsisting of the source
substring to the left of w and its projection
under f on the target string - A right substring region consisting of the
source substring to the right of w and its
projection under f on the target string
36Computing Hierarchical Alignments
- The decomposition is recursive
- The left substring region is decomposed around a
left headword wl - The right substring region is decomposed around a
right headword wr - The process continues for each left and right
substring until it only contains a single word
37Decomposing source and target
Computing Hierarchical Alignments
38Computing Hierarchical Alignments
- To find an alignment
- respects the co-occurrence statistics of bitexts
- the phrasal structure implicit in the source and
target strings
39The cost function
Computing Hierarchical Alignments
- The cost function is the sum of three terms
- The total of all the translation paring costs
c(w, f(w)) - Proportional to the distance in the source string
between dependents wd and their heads g(wd) - Proportional to the distance in the target string
between target dependent words vd and their head
h(vd)
40Computing Hierarchical Alignments
- The hierarchical alignment that minimizes this
cost function is computed using a dynamic
programming procedure - The paring costs are first retrieved for each
possible source-target pair - Adjacent source substrings are combined to
determine the lowest-cost subalignments for
successively larger substrings of the bitext
satisfying the constraints stated above - The successively larger substrings eventually
span the entire source string, yielding the
optimal hierarchical alignment for the bitext
41Constructing Transducers
- creating appropriate head transducer states
- tracing hypothesized head transducer transitions
- The main transitions that are traced are those
that map heads, wl and wr, of the right and left
dependent phrases of w to their translations
42Constructing Transducers
- The positions of the dependents in the target
string are computed by comparing the position of
f(wl) and f(wr) to the position of vf(w)
43Constructing Transducers
- In order to generalize from instances in the
training data, some model states arising for
different training instances are shared - There is only one final state
- To specify the sharing of states
- Use a one-to-one state-naming function s
- from sequences of strings to transducer states
- The same state-naming is used for all examples in
the data set - ensuring that the transducer fragments recorded
for the entire data set will form a complete
collection of head transducer transition networks
44Constructing Transducers
Figure 7 swapping decomposition
- shows a decomposition
- w has a dependent to either side
- v has both dependents to the right
- the alignment is swapping
- f(wl) is to the right of f(wr)
45Figure 7 swapping decomposition
Constructing Transducers
46Figure 8 construction of figure 7
Constructing Transducers
- Construct a transition from s1 s(initial) to s2
s(w, f(w), head) - mapping the source headword w to the target head
f(w) at position 0 in source and target - Construct a transition from s2 to s3 s(w, f(w),
swapping, wr, f(wr)) - mapping the source dependent wr at position 1 to
the target dependent f(wr ) at position 1 - Since the target dependent f(wr) is to the left
of target dependent f(wl), the wr transition is
constructed first in order that the target
dependent nearest the head is output first - Construct a transition from s3 to s4 s(w, f(w),
final) - mapping the source dependent wl at position -1 to
the target dependent f(wl) at position 1
47Figure 8 construction for figure 7
Constructing Transducers
48Figure 9 Decomposing parallel
Constructing Transducers
49Figure 10 Construction for Figure 9
Constructing Transducers
- the wl transition is constructed first
- Since the target dependent f(wl) is to the left
of target dependent f(wr), in order that the
target dependent nearest the head is output first - different state s5 s(w, f(w), parallel, wl,
f(wl)) - Instead of state s3
50Decomposing parallel Contd
51Multiword Pairings
- Short substrings (compounds) of the source and
target strings - Example show me muestreme nonstop sin escalas
- The cost ? statistics multiplied by the number
of words in the source substring - For alignment
- Produce dependency trees with nodes that is
compounds - For transducer construction phrase
- One of the words of a compound headword
- The least common word
- An extra chain of transitions is constructed
transduce the other words of compound
52Experiments
- Comparing the target string produced by the
system against a reference human translation from
held-out data - Simple accuracy
- Computed by first finding a transformation of one
string into another that minimizes the total
weight of insertions, deletions, and
substitutions - Translation accuracy
- Includes transpositions of words as well as
insertions, deletions, and substitutions.
53Experiments
- Simple accuracy
- 1-(IDS)/R
- For the lowest edit-distance transformation
between the reference translation and system
output - I the number of insertions
- D the number of deletions
- S the number of substitutions
- R the number of words in the reference
translation string - Translation accuracy
- 1-(I ?D?ST)/R
- T the number of transpositions in the lowest
weight transformation including transpositions - Since a transposition corresponds to an insertion
and a deletion, the values of I and D for
translation accuracy will be different from I and
D in the computation of simple accuracy
54English to Spanish
- The training and testing data
- a set of transcribed utterances from the Air
Travel Information System (ATIS) corpus together
with a translation of each utterance to Spanish. - An utterance is typically a single sentence but
is sometimes more than one sentence spoken in
sequence. - a total of 13,966 training bitexts
- Alignment search and transduction training was
carried out only on bitexts with sentences up to
length 20 - The test set consisted of 1,185 held-out bitexts
at all lengths - Table 1 shows the word accuracy percentages for
the trained model, e2s, against the original
held-out translations at various source sentence
lengths. - Scores are also given for a word-for-word
baseline, sww, - each English word is translated by the most
highly correlated Spanish word.
55English to Japanese
- The training and test data for the
English-to-Japanese - a set of transcribed utterances of telephone
service customers talking to ATT operators. - These utterances, collected from real
customer-operator interactions, tend to include
fragmented language, restarts, etc. - 12,226 training bitexts and 3,253 held-out test
bitexts - Both training and test partitions were restricted
to bi-texts with at most 20 English words - word boundaries for the Japanese text
- these word boundaries are parasitic on the word
boundaries in the English transcriptions - the translators are asked to insert such a word
boundary between any two Japanese characters that
are taken to have arisen form the translation of
distinct English words. - Table 2 shows the Japanese character accuracy
percentages - the trained English-to-Japanese model, e2j,
- a baseline model, jww,
- gives each English word its most highly
correlated translation
56Experimental result
57Review
- Head Transducers
- Dependency Transduction Models
- Training Method
- Experiments
- Concluding remarks
58Dependency grammar
59Fcorrelation measure
- Dichotomous variables may be correlated. The kind
of correlation that is applied to two binary
variables is the phi correlation. - A correlation between two dummy variables is a
phi correlation. - The phi correlation has a particular formula