Learning Dependency Translation Models as Collections of FiniteState Head Transducers - PowerPoint PPT Presentation

1 / 59

About This Presentation

Title:

Learning Dependency Translation Models as Collections of FiniteState Head Transducers

Description:

Alshawi, H., Bangalore, S., & Douglas, S. ACL, 2000. Machine Translation Seminar, 2006 Winter ... A method for automatically training a dependency transduction model ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 60

Provided by: yrc

Learn more at: http://faculty.washington.edu

Category:

more less

Transcript and Presenter's Notes

Title: Learning Dependency Translation Models as Collections of FiniteState Head Transducers

1
Learning Dependency Translation Models as
Collections of Finite-State Head Transducers
Machine Translation Seminar, 2006 Winter

Alshawi, H., Bangalore, S., Douglas, S.
ACL, 2000

Presenter Yow-Ren Chiang
2
Overview

Weighted head transducers, finite-state machines
middle-out string transduction
More expressive than left-to-right FST
Dependency Transduction Models
collections of weighted head transducers
that are applied hierarchically
A dynamic programming search algorithm
finding the optimal transduction of an input
string
A method for automatically training a dependency
transduction model
from a set of input-output example strings
searches for hierarchical alignments of the
training examples guided by correlation
statistics,
constructs the transitions of head transducers
that are consistent with these alignments
Experiments
applying the training method to translation from
English to Spanish and Japanese

3
Head Transducers

Sub-topic1
Weighted Finite-State Head Transducers
Sub-topic2
Relationship to Standard FSTs

4
Weighted Finite-State Head Transducers

Quintuple ltW, V, Q, F,Tgt
W an alphabet input symbols
V an alphabet output symbols
Q a finite set of states q0, ... ,qs
F a set of final states
T a finite set of state transitions

5
Transition
Weighted Finite-State Head Transducers

A transition from state q to state q has the
form
ltq, q, w, v, ?, ?, cgt
w ?W ? w ? (empty string)
v ?V ? v ? (empty string)
? input position
? output position
c weight or cost of the transition
Head transition
ltq, q, w0, v0, 0, 0, cgt

6
Transition contd
Weighted Finite-State Head Transducers

Notional Input and Output tapes
Reading w from square ? on the input tape
If input was already read from position ?,
if ? lt 0
w is taken from the next unread square to
the left of ?
if ? ? 0
to the right of ?
Writing v to square ? to the output tape
If square ? is occupied
if ? lt 0
v is written to the next empty square to
the left of ?
if ? ? 0
to the right of ?

7
Transition Symbols and Positions
Weighted Finite-State Head Transducers
8
Head Transition
Weighted Finite-State Head Transducers

Head transitionltq, q, w0, v0, 0, 0, cgt
w0 input string (not necessarily the leftmost)
v0 output string
?0 w0 at square o of the input tape
?0 v0 at square o of the output tape

9
Cost (weight)
Weighted Finite-State Head Transducers

The cost of a derivation
The sum of the costs of transitions
String-to-string transduction function
Maps an input string to the output string
By the lowest-cost valid derivation
Not defined
there are no derivations
There are multiple outputs with the same minimal
cost

10
Relationship to Standard FSTs

Traditional left-to-right transducer can be
simulated by a head transducer
Starting at the leftmost input symbol
Setting the positions of the first transition
taken to ?0 and ?0
Setting the positions of the subsequent
transition taken to ?1 and ?1
etc.

11
Head Transducer More expressive

Reverses a string of arbitrary length
Convert a palindrome of arbitrary length into one
of its component halves

12
More expressive Example
13
Dependency Transduction Models

Sub-topic1
Dependency Transduction Models using Head
Transducers
Sub-topic2
Transduction Algorithm

14
Dependency Transduction Models using Head
Transducers

Consists a collection of head transducers
Applied hierarchically
A nonhead transition is interpreted
reading and writing a pair of strings headed by
(w, v) according to the derivation of a
subnetwork.

15
Machine Translation
Dependency Transduction Models

The transducers derive pairs of dependency trees
A source language dependency tree
A target dependency tree
The dependency trees are ordered
An ordering on the nodes of each local tree
The target sentence can be constructed directly
by a simple recursive traversal of the target
dependency tree
Each pair of source and target trees is
synchronized

16
Dependency tree
Dependency Transduction Models

Dependency grammar (Hays1964 and Hudson1984
The words of the sentence appear as nodes
The parent of a node is its head
The child of a node is the nodes dependent

17
Synchronized dependency trees
Dependency Transduction Models
18
Head Transducers
Dependency Transduction Models

The head transducer
converts
a sequence consisting of a headword w
the left and right dependent words
to
a sequence consisting of a target word v
the left and right dependent words

19
Head Transducer
Dependency Transduction Models
20
Relation
Dependency Transduction Models

Head transducers and dependency transduction
models are related as follows
Each pair of local trees produced by a dependency
transduction derivation is the result of a head
transducer derivation
The input to a head transducer is the string
corresponding to the flattened local source
dependency tree
The output of the head transducer derivation is
the string corresponding to the flattened local
target dependency tree

21
Cost
Dependency Transduction Models

The cost of a derivation produced by a dependency
transduction model
the sum of all the weights of the head transducer
derivations involved
When applied to language translation
Choose the target string of the lowest-cost
dependency derivation

22
Probabilistic parameterizations
Cost

Probability for a transition with headwords w and
v and dependent words w' and v'
The probability of choosing a head transition
ltq0, q1, w, v, 0, 0gt
The probability of choosing w0, v0 as the root
nodes of the two trees

23
Probability of a derivation
Cost

The probability of such a derivation can be
expressed as
where P(Dw,v) is the probability of a
subderivation headed by w and v

24
Transduction Algorithm

The algorithm works bottom-up, maintaining a set
of configurations.
A configuration has the form
n1, n2, w, v, q, c, t
Corresponding to a bottom-up partial derivation
currently in state q covering an input sequence
between nodes n1 and n2 of the input lattice.
w and v are the topmost nodes in the source and
target derivation trees.
Only the target tree t is stored in the
configuration.

25
Transduction Algorithm

initializes configurations for the input words
performs transitions and optimizations to develop
the set of configurations bottom-up

26
Transduction Algorithm

Initialization
an initial configuration has the form
n, n?, w0, v0, q?, c, v0
for each word edge between nodes n and n? in the
lattice with source word w0,
for any head transition of the form ltq, q?, w0,
v0, 0, 0, cgt
Transition
Example a transition
a source dependent w1 to the left of a headword
w
the corresponding target dependent v1 to the
right of the target head v.
The transition applied is
ltq, q?, w1, v1, -1, 1, cgt
It is applicable when there are the following
head and dependent configurations
n2, n3, w, v, q, c, t
n1, n2, w1, v1, qf, c1, t1
where the dependent configuration is in a final
state qf
The result of applying the transition is to add
the following to the set of configuration
n1, n3, w, v, q?, cc1c?, t?
where t? is the target dependency tree formed by
adding t1 as the rightmost dependent of t.
Optimization

27
Transduction Algorithm

the optimal derivation the lowest cost
After all applicable transitions have been taken
if there are configurations spanning the entire
input lattice
A pragmatic approach in the translation
application
When there are no configurations spanning the
entire input lattice
simply concatenate the lowest costing of the
minimal length sequences of partial derivations
that span the entire lattice
To find the optimal sequence of derivations
A Viterbe-like search of the graph formed by
configuration is used

28
Training Method

Subtopic 1
Computing Pairing Costs
Subtopic 2
Computing Hierarchical Alignments
Subtopic 3
Constructing transducers
Subtopic 4
Multiword pairings

29
Training Method

requires a set of training examples.
Each example (bitext) consists of a source
language string paired with a target language
string
to produce bilingual dependency representations
Example
headwords in both languages are chosen to force
a synchronized alignment in order to simplify
cases involving head-switching

30
Four Stages

Compute co-occurrence statistics
Search for an optimal synchronized hierarchical
alignment for each bitext
Construct a set of head transducers that can
generate these alignments
With transition weights derived from maximum
likelihood estimation
?Multiword Pairings

31
Computing Pairing Costs

the translation pairing cost c(w, v)
For each source word w in the data set, assign a
cost for all possible translations v into the
target language
the statistical function ? correlation measure
indicates the strength of co-occurrence
correlation between source and target words
indicative of carrying the same semantic content
apply this statistic to co-occurrence of the
source word with all its possible translations in
the data set examples
In addition, the cost includes a distance measure
component
penalizes pairings proportionately to the
difference between the (normalized) positions of
the source and target words in their respective
sentences

32
Computing Hierarchical Alignments

Each derivation generates a pair of dependency
trees
Synchronized hierarchical alignment of two
strings
A hierarchical alignment consists of four
functions
Function 1 an alignment mapping f
from source words w to target words f(w)
Function 2 An inverse alignment mapping
from target words v to source words f?(v)
The inverse mapping is needed to handle mapping
of target words to e, it coincides with f for
pairs without source e.
Function 3 Source head-map g
mapping source dependent words w to their heads
g(w) in the source string
Function 4 A target head-map h
mapping target dependent words v to their
headwords h(v) in the target string

33
Hierarchical Alignment
Computing Hierarchical Alignments
34
Computing Hierarchical Alignments

A hierarchical alignment is synchronized if these
conditions hold
Nonoverlap
If w1?w2, then f(w1)?f(w2), and similarly, if
v1?v2, then f?(v1)?f?(v2)
Synchronization
If f(w) v, and v? e, then f(g(w)) h(v), and
f?(v1) w.
Similarly, if f?(v1) w and w? e, then
f?(h(v))g(w), and f(w) v.
Phrase contiguity
The image under f of the maximal substring
dominated by a headword w is a contiguous segment
of the target string

35
Computing Hierarchical Alignments

The source and target strings of a bitext are
decomposed into three aligned regions
A head regionconsisting of headword w in the
source and its corresponding target f(w) in the
target string
A left substring regionconsisting of the source
substring to the left of w and its projection
under f on the target string
A right substring region consisting of the
source substring to the right of w and its
projection under f on the target string

36
Computing Hierarchical Alignments

The decomposition is recursive
The left substring region is decomposed around a
left headword wl
The right substring region is decomposed around a
right headword wr
The process continues for each left and right
substring until it only contains a single word

37
Decomposing source and target
Computing Hierarchical Alignments
38
Computing Hierarchical Alignments

To find an alignment
respects the co-occurrence statistics of bitexts
the phrasal structure implicit in the source and
target strings

39
The cost function
Computing Hierarchical Alignments

The cost function is the sum of three terms
The total of all the translation paring costs
c(w, f(w))
Proportional to the distance in the source string
between dependents wd and their heads g(wd)
Proportional to the distance in the target string
between target dependent words vd and their head
h(vd)

40
Computing Hierarchical Alignments

The hierarchical alignment that minimizes this
cost function is computed using a dynamic
programming procedure
The paring costs are first retrieved for each
possible source-target pair
Adjacent source substrings are combined to
determine the lowest-cost subalignments for
successively larger substrings of the bitext
satisfying the constraints stated above
The successively larger substrings eventually
span the entire source string, yielding the
optimal hierarchical alignment for the bitext

41
Constructing Transducers

creating appropriate head transducer states
tracing hypothesized head transducer transitions
The main transitions that are traced are those
that map heads, wl and wr, of the right and left
dependent phrases of w to their translations

42
Constructing Transducers

The positions of the dependents in the target
string are computed by comparing the position of
f(wl) and f(wr) to the position of vf(w)

43
Constructing Transducers

In order to generalize from instances in the
training data, some model states arising for
different training instances are shared
There is only one final state
To specify the sharing of states
Use a one-to-one state-naming function s
from sequences of strings to transducer states
The same state-naming is used for all examples in
the data set
ensuring that the transducer fragments recorded
for the entire data set will form a complete
collection of head transducer transition networks

44
Constructing Transducers
Figure 7 swapping decomposition

shows a decomposition
w has a dependent to either side
v has both dependents to the right
the alignment is swapping
f(wl) is to the right of f(wr)

45
Figure 7 swapping decomposition
Constructing Transducers
46
Figure 8 construction of figure 7
Constructing Transducers

Construct a transition from s1 s(initial) to s2
s(w, f(w), head)
mapping the source headword w to the target head
f(w) at position 0 in source and target
Construct a transition from s2 to s3 s(w, f(w),
swapping, wr, f(wr))
mapping the source dependent wr at position 1 to
the target dependent f(wr ) at position 1
Since the target dependent f(wr) is to the left
of target dependent f(wl), the wr transition is
constructed first in order that the target
dependent nearest the head is output first
Construct a transition from s3 to s4 s(w, f(w),
final)
mapping the source dependent wl at position -1 to
the target dependent f(wl) at position 1

47
Figure 8 construction for figure 7
Constructing Transducers
48
Figure 9 Decomposing parallel
Constructing Transducers
49
Figure 10 Construction for Figure 9
Constructing Transducers

the wl transition is constructed first
Since the target dependent f(wl) is to the left
of target dependent f(wr), in order that the
target dependent nearest the head is output first
different state s5 s(w, f(w), parallel, wl,
f(wl))
Instead of state s3

50
Decomposing parallel Contd
51
Multiword Pairings

Short substrings (compounds) of the source and
target strings
Example show me muestreme nonstop sin escalas
The cost ? statistics multiplied by the number
of words in the source substring
For alignment
Produce dependency trees with nodes that is
compounds
For transducer construction phrase
One of the words of a compound headword
The least common word
An extra chain of transitions is constructed
transduce the other words of compound

52
Experiments

Comparing the target string produced by the
system against a reference human translation from
held-out data
Simple accuracy
Computed by first finding a transformation of one
string into another that minimizes the total
weight of insertions, deletions, and
substitutions
Translation accuracy
Includes transpositions of words as well as
insertions, deletions, and substitutions.

53
Experiments

Simple accuracy
1-(IDS)/R
For the lowest edit-distance transformation
between the reference translation and system
output
I the number of insertions
D the number of deletions
S the number of substitutions
R the number of words in the reference
translation string
Translation accuracy
1-(I ?D?ST)/R
T the number of transpositions in the lowest
weight transformation including transpositions
Since a transposition corresponds to an insertion
and a deletion, the values of I and D for
translation accuracy will be different from I and
D in the computation of simple accuracy

54
English to Spanish

The training and testing data
a set of transcribed utterances from the Air
Travel Information System (ATIS) corpus together
with a translation of each utterance to Spanish.
An utterance is typically a single sentence but
is sometimes more than one sentence spoken in
sequence.
a total of 13,966 training bitexts
Alignment search and transduction training was
carried out only on bitexts with sentences up to
length 20
The test set consisted of 1,185 held-out bitexts
at all lengths
Table 1 shows the word accuracy percentages for
the trained model, e2s, against the original
held-out translations at various source sentence
lengths.
Scores are also given for a word-for-word
baseline, sww,
each English word is translated by the most
highly correlated Spanish word.

55
English to Japanese

The training and test data for the
English-to-Japanese
a set of transcribed utterances of telephone
service customers talking to ATT operators.
These utterances, collected from real
customer-operator interactions, tend to include
fragmented language, restarts, etc.
12,226 training bitexts and 3,253 held-out test
bitexts
Both training and test partitions were restricted
to bi-texts with at most 20 English words
word boundaries for the Japanese text
these word boundaries are parasitic on the word
boundaries in the English transcriptions
the translators are asked to insert such a word
boundary between any two Japanese characters that
are taken to have arisen form the translation of
distinct English words.
Table 2 shows the Japanese character accuracy
percentages
the trained English-to-Japanese model, e2j,
a baseline model, jww,
gives each English word its most highly
correlated translation

56
Experimental result
57
Review

Head Transducers
Dependency Transduction Models
Training Method
Experiments
Concluding remarks

58
Dependency grammar
59
Fcorrelation measure

Dichotomous variables may be correlated. The kind
of correlation that is applied to two binary
variables is the phi correlation.
A correlation between two dummy variables is a
phi correlation.
The phi correlation has a particular formula

Write a Comment

User Comments (0)