Learning Dependency Translation Models as Collections of FiniteState Head Transducers - PowerPoint PPT Presentation

1 / 59
About This Presentation
Title:

Learning Dependency Translation Models as Collections of FiniteState Head Transducers

Description:

Alshawi, H., Bangalore, S., & Douglas, S. ACL, 2000. Machine Translation Seminar, 2006 Winter ... A method for automatically training a dependency transduction model ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 60
Provided by: yrc
Category:

less

Transcript and Presenter's Notes

Title: Learning Dependency Translation Models as Collections of FiniteState Head Transducers


1
Learning Dependency Translation Models as
Collections of Finite-State Head Transducers
Machine Translation Seminar, 2006 Winter
  • Alshawi, H., Bangalore, S., Douglas, S.
  • ACL, 2000

Presenter Yow-Ren Chiang
2
Overview
  • Weighted head transducers, finite-state machines
  • middle-out string transduction
  • More expressive than left-to-right FST
  • Dependency Transduction Models
  • collections of weighted head transducers
  • that are applied hierarchically
  • A dynamic programming search algorithm
  • finding the optimal transduction of an input
    string
  • A method for automatically training a dependency
    transduction model
  • from a set of input-output example strings
  • searches for hierarchical alignments of the
    training examples guided by correlation
    statistics,
  • constructs the transitions of head transducers
    that are consistent with these alignments
  • Experiments
  • applying the training method to translation from
    English to Spanish and Japanese

3
Head Transducers
  • Sub-topic1
  • Weighted Finite-State Head Transducers
  • Sub-topic2
  • Relationship to Standard FSTs

4
Weighted Finite-State Head Transducers
  • Quintuple ltW, V, Q, F,Tgt
  • W an alphabet input symbols
  • V an alphabet output symbols
  • Q a finite set of states q0, ... ,qs
  • F a set of final states
  • T a finite set of state transitions

5
Transition
Weighted Finite-State Head Transducers
  • A transition from state q to state q has the
    form
  • ltq, q, w, v, ?, ?, cgt
  • w ?W ? w ? (empty string)
  • v ?V ? v ? (empty string)
  • ? input position
  • ? output position
  • c weight or cost of the transition
  • Head transition
  • ltq, q, w0, v0, 0, 0, cgt

6
Transition contd
Weighted Finite-State Head Transducers
  • Notional Input and Output tapes
  • Reading w from square ? on the input tape
  • If input was already read from position ?,
  • if ? lt 0
  • w is taken from the next unread square to
    the left of ?
  • if ? ? 0
  • to the right of ?
  • Writing v to square ? to the output tape
  • If square ? is occupied
  • if ? lt 0
  • v is written to the next empty square to
    the left of ?
  • if ? ? 0
  • to the right of ?

7
Transition Symbols and Positions
Weighted Finite-State Head Transducers
8
Head Transition
Weighted Finite-State Head Transducers
  • Head transitionltq, q, w0, v0, 0, 0, cgt
  • w0 input string (not necessarily the leftmost)
  • v0 output string
  • ?0 w0 at square o of the input tape
  • ?0 v0 at square o of the output tape

9
Cost (weight)
Weighted Finite-State Head Transducers
  • The cost of a derivation
  • The sum of the costs of transitions
  • String-to-string transduction function
  • Maps an input string to the output string
  • By the lowest-cost valid derivation
  • Not defined
  • there are no derivations
  • There are multiple outputs with the same minimal
    cost

10
Relationship to Standard FSTs
  • Traditional left-to-right transducer can be
    simulated by a head transducer
  • Starting at the leftmost input symbol
  • Setting the positions of the first transition
    taken to ?0 and ?0
  • Setting the positions of the subsequent
    transition taken to ?1 and ?1
  • etc.

11
Head Transducer More expressive
  • Reverses a string of arbitrary length
  • Convert a palindrome of arbitrary length into one
    of its component halves

12
More expressive Example
13
Dependency Transduction Models
  • Sub-topic1
  • Dependency Transduction Models using Head
    Transducers
  • Sub-topic2
  • Transduction Algorithm

14
Dependency Transduction Models using Head
Transducers
  • Consists a collection of head transducers
  • Applied hierarchically
  • A nonhead transition is interpreted
  • reading and writing a pair of strings headed by
    (w, v) according to the derivation of a
    subnetwork.

15
Machine Translation
Dependency Transduction Models
  • The transducers derive pairs of dependency trees
  • A source language dependency tree
  • A target dependency tree
  • The dependency trees are ordered
  • An ordering on the nodes of each local tree
  • The target sentence can be constructed directly
    by a simple recursive traversal of the target
    dependency tree
  • Each pair of source and target trees is
    synchronized

16
Dependency tree
Dependency Transduction Models
  • Dependency grammar (Hays1964 and Hudson1984
  • The words of the sentence appear as nodes
  • The parent of a node is its head
  • The child of a node is the nodes dependent

17
Synchronized dependency trees
Dependency Transduction Models
18
Head Transducers
Dependency Transduction Models
  • The head transducer
  • converts
  • a sequence consisting of a headword w
  • the left and right dependent words
  • to
  • a sequence consisting of a target word v
  • the left and right dependent words

19
Head Transducer
Dependency Transduction Models
20
Relation
Dependency Transduction Models
  • Head transducers and dependency transduction
    models are related as follows
  • Each pair of local trees produced by a dependency
    transduction derivation is the result of a head
    transducer derivation
  • The input to a head transducer is the string
    corresponding to the flattened local source
    dependency tree
  • The output of the head transducer derivation is
    the string corresponding to the flattened local
    target dependency tree

21
Cost
Dependency Transduction Models
  • The cost of a derivation produced by a dependency
    transduction model
  • the sum of all the weights of the head transducer
    derivations involved
  • When applied to language translation
  • Choose the target string of the lowest-cost
    dependency derivation

22
Probabilistic parameterizations
Cost
  • Probability for a transition with headwords w and
    v and dependent words w' and v'
  • The probability of choosing a head transition
    ltq0, q1, w, v, 0, 0gt
  • The probability of choosing w0, v0 as the root
    nodes of the two trees

23
Probability of a derivation
Cost
  • The probability of such a derivation can be
    expressed as
  • where P(Dw,v) is the probability of a
    subderivation headed by w and v

24
Transduction Algorithm
  • The algorithm works bottom-up, maintaining a set
    of configurations.
  • A configuration has the form
  • n1, n2, w, v, q, c, t
  • Corresponding to a bottom-up partial derivation
  • currently in state q covering an input sequence
    between nodes n1 and n2 of the input lattice.
  • w and v are the topmost nodes in the source and
    target derivation trees.
  • Only the target tree t is stored in the
    configuration.

25
Transduction Algorithm
  • initializes configurations for the input words
  • performs transitions and optimizations to develop
    the set of configurations bottom-up

26
Transduction Algorithm
  • Initialization
  • an initial configuration has the form
  • n, n?, w0, v0, q?, c, v0
  • for each word edge between nodes n and n? in the
    lattice with source word w0,
  • for any head transition of the form ltq, q?, w0,
    v0, 0, 0, cgt
  • Transition
  • Example a transition
  • a source dependent w1 to the left of a headword
    w
  • the corresponding target dependent v1 to the
    right of the target head v.
  • The transition applied is
  • ltq, q?, w1, v1, -1, 1, cgt
  • It is applicable when there are the following
    head and dependent configurations
  • n2, n3, w, v, q, c, t
  • n1, n2, w1, v1, qf, c1, t1
  • where the dependent configuration is in a final
    state qf
  • The result of applying the transition is to add
    the following to the set of configuration
  • n1, n3, w, v, q?, cc1c?, t?
  • where t? is the target dependency tree formed by
    adding t1 as the rightmost dependent of t.
  • Optimization

27
Transduction Algorithm
  • the optimal derivation the lowest cost
  • After all applicable transitions have been taken
  • if there are configurations spanning the entire
    input lattice
  • A pragmatic approach in the translation
    application
  • When there are no configurations spanning the
    entire input lattice
  • simply concatenate the lowest costing of the
    minimal length sequences of partial derivations
    that span the entire lattice
  • To find the optimal sequence of derivations
  • A Viterbe-like search of the graph formed by
    configuration is used

28
Training Method
  • Subtopic 1
  • Computing Pairing Costs
  • Subtopic 2
  • Computing Hierarchical Alignments
  • Subtopic 3
  • Constructing transducers
  • Subtopic 4
  • Multiword pairings

29
Training Method
  • requires a set of training examples.
  • Each example (bitext) consists of a source
    language string paired with a target language
    string
  • to produce bilingual dependency representations
  • Example
  • headwords in both languages are chosen to force
    a synchronized alignment in order to simplify
    cases involving head-switching

30
Four Stages
  • Compute co-occurrence statistics
  • Search for an optimal synchronized hierarchical
    alignment for each bitext
  • Construct a set of head transducers that can
    generate these alignments
  • With transition weights derived from maximum
    likelihood estimation
  • ?Multiword Pairings

31
Computing Pairing Costs
  • the translation pairing cost c(w, v)
  • For each source word w in the data set, assign a
    cost for all possible translations v into the
    target language
  • the statistical function ? correlation measure
  • indicates the strength of co-occurrence
    correlation between source and target words
  • indicative of carrying the same semantic content
  • apply this statistic to co-occurrence of the
    source word with all its possible translations in
    the data set examples
  • In addition, the cost includes a distance measure
    component
  • penalizes pairings proportionately to the
    difference between the (normalized) positions of
    the source and target words in their respective
    sentences

32
Computing Hierarchical Alignments
  • Each derivation generates a pair of dependency
    trees
  • Synchronized hierarchical alignment of two
    strings
  • A hierarchical alignment consists of four
    functions
  • Function 1 an alignment mapping f
  • from source words w to target words f(w)
  • Function 2 An inverse alignment mapping
  • from target words v to source words f?(v)
  • The inverse mapping is needed to handle mapping
    of target words to e, it coincides with f for
    pairs without source e.
  • Function 3 Source head-map g
  • mapping source dependent words w to their heads
    g(w) in the source string
  • Function 4 A target head-map h
  • mapping target dependent words v to their
    headwords h(v) in the target string

33
Hierarchical Alignment
Computing Hierarchical Alignments
34
Computing Hierarchical Alignments
  • A hierarchical alignment is synchronized if these
    conditions hold
  • Nonoverlap
  • If w1?w2, then f(w1)?f(w2), and similarly, if
    v1?v2, then f?(v1)?f?(v2)
  • Synchronization
  • If f(w) v, and v? e, then f(g(w)) h(v), and
    f?(v1) w.
  • Similarly, if f?(v1) w and w? e, then
    f?(h(v))g(w), and f(w) v.
  • Phrase contiguity
  • The image under f of the maximal substring
    dominated by a headword w is a contiguous segment
    of the target string

35
Computing Hierarchical Alignments
  • The source and target strings of a bitext are
    decomposed into three aligned regions
  • A head regionconsisting of headword w in the
    source and its corresponding target f(w) in the
    target string
  • A left substring regionconsisting of the source
    substring to the left of w and its projection
    under f on the target string
  • A right substring region consisting of the
    source substring to the right of w and its
    projection under f on the target string

36
Computing Hierarchical Alignments
  • The decomposition is recursive
  • The left substring region is decomposed around a
    left headword wl
  • The right substring region is decomposed around a
    right headword wr
  • The process continues for each left and right
    substring until it only contains a single word

37
Decomposing source and target
Computing Hierarchical Alignments
38
Computing Hierarchical Alignments
  • To find an alignment
  • respects the co-occurrence statistics of bitexts
  • the phrasal structure implicit in the source and
    target strings

39
The cost function
Computing Hierarchical Alignments
  • The cost function is the sum of three terms
  • The total of all the translation paring costs
    c(w, f(w))
  • Proportional to the distance in the source string
    between dependents wd and their heads g(wd)
  • Proportional to the distance in the target string
    between target dependent words vd and their head
    h(vd)

40
Computing Hierarchical Alignments
  • The hierarchical alignment that minimizes this
    cost function is computed using a dynamic
    programming procedure
  • The paring costs are first retrieved for each
    possible source-target pair
  • Adjacent source substrings are combined to
    determine the lowest-cost subalignments for
    successively larger substrings of the bitext
    satisfying the constraints stated above
  • The successively larger substrings eventually
    span the entire source string, yielding the
    optimal hierarchical alignment for the bitext

41
Constructing Transducers
  • creating appropriate head transducer states
  • tracing hypothesized head transducer transitions
  • The main transitions that are traced are those
    that map heads, wl and wr, of the right and left
    dependent phrases of w to their translations

42
Constructing Transducers
  • The positions of the dependents in the target
    string are computed by comparing the position of
    f(wl) and f(wr) to the position of vf(w)

43
Constructing Transducers
  • In order to generalize from instances in the
    training data, some model states arising for
    different training instances are shared
  • There is only one final state
  • To specify the sharing of states
  • Use a one-to-one state-naming function s
  • from sequences of strings to transducer states
  • The same state-naming is used for all examples in
    the data set
  • ensuring that the transducer fragments recorded
    for the entire data set will form a complete
    collection of head transducer transition networks

44
Constructing Transducers
Figure 7 swapping decomposition
  • shows a decomposition
  • w has a dependent to either side
  • v has both dependents to the right
  • the alignment is swapping
  • f(wl) is to the right of f(wr)

45
Figure 7 swapping decomposition
Constructing Transducers
46
Figure 8 construction of figure 7
Constructing Transducers
  • Construct a transition from s1 s(initial) to s2
    s(w, f(w), head)
  • mapping the source headword w to the target head
    f(w) at position 0 in source and target
  • Construct a transition from s2 to s3 s(w, f(w),
    swapping, wr, f(wr))
  • mapping the source dependent wr at position 1 to
    the target dependent f(wr ) at position 1
  • Since the target dependent f(wr) is to the left
    of target dependent f(wl), the wr transition is
    constructed first in order that the target
    dependent nearest the head is output first
  • Construct a transition from s3 to s4 s(w, f(w),
    final)
  • mapping the source dependent wl at position -1 to
    the target dependent f(wl) at position 1

47
Figure 8 construction for figure 7
Constructing Transducers
48
Figure 9 Decomposing parallel
Constructing Transducers
49
Figure 10 Construction for Figure 9
Constructing Transducers
  • the wl transition is constructed first
  • Since the target dependent f(wl) is to the left
    of target dependent f(wr), in order that the
    target dependent nearest the head is output first
  • different state s5 s(w, f(w), parallel, wl,
    f(wl))
  • Instead of state s3

50
Decomposing parallel Contd
51
Multiword Pairings
  • Short substrings (compounds) of the source and
    target strings
  • Example show me muestreme nonstop sin escalas
  • The cost ? statistics multiplied by the number
    of words in the source substring
  • For alignment
  • Produce dependency trees with nodes that is
    compounds
  • For transducer construction phrase
  • One of the words of a compound headword
  • The least common word
  • An extra chain of transitions is constructed
    transduce the other words of compound

52
Experiments
  • Comparing the target string produced by the
    system against a reference human translation from
    held-out data
  • Simple accuracy
  • Computed by first finding a transformation of one
    string into another that minimizes the total
    weight of insertions, deletions, and
    substitutions
  • Translation accuracy
  • Includes transpositions of words as well as
    insertions, deletions, and substitutions.

53
Experiments
  • Simple accuracy
  • 1-(IDS)/R
  • For the lowest edit-distance transformation
    between the reference translation and system
    output
  • I the number of insertions
  • D the number of deletions
  • S the number of substitutions
  • R the number of words in the reference
    translation string
  • Translation accuracy
  • 1-(I ?D?ST)/R
  • T the number of transpositions in the lowest
    weight transformation including transpositions
  • Since a transposition corresponds to an insertion
    and a deletion, the values of I and D for
    translation accuracy will be different from I and
    D in the computation of simple accuracy

54
English to Spanish
  • The training and testing data
  • a set of transcribed utterances from the Air
    Travel Information System (ATIS) corpus together
    with a translation of each utterance to Spanish.
  • An utterance is typically a single sentence but
    is sometimes more than one sentence spoken in
    sequence.
  • a total of 13,966 training bitexts
  • Alignment search and transduction training was
    carried out only on bitexts with sentences up to
    length 20
  • The test set consisted of 1,185 held-out bitexts
    at all lengths
  • Table 1 shows the word accuracy percentages for
    the trained model, e2s, against the original
    held-out translations at various source sentence
    lengths.
  • Scores are also given for a word-for-word
    baseline, sww,
  • each English word is translated by the most
    highly correlated Spanish word.

55
English to Japanese
  • The training and test data for the
    English-to-Japanese
  • a set of transcribed utterances of telephone
    service customers talking to ATT operators.
  • These utterances, collected from real
    customer-operator interactions, tend to include
    fragmented language, restarts, etc.
  • 12,226 training bitexts and 3,253 held-out test
    bitexts
  • Both training and test partitions were restricted
    to bi-texts with at most 20 English words
  • word boundaries for the Japanese text
  • these word boundaries are parasitic on the word
    boundaries in the English transcriptions
  • the translators are asked to insert such a word
    boundary between any two Japanese characters that
    are taken to have arisen form the translation of
    distinct English words.
  • Table 2 shows the Japanese character accuracy
    percentages
  • the trained English-to-Japanese model, e2j,
  • a baseline model, jww,
  • gives each English word its most highly
    correlated translation

56
Experimental result
57
Review
  • Head Transducers
  • Dependency Transduction Models
  • Training Method
  • Experiments
  • Concluding remarks

58
Dependency grammar
59
Fcorrelation measure
  • Dichotomous variables may be correlated. The kind
    of correlation that is applied to two binary
    variables is the phi correlation.
  • A correlation between two dummy variables is a
    phi correlation.
  • The phi correlation has a particular formula
Write a Comment
User Comments (0)
About PowerShow.com