Shuffling Non-Constituents - PowerPoint PPT Presentation

About This Presentation
Title:

Shuffling Non-Constituents

Description:

'kiss') un ('a') ('to') kids. Sam. kiss. quite. often ' ... 'beaucoup d'enfants donnent un baiser Sam' 'kids kiss Sam quite often' ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 88
Provided by: jason403
Category:

less

Transcript and Presenter's Notes

Title: Shuffling Non-Constituents


1
Shuffling Non-Constituents
  • Jason Eisner

with David A. Smith and Roy
Tromble
syntactically-flavored reordering search methods
ACL SSST Workshop, June 2008
2
Starting point Synchronous alignment
  • Synchronous grammars are very pretty.
  • But does parallel text actually have parallel
    structure?
  • Depends on what kind of parallel text
  • Free translations? Noisy translations?
  • Were the parsers trained on parallel annotation
    schemes?
  • Depends on what kind of parallel structure
  • What kinds of divergences can your synchronous
    grammar formalism capture?
  • E.g., wh-movement versus wh in situ

3
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English.
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
4
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange.
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
beaucoup(lots)
quite
NP
d (of)
NP
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
5
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange. A much worse alignment ...
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
NP
beaucoup(lots)
quite
d (of)
NP
Adv
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
6
Synchronous Tree Substitution Grammar
Two training trees, showing a free translation
from French to English. A possible alignment is
shown in orange.
donnent (give)
kiss
à (to)
Sam
baiser (kiss)
Sam
often
kids
un (a)
beaucoup(lots)
quite
NP
d (of)
NP
enfants (kids)
beaucoup denfants donnent un baiser à Sam ?
kids kiss Sam quite often
7
Grammar Set of Elementary Trees
8
But many examples are harder
Auf
Frage
diese
bekommen
ich
habe
leider
Antwort
keine
To
question
this
received
I
have
alas
answer
no
NULL
I
did
not
unfortunately
receive
an
answer
to
this
question
9
But many examples are harder
Auf
Frage
diese
bekommen
ich
habe
leider
Antwort
keine
To
question
this
received
I
have
alas
answer
no
NULL
I
did
not
unfortunately
receive
an
answer
to
this
question
Displaced modifier (negation)
10
But many examples are harder
Auf
Frage
diese
bekommen
ich
habe
leider
Antwort
keine
To
question
this
received
I
have
alas
answer
no
NULL
I
did
not
unfortunately
receive
an
answer
to
this
question
Displaced modifier (negation)
11
But many examples are harder
Auf
Frage
diese
bekommen
ich
habe
leider
Antwort
keine
To
question
this
received
I
have
alas
answer
no
NULL
I
did
not
unfortunately
receive
an
answer
to
this
question
Displaced argument (here, because projective
parser)
12
But many examples are harder
Auf
Frage
diese
bekommen
ich
habe
leider
Antwort
keine
To
question
this
received
I
have
alas
answer
no
NULL
I
did
not
unfortunately
receive
an
answer
to
this
question
Head-swapping (here, different annotation
conventions)
13
Free Translation
Tschernobyl Chernobyl
könntecould
dannthen
etwas something
später later
anon
diethe
Reihequeue
kommencome
NULL
Then
we
could
deal
with
Chernobyl
some
time
later
14
Free Translation
Tschernobyl Chernobyl
könntecould
dannthen
etwas something
später later
anon
diethe
Reihequeue
kommencome
NULL
Then
we
could
deal
with
Chernobyl
some
time
later
Probably not systematic (but words are correctly
aligned)
15
Free Translation
Tschernobyl Chernobyl
könntecould
dannthen
etwas something
später later
anon
diethe
Reihequeue
kommencome
NULL
Then
we
could
deal
with
Chernobyl
some
time
later
Erroneous parse
16
What to do?
  • Current practice
  • Dont try to model all systematic phenomena!
  • Just use non-syntactic alignments (Giza).
  • Only care about the fragments that recur often
  • Phrases or gappy phrases
  • Sometimes even syntactic constituents (can favor
    these, e.g., Marton Resnik 2008)
  • Use these (gappy) phrases in a decoder
  • Phrase based or hierarchical

17
What to do?
  • Current practice
  • Use non-syntactic alignments (Giza)
  • Keep frequent phrases for a decoder
  • But could syntax give us better alignments?
  • Would have to be loose syntax
  • Why do we want better alignments?
  • Throw away less of the parallel training data
  • Help learn a smarter, syntactic, reordering model
  • Could help decoding less reliance on LM
  • Some applications care about full alignments

18
Quasi-synchronous grammar
  • How do we handle loose syntax?
  • Translation story
  • Generate target English by a monolingual grammar
  • Any grammar formalism is okay
  • Pick a dependency grammar formalism for now

P(I did, PRP)
I
did
not
unfortunately
receive
an
answer
to
this
question
P(PRP no previous left children of did)
parsing O(n3)
19
Quasi-synchronous grammar
  • How do we handle loose syntax?
  • Translation story
  • Generate target English by a monolingual grammar
  • But probabilities are influenced by source
    sentence
  • Each English node is aligned to some source node
  • Prefers to generate children aligned to nearby
    source nodes

I
did
not
unfortunately
receive
an
answer
to
this
question
parsing O(n3)
20
QCFG Generative Story
observed
?
Auf
Frage
diese
bekommen
ich
leider
Antwort
keine
habe
NULL
P(parent-child)
P(breakage)
P(I did, PRP, ich)
I
did
not
unfortunately
receive
an
answer
to
this
question
P(PRP no previous left children
of did, habe)
aligned parsing O(m2n3)
21
Whats a nearby node?
  • Given parents alignment, where might child be
    aligned?

synchronous grammar case
none of the above
22
Quasi-synchronous grammar
  • How do we handle loose syntax?
  • Translation story
  • Generate target English by a monolingual grammar
  • But probabilities are influenced by source
    sentence
  • Useful analogies
  • Generative grammar with latent word senses
  • MEMM
  • Generate n-gramtag sequence,
  • but probabilities are influenced by word
    sequence

23
Quasi-synchronous grammar
  • How do we handle loose syntax?
  • Translation story
  • Generate target English by a monolingual grammar
  • But probabilities are influenced by source
    sentence
  • Useful analogies
  • Generative grammar with latent word senses
  • MEMM
  • IBM Model 1
  • Source nodes can be freely reused or unused ?
  • Future work Enforce 1-to-1 to allow good
    decoding (NP-hard to do exactly)

24
Some results Quasi-synch. Dep. Grammar
  • Alignment (D. Smith Eisner 2006)
  • Quasi-synchronous much better than synchronous
  • Maybe also better than IBM Model 4
  • Question answering (Wang et al. 2007)
  • Align question w/ potential answer
  • Mean average precision 43 ? 48 ? 60
  • previous state of the art ? QG ? lexical
    features
  • Bootstrapping a parser for a new language (D.
    Smith Eisner 2007 ongoing)
  • Learn how parsed parallel text influences target
    dependencies
  • Along with many other features! (cf.
    co-training)
  • Unsupervised German 30 ? 69, Spanish 26 ? 65

25
Summary of part I
  • Current practice
  • Use non-syntactic alignments (Giza)
  • Some bits align nicely
  • Use the frequent bits in a decoder
  • Suggestion Let syntax influence alignments.
  • So far, loose syntax methods are like IBM Model
    I.
  • NP-hard to enforce 1-to-1 in any interesting
    model.
  • Rest of talk
  • How to enforce 1-to-1 in interesting models?
  • Can we do something smarter than beam search?

26
Shuffling Non-Constituents
  • Jason Eisner

with David A. Smith and Roy
Tromble
syntactically-flavored reordering model
ACL SSST Workshop, June 2008
27
Motivation
  • MT is really easy!
  • Just use a finite-state transducer!
  • Phrases, morphology, the works!

28
Permutation search in MT
NNP Marie
NEG ne
PRP m
AUX a
NEG pas
VBN vu
NNP Marie
NEG ne
PRP m
AUX a
NEG pas
VBN vu
initial order(French)
best order(French)
easy transduction
29
Motivation
  • MT is really easy!
  • Just use a finite-state transducer!
  • Phrases, morphology, the works!
  • Have just to fix that pesky word order.

Framing it this way lets us enforce 1-to-1
exactly at the permutation step. Deletion and
fertility gt 1 are still allowed in the subsequent
transduction.
30
Often want to find an optimal permutation
  • Machine translation Reorder French to
    French-prime (Brown et al. 1992) So its easier
    to align or translate
  • MT eval
  • How much do you need to rearrange MT output so
    it scores well under an LM derived from ref
    translations?
  • Discourse generation, e.g., multi-doc
    summarization Order the output sentences
    (Lapata 2003) So they flow nicely
  • Reconstruct temporal order of events after info
    extraction
  • Learn rule ordering or constraint ranking for
    phonology?
  • Multi-word anagrams that score well under a LM

31
Other applications (there are many )
  • LOP
  • Maximum-weight acyclic subgraph (equivalent)
  • Graph drawing, task scheduling, archaeology,
    aggregating ranked ballots,
  • TSP
  • Transportation scheduling (schoolbus,
    meals-on-wheels, service calls, )
  • Motion scheduling (drill head, space telescopes,
    )
  • Topology of a ring network
  • Genome assembly

32
Permutation search The problem
initial order
best orderaccording tosome costfunction
33
Traditional approach Beam search
Approx. best path through a really big FSA N!
paths one for each permutation only 2N states
34
An alternative Local search (hill climbing)
The SWAP neighborhood
1 3 2 4 5 6 cost20
2 1 3 4 5 6 cost26
1 2 3 4 5 6 cost22
1 2 4 3 5 6 cost19
1 2 3 4 5 6 cost22
1 2 3 5 4 6 cost25
35
An alternative Local search (hill-climbing)
The SWAP neighborhood


1 2 3 4 5 6 cost22
1 2 4 3 5 6 cost19

36
An alternative Local search (hill-climbing)Lik
e greedy decoder of Germann et al. 2001
The SWAP neighborhood
1
4
2
3
5
6
cost22
Why are the costs always going down? How long
does it take to pick best swap? How many swaps
might you need to reach answer? What if you get
stuck in a local min?
we pick best swap O(N) if youre
careful O(N2) random restarts
37
Larger neighborhood
1 3 2 4 5 6 cost20
2 1 3 4 5 6 cost26
1 2 3 4 5 6 cost22
1 2 4 3 5 6 cost19
1 2 3 4 5 6 cost22
1 2 3 5 4 6 cost25
38
Larger neighborhood(well-known in the
literature reportedly works well)
INSERT neighborhood
1
2
3
6
cost22
Fewer local minima? Graph diameter (max moves
needed)? How many neighbors? How long to find
best neighbor?
yes 3 can move past 4 ? to get past 5 ?? O(N)
rather than O(N2) O(N2) rather than O(N) O(N2)
rather than O(N)
39
Even larger neighborhood
BLOCK neighborhood
1
3
6
cost22
2
yes 2 can get past 45 ? without having to
cross 3 ?? or move 3 first ?? still O(N) O(N3)
rather than O(N), O(N2) O(N3) rather than O(N),
O(N2)
Fewer local minima? Graph diameter (max moves
needed)? How many neighbors? How long to find
best neighbor?
40
Larger yet Via dynamic programming??
1
3
6
cost22
2
logarithmic exponential polynomial
Fewer local minima? Graph diameter (max moves
needed)? How many neighbors? How long to find
best neighbor?
41
Unifying/generalizing neighborhoods so far
1
3
6
7
8
2
Exchange two adjacent blocks, of max widths w
w SWAP w1, w1 INSERT w1, wN BLOCK
wN, wN
runtime neighbors O(wwN)
everything in this talk can be generalized to
other values of w,w
42
Very large-scale neighborhoods
  • What if we consider multiple simultaneous
    exchanges that are independent?
  • The DYNASEARCH neighborhood (Potts van de
    Velde 1995 Congram 2000)

1
5
2
4
3
6
Lowest-cost neighboris lowest-cost path
43
Very large-scale neighborhoods
Lowest-cost neighboris lowest-cost path
  • Why would this be a good idea?

Help get out of bad local minima? Help avoid
getting into bad local minima?
no theyre still local minima yes less greedy
0 -20 0 80
0 0 -30 -0
0 0 0 -20
0 0 0 0
44
Very large-scale neighborhoods
Lowest-cost neighboris lowest-cost path
  • Why would this be a good idea?

no theyre still local minima yes less
greedy yes! shortest-path algorithm finds the
best set of swaps in O(N) time, as fast as best
single swap. Up to N moves as fast as 1 moveno
penalty for parallelism! Globally optimizes
over exponentially many neighbors (paths).
Help get out of bad local minima? Help avoid
getting into bad local minima? More efficient?
45
Can we extend this idea up to N moves in
parallel by dynamic programming to
neighborhoods beyond SWAP?
1
3
6
7
8
2
Exchange two adjacent blocks, of max widths w
w SWAP w1, w1 INSERT w1, wN BLOCK
wN, wN
runtime neighbors O(wwN)
46
Lets define each neighbor by a colored
treeJust like ITG!
1
4
2
3
5
6
47
Lets define each neighbor by a colored
treeJust like ITG!
48
Lets define each neighbor by a colored
treeJust like ITG!
1
4
This is like the BLOCK neighborhood, but with
multiple block exchanges, which may be nested.
49
If that was the optimal neighbor
now look for its optimal neighbor
new tree!
1
4
5
6
2
3
50
If that was the optimal neighbor
now look for its optimal neighbor
new tree!
3
51
If that was the optimal neighbor
now look for its optimal neighbor
repeat till reach local optimum
Each tree defines a neighbor. At each step,
optimize over all possible treesby dynamic
programming (CKY parsing).
5
6
1
4
2
3
Use your favorite parsing speedups (pruning,
best-first, )
52
Very-large-scale versions of SWAP, INSERT, and
BLOCK all by the algorithm we just saw
1
3
6
7
8
2
Exchange two adjacent blocks, of max widths w
w Runtime of the algorithm we just saw was
O(N3) because we considered O(N3) distinct
(i,j,k) triples More generally, restrict to only
the O(wwN) triples of interest to define a
smaller neighborhood with runtime of O(wwN).
(yes, the dynamic programming recurrences go
through)
53
How many steps to get from here to there?
initial order
8
4
6
2
5
3
7
1
One twisted-tree step? No As you probably
know, 3 1 4 2 ? 1 2 3 4 is impossible.
1
4
5
2
3
6
7
8
best order
54
Can you get to the answer in one step?
German-English, Giza alignment
not always(yay, local search)
often(yay, big neighborhood)
55
How many steps to the answer in the worst case?
(what is diameter of the search space?)
8
4
6
2
5
3
7
1
claim only log2N steps at worst (if you know
where to step) Lets sketch the proof!
1
4
5
2
3
6
7
8
56
Quicksort anything into, e.g., 1 2 3 4 5 6 7 8
right-branchingtree
6
8
4
2
5
3
7
1
? 5
? 4
57
Quicksort anything into, e.g., 1 2 3 4 5 6 7 8
Only log2 N steps to get to 1 2 3 4 5 6 7 8
or to anywhere!
? 5
? 4
? 6
? 7
? 2
? 3
58
Defining best orderWhat class of cost
functions can we handle efficiently? How fast
can we compute a subtrees cost from its child
subtrees?
initial order
best orderaccording tosome costfunction
59
Defining best orderWhat class of cost
functions?
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
A
Traveling Salesperson Problem (TSP)
a25 a56 a63
a42
a14
a31
best orderaccording tosome costfunction
60
Defining best orderWhat class of cost
functions?
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
5 4 -12 6 55 0
B
Linear Ordering Problem (LOP)
b26 cost of 2 preceding 6
(add up n(n-1)/2 such costs) (any order will
incur either b26 or b62)
best orderaccording tosome costfunction
61
Defining best orderWhat class of cost
functions?
  • TSP and LOP are both NP-complete
  • In fact, believed to be inapproximable
  • hard even to achieve C optimal cost (any C1)
  • Practical approaches
  • correct answer, typically fast ?
    branch-and-bound, ILP,
  • fast answer, typically close to correct ? beam
    search,
    this talk,

62
Moving small blocks helps on LOP(experiment on
LOLIB collection of 250-word problems from
economics)
63
Defining best orderWhat class of cost
functions?
initial order
cost of this order
  • Does my favorite WFSA like this string of s?
  • Non-local pair order ok?
  • Non-local triple order ok?
  • Can add these all up

64
Costs are derived from source sentence features
NNP Marie
NEG ne
PRP m
AUX a
NEG pas
VBN vu
initial order(French)
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
-75 4 -12 6 55 0
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
B
A
65
Costs are derived from source sentence features
NNP Marie
NEG ne
PRP m
AUX a
NEG pas
VBN vu
initial order(French)
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
75 4 -12 6 55 0
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
B
A
75
Can also include phrase boundary symbols in the
input!
66
Costs are derived from source sentence features
NNP Marie
NEG ne
PRP m
AUX a
NEG pas
VBN vu
initial order(French)
FSA costs Distortion model Language model
looks ahead to next step! (?? good
finite-state translation into good English?)
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
75 4 -12 6 55 0
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
B
A
67
Dynamic program must pick the tree that leads to
the lowest-cost permutation
initial order
cost of this order
  1. Does my favorite WFSA like it as a string?

68
Scoring with a weighted FSA
This particular WFSA implements TSP scoring for
N3 After you read 1, youre in state 1 After
you read 2, youre in state 2 After you read 3,
youre in state 3 and this state determines
the cost of the next symbol you read
nitial
  • Well handle a WFSA with Q states by using a
    fancier grammar, with nonterminals. (Now runtime
    goes up to O(N3Q3) )

69
Including WFSA costs via nonterminals
A possible preterminal for word 2is an arc in A
thats labeled with 2. The preterminal 4?2
rewrites as word 2 with a cost equal to the
arcs cost.
4
5
6
1
2
3
70
Including WFSA costs via nonterminals
71
Dynamic program must pick the tree that leads to
the lowest-cost permutation
initial order
cost of this order
  1. Does my favorite WFSA like it as a string?
  2. Non-local pair order ok?

72
Incorporating the pairwise ordering costs
This puts 5,6,7 before 1,2,3,4.
So this hypothesis must add costs 5 lt 1, 5
lt 2, 5 lt 3, 5 lt 4, 6 lt 1, 6 lt 2, 6 lt 3,
6 lt 4, 7 lt 1, 7 lt 2, 7 lt 3, 7 lt
4 Uh-oh! So now it takesO(N2) time to combine
twosubtrees, instead of O(1) time? Nope
dynamic programmingto the rescue again!
5
6
7
73
Computing LOP cost of a block move
1 1 2 3 4
5
6
7
This puts 5,6,7 before 1,2,3,4.
So we have to add O(N2) costsjust to consider
this single neighbor!
Reuse work from other, narrower block moves
computed new cost in O(1)!
1
4
2
3
5
6
7
1 1 2 3 4
5
6
7
1 1 2 3 4
5
6
7
1 1 2 3 4
5
6
7
1 1 2 3 4
5
6
7


-

74
Incorporating 3-way ordering costs
  • See the initial paper (Eisner Tromble 2006)
  • A little tricky, but
  • comes for free if youre willing to accept a
    certain restriction on these costs
  • more expensive without that restriction, but
    possible

75
Another option Markov chain Monte Carlo
  • Random walk in the space of permutations
  • interpret a permutations cost as a
    log-probability
  • Sample a permutation from the neighborhood
    instead of always picking the most probable
  • Why?
  • Simulated annealing might beat greedy-with-random-
    restarts
  • When learning the parameters of the distribution,
    can use sampling to compute the feature
    expectations

76
Another option Markov chain Monte Carlo
  • Random walk in the space of permutations
  • interpret a permutations cost as a
    log-probability
  • Sample a permutation from the neighborhood
    instead of always picking the most probable
  • How?
  • Pitfall Sampling a permutation ?? sampling a
    tree
  • Spurious ambiguity some permutations have many
    trees
  • Solution Exclude some trees, leaving 1 per
    permutation
  • Normal form has long been known for colored trees
  • For restricted colored trees (which limit the
    size of blocks to swap), we have devised a more
    complicated normal form

77
Sampling from permutation space p(p)
exp(cost(p)) / ZWhy is this useful?
  • To train the weights that determine the cost
    matrix (as we saw earlier)
  • And to compute expectations of other
    quantities(e.g., how often does 2 precede 5?)
  • Less greedy heuristic for finding the lowest-cost
    permutation
  • This is the mode of p, i.e., the
    highest-probability permutation.
  • Take a sample from p. If most of ps probability
    mass is on the mode,you have a good chance of
    getting the mode.
  • If not, boost the odds sample instead from pß,
    for ß gt 1
  • defined as pß(p) (exp ßcost(p)) / Zß (so
    p2(p) proportional to p(p)2)
  • As ß ? 8, chances of getting the mode ? 1
  • But as ß ? 8, MCMC sampling gets slower and
    slower (no free lunch!)
  • ? simulated annealing gradually increase ß
    during MCMC sampling

-Z/Z ?p p(p) cost(p) Epcost(p)
78
Learning the costs
  • Where do these costs come from?
  • If we have some examples on which we know the
    true permutation, could try to learn them

0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
75 4 -12 6 55 0
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
B
A
79
Learning the costs
  • Where do these costs come from?
  • If we have some examples on which we know the
    true permutation, could try to learn them
  • More precisely, try to learn these weights ?
    (the knowledge thats reused across examples)

50 a verb (e.g., vu) shouldnt precede its
subject (e.g., Marie) 27 words at a distance of
5 shouldnt swap order -2 words with PRP between
them ought to swap
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
75 4 -12 6 55 0
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
B
A
80
Learning the costs
  • Typical learning approach (details omitted)Tune
    the weights ? to maximize probability of correct
    answer p
  • Probability??? We were just trying to minimize
    the cost.
  • But theres a standard way to convert costs to
    probabilities

actually, log probabilityconvex optimization
with same answer
  • For every permutation p,define p(p)
    exp(cost(p)) / Z
  • where the partition function Z ?p exp
    cost(p), so ?p p(p) 1
  • Search is now argmaxp p(p)
  • Learning is now argmax? log p(p)increase log
    p(p) by gradient ascent

50 a verb (e.g., vu) shouldnt precede its
subject (e.g., Marie) 27 words at a distance of
5 shouldnt swap order -2 words with PRP between
them ought to swap
81
Learning the costs
  • Typical learning approach (details omitted)Tune
    the weights ? to maximize probability of correct
    answer p

actually, log probabilityconvex optimization
with same answer
find the gradient of log p(p) with respect to
the weights ? were trying to learn easy since
cost(p) is typically just a sum of many
weights slow sum over all permutations!
  • For every permutation p,define p(p)
    exp(cost(p)) / Z
  • where the partition function Z ?p exp
    cost(p), so ?p p(p) 1
  • Search is now argmaxp p(p)
  • Learning increase log p(p) by gradient ascent

82
Learning the costs
  • Typical learning approach (details omitted)Tune
    the weights ? to maximize probability of correct
    answer p

actually, log probabilityconvex optimization
with same answer
find the gradient of log p(p) with respect to
the weights ? What is this gradient anyway? (log
p(p)) (cost(p) log Z) cost(p)
Z/Z Z ?p exp(cost(p)) (-cost(p)) so
-Z/Z ?p p(p) cost(p) Epcost(p)
  • For every permutation p,define p(p)
    exp(cost(p)) / Z
  • where the partition function Z ?p exp
    cost(p), so ?p p(p) 1
  • Search is now argmaxp p(p)
  • Learning increase log p(p) by gradient ascent

aha! estimate by sampling from p (more about this
later)
83
Experimenting with training LOP params(LOP is
quite fast O(n3) with no grammar constant)
PDS VMFIN PPER ADV APPR ART NN PTKNEG VVINF .
Das kann ich so aus dem Stand nicht sagen .
B7,9
84
LOP feature templates
85
LOP feature templates
  • Only LOP features so far
  • And theyre unnecessarily simple (dont examine
    syntactic constituency)
  • And input sequence is only words(not
    interspersed with syntactic brackets)

86
Learning LOP Costs for MT
(interesting, if odd, to try to reorder with only
the LOP costs)
MOSES baseline
German
English
  • Define German to be German in English word order
  • To get German for training data, use Giza to
    alignall German positions to English positions
    (disallow NULL)

87
Learning LOP Costs for MT(interesting, if odd,
to try to reorder with only the LOP costs)
MOSES baseline
German
English
  • Easy first try Naïve Bayes
  • Treat each feature in ? as independent
  • Count and normalize over the training data
  • No real improvement over baseline

88
Learning LOP Costs for MT(interesting, if odd,
to try to reorder with only the LOP costs)
MOSES baseline
German
English
  • Easy second try Perceptron

localoptimum
. . .
update
gold standard
Note Search error can be beneficial, e.g., just
take 1 step from identity permutation
89
search error
model error
Warning different data
90
Benefit from reordering
Learning method BLEU vs. German' BLEU vs. English
No reordering 49.65 25.55
Naïve BayesPOS 49.21
Naïve BayesPOSlexical 49.75
PerceptronPOS 50.05 25.92
PerceptronPOSlexical 51.30 26.34
obviously, not yet unscrambling German need
more features
91
Alternatively, work back from gold standard
  • Contrastive estimation (Smith Eisner 2005)
  • Maximize the probability of the desired
    permutation relative to its ITG neighborhood
  • Requires summing all permutations in a
    neighborhood
  • Must use normal-form trees here
  • Stochastic gradient descent

gold standard
92
Alternatively, work back from gold standard
  • k-best MIRA in the neighborhood
  • Make gold standard beat its local competitors
  • Beat the bad ones by a bigger margin
  • Good close to gold in swap distance?
  • Good close to gold using BLEU?
  • Good translates into English thats close to
    reference?

gold standard
93
Alternatively, train each iterate
model best inneigh of ??(0)
. . .
update
update
update
oracle inneigh of ??(0)
  • Or could do a k-best MIRA version of this, too
    even use a loss measure based on lookahead
    to??(n)

94
Open Questions
  • Search Is there practical benefit to using
    larger neighborhoods (speed, quality of solution)
    for hill-climbing? For MCMC?
  • Search Are the large-scale versions worth the
    constant-factor runtime penalty? At some sizes?
  • Learning How should we learn the weights if we
    plan to use them in greedy search?
  • Learning Can we tune adaptive search methods
    that vary the neighborhood and the temperature
    dynamically from step to step?
  • Theoretical Can it be determined in polytime
    whether two permutations have a common neighbor
    (using the full colored tree neighborhood)?
  • Theoretical Mixing time of MCMC with these
    neighborhoods?
  • Algorithmic Is there a master theorem for normal
    forms?

95
Summary of part II
  • Local search is fun and easy
  • Popular elsewhere in AI
  • Closely related to MCMC sampling
  • Probably useful for translation
  • Maybe other NP-hard problems too
  • Can efficiently use huge local neighborhoods
  • Algorithms are closely related to parsing and
    FSMs
  • Our community knows that stuff better than
    anyone!
Write a Comment
User Comments (0)
About PowerShow.com