Title: Towards Syntactically Constrained Statistical Word Alignment
1Towards Syntactically Constrained Statistical
Word Alignment
- Greg Hanneman
- 11-734 Advanced Machine Translation Seminar
- April 30, 2008
2Outline
- The word alignment problem
- Base approaches
- Syntax-based approaches
- Distortion models
- Tree-to-string models
- Tree-to-tree models
- Discussion
3Word Alignment
- Parallel sentence pair F and E
- Most general map a subset of F to a subset of E
4Word Alignment
- Very large alignment spaces!
- An n-word parallel sentence has n2 possible links
and 2n2 possible alignments - Restrict to one-to-one alignments n! possible
alignments - Alignment models try to restrict or learn a
probability distribution over this space to get
the best alignment of a sentence
5Outline
- The word alignment problem
- Base approaches
- Syntax-based approaches
- Distortion models
- Tree-to-string models
- Tree-to-tree models
- Discussion
6A Generative StoryBrown et al. 1990
7The Framework
- F words f1 fj fn
- E words e1 ei em
- Compute P(F, A E) for hidden alignment variable
A a1 aj an - The major step decomposition, model parameters,
EM algorithm, etc. - aj i word fj is aligned to word ei
8The IBM ModelsBrown et al. 1993 Och and Ney
2003
- Model 1 Bag of words word order doesnt
affect alignment - Model 2 Position of words being aligned does
matter
9The IBM ModelsBrown et al. 1993 Och and Ney
2003
- Later models use more implicit structural or
linguistic information, but not really syntax,
and not really overtly - Fertility P(f ei) of ei producing f words in F
- Distortion P(t, p E) for a set of F words t in
a permutation p - Previous alignments Probs. for positions in F of
the different words of a fertile ei
10The HMM ModelVogel et al. 1996 Och and Ney
2003
- Linguistic intuition words, and their
alignments, tend to clump together in clusters - aj depends on absolute size of jump between it
and aj1
11Discriminative Training
- Consider all possible alignments, score them, and
pick the best ones under some set of constraints - Can incorporate arbitrary features generative
models more fixed - Generative models EM requires lots of unlabeled
training data discriminative requires some
labeled data
12Discriminative AlignmentTaskar et al. 2005
-
- Co-occurrence
- Position difference
- Co-occurrence of following words
- Word-frequency rank
- Model 4 prediction
-
13Outline
- The word alignment problem
- Base approaches
- Syntax-based approaches
- Distortion models
- Tree-to-string models
- Tree-to-tree models
- Discussion
14Syntax-Based Approaches
- Constrain alignment space by looking beyond flat
text stream take higher-level sentence structure
into account - Representations
- Constituency structure
- Inversion Transduction Grammar
- Dependency structure
15An MT Motivation
16Syntax-Based DistortionDeNero and Klein 2007
- Syntax-based MT should start from syntax-aware
word alignments - HMM model target-language parse trees prefer
alignments that respect tree - Handled in distortion model jumps should reflect
tree structure
17Syntax-Based DistortionDeNero and Klein 2007
- HMM distortion size of jump between aj1 and aj
- Syntactic distortion tree path between aj1 and
aj
18Syntax-Based DistortionDeNero and Klein 2007
- Training100,000 parallel FrenchEnglish and
ChineseEnglish sentences with English parse
trees - Both E?F and F ? E combined with different
unions and intersections, plus thresholds - Test Hand-aligned Hansards and NIST MT 2002 data
19Syntax-Based DistortionDeNero and Klein 2007
- HMMs roughly equal, better than GIZA
- Soft union for French hard union for Chinese
competitive thresholding
20Tree-to-String Models
21Tree-to-String Models
- New generative story
- Word-level fertility and distortion replaced with
node insertion and sibling reordering - Lexical translation still the same
- Word alignment produced as a side effect from
lexical translations
22Tree-to-String AlignmentYamada and Knight 2001
- Discussed in other sessions this semester
- Training 2121 short JapaneseEnglish sentences,
modified Collins parser output for English - Test First 50 sentences of training corpus
- Beat IBM Model 5 on human judgements perplexity
between Model 1 and Model 5
23Subtree CloningGildea 2003
- Original tree-to-string model is too strict
- Syntactic divergences, reordering
- Soft constraint allow alignments that violate
tree structure, but at a cost - Tweak the tree side of the alignment to contain
things needed for the string side - Ex. SVO to OSV
24Subtree CloningGildea 2003
25Subtree CloningGildea 2003
S
VP
AUX
VP
do
26Subtree CloningGildea 2003
27Subtree CloningGildea 2003
- For a node np
- Probability of cloning something as a new child
of np single EM-learned constant for all np - Probability of making that clone a node nc
uniform over all nc - Surprising that this works
28Subtree CloningGildea 2003
- Compared with IBM 13, basic tree-to-string,
basic tree-to-tree models - Training 4982 KoreanEnglish sentence pairs,
with manual Korean parse trees - Test 101 hand-aligned held-out sentences
29Subtree CloningGildea 2003
- Cloning helps as good or better than IBM
- Tree-to-tree model runs faster
30Tree-to-Tree Models
- Alignment must conform to tree structure on both
sides space is more constrained - Requires more transformation operations to handle
divergent structures Gildea 2003 - Or we could be more permissive
31Inversion Transduction GrammarWu 1997
- For bilingual parsing get one-to-one word
alignment as a side effect - Parallelbinary-branchingtrees with reordering
32ITG Operations
- A ? A A
- Produce A1 A2 in source and target streams
- A ? ltA Agt
- Produce A1 A2 in source stream, A2 A1 in
target stream - A ? e / f
- Produce e in source stream, f in target stream
33ITG Operations
- Canonical form ITG produces only one derivation
for a given alignment - S ? A B C
- A ? A B B B C B A C B C
C C - B ? ltA Agt ltB Agt ltC Agt ltA Cgt ltB Cgt
ltC Cgt - C ? e / f
34Alignment with ITGZhang and Gildea 2004
- Compared IBM 1, IBM 4, ITG, and tree-to-string
(with and without cloning) - Training ChineseEnglish (18,773) and
FrenchEnglish (20,000) sentences less than 25
words long - Test Hand-aligned ChineseEnglish (48) and
FrenchEnglish (447)
35Alignment with ITGZhang and Gildea 2004
- ITG best, or at least as good as IBM or
tree-to-string plus cloning - ITG has no linguistic syntax
36Dependency Parsing
- Discussed in other sessions this semester
- Notion of violating phrasal cohesion
- Usually bad, but not always
37Dependencies ITGCherry and Lin 2006
- Find invalid dependency spans assign score of 8
if used by the ITG parser - Simple model maximize co-occurrence score with
penalty for distant words - ITG reduces AER by 13 relative dependencies
ITG reduce by 34
38Dependencies ITGCherry and Lin 2006
- Discriminative training with an SVM
- Feature vector for each ITG rule instance
- Features from Taskar et al. 2005
- Feature marking ITG inversion rules
- Feature (penalty) marking invalid spans based on
dependency tree
39Dependencies ITGCherry and Lin 2006
- Compared Taskar et al. to D-ITG with hard and
soft constraints - Training 50,000 FrenchEnglish sentence pairs
for counts and probabilities 100 hand-annotated
pairs with derived ITG trees for discriminative
training - Test 347 hand-annotated sentences from 2003
parallel text workshop
40Dependencies ITGCherry and Lin 2006
- Relative improvement smaller in discriminative
training scenario with stronger objective
function - Hard constraint starts to hurt recall
41Outline
- The word alignment problem
- Base approaches
- Syntax-based approaches
- Distortion models
- Tree-to-string models
- Tree-to-tree models
- Discussion
42All These Tradeoffs
- Mathematical and statistical correctness vs.
computability - Simple model vs. capturing linguistic phenomena
- Not enough syntactic information vs. too much
syntactic information - Ruling out bad alignments vs. keeping good
alignments around
43Alignment Spaces
- Completely unconstrained every alignment link
(ei, fj) either on or off - Permutation space one-to-one alignment with
reordering Taskar et al. 2005 - ITG space permutation space satisfying binary
tree constraint Wu 1997 - Dependency space permutation space maintaining
phrasal cohesion
44Alignment Spaces
- D-ITG space Dependency n ITG space Cherry and
Lin 2006 - HD-ITG space D-ITG space where each span must
contain a head Cherry and Lin 2006a
45Examining Alignment SpacesCherry and Lin 2006a
- Alignment score
- Learned co-occurrence score
- Gold-standard oracle score
46Examining Alignment SpacesCherry and Lin 2006a
- Learned co-occurrence score
- More restricted spaces give better results
47Examining Alignment SpacesCherry and Lin 2006a
- Oracle score subsets of permutation space
- ITG rules out almost nothing correct
- Beam search in dependency space does worst
48Conclusions
- Base alignment models are mathematical, limited
notions of sentence structure - Syntax-aware alignment helpful for syntax-aware
MT DeNero and Klein 2007 - Using structure as a hard constraint is harmful
for divergent sentences tweaking trees Gildea
2003 or using soft constraints Cherry and Lin
2006 helps fix this
49Conclusions
- Surprise winner ITG
- Computationally straightforward
- Permissive, simple grammar that mostly only rules
out bad alignments Cherry and Lin 2006a - Does a lot, even when its not the best
- Discriminative framework looks promising and
flexible can incorporate generative models as
features Taskar et al. 2005
50Towards the Future
- Easy-to-run GIZA made complicated IBM models
the norm promising discriminative or
syntax-based models currently lack such a toolkit - Syntax-based discriminative techniques
morphology, POS, semantic information - Any other ideas?
51References
- Brown, P., J. Cocke, S. Della Pietra, V. Della
Pietra, F. Jelinek, J. Lafferty, R. Mercer, and
P. Roossin, A statistical approach to machine
translation, Computational Linguistics,
16(2)79-85, 1990. - Brown, P., S. Della Pietra, V. Della Pietra, and
R. Mercer, The mathematics of statistical
machine translation Parameter estimation,
Computational Linguistics, 19(2)263-311. - Cherry, Colin and Dekang Lin, Soft syntactic
constraints for word alignment through
discriminative training, Proceedings of the
COLING/ACL Poster Session, 105-112, 2006. - Cherry, Colin and Dekang Lin, A comparison of
syntactically motivated alignment spaces,
Proceedings of EACL, 145-152, 2006a. - DeNero, John and Dan Klein, Tailoring word
alignments to syntactic machine translation,
Proceedings of ACL, 17-24, 2007. - Gildea, Daniel, Loosely tree-based alignment for
machine translation, Proceedings of ACL, 80-87,
2003.
52References
- Och, Franz and Hermann Ney, A systematic
comparison of various statistical alignment
models, Computational Linguistics, 29(1)19-51,
2003. - Taskar, B., S. Lacoste-Julien, and D. Klein, A
discriminative matching approach to word
alignment, Proceedings of HLT/EMNLP, 73-80,
2005. - Vogel, S., H. Ney, and C. Tillmann, HMM-based
word alignment in statistical translation,
Proceedings of COLING, 836-841, 1996. - Wu, Dekai, Stochastic inversion transduction
grammars and bilingual parsing of parallel
corpora, Computational Linguistics,
23(3)377-403. - Yamada, Kenji and Kevin Knight, A syntax-based
statistical translation model, Proceedings of
ACL, 523-530, 2001. - Zhang, Hao and Daniel Gildea, Syntax-based
alignment Supervised or unsupervised?
Proceedings of COLING, 418-424, 2004.