Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 14 - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 14

Description:

Uses log-linear modelling approach for direct translation ... Individual tag-count features probably already encoded in trigram models ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 53
Provided by: facultyWa4
Category:

less

Transcript and Presenter's Notes

Title: Johns Hopkins 2003 Summer Workshop on Syntax and Statistical Machine Translation Chapters 14


1
Johns Hopkins 2003 Summer Workshop onSyntax and
Statistical Machine TranslationChapters 1-4
  • Shauna Eggers

2
Outline
  • Introduction
  • Implicit Syntactic Feature Functions
  • Shallow Syntactic Feature Functions
  • Deep Syntactic Feature Functions

3
Introduction
  • Motivation
  • The Plan
  • The Baseline system

4
Motivation
Introduction
  • Statistical MT systems are the current
    state-of-the art, but they often make stupid
    syntax errors
  • Missing content words
  • Ukraine condemns US interference in its internal
    affairs
  • Condemns US interference in its internal affairs
  • Missing articles
  • he is fully able to activate the team
  • he is fully able to activate team
  • Incorrect dependencies
  • particularly those players who cheat the
    audience
  • particularly those who cheat the audience
    players
  • These systems are data-driven, so reflect
    implicit syntactic properties of language, as
    n-grams and alignment models
  • What if we incorporate some explicit syntactic
    knowledge?

5
The Plan
Introduction
  • Investigate the effect of integrating syntactic
    rules on the performance of an SMT system
  • Analyze errors in a baseline SMT system
  • Develop syntactically-motivated feature functions
    to target specific errors
  • Observe effect of each feature on system score
  • Hope for improved results!
  • Measuring improvement BLEU
  • Is this always an effective/appropriate metric?

6
The Baseline System
Introduction
  • The Model
  • Feature Functions
  • Training and Test corpora
  • Results

7
The Model
Baseline System
  • Alignment template system
  • Och 2002 Och, Tillman, Ney 1999 Och, Ney 2004
  • How it works segment input sentence into
    phrases, translate phrases, and reorder in target
    language
  • Uses log-linear modelling approach for direct
    translation
  • Basic idea of experiments model each syntactic
    feature as a function and plug it into the model

8
Fig1.1, System architecture based on Log-linear
modelling
9
Alignment Templates
Baseline System
  • Used to do the phrase-based translation
  • Sentences e and f are decomposed into K phrase
    pairs, and a template z is assigned to translate
    each pair
  • Parameters
  • Segmentation points in both e and f
  • Search for optimal segmentation included in
    Global Search component of model (Fig 1.1)
  • Set of templates 1 through K z1K
  • Permutation of templates 1 through K p1K
  • This parameter allows for reordering of phrases
  • The z and p parameters are added as hidden
    variables to the model

10
Fig1.2, Sample segmentation of e and f and
translation into alignment templates
11
Fig1.3, Dependencies in the alignment template
model
12
Feature Functions
Baseline System
  • Alignment template selection
  • Word selection
  • Phrase alignment
  • Language Model features
  • Word/Phrase penalty
  • Phrases from conventional lexicon
  • Additional features

13
Alignment template selection
Baseline System
  • Product of alignment template probabilities
  • Feature function
  • Notice there are no insertions or deletions on
    phrase level just permutations

14
Word selection
Baseline System
  • Product of word translation probabilities
  • Feature function
  • Notice i and j to include dependence on word
    positions (ubermorgen -gt the day after tomorrow
    should be weighted higher than ubermorgen -gt
    after the day tomorrow)
  • Ei is word class for word ei
  • A is matrix of word alignments Ap1Kz1K

15
Phrase Alignment
Baseline System
  • Feature function for phrase alignment
  • Sum over distance (in source) of alignment
    templates which are consecutive in target
  • Measures non-monotonicity of phrases
  • Takes into account that very often monotone
    alignments are the correct alignments

16
Language Model features
Baseline System
  • Standard word-based trigram for language model
    feature
  • Report mentions a total of four language models
    dont know what other three are (or just four
    variations on trigram?)

17
Word-phrase penalty
Baseline System
  • Number of produced words, ie, length of target
    sentence
  • Number of produced phrases
  • Can be arranged to prefer long or short phrases
    (I imagine this means smaller K for longer
    phrases, and larger K for shorter phrases...?)

18
Additional Features
Baseline System
  • Phrases from conventional lexicon
  • Entries in the Chinese-English lexicon provided
    by Linguistic Data Consortium can be potential
    phrase translation pairs in the align template
    system
  • A feature function is included that counts the
    number of times each lexical entry is used in
    training
  • Model allows further addition of any number of
    feature functions, for example, other syntactic
    features (numbers of verb arguments), semantic
    features, pragmatic features

19
Training and Test corpora
Baseline System
  • Three corpora
  • training corpus (train)
  • 170M English words
  • development corpus (dev)
  • 993 sentences (25K words) in both languages
  • 5765 sentences (175K words) for use in
    post-workshop experiments
  • test corpus (test)
  • And
  • unseen test corpus (blind-test)
  • for experiments on completely unseen data

20
Preprocessing
Baseline System
  • Some additional tweaking was needed to get the
    system ready to roll
  • Segmentation and POS tagging used standard
    tools distributed by LDC
  • Slightly different tag sets are appropriate for
    English and Chinese data (no NN v. NNS
    distinction in Chinese, no M for measure words in
    English)
  • Parsing used Collins 1999 for English, Bikel
    2002 for Chinese
  • Chunking used fnTBL chunker
  • Case issues used HMM to insert upper case back
    into baseline system output
  • Tokenization Issues normalize hyphenation,
    other formatting for i/o into various systems

21
The Baseline Result
Baseline System
  • BLEU score 31.6
  • This will be the score that every experiment will
    be compared against
  • SPOILER! Youre not going to see anything much
    different from this...
  • (But hold on anyway... here we go!)

22
Outline
  • Introduction
  • Implicit Syntactic Feature Functions
  • Shallow Syntactic Feature Functions
  • Deep Syntactic Feature Functions

23
Implicit Syntactic Feature Functions
  • A Trio for Punctuation
  • Specific Word Penalty
  • Model 1 Score
  • Missing Content Words
  • Multi-Sequence Alignment (MSA) of Hypotheses

24
Punctuation
Implicit Functions
  • Problem Ungrammatical punctuation in hypotheses
    affect syntactic quality of output
  • Idea 1 Count of unmatched or empty parens and
    quotes
  • Feature function penalizes for ungrammatical
    punctuation
  • Idea 2 Percent overlap between groups in e and
    f
  • penalizes word movement around punctuation
  • penalizes punctuation deletion
  • Idea 3 Add hypotheses to correct bad
    punctuation
  • delete unaligned parens and quotes
  • insert and opening paren/quote before the first
    word aligned to the first Chinese word inside the
    parens
  • Insert a closing paren/quote after the last word
    aligned to the last Chinese word inside the parens

25
Punctuation - Results
  • BLEU score no statistically significant
    improvement
  • Ideas 1 and 2 are restricted in their application
  • Have little discriminating power when most of the
    hypotheses for a Chinese sentences make similar
    punctuation mistakes
  • Idea 2 doesnt work when punctuation deletion is
    at borders of sentence, or next to another
    punctuation mark
  • Idea 3 hypotheses apparently make only trivial
    changes to feature function values (and hence
    n-best score)
  • Conclusion Punctuation soundness has little
    influence on BLEU

26
Specific Word Penalty
Implicit Functions
  • Problem Errant ie, wrongly-placed, inserted,
    or deleted content words
  • Idea Use counts of ten most common non-content
    words as feature functions
  • Individually, 10 counts ? 10 functions
  • Combined into one count, to avoid overfitting
  • Results, compared to 31.6 baseline
  • Using individual counts as features 31.1
  • Combined into one feature value 31.7
  • Conclusion BLEU drops with these features
  • But! They did find that that and a were more
    commonly systematically mistranslated than
    others. Maybe further experiments can be done on
    larger list of non-content words

27
Model 1 Score
Implicit Functions
  • Idea Use IBM Model 1 for two feature functions
  • Model 1 gives the sum of all possible alignment
    probabilities
  • Feature functions p(fe) and p(ef)
  • Trained with subset of training corpus for
    baseline system 30M English words
  • Smoothing constant t(fjei) 10-40 used for
    unknown words

28
Model 1 - Results
  • Compared to 31.6 baseline
  • With p(fe) yields 32.5 average, p(ef) 30.6
  • One of the best-performing features in workshop
  • Breakdown for different training sizes
  • (numbers for p(ef) a little jumpy may be bug in
    eval script)

29
Missing Content Words
Implicit Functions
  • Problem Those missing content words are really
    annoying.
  • Sentences missing content words can have overall
    higher probability ranking than those with
    correct content words
  • Idea Implement feature function that counts
    number of content words missing in a candidate
    translation
  • Results 31.9 BLEU score
  • 0.3 improvement over baseline 31.6
  • Comparatively large improvement, yet not
    statistically significant
  • Discussion The BLEU score is not significantly
    better, but on manual inspection, the adequacy of
    resulting sentences is much better. Perhaps BLEU
    is not the best metric to evaluate application of
    this feature function.

30
MSA for Hypotheses
Implicit Functions
  • Problem No real range of diversity in
    translation sentences
  • Idea Use Multi-Sequence Alignment (MSA)
    lattices to recombine subparts of existing
    hypotheses into new ones. Three features
  • Path weight of each hypothesis
  • Arc weight number of hypotheses that agree
    with that arc
  • Binary feature Does arc represent majority
    hypoths?
  • Number of arcs on which a hypothesis agreed with
    the consensus path

31
MSA - Results
  • BLEU scores Meh.
  • Conclusion SMT not constrained enough to be a
    very good fit for MSA

32
Implicit Syntax Results
33
Outline
  • Introduction
  • Implicit Syntactic Feature Functions
  • Shallow Syntactic Feature Functions
  • Deep Syntactic Feature Functions

34
Shallow Syntactic Feature Functions
  • Overview
  • Part-Of-Speech and Chunk Tag Counts
  • Tag Fertility Models
  • Projected POS Language Model
  • Aligned POS-Tag Sequences

35
Overview
Shallow Functions
  • Shallow features depend on POS tagging or
    chunking
  • Motivations
  • Overcome data sparseness
  • Generalize from behavior of words to behavior of
    tags and chunks
  • Make stronger generalizations about syntactic
    behavior than what is observed in training corpus
  • Disadvantages It may not be possible to capture
    more info than is already implicitly modeled in
    baseline
  • POS and baseline systems trained on same input
  • Chunker output not at a much higher granularity
    than Alignment Templates
  • Advantages
  • Efficiency of POS and chunking systems
  • Decisions are local, so better for noisy
    hypotheses
  • Lots of available input data (1.3M tagged
    parallel sentences available for training)
  • Simpler models allow quicker reaction to
    problems, contrastive error analysis

36
Part-Of-Speechand Chunk Tag Counts
Shallow Functions
  • Problem baseline is systematically under- and
    over-generating certain POS and chunk types
  • Idea Favor sentences with more or less of
    certain tags (depending on under- or
    over-generation). For example
  • Number of NPs in English
  • Difference in number of NPs from Chinese to
    English
  • Number of Chinese N tags translated only to non-N
    tags in English
  • Results Meh.
  • Conclusions
  • Individual tag-count features probably already
    encoded in trigram models
  • Combined tag-count features do better, maybe
    because counteract biases in more sophisticated
    features

37
POS and Chunk counts - Results
38
Tag Fertility Models
Shallow Functions
  • Problem Tag distribution again
  • Idea Model expected English tag distribution,
    with and without given Chinese tags
  • Single feature consisting of product of various
    probability distributions for English tags
  • Some bag-o-tags models, eg, P(N Ne 2)
  • Some conditional given Chinese tags, eg,
  • P(N Pe 2 N Pf 1)

39
Tag Fertility - Results
  • Not as good as hoped
  • Discussion
  • Parameter estimation was rather simplistic
    obviously-related probs such as
  • were independently calculated
  • Fewer free parameters might be tried

40
Projected POS Language Model
Shallow Functions
  • Problem Word movement model in baseline system
    is pretty weak.
  • Idea Since Chinese words are too sparse to
    model movement, use POS instead
  • Use word alignment to project Chinese POS
    sequences into English
  • Similar to HMM alignment model of Vogel, Ney, and
    Tillman 1996, but with POS instead of words

41
Projected POS - Results
  • This is a little better...
  • Conclusion
  • Results better simply because of poorness of
    movement-handling in baseline
  • Strongest-performing of shallow features
  • Should be investigated further indicates
    possible move from purely word-based models to
    ones based on shallow syntax

42
Aligned POS-Tag Sequences
Shallow Functions
  • Problem Alignments in baseline computed on word
    level however, lexical item distribution is
    always sparse
  • Idea Use POS tag sequence alignments instead
  • Replace words in alignment templates with POS
    tags, and use following alignment models
  • Unigram p(f,e) product of all p(sf, se)
  • Conditional p(e,f) p(f) product of all p(se
    sf)

43
Aligned POS - Results
  • Unigram model average 31.6
  • Conditional model average 31.4
  • Conclusion Maybe need more input for training
    models
  • Baseline system does not output alignment
    information for words translated by rules, so
    these particular alignments cannot be recovered
  • Performance of these feature functions may
    improve if can reconfigure baseline system to
    output more alignments

44
Shallow Syntax Results
45
Outline
  • Introduction
  • Implicit Syntactic Feature Functions
  • Shallow Syntactic Feature Functions
  • Deep Syntactic Feature Functions

46
Deep Syntactic Feature Functions
  • Grammaticality Test of English Parser
  • Parser Probability
  • Parser Probability / Unigram LM Scores
  • Syntax-based Translation Models
  • Tree to String
  • Tree to Tree Alignment
  • Dependency Tree-to-Tree Alignments

47
Overview
  • Deep syntactic features depend on parser output
  • Grammaticality is measured by parse trees
  • How to use parser output
  • simple features
  • model-based features
  • dependency-based features
  • other complex features
  • tricky features (Chapter 5, Ethan)

48
Grammaticality Testof English Parser
Deep Functions
  • Idea Grammatical sentences should have a higher
    parse probability
  • Try parse probability of sentence by itself
  • Try parse probability of sentence / unigram prob
    for words in sentence
  • Result Worse than baseline! Guess these probs
    are not really related...

49
Tree to String Model
Deep Functions
  • Idea Incorporate syntax-based Tree-to-String
    model as a feature function (Yamada and Knight
    2001, 2002)
  • Theta is the set of reorderings, insertions, and
    leaf-word translation operations
  • Results Average 31.7 BLEU
  • Conclusion
  • Results are not bad, but this is computationally
    very expensive! Expense makes it impractical for
    this model.
  • Try reducing cost by fragmenting long sentences
    with a tool called machete kinks are still
    being worked out of this tool, but it may be
    promising

50
Tree to Tree Alignment
Deep Functions
  • Idea Use (Gildea 2003) tree alignment
    probabilities as feature function
  • Remember, Gildeas model includes cloning, and
    many-to-one, one-to-many node mappings
  • Experiment
  • Lexical translation probs for leaf nodes were
    trained using IBM Model 1
  • Some tweaks for performance max fan-out of 6,
    max sentence length of 60
  • Results 31.6 BLEU

51
DependencyTree-to-Tree Alignments
Deep Functions
  • Idea Try a gaggle of dependency-derived
    features (listed in results table, next slide)
  • By representing relationships between words,
    dependency trees for source and target sentences
    supposedly have less conflicting structures than
    constituency trees
  • Results Not much different from baseline
  • Conclusion A lot of the lack of gain for this
    approach is probably accounted for by errors in
    the parsing tools. Fixing these errors would
    likely improve results of this using this feature.

52
Dependency Tree Alignments - Results
Write a Comment
User Comments (0)
About PowerShow.com