Rapid Prototyping of a Transferbased HebrewtoEnglish Machine Translation System - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Rapid Prototyping of a Transferbased HebrewtoEnglish Machine Translation System

Description:

(LEX B$WRH) (LEX B) (LEX $WR) (POS N) (POS PREP)) (POS N) (GEN F) (GEN M) (NUM S) (NUM S) ... for missing closed-class entries (pronouns, prepositions, etc. ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 31
Provided by: chadtl
Category:

less

Transcript and Presenter's Notes

Title: Rapid Prototyping of a Transferbased HebrewtoEnglish Machine Translation System


1
Rapid Prototyping of a Transfer-based
Hebrew-to-EnglishMachine Translation System
  • Alon Lavie
  • Language Technologies Institute
  • Carnegie Mellon University
  • Joint work with
  • Shuly Wintner, Danny Shacham, Nurit Melnik,
    Yuval Krymolowski - University of Haifa
  • Erik Peterson Carnegie Mellon University

2
Outline
  • Context of this Work
  • CMU Statistical Transfer MT Framework
  • Hebrew and its Challenges for MT
  • Hebrew-to-English System
  • Morphological Analysis and Generation
  • MT Resources lexicon and grammar
  • Translation Examples
  • Performance Evaluation
  • Conclusions, Current and Future Work

3
Current State-of-the-art in Machine Translation
  • MT underwent a major paradigm shift over the past
    15 years
  • From manually crafted rule-based systems with
    manually designed knowledge resources
  • To search-based approaches founded on automatic
    extraction of translation models/units from large
    sentence-parallel corpora
  • Current Dominant Approach Phrase-based
    Statistical MT
  • Extract and statistically model large volumes of
    phrase-to-phrase correspondences from
    automatically word-aligned parallel corpora
  • Decode new input by searching for the most
    likely sequence of phrase matches, using a
    statistical Language Model for the target language

4
Current State-of-the-art in Machine Translation
  • Phrase-based MT State-of-the-art
  • Requires minimally several million words of
    parallel text for adequate training
  • Limited to language-pairs for which such data
    exists major European languages, Chinese,
    Japanese, a few others
  • Linguistically shallow and highly lexicalized
    models result in weak generalization
  • Best performance levels (BLEU0.6) on
    Arabic-to-English provide understandable but
    often still somewhat disfluent translations
  • Ill suited for Hebrew and most of the worlds
    minor languages

5
CMUs Statistical-Transfer (XFER) Approach
  • Framework Statistical search-based approach with
    syntactic translation transfer rules that can be
    acquired from data but also developed and
    extended by experts
  • Elicitation use bilingual native informants to
    produce a small high-quality word-aligned
    bilingual corpus of translated phrases and
    sentences
  • Transfer-rule Learning apply ML-based methods to
    automatically acquire syntactic transfer rules
    for translation between the two languages
  • XFER Decoder
  • XFER engine produces a lattice of possible
    transferred structures at all levels
  • Decoder searches and selects the best scoring
    combination
  • Rule Refinement refine the acquired rules via a
    process of interaction with bilingual informants
  • Word and Phrase bilingual lexicon acquisition

6
(No Transcript)
7
Transfer Rule Formalism
SL the old man, TL ha-ish ha-zaqen NPNP
DET ADJ N -gt DET N DET ADJ ( (X1Y1) (X1Y3)
(X2Y4) (X3Y2) ((X1 AGR) 3-SING) ((X1 DEF
DEF) ((X3 AGR) 3-SING) ((X3 COUNT)
) ((Y1 DEF) DEF) ((Y3 DEF) DEF) ((Y2 AGR)
3-SING) ((Y2 GENDER) (Y4 GENDER)) )
  • Type information
  • Part-of-speech/constituent information
  • Alignments
  • x-side constraints
  • y-side constraints
  • xy-constraints,
  • e.g. ((Y1 AGR) (X1 AGR))

8
The Transfer Engine
  • Main algorithm chart-style bottom-up integrated
    parsingtransfer with beam pruning
  • Seeded by word-to-word translations
  • Driven by transfer rules
  • Generates a lattice of transferred translation
    segments at all levels
  • Some Unique Features
  • Works with either learned or manually-developed
    transfer grammars
  • Handles rules with or without unification
    constraints
  • Supports interfacing with servers for
    morphological analysis and generation
  • Can handle ambiguous source-word analyses and/or
    SL segmentations represented in the form of
    lattice structures

9
XFER Output Lattice
(28 28 "AND" -5.6988 "W" "(CONJ,0 'AND')") (29 29
"SINCE" -8.20817 "MAZ " "(ADVP,0 (ADV,5 'SINCE'))
") (29 29 "SINCE THEN" -12.0165 "MAZ " "(ADVP,0
(ADV,6 'SINCE THEN')) ") (29 29 "EVER SINCE"
-12.5564 "MAZ " "(ADVP,0 (ADV,4 'EVER SINCE'))
") (30 30 "WORKED" -10.9913 "BD " "(VERB,0 (V,11
'WORKED')) ") (30 30 "FUNCTIONED" -16.0023 "BD "
"(VERB,0 (V,10 'FUNCTIONED')) ") (30 30
"WORSHIPPED" -17.3393 "BD " "(VERB,0 (V,12
'WORSHIPPED')) ") (30 30 "SERVED" -11.5161 "BD "
"(VERB,0 (V,14 'SERVED')) ") (30 30 "SLAVE"
-13.9523 "BD " "(NP0,0 (N,34 'SLAVE')) ") (30 30
"BONDSMAN" -18.0325 "BD " "(NP0,0 (N,36
'BONDSMAN')) ") (30 30 "A SLAVE" -16.8671 "BD "
"(NP,1 (LITERAL 'A') (NP2,0 (NP1,0 (NP0,0
(N,34 'SLAVE')) ) ) ) ") (30 30 "A BONDSMAN"
-21.0649 "BD " "(NP,1 (LITERAL 'A') (NP2,0
(NP1,0 (NP0,0 (N,36 'BONDSMAN')) ) ) ) ")
10
The Lattice Decoder
  • Simple Stack Decoder, similar in principle to
    simple Statistical MT decoders
  • Searches for best-scoring path of non-overlapping
    lattice arcs
  • No reordering during decoding
  • Scoring based on log-linear combination of
    scoring components, with weights trained using
    MERT
  • Scoring components
  • Statistical Language Model
  • Fragmentation how many arcs to cover the entire
    translation?
  • Length Penalty
  • Rule Scores
  • Lexical Probabilities (not fully integrated)

11
XFER Lattice Decoder
0 0 ON THE FOURTH DAY THE LION ATE THE RABBIT
TO A MORNING MEAL Overall -8.18323, Prob
-94.382, Rules 0, Frag 0.153846, Length 0,
Words 13,13 235 lt 0 8 -19.7602 B H IWM RBII
(PP,0 (PREP,3 'ON')(NP,2 (LITERAL 'THE') (NP2,0
(NP1,1 (ADJ,2 (QUANT,0 'FOURTH'))(NP1,0 (NP0,1
(N,6 'DAY')))))))gt 918 lt 8 14 -46.2973 H ARIH
AKL AT H PN (S,2 (NP,2 (LITERAL 'THE') (NP2,0
(NP1,0 (NP0,1 (N,17 'LION')))))(VERB,0 (V,0
'ATE'))(NP,100 (NP,2 (LITERAL 'THE') (NP2,0
(NP1,0 (NP0,1 (N,24 'RABBIT')))))))gt 584 lt 14 17
-30.6607 L ARWXH BWQR (PP,0 (PREP,6 'TO')(NP,1
(LITERAL 'A') (NP2,0 (NP1,0 (NNP,3 (NP0,0 (N,32
'MORNING'))(NP0,0 (N,27 'MEAL')))))))gt
12
XFER MT Prototypes
  • General XFER framework under development for past
    five years
  • Prototype systems so far
  • German-to-English
  • Dutch-to-English
  • Chinese-to-English
  • Hindi-to-English
  • Hebrew-to-English
  • In progress or planned
  • Mapudungun-to-Spanish
  • Quechua-to-Spanish
  • Brazilian Portuguese-to-English
  • Native-Brazilian languages to Brazilian
    Portuguese
  • Hebrew-to-Arabic

13
Challenges for Hebrew MT
  • Puacity in existing language resources for Hebrew
  • No publicly available broad coverage
    morphological analyzer
  • No publicly available bilingual lexicons or
    dictionaries
  • No POS-tagged corpus or parse tree-bank corpus
    for Hebrew
  • No large Hebrew/English parallel corpus
  • Scenario well suited for CMU transfer-based MT
    framework for languages with limited resources

14
Modern Hebrew Spelling
  • Two main spelling variants
  • KTIV XASER (difficient) spelling with the
    vowel diacritics, and consonant words when the
    diacritics are removed
  • KTIV MALEH (full) words with I/O/U vowels are
    written with long vowels which include a letter
  • KTIV MALEH is predominant, but not strictly
    adhered to even in newspapers and official
    publications ? inconsistent spelling
  • Example
  • niqud (spelling) NIQWD, NQWD, NQD
  • When written as NQD, could also be niqed, naqed,
    nuqad

15
Morphological Analyzer
  • We use a publicly available morphological
    analyzer distributed by the Technions Knowledge
    Center, adapted for our system
  • Coverage is reasonable (for nouns, verbs and
    adjectives)
  • Produces all analyses or a disambiguated analysis
    for each word
  • Output format includes lexeme (base form), POS,
    morphological features
  • Output was adapted to our representation needs
    (POS and feature mappings)

16
Morphology Example
  • Input word BWRH
  • 0 1 2 3 4
  • --------BWRH--------
  • -----B-----WR--H--
  • --B---H----WRH---

17
Morphology Example
  • Y0 ((SPANSTART 0) Y1 ((SPANSTART 0)
    Y2 ((SPANSTART 1)
  • (SPANEND 4) (SPANEND
    2) (SPANEND 3)
  • (LEX BWRH) (LEX B)
    (LEX WR)
  • (POS N) (POS
    PREP)) (POS N)
  • (GEN F)
    (GEN M)
  • (NUM S)
    (NUM S)
  • (STATUS ABSOLUTE))
    (STATUS ABSOLUTE))
  • Y3 ((SPANSTART 3) Y4 ((SPANSTART 0)
    Y5 ((SPANSTART 1)
  • (SPANEND 4) (SPANEND
    1) (SPANEND 2)
  • (LEX LH) (LEX
    B) (LEX H)
  • (POS POSS)) (POS
    PREP)) (POS DET))
  • Y6 ((SPANSTART 2) Y7 ((SPANSTART 0)
  • (SPANEND 4) (SPANEND
    4)
  • (LEX WRH) (LEX
    BWRH)
  • (POS N) (POS
    LEX))
  • (GEN F)
  • (NUM S)

18
Translation Lexicon
  • Constructed our own Hebrew-to-English lexicon,
    based primarily on existing Dahan H-to-E and
    E-to-H dictionary made available to us, augmented
    by other public sources
  • Coverage is not great but not bad as a start
  • Dahan H-to-E is about 15K translation pairs
  • Dahan E-to-H is about 7K translation pairs
  • Base forms, POS information on both sides
  • Converted Dahan into our representation, added
    entries for missing closed-class entries
    (pronouns, prepositions, etc.)
  • Had to deal with spelling conventions
  • Recently augmented with 50K translation pairs
    extracted from Wikipedia (mostly proper names and
    named entities)

19
Manual Transfer Grammar (human-developed)
  • Initially developed by Alon in a couple of days,
    extended and revised by Nurit over time
  • Current grammar has 36 rules
  • 21 NP rules
  • one PP rule
  • 6 verb complexes and VP rules
  • 8 higher-phrase and sentence-level rules
  • Captures the most common (mostly local)
    structural differences between Hebrew and English

20
Transfer GrammarExample Rules
NP1,2 SL MLH ADWMH TL A RED
DRESS NP1NP1 NP1 ADJ -gt ADJ
NP1 ( (X2Y1) (X1Y2) ((X1 def) -) ((X1
status) c absolute) ((X1 num) (X2 num)) ((X1
gen) (X2 gen)) (X0 X1) )
NP1,3 SL H MLWT H ADWMWT TL THE RED
DRESSES NP1NP1 NP1 "H" ADJ -gt ADJ
NP1 ( (X3Y1) (X1Y2) ((X1 def) ) ((X1
status) c absolute) ((X1 num) (X3 num)) ((X1
gen) (X3 gen)) (X0 X1) )
21
Hebrew-to-English MT Prototype
  • Initial prototype developed within a two month
    intensive effort
  • Accomplished
  • Adapted available morphological analyzer
  • Constructed a preliminary translation lexicon
  • Translated and aligned Elicitation Corpus
  • Learned XFER rules
  • Developed (small) manual XFER grammar
  • System debugging and development
  • Evaluated performance on unseen test data using
    automatic evaluation metrics

22
Example Translation
  • Input
  • ???? ?????? ???? ?????? ?????? ????? ???? ??
    ????? ??????
  • After debates many decided the government to hold
    referendum in issue the withdrawal
  • Output
  • AFTER MANY DEBATES THE GOVERNMENT DECIDED TO HOLD
    A REFERENDUM ON THE ISSUE OF THE WITHDRAWAL

23
Noun Phrases Construct State
????? ????? ??????
HXL_at_T HNSIA HRAWNdecision.3SF-CS the-president
.3SM the-first.3SM
THE DECISION OF THE FIRST PRESIDENT
????? ????? ???????
HXL_at_T HNSIA HRAWNHdecision.3SF-CS the-presiden
t.3SM the-first.3SF
THE FIRST DECISION OF THE PRESIDENT
24
Noun Phrases - Possessives
????? ????? ??????? ??????? ??? ???? ????? ?????
?????? ???????
HNSIA HKRIZ HMIMH HRAWNH LW THIHthe-president
announced that-the-task.3SF the-first.3SF of-him
will.3SF
LMCWA PTRWN LSKSWK BAZWRNWto-find solution to-the
-conflict in-region-POSS.1P
Without transfer grammar THE PRESIDENT ANNOUNCED
THAT THE TASK THE BEST OF HIM WILL BE TO FIND
SOLUTION TO THE CONFLICT IN REGION OUR
With transfer grammar THE PRESIDENT ANNOUNCED
THAT HIS FIRST TASK WILL BE TO FIND A SOLUTION TO
THE CONFLICT IN OUR REGION
25
Subject-Verb Inversion
????? ?????? ?????? ??????? ?????? ????? ???
ATMWL HWDIH HMMLH yesterday announced.3SF the-g
overnment.3SF
TRKNH BXIRWT BXWD HBAthat-will-be-held.3PF ele
ctions.3PF in-the-month the-next
Without transfer grammar YESTERDAY ANNOUNCED THE
GOVERNMENT THAT WILL RESPECT OF THE FREEDOM OF
THE MONTH THE NEXT
With transfer grammar YESTERDAY THE GOVERNMENT
ANNOUNCED THAT ELECTIONS WILL ASSUME IN THE NEXT
MONTH
26
Subject-Verb Inversion
???? ??? ?????? ?????? ????? ????? ?????? ????
???? ????
LPNI KMH BWWT HWDIH HNHLT HMLWNbefore several
weeks announced.3SF management.3SF.CS the-hotel
HMLWN ISGR BSWF HNH that-the-hotel.3SM will-be
-closed.3SM at-end.3SM.CS the-year
Without transfer grammar IN FRONT OF A FEW WEEKS
ANNOUNCED ADMINISTRATION THE HOTEL THAT THE HOTEL
WILL CLOSE AT THE END THIS YEAR
With transfer grammar SEVERAL WEEKS AGO THE
MANAGEMENT OF THE HOTEL ANNOUNCED THAT THE HOTEL
WILL CLOSE AT THE END OF THE YEAR
27
Evaluation Results
  • Test set of 62 sentences from Haaretz newspaper,
    2 reference translations

28
Current and Future Work
  • Issues specific to the Hebrew-to-English system
  • Coverage further improvements in the translation
    lexicon and morphological analyzer
  • Manual Grammar development
  • Acquiring/training of word-to-word translation
    probabilities
  • Acquiring/training of a Hebrew language model at
    a post-morphology level that can help with
    disambiguation
  • General Issues related to XFER framework
  • Discriminative Language Modeling for MT
  • Effective models for assigning scores to transfer
    rules
  • Improved grammar learning
  • Merging/integration of manual and acquired
    grammars

29
Conclusions
  • Test case for the CMU XFER framework for rapid MT
    prototyping
  • Preliminary system was a two-month, three person
    effort we were quite happy with the outcome
  • Core concept of XFER Decoding is very powerful
    and promising for MT
  • We experienced the main bottlenecks of knowledge
    acquisition for MT morphology, translation
    lexicons, grammar...

30
Questions?
Write a Comment
User Comments (0)
About PowerShow.com