Machine Transliteration - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

Machine Transliteration

Description:

Words written in a language with alphabet A written in a language with alphabet B ... But not Yiddish, because wouldn't have 't' ending ... – PowerPoint PPT presentation

Number of Views:167
Avg rating:3.0/5.0
Slides: 68
Provided by: pli3
Category:

less

Transcript and Presenter's Notes

Title: Machine Transliteration


1
Machine Transliteration
  • Joshua Waxman

2
Overview
  • Words written in a language with alphabet A ?
    written in a language with alphabet B
  • ???? ? shalom
  • Importance for MT, for cross-language IR
  • Forward transliteration, Romanization,
    back-transliteration

3
Is there a convergence towards standards?
  • Perhaps for really famous names. Even for such
    standard names, multiple acceptable spellings.
    Whether there is someone regulating such
    spellings probably dependent culturally. In
    meantime, have a lot of variance. Especially on
    Web. E.g. holiday of Succot, ?????, ???????
  • Variance in pronunciation culturally across
    different groups (soo-kot, suh-kes) dialect,
    variance in how one chooses to transliterate
    different Hebrew letters (kk, cc, gemination).
  • Sukkot 7.1 million
  • Succot 173 thousand
  • Succos 153 thousand
  • Sukkoth 113 thousand
  • Succoth 199 thousand
  • Sukos 112 thousand
  • Sucos 927 thousand, but probably almost none
    related to holiday
  • Sucot 101 thousand. Spanish transliteration of
    holiday
  • Sukkes 1.4 thousand. Yiddish rendition
  • Succes 68 million. Misspelling of success
  • Sukket 45 thousand. But not Yiddish, because
    wouldnt have t ending
  • Recently in the news AP Emad Borat Arutz
    Sheva Imad Muhammad Intisar Boghnat

4
Can we enforce standards?
  • Would make task easier.
  • News articles, perhaps
  • However
  • Would they listen to us?
  • Does the standard make sense across the board?
    Once again, dialectal differences. E.g. ?, ?,
    vowels. Also, fold-over of alphabet. ?-?, ?-?,
    ?-?, ?-?, ?-?
  • 2N for N laguages

5
(No Transcript)
6
Four Papers
  • Cross Linguistic Name Matching in English and
    Arabic
  • For IR search. Fuzzy string matching.
    Modification of Soundex to use cross-language
    mapping, using character equivalence classes
  • Machine Transliteration
  • For Machine translation. Back transliteration. 5
    steps in transliteration. Use Bayes rule
  • Transliteration of Proper Names in
    Cross-Language Applications
  • Forward transliteration, purely statistical based
  • Statistical Transliteration for English-Arabic
    Cross Language Information Retrieval
  • Forward transliteration. For IR, generating every
    possible transliteration, then evaluate. Using
    selected n-gram model

7
Cross Linguistic Name Matching in English and
ArabicA One to Many Mapping Extension of the
Levenshtein Edit Distance Algorithm
  • Dr. Andrew T. Freeman, Dr. Sherri L. Condon and
  • Christopher M. Ackerman
  • The Mitre Corporation

8
Cross Linguistic Name Matching
  • What?
  • Match personal names in English to the same names
    in Arabic script.
  • Why is this not a trivial problem?
  • There are multiple transcription schemes, so it
    is not one-to-one
  • e.g. ???? ??????? can be Muammar Gaddafi, Muammar
    Qaddafi, Moammar Gadhafi, Muammar Qadhafi,
    Muammar al Qadhafi
  • because certain consonants and vowels can be
    represented multiple ways in English
  • note Arabic is just an example of this
    phenomenon
  • so standard string comparison insufficient
  • For What purpose?
  • For search on, say, news articles. How do you
    match all occurrences of Qadhafi
  • Their solution
  • Enter the search term in Arabic, use Character
    Equivalence Classes (CEQ) to generate possible
    transliterations, supplement the Levenshtein Edit
    Distance Algorithm

9
Elaboration on Multiple Transliteration Schemes
  • Why?
  • No standard English phoneme corresponding to
    Arabic /q/
  • Different dialects in Libya, this is pronounced
    g
  • note Similar for Hebrew dialects

10
Fuzzy string matching
  • def matching strings based on similarity rather
    than identity
  • Examples
  • edit-distance
  • n-gram matching
  • normalization procedures like Soundex.

11
Survey of Fuzzy Matching Methods - Soundex
  • Soundex
  • Odell and Russel, 1918
  • Some obvious pluses
  • (not mentioned explicitly by paper)
  • we eliminate vowels, so Moammar/Muammar not a
    problem
  • Groups of letters will take care of different
    English letters corresponding to Arabic
  • Elimination of repetition and of h will remove
    gemination/fricatives
  • Some minuses
  • Perhaps dialects will transgress Soundex phonetic
    code boundaries. e.g. ? in Hebrew can be t, th,
    s. ? can be ch or h. Is a ? to be w or v? But
    could modify algorithm to match.
  • note al in al-Qadafi
  • Perhaps would match too many inappropriate results

12
Noisy Channel Model
13
Levenshtein Edit Distance
  • AKA Minimum Edit Distance
  • Minimum number of operations of insertion,
    deletion, substitution. Cost per operation 1
  • Via dynamic programming
  • Example taken from Jurafsky and Martin, but with
    corrections
  • Minimum of diagonal subst, or down/left
    insertion/deletion cost

14
Minimum Edit Distance Example(substitution cost
2)
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 12
E 4 3 4 5 6 7 8 9 10 11
T 3 4 5 6 7 8 7 8 9 10
N 2 3 4 5 6 7 8 9 10 11
I 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9
E X E C U T I O N
15
Minimum Edit Distance Example(substitution cost
1)
N 9 7 7 7 7 8 8 7 6 5
O 8 6 6 6 7 7 7 6 5 6
I 7 5 5 6 6 6 6 5 6 7
T 6 4 5 5 5 5 5 6 7 8
N 5 4 4 5 4 5 6 7 7 7
E 4 3 4 3 4 5 6 6 7 8
T 3 3 3 3 4 5 5 6 7 8
N 2 2 2 3 4 5 6 7 7 8
I 1 1 2 3 4 5 6 6 7 8
0 1 2 3 4 5 6 7 8 9
E X E C U T I O N
16
Minimum Edit Distance
  • Score of 0 perfect match, since no edit ops
  • s of len m, t of len n
  • Fuzzy match divide edit score by length of
    shortest (or longest) string, 1 this number.
    Set threshold for strings to be a match. Then,
    longer pairs of strings more likely to be matched
    than shorter pairs of strings with same number of
    edits. So get percentage of chars that need ops.
    Otherwise, A vs I has same edit distance as
    tuning vs. turning.
  • Good algorithm for fuzzy string comparison can
    see that Muammar Gaddafi, Muammar Qaddafi,
    Moammar Gadhafi, Muammar Qadhafi, Muammar al
    Qadhafi are relatively close.
  • But, dont really want substitution cost of G/Q,
    O/U, DD/DH, certain insertion/deletion costs.
    That is why they supplement it with these
    Character Equivalence Classes (CEQ), which well
    get to a bit later.

17
Editex
  • Zobel and Dart (1996) Soundex Levenshtein
    Edit Distance
  • replace e(si, tj) which was basically 1 if
    unequal, 0 if equal (that is, cost of an op),
    with r(si, tj), which makes use of Soundex
    equivalences. 0 if identical, 1 if in same group,
    2 if different
  • Also neutralizes h and w in general. Show example
    based on chart from before. In terms of
    initializing or calculating cost of
    insertion/deletion, do not count, otherwise have
    cost of 1.
  • Other enhancements to standard Soundex and Edit
    distance for the purpose of comparison. e.g.
    tapering (counts less later in the word)
    phonometric methods input strings mapped to
    phonemic representations. E.g. rough.
  • Say performed better than Soundex, Min Edit
    Distance, counting n-gram sequences, 10
    permutations of tapering, phonemetric
    enhancements to standard algorithms

18
SecondString (Tool)
  • Java based implementation of many of these string
    matching algorithms. They use this for comparison
    purposes. Also, SecondString allows hybrid
    algorithms by mixing and matching, tools for
    string matching metrics, tools for matching
    tokens within strings.

19
Baseline Task (??)
  • Took 106 Arabic, 105 English texts from newswire
    articles
  • Took names from these articles, 408 names from
    English, 255 names from Arabic.
  • manual cross-script matching, got 29 common names
    (rather than manually coming up with all possible
    transliterations)
  • But to get baseline, tried matching all names in
    Arabic (transliterated using Atrans by Basis
    2004) to all names in English, using algorithms
    from SecondString. Thus, have one standard
    transliteration, and try to match it to all other
    English transliterations
  • Empirically set threshold to something that
    yielded good result.
  • R recall correctly matched English names /
    available correct English matches in set what
    percentage of total correct did they get?
  • P Precision total correct names / total
    of names returned what percentage of their
    guesses were accurate?
  • Defined F-score as 2 X (PR) / (P R)

20
Other Algorithms Used For Comparison
  • Smith Waterman Levenstein Edit, with some
    parameterization of gap score
  • SLIM iterative statistical learning algorithm
    based on a variety of estimation-maximization in
    which a Levenshtein edit-distance matrix is
    iteratively processed to find the statistical
    probabilities of the overlap between two strings.
  • Jaro n-gram
  • Last one is Edit distance

21
Their Enhancements
  • Motivation Arabic letter has more than one
    possible English letter equivalent. Also, Arabic
    transliterations of English names not
    predictable. 6 different ways to represent
    Milosevic in Arabic.

22
Some Real World Knowledge
23
Character Equivalence Classes
  • Same idea as Editex, except use Ar(si, tj) where
    s is an Arabic word, so si is an Arabic letter,
    and t is an English word, and tj is an English
    letter.
  • So, comparing Arabic to English directly, rather
    than a standard transliteration
  • The sets within Ar to handle (modified) Buckwater
    transliteration, default transliteration of
    Basis software
  • Basis uses English digraphs for certain letters

24
Buckwalter Transliteration Scheme
  • A scholarly transliteration scheme, unlikely to
    be found in newspaper articles
  • WikipediaThe Buckwalter Arabic transliteration
    was developed at Xerox by Tim Buckwalter in the
    1990s. It is an ASCII only transliteration
    scheme, representing Arabic orthography strictly
    one-to-one, unlike the more common romanization
    schemes that add morphological information not
    expressed in Arabic script. Thus, for example, a
    waw will be transliterated as w regardless of
    whether it is realized as a vowel u or a
    consonant w. Only when the waw is modified by a
    hamza ( ?) does the transliteration change to .
    The unmodified letters are straightforward to
    read (except for maybe dhaal and Eayin,
    vthaa), but the transliteration of letters with
    diacritica and the harakat take some time to get
    used to, for example the nunated irab -un, -an,
    -in appear as N, F, K, and the sukun ("no vowel")
    as o. Ta marbouta ? is p.
  • hamza
  • lone hamza '
  • hamza on alif gt
  • hamza on wa
  • hamza on ya
  • alif
  • madda on alif
  • alif al-wasla
  • dagger alif
  • alif maqsura Y

25
The Equivalence Classes
26
Normalization
  • They normalize Buckwalter and the English in the
    newspaper articles.
  • Thus, ? sh from Buckwalter,
  • ph ? f in English, eliminate dupes, etc.
  • Move vowels from each language closer to one
    another by only retaining matching vowels (that
    is, where exist in both)

27
(No Transcript)
28
Why different from Soundex and Editex
  • What we do here is the opposite of the approach
    taken by the Soundex and Editex algorithms. They
    try to reduce the complexity by collapsing groups
    of characters into a single super-class of
    characters. The algorithm here does some of that
    with the steps that normalize the strings.
    However, the largest boost in performance is with
    CEQ, which expands the number of allowable
    cross-language matches for many characters.

29
Machine (Back-) Transliteration
  • Kevin Knight and Jonathan Graehl
  • University of Southern California

30
Machine Transliteration
  • For Translation purposes
  • Foreign Words commonly transliterated, using
    approximate phonemic equivalents
  • computer ? konpyuuta
  • Problem Usually, translate by looking up in
    dictionaries, but these often dont show up in
    dictionaries
  • Usually not a problem for some languages, like
    Spanish/English, since have similar alphabets.
    But non-alphabetic languages or with different
    alphabets, more problematic. (e.g. Japanese,
    Arabic)
  • Popular on the Internet The Coca-Cola name in
    China was first read as "Ke-kou-ke-la," meaning
    "Bite the wax tadpole" or "female horse stuffed
    with wax," depending on the dialect. Coke then
    researched 40,000 characters to find a phonetic
    equivalent to "ko-kou-ko-le," translating into
    "happiness in the mouth."
  • Solution Backwards transliteration to get the
    original word, using a generative model

31
Machine Transliteration
  • Japanese transliterates e.g. English in katakana.
    Foreign names and loan-words.
  • Compromises e.g. golfbag
  • L/R map to same character
  • Japanese has alternating consonant vowel pattern,
    so cannot have consonant cluster LFB
  • Syllabary instead of alphabet.
  • Goruhubaggu
  • Dot separator, but inconsisent, so
  • aisukuriimu can be I scream
  • or ice cream

32
Back Transliteration
  • Going from katakana back to original English word
  • for translation katakana not found in bilingual
    dictionaries, so just generate original English
    (assuming it is English)
  • Yamrom 1994 pattern matching
  • Arbabi 1994 neural net/expert system
  • Information loss, so not easy to invert

33
More Difficult Than
  • Forward transliteration
  • several ways to transliterate into katakana, all
    valid, so you might encounter any of them
  • But only one English spelling cant say arture
    for archer
  • Romanization
  • we have seen examples of thisthe katakana
    examples above
  • more difficult because of spelling variations
  • Certain things cannot be handled by
    back-transliteration
  • Onomatopoeia
  • Shorthand e.g. waapuro word processing

34
Desired Features
  • Accuracy
  • Portability to other languages
  • Robust against OCR errors
  • Relevant to ASR where speaker has heavy accent
  • Ability to take context (topical/syntactic) into
    account, or at least return ranked list of
    possibilities
  • Really requires 100 knowledge

35
Learning Approach Initial Attempt
  • Can learn what letters transliterate for what by
    training on corpus of katakana phrases in
    bilingual dictionaries
  • Drawbacks
  • with naïve approach, how can we make sure we get
    a normal transliteration?
  • E.g. we can get iskrym as back transliteration
    for aisukuriimu.
  • Take letter frequency into account! So can get
    isclim
  • Restrict to real words! Is crime.
  • We want ice cream!

36
Modular Learning Approach
  • Build generative model of transliteration
    process,
  • English phrase is written
  • Translator pronounces it in English
  • Pronunciation modified to fit Japanese sound
    inventory
  • Sounds are converted into katakana
  • Katakana is written
  • Solve and coordinate solutions to these
    subproblems, use generative models in reverse
    direction
  • Use probabilities and Bayes Rule

37
Bayes Rule Example
  • Example 1 Conditional probabilities from
    Wikipedia
  • Suppose there are two bowls full of cookies. Bowl
    1 has 10 chocolate chip cookies and 30 plain
    cookies, while bowl 2 has 20 of each. Fred picks
    a bowl at random, and then picks a cookie at
    random. We may assume there is no reason to
    believe Fred treats one bowl differently from
    another, likewise for the cookies. The cookie
    turns out to be a plain one. How probable is it
    that Fred picked it out of bowl 1?
  • Intuitively, it seems clear that the answer
    should be more than a half, since there are more
    plain cookies in bowl 1. The precise answer is
    given by Bayes's theorem. But first, we can
    clarify the situation by rephrasing the question
    to "whats the probability that Fred picked bowl
    1, given that he has a plain cookie? Thus, to
    relate to our previous explanation, the event A
    is that Fred picked bowl 1, and the event B is
    that Fred picked a plain cookie. To compute
    Pr(AB), we first need to know
  • Pr(A), or the probability that Fred picked bowl
    1 regardless of any other information. Since
    Fred is treating both bowls equally, it is 0.5.
  • Pr(B), or the probability of getting a plain
    cookie regardless of any information on the
    bowls. In other words, this is the probability of
    getting a plain cookie from each of the bowls. It
    is computed as the sum of the probability of
    getting a plain cookie from a bowl multiplied by
    the probability of selecting this bowl. We know
    from the problem statement that the probability
    of getting a plain cookie from bowl 1 is 0.75,
    and the probability of getting one from bowl 2
    is 0.5, and since Fred is treating both bowls
    equally the probability of selecting any one of
    them is 0.5. Thus, the probability of getting a
    plain cookie overall is 0.750.5  0.50.5
    0.625.
  • Pr(BA), or the probability of getting a plain
    cookie given that Fred has selected bowl 1. From
    the problem statement, we know this is 0.75,
    since 30 out of 40 cookies in bowl 1 are plain.
  • Given all this information, we can compute the
    probability of Fred having selected bowl 1 given
    that he got a plain cookie, as such
  • As we expected, it is more than half.

38
Application To Task At Hand
  • English Phrase Generator produces word sequences
    according to probability distribution P(w)
  • English Pronouncer probabilistically assigns a
    set of pronunciations to word sequence, according
    to P(pw)
  • Given pronunciation p, find word sequence that
    maximizes P(wp)
  • Based on Bayes Rule P(wp) P(pw) P(w) /
    P(p)
  • But P(p) will be the same regardless of the
    specific word sequence, so can just search for
    word sequence that maximizes P(pw) P(w), which
    are the two distributions we just modeled

39
Five Probability Distributions
  • Extending this notion, built 5 probability
    distributions
  • P(w) generates written English word sequences
  • P(ew) pronounces English word sequences
  • P(je) converts English sounds into Japanese
    sounds
  • P(kj) converts Japanese sounds into katakana
    writing
  • P(ok) introduces misspellings caused by OCR
  • Parallels 5 steps above
  • English phrase is written
  • Translator pronounces it in English
  • Pronunciation modified to fit Japanese sound
    inventory
  • Sounds are converted into katakana
  • Katakana is written
  • Given katakana string o observed by OCR, we wish
    to maximize
  • P(w) P(ew) P(je) P(kj) P(o k)
    over all e, j, k
  • Why? Lets say have e and want to determine most
    probable w given e that is, P(we), would
    maximize P(w) P(ew) / P(e)

40
Implementation of the probability distributions
  • P(w) as WFSA (weighted finite state acceptor),
    others as WFST (transducers)
  • WFSA state transition diagram with both symbols
    and weights on the transitions, such that some
    transitions more likely than others
  • WFST the same, but with both input and output
    symbols
  • Implemented composition algorithm to yield P(xz)
    from models P(xy) and P(yz), treating WFSAs
    simply as WFST with identical input and output
  • Yields one large WFSA, and use Djikstras
    shortest path algorithm to extract most probable
    one
  • No pruning, use Viterbi approximation, searching
    best path through WFSA rather than best sequence

41
First Model Word Sequences
  • ice cream gt ice crème gt aice kreme
  • Unigram scoring mechanism which multiplies scores
    of known words and phrases in a sequence
  • Corpus WSJ corpus online English name list
    online gazeteer of place names
  • Should really e.g. ignore auxiliaries and favor
    surnames. Approximate by removing high frequency
    words

42
Model 2 Eng Word Sequences ? Eng Sound Sequences
  • Use English phoneme inventory from CMU
    Pronunciation Dictionary, minus stress marks
  • 40 sounds 14 vowel sounds, 25 consonant sounds
    (e.g. K, HH, R), additional symbol PAUSE
  • Dictionary has 100,000 (125,000) word
    pronunciation
  • Used top 50,000 words because of memory
    limitations
  • Capital letters Eng sounds lowercase words
    Eng words

43
(No Transcript)
44
(No Transcript)
45
Example Second WFST
Note Why not letters instead of phonemes?
Doesnt match Japanese transliteration
mispronunciation, and that is modeled in next
step.
46
Model 3 English Sounds ? Japanese Sounds
  • Information losing process R, L ? r, 14 vowels ?
    5 Japanese vowels
  • Identify Japanese sound inventory
  • Build WFST to perform the sequence mapping
  • Japanese sound inventory has 39 symbols 5
    vowels, 33 consonants (including doubled kk),
    special symbol pause.
  • (P R OW PAUSE S AA K ER) (pro-soccer) maps to (p
    u r o pause s a kk a a)
  • Use machine learning to train WFST from 8000
    pairs of English/Japanese sound sequences (for
    example, soccer). Created this corpus by
    modifying an English/katakana dictionary,
    converting into these sounds used EM (estimation
    maximization) algorithm to generate symbol
    matching probabilities. See table on next page

47
(No Transcript)
48
The EM Algorithm
Note pays no heed to context
49
Model 4 Japanese sounds ? Katakana
  • Manually construct 2.
  • 1 just merges sequential doubled sounds into
    single sound. o o ? oo
  • 2 just does mapping, accounting for different
    spelling variation. e.g.

50
Model 5 katakana ? OCR
51
Example
52
Transliteration of Proper Names in
Cross-LanguageApplications
  • Paola Virga, Sanjeev Khudanpur
  • Johns Hopkins University

53
Abstract
  • For MT, for IR, specifically cross-language IR
  • Names important, particularly for short queries
  • Transliteration writing name in foreign
    language, preserving the way it sounds
  • Render English name in phonemic form
  • Convert phonemic string into foreign orthography,
    e.g. Mandarin Chinese
  • Mentions back transliteration for Japanese, and
    application to Arabic, by Knight etc.
  • For Korean, strongly phonetic orthography allows
    good transliteration using simple HMMS
  • Hand-crafted rules to change English spelling to
    accord to Mandarin syllabification, then learns
    to convert English phoneme sequence to Mandarin
    syllable sequence.
  • They extend the previous, making it fully
    data-driven rather than relying on hand-crafted
    rules, to accomplish English ? Mandarin
    transliteration

54
Four steps in transliteration process
  • English ? Phonetic English (using Festival)
  • Festival free, source available, multilingual,
    interfaces to shell, Scheme, Java, C, emacs
    (see next page)
  • English phoneme ? initials and finals
  • Initial final sequence ? pin-yin symbols
  • Wikipedia Pinyin is a system of romanization
    (phonemic notation and transcription to Roman
    script) for Standard Mandarin, where pin means
    "spell" and yin means "sound".
  • Pinyin is a romanization and not an
    anglicization that is, it uses Roman letters to
  • represent sounds in Standard Mandarin. The way
    these letters represent sounds in Standard
    Mandarin will differ from how other languages
    that use the Roman alphabet represent sound. For
    example, the sounds indicated in pinyin by b and
    g are not as heavily voiced as in the Western use
    of the Latin script. Other letters, like j, q, x
    or zh indicate sounds that do not correspond to
    any exact sound in English. Some of the
    transcriptions in pinyin, such as the ang ending,
    do not correspond to English pronunciations,
    either.
  • By letting Roman characters refer to specific
    Chinese sounds, pinyin produces a compact and
    accurate romanization, which is convenient for
    native Chinese speakers and scholars. However, it
    also means that a person who has not studied
    Chinese or the pinyin system is likely to
    severely mispronounce words, which is a less
    serious problem with some earlier romanization
    systems such as Wade-Giles.
  • Diff than katakana
  • Pin-yin ? Chinese character sequence
  • 1, 3 deterministic 2, 4 statistics

55
(No Transcript)
56
Noisy Channel Model
  • We had concept before
  • Think of e an i-word English sentence output from
    noisy channel, c as j-word Chinese input into the
    noisy channel. Except words phonemes
  • Find most likely Chinese sentence to have
    generated English output. Use Bayes rule.

57
How train, use transliteration system see next
slide
58
Training
  • Got from authors of 3 4, their corpus.
  • 3875 English names, Chinese transliterations,
    pin-yin counterparts, used Festival to generate
    phonemic English, pronunciation of pinyin based
    on Initial/Final inventory from Mandarin
    phonology text
  • First corpus lines 2, 3
  • Second corpus lines 4, 5
  • Compare to 4, Do more general test

59
Spoken Document Retrieval
  • Infrastructure developed at Johns Hopkins Summer
    Workshop Mandarin audio to be searched using
    English text queries
  • English proper names unavailable in translation
    lexicon, thus ignored during retrieval
  • Improved mean average precision by adding name
    transliteration (from 0.501 to 0.515)

60
Statistical Transliteration for English-Arabic
Cross Language Information Retrieval
  • Nasreen AbdulJaleel, Leah Larkey

61
Overview
  • For IR
  • Motivation not proper nouns but rather OOV (out
    of vocabulary) words when have no corresponding
    word in dictionary, simply transliterate it
  • Though train English to Arabic transliteration
    model from pairs of names
  • Selected n-gram model
  • Two stage training model
  • Learn which n-gram segments should be added to
    unigram inventory for source language
  • Then learn translation model over this inventory
  • No need for heuristics
  • No need for knowledge of either language

62
The Problem
  • OOV words problem in cross language information
    retrieval
  • Named entities
  • Numbers
  • Technical terms
  • Acronyms
  • These compose significant portion of OOV, and
    when named entity translation not available,
    reduction in average precision of 50
  • Variability of spelling foreign words. E.g.
    Qaddafi from before
  • OK to use own spelling in foreign language when
    share same alphabet (e.g. Italian, Spanish,
    German), but not when has different alphabet.
    Then transliteration.

63
Multiple Spellings In Arabic
  • Thus, useful to have way to generate multiple
    spellings in Arabic from single source
  • Use statistical transliteration to generate no
    heuristics, no linguistic knowledge
  • Statistical transliteration is special case of
    statistical translation, in which the words are
    letters.

64
Selected N-gram transliteration model
  • Generative statistical model, producing string of
    Arabic chars from string of English chars
  • Model set of conditional probability
    distributions over Arabic chars and NULL
  • Each English char n-gram ei can be mapped to
    Arabic char or sequence of chars ai with
    probability P(aiei)
  • Most probabilities are 0, in practice.
  • Probabilities of s, z, tz
  • Also, English source symbol inventory has,
    besides unigrams (such as single letters), some
    end symbols and n-grams such as sh, bb, eE

65
Training of Model
  • From lists of English/Arabic name pairs
  • 2 alignment stages
  • 1 to select n-grams for the model
  • 2 Determine translation probabilities for the
    n-grams
  • Used GIZA for letter alignment rather than word
    alignment, treating letters as words
  • Corpus 125,000 English proper nouns and Arabic
    translations, retaining only those existing in AP
    news article corpus
  • Some normalization made lowercase, prefixed
    with B and ended with E
  • Alignment 1 Align using GIZA, count instances
    in which English char sequence aligned to single
    Arabic character. Take top 50 of these n-grams
    and add to English symbol inventory
  • Resegment based on new inventory, using
    greedy-ish method
  • Ashcroft ? a sh c r o f t
  • Alignment 2, using GIZA
  • Count up alignments, use them as conditional
    probabilities, removing alignments with
    probability threshold of 0.01

66
Generation of Arabic Transliterations
  • Take English word ew.
  • Segment, greedily (?) from n-gram inventory
  • All possible transliterations, wa generated
  • Rank according to probabilities, by multiplying
  • Ran experiments, improvement over unigram only.
    Etc.

67
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com