Mining New Word Translations from Comparable Corpora - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Mining New Word Translations from Comparable Corpora

Description:

If two words are translation of each other, then their contexts are similar. ... EM algorithm to identify the translations ... Translation by Transliteration ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 30
Provided by: g0605
Category:

less

Transcript and Presenter's Notes

Title: Mining New Word Translations from Comparable Corpora


1
Mining New Word Translations from Comparable
Corpora
  • Li Shao and Hwee Tou Ng
  • Department of Computer Science
  • National University of Singapore

2
Motivation
  • New words appear every day
  • These words include names, technical terms, etc.
  • A machine translation system must be able to
    learn the translation of new words

3
Objective
  • A separate lexicon learning system to learn
    translations of new names, technical terms, etc.
  • Given comparable corpora, mine translation of new
    words from the corpora

4
Why Comparable Corpora?
  • More readily available than parallel corpora
  • News articles are up-to-date and contain the
    latest new words

5
Translation by Context
  • Fung and Yee, 1998 Rapp, 1995 Rapp, 1999
  • If two words are translation of each other, then
    their contexts are similar.
  • Compute the similarity of the contexts of two
    words with the vector space model
  • Cao and Li, 2002 used web data and the EM
    algorithm to identify the translations

6
Language Modeling
  • Ponte and Croft, 1998 Ng, 2000
  • Build a language model for each document
  • Compute the probability of generating a query
    using the language model and sort the documents
    accordingly

7
Translation by Transliteration
  • Knight and Graehl, 1998 presented a method for
    machine transliteration based on word
    pronunciation
  • Al-Onaizan and Knight 2002a, b presented
    improvements using monolingual and bilingual
    resources and using word spelling

8
Our Approach
  • A new approach to mine new word translations by
    combining both context and transliteration
    information

9
Our Approach
Translate by context
Translate by transliteration
Sorted list of candidates
Sorted list of candidates
Final sorted list of candidates
10
Translation by Context
  • C(c), the context of a Chinese word c, is viewed
    as a query
  • C(e), the context of an English word e, is viewed
    as a document
  • Let e be the correct translation. C(e), the
    context of e, is the most similar to C(c)
  • C(e) is the most relevant document to the query
    C(c)

11
Translation by Context
The probability of generating query Q according
to the language model derived from document D
is estimated as
t is a term in the corpus,
is the number of times
term t occurs in Q. n is the total number of terms
in Q
12
Translation by Context
We then estimate P(C(c)C(e)) for each e as
is a Chinese word.
is the number of occurrences
of
in
.
is the bag of Chinese words
obtained by translating the English words in
C(e), using a bilingual dictionary
13
Translation by Context
We use backoff and linear interpolation for
probability estimation
14
Translation by Transliteration
  • Use the pronunciation of a Chinese word in pinyin
  • Assume character-to-pinyin mapping is one-to-one
  • A pinyin syllable is mapped to an English letter
    sequence
  • Restrict each pinyin syllable to only map to 0,
    1, or 2 English letters. No cross mapping is
    allowed

15
Translation by Transliteration
is the ith syllable of pinyin,
is the English letter
sequence that the ith pinyin syllable maps to in
the
particular alignment a .
16
Translation by Transliteration
  • The probability of mapping between pinyin
    syllable and English letter sequence is learned
    by EM algorithm
  • Require a list of Chinese-English name pairs as
    training data for EM

17
Combining Context and Transliteration
  • Two sorted lists for each Chinese word (one list
    from each individual method)
  • Sort the English words appearing in both lists
    within the top M positions, according to their
    average rank
  • Output the English word with the highest average
    rank as the translation

18
Experiment - Resource
  • Chinese corpus LDC Chinese Gigaword Corpus Jan -
    Dec 1995
  • Jul - Dec 1995 (120 MB) was used to come up with
    candidate Chinese words
  • Jan - Jun 1995 was used to determine if a
    candidate Chinese word was new
  • English corpus LDC English Gigaword Corpus Jul -
    Dec 1995 (730 MB)

19
Experiment - Resource
  • A Chinese-English dictionary of about 10,000
    entries, for translating words in the context
  • A list of 1,580 Chinese-English name pairs, as
    training data for EM

20
Experiment - Preprocessing
  • Chinese corpus
  • Perform Chinese word segmentation (Ng Low,
    EMNLP04)
  • Divide the Chinese corpus Jul - Dec 1995 into 12
    periods, each containing text from a half-month
    period
  • Extract new words occurring at least 5 times in
    each period. New words are those appearing in
    this period but not before

21
Experiment - Preprocessing
  • English corpus
  • Perform sentence segmentation
  • Convert each word to its morphological root
  • Divide English corpus into 12 periods, each
    containing text from a half-month period
  • For each period, select English candidate words
    occurring at least 10 times and are not present
    in the 10,000-word dictionary and are not stop
    words

22
Experiment - Translation
  • Context of a Chinese word window of 50
    characters
  • Context of an English word window of 100 words
  • Perform translation by combining context and
    transliteration

23
Experiment - Evaluation
M 10
24
Experiment - Evaluation
  • Recall
  • We manually found the English translations for
    the Chinese words in two periods (Dec 01 15 and
    Dec 16 31). We managed to find 43 Chinese words
    with English translations. Number of English
    translations present is estimated as

Recall (for M 10) is estimated as
25
Experiment - Effect of varying M
26
Experiment - Analysis
  • We examined the 43 Chinese words with English
    translations in Dec 1995
  • English translations in multi-word phrases 8
  • English translations occurring less than 10
    times 4
  • English translations in the dictionary 6

27
Experiment - Analysis
  • Combining both context and transliteration
    performs better than either info source alone
  • Number of correct English translations at rank
    one position using
  • Context 10
  • Transliteration 9
  • Combined method (M10) 15
  • Combined method (M8) 19

28
Related Work
  • Fung and Yee, 1998 Rapp, 1995, Rapp, 1999, Cao
    and Li, 2002
  • Knight and Graehl, 1998 Al-Onaizan and Knight,
    2002a, b
  • Huang, Vogel, Waibel, HLT-NAACL 2004
  • Different contextual semantic similarity model
  • Our method does not require POS tag info
  • Our method uses average rank for combination
    instead of weighting

29
Conclusion
  • A new approach to mine new word translations by
    combining both context and transliteration
    information
  • Using both sources of information outperforms
    using either information alone
Write a Comment
User Comments (0)
About PowerShow.com