Title: Mining New Word Translations from Comparable Corpora
1Mining New Word Translations from Comparable
Corpora
- Li Shao and Hwee Tou Ng
- Department of Computer Science
- National University of Singapore
2Motivation
- New words appear every day
- These words include names, technical terms, etc.
- A machine translation system must be able to
learn the translation of new words
3Objective
- A separate lexicon learning system to learn
translations of new names, technical terms, etc. - Given comparable corpora, mine translation of new
words from the corpora
4Why Comparable Corpora?
- More readily available than parallel corpora
- News articles are up-to-date and contain the
latest new words
5Translation by Context
- Fung and Yee, 1998 Rapp, 1995 Rapp, 1999
- If two words are translation of each other, then
their contexts are similar. - Compute the similarity of the contexts of two
words with the vector space model - Cao and Li, 2002 used web data and the EM
algorithm to identify the translations
6Language Modeling
- Ponte and Croft, 1998 Ng, 2000
- Build a language model for each document
- Compute the probability of generating a query
using the language model and sort the documents
accordingly
7Translation by Transliteration
- Knight and Graehl, 1998 presented a method for
machine transliteration based on word
pronunciation - Al-Onaizan and Knight 2002a, b presented
improvements using monolingual and bilingual
resources and using word spelling
8Our Approach
- A new approach to mine new word translations by
combining both context and transliteration
information
9Our Approach
Translate by context
Translate by transliteration
Sorted list of candidates
Sorted list of candidates
Final sorted list of candidates
10Translation by Context
- C(c), the context of a Chinese word c, is viewed
as a query - C(e), the context of an English word e, is viewed
as a document - Let e be the correct translation. C(e), the
context of e, is the most similar to C(c) - C(e) is the most relevant document to the query
C(c)
11Translation by Context
The probability of generating query Q according
to the language model derived from document D
is estimated as
t is a term in the corpus,
is the number of times
term t occurs in Q. n is the total number of terms
in Q
12Translation by Context
We then estimate P(C(c)C(e)) for each e as
is a Chinese word.
is the number of occurrences
of
in
.
is the bag of Chinese words
obtained by translating the English words in
C(e), using a bilingual dictionary
13Translation by Context
We use backoff and linear interpolation for
probability estimation
14Translation by Transliteration
- Use the pronunciation of a Chinese word in pinyin
- Assume character-to-pinyin mapping is one-to-one
- A pinyin syllable is mapped to an English letter
sequence - Restrict each pinyin syllable to only map to 0,
1, or 2 English letters. No cross mapping is
allowed
15Translation by Transliteration
is the ith syllable of pinyin,
is the English letter
sequence that the ith pinyin syllable maps to in
the
particular alignment a .
16Translation by Transliteration
- The probability of mapping between pinyin
syllable and English letter sequence is learned
by EM algorithm - Require a list of Chinese-English name pairs as
training data for EM
17Combining Context and Transliteration
- Two sorted lists for each Chinese word (one list
from each individual method) - Sort the English words appearing in both lists
within the top M positions, according to their
average rank - Output the English word with the highest average
rank as the translation
18Experiment - Resource
- Chinese corpus LDC Chinese Gigaword Corpus Jan -
Dec 1995 - Jul - Dec 1995 (120 MB) was used to come up with
candidate Chinese words - Jan - Jun 1995 was used to determine if a
candidate Chinese word was new - English corpus LDC English Gigaword Corpus Jul -
Dec 1995 (730 MB)
19Experiment - Resource
- A Chinese-English dictionary of about 10,000
entries, for translating words in the context - A list of 1,580 Chinese-English name pairs, as
training data for EM
20Experiment - Preprocessing
- Chinese corpus
- Perform Chinese word segmentation (Ng Low,
EMNLP04) - Divide the Chinese corpus Jul - Dec 1995 into 12
periods, each containing text from a half-month
period - Extract new words occurring at least 5 times in
each period. New words are those appearing in
this period but not before
21Experiment - Preprocessing
- English corpus
- Perform sentence segmentation
- Convert each word to its morphological root
- Divide English corpus into 12 periods, each
containing text from a half-month period - For each period, select English candidate words
occurring at least 10 times and are not present
in the 10,000-word dictionary and are not stop
words
22Experiment - Translation
- Context of a Chinese word window of 50
characters - Context of an English word window of 100 words
- Perform translation by combining context and
transliteration
23Experiment - Evaluation
M 10
24Experiment - Evaluation
- Recall
- We manually found the English translations for
the Chinese words in two periods (Dec 01 15 and
Dec 16 31). We managed to find 43 Chinese words
with English translations. Number of English
translations present is estimated as
Recall (for M 10) is estimated as
25Experiment - Effect of varying M
26Experiment - Analysis
- We examined the 43 Chinese words with English
translations in Dec 1995 - English translations in multi-word phrases 8
- English translations occurring less than 10
times 4 - English translations in the dictionary 6
27Experiment - Analysis
- Combining both context and transliteration
performs better than either info source alone - Number of correct English translations at rank
one position using - Context 10
- Transliteration 9
- Combined method (M10) 15
- Combined method (M8) 19
28Related Work
- Fung and Yee, 1998 Rapp, 1995, Rapp, 1999, Cao
and Li, 2002 - Knight and Graehl, 1998 Al-Onaizan and Knight,
2002a, b - Huang, Vogel, Waibel, HLT-NAACL 2004
- Different contextual semantic similarity model
- Our method does not require POS tag info
- Our method uses average rank for combination
instead of weighting
29Conclusion
- A new approach to mine new word translations by
combining both context and transliteration
information - Using both sources of information outperforms
using either information alone