Mining New Word Translations from Comparable Corpora - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Mining New Word Translations from Comparable Corpora

Description:

If two words are translation of each other, then their contexts are similar. ... EM algorithm to identify the translations ... Translation by Transliteration ... – PowerPoint PPT presentation

Number of Views:92

Avg rating:3.0/5.0

Slides: 30

Provided by: g0605

Category:

more less

Transcript and Presenter's Notes

Title: Mining New Word Translations from Comparable Corpora

1
Mining New Word Translations from Comparable
Corpora

Li Shao and Hwee Tou Ng
Department of Computer Science
National University of Singapore

2
Motivation

New words appear every day
These words include names, technical terms, etc.
A machine translation system must be able to
learn the translation of new words

3
Objective

A separate lexicon learning system to learn
translations of new names, technical terms, etc.
Given comparable corpora, mine translation of new
words from the corpora

4
Why Comparable Corpora?

More readily available than parallel corpora
News articles are up-to-date and contain the
latest new words

5
Translation by Context

Fung and Yee, 1998 Rapp, 1995 Rapp, 1999
If two words are translation of each other, then
their contexts are similar.
Compute the similarity of the contexts of two
words with the vector space model
Cao and Li, 2002 used web data and the EM
algorithm to identify the translations

6
Language Modeling

Ponte and Croft, 1998 Ng, 2000
Build a language model for each document
Compute the probability of generating a query
using the language model and sort the documents
accordingly

7
Translation by Transliteration

Knight and Graehl, 1998 presented a method for
machine transliteration based on word
pronunciation
Al-Onaizan and Knight 2002a, b presented
improvements using monolingual and bilingual
resources and using word spelling

8
Our Approach

A new approach to mine new word translations by
combining both context and transliteration
information

9
Our Approach
Translate by context
Translate by transliteration
Sorted list of candidates
Sorted list of candidates
Final sorted list of candidates
10
Translation by Context

C(c), the context of a Chinese word c, is viewed
as a query
C(e), the context of an English word e, is viewed
as a document
Let e be the correct translation. C(e), the
context of e, is the most similar to C(c)
C(e) is the most relevant document to the query
C(c)

11
Translation by Context
The probability of generating query Q according
to the language model derived from document D
is estimated as
t is a term in the corpus,
is the number of times
term t occurs in Q. n is the total number of terms
in Q
12
Translation by Context
We then estimate P(C(c)C(e)) for each e as
is a Chinese word.
is the number of occurrences
of
in
.
is the bag of Chinese words
obtained by translating the English words in
C(e), using a bilingual dictionary
13
Translation by Context
We use backoff and linear interpolation for
probability estimation
14
Translation by Transliteration

Use the pronunciation of a Chinese word in pinyin
Assume character-to-pinyin mapping is one-to-one
A pinyin syllable is mapped to an English letter
sequence
Restrict each pinyin syllable to only map to 0,
1, or 2 English letters. No cross mapping is
allowed

15
Translation by Transliteration
is the ith syllable of pinyin,
is the English letter
sequence that the ith pinyin syllable maps to in
the
particular alignment a .
16
Translation by Transliteration

The probability of mapping between pinyin
syllable and English letter sequence is learned
by EM algorithm
Require a list of Chinese-English name pairs as
training data for EM

17
Combining Context and Transliteration

Two sorted lists for each Chinese word (one list
from each individual method)
Sort the English words appearing in both lists
within the top M positions, according to their
average rank
Output the English word with the highest average
rank as the translation

18
Experiment - Resource

Chinese corpus LDC Chinese Gigaword Corpus Jan -
Dec 1995
Jul - Dec 1995 (120 MB) was used to come up with
candidate Chinese words
Jan - Jun 1995 was used to determine if a
candidate Chinese word was new
English corpus LDC English Gigaword Corpus Jul -
Dec 1995 (730 MB)

19
Experiment - Resource

A Chinese-English dictionary of about 10,000
entries, for translating words in the context
A list of 1,580 Chinese-English name pairs, as
training data for EM

20
Experiment - Preprocessing

Chinese corpus
Perform Chinese word segmentation (Ng Low,
EMNLP04)
Divide the Chinese corpus Jul - Dec 1995 into 12
periods, each containing text from a half-month
period
Extract new words occurring at least 5 times in
each period. New words are those appearing in
this period but not before

21
Experiment - Preprocessing

English corpus
Perform sentence segmentation
Convert each word to its morphological root
Divide English corpus into 12 periods, each
containing text from a half-month period
For each period, select English candidate words
occurring at least 10 times and are not present
in the 10,000-word dictionary and are not stop
words

22
Experiment - Translation

Context of a Chinese word window of 50
characters
Context of an English word window of 100 words
Perform translation by combining context and
transliteration

23
Experiment - Evaluation
M 10
24
Experiment - Evaluation

Recall
We manually found the English translations for
the Chinese words in two periods (Dec 01 15 and
Dec 16 31). We managed to find 43 Chinese words
with English translations. Number of English
translations present is estimated as

Recall (for M 10) is estimated as
25
Experiment - Effect of varying M
26
Experiment - Analysis

We examined the 43 Chinese words with English
translations in Dec 1995
English translations in multi-word phrases 8
English translations occurring less than 10
times 4
English translations in the dictionary 6

27
Experiment - Analysis

Combining both context and transliteration
performs better than either info source alone
Number of correct English translations at rank
one position using
Context 10
Transliteration 9
Combined method (M10) 15
Combined method (M8) 19

28
Related Work

Fung and Yee, 1998 Rapp, 1995, Rapp, 1999, Cao
and Li, 2002
Knight and Graehl, 1998 Al-Onaizan and Knight,
2002a, b
Huang, Vogel, Waibel, HLT-NAACL 2004
Different contextual semantic similarity model
Our method does not require POS tag info
Our method uses average rank for combination
instead of weighting

29
Conclusion

A new approach to mine new word translations by
combining both context and transliteration
information
Using both sources of information outperforms
using either information alone

Write a Comment

User Comments (0)