Title: Combining Query Translation and Document Translation in Cross-Language Retrieval
1Combining Query Translation and Document
Translation in Cross-Language Retrieval
Aitao Chen Fredric C. Gey School of
Information Management and Systems UC Data
Archive Technical Assistance University of
California at Berkeley
CLEF 2003 Workshop 21-22 August, 2003,
Trondheim, Norway
2Talk Outline
- Development of new resources
- Fast approximate document translation
- Combining query translation and document
translation - Conclusions
3New Resources
- Finnish and Swedish stoplists
- Base Finnish and Swedish lexicons for
decompounding - Statistical translation lexicons derived from
parallel texts - Finnish and Swedish statistical stemmers
automatically generated from parallel texts - English spelling normalizer
4Development of Swedish Stoplist(by someone who
doesnt know Swedish)
Look for Swedish words whose English translations
are English stopwords in Swedish textbooks (e.g.,
grammar) written in English.
- en park (a park)
- ett piano (a piano)
- Jag vet inte mycket om honom (I dont know much
about him) - efter skolan (after school)
- Hans och Greta (Hans and Greta)
(Source Swedish A comprehensive grammar by P.
Holmes I. Hinchliffe)
5Development of Swedish Base Lexicon
A base lexicon should contain all and only the
words and their variants that are not compounds.
- Compile a list of Swedish words (e.g., from the
Swedish document collection). - Remove the words that are 4 or fewer characters
long. - Remove the long words that can be decomposed into
short words in the initial wordlist.
animation animationen dator datoranimation datorgr
afik datorteknologi datorvirus grafik teknologi vi
rus
Remove the compounds that are decomposed.
dator animation dator grafik dator
teknologi dator virus
6Development of Statistical Translation Lexicons
from Parallel Texts
parallel texts (EU Official Journal)
PDF?texts conversion
paragraph sentence alignment
statistical association
statistical MT toolkit
- Italian ? Spanish
- German ? Italian
- Finnish ? German
- English ? Dutch
- English ? Finnish
- English ? Swedish
- Dutch ? English
- Finnish ? English
- Swedish ? English
statistical translation lexicons
7Development of Statistical Stemmers
computer cluster
statistical English translations
Swedish words
dator datorn datorer datorersom datornät datorne
rnä informatik
dator
dator datorn datorer datorersom datornät datorne
rnä diamanten diamanterna diamanter diamant info
rmatik
computer computers
computer
diamond diamonds
diamond
diamond cluster
diamanten diamanterna diamanter diamant
diamant
8Fast Approximate Document Translation
2
List of Spanish words
List of English words
Spanish documents
1
Spanish-English MT
3
Bilingual Spanish-English wordlist
Word-by-word
English translations
4
9Query Translation-based Multilingual Retrieval
Query
Documents
IR
English
English
IR
French
French
IR
German
German
LH
IR
Spanish
Spanish
English docs
French docs
German docs
Spanish docs
merger
combined ranked list of documents
10Documentation Translation-based Multilingual
Retrieval
Documents
English
English
French
Query
IR
English
English
German
English
Spanish
unified ranked list of documents
11Evaluation of Multilingual Retrieval
Multilingual-4 English, TD
Run ID Trans. method Merging method Average precision
bkmul4en1 query-trans raw score 0.3783
bkmul4en2 doc-trans none 0.4082
bkmul4en3 query doc-trans raw score 0.4260
Multilingual-8 English, TD
Run ID Trans. method Merging method Average precision
bkmul8en1 query-trans raw score 0.3317
bkmul8en2 doc-trans none 0.3401
bkmul8en3 query doc-trans raw score 0.3733
12Query Translation v.s. Document Translation
Spanish doc words
German doc words
English words in topic 161
celíacos dietas
Diets for Celiacs
diät zöliakie
document translation (word-by-word)
query translation
Las Dietas para Celiacs
Nahrungen für Celiacs
celiacs diets
diets coeliac diseases
(Spanish)
(German)
(English)
Average precision 0.0003 (mul4en1)
Average precision 0.6750 (mul4en2)
English words in topic 186
French document words
Dutch Netherlands
Néerlandais Pays-Bas
document translation (word-by-word)
0.0
query translation
1.0
Hollandais Hollande
Dutch Netherlands
(French)
(English)
Average precision 0.2213 (mul4en1)
Average precision 0.6167 (mul4en2)
13Manual v.s. Automatic Stemming
CLEF 2003
Language No stemming Manual (Snowball) Automatic (parallel texts)
Finnish 0.3801 0.4972 0.4304
Swedish 0.3630 0.4121 0.3844
(topic fields TD. No decompounding or query
expansion)
CLEF2001-2002
Language No Stemming Manual (Muscat) Automatic (LH MT)
French 0.3905 0.4528 0.4521
Italian 0.3801 0.4324 0.4322
Spanish 0.4687 0.5166 0.5285
(topic fields TD. No query expansion)
14Evaluation of Decompounding, Stemming and Query
Expansion in Monolingual Retrieval
Topics (TD)
Dutch German Finnish Swedish
.5304 (22.16) .5678 (52.35) .5633
(48.20) .5465 (50.55)
decompstemexpan
.4962 .4804 .5541 .4838
.5126 .5473 .4469 .4880
.4955 .5111 .4972 .4727
decompexpan
stemexpan
decompstem
.4744 .4294 .4204 .4331
.4480 .4220 .4974 .4121
.4673 .4867 .4071 .4224
stem
expan
decomp
.4342 .3727 .3801 .3630
baseline
15Conclusions
- Fast approximate document-translation worked
well. Combining document-translation with
query-translation was even better. - Decompounding with stemming and query expansion
worked well for languages with rich compounds. - Statistical stemmers derived from parallel texts
were not as effective as manually built stemmers
for Finnish and Swedish. But there is still room
for improving statistical stemmers.
16Software
Berkeley Text Retrieval System is available for
research purpose. Send request to
aitao_at_sims.berkeley.edu
17THANK YOU