Combining Query Translation and Document Translation in Cross-Language Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

Combining Query Translation and Document Translation in Cross-Language Retrieval

Description:

Combining Query Translation and Document Translation in Cross-Language Retrieval Aitao Chen & Fredric C. Gey* School of Information Management and Systems – PowerPoint PPT presentation

Number of Views:144
Avg rating:3.0/5.0
Slides: 18
Provided by: ait59
Category:

less

Transcript and Presenter's Notes

Title: Combining Query Translation and Document Translation in Cross-Language Retrieval


1
Combining Query Translation and Document
Translation in Cross-Language Retrieval
Aitao Chen Fredric C. Gey School of
Information Management and Systems UC Data
Archive Technical Assistance University of
California at Berkeley
CLEF 2003 Workshop 21-22 August, 2003,
Trondheim, Norway
2
Talk Outline
  • Development of new resources
  • Fast approximate document translation
  • Combining query translation and document
    translation
  • Conclusions

3
New Resources
  • Finnish and Swedish stoplists
  • Base Finnish and Swedish lexicons for
    decompounding
  • Statistical translation lexicons derived from
    parallel texts
  • Finnish and Swedish statistical stemmers
    automatically generated from parallel texts
  • English spelling normalizer

4
Development of Swedish Stoplist(by someone who
doesnt know Swedish)
Look for Swedish words whose English translations
are English stopwords in Swedish textbooks (e.g.,
grammar) written in English.
  • en park (a park)
  • ett piano (a piano)
  • Jag vet inte mycket om honom (I dont know much
    about him)
  • efter skolan (after school)
  • Hans och Greta (Hans and Greta)

(Source Swedish A comprehensive grammar by P.
Holmes I. Hinchliffe)
5
Development of Swedish Base Lexicon
A base lexicon should contain all and only the
words and their variants that are not compounds.
  • Compile a list of Swedish words (e.g., from the
    Swedish document collection).
  • Remove the words that are 4 or fewer characters
    long.
  • Remove the long words that can be decomposed into
    short words in the initial wordlist.

animation animationen dator datoranimation datorgr
afik datorteknologi datorvirus grafik teknologi vi
rus
Remove the compounds that are decomposed.
dator animation dator grafik dator
teknologi dator virus
6
Development of Statistical Translation Lexicons
from Parallel Texts
parallel texts (EU Official Journal)
PDF?texts conversion
paragraph sentence alignment
statistical association
statistical MT toolkit
  1. Italian ? Spanish
  2. German ? Italian
  3. Finnish ? German
  1. English ? Dutch
  2. English ? Finnish
  3. English ? Swedish
  4. Dutch ? English
  5. Finnish ? English
  6. Swedish ? English

statistical translation lexicons
7
Development of Statistical Stemmers
computer cluster
statistical English translations
Swedish words
dator datorn datorer datorersom datornät datorne
rnä informatik
dator
dator datorn datorer datorersom datornät datorne
rnä diamanten diamanterna diamanter diamant info
rmatik
computer computers
computer
diamond diamonds
diamond
diamond cluster
diamanten diamanterna diamanter diamant
diamant
8
Fast Approximate Document Translation
2
List of Spanish words
List of English words
Spanish documents
1
Spanish-English MT
3
Bilingual Spanish-English wordlist
Word-by-word
English translations
4
9
Query Translation-based Multilingual Retrieval
Query
Documents
IR
English
English
IR
French
French
IR
German
German
LH
IR
Spanish
Spanish
English docs
French docs
German docs
Spanish docs
merger
combined ranked list of documents
10
Documentation Translation-based Multilingual
Retrieval
Documents
English
English
French
Query
IR
English
English
German
English
Spanish
unified ranked list of documents
11
Evaluation of Multilingual Retrieval
Multilingual-4 English, TD
Run ID Trans. method Merging method Average precision
bkmul4en1 query-trans raw score 0.3783
bkmul4en2 doc-trans none 0.4082
bkmul4en3 query doc-trans raw score 0.4260
Multilingual-8 English, TD
Run ID Trans. method Merging method Average precision
bkmul8en1 query-trans raw score 0.3317
bkmul8en2 doc-trans none 0.3401
bkmul8en3 query doc-trans raw score 0.3733
12
Query Translation v.s. Document Translation
Spanish doc words
German doc words
English words in topic 161
celíacos dietas
Diets for Celiacs
diät zöliakie
document translation (word-by-word)
query translation
Las Dietas para Celiacs
Nahrungen für Celiacs
celiacs diets
diets coeliac diseases
(Spanish)
(German)
(English)
Average precision 0.0003 (mul4en1)
Average precision 0.6750 (mul4en2)
English words in topic 186
French document words
Dutch Netherlands
Néerlandais Pays-Bas
document translation (word-by-word)
0.0
query translation
1.0
Hollandais Hollande
Dutch Netherlands
(French)
(English)
Average precision 0.2213 (mul4en1)
Average precision 0.6167 (mul4en2)
13
Manual v.s. Automatic Stemming
CLEF 2003
Language No stemming Manual (Snowball) Automatic (parallel texts)
Finnish 0.3801 0.4972 0.4304
Swedish 0.3630 0.4121 0.3844
(topic fields TD. No decompounding or query
expansion)
CLEF2001-2002
Language No Stemming Manual (Muscat) Automatic (LH MT)
French 0.3905 0.4528 0.4521
Italian 0.3801 0.4324 0.4322
Spanish 0.4687 0.5166 0.5285
(topic fields TD. No query expansion)
14
Evaluation of Decompounding, Stemming and Query
Expansion in Monolingual Retrieval
Topics (TD)
Dutch German Finnish Swedish
.5304 (22.16) .5678 (52.35) .5633
(48.20) .5465 (50.55)
decompstemexpan
.4962 .4804 .5541 .4838
.5126 .5473 .4469 .4880
.4955 .5111 .4972 .4727
decompexpan
stemexpan
decompstem
.4744 .4294 .4204 .4331
.4480 .4220 .4974 .4121
.4673 .4867 .4071 .4224
stem
expan
decomp
.4342 .3727 .3801 .3630
baseline
15
Conclusions
  • Fast approximate document-translation worked
    well. Combining document-translation with
    query-translation was even better.
  • Decompounding with stemming and query expansion
    worked well for languages with rich compounds.
  • Statistical stemmers derived from parallel texts
    were not as effective as manually built stemmers
    for Finnish and Swedish. But there is still room
    for improving statistical stemmers.

16
Software
Berkeley Text Retrieval System is available for
research purpose. Send request to
aitao_at_sims.berkeley.edu
17
THANK YOU
Write a Comment
User Comments (0)
About PowerShow.com