MIRACLE Multilingual Information RetrievAl for the CLEF campaign - PowerPoint PPT Presentation

1 / 13

About This Presentation

Title:

MIRACLE Multilingual Information RetrievAl for the CLEF campaign

Description:

Number of Views:42

Avg rating:3.0/5.0

Slides: 14

Provided by: clefca

Category:

more less

Transcript and Presenter's Notes

Title: MIRACLE Multilingual Information RetrievAl for the CLEF campaign

1
MIRACLEMultilingual Information RetrievAl for
the CLEF campaign

2
ImageCLEF 2003

3
System Architecture

IR engine XAPIAN (based on Probabilistic IR
model)
Filtering components text and word extraction,
topic extraction, word count, statistic
calculations
Linguistic components tokenizers, stemmers
(based on Porter algorithm), German word
decompounding module, stopword filters
Translation components API to FreeTranslation.com
(full text) and ERGANE dictionary (word by word)
Semantic components Synonym expansion for
English (WordNet)
Our idea is to couple these components in
different ways to evaluate different approaches
and compare the influence of each one in the P/R
of the IR process for each language

4
IR Process Index

5
IR Process Retrieval

Different runs, basically consisting on
Create the query from the topic
Execute the query in XAPIAN system
Retrieve 1000 best results (ranked list)
For each topic, only the TITLE field and the 1st
translation variant are used
Evaluation four relevance sets (2 judges)
Union (any assessor) / Intersection (both
assessors)
Strict (relevant only) / Relaxed (also partially
relevant)
In our evaluation, we have considered the
intersection-strict, which is the most
restrictive

6
Monolingual Runs (en-en)

OR
Word extraction in topic title ? stop word
filtering ? stemming ? weighted OR operator with
stems
Intended as the baseline run
ORlem
Word extraction in topic title ? stop word
filtering ? stemming ? weighted OR operator with
stems and original words
Idea measure the effect of stemming
ORlemexp
Word extraction in topic title ? stop word
filtering ? synonym expansion ? stemming ?
weighted OR operator with stems and original
words and synonyms
Idea measure the effect of increasing the recall
despite the penalization in precision
Doc
Index topic title as document ? retrieve similar
docs
Idea Confirm that this is a similar approach to
vector space model
ORrf
Query with OR operator with stems ? Top 25 docs ?
250 most important terms ? new weighted OR query
Idea measure the effect of simplest blind
relevance feedback

7
P-R curve (en-en)

Best runs have too high precision values (the
set of relevant documents is not complete)
Relevance feedback is the worst (noise due to
unappropriate parameter values - 250 terms when
the mean length of image description is about 50
words)
Any kind of term expansion reduces precision
(low number of documents, existence of
ambiguity)

8
Average Precision (en-en)

Best run is weighted OR query and Doc (in
Probabilistic IR model, weighted OR is like term
weights in Vector Space Model)
The evaluation with other relevance sets gives a
slight increase in overall precision

9
Bilingual Runs (fr,ge,it,sp-en)

TOR1
Topic title ? FreeTranslation ? Word extraction ?
stop word filtering ? stemming ? weighted OR
operator with stems
Similar to monolingual OR, intended as the
baseline run
TOR3
Topic title ? FreeTranslation ERGANE ? Word
extraction ? stop word filtering ? stemming ?
weighted OR operator with stems
Idea improve translation by combining different
sources
Tdoc
Topic title ? FreeTranslation ? Index as document
? retrieve similar docs
TOR3exp
Topic title ? FreeTranslation ERGANE ? Word
extraction ? stop word filtering ? synonym
expansion ? stemming ? weighted OR operator with
stems and original words and synonyms
TOR3full
The same as TOR3 but also including topic title
in original language
Idea evaluate the effect of text that cannot be
or is incorrectly translated
TOR3fullexp
Combination of TOR3exp and TOR3full

10
P-R curve (fr,ge,it,sp-en)
11
P-R curve (fr,ge,it,sp-en)

Although all results are similar, TOR1 and Tdoc
are the best ones in all cases
Using word by word translation with ERGANE has
proved to be worse translation is not adequate
or the expansion of the query makes wider queries
thus reducing precision
Again, as in monolingual task, any kind of term
expansion reduces precision, if not coping with
ambiguity
Spanish, German and Italian have similar results,
but French is slightly worse FreeTranslation is
worse for French or the French topics are harder
to translate
Spanish-English gives our best individual
results !!!
Comparing bilingual/monolingual results, a
difference of about 10-15 arises (similar to our
participation in CLEF tasks this year)

12
Average Precision (fr,ge,it,sp-en)
13
Conclusions and Future Work

As new-comers to CLEF, we have worked hard to
build the infrastructure to be able to easily
execute different runs
Simplest approaches have proved to be the best if
not handling ambiguity caused by term expansion
Next time
POS filtering for syntactic disambiguation to
handle ambiguity
Evaluate the effect of using stemming (and its
quality) or not in high flexible languages like
Spanish/French/Italian
More focus on Spanish better stemmer, better
synonym expansion (directly in Spanish)
Evaluate the quality of translation engines with
respect to the IR process