MIRACLE Multilingual Information RetrievAl for the CLEF campaign - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

MIRACLE Multilingual Information RetrievAl for the CLEF campaign

Description:

... API to FreeTranslation.com (full text) and ERGANE dictionary (word by word) ... Spanish- English gives our best individual results ! ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 14
Provided by: clefca
Category:

less

Transcript and Presenter's Notes

Title: MIRACLE Multilingual Information RetrievAl for the CLEF campaign


1
MIRACLEMultilingual Information RetrievAl for
the CLEF campaign
  • DAEDALUS Data, Decisions and Language, S.A.
  • www.daedalus.es
  • Universidad Carlos III de Madrid (UC3M)
  • www.uc3m.es
  • Universidad Politécnica de Madrid (UPM)
  • www.upm.es
  • Partially funded by IST-2001-32174 (OmniPaper)
    and CAM 07T/0055/2003 projects

2
ImageCLEF 2003
  • Participation in
  • Monolingual task
  • English - English 5 different runs
  • Bilingual tasks
  • Spanish-English 6 runs
  • German-English 6 runs
  • French-English 4 runs
  • Italian-English 4 runs
  • TOTAL 25 runs

3
System Architecture
  • IR engine XAPIAN (based on Probabilistic IR
    model)
  • Filtering components text and word extraction,
    topic extraction, word count, statistic
    calculations
  • Linguistic components tokenizers, stemmers
    (based on Porter algorithm), German word
    decompounding module, stopword filters
  • Translation components API to FreeTranslation.com
    (full text) and ERGANE dictionary (word by word)
  • Semantic components Synonym expansion for
    English (WordNet)
  • Our idea is to couple these components in
    different ways to evaluate different approaches
    and compare the influence of each one in the P/R
    of the IR process for each language

4
IR Process Index
  • All the images are indexed in the same XAPIAN
    collection
  • For each image, HEADLINE and TEXT fields are used
    (without tags and IDs)

5
IR Process Retrieval
  • Different runs, basically consisting on
  • Create the query from the topic
  • Execute the query in XAPIAN system
  • Retrieve 1000 best results (ranked list)
  • For each topic, only the TITLE field and the 1st
    translation variant are used
  • Evaluation four relevance sets (2 judges)
  • Union (any assessor) / Intersection (both
    assessors)
  • Strict (relevant only) / Relaxed (also partially
    relevant)
  • In our evaluation, we have considered the
    intersection-strict, which is the most
    restrictive

6
Monolingual Runs (en-en)
  • OR
  • Word extraction in topic title ? stop word
    filtering ? stemming ? weighted OR operator with
    stems
  • Intended as the baseline run
  • ORlem
  • Word extraction in topic title ? stop word
    filtering ? stemming ? weighted OR operator with
    stems and original words
  • Idea measure the effect of stemming
  • ORlemexp
  • Word extraction in topic title ? stop word
    filtering ? synonym expansion ? stemming ?
    weighted OR operator with stems and original
    words and synonyms
  • Idea measure the effect of increasing the recall
    despite the penalization in precision
  • Doc
  • Index topic title as document ? retrieve similar
    docs
  • Idea Confirm that this is a similar approach to
    vector space model
  • ORrf
  • Query with OR operator with stems ? Top 25 docs ?
    250 most important terms ? new weighted OR query
  • Idea measure the effect of simplest blind
    relevance feedback

7
P-R curve (en-en)
  • Best runs have too high precision values (the
    set of relevant documents is not complete)
  • Relevance feedback is the worst (noise due to
    unappropriate parameter values - 250 terms when
    the mean length of image description is about 50
    words)
  • Any kind of term expansion reduces precision
    (low number of documents, existence of
    ambiguity)

8
Average Precision (en-en)
  • Best run is weighted OR query and Doc (in
    Probabilistic IR model, weighted OR is like term
    weights in Vector Space Model)
  • The evaluation with other relevance sets gives a
    slight increase in overall precision

9
Bilingual Runs (fr,ge,it,sp-en)
  • TOR1
  • Topic title ? FreeTranslation ? Word extraction ?
    stop word filtering ? stemming ? weighted OR
    operator with stems
  • Similar to monolingual OR, intended as the
    baseline run
  • TOR3
  • Topic title ? FreeTranslation ERGANE ? Word
    extraction ? stop word filtering ? stemming ?
    weighted OR operator with stems
  • Idea improve translation by combining different
    sources
  • Tdoc
  • Topic title ? FreeTranslation ? Index as document
    ? retrieve similar docs
  • TOR3exp
  • Topic title ? FreeTranslation ERGANE ? Word
    extraction ? stop word filtering ? synonym
    expansion ? stemming ? weighted OR operator with
    stems and original words and synonyms
  • TOR3full
  • The same as TOR3 but also including topic title
    in original language
  • Idea evaluate the effect of text that cannot be
    or is incorrectly translated
  • TOR3fullexp
  • Combination of TOR3exp and TOR3full

10
P-R curve (fr,ge,it,sp-en)
11
P-R curve (fr,ge,it,sp-en)
  • Although all results are similar, TOR1 and Tdoc
    are the best ones in all cases
  • Using word by word translation with ERGANE has
    proved to be worse translation is not adequate
    or the expansion of the query makes wider queries
    thus reducing precision
  • Again, as in monolingual task, any kind of term
    expansion reduces precision, if not coping with
    ambiguity
  • Spanish, German and Italian have similar results,
    but French is slightly worse FreeTranslation is
    worse for French or the French topics are harder
    to translate
  • Spanish-English gives our best individual
    results !!!
  • Comparing bilingual/monolingual results, a
    difference of about 10-15 arises (similar to our
    participation in CLEF tasks this year)

12
Average Precision (fr,ge,it,sp-en)
13
Conclusions and Future Work
  • As new-comers to CLEF, we have worked hard to
    build the infrastructure to be able to easily
    execute different runs
  • Simplest approaches have proved to be the best if
    not handling ambiguity caused by term expansion
  • Next time
  • POS filtering for syntactic disambiguation to
    handle ambiguity
  • Evaluate the effect of using stemming (and its
    quality) or not in high flexible languages like
    Spanish/French/Italian
  • More focus on Spanish better stemmer, better
    synonym expansion (directly in Spanish)
  • Evaluate the quality of translation engines with
    respect to the IR process
Write a Comment
User Comments (0)
About PowerShow.com