Iterative Translation Disambiguation for Cross Language Information Retrieval - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Iterative Translation Disambiguation for Cross Language Information Retrieval

Description:

a monolingual corpus in the target language. TRANSLATION SELECTION ... Only need a dictionary and monolingual resources in the target language ... – PowerPoint PPT presentation

Number of Views:55
Avg rating:3.0/5.0
Slides: 25
Provided by: nlgCsie
Category:

less

Transcript and Presenter's Notes

Title: Iterative Translation Disambiguation for Cross Language Information Retrieval


1
Iterative Translation Disambiguation forCross
Language Information Retrieval
  • Christof Monz and Bonnie J. Dorr
  • Institute for Advanced Computer Studies
  • University of Maryland
  • SIGIR 2005

2
INTRODUCTION
  • Query translation requires access to some form of
    translation dictionary
  • Use machine translation system to translate the
    entire query into the target language
  • Use of a dictionary to produce a number of
    target-language translations for words or phrases
    in the source language
  • Use of a parallel corpus to estimate the
    probabilities that w in the source language
    translates into w in target language

3
INTRODUCTION
  • An approach that does not require a parallel
    corpus to induce translation probabilities
  • a machine-readable dictionary (without any
    rankings or frequency statistics)
  • a monolingual corpus in the target language

4
TRANSLATION SELECTION
  • Translation ambiguity is very common
  • Apply word sense disambiguation
  • For most languages the appropriate resources do
    not exist
  • Word-sense disambiguation is a non-trivial
    enterprise

5
TRANSLATION SELECTION
  • Our approach uses co-occurrences between terms to
    modeling context for the problem of word
    selection
  • Ex. S1gtt11,t21,t31
  • S2gtt21,t22
  • S3gtt31

6
TRANSLATION SELECTION
  • Computing co-occurrence statistics for a larger
    number of terms induces a data-sparseness issue
  • Use very large corpora (Web)
  • Apply smoothing techniques

7
ITERATIVE DISAMBIGUATION
  • Only examine pairs of terms in order to gather
    partial evidence for the likelihood of a
    translation in a given context

8
ITERATIVE DISAMBIGUATION
  • Assume that ti1 occurs more frequently with tj1
    than any other pair of candidates between a
    translation for si and sj
  • On the other hand, assume that ti1 and tj1 do not
    co-occur with tk1 at all, but ti2 and tj2 do
  • Which should be preferred
  • ti1 and tj1 or ti2 and tj2

9
ITERATIVE DISAMBIGUATION
  • Associate with each translation candidate a
    weight (t is a translation candidate for si)
  • Each term weight is recomputed based on two
    different inputs the weights of the terms that
    link to the term (WL(t,t)link weight between t
    and t )

10
ITERATIVE DISAMBIGUATION
  • Normalize term weights
  • The iteration stops if the changes in term
    weights become smaller than some threshold
    (WTthe vector of all term weights Vkkth element
    in the vector)

11
ITERATIVE DISAMBIGUATION
  • There are a number of ways to compute the
    association strength between two terms
  • MI
  • Dice coefficient
  • log likelihood

12
ITERATIVE DISAMBIGUATION
  • Example

13
EXPERIMENT Set-Up
  • Test Data
  • CLEF 2003 English to German bilingual data
  • Contains 60 topics, four of which were removed by
    the CLEF organizers, as no relevant documents
  • Each topic has a title, a description, and a
    narrative field, for our experiments, we used
    only the title field to formulate the queries

14
EXPERIMENT Set-Up
  • Morphological normalization
  • Since the dictionary only contains base forms,
    the words in the topics must be mapped to their
    respective base forms as well
  • Compounds are very frequent in German
  • Instead of de-compounding, we use character
    5-grams, an approach that yields almost the same
    retrieval performance as decompounding

15
EXPERIMENT Set-Up
  • Ex. Topics
  • Intermediate results of the query formulation
    process

16
EXPERIMENT Set-Up
  • Retrieval Model - Lnu.ltc weighting scheme
  • we used sl0.1 ,pvthe average number of unique
    words per document, uwd refers to the number of
    unique words in document d, w(i) weight of term
    i

17
Experimental Results
18
Experimental Results
  • Individual average precision decreases for a
    number of queries
  • 6 of all English query terms were not in the
    dictionary
  • Unknown words are treated as proper names, and
    the original word from the source language is
    included in the target language query
  • Ex. the word Women is falsely considered a proper
    noun, although faulty translations of this type
    affect both the baseline system and the run using
    term weights, the latter is affected more severely

19
RELATEDWORK
  • Pirkolas approach does not consider
    disambiguation at all
  • Jangs approach use MI to re-compute translation
    probabilities for cross-language retrieval
  • Only considers mutual information between
    consecutive terms in the query
  • they do not compute the translation probabilities
    in an iterative fashion

20
RELATEDWORK
  • Adrianis approach is similar to the approach by
    Jang
  • does not benefit from using multiple iterations.
  • Gao use a decaying mutual-information score in
    combination with syntactic dependency relations
  • We did not consider distances between words

21
RELATEDWORK
  • Maeda compare a number of co-occurrence
    statistics with respect to their usefulness for
    improving retrieval effectiveness
  • They consider all pairs of possible translations
    of words in the query
  • use co-occurrence information to select
    translations of words from the topic for query
    formulation, instead of re-weighting them

22
RELATEDWORK
  • Kikui
  • Only need a dictionary and monolingual resources
    in the target language
  • Computes the coherence between all possible
    combinations of translation candidates of the
    source terms

23
CONCLUSIONS
  • introduced a new algorithm for computing topic
    dependent translation probabilities for
    cross-language information retrieval
  • We experimented with different term association
    measures, experimental results show Log
    Likelihood Ratio has the strongest positive
    impact on retrieval effectiveness

24
CONCLUSIONS
  • An important advantage of our approach is that it
    only requires a bilingual dictionary and a
    monolingual corpus
  • An issue that remains open at this point is the
    computation of query terms that are not covered
    by the bilingual dictionary
Write a Comment
User Comments (0)
About PowerShow.com