Cross Language Information Retrieval (CLIR)

About This Presentation

Title:

Cross Language Information Retrieval (CLIR)

Description:

Title: Cross Language Information Retrieval (CLIR) Author: Miguel Ruiz Last modified by: Lab-301 Created Date: 2/12/2003 4:51:16 PM Document presentation format – PowerPoint PPT presentation

Number of Views:301

Avg rating:3.0/5.0

Slides: 59

Provided by: Migue88

Category:

more less

Transcript and Presenter's Notes

Title: Cross Language Information Retrieval (CLIR)

1
Cross Language Information Retrieval (CLIR)

Modern Information Retrieval
Sharif University of Technology
Mohsen Jamali

2
The General Problem

Find documents written in any language
Using queries expressed in a single language

3
The General Problem (cont)

Traditional IR identifies relevant documents in
the same language as the query (monolingual IR)
Cross-language information retrieval (CLIR) tries
to identify relevant documents in a language
different from that of the query
This problem is more and more acute for IR on the
Web due to the fact that the Web is a truly
multilingual environment

4
Why is CLIR important?
5
Characteristics of the WWW

Country of Origin of Public Web Sites, 2001 ( of
Total) (OCLC Web Characterization, 2001)

6
Global Internet User Population
2000
2005
English
English
Chinese
Source Global Reach
7
Importance of CLIR

CLIR research is becoming more and more important
for global information exchange and knowledge
sharing.
National Security
Foreign Patent Information Access
Medical Information Access for Patients

8
CLIR is Multidisciplinary

CLIR involves researchers from the following
fields
information retrieval, natural language
processing, machine translation and
summarization, speech processing, document
image understanding, human-computer
interaction

9
User Needs

Search a monolingual collection in a language
that the user cannot read.
Retrieve information from a multilingual
collection using a query in a single language.
Select images from a collection indexed with free
text captions in an unfamiliar language.
Locate documents in a multilingual collection of
scanned page images.

10
Why Do Cross-Language IR?

When users can read several languages
Eliminates multiple queries
Query in most fluent language
Monolingual users can also benefit
If translations can be provided
If it suffices to know that a document exists
If text captions are used to search for images

11
Language Identification

Can be specified using metadata
Included in HTTP and HTML
Determined using word-scale features
Which dictionary gets the most hits?
Determined using subword features
Letter n-grams in electronic and printed text
Phoneme n-grams in speech

12
Language Encoding Standards

Language (alphabet) specific native encoding
Chinese GB, Big5,
Western European ISO-8859-1 (Latin1)
Russian KOI-8, ISO-8859-5, CP-1251
UNICODE (ISO/IEC 10646)
UTF-8 variable-byte length
UTF-16, UCS-2 fixed double-byte

13
CLIR Experimental System

2 systems
SMART Information retrieval system modified to
work with 11 European languages (Danish, Dutch,
English, Finnish, French, German, Italian,
Norwegian, Portuguese, Spanish, Swedish)
Generation of restricted bigrams
Pseudo-Relevance feedback
TAPIR is a language model IR system written by M.
Srikanth. It has been adated to work with 12
different European languages (Danish, Dutch,
English, Finnish, French, German, Italian,
Norwegian, Portuguese, Russian, Spanish, Swedish)
Stemming using Porters stemmer
Translation using Intertran (http//www.tranexp.co
m2000/InterTran)

14
Approaches to CLIR
15
Design Decisions

What to index?
Free text or controlled vocabulary
What to translate?
Queries or documents
Where to get translation knowledge?
Dictionary, ontology, training corpus

16
Cross-Language Text Retrieval
Query Translation
Document Translation
Text Translation Vector Translation
Controlled Vocabulary Free Text
Knowledge-based
Corpus-based
Ontology-based Dictionary-based
Term-aligned Sentence-aligned
Document-aligned Unaligned
Thesaurus-based
Parallel Comparable
17
Early Development

1964 International Road Research Documentation
English, French and German thesaurus
1969 Pevzner
Exact match with a large Russian/English
thesaurus
1970 Salton
Ranked retrieval with small English/German
dictionary
1971 UNESCO
Proposed standard for multilingual thesauri

18
Controlled Vocabulary Matures

1977 IBM STAIRS-TLS
Large-scale commercial cross-language IR
1978 ISO Standard 5964
Guidelines for developing multilingual thesauri
1984 EUROVOC thesaurus
Now includes all 9 EC languages
1985 ISO Standard 5964 revised

19
Free Text Developments

1970, 1973 Salton
Hand coded bilingual term lists
1990 Latent Semantic Indexing
1994 European multilingual IR project
First precision/recall evaluation
1996 SIGIR Cross-lingual IR workshop
1998 EU/NSF digital library working group

20
Query vs. Document Translation

Query translation
Very efficient for short queries
Not as big an advantage for relevance feedback
Hard to resolve ambiguous query terms
Document translation
May be needed by the selection interface
And supports adaptive filtering well
Slow, but only need to do it once per document
Poor scale-up to large numbers of languages

21
Document Translation Example

Approach
Select a single query language
Translate every document into that language
Perform monolingual retrieval
Long documents provide enough context
And many translation errors do not hurt retrieval
Much of the generation effort is wasted
And choosing a single translation can hurt

22
Text Translation

One weakness of present fully automatic machine
translation systems is that they are able to
produce high quality translations only in limited
domains
Text retrieval systems are typically more
tolerant of syntactic than semantic translation
errors but that semantic accuracy suffers when
insufficient domain knowledge is encoded into a
translation system
In fact some of the work done by a machine
translation system could actually reduce some
measures of retrieval effectiveness

23
Query Translation Example

Select controlled vocabulary search terms
Retrieve documents in desired language
Form monolingual query from the documents
Perform a monolingual free text search

English Web Pages
French Query Terms
Information Need
Controlled Vocabulary Multilingual Text
Retrieval System
English Abstracts
Thesaurus
Alta Vista
24
Query Translation
An English-Chinese CLIR System
Queries (E)
Queries (C)
Results (E)
Results (C)
Chinese Documents
25
Controlled Vocabulary

A controlled vocabulary information retrieval
system can be very useful in the hands of a
skilled searcher, but end users often find free
text searching to be more helpful.
Experience has shown that although the domain
knowledge that can be encoded in a thesaurus
permits experienced users to form more precise
queries casual and intermittent users have
diffculty exploiting the expressive power of a
traditional query interface in exact match
retrieval systems
Controlled vocabulary text retrieval systems are
widely used in libraries and user needs
assessment has received considerable attention
from library and information science researchers.

26
Knowledge-based Techniquesfor Free Text Searching
27
Knowledge Structures for IR

Ontology
Representation of concepts and relationships
Thesaurus
Ontology specialized for retrieval
Bilingual lexicon
Ontology specialized for machine translation
Bilingual dictionary
Ontology specialized for human translation

28
Machine Readable Dictionaries

Based on printed bilingual dictionaries
Becoming widely available
Used to produce bilingual term lists
Cross-language term mappings are accessible
Sometimes listed in order of most common usage
Some knowledge structure is also present
Hard to extract and represent automatically
The challenge is to pick the right translation

29
CLIR Dictionary Based

Problems
Limitations of dictionaries
Inflected word forms
Phrases and compound words
Lexical ambiguity
Possible solution
Approximate string matching

30
Unconstrained Query Translation

Replace each word with every translation
Typically 5-10 translations per word
About 50 of monolingual effectiveness
Ambiguity is a serious problem
Example Fly (English)
8 word senses (e.g., to fly a
flag)
13 Spanish translations (enarbolar, ondear, )
38 English retranslations (hoist, brandish, lift)

31
(No Transcript)
32
Exploiting Part-of-Speech Tags

Constrain translations by part of speech
Noun, verb, adjective,
Effective taggers are available
Works well when queries are full sentences
Short queries provide little basis for tagging
Constrained matching can hurt monolingual IR
Nouns in queries often match verbs in documents

33
Phrase Indexing

Improves retrieval effectiveness two ways
Phrases are less ambiguous than single words
Idiomatic phrases translate as a single concept
Three ways to identify phrases
Semantic (e.g., appears in a dictionary)
Syntactic (e.g., parse as a noun phrase)
Cooccurrence (words found together often)
Semantic phrase results are impressive

34
Corpus-based Techniquesfor Free Text Searching
35
Types of Bilingual Corpora

Parallel corpora translation-equivalent pairs
Document pairs
Sentence pairs
Term pairs
Comparable corpora
Content-equivalent document pairs
Unaligned corpora
Content from the same domain

36
Pseudo-Relevance Feedback

Enter query terms in French
Find top French documents in parallel corpus
Construct a query from English translations
Perform a monolingual free text search

Top ranked French Documents
French Query Terms
English Web Pages
English Translations
French Text Retrieval System
Parallel Corpus
Alta Vista
37
Learning From Document Pairs

Count how often each term occurs in each pair
Treat each pair as a single document

English Terms
Spanish Terms
E1 E2 E3 E4 E5 S1 S2
S3 S4
Doc 1
4
2
2
1
Doc 2
8
4
4
2
Doc 3
2
2
1
2
Doc 4
2
1
2
1
Doc 5
4
1
2
1
38
Similarity-Based Dictionaries

Automatically developed from aligned documents
Terms E1 and E3 are used in similar ways
Terms E1 S1 (or E3 S4) are even more similar
For each term, find most similar in other
language
Retain only the top few (5 or so)
Performs as well as dictionary-based techniques
Evaluated on a comparable corpus of news stories
Stories were automatically linked based on date
and subject

39
Generalized Vector Space Model

Term space of each language is different
But the document space for a corpus is the same
Describe new documents based on the corpus
Vector of cosine similarity to each corpus
document
Easily generated from a vector of term weights
Multiply by the term-document matrix
Compute cosine similarity in document space
Excellent results when the domain is the same

40
Latent Semantic Indexing

Designed for better monolingual effectiveness
Works well across languages too
Cross-language is just a type of term choice
variation
Produces short dense document vectors
Better than long sparse ones for adaptive
filtering
Training data needs grow with dimensionality
Not as good for retrieval efficiency
Always 300 multiplications, even for short queries

41
Sentence-Aligned Parallel Corpora

Easily constructed from aligned documents
Match pattern of relative sentence lengths
Not yet used directly for effective retrieval
But all experiments have included domain shift
Good first step for term alignment
Sentences define a natural context

42
Cooccurrence-Based Translation

Align terms using cooccurrence statistics
How often do a term pair occur in sentence pairs?
Weighted by relative position in the sentences
Retain term pairs that occur unusually often
Useful for query translation
Excellent results when the domain is the same
Also practical for document translation
Term usage reinforces good translations

43
Exploiting Unaligned Corpora

Documents about the same set of subjects
No known relationship between document pairs
Easily available in many applications
Two approaches
Use a dictionary for rough translation
But refine it using the unaligned bilingual
corpus
Use a dictionary to find alignments in the corpus
Then extract translation knowledge from the
alignments

44
Feedback with Unaligned Corpora

Pseudo-relevance feedback is fully automatic
Augment the query with top ranked documents
Improves recall
Recenters queries based on the corpus
Short queries get the most dramatic improvement
Two opportunities
Query language Improve the query
Document language Suppress translation error

45
Context Linking

Automatically align portions of documents
For each query term
Find translation pairs in corpus using dictionary
Select a context of nearby terms
e.g., /- 5 words in each language
Choose translations from most similar contexts
Based on cooccurrence with other translation
pairs
No reported experimental results

46
Problems with CLIR

Morphological processing difficult for some
languages (e.g. Arabic)
Many different encodings for Arabic
Windows Arabic (e.g. dictionaries)
Unicode (UTF-8) (e.g. corpus)
Macintosh Arabic (e.g. queries)
Normalization
Remove diacritics
???????????? to ???????? Arabic (language)
Standardize spellings for foreign names
??????? vs ?????? Kleentoon vs Klntoon for
Clinton

47
Problems with CLIR (contd)

Morphological processing (contd.)
Arabic stemming
Root patternssuffixesprefixesword
ktbCiCaCkitab
All verbs and nouns derived from fewer than 2000
roots
Roots too abstract for information retrieval
ktb ? kitab a book kitabi my book
alkitab the book kitabuki your book (f)
kataba to write kitabuka your book (m)
maktab office kitabuhu his book
maktaba library, bookstore ...
Want stemrootpatternderivational affixes?
No standard stemmers available,
only morphological (root) analyzers

48
Problems with CLIR (contd)

Availability of resources
Names and phrases are very important, most
lexicons do not have good coverage
Difficult to get hold of bilingual dictionaries
can sometimes be found on the Web
e.g. for recent Arabic cross-lingual evaluation
we used 3 on-line Arabic- English dictionaries
(including harvesting) and a small lexicon of
country and city names
Parallel corpora are more difficult and require
more formal arrangements

49
CLIR better than IR?

How can cross-language beat within-language?
We know there are translation errors
Surely those errors should hurt performance
Hypothesis is that translation process may
disambiguate some query terms
Words that are ambiguous in Arabic may not be
ambiguous in English
Expansion during translation from English to
Arabic prevents the ambiguity from re-appearing
Has been proposed that CLIR is a model for IR
Translate query into one language and then back
to original
Given hypothesis, should have an improved query
Should be reasonable to do this across many
different languages

50
Low Density Languages

Languages for which few on-line resources exist
Rumor has it that 25 languages are well
represented on Web
Extreme is kitchen languages that are only
spoken
More extreme a language made up of whistling
Corpus to be searched may also be very small
Bilingual dictionaries often exist in print, may
need to use interlingua such as French
Some approaches, such as those relying on
translation probabilities may not work well
Solution depends on specific application

51
Performance Evaluation
52
Constructing Test Collections

One collection for retrospective retrieval
Start with a monolingual test collection
Documents, queries, relevance judgments
Translate the queries by hand
Need 2 collections for adaptive filtering
Monolingual test collection in one language
Plus a document collection in the other language
Generate relevance judgments for the same queries

53
Evaluating Corpus-Based Techniques

Same domain evaluation
Partition a bilingual corpus
Design queries
Generate relevance judgments for evaluation part
Cross-domain evaluation
Can use existing collections and corpora
No good metric for degree of domain shift

54
Evaluation Example

Corpus-based same domain evaluation
Use average precision as figure of merit

Technique Cross-lang Mono-lingual Ratio
Cooccurrence-based dictionary 0.43 0.47 91
Pseudo-relevance feedback 0.40 0.44 90
Generalized vector space model 0.38 0.40 95
Latent semantic indexing 0.31 0.37 84
Dictionary-based translation 0.29 0.47 61
55
User Interface Design
56
Query Formulation

Interactive word sense disambiguation
Show users the translated query
Retranslate it for monolingual users
Provide an easy way of adjusting it
But dont require that users adjust or approve it

57
Selection and Examination

Document selection is a decision process
Relevance feedback, problem refinement, read it
Based on factors not used by the retrieval system
Provide information to support that decision
May not require very good translations
e.g., Word-by-word title translation
People can read past some ambiguity
May help to display a few alternative translations

58
References

Miguel E. Ruiz. Cross Language Information
Retrieval (CLIR). Power point presentation,
University of Buffalo. 2002
Douglas W Oard, Bonnie J Dorr. A Survey of
Multilingual Text Retrieval .1996
Jian-Yun Nie Cross-Language Information
Retrieval. IEEE Computational Intelligence
Bulletin 2(1) 19-24 (2003)
Hansen, Preben and Petrelli, Daniela and
Karlgren, Jussi and Beaulieu, Micheline and
Sanderson, Mark (2002) User-Centered Interface
Design for Cross-Language Information Retrieval.
In Proceedings of the Twenty-fifth Annual
International ACM SIGIR Conference on Research
and Development in Information Retrieval,
Tampere, Finland. 2002
Elizabeth D. Liddy and Anne R. Diekema.
Cross-Language Information Exploitation of
Arabic. Power point presentation April 2005