Information Retrieval and Web Search - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Information Retrieval and Web Search

Description:

Document translation. May be needed by the selection interface ... Controlled Vocabulary Free Text. Cross-Language Text Retrieval ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 28
Provided by: cse54
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval and Web Search


1
Information Retrieval and Web Search
  • Cross Language Information Retrieval
  • Instructor Rada Mihalcea
  • Some of the slides are from a course taught by
    Doug Oard at U. Maryland

2
The General Problem
  • Find documents written in any language
  • Using queries expressed in a single language

3
Why Do Cross-Language IR?
  • When users can read several languages
  • Eliminates multiple queries
  • Query in most fluent language
  • Monolingual users can also benefit
  • If translations can be provided
  • If it suffices to know that a document exists
  • If text captions are used to search for images

9
4
Top Ten Languages on the Web
internetworldstats.com, March, 2011
5
Demand Side Top Spoken Languages
Source http//en.wikipedia.org/wiki/List_of_langu
ages_by_number_of_native_speakers
6
Search Technology
Chinese Feature Assignment
Monolingual Chinese Matching
1 0.72 2 0.48
Language Identification
Chinese Feature Assignment
Chinese Query
English Feature Assignment
Cross- Language Matching
3 0.91 4 0.57 5 0.36
7
Language Identification
  • Can be specified using metadata
  • Included in HTTP and HTML
  • Can be determined using word-scale features
  • Which dictionary gets the most hits?
  • Can be determined using subword features
  • Letter n-grams, for example

8
Design Decisions
  • What to index?
  • Free text or controlled vocabulary
  • What to translate?
  • Queries or documents
  • Where to get translation knowledge?

9
Query Vector Translation
Chinese Query Features
Query (Vector) Translation
Monolingual English Matching
3 0.91 4 0.57 5 0.36
English Document Features
10
Document Vector Translation
Chinese Query Features
English Document Features
Monolingual Chinese Matching
3 0.91 4 0.57 5 0.36
Document (Vector) Translation
11
Matching Interlingual Representations
Chinese Query Features
Query Folding In
English Document Features
Interlingual Matching
3 0.91 4 0.57 5 0.36
Document Folding In
12
Query vs. Document Translation
  • Query translation
  • Very efficient for short queries
  • Not as big an advantage for relevance feedback
  • Hard to resolve ambiguous query terms
  • Document translation
  • May be needed by the selection interface
  • And supports adaptive filtering well
  • Slow, but only need to do it once per document
  • Poor scale-up to large numbers of languages

13
Cross-Language Text Retrieval
Query Translation
Document Translation
Text Translation Vector Translation
Controlled Vocabulary Free Text
Knowledge-based
Corpus-based
Ontology-based Dictionary-based
Term-aligned Sentence-aligned
Document-aligned Unaligned
Thesaurus-based
Parallel Comparable
14
Translation Knowledge
  • A lexicon
  • e.g., extract term list from a bilingual
    dictionary
  • Corpora
  • Parallel or comparable, linked or unlinked
  • Algorithmic
  • e.g., transliteration rules, cognate matching
  • The user

15
Types of Lexicons
  • Ontology
  • Representation of concepts and relationships
  • Thesaurus
  • Ontology specialized for retrieval
  • Bilingual lexicon
  • Ontology specialized for machine translation
  • Bilingual dictionary
  • Ontology specialized for human translation

16
Multilingual Thesauri
  • Adapt the knowledge structure
  • Cultural differences influence indexing choices
  • Use language-independent descriptors
  • Matched to a unique term in each language
  • Three construction techniques
  • Build it from scratch
  • Translate an existing thesaurus
  • Merge monolingual thesauri

17
Machine Readable Dictionaries
  • Based on printed bilingual dictionaries
  • Becoming widely available
  • Used to produce bilingual term lists
  • Cross-language term mappings are accessible
  • Sometimes listed in order of most common usage
  • Some knowledge structure is also present
  • Hard to extract and represent automatically
  • The challenge is to pick the right translation

18
Unconstrained Query Translation
  • Replace each word with every translation
  • Typically 5-10 translations per word
  • About 50 of monolingual effectiveness
  • Ambiguity is a serious problem
  • Example Fly (English)
  • 8 word senses (e.g., to fly a
    flag)
  • 13 Spanish translations (enarbolar, ondear, )
  • 38 English retranslations (hoist, brandish, lift)

19
Phrase Indexing
  • Improves retrieval effectiveness two ways
  • Phrases are less ambiguous than single words
  • Idiomatic phrases translate as a single concept
  • Three ways to identify phrases
  • Semantic (e.g., appears in a dictionary)
  • Syntactic (e.g., parse as a noun phrase)
  • Co-occurrence (words found together often)
  • Semantic phrase results are impressive

20
Types of Bilingual Corpora
  • Parallel corpora translation-equivalent pairs
  • Document pairs
  • Sentence pairs
  • Term pairs
  • Comparable corpora
  • Content-equivalent document pairs
  • E.g. newspaper articles in different languages,
    on the same day (for the same event)
  • Unaligned corpora
  • Content from the same domain

21
How to Use Bilingual Corpora?
  • Pseudo-relevance feedback
  • Enter query terms in Spanish
  • Find top Spanish documents in parallel corpus
  • Construct a query from English translations
  • Perform a monolingual free text search

Top ranked French Documents
French Query Terms
English Web Pages
English Translations
French Text Retrieval System
Parallel Corpus
Alta Vista
22
How to Use Bilingual Corpora?
  • Count how often each term occurs in each pair
  • Treat each pair as a single document

English Terms
Spanish Terms
E1 E2 E3 E4 E5 S1 S2
S3 S4
Doc 1
4
2
2
1
Doc 2
8
4
4
2
Doc 3
2
2
1
2
Doc 4
2
1
2
1
Doc 5
4
1
2
1
23
Similarity-Based Dictionaries
  • Automatically developed from aligned documents
  • Terms E1 and E3 are used in similar ways
  • Terms E1 S1 (or E3 S4) are even more similar
  • For each term, find most similar in other
    language
  • Retain only the top few (5 or so)
  • Performs as well as dictionary-based techniques
  • Evaluated on a comparable corpus of news stories
  • Stories were automatically linked based on date
    and subject

24
Sentence-Aligned Parallel Corpora
  • Easily constructed from aligned documents
  • Match pattern of relative sentence lengths
  • Not yet used directly for effective retrieval
  • But all experiments have included domain shift
  • Good first step for term alignment
  • Sentences define a natural context

25
Co-occurrence-Based Translation
  • Align terms using co-occurrence statistics
  • How often do a term pair occur in sentence pairs?
  • Weighted by relative position in the sentences
  • Retain term pairs that occur unusually often
  • Useful for query translation
  • Excellent results when the domain is the same
  • Also practical for document translation
  • Term usage reinforces good translations

26
Exploiting Unaligned Corpora
  • Documents about the same set of subjects
  • No known relationship between document pairs
  • Easily available in many applications
  • Two approaches
  • Use a dictionary for rough translation
  • But refine it using the unaligned bilingual
    corpus
  • Use a dictionary to find alignments in the corpus
  • Then extract translation knowledge from the
    alignments

27
CLIR Evaluation Resources
  • Electronic texts
  • Text Retrieval Conference
  • Topic Detection and Tracking
  • Document images
  • No evaluation programs yet
  • Recorded speech
  • Topic Detection and Tracking
  • Sign language
  • No evaluation programs yet
  • Cross-language question answering
  • CLEF Evaluation
  • http//www.clef-campaign.org/
Write a Comment
User Comments (0)
About PowerShow.com