CrossLanguage Retrieval and Laboratory - PowerPoint PPT Presentation

1 / 89
About This Presentation
Title:

CrossLanguage Retrieval and Laboratory

Description:

Free Text CLIR. What to translate? Queries or documents. Where to get translation knowledge? ... Document translation. Rapid support for interactive selection ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 90
Provided by: gin116
Category:

less

Transcript and Presenter's Notes

Title: CrossLanguage Retrieval and Laboratory


1
Cross-Language Retrieval(and Laboratory)
  • Philip Resnik
  • University of Maryland

With many slides borrowed from Doug Oard
2
Information Access
Information Use
Translation
Translingual Browsing
Translingual Search
Select
Examine
Query
Document
3
A Little (Confusing) Vocabulary
  • Multilingual document
  • Document containing more than one language
  • Multilingual collection
  • Collection of documents in different languages
  • Multilingual system
  • Can retrieve from a multilingual collection
  • Cross-language system
  • Query in one language finds document in another
  • Translingual system
  • Queries can find documents in any language

4
Who needs Cross-Language Search?
  • When users can read several languages
  • Eliminate multiple queries
  • Query in most fluent language
  • Monolingual users can also benefit
  • If translations can be provided
  • If it suffices to know that a document exists
  • If text captions are used to search for images

5
Motivations
  • Commerce
  • Security
  • Social

6
Global Internet Hosts
Source Network Wizards Jan 99 Internet Domain
Survey
7

Global Web Page Languages
Source Jack Xu, Excite_at_Home, 1999
8
European Web Content
Source European Commission, Evolution of the
Internet and the World Wide Web in Europe, 1997
9
European Web Size Projection
Source Extrapolated from Grefenstette and
Nioche, RIAO 2000
10
Global Internet Audio
Almost 2000 Internet-accessible Radio and
Television Stations
source www.real.com, Feb 2000
11
13 Months Later
About 2500 Internet-accessible Radio and
Television Stations
source www.real.com, Mar 2001
12
User Needs Assessment
  • Who are the potential users?
  • What goals do we seek to support?
  • What language skills must we accommodate?

13
Global Languages
Source http//www.g11n.com/faq.html
14
Global Trade
Billions of US Dollars (1999)
Source World Trade Organization 2000 Annual
Report
15
Global Internet User Population
2000
2005
English
English
Chinese
Source Global Reach
16
The Search Process
Author
Choose Document-Language Terms
Query-Document Matching
Document
17
(No Transcript)
18
Some History
  • 1964 International Road Research
  • Multilingual thesauri
  • 1970 SMART
  • Dictionary-based free-text cross-language
    retrieval
  • 1978 ISO Standard 5964 (revised 1985)
  • Guidelines for developing multilingual thesauri
  • 1990 Latent Semantic Indexing
  • Corpus-based free-text translingual retrieval

19
Multilingual Thesauri
  • Build a cross-cultural knowledge structure
  • Cultural differences influence indexing choices
  • Use language-independent descriptors
  • Matched to language-specific lead-in vocabulary
  • Three construction techniques
  • Build it from scratch
  • Translate an existing thesaurus
  • Merge monolingual thesauri

20
Free Text CLIR
  • What to translate?
  • Queries or documents
  • Where to get translation knowledge?
  • Dictionary or corpus
  • How to use it?

21
Translingual Retrieval Architecture
Chinese Term Selection
Monolingual Chinese Retrieval
1 0.72 2 0.48
Language Identification
Chinese Term Selection
Chinese Query
English Term Selection
Cross- Language Retrieval
3 0.91 4 0.57 5 0.36
22
Evidence for Language Identification
  • Metadata
  • Included in HTTP and HTML
  • Word-scale features
  • Which dictionary gets the most hits?
  • Subword features
  • Character n-gram statistics

23
Query-Language Retrieval
Chinese Query Terms
English Document Terms
Monolingual Chinese Retrieval
3 0.91 4 0.57 5 0.36
Document Translation
24
Example Modular use of MT
  • Select a single query language
  • Translate every document into that language
  • Perform monolingual retrieval

25
Is Machine Translation Enough?
TDT-3 Mandarin Broadcast News
Systran
Balanced 2-best translation
26
Document-Language Retrieval
Chinese Query Terms
Query Translation
English Document Terms
Monolingual English Retrieval
3 0.91 4 0.57 5 0.36
27
Query vs. Document Translation
  • Query translation
  • Efficient for short queries (not relevance
    feedback)
  • Limited context for ambiguous query terms
  • Document translation
  • Rapid support for interactive selection
  • Need only be done once (if query language is
    same)
  • Merged query and document translation
  • Can produce better effectiveness than either alone

28
The Short Query Challenge
Source Jack Xu, Excite_at_Home, 1999
29
Interlingual Retrieval
Chinese Query Terms
Query Translation
English Document Terms
Interlingual Retrieval
3 0.91 4 0.57 5 0.36
Document Translation
30
Key Challenges in CLIR
probe survey take samples
cymbidium goeringii
oil petroleum
restrain
31
Whats a Term?
  • Granularity of a term depends on the task
  • Long for translation, more fine-grained for
    retrieval
  • Phrases improve translation two ways
  • Less ambiguous than single words
  • Idiomatic expressions translate as a single
    concept
  • Three ways to identify phrases
  • Semantic (e.g., appears in a dictionary)
  • Syntactic (e.g., parse as a noun phrase)
  • Co-occurrence (appear together unexpectedly often)

32
Segmentation
  • Choose a model
  • Assemble evidence
  • Choose a preference criterion
  • Choose a search strategy

33
Segmentation Models
  • Unique segmentation
  • Assign each item to at most one term
  • Plausible segmentation
  • Allow alternative segmentations of a string
  • Plausible inference
  • Expand contractions and abbreviations

34
Sources of Evidence for Segmentation
  • Lexical resources
  • Dictionaries, term lists, name lists, gazeteers
  • Corpus statistics
  • Within-document, within-collection, balanced
  • Algorithms
  • Transliteration rules, name cues, parsers,
  • The user
  • Forced join, forced split

35
Sources of Evidence for Translation
  • Lexical resources
  • Corpus statistics
  • Algorithms
  • The user

36
Types of Lexical Resources
  • Ontology
  • Representation of concepts and relationships
  • Thesaurus
  • Ontology specialized for retrieval
  • Lexicon
  • Ontology specialized for machine translation
  • Dictionary
  • Ontology specialized for human translation
  • Bilingual term list
  • List of translation-equivalent pairs

37
Machine Readable Dictionaries
  • Based on printed bilingual dictionaries
  • Becoming widely available
  • Cross-language term mappings are accessible
  • Sometimes listed in order of most common usage
  • Some knowledge structure is also present
  • Hard to extract and represent automatically

38
TREC topic 351, title and description fields
Manual translation, then automatic segmentation
Unbalanced translation All translations of every
term
Balanced translation 1-best translation of each
term
39
Dictionary-Based Query Translation
  • Original query El Nino and infectious diseases
  • Term selection El Nino infectious diseases
  • Term translation
  • (Dictionary coverage El Nino is not found)
  • Translation selection
  • Query formulation
  • Structure
  • Post-translation resegmentation

40
The Unbalanced Translation Problem
  • Common query terms may have many translations
  • Some of the translations may be rare
  • IR systems give rare translations higher weights
  • The wrong documents get highly ranked

41
Solution 1 Balanced Translation
  • Replace each term with plausible translations
  • Common terms have many translations
  • Specific terms are more useful for retrieval
  • Balance the contribution of each translation
  • Modular duplicate translations
  • Integrated average the contributions

42
Solution 2 Structured Queries
  • Weight of term a in a document i depends on
  • TF(a,i) Frequency of term a in document i
  • DF(a) How many documents term a occurs in
  • Build pseudo-terms from alternate translations
  • TF (syn(a,b),i) TF(a,i)TF(b,i)
  • DF (syn(a,b) docs with aUdocs with b
  • Downweights terms with any common translation
  • Particularly effective for long queries

43
Computing Weights
  • Unbalanced
  • Overweights query terms that have many
    translations
  • Balanced (sum)
  • Sensitive to rare translations
  • Pirkola (syn)
  • Deemphasizes query terms with any common
    translation

(Query Terms 1 2
3 )
44
Relative Effectiveness
NTCIR-2 ECIR Collection, CETALDC Dictionary,
Inquery 3.1p1
45
Exploiting Part-of-Speech (POS)
  • Constrain translations by part-of-speech
  • Requires POS tagger and POS-tagged lexicon
  • Works well when queries are full sentences
  • Short queries provide little basis for tagging
  • Constrained matching can hurt monolingual IR
  • Nouns in queries often match verbs in documents

46
Types of Bilingual Corpora
  • Parallel corpora translation-equivalent pairs
  • Document pairs
  • Sentence pairs
  • Term pairs
  • Comparable corpora topically related
  • Collection pairs
  • Document pairs

47
Corpus-Based CLIR Example
Top ranked French Documents
French Query Terms
Top ranked English Documents
English Translations
Parallel Corpus
French IR System
English IR System
48
Exploiting Comparable Corpora
  • Blind relevance feedback
  • Existing CLIR technique collection-linked
    corpus
  • Lexicon enrichment
  • Existing lexicon collection-linked corpus
  • Dual-space techniques
  • Document-linked corpus

49
Blind Relevance Feedback
  • Augment a representation with related terms
  • Find related documents, extract distinguishing
    terms
  • Multiple opportunities
  • Before doc translation Enrich the vocabulary
  • After doc translation Mitigate translation
    errors
  • Before query translation Improve the query
  • After query translation Mitigate translation
    errors
  • Short queries get the most dramatic improvement

50
Example Post-Translation Document Expansion
English Query
Term Selection
IR System
Document to be Indexed
Top 5
IR System
Results
Single Document
Term-to-Term Translation
English Corpus
Automatic Segmentation
Mandarin Chinese Documents
51
Post-Translation Document Expansion
Mandarin Newswire Text
52
Why Document Expansion Works
  • Story-length objects provide useful context
  • Ranked retrieval finds signal amid the noise
  • Selective terms discriminate among documents
  • Enrich index with low DF terms from top documents
  • Similar strategies work well in other
    applications
  • CLIR query translation
  • Monolingual spoken document retrieval

53
Lexicon Enrichment
  • Use a bilingual lexicon to align context
    regions
  • Regions with high coincidence of known
    translations
  • Pair unknown terms with unmatched terms
  • Unknown language A, not in the lexicon
  • Unmatched language B, not covered by translation
  • Treat the most surprising pairs as new
    translations
  • Not yet tested in a CLIR application

54
Lexicon Enrichment
Similar techniques can guide translation selection
55
Learning From Document Pairs
56
Similarity Thesauri
  • For each term, find most similar in other
    language
  • Terms E1 S1 (or E3 S4) are used in similar
    ways
  • Treat top related terms as candidate translations
  • Applying dictionary-based techniques
  • Performed well comparable news corpus
  • Automatically linked based on date and subject
    codes

57
Generalized Vector Space Model
  • Term space of each language is different
  • Document links define a common document space
  • Describe documents based on the corpus
  • Vector of similarities to each corpus document
  • Compute cosine similarity in document space
  • Very effective in a within-domain evaluation

58
Latent Semantic Indexing
  • Cosine similarity captures noise with signal
  • Term choice variation and word sense ambiguity
  • Signal-preserving dimensionality reduction
  • Conflates terms with similar usage patterns
  • Reduces term choice effect, even across languages
  • Computationally expensive

59
Exploiting Parallel Corpora
  • Document-linked techniques
  • Corpus-guided translation selection
  • Statistical machine translation

60
Hieroglyphic
Egyptian Demotic
Greek
61
Corpus-Guided Translation Selection
  • Build target-language term n-gram language model
  • Can use the collection being searched
  • Smooth statistics using comparable and balanced
    corpora
  • Use it to rank translation alternatives for each
    term
  • Unigram language models are easy to build
  • Limits uncommon translation and spelling error
    effects

62
Statistical Machine Translation
  • Add a translation model
  • Trained using a term-aligned corpus
  • Statistical MT toolkits are becoming available
  • Excellent results on hand-assembled corpora
  • Promising initial results on harvested Web pages

63
Obtaining Parallel Corpora
  • Translating monolingual corpora is impractical
  • Humans are expensive, machines are inaccurate
  • Harvesting new parallel corpora can be expensive
  • Reverse engineering collection-specific link
    encoding
  • Crawling the Web offers an interesting
    alternative
  • Low-quality translations, but lots of them
  • Reuse of existing parallel corpora is limited
  • Cross-domain applications typically perform poorly

64
Cognate Matching
  • Dictionary coverage is inherently limited
  • Translation of proper names
  • Translation of newly coined terms
  • Translation of unfamiliar technical terms
  • Strategy model derivational translation
  • Orthography-based
  • Pronunciation-based

65
Matching Orthographic Cognates
  • Retain untranslatable words unchanged
  • Often works well between European languages
  • Rule-based systems
  • Even off-the-shelf spelling correction can help!
  • Character-level statistical MT
  • Trained using a set of representative cognates

66
Matching Phonetic Cognates
  • Forward transliteration
  • Generate all potential transliterations
  • Reverse transliteration
  • Guess source string(s) that produced a
    transliteration
  • Match in phonetic space

67
Post-Translation Resegmentation
68
Cross-Language Information Retrieval
Controlled Vocabulary
Free Text
Query Translation Document
Translation
Dictionary-based
Corpus-based
Term-aligned Document-aligned
Collection-aligned
Parallel Comparable
69
Evaluation Collections
  • TREC
  • TREC-6/7/8 English, French, German, Italian text
  • TREC-9 Chinese text
  • TREC-10 Arabic text
  • CLEF
  • CLEF-1 English, French, German, Italian text
  • CLEF-2 Above, plus Spanish and Dutch
  • TDT
  • TDT-2/3 Chinese and English, text and speech
  • NTCIR
  • NTCIR-1 Japanese and English text
  • NTCIR-2 Japanese, Chinese and English text

70
Topic Detection and Tracking
  • English and Chinese news stories
  • Newswire, radio, and television sources
  • Query-by-example, mixed language/source
  • Merged mutilingual result set
  • Set-based retrieval measures
  • Focus on utility

71
TDT Evaluating CL Speech Retrieval
Development Collection TDT-2
Evaluation Collection TDT-3
Oct 98
Dec 98
Jun 98
Jan 98
17 topics, variable number of exemplars
56 topics, variable number of exemplars
English text topic exemplars Associated
Press New York Times
2265 manually segmented stories
3371 manually segmented stories
Mandarin audio broadcast news Voice of America
Mar 98
Exhaustive relevance assessment based on event
overlap
72
President Bill Clinton and Chinese President
Jiang Zemin engaged in a spirited, televised
debate Saturday over human rights and
the Tiananmen Square crackdown, and announced a
string of agreements on arms control, energy and
environmental matters. There were no announced
breakthroughs on American human rights concerns,
including Tibet, but both leaders accentuated
the positive
Query by Example
English Newswire Exemplars
Mandarin Audio Stories
??????????????????????????????????????? ??????????
????????????????????????????? ????,???????????????
????????????,??????? ????????????????????????????
??????
73
Known Item Retrieval
  • Design queries to retrieve a single document
  • Measure the rank of that document in the list
  • Average the inverse of the rank across queries
  • Use sign test for statistical significance
  • Useful first pass evaluation strategy
  • Avoids the cost of relevance judgments
  • Poor mean inverse rank implies poor average
    precision
  • Does not distinguish well among fairly close
    systems

74
Evaluating Corpus-Based Techniques
  • Within-domain evaluation (upper bound)
  • Partition a bilingual corpus into training and
    test
  • Use the training part to tune the system
  • Generate relevance judgments for evaluation part
  • Cross-domain evaluation (fair)
  • Use existing corpora and evaluation collections
  • No good metric for degree of domain shift

75
Evaluating Lexicon Coverage
  • Lexicon size
  • Vocabulary coverage of the collection
  • Term instance coverage of the collection
  • Term weight coverage of the collection
  • Term weight coverage on representative queries
  • Retrieval performance on a test collection

76
Outline
  • Questions
  • Overview
  • Search
  • Browsing

77
Interactive CLIR
  • Important Strong support for interactive
    relevance judgments can make up for less accurate
    nominations
  • Hersh et al., SIGIR 2000
  • Practical Interactive relevance judgments based
    on imperfect translations can beat fully
    automatic nominations alone
  • Oard and Resnik, IPM 1997

78
User-Assisted Query Translation
79
Reverse Translation
Search
Swiss bank
Query in English
Click on a box to remove a possible translation
bank
Bankgebäude ( ) bankverbindung (bank account,
correspondent) bank (bench, settle) damm (caus
eway, dam, embankment) ufer (shore, strand,
waterside) wall (parapet, rampart)
Continue
80
Selection
  • Goal Provide information to support decisions
  • May not require very good translations
  • e.g., Term-by-term title translation
  • People can read past some ambiguity
  • May help to display a few alternative translations

81
Language-Specific Selection
Search
Swiss bank
Query in English
English
German
(Swiss) (Bankgebäude, bankverbindung, bank)
1 (0.72) Swiss Bankers Criticized AP / June 14,
1997 2 (0.48) Bank Director Resigns AP / July
24, 1997
1 (0.91) U.S. Senator Warpathing NZZ / June 14,
1997 2 (0.57) Bankensecret Law Change SDA /
August 22, 1997 3 (0.36) Banks Pressure
Existent NZZ / May 3, 1997
82
Translingual Selection
Search
Swiss bank
Query in English
German Query
(Swiss) (Bankgebäude, bankverbindung, bank)
1 (0.91) U.S. Senator Warpathing NZZ June 14,
1997 2 (0.57) Bankensecret Law Change
SDA August 22, 1997 3 (0.52) Swiss Bankers
Criticized AP June 14, 1997 4 (0.36) Banks
Pressure Existent NZZ May 3, 1997 5 (0.28) Bank
Director Resigns AP July 24, 1997
83
Merging Ranked Lists
1 voa4062 .22 2 voa3052 .21 3
voa4091 .17 1000 voa4221 .04
1 voa4062 .52 2 voa2156 .37 3
voa3052 .31 1000 voa2159 .02
  • Types of Evidence
  • Rank
  • Score
  • Evidence Combination
  • Weighted round robin
  • Score combination
  • Parameter tuning
  • Condition-based
  • Query-based

1 voa4062 2 voa3052 3 voa2156
1000 voa4201
84
Examination Interface
  • Two goals
  • Refine document delivery decisions
  • Support vocabulary discovery for query
    refinement
  • Rapid translation is essential
  • Document translation retrieval strategies are a
    good fit
  • Focused on-the-fly translation may be a viable
    alternative

85
The Challenge
86
State-of-the-Art Machine Translation
87
Term-by-Term Gloss Translation
88
Experiment Design
Participant
Task Order
Topic Key
1
Topic11, Topic17
Topic13, Topic29
Narrow
11, 13
Broad
17, 29
2
Topic11, Topic17
Topic13, Topic29
System Key
3
Topic17, Topic11
Topic29, Topic13
System A
System B
4
Topic17, Topic11
Topic29, Topic13
89
An Experiment Session
  • Task and system familiarization (30 minutes)
  • Gain experience with both systems
  • 4 searches (20 minutes x 4)
  • Read topic description (in a language you know)
  • Examine translations (into that same language)
  • Select one of 5 relevance judgments (two
    clusters)
  • Relevant
  • Somewhat relevant, Not relevant, Unsure, Not
    judged
  • Instructed to seek high precision
  • 8 questionnaires
  • Initial (1), each topic (4), each system (2),
    Final (1)

90
Measure of Effectiveness
  • Unbalanced F-Measure
  • P precision
  • R recall
  • ? 0.8
  • Favors precision over recall
  • This models an application in which
  • Fluent translation is expensive
  • Missing some relevant documents would be okay

91
French Results Overview
92
English Results Overview
93
Maryland Experiments
---------- Broad topics -----------
--------- Narrow topics -----------
  • MT is almost always better
  • Significant overall and for narrow topics alone
    (one-tailed t-test, p
  • F measure is less insightful for narrow topics
  • Always near 0 or 1

94
Some Observations
  • Small agreement with CLEF assessments!
  • Time pressure, precision bias, strict judgments
  • Systran was fairly consistent across sites
  • Only when the language pair was the same
  • Monolingual Systran Gloss
  • In both recall and precision
  • UNEDs phrase translations improve recall
  • With no adverse affect on precision

95
Delivery
  • Use may require high-quality translation
  • Machine translation quality is often rough
  • Route to best translator based on
  • Acceptable delay
  • Required quality (language and technical skills)
  • Cost

96
Summary
  • Controlled vocabulary
  • Mature, efficient, easily explained
  • Dictionary-based
  • Simple, broad coverage
  • Collection-aligned corpus-based
  • Generally helpful
  • Document- and Term-aligned corpus-based
  • Effective in the same domain
  • User interface
  • Only very preliminary results available

97
Research Opportunities(Oards view)
Segmentation Phrase Indexing
Lexical Coverage
Write a Comment
User Comments (0)
About PowerShow.com