Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words - PowerPoint PPT Presentation

About This Presentation
Title:

Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words

Description:

Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words Gina-Anne Levow University of Chicago July 7, 2003 – PowerPoint PPT presentation

Number of Views:136
Avg rating:3.0/5.0
Slides: 14
Provided by: peopleCsU
Category:

less

Transcript and Presenter's Notes

Title: Issues in Pre- and Post-translation Document Expansion: Untranslatable Cognates and Missegmented Words


1
Issues in Pre- and Post-translation Document
ExpansionUntranslatable Cognates and
Missegmented Words
  • Gina-Anne Levow
  • University of Chicago
  • July 7, 2003

2
Roadmap
  • Goals of expansion
  • Expansion points in CL-SDR
  • Pre- and Post-translation document expansion
    experiments
  • Task, query document processing
  • Expansion methodology
  • Results
  • Discussion Conclusions

3
Why Expansion?
  • Recover terms that could have appeared
  • Compensate for difference in term choice
  • Author concepts vs searcher information need
  • Compensate for noisy processing
  • ASR transcription errors
  • Misrecognitions, deletions, missegmentations
  • Translation errors
  • Gaps, missegmentations
  • Context disambiguates

4
Expansion Opportunities
  • Query
  • (Ballesteros Croft96 McNamee Mayfield 2002)
  • Before, after translation both
  • Different enhancements to precision/recall
  • Pre-translation key something to translate
  • European languages
  • Document
  • Before, after translation both
  • Developed for monolingual SDR (Singhal 1999)
  • CLIR (SDR) (Levow Oard 2000)
  • Post-translation promising

5
Experimental Configuration Basic Task
  • Variant of Topic Detection and Tracking (TDT)
  • English queries to Mandarin documents
  • Query-by-example
  • English newswire or broadcast news stories
  • Mandarin audio broadcast news documents
  • Automatically transcribed by Dragon ASR system
  • Modifications
  • Retrospective retrieval
  • Evaluation metric Mean Average Precision

6
Experimental ConfigurationQuery and Document
Processing
  • Query
  • Select top 180 positively correlated terms in 4
    exemplars
  • Based on ?2 test
  • 996 prior documents assumed not relevant
  • Document
  • Dictionary-based word-for-word translation
  • Segmentation NMSU ch_seg
  • Translation resource
  • Merged bilingual term list CETA LDC term list
  • Translation ranking
  • Target language unigram frequency single words,
    multi-word

7
Experimental ConfigurationDocument Expansion
8
Document Expansion Details
  • Side collections
  • Mandarin TDT-2 Xinhua, Zaobao newswire
  • English TDT-2 New York Times, AP news
  • Expansion term selection
  • Top 5 documents
  • Sort candidate terms by idf
  • Exclude terms in only one document
  • Add one term instance per document
  • Add until document doubled in length

9
Results
  • Post-translation significantly outperforms
    pre-translation expansion

None Pre Post PrePost
0.39 0.46 0.59 0.61
10
Discussion Post-translation Effectivenes
  • Post-translation document expansion significantly
    improves retrieval effectiveness
  • Little improvement from pre-translation expansn
  • Either alone or in conjunction
  • Expansion introduces key enriching terms
  • Named entities, alternate forms
  • E.g. Tariq Aziz, Saddam, Yeltsin, etc
  • Available in English (post-translation) collection

11
Discussion Pre-translation Limitations
  • Expansion terms do not exist
  • Segmentation transcription rely on term lists
  • Named entities frequently absent
  • Can not extract terms from Mandarin newswire
  • Expansion terms can not translate
  • Key terms (e.g. named entities) absent from
    bilingual term lists
  • All examples on previous page absent

12
Discussion Contrasts
  • Contradict prior query expansion results
  • Re Primacy of pre-translation expansion
  • Explanation
  • Prior languages mostly European
  • Common writing system, white-space delimited
  • Pre-translation expansion produces
  • -gt translatable terms (possibly) untranslatable
    cognates
  • Cognates still match, even without translation
  • Current experiment English-Mandarin
  • Untranslatable cognates useless
  • Different orthography
  • Terms not identified - missegmentation

13
Conclusion
  • Document expansion improves effectiveness
  • For CL-SDR case, recovers terms lost by
    missegmentation, mistranscription, or
    mistranslation supports different terms
  • Post-translation expansion most effective
  • Translated terms provide context for retrieval
  • Correct translations/transcriptions coherent
    others noise
  • Enriching terms often absent from term lists
  • Segmentation, transcription, translation all rely
    on lists
  • Expansion in indexing language bypasses barriers
  • Crucial in languages with segmentation issues and
    different forms
Write a Comment
User Comments (0)
About PowerShow.com