Bridging the language gap: making digital collections available to a multilingual society - PowerPoint PPT Presentation

Loading...

PPT – Bridging the language gap: making digital collections available to a multilingual society PowerPoint presentation | free to download - id: faec1-ZDc1Z



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Bridging the language gap: making digital collections available to a multilingual society

Description:

... language gap: making digital collections available to a multilingual society. Paul Clough Department of Information Studies. p.d.clough_at_sheffield.ac.uk. 9/14/09 ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 20
Provided by: paulc53
Learn more at: http://www.ukoln.ac.uk
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Bridging the language gap: making digital collections available to a multilingual society


1
Bridging the language gap making digital
collections available to a multilingual society
  • Paul Clough Department of
    Information Studies
  • p.d.clough_at_sheffield.ac.uk

2
Overview
  • Making digital collections available
  • Information Retrieval (IR)
  • to a multicultural society
  • Cross-Language Information Retrieval (CLIR)
  • Example CLIR applications
  • Cross-Language retrieval from texts and audio
  • Cross-Language retrieval from images

3
Information Retrieval (IR)
  • To find information relevant to a user need
  • Includes storage, retrieval, and presentation
  • Information
  • e.g. texts, images, audio, video
  • User need
  • Find information about a given topic
  • Involves searching and/or browsing
  • Relevance (core to IR)
  • Does retrieved information meet the expected
    user need?

4
Trends in IR research
  • Classical IR
  • IR models, evaluation, query languages and
    document indexing on small document collections
  • Modern IR
  • Internet search engines
  • Mark-up languages
  • Multimedia content
  • Distributed collections
  • User interaction
  • Multilingualism

5
Cross-Language IR (CLIR)
  • Find documents written in any language
  • Driven by multilingual content and users
  • Applications of CLIR
  • Sharing information between global communities
  • Selling products globally on the Web
  • Searching multilingual documents
  • Performing CLIR
  • Combination of IR and translation technologies
  • Translate search requests, documents (or both)
  • Requirements of a multilingual IR system
  • Store, retrieve and present multilingual
    information

6
Monolingual IR
  • Documents and user requests in the same language

7
Cross-language IR
  • Documents and user requests are in different
    languages (bilingual IR)

Source language
Target language
8
Multilingual IR
  • Documents in collection in different languages,
    search requests in any language

e.g. the Web
9
Drivers of CLIR
Multilingual content (e.g. web)
English 68.4
Japanese 5.9
German 5.8
Chinese 3.9
French 3.0
Spanish 2.4
Russian 1.9
Italian 1.6
Portuguese 1.4
Korean 1.3
Other 4.6
Total Web pages 313 B
Source Vilaweb.com, as quoted by eMarketer (2003)
Multilingual users
10
Users of CLIR systems
  • Obvious question
  • Why do users want to retrieve documents they
    presumably cant read?
  • Some users are multilingual
  • Can formulate searches and judge relevance in
    many languages
  • Want convenience of a single query
  • Some users are monolingual
  • Want to query in their native language
  • Can judge relevance even if results not
    translated
  • Have access to document translation
  • Objects retrieved are language-independent (e.g.
    images)

11
CLIR methods
  • How is it done?
  • Translate search requests, documents (or both)
  • Translation resources
  • Machine Translation (MT)
  • Parallel/comparable corpora
  • Bilingual Dictionaries
  • Example problems
  • Handling non-ascii character sets
  • OOV terms, e.g. proper names
  • Multi-word concepts, e.g. phrases and idioms
  • Ambiguity, e.g. polysemy
  • Inflections, e.g. plurals and gender

12
Example translation errors (MT)
Italian
Dogs that assemble sheep
Exposures in museums
Ruins of castles in England
German
Dogs with sheep hats
Museumaustellungssteucke
Castle ruins in England
Ruin of castles in United Kingdom
Dutch
Dogs which sheep bejeendrijven
Museumstukken
French
Dogs gathering of the preois
Exposure of objects in museum
Castles in ruins in England
Spanish
Dogs urging on ewes
Objects of museum
Castles in ruins in England
13
Does it work?
  • Best systems at TREC-6 (1997)
  • English-French 49 of highest French monolingual
  • English-German 64 of highest German monolingual
  • Best systems at CLEF (2002)
  • English-French 83 of highest French monolingual
  • English-German 86 of highest German monolingual
  • Best systems at CLEF (2003) for unusual pairs
  • Italian-Spanish 83 of highest Spanish
    monolingual
  • German-Italian 87 of highest Italian
    monolingual
  • French-Dutch 82 of highest Dutch monolingual

14
  • Cross-language multi-media information retrieval
    system
  • For rare languages few electronic translation
    resources exist
  • Collection
  • Newspaper texts and audio documents in mixed
    languages
  • Translation approach
  • Query translation using dictionary-based lookup
  • Transitive cross-language retrieval for varying
    language pairs
  • N-gram techniques for translating OOV words
  • Support for Baltic languages (e.g. Latvian and
    Lithuanian)
  • End-users of CLARITY
  • Journalists working for BBC monitoring (UK) and
    Alma Media (Finland)
  • Users are polyglots

http//www.dcs.shef.ac.uk/nlp/clarity/
15
Eurovision
  • Multilingual access to image collections
  • Many images have associated text
  • Users often formulate queries in natural language
  • Collection
  • St Andrews Historic Photographic Archive
    (http//specialcollections.st-and.ac.uk/photcol.ht
    m)
  • 30,000 historic photographs with English captions
  • Translation approach
  • MT for both query and caption translation
  • Exploited on-line version of Babelfish
    (http//babelfish.altavista.com/)
  • End-users of Eurovision
  • Historians and general public (monoglots)

http//ir.shef.ac.uk/eurovision/
16
Involving the user
  • Interactive CLIR systems
  • help users locate and identify relevant
    foreign-language documents
  • Users may have different language skills
  • Polyglots or monoglots
  • Interactive CLIR systems can help users
  • Formulate and translate the query (e.g. entering
    diacritics, selecting translation alternatives)
  • Query re-formulation (e.g. selecting query
    expansion terms)
  • Browsing/navigating results (e.g. translating
    metadata)
  • Identifying relevant documents (e.g. summarising
    and translating results)
  • Users can also help CLIR systems
  • By providing feedback to the system

17
Summary
  • IR enables access to digital collections
  • Cross-language information retrieval (CLIR)
    provides multilingual access
  • CLIR works and is being used in practice
  • CLIR provides opportunities
  • For both content providers and end-users
  • Widens accessibility to all kinds of information
  • Challenges include
  • Bridging the language gap
  • Providing effective access for users
  • Multilingual retrieval - where next?

18
Spanish
English
Japanese
Italian
19
multilingual communities
  • Large-scale photo management tool
  • Over 5 million publicly accessible images
  • Users can upload, manage and share their photos
  • Users can discuss images (within groups)
  • Naturally multilingual
  • Annotations include
  • Pre-defined tags
  • Textual descriptions
  • Manually-defined keywords (folksonomies)
  • Collaborative annotations (comments)
  • New challenges for IR
  • Multiple annotation types (e.g. comments, tags)
  • Non-uniform and subjective categorisation and
    annotation of images
  • Annotations in multiple languages

http//flickr.com
About PowerShow.com