Sharing the effort to produce the needed LRs' - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Sharing the effort to produce the needed LRs'

Description:

Sharing the effort to produce the needed LRs' – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 33
Provided by: iec
Category:

less

Transcript and Presenter's Notes

Title: Sharing the effort to produce the needed LRs'


1
Sharing the effort to produce the needed LRs.
  • J. Mariani
  • LIMSI-CNRS
  • Director
  • Institute for Multilingual Multimedia
    Information (IMMI)

2
The Multilingual challenge
  • Preserve the cultures (languages)
  • Allow citizens to speak their own language
  • Preference for native language (Web sites in
    German (75)...)
  • 30 of the Web in English (50 (2000), 35
    (2004))
  • 50 of European citizens only speak one language
  • Only 3 of Japanese citizens speak a foreign
    language
  • Allow for communication across humans
  • European Union
  • 27 countries, 23 official languages / 506
    language pairs, 60 languages
  • 2500 translators at the EC - 1.5 Mpages
    translated in 2007
  • 30 European Parliament budget (300 M) 500
    translators
  • Annual total cost for EU 1.1 B 2.2 per
    European citizen
  • Globalization of information
  • 13 hours of new video every minute on YouTube
  • 6,000 languages, 36,000,000 language pairs

3
Needs
  • Europe
  • European Digital Library (Europeana)
  • 4.6 millions documents in 26 languages
  • Multilingual / crosslingual, and need to have
    tools
  • Business Intelligence (ENISA)
  • Multilingual alert and information exchange
    Platform for Members States
  • European Patent Office
  • 3 languages (English, German, French) to decrease
    the costs
  • European Commission, Parliament, Court of Justice
    (documents, reports, meetings)
  • International
  • Internet Governance
  • UN Forum on Internet Governance (FGI) English
    only working language
  • Accents and graphemes in Domain Names
  • Dubbling and subtitling of audiovisuals
  • Technical notices (aeronautics, car industry,
    consumer products)
  • Writing of scientific articles in mother tongue
  • Translation of Text, Video, radio TV broadcasts
    on the Internet
  • Interpretation at conferences, workshops, courses

4
Findings (1)
  • Impossibility to (quickly) answer all the
    numerous present and future needs with the
    present (and future) Human Resources

5
Findings (2)
  • Considering Multilingualism is not the first
    prioririty in any economical sector
  • But the sum of the priorities in each of those
    economical sectors is very large
  • It therefore necessitates a political thinking
    and a political action

6
Findings (3)
  • Multilingualism is necessary, but its cost is
    very important
  • Having Language Technologies would facilitate
    multilingualism by
  • Decreasing the cost
  • Generalizing it
  • If LT performances meet the user needs

7
Findings (4)
  • LT arent yet enough mature
  • Automated Translation didnt reach good enough
    quality for translating literary books or
    documents requesting a good translation
  • However, it may already
  • On one hand, help the human translator in his
    activity
  • On the other hand, provide an approximate
    translation for the general public, especially if
    free and online
  • Therefore the research effort should be
    increasingly supported

8
Language Technologies
  • Written Language
  • Monolingual
  • PoS tagging, Syntactic parsing, terminology
    extraction Text understanding and generation,
    automatic summarization, monolingual information
    retrieval, QA
  • Crosslingual
  • Crosslingual information retrieval, automatic or
    machine assisted translation...
  • Spoken Language
  • Monolingual
  • Speech recognition and understanding, speech
    synthesis, oral dialogue, speaker recognition
  • Crosslingual
  • Language identification, speech translation
  • Gestual
  • Sign Language Processing (analysis, synthesis,
    translation)
  • Accessibility
  • Crossmedia Text-to-Speech synthesis (visual
    impairment), Speech-to-Text transcription, Sign
    Language processing (hearing impairment), Voice
    command (motor impairment)
  • Crosslingual language barriers

9
Language Resources and Evaluation
  • Necessity to have a platform to develop
    technologies
  • Language Resources
  • Data corpus, lexica, dictionaries, terminology
    databases
  • Necessary for research investigations in
    linguistics
  • Necessary for training automatic processing
    systems based on statistical approaches
  • Standards for data distribution
  • Technology Evaluation
  • Compare the performances of systems coming from
    different laboratories, based on different
    approaches, on common data, with the same
    protocol in the framework of evaluation campaigns
  • Indicate the quality of research and
    technological advances
  • International competition / cooperation
    ( coopetition )

10
Courtesy NIST/ Darpa
11
Courtesy NIST/ Darpa
12
Courtesy NIST
13
Courtesy NIST
14
Mono/Multi/Cross-lingual Technologies
  • Preserve the cultures (languages)
  • Produce monolingual technologies resources for
    each language
  • Allow for communication across humans
  • Produce crosslingual technologies resources for
    each language pair
  • Interest in coordinating research and development
    of technologies for each language in order to
    better develop crosslingual technologies
  • Facilitate a joint effort (standards for data and
    tools exchange, experience feedback, Best
    Practices)
  • Necessary for speech translation, for
    crosslingual information retrieval (automatic
    summary), for document localization (spelling
    checker),

15
Language coverage
  • Two-speed situation Digital divide
  • 95 of languages are spoken by 6 of population
    worldwide
  • 90 of languages to disappear in the next century
  • Resourced / under- or non-resourced / oral
    tradition languages
  • Large enough data missing
  • Cf parallel corpora (bilingual), noisy parallel
    corpora (monolingual/bilingual), comparable
    corpora (monolingual/bilingual), quasi-comparable
    corpora
  • How to address minority, regional languages ?
    Migrants languages ? Foreign, regional accents ?
  • Political challenge
  • Who pays for it ?
  • What about languages which do not have the
    chance to be considered as a Peace Keeping
    Operations language, or a victim of war or
    Tsunami ?

16
  • International Forum on Multilingualism
  • Bamako, January 2009
  • Follow-up of WSIS
  • Commitment for a universal multilingualism
  • Ethics use of information, in terms of language
  • Multilingual Education (mother tongue)
  • Multilingual Cyberspace
  • Language Resources and Language Technologies

17
How to address the problem ?
  • Produce the necessary LT and LR ?
  • Several models
  • Centres
  • Google, NISTLDC, ELRA-ELDA
  • Programs
  • GALE, TDIL, TechnoLangue, Quaero
  • Networks
  • (Oriental) Cocosda, LangNet, Clarin, FLaReNet,
    T4ME proposal
  • Advantages / Drawbacks
  • Sustainability, links to the community, links to
    applications, quality checking

18
Commercial interest
  • Google
  • Search Engine in 124 languages (national and
    regional)
  • Online MT and crosslingual information retrieval
  • From 13 languages and 30 language pairs (2008)
  • to 41 languages (including catalan and galician)
    and 1640 language pairs (now)
  • Google Book Search (7 millions documents in 44
    languages)
  • Microsoft
  • Word
  • Spelling Checker in 126 languages (233 with
    regional variants)
  • Grammar Checker in 6 languages (61 with regional
    variants)

19
TDIL
  • TDIL Technology Development for Indian
    Languages
  • One of the 10 priorities of the National Indian
    Information Society Program
  • Automatic Language Processing for English and 18
    languages constitutionally recognized in India
  • Assamese, Bengali, Gujrati, Hindi, Kannada,
    Kashmiri, Konkani, Malayalam, Manipuri, Marathi,
    Napali, Oriya, Punjabi, Sanskrit, Sindhi, Tamil,
    Telegu, Urdu.
  • Several Language Technologies
  • Machine Translation, Text-to-Speech synthesis,
    Speech recognition, Search engines, Character
    recognition, Spelling checkers, Language
    Resources for 19 languages

20
FRANCE
  • Technolangue National Program
  • 2 M / year over 3 years (2002-2005)
  • Language Resources
  • Dictionaries (bilingual), lexica, corpora,
    terminological databases, tools
  • 8 evaluation campaign (spoken / written language)
  • Syntactic Parsing
  • Automatic Terminology Extraction
  • Information Retrieval (Question Answer)
  • Text-to-Speech Synthesis
  • Oral Dialog
  • Speech Transcription for automatic Indexing of
    radio/TV broadcast
  • Speech Corpus 1600 hours, 100 hours transcribed,
    1 Million words, 350 speakers
  • Workshop with linguists (data, tools, results)

21
FRANCE
  • Parallel Text Alignment for automatic bilingual
    dictionary building
  • French - English, German, Italian, Spanish
  • French - Arabic, Mandarin, Greek, Japanese,
    Farsi, Russian
  • Automatic Translation (English-French and
    Arabic-French)
  • Statistical or Rule-Based Approaches
  • Study of Evaluation Metrics (BLEU, mWER, X-score,
    WNM) for measuring the quality of an automatic
    machine translation by comparison with a human
    reference translation.
  • www.technolangue.net

22
FRANCE
  • Quaero Program
  • Multilingual and Multimedia documents processing
  • Structured around technologies development (30)
    for the needs of applicative projects (6) based
    on the use of corpus and evaluation
  • Several languages (gt9)
  • 200 M budget over 5 years (2008-2012)

23
FRANCE
  • 09.08 Etats-Généraux du multilinguisme
  • Paris-La Sorbonne, EU French Presidency
  • MCC (DGLF2), MAEE, MEN, MESR 1,000 participants
  • Multilingualism, translation and circulation of
    cultural works in Europe and Multilingualism,
    competitiveness and social cohesion reports
  • 11.08 Resolution of the European Ministerial
    Council on Multilingualism
  • 03.09 Resolution of the European Parliament on
    Multilingualism
  • Euro-mediterranean Program
  • Produce first the tools to allow for
    multilingualism ?
  • Multilingual Innovation Portal

24
European Commission
  • Commissioner for Multilingualism L. Orban
    (01.07)
  • A new strategic framework for multilingualism
    (2005)
  • High Level Group on Multilingualism (2006)
  • Communications to European Parliament and Council
    (2008)
  • Call for Proposal DGT (june 2008)
  • Study of the size of language industries in the
    EU
  • Single European Information Space
  • European Information Society for Growth and Jobs
  • Language Technologies are back !
  • Unit E1 Language Technologies Machine
    Translation (07.08)
  • Challenge 2 Cognitive systems, Interaction,
    Robotics
  • Objective 2.1. Cognitive systems, Interaction,
    Robotics
  • Objective 2.2. Language-based interaction (4th
    FP7 Call)

25
Lang-Net proposal
  • Build-up ERA-Net proposal (infrastructural)
  • LR, LT evaluation, Standards, Survey
  • Share of information
  • Strategic activities and Best Practice
  • Implementation of joint activities
  • Transnational research activities
  • Identify EU countries or regions with similar
    programs
  • 11 countries / regions in partnership Germany,
    France, Italy, Czech Republic, Denmark, Norway,
    The Netherlands / Belgium-Flanders (Dutch
    Language Union), Spain, Sweden, Basque and Trento
    regions
  • EU advanced Contacts Austria, Catalonia,
    Finland, Greece, Iceland, Portugal, Switzerland,
    UK
  • NMS Slovenia, Cyprus, Poland, Hungary, Malta,
    Baltic countries, Romania, Bulgaria
  • Contacts USA, Japan, South Africa, Israel,
    Canada
  • Submitted last Call FP6 ERA-NET (2005)
  • DG Research - HSS sector. Rejected

26
CLARIN
  • Common Language Resources and technology
    Infrastructure
  • ESFRI (European Strategy Forum on Research
    Infrastructure)
  • Distribution of LR and tools for Human and Social
    Sciences
  • 4,1 M EC funding
  • 2008-2010 preparatory phase
  • 200 M planned budget
  • 2008-2020 preparatory construction
    exploitation
  • 124 members (32 countries)

27
FLaReNet
  • Fostering Language Resources Network
  • Thematic Network E-Content
  • 0.9 M EC funding (2008-2011)
  • Language Resources (data/tools) for Automatic
    Language Processing (ICT)
  • Think Tank
  • Launching event (Vienna, February 2009)
  • 78 institutional members (30 countries), 200
    individual members
  • Link EU-US INTEROP (NSF)

28
T4ME
  • Technologies for a Multilingual Europe
  • Proposal of a NoE in FP7 4th ICT Call
  • Pushing the frontiers (MT/LT)
  • Open Resource Infrastructure (ORI)
  • On-line LR production, annotation,
    standardisation, validation, distribution
  • On-line LT evaluation
  • Vision, planning, promotion (LT)
  • Under evaluation

29
Next steps
  • European Technological Platform (ETP) / SRA ?
  • Joint Technological Initiative (JTI) ?
  • Industrial support from large companies ?
  • ERA-Net ? ERA-Net ?
  • Cf LangNet (2005)
  • Article 169 ?
  • Allows for supporting a large effort jointly
    shared among Member States (regions), European
    Commission and industrials
  • Double decision from European Council and
    Parliament
  • Infectious diseases (600 M), Ambient Assisted
    Living (AAL), Support to SMEs, Metrology,
    Research in the Baltic sea

30
Shared effort
  • Language Technologies well fitted for a shared
    effort
  • Coordination ensured by the EC
  • Management, standards, technology evaluation,
    communication
  • Core Language Technologies development
  • LR availability ensured by Member-States /
    Regions
  • LR (mandatory) corpora, lexica, dictionaries
  • LT development / adaptation to the specificities
    of their language(s)

31
Conclusion Perspectives
  • Language Technologies may facilitate
    multilingualism (in Europe, and elsewhere)
  • Very large effort for developing the necessary
    technologies for all (European) languages
  • Number of Technologies x Number of Languages
  • Too large for the European Commission alone
  • Interest for sharing the effort with the Member
    States and the regions (Language Resources)
  • It would be possible to generalize this
    cooperation scheme at the international level

32
Pros cons
  • Débats actuels sur le modèle à privilégier
  • Centres (Google, LDC, ELRA)
  • Pérennité assurée (modèle économique)
  • Pb couverture à assurer, priorités linguistiques,
    liens avec producteurs , financements
    transfrontaliers
  • Réseaux (Cocosdas, Clarin, FlaReNet)
  • Collaboratif (réseaux de producteurs/
    utilisateurs)
  • Pb pérennité, coordination permanente,
    financement, assurance qualité des ressources
    diffusées
  • Programmes (Quaero, GALE, Choral)
  • Prise en compte des applications
  • Pb pérennité, ouverture à lextérieur du
    partenariat (participation aux évaluations,
    disponibilité des ressources)
Write a Comment
User Comments (0)
About PowerShow.com