Aucun titre de diapositive - PowerPoint PPT Presentation

1 / 55
About This Presentation

Aucun titre de diapositive


European Language Resource Association A European Infrastructure for Language Resource distribution And HL Technology evaluation Khalid CHOUKRI – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 56
Provided by: khal50


Transcript and Presenter's Notes

Title: Aucun titre de diapositive

European Language Resource Association
A European Infrastructure for Language Resource
distribution And HL Technology evaluation
Khalid CHOUKRI ELRA/ELDA 55 Rue Brillat-Savarin,
F-75013 Paris, France Tel. 33 1 43 13 33 33 --
Fax. 33 1 43 13 33 30 Email Web
  • Rational behind ELRA
  • ELRAs Mission Structure .. Services
  • Membership
  • Identification
  • Distribution
  • Legal issues
  • Market forecasts Needs - requirements
  • Promotion
  • .
  • ELRA Catalogue -- A quick overview BLARK .
  • Activities in Europe / European National scenes
    Role of ELRA
  • The ENABLER Initiative
  • Conclusion

European Language Resource Association An
Improved infrastructure for Data sharing
Centralized Not-for-profit organization for the
collection, distribution, and validation of
Language Resources and tools.
Operational agency ? ELDA
Evaluation Language Resources Distribution
The Association
  • Membership Drive
  • ELRA is Open to European Non-European
  • Resources are available to Members Non-Members
  • Pay per Resource
  • Some of the benefits of becoming a member
  • Substantial discounts on LR prices (over 70)
  • Legal and contractual assistance with respect to
    LR matters
  • Access to Validation and production manuals
    (Quality assessment)
  • Figures and facts about the Market (results of
    ELRA surveys)
  • Newsletter and other publications

European Language Resource Association An
Improved infrastructure for Data sharing
An Association of users of Language Resources
  • A Repository Center?
  • Technical Logistic issues
  • Commercial issues (prices, fees, royalties)
  • Legal issues (Licensing, IPR)
  • Information Dissemination

Application to Norwegian
Infrastructure for the evaluation of Human
Language Technologies providing resources,
tools, methodologies, logistics, Exit strategies
/ Capitalization on evaluation packages
ELRA Offer
Membership Drive
  • Colleges Speech, Written, Terminology
  • Membership fees gt 4 categories

Legal Issues- Licensing
Provider-User Agreements
Legal Issues- Licensing
Distribution Agreement
Quick Overview Basic Language Resources ---
Spoken Written Resources
  • What should be available for all languages
  • Lexicons Based on
  • Parole
  • Simple
  • (Euro) WordNet
  • and more generally EAGLES/ISLE
  • Corpus ---
  • (Country/language) National Corpus
  • (.) Business/scientific Corpus
  • (.) Broadcast News - Transcriptions
  • Multilingual/BiLingual
  • Lexica
  • Corpora (Comparable / Aligned / Parallel)

Quick Overview Basic Language Resources ---
Spoken Written Resources
  • What should be available for all languages
  • Articulatory databases (e.g. ACCOR)
  • Basic speech data
  • (some phonetic material and
  • some phonetic sequences, by a small number of
  • recorded in a quiet environment (EUROM 1 BABEL)
  • Pronunciation lexicon (BDLEX, PHONOLEX)
  • Proper names pronunciation lexicon (ONOMASTICA)
  • Newspaper read text (BREF, Siemens-100, Apasci)
  • Basic telephone speech (SPEECHDAT)
  • Telephone-based speaker verification. (PolyVar)
  • Text corpora for language models (MLCC, Le
    Monde )

BLARK ..Basic LAnguage Resource Kit
Basic Speech resources -- (Europe)
A Available through ELRA S Available through ELRA
within the next quarter E Exist/identified but
not (never!) available "blank" Probably Not
available / has not been identified U Under
completion/Well advanced project with
distribution plans We exclude the lexicon
that come with SpeechDat Available through
German telecoms
Funding(s) of Language Resources
  • Public Funding
  • Commission of the European Union(e.g. RD FPs)
  • National agencies Authorities
  • Private Funding

Criteria for Language Resources funding .. !
Brief Overview of recent activities in
EuropeEuropean Union Level
Some Projects within FP5 and previous FPs .
Related to our concerns
  • Resources production Speechdat Family
  • Specifications of new types of resources Natural
    Interaction and MultiModality
  • within ISLE (International Standards for
    Language Engineering) project
  • Standards/metadata Eagles and its extension
  • the EU/US collaborative project ISLE,
  • Coming Soon INTERA
  • Coordination ENABLER, Coming soon NEMLAR
  • Information gathering Dissemination Euromap
    and its follow-up Hope

SpeechDat Family
  • SpeechDat(M) --- Fixed Telephone network --
    1K Speakers
  • SpeechDat-II Fixed, Mobile, 1-5Kspeakers
  • SpeechDat-II Speaker Verification
  • SpeechDat-E (CEE - Polish Czech Slovak Russian
  • SALA (Speech Across Latin America) and Now
  • SpeechDat-Car (inc. cellular)

  • SpeeCon (Consumer products)
  • Orientel

SpeechDat Family
SpeeCon Project
SpeechDat Family SALA-II what you may get
with PRIVATE Funding
SALA II cellular/Mobile Network (1000
speakers) Partner Latin
America US and Canada
Brief Overview of recent activities at National
Top-down vs Bottom-up approches
Examples of National Projects/programs
OVER Nine National projects, among which
Netherlands Belgium Continue Now Release
5 Data Available via ELRA, Release of April2002
France Action Techno-Langue
Italy Infrastruttura nazionale per le risorse
linguistiche nel settore del trattamento
automatico della lingua naturale parlata e
Norway Norwegian Language Bank
Dutch Flemish
National Projects/programs Example of Italy
Example of Italy
With Contribution from N. Calzollari and A.
  • 2 National projects under 2 different Programs.
  • The Programs were not specific for HLT, but
  • one for industrial RD
  • and the other for the South of Italy.
  • Both projects are coordinated by A. Zampolli in
  • Goal to extend core resources built in EU
    projects, create new LR, the tools needed to
    manage the resources, a platform for NLP
    development, and technology transfer towards SME.

National Projects/programs Example of Italy
TAL - Infrastruttura nazionale per le risorse
linguistiche nel settore del trattamento
automatico della lingua naturale parlata e
scritta with 13 partner of private
organisations). Duration 2 years, finished in
Partners CPR - Consorzio Pisa Ricerche ITC -
Istituto Trentino di Cultura CSELT - Centro
Studi e Laboratori Telecomunicazioni SYNTHEMA
CVR - Consorzio Venezia Ricerche CERTIA - Centro
per la Ricerca, Sviluppo, Formazione nelle
Tecnologie e Applicazioni Informatiche QUINARY
Soluzioni Tecnologiche INTERACTIVE MEDIA NECSY
- Network Control Systems
National Projects/programs Example of Italy
  • Linguistica computazionale ricerche monolingui
    e multilingui
  • (cluster "Linguistica", legge 488, with 16
    partners of private and public organisations).
  • Duration 3 years will finish in 2003.
  • Partners
  • CPR, Pisa CIRASS, Napoli THAMUS, Salerno
    ILC-CNR, Pisa SYNTHEMA, Pisa
  • Istituto Universitario Orientale, Napoli
    Dipartimento di Scienze Storiche del Mondo
    Antico, Università di Pisa Sportello per la
    Cooperazione Scientifica e Tecnologica con i
    Paesi del Mediterraneo (SMED) del CNR, Napoli.

National Projects/programs Example of Italy
Italy Infrastruttura nazionale per le risorse
linguistiche nel settore del trattamento
automatico della lingua naturale parlata e
  • ItalWordNet (50.000 entries).
  • Corpus di italiano parlato --- 100 Hours of
    speech consisting of
  • a) 10h Radio-TV broadcast data (notiziari,
    interviste, talk show),
  • b) 60h Map task like collection
  • c) 5h Lab data for lexical coverage
  • d) 10h telephone conversational speech
  • e) 10h Domain specific (finances, touristic
    information etc.)
  • Annotated dialogues for speech interfaces (H-H
    and H-M interactions)
  • ( Dialoghi annotati per applicazioni di
    interfacce vocali avanzate)
  • 450 dialogues annotated at all levels
    (morphological ProsodySemantics .)

Bergen 2002/10/24-25 Norwegian Language Bank
National Projects/programs Example of Italy
to extend core resources built in EU projects,
created new LR, the tools needed to manage the
resources, a platform for NLP development, and
technology transfer towards SME. The total cost
was about 7 million euro and funding for almost
5 million euro The costs were equally divided
between Spoken Written areas. In both projects
the consortia agreed to distribute the LR through
ELRA (with special price for Italian
users). Now, after the conference TIPI in Roma,
under the sponsorship of the Ministry of
Communications, the topic of HLT has been
inserted in the Framework Programme for the
financing of RD in Italy. It was also decided
to constitute a Forum for HLT, of which Zampolli
is president. The Forum will start working soon,
also to prepare new national initiatives, to
maintain LR, to write a white book on HLT in
Italy, to coordinate with national activities in
other EU countries, etc.
Bergen 2002/10/24-25 Norwegian Language Bank
Example of France National Projects/programs
France Technolangue Action
With Contribution from J. Mariani
Ministère de la Culture et de la
Communication Ministère de la Jeunesse, de
lEducation Nationale et de la Recherche Ministère
de lEconomie, des Finances et de
lIndustrie Language Technologies  TechnoLangue 
 TechnoLangue  action
  • Report to Prime Minister (November 2000)
  • Meeting Min. Industry, Research, Culture June
  • Action Technology survey and evaluation
  • Basic Technological Research
  • Articulate with present actions
  • Research Innovation Technological Networks
  • 4 ICT RRIT Telecommunications, Software,
    Micronanotechnologies, Audiovisual multimedia
  • Ministry of Research action on Technological
    Survey (VSE)

 TechnoLangue  structure
  • Infrastructure program to support technological
    innovation, while existing RD projects stay with
    RRIT VSE (120 M / year)

Usage Evaluation
Meeting points with technology development
Quantitative Evaluation
Basic Research
Technology Development
Application Development
Technologies necessitated for applications
Bottleneck Identification
Research results in quantitative evaluation
Technologies which have been validated for
Long term / high risk Large return of investment
Usability Acceptability
 TechnoLangue  action
  • Organization
  • Executive Committee (EC) chaired by C. Fluhr
  • Comprising 15 members
  • 3 RRIT representatives B. Bachimont (INA -
    RIAM), C. Sedogbo (Thalès - RNTL), C. Waast (IBM
    - RNRT)
  • 3 Public research C. Fluhr (CEA), E. Geoffrois
    (DGA) P. Paroubek (Limsi-CNRS)
  • 5 Industrials K. Choukri (ELDA), B. Normier
    (Lingway), J.-J. Rigoni (Elan Informatique ), F.
    Segond (Xerox) C. Sorin (FT RD)
  • 4 Administrations S. Chaudiron (MR), J. Mariani
    (MR), D. Malbert (MCC), J. Mathieu (MinEFI)
  • Good balance between research industry -

 TechnoLangue  action
  • Install a User Committee
  • Ministry of Foreign Affairs
  • Automatic translation, multilingualism
  • Ministry of public administration
  • Simplification of the administrative language...
  • Ministry of National Education
  • Training technologies, language traning...

 TechnoLangue  Call
  • International cooperation
  • Cooperation mechanisms within TechnoLangue
  • foreign entities may participate in the projects
  • financing from their own funds
  • Future cooperation among similar national
  • EU Countries (Italy, Germany, Norway, Spain,
    Greece, The Netherlands, Switzerland)
  • Prepare the construction of the European Research
    Area (ERA)
  • The EC supports the coordination and generic
    technologies cost
  • Each country supports the cost for covering its
    language(s) specific technology
    development/adaptation (annnotated) corpus
    (spoken/written), lexicon (incl. pronun.),
  • USA, Japan, South Africa

 TechnoLangue  Call
  • 4 meetings of the Executive Committee
  • A Call for Proposals with 4 parts
  • Part 1 Language resources
  • Part 2 Evaluation
  • Part 3 Norms standards
  • Part 4 Technological survey
  • Calendar
  • Launched April 15, 2002
  • Deadline May 31 / June 10 (Electronic) - June
    17 (Paper)
  • Results July 19, 2002

 TechnoLangue  Call
  • Language resources
  • Spoken/written data (corpus, dictionaries,
    terminological data)
  • Basic Language Processing Tools (Open Source)
  • Production, validation, distribution (incl.
    legal, economical aspects)
  • For a large use by a large community (education,
  • Evaluation
  • Technology (evaluation campaign)
  • Applications (evaluation toolkits)
  • Methodology (metrics / protocols)
  • Norms standards
  • Shared effort to improve French participation
  • Technological survey
  • In relationship with on-going actions (Euromap...)

Part 1 Language Resources
  • Stimulate the production and the distribution of
    language resources for
  • answering minimal needs (Basic LAnguage Resource
    Kit) for the french language
  • promoting resources reusabilty
  • supporting research
  • helping industrial applications development
  • decreasing the cost of entering the sector for
    new comers
  • Should include the French language, eventually in
    connection with other languages

Part 1 Language Resources
  • Spoken and written data
  • oral corpus, pronunciation lexicons, etc.
  • databases for speech synthesis
  • monolingual and multilingual text corpus
    (parallel, comparable...)
  • lexicons, terminology, grammars,...
  • Lexical semantic resources ontologies,
  • Multimodal corpus,...etc
  • Basic sofware tools
  • morphosyntactic taggers, syntactic parsers,
    semantic tools,
  • teminology extractors,
  • language identifiers,
  • corpus annotations tools,
  • lemmatizers, etc.

Part 1 Language Resources
  • Encourage and facilitate the use of those
  • Putting them in new (young) user hands
  • Same approach as for GUIs VUIs
  • Language Technology Kits with Users guide
  • Distribution towards specialized education
    entities (NLP, Document Engineering) and more
    largely towards training centers (Universities,
    Technical Universities, Engineering schools...)
  • While insuring a feedback from experience
  • Open Source software economical model

Part 2 Evaluation
  • 3 areas
  • Technology evaluation
  • Application evaluation
  • Evaluation methodologies

Part 2 Evaluation
  • Technology evaluation
  • Organization of comparative evaluation campaigns
    for technologies presently not covered by
    european or international programs, or with a
    complementary approach
  • Includes the production of the data necessary for
    the evaluation, in a monolingual, multilingual or
    crosslingual context
  • Scientific and industrial interest of the
    evaluation should appear (large enough number of
  • The projects must define the evaluation
    methodology and justify the practical
    organization aspects

Part 2 Evaluation
  • Application evaluation
  • The objective is to develop evaluation
    mehodologies for industrial or pre-industrial
  • The methodologies may result in toolboxes, also
    regrouping user-oriented methodologies and
    protocols, or in test software packages
  • The methodologies should be generic (class of
  • The proposals should demonstrate the project
    economical and industrial interest, and the
    modalities of the distribution of the toolboxes

Part 2 Evaluation
  • Evaluation methodologies
  • Improve the present evaluation methodologies
  • Identify new (quantitative and qualitative)
    approaches for already evaluated technologies
  • socio-technical and psycho-cognitive aspects
  • cognitive modeling of evaluation
  • Identify protocols for new technologies and
  • Virtual Reality, Multimodal interaction, Language
    on the Internet...

Part 3 Standards
  • Support the participation of French actors in
    normalization and standardization bodies
  • Presently weak participation of French actors in
    normalization and standardization bodies
  • Of strategic importance
  • Variety of places where the normalization
    activities are taking place official or
    non-official committees, forums, projects,...

Part 3 Standards
  • Actions
  • Support the creation of consortia to reinforce
    the french presence in various bodies (ISO, CEN,
  • Help the share of efforts among French
  • Identify a topic and ensure a permanent
    participation in all related bodies character
    sets, exchange format, phonetic alphabet
    transcription, etc.
  • Necessity of articulating the project with French
    bodies already implied AFNOR, W3C French

Part 4 Survey
  • Part 4 - Install an information survey
  • Create a portal on Language Engineering in order
    to give access to
  • panorama of the industrial and technological
  • state-of-the-art in science and technology
  • identification of language resources
  • identification of technological bottlenecks
  • a list of Call for Proposals
  • a presentation of the market key numbers
  • an information on norms and standards (with
    Internet links)
  • Should be linked with existing sites

  • 52 proposals submitted
  • Total proposal costs 35,9 M
  • Total requested support 21,7 M
  • Clustering within each of the 4 topics
  • 26 projects selected
  • 173 participations, 94 participants
  • 33 industry
  • 39 public research
  • 11 other (Associations, CEA, DGA)
  • 11 foreign (Bell Labs, NII, EPFL, LATL)
  • Budget 6,2 M

  • 26 selected projects
  • 8 on Language resources
  • BLARK (Cf BNC), Fr-En, G, Sp, It, Arabic
  • Specialized (aerospace, automotive), proper
    names dictionaries
  • Aligned corpus (7 novels 19th century litterature
    in 4 languages)
  • 6 on Tools (Open source)
  • Lemmatizer, Chunker, Guesser, Tagger, Parser,
    Speaker recogn., Topic NE detector, summarizer,
    term. extractor, Search engine...
  • 3 on Standards (Spoken / Written)
  • 1 on Technological survey (Portal)
  • 8 on Evaluation 7 on technology, 1 on usage

Technology Evaluation
  • Written language
  • Machine translation
  • Text alignment
  • Syntactic parsing
  • Information query
  • Spoken Language
  • Speech transcription / indexing (incl. Named
  • Speech synthesis
  • Spoken dialog

French Techno-Langue Conclusions
  • Launch a large national program on Language
    Technology (TechnoLangue)
  • In the perspective of installing a permanent
    infrastructure for Language Resources,
    Evaluation, Standards and Survey
  • Hope that it can participate in the construction
    of the European Research Area
  • And articulates well with international activities

Example of NORWAY National Projects/programs
Norway Norwegian Language Bank
  • language technology resources in Norway
  • Launch conference 24-25 October 2002 (Bergen,
  • The language bank will contain three types of
    data spoken data, text and lexical resources.
  • It will be organized as a foundation with state
  • The estimated budget is about NOK 100 million,
    (12 M)

ENABLER European National Activities for Basic
Language Engineering Resources
Information Dissemination
(Bilingual English/French issued each quarter)
Write a Comment
User Comments (0)