ISO TC 37 / SC4 Language Resources - PowerPoint PPT Presentation

About This Presentation
Title:

ISO TC 37 / SC4 Language Resources

Description:

ISO TC 37 / SC4 Language Resources An overview (Ammended 2-5 f vrier 2002) Laurent Romary – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 26
Provided by: Laurent198
Category:

less

Transcript and Presenter's Notes

Title: ISO TC 37 / SC4 Language Resources


1
ISO TC 37 / SC4Language Resources
  • An overview
  • (Ammended 2-5 février 2002)
  • Laurent Romary

2
Standards for language processing
Access protocols Corba, SOAP
Primary resources (text, dialogues) Structural
mark-up Basic annotations TEI, MPEG7,
TMX (XHTML), etc.
Knowledge structures Hierarchies of
types Relations between concepts (subjects/topics
etc.) Links to primary resources Topic Maps,
OIL, RDF
Links
NLP structures (annotations) POS tagging Chunks
(cf. Named Entities) Deep Syntactic
structures Co-references etc. Eagles/ISLE, CES,
MATE,
Lexical structures (Language models) Terminologies
Transfer lexica LTAG/HPSG/LFG lexica TBX, OLIF,
Eagles/ ISLE (Genelex)
Meta-data Dublin core, OLAC, ISLE, MPEG7, RDF
3
Context
  • ISO TC37 - Terminology and other language
    resources
  • SC3 - Computer applications in terminology
  • ISO 12200 - Martif
  • Latest version of TEI Terminology chapter
  • ISO 12620 - Data categories
  • ISO CD (DIS under ballot) 16642 - TMF
    (Terminological Markup Framework)
  • SC4 - Language resources

4
TC37/SC4 details
  • Scope Platform for designing and implementing
    linguistic resource formats and processes
  • Multi-layer annotation of linguistic resources
  • Exchange of information between NLP modules
  • General strategy
  • Involve a wide community from academia and
    industry
  • Identification of experts in the various work
    items
  • Involvment through national standardizing bodies
  • Agenda
  • Current identification of possible work items
    and working groups
  • Constituancy meeting and technical workshop at
    LREC (May 2002)

5
Organization
  • Secretary
  • Prof. Key-Sun Choi, Korea
  • Chair
  • Laurent Romary, France
  • International Advisory Committee
  • Permanent Chair Prof. Antonio Zampolli, Italy

6
SC4 and other standardizing bodies
Contributing organizations
-----
-----
-----
-----
  • TEI
  • text representation
  • Reference for primary sources
  • e.g. text archives

Oscar
Text
  • W3C
  • basic protocols and formats
  • XML (Schemas)
  • XPath
  • XPointer
  • RDF, SVG, SMIL, SOAP

ISO TC37/SC4 - language resources, NLP
perspective e.g. linguistic annotations, lexical
formats
Technical background
  • What about gestures?
  • Kinetic in the TEI
  • SMIL?

MPEG - Multimedia, XML based e.g. MPEG7-4 Word
and phone lattices
Audio/Speech
7
Working groups
  • WG1 Basic descriptors and mechanisms for
    language resources
  • Convener Laurent Romary
  • WG2 Representation schemes
  • Convener Kiyong Lee
  • WG3 Multilingual text representation
  • Convener Alan K. Melby
  • WG4 Lexical databases
  • Convener ??
  • WG5 Workflow of language Resource Management
  • Convener Christian Galinski

8
TC37/SC4 Work Items
  • WG1/WI-0 Terminology of Language Resources
  • WG1/WI-1 Linguistic annotation framework
  • WG1/WI-2 Meta-data for multimodal and
    multilingual information
  • WG2/WI-3 Structural content representation
    scheme
  • WG2/WI-4 Multimodal content representation sheme
  • WG2/WI-5 Discourse level representation scheme

9
TC37/SC4 Work Items - cont.
  • WG3/WI-6a Translation Memory, Alignment of
    parallel corpora
  • WG3/WI-6a Segmentation and counting algorithms
    (characters, words, sentences etc.)
  • WG3/WI-6a Meta-markup for GIL (Globalization,
    Internationalization and Localization)
  • WG4/WI-7 NLP Lexica
  • WG5/WI-8 Validation of language resources
  • WG5/WI-9 Net-based distributed cooperative work
    for the creation of LRs

10
WI-0
  • Terminology of Language Resources
  • Basic terminology of the various sub-fields of
    language resources and general methodology
  • Project leader Klaus-Dirk Schmitz
  • Sources
  • ISO 1087
  • LREC proceedings KAIST
  • English dictionaries in Linguistics?
  • Support from GTW

11
WI-1
  • Linguistic annotation framework
  • Basic mechanisms and data structures for
    linguistic annotation and representation data
    architecture
  • Methods and principles for the design of an
    annotation scheme
  • Structural nodes and information units, Data
    category specification
  • Linking and pointing mechanisms, Feature
    Structures, Meta-Markup
  •  Stand-off  and  in-line  views -
    equivalences, combining levels.
  • Administrative data categories

12
WI-1 - cont.
  • Project leader Nancy Ide (TBC)
  • Contributors Alan Melby, Koiti Hasida, Lee
    Gillam, Yves Savourel, Laurent Romary
  • Possible sources
  • TMF, iso12620-revised, Mate (general methodology)
  • TEI (Linking mechanisms, feature structures)
  • Link with Linguistic DS

13
WI-2
  • Meta-data for multimodal and multilingual
    information
  • Description of a meta-data representation scheme
    to document linguistic information structures and
    processes
  • General content description
  • Local content description
  • Project leader Peter Wittenburg, MPI (Nijmegen,
    NL)
  • Participants Steven Bird, TEI aware person
  • Possible sources
  • OLAC, Mile, TEI Header
  • Liaison TC46 (SC9), MPEG7/MDS, SCORM

14
WI-3
  • Structural content representation scheme
  • Definition of annotation/representation scheme(s)
    for morpho-syntax and syntax, to be used for
    annotation and interchange purposes
  • Meta-model for morpho-syntactic annotation
  • Meta-model(s) for syntactic annotation
    (lexicalized grammar, elementary trees,
    dependancy structures)
  • corresponding Data category registries

15
WI-3 - cont.
  • Project leaderJohn Carroll ??
  • Participants Nuria Bell, representatives from
    existing TreeBanks initiatives
  • Possible sources
  • Eagles, TAGML, Linguistic DS
  • SIGPARSE

16
WI-4
  • Multimodal meaning representation scheme
  • Representation scheme for the semantic content of
    multimodal information (textual, spoken,
    graphical and gestural)
  • Meta-modal for content representation (Events,
    participants, etc.)
  • Data category registry for multimodal content
  • Project leader Harry Bunt (id1)
  • Possible sources
  • SIGSEM working group on semantic content
  • Chair 1
  •  Liaison 
  • Semantic web activities

17
WI-5
  • Discourse level representation scheme
  • Meta-model for discourse and dialogue
    representation
  • Meta-model for discourse level annotation (e.g.
    reference annotation)
  • corresponding DatCat registry
  • Possible sources
  • SIGDIAL
  • DRI - Discourse Resource Initiative
  • Mate

18
WI 6a
  • Translation Memory, Alignment of parallel corpora
  • Provides formats for the representation of
    multilingual textual data as produced in
    translation activities or constructed from
    existing primary sources
  • Sources
  • OSCAR/TMX for translation memories
  • TEI based linking mechanism (or see WI-1) for
    Parallel texts

19
WI 6b
  • Segmentation and counting algorithms (characters,
    words, sentences etc.)
  • Provide methods for segmenting streams of text
    with markup and means to for counting the
    corresponding segments
  • Possible sources
  • OSCAR

20
WI 6c
  • Meta-markup for GIL (Globalization,
    Internationalization and Localization)
  • Identification of the specific markup modules
    needed to perform GIL activities
  • Possible sources
  • OSCAR/OpenTag

21
WI-7
  • NLP lexica
  • Lexicon representation formats for the various
    types of NLP applications (Machine Readable
    Lexica)
  • Define a set of meta-models (classes of
    applications)
  • Specific data categories (derivation, phonology,
    etc.)
  • Based on the work done in other work items
  • Possible sources
  • Eagles
  • Multext
  • ISLE Computational lexicon Working group
  • OLIF

22
WI-8
  • Validation of language resources
  • Defines guidelines and requirements for producing
    and distributing high quality language resources
  • Contacts
  • ELRA, TEI
  • Possibles sources
  • To be defined

23
WI-9
  • Net-based distributed cooperative work for the
    creation of LRs
  • Principles and methods for designing
    collaborative and cooperative compilation of LRs
  • Define what is specific to LRs with regards
  • Tracability of resources, version control,
    validation, quality management
  • Protocols (Corba, SOAP), Workflow standards, Data
    management
  • Contacts Christian Galinski, Remi Zajac,
  • Sources To be defined

24
Liaison - OSCAR (AKM)
  • Brief history of LR exchange standards
  • Parallel events since 1997
  • Open Tag - meta-markup (XML vs. Others)
  • Major current OSCAR activities
  • TMX - Translation Memory eXchange
  • Counting and segmentation algorithms
  • TBX (Terminologies) and OLIF (MT lexica)
  • XLIFF and CGS - Annotation of source code and
    localisation of web sites
  • xmllang etc. J. DeCamp and S.-E. Wright

25
Liaison - TEI (LR)
  • General architecture and data modeling
  • WI-1
  • Annotations (paragraph level, external
    annotations)
  • WI-1
  • TEI Header
  • WI-2
  • NLP lexica (with regards Terminologies and
    dictionaries)
  • WI-7
  • Feature structures
  • WI-1
Write a Comment
User Comments (0)
About PowerShow.com