Standards, Use and Prospects for Language Resource Management - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Standards, Use and Prospects for Language Resource Management

Description:

Web-based collaborative authoring multi-lingual encyclopedia. 8.29 M ... (XHTML...), etc.] NLP structures (annotations) POS tagging. Chunks (cf. Named Entities) ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 57
Provided by: key54
Category:

less

Transcript and Presenter's Notes

Title: Standards, Use and Prospects for Language Resource Management


1
Standards, Use and Prospects for Language
Resource Management
  • Key-Sun Choi
  • 16 Aug. 2008
  • TII, Moscow

2
MOTIVATION
3
Wikipedia
  • Web-based collaborative authoring multi-lingual
    encyclopedia
  • 8.29 M pages/ 253 languages (2007/9)
  • 2.0 M pages/ English (2007/9) now 5.0 M pages

Computer science
Category Classification
Databases
Computer scientists
Algorithms
Category Page
Martic Kay
Robert Watson
Parallel database
SQL
Divide Conquer
4
Problem IS-A Relation Extraction from Wikipedia
  • Relation Classification from Category System
  • By Term Formation Rule, Wikipedia Structure
    (Ponzetto Strube, 2007)

Relation Classification
IS-A relation
Upper-lower level Category relation
Not IS-A relation
Computer science
Not IS-A
IS-A
IS-A
Databases
Computer scientists
Algorithms
5
Relation Extraction by Pattern
  • (Ryu Choi, 2007)
  • http//cseight.kaist.ac.kr8080/RelExt

Computer display mode
IS-A
Text mode
6
Problem IS-A Relation Extraction from Wiktonary
  • Web-based Collaborative Multilingual Dictionary
  • 617,639 entries/401 languages
  • ISA relation extraction from Definition Pattern
  • http//cseight.kaist.ac.kr8080/Wiktionary

IS-A
IS-A
7
Problem IS-A Relation Extraction from WordNet
  • Semantic Word Net (English)
  • 117,798 nouns, 82,115 synset (Ver. 3.0)
  • ISA relation extraction through ISA between
    Synsets

Synset 12
engineering, applied science
IS-A
IS-A
Synset 22
Synset 23
Synset 33
chemical engineering
computer science, computing
electrical engineering
8
LMF
  • Lexical Markup Framework

9
Wikipedia IS-A Annotation
IS-A (Entry, Term in Page)
IS-A (Term in Page, Term in Page)
Synonymy (Entry, Term in Page)
10
What is common representation?
  • Graph Structure

11
Linguistic Annotation Framework
  • ISO-GrAF Graph Structure-based Annotation
  • GrAF XML schema type hierarchy
  • graphElementType Attributes ID, type
  • edgeType extends graphElementType
  • nodeType extends graphElementType
  • spanType extends nodeType Attributes start, end
  • graphElementSetType
  • edgeSetType extends graphElementSetType
  • nodeSetType extends graphElementSetType
  • featureStructureType
  • featureType
  • annotationSetType

12
Problem Causality between Terms
  • Causal relation between terms
  • Term clustering based on inter-term causality
  • Terms with similar causality tend to be similar
    concept.
  • Realization Evaluation

Skin cancer usually appears in adulthood, but
it is caused by sun exposure and sunburns
that began in childhood .
TG
Stat5
Interleukin-2
IL-2
Egr-1
IFN-gamma
13
Is it true?Terms with similar causality tend to
be similar concept.
  • The oral bacteria that cause gum disease appear
    to be the culprit. Cigarette smoking and use of
    smokeless tobacco products may also cause gum
    disease. Gum disease is the second most common
    cause of toothache

Periodontal disease can lead to toothache.
Cigarette smoking is the number one environmental
risk for periodontal disease.
14
What to do
  • Is it true?
  • Terms with similar causality tend to be similar
    concept
  • We try to test the term clustering based on
    causal information
  • Prove that causality is one of effective features
    for term clustering.
  • Focus on
  • Causal NP pair extraction (Chang and Choi, 2004)
  • Causal term pair extraction
  • Term clustering based on causal similarity
  • Term clustering evaluation

15
Features on term clustering (1/3)
  • Useful features for Term clustering
  • Internal feature
  • Word lexicon/structure in terms
  • (Bourigault and Jacquemin,1999) POS sequences
    including insertion
  • NPDNInsAj NOunl ((Adv? Adj)0-3 Prep Det? (Adv?
    Adj)0-3 ) Noun3
  • 9398 precision
  • Outer-term feature
  • Structural modifier/modifiee of term
  • Some words nearby term
  • (Maynard et al., 2000)
  • Hand-made semantic frame information

16
Feature Structure Representation
  • (1) Employee
  • ltSEX, femalegt, ltNAME, Sandy Jonesgt, ltAGE, 30gt
  • (2) Sound segment /p/
  • ltCONSONANTAL, gt, ltANTERIOR, gt, ltVOICED, -gt,
    ltCONTINUANT, -gt
  • (3) Grammatical features of the verb love
  • ltPOS, verbgt, ltVALENCE, transitivegt,
    ltSEMANTIC_RELATION, lovinggt,

17
FSR Graph vs. Matrix Notation
  • M

18
Related Works on term clustering (3/3)
  • Discussion
  • Causal information is one of long-distance
    contextual information

Cigarette smoking and use of smokeless tobacco
products may also cause gum disease.
cause
use
Gum disease
Smokeless tobacco product
19
Event ternary extraction
Skin cancer usually appears in adulthood , but it
is caused by sun exposure and sunburns that began
in childhood .
Dependency Structure
appears
caused
by
and
usually
in
but
Skin cancer
it
is
Sun exposure
adulthood
Sunburns
began
in
that
child
NP chunking
Reference finding
Cue phrases filtering
Verb selection
Causal event pair candidate ltcause event, cue
phrase, effect eventgt
Skin cancer RNP caused by CNP sun
exposure Skin cancer RNP caused by CNP
sunburns
20
Representation Scheme
  • Morpho-syntactic Annotation Framework
  • Syntactic Annotation Framework

21
Morpho-Syntactic Annotation Framework MAF
  • lttoken id" t1 "gttolt/ tokengt
  • lttoken id" t2 "gteventuallylt/ tokengt
  • 3 lttoken id" t3 "gtdecidelt/ tokengt
  • ltwordForm lemma" to_decide " tokens" t1 t3 "/gt
  • 5 ltwordForm lemma" eventually " tokens" t2 "/gt

22
MAF token
  • lttoken id" t1 "gtThelt/ tokengt
  • lttoken id" t2 "gtvi c t imlt/ tokengt
  • lttoken id" t3 "gt slt/ tokengt
  • lttoken id" t4 "gtf r i e n d slt/ tokengt
  • lttoken id" t5 "gtt o ldlt/ tokengt
  • lttoken id" t6 "gtp o l i c elt/ tokengt
  • lttoken id" t7 "gtthatlt/ tokengt
  • lttoken id" t8 "gtKruegerlt/ tokengt
  • lttoken id" t9 "gtdrovelt/ tokengt
  • lttoken id" t10 "gtint olt/ tokengt
  • lttoken id" t11 "gtthelt/ tokengt
  • lttoken id" t12 "gtquar rylt/ tokengt
  • lttoken id" t13 "gtandlt/ tokengt
  • lttoken id" t14 "gtneverlt/ tokengt
  • lttoken id" t15 "gtsur f a c edlt/ tokengt
  • lttoken id" t16 "gt.lt/ tokengt

23
Syntactic Annotation Framework
24
Semantic Annotation Framework TimeML
  • no more than 60 days
  • ltTIMEX3 tid"t1" type"DURATION" value"P60D"
    mod"EQUAL_OR_LESS"gt no more than 60 days
    lt/TIMEX3gt
  • the dawn of 2000
  • ltTIMEX3 tid"t2" type"DATE" value"2000"
    mod"START"gt the dawn of 2000 lt/TIMEX3gt

25
ONTOLOGY EXTRACTION/LEARNING AND
QUESTION-ANSWERING
26
(No Transcript)
27
Word Segmentation
28
MULTILINGUAL INFORMATION FRAMEWORK
29
IT Ontology
IT Core Ontology
30
A Scenario
Control Server
Ontology Reasoner
Rule Reasoner
User
What is the best RTOS Vendor?
Do you know?
No
What is RTOS?
Real-time Operating System
What are instances?
VxWorks
Vendor?
Wind River
. .
Microsoft
Which is better?
31
Dialogue acts
  • Well-known examples of communicative functions
    (core dialogue acts)
  • question
  • WH-question
  • YN-question
  • check/verification
  • statement/inform
  • answer (WH-answer. YN-answer)
  • confirmation, disconfirmation
  • request
  • instruct
  • promise
  • acknowledgement
  • greeting

32
General-purpose functions
  • Applicable in any dimension are
  • Information-seeking functions
  • WH-question, YN-question,
    Alternatives-question, Check,..
  • Information-providing functions
  • Inform, WH-Answer, YN-Answer, Confirmation,
    Disconfirmation, Agreement, Correction,..
  • Commissive functions
  • Offer, Promise, AcceptRequest,..
  • Directive functions
  • Instruct, Request, Suggest,..

33
DiaML concrete syntax
  • ltdiaML idd2 speakers addresseea
    markablem1 commfunctionscfs1gt
  • ltsourceText idm1 sb1..se1blabla
    sb3..se3blablagt
  • ltcfs idcfs1 taskFunf1 feedbackFunf2gt
  • ltcomfun idf1 functionanwer
    respTod1gt
  • ltcomfun idf2 functionpositiv
    respTod1gt
  • lt/cfsgt
  • lt/diaMLgt

34
From sentence to ontologies
artifact
contents
device
ontology

camera
video
(camera, ISA, device) (camera, hasPropertyOf,
that AND (take video))
Triplets extraction
Dependency analysis
camera
is
device
takes
that
video
A camera is a device that takes video.
Term recognition
Sentence
A camera is a device that takes video.
35
Standards for language processing
Access protocols Corba, SOAP
Primary resources (text, dialogues) Structural
mark-up Basic annotations TEI, MPEG7,
TMX (XHTML), etc.
Knowledge structures Hierarchies of
types Relations between concepts (subjects/topics
etc.) Links to primary resources Topic Maps,
OIL, RDF
Links
NLP structures (annotations) POS tagging Chunks
(cf. Named Entities) Deep Syntactic
structures Co-references etc. Eagles/ISLE, CES,
MATE,
Lexical structures (Language models) Terminologies
Transfer lexica LTAG/HPSG/LFG lexica TBX, OLIF,
Eagles/ ISLE (Genelex)
Meta-data Dublin core, OLAC, ISLE, MPEG7, RDF
36
Context
  • ISO TC37 - Terminology and other language
    resources
  • SC3 - Computer applications in terminology
  • ISO 12200 - Martif
  • Latest version of TEI Terminology chapter
  • ISO 12620 - Data categories
  • ISO CD (DIS under ballot) 16642 - TMF
    (Terminological Markup Framework)
  • SC4 - Language resources

37
TC37/SC4 details
  • Scope Platform for designing and implementing
    linguistic resource formats and processes
  • Multi-layer annotation of linguistic resources
  • Exchange of information between NLP modules
  • General strategy
  • Involve a wide community from academia and
    industry
  • Identification of experts in the various work
    items
  • Involvment through national standardizing bodies
  • Agenda
  • Current identification of possible work items
    and working groups
  • Constituancy meeting and technical workshop at
    LREC (May 2002)

38
Organization
  • Chair
  • Laurent Romary, France
  • Secretary
  • Key-Sun Choi, Korea
  • International Advisory Committee
  • Chair Prof. Antonio Zampolli, Italy

39
SC4 and other standardizing bodies
  • TEI
  • text representation
  • Reference for primary sources
  • e.g. text archives

Oscar
Text
  • W3C
  • basic protocols and formats
  • XML (Schemas)
  • XPath
  • XPointer
  • RDF, SVG, SMIL, SOAP

ISO TC37/SC4 - language resources, NLP
perspective e.g. linguistic annotations, lexical
formats
Technical background
  • What about gestures?
  • Kinetic in the TEI
  • SMIL?

MPEG - Multimedia, XML based e.g. MPEG7-4 Word
and phone lattices
Audio/Speech
40
TC37/SC4 Work Items
  • WG1/WI-0 Terminology of Language Resources
  • WG1/WI-1 Linguistic annotation framework
  • WG1/WI-2 Meta-data for multimodal and
    multilingual information
  • WG2/WI-3 Structural content representation
    scheme
  • WG2/WI-4 Multimodal content representation sheme
  • WG2/WI-5 Discourse level representation scheme

41
TC37/SC4 Work Items - cont.
  • WG3/WI-6a Multilingual text representation
  • WG4/WI-7 NLP Lexica
  • WG5/WI-8 Net-based distributed cooperative work
    for the creation of LRs

42
WI-0
  • Terminology of Language Resources
  • Basic terminology of the various sub-fields of
    language resources and general methodology
  • Project leader Klaus-Dirk Schmitz
  • Sources
  • ISO 1087
  • LREC proceedings KAIST
  • English dictionaries in Linguistics?
  • Support from GTW

43
WI-1
  • Linguistic annotation framework
  • Basic mechanisms and data structures for
    linguistic annotation and representation data
    architecture
  • Methods and principles for the design of an
    annotation scheme
  • Structural nodes and information units, Data
    category specification
  • Linking and pointing mechanisms, Feature
    Structures, Meta-Markup
  •  Stand-off  and  in-line  views -
    equivalences, combining levels.
  • Administrative data categories

44
WI-1 - cont.
  • Project leader Nancy Ide (TBC)
  • Contributors Alan Melby, Koiti Hasida, Lee
    Gillam, Yves Savourel, Laurent Romary
  • Possible sources
  • TMF, iso12620-revised, Mate (general methodology)
  • TEI (Linking mechanisms, feature structures)
  • Link with Linguistic DS

45
WI-2
  • Meta-data for multimodal and multilingual
    information
  • Description of a meta-data representation scheme
    to document linguistic information structures and
    processes
  • General content description
  • Local content description
  • Project leader Peter Wittenburg, MPI (Nijmegen,
    NL)
  • Participants Steven Bird, TEI aware person
  • Possible sources
  • OLAC, Mile, TEI Header
  • Liaison TC46 (SC9), MPEG7/MDS, SCORM

46
WI-3
  • Structural content representation scheme
  • Definition of annotation/representation scheme(s)
    for morpho-syntax and syntax, to be used for
    annotation and interchange purposes
  • Meta-model for morpho-syntactic annotation
  • Meta-model(s) for syntactic annotation
    (lexicalized grammar, elementary trees,
    dependancy structures)
  • corresponding Data category registries
  • Project leaderJohn Carroll ??
  • Participants Nuria Bell
  • Possible sources
  • Eagles, TAGML, Linguistic DS
  • SIGPARSE
  • Working group with representatives from existing
    TreeBanks initiatives

47
WI-4
  • Multimodal meaning representation scheme
  • Representation scheme for the semantic content of
    multimodal information (textual, spoken,
    graphical and gestural)
  • Meta-modal for content representation (Events,
    participants, etc.)
  • Data category registry for multimodal content
  • Project leader Harry Bunt (id1)
  • Possible sources
  • SIGSEM working group on semantic content
  • Chair 1
  •  Liaison 
  • Semantic web activities

48
WI-5
  • Discourse level representation scheme
  • Meta-model for discourse and dialogue
    representation
  • Meta-model for discourse level annotation (e.g.
    reference annotation)
  • corresponding DatCat registry
  • Possible sources
  • SIGDIAL
  • DRI - Discourse Resource Initiative
  • Mate

49
WI-6
  • Multilingual text representation scheme
  • Framework for representing language specific and
    multi-lingual textual information
  • Translation Memory
  • Alignment Parallel Corpora
  • Word count algorithms (characters, words,
    segments)
  • Possible sources
  • TMX for translation memories
  • TEI based linking mechanism (or see WI-1) for
    Parallel texts

50
  • WI 6A
  • Translation Memory, Alignment of parallel corpora
  • Sources
  • OSCAR/TMX for translation memories
  • TEI based linking mechanism (or see WI-1) for
    Parallel texts

51
  • WI 6b
  • Segmentation and counting algorithms (characters,
    words, sentences etc.)
  • Sources
  • OSCAR

52
  • WI 6C
  • Meta-markup for GIL (Globalization,
    Internationalization and Localization)
  • Possible sources
  • OSCAR/OpenTag

53
WI-7
  • NLP lexica
  • Lexicon representation formats for the various
    types of NLP applications (Machine Readable
    Lexica)
  • Define a set of meta-models (classes of
    applications)
  • Specific data categories (derivation, phonology,
    etc.)
  • Based on the work done in other work items
  • Sources
  • Eagles/multext
  • ISLE Computational Working group/Genelex
  • OLIF

54
WI-8
  • Net-based distributed cooperative work for the
    creation of LRs
  • Principles and methods for designing
    collaborative and cooperative compilation of LRs
  • Define what is specific to LRs with regards
  • Tracability of resources, version control,
    validation, quality management
  • Protocols (Corba, SOAP), Workflow standards, Data
    management
  • Contacts Christian Galinski, Remi Zajac
  • Sources

55
Liaison - OSCAR
  • Brief history of LR exchange standards
  • Parallel events since 1997
  • Open Tag - meta-markup (XML vs. Others)
  • Major current OSCAR activities
  • TMX - Translation Memory eXchange
  • Counting and segmentation algorithms
  • TBX (Terminologies) and OLIF (MT lexica)
  • XLIFF and CGS - Annotation of source code and
    localisation of web sites
  • xmllang etc. J. DeCamp and S.-E. Wright

56
Liaison - TEI
  • General architecture and data modeling
  • WI-1
  • Annotations (paragraph level, external
    annotations)
  • WI-1
  • TEI Header
  • WI-2
  • NLP lexica
  • WI-7
  • Feature structures
  • WI-1
Write a Comment
User Comments (0)
About PowerShow.com