Title: The CrossLingual Reuse and Extension of Knowledge Resources in Ontological Semantics
1The Cross-Lingual Reuse and Extension of
Knowledge Resources in Ontological Semantics
- Marjorie McShane, Sergei Nirenburg, Stephen
Beale, Margalit Zabludowski - The Institute for Language and Information
Technologies (ILIT) - University of Maryland Baltimore County
2Plan of the Talk
- Background
- the OntoSem environment
- the OntoSem processors and their parameterization
potential - language-independent and language-dependent
knowledge resources - Focus of the talk the benefits and challenges of
using our current, English, lexicon to seed the
building of lexicons for other languages - For comparison (one of many possible) the
European Unions SIMPLE lexicon-building project
3What is OntoSem?
- Ontological Semantic (OntoSem) text processing
takes as input open text and returns text-meaning
representations (TMRs), which are structured
representations of its meaning (Nirenburg and
Raskin, Ontological Semantics, MIT Press,
forthcoming). - TMRs are the basis for all other applications
MT, QA, Knowledge Extraction - OntoSem is a practical, non-toy system that
concentrates on encoding and interpreting text
meaning (in contrast with most stochastic methods
that focus on surface strings)
4OntoSem Processors
- Processors include
- pre-processor (tokenization, POS tagging,
morphological analysis, etc.) - syntactic analyzer
- semantic analyzer
- All processors are parametrizable i.e., basic
functionality can apply to any language, but
certain language-specific information must be
recorded - Recent experiments port analyzers to Arabic and
Persian (CRL) successful for small experiment
5Language Independent Knowledge Sources
- the TMR language
- an ontology, currently containing 5500 concepts
designed to support many-to-one lexical mappings - a fact repository, which is a database of
real-world facts that represent instances of
ontological concepts - Brief descriptions of each follow.
6Static Resources 1 The TMR Language
- He asked the UN to authorize the war.
- REQUEST-ACTION-69 AGENT HUMAN-72
THEME ACCEPT-70 BENEFICIARY
ORGANIZATION-71 SOURCE-ROOT-WORD ask
TIME (lt (FIND-ANCHOR-TIME)) ACCEPT-70
THEME WAR-73 THEME-OF REQUEST-ACTION-69
SOURCE-ROOT-WORD authorizeORGANIZATION-71
HAS-NAME United-Nations BENEFICIARY-OF
REQUEST-ACTION-69 SOURCE-ROOT-WORD
UNHUMAN-72 HAS-NAME Colin Powell
AGENT-OF REQUEST-ACTION-69 SOURCE-ROOT-WORD
he reference resolution has been carried
outWAR-73 THEME-OF ACCEPT-70
SOURCE-ROOT-WORD war
7Static Resources 2 The Ontology
8Ontology (cont) Why our ontology is not just a
word net it contains many properties and their
values(an average of 16 per concept), scripts,
etc.
- BIOLOGICAL-WEAPON
- MADE-OF MICRO-ORGANISM
- INSTRUMENT-OF DESTROY, ANIMAL-DISEASE,
- PLANT-DISEASE
-
- GUN
- INSTRUMENT-OF DISCHARGE, HUNTING-EVENT
- PRODUCED-BY GUNSMITH
- MADE-OF METAL (PLASTIC, WOOD)
- COLON-CANCER
- LOCATION COLON
- ESTABLISHED-BY COLONOSCOPY, SIGMOIDOSCOPY
- HAS-SYMPTOM CONSTIPATION
- REMEDIED-BY DRUG, PERFORM-SURGERY
- EXPERIENCER ANIMAL
9Static Resources 3 Fact Repository
10Language Dependent Knowledge Sources
- Lexicons and onomastica (lexicons of proper
names) are language dependent - For lexicons, however, the semantic
representation is largely transferable across
languages, and the syntactic patterns that
realize a given semantic meaning often are as
well this is the source of the cross-linguistic
portability well explore here
11Example of a Basic Lexicon Entry
- watch
- watch-v1
- synonyms observe
- anno
- definition to observe, look at
- example Hes watching the demolition
team. -
- syn-struc
- subject var1
- v var0
- directobject var2
-
- sem-struc
- VOLUNTARY-VISUAL-EVENT
- agent var1
- theme var2
12Methods of Expressing Word / Phrase Meaning in
OntoSem
- mapping directly to an ontological concept (dog
maps to DOG) - mapping to an ontological concept with
modification by properties e.g., - Zionist maps to POLITICAL-ROLE
- AGENT-OF SUPPORT
- THEME Israel
- asphalt (v.) maps to COVER
- INSTRUMENT ASPHALT
- recall (v., as in They recalled the high chairs)
maps to - RETURN-OBJECT
- THEME ARTIFACT,
INGESTIBLE, MATERIAL - CAUSED-BY
FOR-PROFIT-CORPORATION
13Methods of Expressing Meaning (Cont.)
- using modality or aspect
- (might-aux1
- (def "expresses the possibility of something
happening - epistemic .5") - (ex "he might come over")
- (syn-struc
- ((subject ((root var1) (cat n)))
- (root var0) (cat v)
- (inf-cl ((root var2) (cat v)))))
- (sem-struc
- (var2
- (epistemic .5)
- (agent (value var1)))))
- (meaning-procedure
- (fix-case-role (value var1) (value
var2)))) -
14Methods of Expressing Meaning (Cont.)
- 4. using our non-ontological methods of
expressing time, sets, etc - (yesterday-adv1
- (def the day before the speech time")
- (ex "he admitted that yesterday")
- (syn-struc
- ((root var1) (cat v)
- (mods ((root var0) (cat adv) (type
pre-verb-post-clause))))) - (sem-struc
- (var1
- (time (combine-time (find-anchor-time)
(day 1) before))))) -
15Methods of Expressing Meaning (Cont.)
- 5. calling a meaning procedure
- (she-pro1
- (def "the pronoun 'she'") (ex "she kicked
the can.") - (syn-struc
- ((root var0) (cat n) (type pro)))
- (sem-struc
- (animal))
- (meaning-procedure
- (trigger-reference
- (person third) (number sing) (gender
female) - (same-clause .1) (preceding-clause
.7) (pre-preceding-clause .5)
(preceding-sent .5) (sentence-minus-2 .2)
(sentence-minus-3 .1) (para-break .5)
(repeat-collocation .7) (synonym-collocation
.6) - (agent-theme .8) (pp-embedded .2)
(function-match .7) (coord .7))))
16Why Port Semantic Representations?
- Once a semantic representation of a word sense
has been created along with the concurrent
extension, if necessary, of other resources it
not only can but should be used to represent the
same sense in any language. Why? - Time (we want to save it!)
- Paraphrase (weapons of mass destruction a)
weapons that can kill more than people b)
nuclear and/or bio weapons) - Options for Resource Development (e.g., one can
ontologize a fine-grained notion or describe its
properties in the lexicon either is fine)
17A Driving Principle The Principle of Practical
Effability
- What can be expressed in one language can be
expressed in every language so the sem-strucs,
by definition, must be portable (apart from their
variables)
18What Can Be Involved in Editing a Word/Phrase
Sense for a New L (on the example of Polish)
- No modification required, just a new translation
dog gt pies. This may rely on global syntactic
rules for parameterization e.g., subject in
English maps to Nominative case in Polish as a
default so a basic transitive frame in English
can generally map to a basic transitive frame in
Polish - Manual syntactic modification required e.g., an
object in Polish can have quirky case-marking an
xcomp in English might be realized as a comp in
Polish a category that is optional in English
might be required in Polish or vice versa, etc. - Linking of variables might be different in
different languages - Semantic distinctions in one language that are
missing in another (e.g., the English hand/arm
distinction is missing in many languages) -
19From Porting Senses to Porting a Whole Lexicon
- Porting senses is fairly straight forward one
can provide as many translations of a given sense
as needed using the synonyms field or a new
entry, if there are any syntactic differences or
semantic nuances to be capture translations can
be words or phrases of any complexity - However, porting a whole lexicon introduces
difficulties of a more organizational nature, to
which we turn now
20Organizational Issues
- Leave the base lexicon as is or attempt to
improve its quality while building L2 (e.g., add
more distinguishing properties and values)? - Expand lexicons simultaneously (e.g., add more
senses to English words and their corresponding
L2 equivalents add more words in general)? - Be driven by correspondences in head words or
simply by sem-struc meanings? (e.g., all English
senses of table will be in one head entry should
all senses of all L2 translations of table be
handled at once during L2 acquisition?) - To what extent should the regular acquisition
process including ontology supplementation be
carried out on L2? - Automate? If so, how and how much?
21Insight from an Experiment
- We attempted to do a fast port of part of the
English lexicon to Polish to determine time
savings, problems, automation potential - The experiment was by carried out by one
bilingual English and Polish speaker working for
about a week. - A portion of the lexicon was ported,
problems/successes were reported.
22As Regards Automation
- If on-line lexicon bilingual English-L2 lexicon
has 1 sense for a given word form, it is
reasonable to assume identity (big time savings
for technical terms, real-world objects (frying
pan), etc.). Presenting results in quickly
inspectable format would help - If there is a many-to-one correspondence, would
need an interface to really exploit time savings
otherwise, somewhat difficult to keep track of
senses.
23As Regards Content of The English and L2 Lexicons
- How the lexicons would best develop
simultaneously depends in large part upon the
capabilities of acquirers, their training, etc.
Driving English acquisition from the L2 side is
perfectly fine, since the sem-struc is language
independent. - Need to divide tasks according to their
difficulty a relatively untrained informant
could do simple nouns, whereas a trained
informant is necessary for polysemous verbs,
phrasals, etc. - Time savings clearly depends upon organization of
efforts getting bogged down in simultaneous
development of multiple lexicons and the ontology
is a real risk
24For Comparison The SIMPLE Project
- Goal develop 10K-sense semantic lexicons for 12
European Union languages (the earlier PAROLE
project developed 20K-sense morphological and
syntactic ones) - Each lexicon is built separately the word list
for each L is based on corpus evidence for that L
- Each L must cover a given inventory of high-level
concepts in EuroWordNet to ensure some overlap - Each L uses the same inventory of template
types, which indicates which types of properties
should be described for different types of words
25SIMPLE Aims
- Apparently, translation is the main aim, since
semantic description is shallow (wouldnt support
reasoning) - Semantic description is limited to
- mapping to EuroWordNet concepts (which are
iconic, not descriptive few properties are used) - using a slightly expanded version of qualia,
which are properties that support reasoning about
generative properties of words (Pustejovsky, The
Generative Lexicon) - qualia represent just the generative corner of
lexical description they do not provide breadth
of description
26Generalized vs. Application-Specific Resources
- A dichotomy at stake here is the one between
generality of a LR lexical resource vs.
usefulness for applications. In principle, only
when we know the actual specific use we intend to
do of a LR can we build the very best LR for
that use, but this has proved to be too expensive
and not realistic. In practice, however, there
exists a large core of information that can be
shared by many applicative uses, and this leads
to the concept of generic LR, which is at the
basis for the EAGLES initiative and of the
PAROLE/SIMPLE projects, to be then enhanced and
tuned with other means (Syntactic/semantic
lexicons for the European languages towards a
standardised infrastructure, Calzolari 1999 42) - However, rebuilding resources for every
application is not cost effective either, which
brings us back to the approach were taking in
OntoSem