SCHEMAS Workshop 3 Introduction

About This Presentation

Title:

SCHEMAS Workshop 3 Introduction

Description:

Lexicon-driven extraction of ontological data. Corpus-driven extraction of ... promotes development of lexicon resources which aim at text-understanding as it ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 33

Provided by: mak113

Category:

more less

Transcript and Presenter's Notes

Title: SCHEMAS Workshop 3 Introduction

1
Extraction of Ontological Information
from Lexicon and Corpora
Dimitrios Kokkinakis Maria Toporowska Gronostaj
2
Motto

To process information
you need information
P. Vossen, 2003

3
Content

Introduction
Background
Language resources
Methodology
Lexicon-driven extraction of ontological data
Corpus-driven extraction of ontological data
Conclusions

4
Background

What is ontological information ?
information necessary for making
common-sense-like inferences based on our
knowledge of the world
How is it represented?
in form of structured sets of conceptual types
often inclusive semantic relations underlying
them
Where?
SIMPLE-ontology, EWN, LexiQuest

5
Background

Why is ontological information relevant for NLP?
promotes development of lexicon resources which
aim at text-understanding as it offers
disambiguation means
provides knowledge needed in
machine translation (MT)
information retrieval (IR)
information extraction (IE)
summarization
computer aided language learning (CALL)
enables communication on the Semantic Web

6
Background

What is meant with a semi-automatic extraction of
OI?
some human intervention is involved in
information processing to maximize its effects
What will we achieve with it?
enhance the content of the Swedish SIMPLE lexicon
in a quick and costs-effective way
investigate lexicon-driven and corpus-driven
methodologies

7
Methodology in general (1)

Methodological assumptions
lexical databases, MRD lexica and corpora can be
mined for ontological information
relevant factors in information processing
resource size
degree of extractability
implicitness and explicitness of information
bootstrapping

8
Methodology in general (2)

Approach text data mining (TDM)
TDM is a process of exploratory data analysis
using text that leads to the discovery of
heretofore unknown information, or to answers to
questions for which the answer is not currently
known (Mitkov 2003, Hearst 2003)
Result evolutionary lexicon model
output data are reused to discover new data,
which leads
to a successive enlargement of lexicon

9
Language resources SIMPLE-SE (1)

Corpora
150 million words i Språkbanken
Lexicon resources
SIMPLE-SE lexicon
GLDB Göteborg lexical database
SEMNET

10
Language resources SIMPLE-SE (2)

About SIMPLE-SE
computational lexicon with explicit ontological
information (OI)
10 000 lexicon units
7 000 nouns, 2 000 verbs, 1 000 adjectives
manually annotated with semantic and OI which is
linked to the morphosyntactic information in the
PAROLE lexicon
multidimensional

11
Language resources SIMPLE-SE (3)

SIMPLE-SE supports
word sense disambiguation
kastanji 1/1/0 FRUIT
kastanji 1/1/1 PLANT
kastanji 1/1/2sms COLOUR
kastanji 1/1/3 FOOD
kastanji 1/2/0 ORGANIC OBJECT
finding regular polysemy
creating multilingual links between lexicons

12
Language resources SIMPLE-SE (4)

SIMPLE-SE supports
text annotation
text data mining knowledge based information
processing
evaluation
pattern matching based on the ontological
information assigned to arguments (selection
restrictions/preferences)

13
Language resources SIMPLE-SE (5)

selection restriction based pattern matching
Word/expression Position Ontological term
injicera (inject) object Substance
bebo (inhabit) object Area
griljera (roast) object Food
förlova sig (become engaged) subj., prep.
obj Human
devalvera (devaluate) obj. Money
ha ont i (have pain in) prep. obj. Body part

14
Language resources GLDB

Göteborg lexical database, GLDB
67 000 core senses with stringent definition
format
implicit, but extractable genus proximum (genus
word)
implicit onto info about arguments in definition
extensions
35 000 explicit semantic references on semantic
relations like synonymy, antonymy, hyperonymy,
hyponymy and cohyponymy

15
Language resources SEMNET (1)

SEMNET hyperonymic taxonomy
Extraction of hyperonymy relations from GLDBs
definitions
(methodology software Y. Cederholm, 1999)
Recognition of headwords (genus proximum) in
definitions

16
Language resources SEMNET (2)

Input data
GLDB definitions
44 915 noun lexeme
10 082 verb lexeme
Two analysis methods which complete each other

17
Language resources SEMNET (3)

Method I
distinguishing typical def. patterns for core
senses
(see overhead/handout from Cederholm Y. 1999,
Tabell 1. Definitionsformler))
pattern matching against non-lemmatized
definitions (using regular expressions)

18
Language resources SEMNET (4)

Method II
Input lemmatized definitions
Assumptions
genus word is the first word in the definition
which matches the part of speech of the headword,
the word being defined
method II finds even those genus words which
cannot be parsed with the method I

19
Language resources SEMNET (5)

Analysis results for nouns
tot. number of analysis tot. number of
correct analysis
Method I 8127 (64) 7141 (56)
Method II 12 194 (95) 8974 (70)
Method I II 12 528 (98) 10536 ( 83)
(evaluation based on 12 786 manually annotated
noun genus words)
Approximated result for ca 45 000 nouns i genus
position
36 500 correctly recognised noun genus words

20
Language resources SEMNET (6)

The 33 most frequent noun genus words i SEMNET
2702 person 858 typ 612 del
461 anordning 314 område 261 kvinna
228 tillstånd 219 lära 217 titel
207 grupp 183 föremål 173 sammanfattning
172 mängd 169 sätt 167 plats
166 system 165 växt 162 ämne
153 apparat 145 förmåga 133 medlem
128 språk 122 stycke 122 redskap
122 plats 119 känsla 118 form
116 metod 116 handling 113 enhet
111 ljud 110 instrument 102 verksamhet

21
Language resources SEMNET (7)

Hyperonymy taxonomy sjukdom
-- 1 akutfall 1/1
-- 2 almsjuka 1/1
-- 3 astma 1/1
-- 4 avitaminos 1/1
-- 5 basedow 1/1
-- 6 bladrullsjuka 1/1
-- 7 blodkräfta 1/1
-- 8 blodsjukdom 1/1
-- 9 blödarsjuka 1/1............................
................ (totalt 66 hyponyms)

22
Definition-driven extraction of ontological
information (1)

Resources SIMPLE-SE SEMNET GLDB
Methodological assumptions
Hyperonymic taxonomy in combination with
ontological information in SIMPLE-SE supports
semiautomatic extraction of ontological
information
Procedure
Preparatory phase relevant for all ontological
processing annotate GLDB data with the ontol.
info from the SIMPLE-SE to generate ontologically
enriched SEMNET

23
Definition-driven extraction of ontological
information (2)

Methodological assumptions (cont.)
The extracted ontological information is an
approximation of ontological category until
verified with other methods, t.ex. a
corpus-driven methodology, semantic/ontological
data från GLDB or pattern matching based on
selection restrictions
Since annotated words in SIMPLE cover both
hyperonyms and hyponyms, two methods are proposed
here that put in focus each of these semantic
categories

24
Definition-driven extraction of ontological
information (3)

Method I
from annotated hyponyms to new annotations of
hyperonyms
Assumption
One can approximate ontological category of a
hyperonym given some information on its hyponyms
and using the structural knowledge inherent in
ontology
Annotation of a hyperonym can be performed if all
of the annotated hyponyms share the same
ontological tag or if the tags share a common
superordinate tag, except the tag Entity which is
ontologically heterogeneous and thus relatively
uninformative

25
Definition-driven extraction of ontological
information (4)

Method I example
Hyponyms known info
diabetes Disease cat Air animal,
asthma Disease dog Air animal
cholera Disease fisk Water_animal
Hyperonym new info
disease gtDisease djur gt Animal

26
Definition-driven extraction of ontological
information (5)

Method II
from annotated hyperonyms to new annotations of
hyponyms
Assumption (resulting in approximation)
Direct hyponyms (hyponyms which are directly
subordinated to the genus word/hyperonym)
automatically inherit the ontological category of
their hyperonyms och therefore manual annotation
of the most frequent genus words/hyperonyms can
be recommended and justified.
hyperonym known info hyponyms new info
myntenhet Money gt dollar, krona, pund,
rubel... Money

27
Definition-driven extraction of ontological
information (6)

The assumption has far reaching consequences for
all those annotated hyponymic words which also
occur as genus words, since their subordinates
can automatically inherit the ontological class
from the hyperonym/genus word.
Cascade effect
sjukdom (disease) 66 hyponymes
infektionssjukdom 25 hyponyms
könsjukdom 4 hyponyms

28
Definition-driven extraction of ontological
information (7)

Cascade distribution of the ontological type
Animal
Djur 102 hyponyms
hovdjur 10
ryggradsdjur 8
fågel 98
däggdjur 18
Note 80 most frequent genus words, when
ontologically annotated, give rise to 11 000
automatically annotated genus words at the first
hyponymy level. This number further increases due
to the cascade effect.

29
Definition-driven extraction of ontological
information (8)

2702 person 1/1 person HUMAN
461 anordning 1/1 device ARTIFACT
314 område 1/1 area AREA (gtLOCATION)
261 kvinna 1/1 woman HUMAN
238 tillstånd 1/ state STATE
219 lära 1/1 doctrine DOMAIN
217 titel 1/1 titel SOCIAL_STATUS (gtHUMAN)
183 föremål 1/ thing CONCRETE_ENTITY
169 sätt 1/1 manner CONSTITUTIVE
167 plats 1/1el 4 place LOCATION
166 system 1/1 system CONSTITUTIVE
165 växt 1/1 plant PLANT

30
Conclusion

Ontological annotations are approximations. They
need to be verified against manually annotated
data and/or by means of corpus-driven methodology
for extracting ontological information
The status of ontological annotations need to be
explicitly specified in the database
Method I (from hyponyms to hyperonyms) seem to
complement the method II (from hyperonyms to
hyponyms) since the range of annotated categories
increases rapidly
The quality (and quantity) of the used lexical
resources determines the precision of the
acquired results ontology

31
Conclusion contd

To prevent overgenerating of incorrect
ontological annotation special attention needs to
be paid to
disambiguation of polysemous and homographic
genus words (hyperonyms)
krona Artifact, Money, Part
analysis of compound nouns
gosedjur Artifact vs husdjur Animal

32
References

Cederholm Y. 1999. Automatisk konstruktion av en
hyperonymitaxonomi baserad på definitioner i
GLDB. In Från dataskärm och forskarpärm. MISS 25.
Göteborgs universitet.
Hearst, M. 2003. Text Data Mining. In ed. R.
Mitkov The Oxford Handbook of Computational
Linguistics Oxford.
Mitkov, R. 2003. The Oxford Handbook of
Computational Linguistics Oxford. Oxford
University Press.
Vossen, P. 2003. Ontologies. In ed. R. Mitkov The
Oxford Handbook of Computational Linguistics
Oxford.
about SIMPLE see http//spraakbanken.gu.se

Write a Comment

User Comments (0)

About PowerShow.com

SCHEMAS Workshop 3 Introduction - PowerPoint PPT Presentation

SCHEMAS Workshop 3 Introduction

Lexicon-driven extraction of ontological data. Corpus-driven extraction of ... promotes development of lexicon resources which aim at text-understanding as it ... – PowerPoint PPT presentation