Automatic event extraction from text on the base of linguistic and semantic annotation - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Automatic event extraction from text on the base of linguistic and semantic annotation

Description:

Involve entities and relations between then. Implies a change of states ... Goal: Tore: 1:0 Shearer (53., Kopfball, Vorarbeit Beckham) ... – PowerPoint PPT presentation

Number of Views:169

Avg rating:3.0/5.0

Slides: 35

Provided by: DeCl5

Category:

more less

Transcript and Presenter's Notes

Title: Automatic event extraction from text on the base of linguistic and semantic annotation

1
Automatic event extraction from text on the base
of linguistic and semantic annotation

Thierry Declerck
DFKI Language Technology Lab

2
Events

Involve entities and relations between then
Implies a change of states
Example The striker of Liverpool shot a
wonderful goal in the 87. Minute.
1 event (goal-shot)
2 entities (person and team)
1 change of state (the scoring)

3
Events in textual documents

Various types of text
Structured Example and Example_2
For processing, pattern matching techniques
required. Very few linguistic knowledge needed
Semi-structured Example
Requires a mixture of pattern matching and more
linguistic knowledge
Unstructured Example
Requires a mixture of layout analysis and
linguistic knowledge
All types of text require a domain specific
knowledge base (ontology) for event extraction

4
Domain Knowledge

Domain knowledge can be organised in
terminologies, thesauri, taxonomies or
ontologies. Example of a (non-formal)
multingual ontology for the soccer domain.
More on ontology engineering in the talk by
Borislav

5
Automatic Event Extraction from Text is

A combination of human language technology (HLT)
and semantic web technologies (ontologies)
Can also be done on the base of purely
statistical means (with minimal linguistic
knowledge), but we concentrate here on the
HLT-based approach

6
What is Human Language Technology
7
Linguistic Analysis
Language technology tools are needed to support
the upgrade of the actual web to the Semantic Web
(SW) by providing an automatic analysis of the
linguistic structure of textual documents. Free
text documents undergoing linguistic analysis
become available as semi-structured documents,
from which meaningful units can be extracted
automatically (information extraction) and
organized through clustering or classification
(text mining). Here we focus on the following
linguistic analysis steps that underlie the
extraction tasks tokenization, morphological
analysis, part-of-speech tagging, chunking,
dependency structure analysis, semantic tagging.
8
Tokenisation
Tokenisation deals with the detection of the word
units in a text and with the detection of
sentence boundaries. The markets acknowledge the
measures taken on the 24th of September by the
CEO of XYZ Corp.
9
Morphological Analysis
Morphological analysis is concerned with the
inflectional, derivational, and compounding
processes in word formation in order to determine
properties such as stem and inflectional
information. Together with part-of-speech (PoS)
information this process delivers the
morpho-syntactic properties of a word. While
processing the German word Häusern (houses) the
following morphological information should be
analysed PoSN NUMPL CASEDAT GENNEUT
STEMHAUS
10
Part-of-Speech Tagging
Part-of-Speech (PoS) tagging is the process of
determining the correct syntactic class (a
part-of-speech, e.g. noun, verb, etc.) for a
particular word given its current context. The
word works in the following sentences will be
either a verb or a noun He works N,V the
whole day for nothing. His works N,V have all
been sold abroad. PoS tagging involves
disambiguation between multiple part-of-speech
tags, next to guessing of the correct
part-of-speech tag for unknown words on the basis
of context information.
11
Chunking
Chunks are sequences of words which are grouped
on the base of linguistic properties, such as
nominal, prepositional, adjectival and adverbial
phrases and verb groups. NP His works VG
have NP all VG been sold AdvP abroad.
12
Named Entities detection
Related to chunking is the recognition of
so-called named entities (names of institutions
and companies, date expressions, etc.). The
extraction of named entities is mostly based on a
strategy that combines look up in gazetteers
(lists of companies, cities, etc.) with the
definition of regular expression patterns. Named
entity recognition can be included as part of the
linguistic chunking procedure and the following
sentence fragment the secretary-general of
the United Nations, Kofi Annan, will be
annotated as a nominal phrase, including two
named entities United Nations with named entity
class organization, and Kofi Annan with named
entity class person
13
Dependency Structure Analysis
A dependency structure consists of two or more
linguistic units that immediately dominate each
other in a syntax tree. The detection of such
structures is generally not provided by chunking
but is building on the top of it. There are two
main types of dependencies that are relevant for
our purposes On the one hand, the internal
dependency structure of phrasal units or chunks
and on the other hand the so-called grammatical
functions (like subject and direct object).
14
Internal Dependency Structure
In linguistic analysis, for this we use the
terms head, complements and modifiers, where the
head is the dominating node in the syntax tree of
a phrase (chunk), complements are necessary
qualifiers thereof, and modifiers are optional
qualifiers. Consider the following example The
shot by Christian Ziege goes over the goal. The
prepositional phrase by Christian Ziege
(containing the named entity Christian Ziege)
depends on (and modifies) the head noun shot.
.
15
Grammatical Functions
Determine the role (function) of each of the
linguistic chunks in the sentence and allow to
identify the actors involved in certain events.
So for example in the following sentence, the
syntactic (and also the semantic) subject is the
NP constituent The shot by Christian
Ziege The shot by Christian Ziege goes over
the goal. This nominal phrase depends on (and
complements) the verb goes, whereas the Noun
shot is the head of the NP (it this the shot
going over the goal, and not Christian Ziege!)
16
Semantic Tagging
Automatic semantic annotation has developed
within language technology in recent years in
connection with more integrated tasks like
information extraction, which require a certain
level of semantic analysis. Semantic tagging
consists in the annotation of each content word
in a document with a semantic category. Semantic
categories are assigned on the basis of a
semantic resources like WordNet for English or
EuroWordNet, which links words between many
European languages through a common inter-lingua
of concepts.
17
Semantic Resources

Semantic resources are captured in dictionaries,
thesauri, and semantic networks, all of which
express, either implicitly or explicitly, an
ontology of the world in general or of more
specific domains, such as medicine.
They can be roughly distinguished into the
following three groups
Thesauri Semantic resources that group
together similar words or terms according to a
standard set of relations, including broader
term, narrower term, sibling, etc. (like Roget)
Semantic Lexicons Semantic resources that
group together words (or more complex lexical
items) according to lexical semantic relations
like synonymy, hyponymy, meronymy, and antonymy
(like WordNet)
Semantic Networks Semantic resources that
group together objects denoted by natural
language expressions (terms) according to a set
of relations that originate in the nature of the
domain of application (like UMLS in the medical
domain)

18
The MeSH Thesaurus
MeSH (Medical Subject Headings) is a thesaurus
for indexing articles and books in the medical
domain, which may then be used for searching
MeSH-indexed databases. MeSH provides for each
term a number of term variants that refer to the
same concept. It currently includes a vocabulary
of over 250,000 terms. The following is a sample
entry for the term gene library (MH is the term
itself, ENTRY are term variants) MH
Gene Library ENTRY Bank, Gene ENTRY
Banks, Gene ENTRY DNA Libraries ENTRY
Gene Bank etc.
19
The WordNet Semantic Lexicon
WordNet has primarily been designed as a
computational account of the human capacity of
linguistic categorization and covers an extensive
set of semantic classes (called synsets). Synsets
are collections of synonyms, grouping together
lexical items according to meaning similarity.
Synsets are actually not made up of lexical
items, but rather of lexical meanings (i.e.
senses)
20
The WordNet Semantic Lexicon
WordNet has primarily been designed as a
computational account of the human capacity of
linguistic categorization and covers an extensive
set of semantic classes (called synsets). Synsets
are collections of synonyms, grouping together
lexical items according to meaning similarity.
Synsets are actually not made up of lexical
items, but rather of lexical meanings (i.e.
senses)
21
WordNet An Example
The word 'tree' has two meanings that roughly
correspond to the classes of plants and that of
diagrams, each with their own hierarchy of
classes that are included in more general
super-classes 09396070 tree 0 09395329
woody_plant 0 ligneous_plant 0 09378438
vascular_plant 0 tracheophyte 0 00008864
plant 0 flora 0 plant_life 0 00002086
life_form 0 organism 0 being 0 living_thing 0
00001740 entity 0 something 0 10025462 tree
0 tree_diagram 0 09987563 plane_figure 0
two-dimensional_figure 0 09987377 figure 0
00015185 shape 0 form 0 00018604
attribute 0 00013018 abstraction 0
22
What is the Semantic Web

The Semantic Web is a new initiative to
transform the web into a structure that supports
more intelligent querying and browsing, both by
machines and by humans. This transformation is to
be supported through the generation and use of
metadata constructed via web annotation tools
using user-defined ontologies that can be related
to one another.
Somewhere on the web

23
(No Transcript)
24
Extracting Events from Structured Documents

Detecting Metadata in our Example
Type of game N/A
Teams involved England - Deutschland
Players Deutschland Kahn (2) - Matthaeus (3) -
Babbel (3,5),
Final (and intermediate) score10 (00)
RefereeSchiedsrichter Collina, Pierluigi
(Viareggio)
Date N/A
Etc

25
Extracting Events from Structured Documents (2)

Detecting Events in our Example
Substitution Eingewechselt 61. Gerrard fuer
Owen,
Goal Tore 10 Shearer (53., Kopfball,
Vorarbeit Beckham)
Cards Gelbe Karten Beckham - Babbel, Jeremies

26
Results in XML

Automatically extracted events (and entities and
relations) from structured text, on the base of
patterns (DTD) of typical expressions and the
soccer ontology. Example and Example_2
Since various results are available in XML files,
those results can be merged automatically, guided
by the ontology. Example. This is supporting an
incremental and dynamic extraction.

27
Extracting Events from Semi-Structured Documents

Need of linguistic processing, for providing of a
basic structure of the document, which allows the
domain specific annotation. Example.

28
Extracting Events from Semi-Structured Documents
(2)

Using as well the results from the semantic
annotation of the structured documents,
supporting incremental extraction Example.

29
Actual Development

Extracting information from multilingual balance
sheets (WINS eTen project), extending this to
unstructured text and extracting relations and
events from annexes to balance sheets (upcoming
Project MUSING).
Detecting positive/negative mentioning of
entities in news documents (project Direct-Info
on Media Monitoring). Example.

30
Further Challenge for HLT

Not only use HLT for the semantic annotation of
web pages (or other documents), but use HLT for
supporting ontology extraction/learning from the
web (or other documents)

31
Example of semantic relation extraction in
bio-medicine

Rheumatoid arthritis is characterized by
progressive synovial inflammation
and joint destruction .

32
Open issues for HLT and SW

To achieve a better coordination for improving
semantic annotation results
Development and use of standards for interelated
linguistic and semantic annotation (see eContent
Project LIRICS for standards for language
resources)

33
Interoperable Standards?
34
Thank you!

Write a Comment

User Comments (0)