Title: Automatic event extraction from text on the base of linguistic and semantic annotation
1Automatic event extraction from text on the base
of linguistic and semantic annotation
- Thierry Declerck
- DFKI Language Technology Lab
2Events
- Involve entities and relations between then
- Implies a change of states
- Example The striker of Liverpool shot a
wonderful goal in the 87. Minute. - 1 event (goal-shot)
- 2 entities (person and team)
- 1 change of state (the scoring)
3Events in textual documents
- Various types of text
- Structured Example and Example_2
- For processing, pattern matching techniques
required. Very few linguistic knowledge needed - Semi-structured Example
- Requires a mixture of pattern matching and more
linguistic knowledge - Unstructured Example
- Requires a mixture of layout analysis and
linguistic knowledge - All types of text require a domain specific
knowledge base (ontology) for event extraction
4Domain Knowledge
- Domain knowledge can be organised in
terminologies, thesauri, taxonomies or
ontologies. Example of a (non-formal)
multingual ontology for the soccer domain. - More on ontology engineering in the talk by
Borislav
5Automatic Event Extraction from Text is
- A combination of human language technology (HLT)
and semantic web technologies (ontologies) - Can also be done on the base of purely
statistical means (with minimal linguistic
knowledge), but we concentrate here on the
HLT-based approach
6What is Human Language Technology
7Linguistic Analysis
Language technology tools are needed to support
the upgrade of the actual web to the Semantic Web
(SW) by providing an automatic analysis of the
linguistic structure of textual documents. Free
text documents undergoing linguistic analysis
become available as semi-structured documents,
from which meaningful units can be extracted
automatically (information extraction) and
organized through clustering or classification
(text mining). Here we focus on the following
linguistic analysis steps that underlie the
extraction tasks tokenization, morphological
analysis, part-of-speech tagging, chunking,
dependency structure analysis, semantic tagging.
8Tokenisation
Tokenisation deals with the detection of the word
units in a text and with the detection of
sentence boundaries. The markets acknowledge the
measures taken on the 24th of September by the
CEO of XYZ Corp.
9Morphological Analysis
Morphological analysis is concerned with the
inflectional, derivational, and compounding
processes in word formation in order to determine
properties such as stem and inflectional
information. Together with part-of-speech (PoS)
information this process delivers the
morpho-syntactic properties of a word. While
processing the German word Häusern (houses) the
following morphological information should be
analysed PoSN NUMPL CASEDAT GENNEUT
STEMHAUS
10Part-of-Speech Tagging
Part-of-Speech (PoS) tagging is the process of
determining the correct syntactic class (a
part-of-speech, e.g. noun, verb, etc.) for a
particular word given its current context. The
word works in the following sentences will be
either a verb or a noun He works N,V the
whole day for nothing. His works N,V have all
been sold abroad. PoS tagging involves
disambiguation between multiple part-of-speech
tags, next to guessing of the correct
part-of-speech tag for unknown words on the basis
of context information.
11Chunking
Chunks are sequences of words which are grouped
on the base of linguistic properties, such as
nominal, prepositional, adjectival and adverbial
phrases and verb groups. NP His works VG
have NP all VG been sold AdvP abroad.
12Named Entities detection
Related to chunking is the recognition of
so-called named entities (names of institutions
and companies, date expressions, etc.). The
extraction of named entities is mostly based on a
strategy that combines look up in gazetteers
(lists of companies, cities, etc.) with the
definition of regular expression patterns. Named
entity recognition can be included as part of the
linguistic chunking procedure and the following
sentence fragment the secretary-general of
the United Nations, Kofi Annan, will be
annotated as a nominal phrase, including two
named entities United Nations with named entity
class organization, and Kofi Annan with named
entity class person
13Dependency Structure Analysis
A dependency structure consists of two or more
linguistic units that immediately dominate each
other in a syntax tree. The detection of such
structures is generally not provided by chunking
but is building on the top of it. There are two
main types of dependencies that are relevant for
our purposes On the one hand, the internal
dependency structure of phrasal units or chunks
and on the other hand the so-called grammatical
functions (like subject and direct object).
14Internal Dependency Structure
In linguistic analysis, for this we use the
terms head, complements and modifiers, where the
head is the dominating node in the syntax tree of
a phrase (chunk), complements are necessary
qualifiers thereof, and modifiers are optional
qualifiers. Consider the following example The
shot by Christian Ziege goes over the goal. The
prepositional phrase by Christian Ziege
(containing the named entity Christian Ziege)
depends on (and modifies) the head noun shot.
.
15Grammatical Functions
Determine the role (function) of each of the
linguistic chunks in the sentence and allow to
identify the actors involved in certain events.
So for example in the following sentence, the
syntactic (and also the semantic) subject is the
NP constituent The shot by Christian
Ziege The shot by Christian Ziege goes over
the goal. This nominal phrase depends on (and
complements) the verb goes, whereas the Noun
shot is the head of the NP (it this the shot
going over the goal, and not Christian Ziege!)
16Semantic Tagging
Automatic semantic annotation has developed
within language technology in recent years in
connection with more integrated tasks like
information extraction, which require a certain
level of semantic analysis. Semantic tagging
consists in the annotation of each content word
in a document with a semantic category. Semantic
categories are assigned on the basis of a
semantic resources like WordNet for English or
EuroWordNet, which links words between many
European languages through a common inter-lingua
of concepts.
17Semantic Resources
- Semantic resources are captured in dictionaries,
thesauri, and semantic networks, all of which
express, either implicitly or explicitly, an
ontology of the world in general or of more
specific domains, such as medicine. - They can be roughly distinguished into the
following three groups - Thesauri Semantic resources that group
together similar words or terms according to a
standard set of relations, including broader
term, narrower term, sibling, etc. (like Roget) - Semantic Lexicons Semantic resources that
group together words (or more complex lexical
items) according to lexical semantic relations
like synonymy, hyponymy, meronymy, and antonymy
(like WordNet) - Semantic Networks Semantic resources that
group together objects denoted by natural
language expressions (terms) according to a set
of relations that originate in the nature of the
domain of application (like UMLS in the medical
domain)
18The MeSH Thesaurus
MeSH (Medical Subject Headings) is a thesaurus
for indexing articles and books in the medical
domain, which may then be used for searching
MeSH-indexed databases. MeSH provides for each
term a number of term variants that refer to the
same concept. It currently includes a vocabulary
of over 250,000 terms. The following is a sample
entry for the term gene library (MH is the term
itself, ENTRY are term variants) MH
Gene Library ENTRY Bank, Gene ENTRY
Banks, Gene ENTRY DNA Libraries ENTRY
Gene Bank etc.
19The WordNet Semantic Lexicon
WordNet has primarily been designed as a
computational account of the human capacity of
linguistic categorization and covers an extensive
set of semantic classes (called synsets). Synsets
are collections of synonyms, grouping together
lexical items according to meaning similarity.
Synsets are actually not made up of lexical
items, but rather of lexical meanings (i.e.
senses)
20The WordNet Semantic Lexicon
WordNet has primarily been designed as a
computational account of the human capacity of
linguistic categorization and covers an extensive
set of semantic classes (called synsets). Synsets
are collections of synonyms, grouping together
lexical items according to meaning similarity.
Synsets are actually not made up of lexical
items, but rather of lexical meanings (i.e.
senses)
21WordNet An Example
The word 'tree' has two meanings that roughly
correspond to the classes of plants and that of
diagrams, each with their own hierarchy of
classes that are included in more general
super-classes 09396070 tree 0 09395329
woody_plant 0 ligneous_plant 0 09378438
vascular_plant 0 tracheophyte 0 00008864
plant 0 flora 0 plant_life 0 00002086
life_form 0 organism 0 being 0 living_thing 0
00001740 entity 0 something 0 10025462 tree
0 tree_diagram 0 09987563 plane_figure 0
two-dimensional_figure 0 09987377 figure 0
00015185 shape 0 form 0 00018604
attribute 0 00013018 abstraction 0
22What is the Semantic Web
- The Semantic Web is a new initiative to
transform the web into a structure that supports
more intelligent querying and browsing, both by
machines and by humans. This transformation is to
be supported through the generation and use of
metadata constructed via web annotation tools
using user-defined ontologies that can be related
to one another. - Somewhere on the web
23(No Transcript)
24Extracting Events from Structured Documents
- Detecting Metadata in our Example
- Type of game N/A
- Teams involved England - Deutschland
- Players Deutschland Kahn (2) - Matthaeus (3) -
Babbel (3,5), - Final (and intermediate) score10 (00)
- RefereeSchiedsrichter Collina, Pierluigi
(Viareggio) - Date N/A
- Etc
25Extracting Events from Structured Documents (2)
- Detecting Events in our Example
- Substitution Eingewechselt 61. Gerrard fuer
Owen, - Goal Tore 10 Shearer (53., Kopfball,
Vorarbeit Beckham) - Cards Gelbe Karten Beckham - Babbel, Jeremies
26Results in XML
- Automatically extracted events (and entities and
relations) from structured text, on the base of
patterns (DTD) of typical expressions and the
soccer ontology. Example and Example_2 - Since various results are available in XML files,
those results can be merged automatically, guided
by the ontology. Example. This is supporting an
incremental and dynamic extraction.
27Extracting Events from Semi-Structured Documents
- Need of linguistic processing, for providing of a
basic structure of the document, which allows the
domain specific annotation. Example.
28Extracting Events from Semi-Structured Documents
(2)
- Using as well the results from the semantic
annotation of the structured documents,
supporting incremental extraction Example.
29Actual Development
- Extracting information from multilingual balance
sheets (WINS eTen project), extending this to
unstructured text and extracting relations and
events from annexes to balance sheets (upcoming
Project MUSING). - Detecting positive/negative mentioning of
entities in news documents (project Direct-Info
on Media Monitoring). Example.
30Further Challenge for HLT
- Not only use HLT for the semantic annotation of
web pages (or other documents), but use HLT for
supporting ontology extraction/learning from the
web (or other documents)
31Example of semantic relation extraction in
bio-medicine
- Rheumatoid arthritis is characterized by
progressive synovial inflammation - and joint destruction .
32Open issues for HLT and SW
- To achieve a better coordination for improving
semantic annotation results - Development and use of standards for interelated
linguistic and semantic annotation (see eContent
Project LIRICS for standards for language
resources)
33Interoperable Standards?
34Thank you!