Automatic event extraction from text on the base of linguistic and semantic annotation - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Automatic event extraction from text on the base of linguistic and semantic annotation

Description:

Involve entities and relations between then. Implies a change of states ... Goal: Tore: 1:0 Shearer (53., Kopfball, Vorarbeit Beckham) ... – PowerPoint PPT presentation

Number of Views:169
Avg rating:3.0/5.0
Slides: 35
Provided by: DeCl5
Category:

less

Transcript and Presenter's Notes

Title: Automatic event extraction from text on the base of linguistic and semantic annotation


1
Automatic event extraction from text on the base
of linguistic and semantic annotation
  • Thierry Declerck
  • DFKI Language Technology Lab

2
Events
  • Involve entities and relations between then
  • Implies a change of states
  • Example The striker of Liverpool shot a
    wonderful goal in the 87. Minute.
  • 1 event (goal-shot)
  • 2 entities (person and team)
  • 1 change of state (the scoring)

3
Events in textual documents
  • Various types of text
  • Structured Example and Example_2
  • For processing, pattern matching techniques
    required. Very few linguistic knowledge needed
  • Semi-structured Example
  • Requires a mixture of pattern matching and more
    linguistic knowledge
  • Unstructured Example
  • Requires a mixture of layout analysis and
    linguistic knowledge
  • All types of text require a domain specific
    knowledge base (ontology) for event extraction

4
Domain Knowledge
  • Domain knowledge can be organised in
    terminologies, thesauri, taxonomies or
    ontologies. Example of a (non-formal)
    multingual ontology for the soccer domain.
  • More on ontology engineering in the talk by
    Borislav

5
Automatic Event Extraction from Text is
  • A combination of human language technology (HLT)
    and semantic web technologies (ontologies)
  • Can also be done on the base of purely
    statistical means (with minimal linguistic
    knowledge), but we concentrate here on the
    HLT-based approach

6
What is Human Language Technology
7
Linguistic Analysis
Language technology tools are needed to support
the upgrade of the actual web to the Semantic Web
(SW) by providing an automatic analysis of the
linguistic structure of textual documents. Free
text documents undergoing linguistic analysis
become available as semi-structured documents,
from which meaningful units can be extracted
automatically (information extraction) and
organized through clustering or classification
(text mining). Here we focus on the following
linguistic analysis steps that underlie the
extraction tasks tokenization, morphological
analysis, part-of-speech tagging, chunking,
dependency structure analysis, semantic tagging.
8
Tokenisation
Tokenisation deals with the detection of the word
units in a text and with the detection of
sentence boundaries. The markets acknowledge the
measures taken on the 24th of September by the
CEO of XYZ Corp.
9
Morphological Analysis
Morphological analysis is concerned with the
inflectional, derivational, and compounding
processes in word formation in order to determine
properties such as stem and inflectional
information. Together with part-of-speech (PoS)
information this process delivers the
morpho-syntactic properties of a word. While
processing the German word Häusern (houses) the
following morphological information should be
analysed PoSN NUMPL CASEDAT GENNEUT
STEMHAUS
10
Part-of-Speech Tagging
Part-of-Speech (PoS) tagging is the process of
determining the correct syntactic class (a
part-of-speech, e.g. noun, verb, etc.) for a
particular word given its current context. The
word works in the following sentences will be
either a verb or a noun He works N,V the
whole day for nothing. His works N,V have all
been sold abroad. PoS tagging involves
disambiguation between multiple part-of-speech
tags, next to guessing of the correct
part-of-speech tag for unknown words on the basis
of context information.
11
Chunking
Chunks are sequences of words which are grouped
on the base of linguistic properties, such as
nominal, prepositional, adjectival and adverbial
phrases and verb groups. NP His works VG
have NP all VG been sold AdvP abroad.
12
Named Entities detection
Related to chunking is the recognition of
so-called named entities (names of institutions
and companies, date expressions, etc.). The
extraction of named entities is mostly based on a
strategy that combines look up in gazetteers
(lists of companies, cities, etc.) with the
definition of regular expression patterns. Named
entity recognition can be included as part of the
linguistic chunking procedure and the following
sentence fragment the secretary-general of
the United Nations, Kofi Annan, will be
annotated as a nominal phrase, including two
named entities United Nations with named entity
class organization, and Kofi Annan with named
entity class person
13
Dependency Structure Analysis
A dependency structure consists of two or more
linguistic units that immediately dominate each
other in a syntax tree. The detection of such
structures is generally not provided by chunking
but is building on the top of it. There are two
main types of dependencies that are relevant for
our purposes On the one hand, the internal
dependency structure of phrasal units or chunks
and on the other hand the so-called grammatical
functions (like subject and direct object).
14
Internal Dependency Structure
In linguistic analysis, for this we use the
terms head, complements and modifiers, where the
head is the dominating node in the syntax tree of
a phrase (chunk), complements are necessary
qualifiers thereof, and modifiers are optional
qualifiers. Consider the following example The
shot by Christian Ziege goes over the goal. The
prepositional phrase by Christian Ziege
(containing the named entity Christian Ziege)
depends on (and modifies) the head noun shot.
.
15
Grammatical Functions
Determine the role (function) of each of the
linguistic chunks in the sentence and allow to
identify the actors involved in certain events.
So for example in the following sentence, the
syntactic (and also the semantic) subject is the
NP constituent The shot by Christian
Ziege The shot by Christian Ziege goes over
the goal. This nominal phrase depends on (and
complements) the verb goes, whereas the Noun
shot is the head of the NP (it this the shot
going over the goal, and not Christian Ziege!)
16
Semantic Tagging
Automatic semantic annotation has developed
within language technology in recent years in
connection with more integrated tasks like
information extraction, which require a certain
level of semantic analysis. Semantic tagging
consists in the annotation of each content word
in a document with a semantic category. Semantic
categories are assigned on the basis of a
semantic resources like WordNet for English or
EuroWordNet, which links words between many
European languages through a common inter-lingua
of concepts.
17
Semantic Resources
  • Semantic resources are captured in dictionaries,
    thesauri, and semantic networks, all of which
    express, either implicitly or explicitly, an
    ontology of the world in general or of more
    specific domains, such as medicine.
  • They can be roughly distinguished into the
    following three groups
  • Thesauri Semantic resources that group
    together similar words or terms according to a
    standard set of relations, including broader
    term, narrower term, sibling, etc. (like Roget)
  • Semantic Lexicons Semantic resources that
    group together words (or more complex lexical
    items) according to lexical semantic relations
    like synonymy, hyponymy, meronymy, and antonymy
    (like WordNet)
  • Semantic Networks Semantic resources that
    group together objects denoted by natural
    language expressions (terms) according to a set
    of relations that originate in the nature of the
    domain of application (like UMLS in the medical
    domain)

18
The MeSH Thesaurus
MeSH (Medical Subject Headings) is a thesaurus
for indexing articles and books in the medical
domain, which may then be used for searching
MeSH-indexed databases. MeSH provides for each
term a number of term variants that refer to the
same concept. It currently includes a vocabulary
of over 250,000 terms. The following is a sample
entry for the term gene library (MH is the term
itself, ENTRY are term variants) MH
Gene Library ENTRY Bank, Gene ENTRY
Banks, Gene ENTRY DNA Libraries ENTRY
Gene Bank etc.
19
The WordNet Semantic Lexicon
WordNet has primarily been designed as a
computational account of the human capacity of
linguistic categorization and covers an extensive
set of semantic classes (called synsets). Synsets
are collections of synonyms, grouping together
lexical items according to meaning similarity.
Synsets are actually not made up of lexical
items, but rather of lexical meanings (i.e.
senses)
20
The WordNet Semantic Lexicon
WordNet has primarily been designed as a
computational account of the human capacity of
linguistic categorization and covers an extensive
set of semantic classes (called synsets). Synsets
are collections of synonyms, grouping together
lexical items according to meaning similarity.
Synsets are actually not made up of lexical
items, but rather of lexical meanings (i.e.
senses)
21
WordNet An Example
The word 'tree' has two meanings that roughly
correspond to the classes of plants and that of
diagrams, each with their own hierarchy of
classes that are included in more general
super-classes 09396070 tree 0 09395329
woody_plant 0 ligneous_plant 0 09378438
vascular_plant 0 tracheophyte 0 00008864
plant 0 flora 0 plant_life 0 00002086
life_form 0 organism 0 being 0 living_thing 0
00001740 entity 0 something 0 10025462 tree
0 tree_diagram 0 09987563 plane_figure 0
two-dimensional_figure 0 09987377 figure 0
00015185 shape 0 form 0 00018604
attribute 0 00013018 abstraction 0
22
What is the Semantic Web
  • The Semantic Web is a new initiative to
    transform the web into a structure that supports
    more intelligent querying and browsing, both by
    machines and by humans. This transformation is to
    be supported through the generation and use of
    metadata constructed via web annotation tools
    using user-defined ontologies that can be related
    to one another.
  • Somewhere on the web

23
(No Transcript)
24
Extracting Events from Structured Documents
  • Detecting Metadata in our Example
  • Type of game N/A
  • Teams involved England - Deutschland
  • Players Deutschland Kahn (2) - Matthaeus (3) -
    Babbel (3,5),
  • Final (and intermediate) score10 (00)
  • RefereeSchiedsrichter Collina, Pierluigi
    (Viareggio)
  • Date N/A
  • Etc

25
Extracting Events from Structured Documents (2)
  • Detecting Events in our Example
  • Substitution Eingewechselt 61. Gerrard fuer
    Owen,
  • Goal Tore 10 Shearer (53., Kopfball,
    Vorarbeit Beckham)
  • Cards Gelbe Karten Beckham - Babbel, Jeremies

26
Results in XML
  • Automatically extracted events (and entities and
    relations) from structured text, on the base of
    patterns (DTD) of typical expressions and the
    soccer ontology. Example and Example_2
  • Since various results are available in XML files,
    those results can be merged automatically, guided
    by the ontology. Example. This is supporting an
    incremental and dynamic extraction.

27
Extracting Events from Semi-Structured Documents
  • Need of linguistic processing, for providing of a
    basic structure of the document, which allows the
    domain specific annotation. Example.

28
Extracting Events from Semi-Structured Documents
(2)
  • Using as well the results from the semantic
    annotation of the structured documents,
    supporting incremental extraction Example.

29
Actual Development
  • Extracting information from multilingual balance
    sheets (WINS eTen project), extending this to
    unstructured text and extracting relations and
    events from annexes to balance sheets (upcoming
    Project MUSING).
  • Detecting positive/negative mentioning of
    entities in news documents (project Direct-Info
    on Media Monitoring). Example.

30
Further Challenge for HLT
  • Not only use HLT for the semantic annotation of
    web pages (or other documents), but use HLT for
    supporting ontology extraction/learning from the
    web (or other documents)

31
Example of semantic relation extraction in
bio-medicine
  • Rheumatoid arthritis is characterized by
    progressive synovial inflammation
  • and joint destruction .

32
Open issues for HLT and SW
  • To achieve a better coordination for improving
    semantic annotation results
  • Development and use of standards for interelated
    linguistic and semantic annotation (see eContent
    Project LIRICS for standards for language
    resources)

33
Interoperable Standards?
34
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com