Human Language Technology for the Semantic Web

About This Presentation

Title:

Human Language Technology for the Semantic Web

Description:

... 'DATE' nine day /TIMEX shuttle flight was to be the 12th ... annotators are cheap (but you get what you pay for!) A) NE Baseline: list lookup approach ... – PowerPoint PPT presentation

Number of Views:366

Avg rating:3.0/5.0

Slides: 131

Provided by: ham4157

Category:

more less

Transcript and Presenter's Notes

Title: Human Language Technology for the Semantic Web

1
Human Language Technology for the Semantic Web
http//gate.ac.uk/ http//nlp.shef.ac.uk
/ Hamish Cunningham Kalina Bontcheva Diana
Maynard Valentin Tablan ESWS, Crete, May
2004 This work has been supported by AKT
(http//aktors.org/) and SEKT (http//sekt.semant
icweb.org/)
2
Are you wasting your time?
3
Structure of the Tutorial

Motivation, background
Information Extraction - definition
Evaluation corpora metrics
IE approaches some examples
Rule-based approaches
Learning-based approaches
Semantic Tagging
Using traditional IE
Ontology-based IE
Platforms for large-scale processing
Language Generation
Slides http//gate.ac.uk/sale/talks/esws2004-tut
orial.ppt

4
The Knowledge Economy and Human Language

Gartner, December 2002
taxonomic and hierarchical knowledge mapping and
indexing will be prevalent in almost all
information-rich applications
through 2012 more than 95 of human-to-computer
information input will involve textual language
A contradiction formal knowledge in
semantics-based systems vs. ambiguous informal
natural language
The challenge to reconcile these two opposing
tendencies

5
HLT Knowledge Closing the Language Loop
KEY MNLG Multilingual Natural Language
GenerationOBIE Ontology-Based Information
ExtractionAIE Adaptive Mixed-Initiative
IECLIE Controlled Language IE
(M)NLG
Semantic Web Semantic GridSemantic Web
Services
Formal Knowledge(ontologies andinstance bases)
HumanLanguage
OBIE
(A)IE
ControlledLanguage
CLIE
6
Background and Examples (1)

Like other areas of computer science, HLT has
typical data structures and infrastructure
requirements
Annotation associating arbitrary data with areas
of text or speech
Defacto standard Stand-off Markup (e.g.
TEI/XCES, NITE, ATLAS, GATE)
Other issues visualisation and editing
persistence and search metrics component model
baseline NLP tools ...
To cut a long story short HLT has a lot of T
underneath it which comes in many shapes and sizes

7
Background and Examples (2)

Infrastructure (many) examples in this
tutorial
GATE, a General Architecture for Text
Engineering architecture, framework IDE
Why?
I happen to know a little about it ?
Free software, relatively comprehensive, widely
used, has extensive Semantic Web support
It means we can ignore the infrastructural issues
Not a claim that it is the best or only in all
cases!

8
Structure of the Tutorial

Motivation, background
Information Extraction - definition
Evaluation corpora metrics
IE approaches some examples
Rule-based approaches
Learning-based approaches
Semantic Tagging
Using traditional IE
Ontology-based IE
Platforms for large-scale processing
Language Generation
Slides http//gate.ac.uk/sale/talks/esws2004-tut
orial.ppt

9
Information Extraction (1)

Information Extraction (IE) pulls facts and
structured information from the content of large
text collections.
Contrast IE and Information Retrieval
NLP history from NLU to IE (if you cant score,
why not move the goalposts?)

10
Information Extraction (2)

When you can measure what you are speaking about,
and express it in numbers, you know something
about it but when you cannot measure it, when
you cannot express it in numbers, your knowledge
is of a meager and unsatisfactory kind it may be
the beginning of knowledge, but you have scarcely
in your thoughts advanced to the stage of
science. (Kelvin)
Not everything that counts can be counted, and
not everything that can be counted counts.
(Einstein)
IE progress driven by quantitative measures
MUC Message Understanding Conferences
ACE Automatic Content Extraction

11
MUC-7 tasks

Held in 1997, around 15 participants inc. 2 UK.
Broke IE down into component tasks
NE Named Entity recognition and typing
CO co-reference resolution
TE Template Elements
TR Template Relations
ST Scenario Templates

12
An Example

The shiny red rocket was fired on Tuesday. It is
the brainchild of Dr. Big Head. Dr. Head is a
staff scientist at We Build Rockets Inc.

NE "rocket", "Tuesday", "Dr. Head, "We Build
Rockets"

CO"it" rocket "Dr. Head" "Dr. Big Head"

TE the rocket is "shiny red" and Head's
"brainchild".

TR Dr. Head works for We Build Rockets Inc.

ST rocket launch event with various participants

13
Performance levels

Vary according to text type, domain, scenario,
language
NE up to 97 (tested in English, Spanish,
Japanese, Chinese, etc. etc.)
CO 60-70 resolution
TE 80
TR 75-80
ST 60 (but human level may be only 80)

14
What are Named Entities?

NE involves identification of proper names in
texts, and classification into a set of
predefined categories of interest
Person names
Organizations (companies, government
organisations, committees, etc)
Locations (cities, countries, rivers, etc)
Date and time expressions

15
What are Named Entities (2)

Other common types measures (percent, money,
weight etc), email addresses, Web addresses,
street addresses, etc.
Some domain-specific entities names of drugs,
medical conditions, names of ships, bibliographic
references etc.
MUC-7 entity definition guidelines Chinchor97
http//www.itl.nist.gov/iaui/894.02/related_projec
ts/muc/proceedings/ne_task.html

16
What are NOT NEs (MUC-7)

Artefacts Wall Street Journal
Common nouns, referring to named entities the
company, the committee
Names of groups of people and things named after
people the Tories, the Nobel prize
Adjectives derived from names Bulgarian,
Chinese
Numbers which are not times, dates, percentages,
and money amounts

17
Basic Problems in NE

Variation of NEs e.g. John Smith, Mr Smith,
John.
Ambiguity of NE types John Smith (company vs.
person)
May (person vs. month)
Washington (person vs. location)
1945 (date vs. time)
Ambiguity with common words, e.g. "may"

18
More complex problems in NE

Issues of style, structure, domain, genre etc.
Punctuation, spelling, spacing, formatting, ...
all have an impact
Dept. of Computing and Maths
Manchester Metropolitan University
Manchester
United Kingdom
Tell me more about Leonardo
Da Vinci

19
Structure of the Tutorial

Motivation, background
Information Extraction - definition
Evaluation corpora metrics
IE approaches some examples
Rule-based approaches
Learning-based approaches
Semantic Tagging
Using traditional IE
Ontology-based IE
Platforms for large-scale processing
Language Generation
Slides http//gate.ac.uk/sale/talks/esws2004-tut
orial.ppt

20
Corpora and System Development

Gold standard data created by manual annotation
Corpora are divided typically into a training and
testing portion
Rules and/or learning algorithms are developed or
trained on the training part
Tuned on the testing portion in order to optimise
Rule priorities, rules effectiveness, etc.
Parameters of the learning algorithm and the
features used (typical routine 10-fold cross
validation)
Evaluation set the best system configuration is
run on this data and the system performance is
obtained
No further tuning once evaluation set is used!

21
Some NE Annotated Corpora

MUC-6 and MUC-7 corpora - English
CONLL shared task corpora http//cnts.uia.ac.be/co
nll2003/ner/ - NEs in English and
Germanhttp//cnts.uia.ac.be/conll2002/ner/ -
NEs in Spanish and Dutch
TIDES surprise language exercise (NEs in Cebuano
and Hindi)
ACE English - http//www.ldc.upenn.edu/Projects/
ACE/

22
The MUC-7 corpus

100 documents in SGML
News domain
Named Entities
1880 Organizations (46)
1324 Locations (32)
887 Persons (22)
Inter-annotator agreement very high (97)
http//www.itl.nist.gov/iaui/894.02/related_projec
ts/muc/proceedings/muc_7_proceedings/marsh_slides.
pdf

23
The MUC-7 Corpus (2)

ltENAMEX TYPE"LOCATION"gtCAPE CANAVERALlt/ENAMEXgt,
ltENAMEX TYPE"LOCATION"gtFla.lt/ENAMEXgt MD
Working in chilly temperatures ltTIMEX
TYPE"DATE"gtWednesdaylt/TIMEXgt ltTIMEX
TYPE"TIME"gtnightlt/TIMEXgt, ltENAMEX
TYPE"ORGANIZATION"gtNASAlt/ENAMEXgt ground crews
readied the space shuttle Endeavour for launch on
a Japanese satellite retrieval mission.
ltpgt
Endeavour, with an international crew of six, was
set to blast off from the ltENAMEX
TYPE"ORGANIZATIONLOCATION"gtKennedy Space
Centerlt/ENAMEXgt on ltTIMEX TYPE"DATE"gtThursdaylt/TI
MEXgt at ltTIMEX TYPE"TIME"gt418 a.m. ESTlt/TIMEXgt,
the start of a 49-minute launching period. The
ltTIMEX TYPE"DATE"gtnine daylt/TIMEXgt shuttle
flight was to be the 12th launched in darkness.

24
ACE Towards Semantic Tagging of Entities

MUC NE tags segments of text whenever that text
represents the name of an entity
In ACE (Automated Content Extraction), these
names are viewed as mentions of the underlying
entities. The main task is to detect (or infer)
the mentions in the text of the entities
themselves
Rolls together the NE and CO tasks
Domain- and genre-independent approaches
ACE corpus contains newswire, broadcast news (ASR
output and cleaned), and newspaper reports (OCR
output and cleaned)

25
ACE Entities

Dealing with
Proper names e.g., England, Mr. Smith, IBM
Pronouns e.g., he, she, it
Nominal mentions the company, the spokesman
Identify which mentions in the text refer to
which entities, e.g.,
Tony Blair, Mr. Blair, he, the prime minister, he
Gordon Brown, he, Mr. Brown, the chancellor

26
ACE Example

ltentity ID"ft-airlines-27-jul-2001-2"
GENERIC"FALSE"
entity_type "ORGANIZATION"gt
ltentity_mention ID"M003"
TYPE "NAME"
string "National Air
Traffic Services"gt
lt/entity_mentiongt
ltentity_mention ID"M004"
TYPE "NAME"
string "NATS"gt
lt/entity_mentiongt
ltentity_mention ID"M005"
TYPE "PRO"
string "its"gt
lt/entity_mentiongt
ltentity_mention ID"M006"
TYPE "NAME"
string "Nats"gt
lt/entity_mentiongt

27
Annotation Tools (1) GATE
28
Annotation Tools (2) Alembic
29
Performance Evaluation

Evaluation metric mathematically defines how to
measure the systems performance against
human-annotated gold standard
Scoring program implements the metric and
provides performance measures
For each document and over the entire corpus
For each type of NE

30
Evaluation Metrics

Most common are Precision and Recall
Precision correct answers/answers produced
Recall correct answers/total possible correct
answers
Trade-off between precision and recall
F-Measure (ß2 1)PR / ß2R P van Rijsbergen
75
ß reflects the weighting between precision and
recall, typically ß1
Some tasks sometimes use other metrics, e.g.
false positives (not sensitive to doc richness)
cost-based (good for application-specific
adjustment)

31
The Evaluation Metric (2)

We may also want to take account of partially
correct answers
Precision Correct ½ Partially correct
Correct Incorrect Partial
Recall Correct ½ Partially correctCorrect
Missing Partial
Why NE boundaries are often misplaced, sosome
partially correct results

32
The GATE Evaluation Tool
33
Corpus-level Regression Testing

Need to track systems performance over time
When a change is made we want to know
implications over whole corpus
Why because an improvement in one case can lead
to problems in others
GATE offers automated tool to help with the NE
development task over time

34
Regression Testing (2)
At corpus level GATEs corpus benchmark tool
tracking systems performance over time
35
SW IE Evaluation tasks

Detection of entities and events, given a target
ontology of the domain.
Disambiguation of the entities and events from
the documents with respect to instances in the
given ontology. For example, measuring whether
the IE correctly disambiguated Cambridge in the
text to the correct instance Cambridge, UK vs
Cambridge, MA.
Decision when a new instance needs to be added to
the ontology, because the text contains a new
instance, that does not already exist in the
ontology.

36
ChallengeEvaluating Richer NE Tagging

Need for new metrics when evaluating
hierarchy/ontology-based NE tagging
Need to take into account distance in the
hierarchy
Tagging a company as a charity is less wrong than
tagging it as a person

37
Structure of the Tutorial

Motivation, background
Information Extraction - definition
Evaluation corpora metrics
IE approaches some examples
Rule-based approaches
Learning-based approaches
Semantic Tagging
Using traditional IE
Ontology-based IE
Platforms for large-scale processing
Language Generation
Slides http//gate.ac.uk/sale/talks/esws2004-tut
orial.ppt

38
Two kinds of IE approaches

Knowledge Engineering
rule based
developed by experienced language engineers
make use of human intuition
requires only small amount of training data
development could be very time consuming
some changes may be hard to accommodate

Learning Systems
use statistics or other machine learning
developers do not need LE expertise
requires large amounts of annotated training data
some changes may require re-annotation of the
entire training corpus
annotators are cheap (but you get what you pay
for!)

39
A) NE Baseline list lookup approach

System that recognises only entities stored in
its lists (gazetteers).
Advantages - Simple, fast, language independent,
easy to retarget (just create lists)
Disadvantages impossible to enumerate all
names, collection and maintenance of lists,
cannot deal with name variants, cannot resolve
ambiguity

40
B) Shallow parsing approach using internal
structure

Internal evidence names often have internal
structure. These components can be either stored
or guessed, e.g. location
Cap. Word City, Forest, Center, River
e.g. Sherwood Forest
Cap. Word Street, Boulevard, Avenue, Crescent,
Road
e.g. Portobello Street

41
Problems ...

Ambiguously capitalised words (first word in
sentence)All American Bank vs. All State
Police
Semantic ambiguity "John F. Kennedy" airport
(location) "Philip Morris" organisation
Structural ambiguity Cable and Wireless vs.
Microsoft and DellCenter for Computational
Linguistics vs. message from City Hospital for
John Smith

42
C) Shallow parsing with context

Use of context-based patterns is helpful in
ambiguous cases
"David Walton" and "Goldman Sachs" are
indistinguishable
But with the phrase "David Walton of Goldman
Sachs" and the Person entity "David Walton"
recognised, we can use the pattern "Person of
Organization" to identify "Goldman Sachs
correctly.

43
Examples of context patterns

PERSON earns MONEY
PERSON joined ORGANIZATION
PERSON left ORGANIZATION
PERSON joined ORGANIZATION as JOBTITLE
ORGANIZATION's JOBTITLE PERSON
ORGANIZATION JOBTITLE PERSON
the ORGANIZATION JOBTITLE
part of the ORGANIZATION
ORGANIZATION headquarters in LOCATION
price of ORGANIZATION
sale of ORGANIZATION
investors in ORGANIZATION
ORGANIZATION is worth MONEY
JOBTITLE PERSON
PERSON, JOBTITLE

44
Example Rule-based System - ANNIE

ANNIE A Nearly-New IE system
A version distributed as part of GATE
GATE automatically deals with document formats,
saving of results, evaluation, and visualisation
of results for debugging
GATE has a finite-state pattern-action rule
language, used by ANNIE
A reusable and easily extendable set of components

45
NE Components
46
Gazetteer lists for rule-based NE

Needed to store the indicator strings for the
internal structure and context rules
Internal location indicators e.g., river,
mountain, forest for natural locations street,
road, crescent, place, square, for address
locations
Internal organisation indicators e.g., company
designators GmbH, Ltd, Inc,
Produces Lookup results of the given kind

47
The Named Entity Transducers

Phases run sequentially and constitute a cascade
of FSTs over the pre-processing results
Hand-coded rules applied to annotations to
identify NEs
Annotations from format analysis, tokeniser,
sentence splitter, POS tagger, and gazetteer
modules
Use contextual information
Finds person names, locations, organisations,
dates, addresses.

NE Rule in JAPE
JAPE a Java Annotation Patterns Engine
Light, robust regular-expression-based
processing
Cascaded finite state transduction
Low-overhead development of new components
Simplifies multi-phase regex processing
Rule Company1
Priority 25
(
( Token.orthography upperInitial )
//from tokeniser
Lookup.kind companyDesignator //from
gazetteer lists
)match
--gt
match.NamedEntity
kindcompany, ruleCompany1

49
Named Entities in GATE
50
Using co-reference to classify ambiguous NEs

Orthographic co-reference module matches proper
names in a document
Improves NE results by assigning entity type to
previously unclassified names, based on
relations with classified NEs
May not reclassify already classified entities
Classification of unknown entities very useful
for surnames which match a full name, or
abbreviations, e.g. Bonfield will match Sir
Peter Bonfield International Business
Machines Ltd. will match IBM

51
Named Entity Coreference
52
Structure of the Tutorial

Motivation, background
Information Extraction - definition
Evaluation corpora metrics
IE approaches some examples
Rule-based approaches
Learning-based approaches
Semantic Tagging
Using traditional IE
Ontology-based IE
Platforms for large-scale processing
Language Generation
Slides http//gate.ac.uk/sale/talks/esws2004-tut
orial.ppt

53
Machine Learning Approaches

Approaches
Train ML models on manually annotated text
Mixed initiative learning
Used for producing training data
Used for producing working systems
ML Methods
Symbolic learning rules/decision trees induction
Statistical models HMMs, Bayesian methods,
Maximum Entropy

54
ML Terminology

Instances (tokens, entities)
Occurrences of a phenomenon
Attributes (features)
Characteristics of the instances
Classes
Sets of similar instances

55
Methodology

The task can be broken into several subtasks
(that can use different methods)
Boundary detection
Entity classification into NE types
Different models for different entity types
Several models can be used in competition.
Some algorithms perform better on little data
while others are better when more training is
available

56
Methodology (2)

Boundaries (and entity types) notations
S(-XXX), E(-XXX)
ltS-ORG/gtU.N.ltE-ORG/gt official ltS-PER/gtEkeusltE-PER/
gt heads for
ltS-LOC/gtBaghdadltE-LOC/gt.
IOB notation (Inside, Outside, Beginning_of)
U.N. I-ORG
official O
Ekeus I-PER
heads O
for O
Baghdad I-LOC
. O
Translations between the two conventions are
straight-forward

57
Features

Linguistic features
POS
Morphology
Syntax
Lexicon data
Semantic features
Ontological class
ETC

Document structure
Original markup
Paragraph/sentence structure
Surface features
Token length
Capitalisation
Token type (word, punctuation, symbol)

Feature selection the most difficult part
Some automatic scoring methods can be used

58
Mixed Initiative Learning

Human computer interaction
Speeds up the creation of training data
Can be used for corpus/system creation
Example implementations
Alembic Day et al97 and later
Amilcare Ciravegna03

59
Mixed Initiative Learning (2)
User annotates
System learns
Pgtt1
Pgtt2
60
Example 1 Alembic, Day et al 1997

Mixed initiative approach implemented in Alembic
Workbench
Bootstrapping procedure use already tagged data
to pre-annotate new documents
Transforms the process from tagging to review
Finally, the trained system can be used on its
own

61
Mixed-Initiative text annotation
User can also edit the induced rulesand
writenew ones Brill-style learning
generate-and-test
62
Considerations

Too high recall and the human might become
over-reliant on the system annotations
Too high precision might have similar effect
Theory-creep the choices of the human
annotator are increasingly influenced by the
machines and might deviate from the task
definition measure inter-annotator agreement

63
Example 2 Amilcare Melita

Amilcare rule-learning algorithm
Tagging rules learn to insert tags in the text,
given training examples
Correction rules learn to move already inserted
tags to their correct place in the text
Learns begin and end tags independently
Melita support adaptive IE
Applied in SemWeb context (see below)
Being extended as part of the EU-funded DOT.KOM
project towards KM andSemWeb applications

64
Comparison of Alembic Melita

The life cycle of user tagging, the machine
learning, then making suggestions which user
corrects, is very similar in Melita and Alembic
Alembic is more oriented towards NLP developers,
while Melita more towards end-users
Melita considers timeliness and intrusiveness as
criteria, while Alembic does not (possibly due to
performance bottleneck from old hardware)
Both acknowledge but do not address problems with
over-reliance on machine annotations
From ML perspective the two are very similar
rule-learning

65
Eg. 3 GATE Machine Learning support

Uses classification.
Attr1, Attr2, Attr3, Attrn ? Class
Classifies annotations.
(Documents can be classified as well using a
1-to1 relation with annotations.)
Annotations of a particular type are selected as
instances.
Attributes refer to features of the instance
annotations or their context.
Generic implementation for attribute collection
can be linked to any ML engine.
ML engines currently integrated WEKA and
Ontotexts HMM.

66
Implementation

Machine Learning PR in GATE.
Has two functioning modes
training
application
Uses an XML file for configuration
lt?xml version"1.0" encoding"windows-1252"?gt
ltML-CONFIGgt
ltDATASETgt lt/DATASETgt
ltENGINEgtlt/ENGINEgt
ltML-CONFIGgt

67
Attributes Collection
Instances type Token
68
Dataflow
GATE ML Library
NLP Pipeline Tokeniser Gazetteer POS
Tagger Lexicon Lookup Semantic Tagger etc
Annotated documents
Plain text documents
Feature Collection
Results Converter
Engine Interface
Machine Learning Engine
69
Structure of the Tutorial

Motivation, background
Information Extraction - definition
Evaluation corpora metrics
IE approaches some examples
Rule-based approaches
Learning-based approaches
Semantic Tagging
Using traditional IE
Ontology-based IE
Platforms for large-scale processing
Language Generation
Slides http//gate.ac.uk/sale/talks/esws2004-tut
orial.ppt

70
Towards Semantic Tagging of Entities

The MUC NE task tags selected segments of text
whenever that text represents the name of an
entity.
Semantic tagging - view as mentions of the
underlying instances from the ontology
Identify which mentions in the text refer to
which instances in the ontology, e.g.,
Tony Blair, Mr. Blair, he, the prime minister, he
Gordon Brown, he, Mr. Brown, the chancellor

71
Tasks

Identify entity mentions in the text
Reference disambiguation
Add new instances if needed
Disambiguate wrt instances in the ontology
Identify instances of attributes and relations
take into account what are allowed given the
ontology, using domainrange as constraints

72
Example
XYZ was established on 03 November 1978 in
London. It opened a plant in Bulgaria in
Ontology KB
Location
Company
HQ
partOf
City
Country
type
type
HQ
type
type
establOn
partOf
03/11/1978
73
Classes, instances metadata
Gordon Brown met George Bush during his two day
visit.
ltmetadatagt ltDOC-IDgthttp// 1.htmllt/DOC-IDgt
ltAnnotationgt lts_offsetgt 0 lt/s_offsetgt
lte_offsetgt 12 lt/e_offsetgt ltstringgtGordon
Brownlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson12345lt/instgt lt/Annotationgt
ltAnnotationgt lts_offsetgt 18 lt/s_offsetgt
lte_offsetgt 32 lt/e_offsetgt ltstringgtGeorge
Bushlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson67890lt/instgt lt/Annotationgt lt/metad
atagt
Classesinstances after
Classesinstances before
Bush
74
Classes, instances metadata (2)
Gordon Brown met Tony Blair to discuss the
university tuition fees.
ltmetadatagt ltDOC-IDgthttp// 2.htmllt/DOC-IDgt
ltAnnotationgt lts_offsetgt 0 lt/s_offsetgt
lte_offsetgt 12 lt/e_offsetgt ltstringgtGordon
Brownlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson12345lt/instgt lt/Annotationgt
ltAnnotationgt lts_offsetgt 18 lt/s_offsetgt
lte_offsetgt 30 lt/e_offsetgt ltstringgtTony
Blairlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson26389lt/instgt lt/Annotationgt lt/metad
atagt
Classesinstances after
Classesinstances before
G. Brown
G. Bush
75
Why not put metadata in ontologies?

Can be encoded in RDF/OWL, etc. but does it need
to be put as instances in the ontology?
Typically we do not need to reason with it
Reasoning happens in the ontology when the new
instances of classes and properties are added,
but the metadata statements are different from
them, they only refer to them
A lot more metadata than instances
Millions of metadata statements, thousands of
instances, hundreds of concepts
Different access required
By offset (give me all metadata of the first
paragraph)
Efficient metadata-wide statistics based on
strings not an operation that people would do
on other concepts
Mixing with keyword-based search using IR-style
indexing

76
Metadata Creation with IE

Semantic tagging creates metadata
Stand-off or part of document
Semi-automatic
One view (given by the user, one ontology)
More reliable
Automatic metadata creation
Many views change with ontology, re-train IE
engine for each ontology
Always up to date, if ontology changes
Less reliable

77
Problems with traditional IE for metadata
creation

S-CREAM Semi-automatic CREAtion of Metadata
Handschuh et al02
Semantic tags from IE need to be mapped to
instances of concepts, attributes or relations
Most ML-based IE systems do not deal well with
relations, mainly entities
Amilcare does not handle anaphora resolution,
GATE has such component but not used here
Implemented a discourse model with logical rules
LASIE used discourse model with domainontology
problem is robustness and domain portability

78
Example
Handschuh et al02 S-CREAM, EKAW02
79
S-CREAM Discourse Rules

Rules to attach instances only when the ontology
allows that (e.g., prices)
Attach tag values to the nearest preceding
compatible entity (e.g., prices and rooms)
Create a complex object between two concept
instances if they are adjacent (e.g., rate
number followed by currency)
Experienced users can write new rules

80
Challenges for IE for SemWeb

Portability different and changing ontologies
Different text types structured, free, etc.
Utilise ontology information where available
Train from small amount of annotated text
Output results wrt the given ontology
bridge the gap demonstrated in S-CREAM
Learn/Model at the right level
ontologies are hierarchical and data will get
sparser the lower we go

DOT.KOM http//nlp.shef.ac.uk/dot.kom/
81
Structure of the Tutorial

Motivation, background
Information Extraction - definition
Evaluation corpora metrics
IE approaches some examples
Rule-based approaches
Learning-based approaches
Semantic Tagging
Using traditional IE
Ontology-based IE (OBIE)
Platforms for large-scale processing
Language Generation
Slides http//gate.ac.uk/sale/talks/esws2004-tut
orial.ppt

82
Eg. 1 GATE metadata extraction

Combines learning and rule-based methods
Allows combination of IE and IR
Enables use of large-scale linguistic resources
for IE, such as WordNet
Supports ontologies as part of IE applications -
Ontology-Based IE (OBIE)

83
Ontology Management in GATE
84
Information Retrieval Currently based on the
Lucene IR engine useful for combining semantic
and keyword-based search
85
WordNet support
86
Populating Ontologies with IE
87
Example 2 OBIE in h-TechSight

hTechSight project using Ontology-Based IE for
semantic tagging of job adverts, news and reports
in chemical engineering domain
Aim is to track technological change over time
through terminological analysis
Fundamental to the application is a
domain-specific ontology
Terminological gazetteer lists are linked to
classes in the ontology
Rules classify the mentions in the text wrt the
domain ontology
Annotations output into a database or as an
ontology

88
(No Transcript)
89
(No Transcript)
90
Exported Database
91
Structure of the Tutorial

Motivation, background
Information Extraction - definition
Evaluation corpora metrics
IE approaches some examples
Rule-based approaches
Learning-based approaches
Semantic Tagging
Using traditional IE
Ontology-based IE
Platforms for large-scale processing
Language Generation
Slides http//gate.ac.uk/sale/talks/esws2004-tut
orial.ppt

92
Platforms for Large-Scale Metadata Creation

Allow use of corpus-wide statistics to improve
metadata quality, e.g., disambiguation
Automated alias discovery
Generate SemWeb output (RDF, OWL)
Stand-off storage and indexing of metadata
Use large instance bases to disambiguate to
Ontology servers for reasoning and access
Architecture elements
Crawler, onto storage, doc indexing, query,
annotators
Apps sem browsers, authoring tools, etc.

93
Example 1 SemTag

Lookup of all instances from the ontology (TAP)
65K instances
Disambiguate the occurrences as
One of those in the taxonomy
Not present in the taxonomy
Not very high ambiguity of instances with the
same label in TAP concentrate on the second
problem
Use bag-of-words approach for disambiguation
3 people evaluated 200 labels in context agreed
on only 68.5 - metonymy
Placing labels in the taxonomy is hard

Dill et al, SemTag and Seeker. WWW03
94
Seeker

High-performance distributed infrastructure
128 dual-processor machines with separate ½
terabyte of storage
Each node runs approx. 200 documents per sec.
Service-oriented architecture Vinci (SOAP)

Dill et al, SemTag and Seeker. WWW03
95
Example 2 OBIE in KIM

The ontology (KIMO) and 86K/200K instances KB
High ambiguity of instances with the same label
need for disambiguation step
Lookup phase marks mentions from the ontology
Combined with rule-based IE system to recognise
new instances of concepts and relations
Special KB enrichment stage where some of these
new instances are added to the KB
Disambiguation uses an Entity Ranking algorithm,
i.e., priority ordering of entities with the same
label based on corpus statistics (e.g., Paris)

Popov et al. KIM. ISWC03
96
OBIE in KIM (2)
Popov et al. KIM. ISWC03
97
Comparison between SemTag KIM

SemTag only aims for accuracy (precision) of
classification of the annotated entities
KIM also aims for coverage (recall) whether all
possible mentions of entities were found
Trade-off sometimes finding some is enough
SemTag does not attempt to discover and expand
the KB with new instances (e.g., new company)
the reason why KIM uses IE, not simple KB lookup
i.e. OBIE is often needed for ontology
population, not just metadata creation

98
Two Annotation Scenarios (1)

Getting the instances and the relations between
them is enough, maybe not all mentions in the
text are covered, but compensated by giving
access to this info from the annotated text

99
Example
Gordon Brown met president Bush during his two
day visit. Afterwards George Bush said
The system
Bush
Score 100
100
Two Annotation Scenarios (2)

Exhaustive annotation is required, so all
occurrences of all instances and relations are
needed
Allows sentence and paragraph-level exploration,
rather than document-level as in the previous
scenario
Harder to achieve
Distinction between these scenarios needs to be
made in the metadata annotation tools/KM tools
using IE

101
Example
Gordon Brown met president Bush during his two
day visit. Afterwards George Bush said
ltmetadatagt ltAnnotationgt lts_offsetgt 0
lt/s_offsetgt lte_offsetgt 12 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson12345lt/ins
tgt lt/Annotationgt ltAnnotationgt lts_offsetgt
18 lt/s_offsetgt lte_offsetgt 32 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson1267lt/inst
gt lt/Annotationgt ltAnnotationgt lts_offsetgt
61 lt/s_offsetgt lte_offsetgt 72 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson1267lt/inst
gt lt/Annotationgt lt/metadatagt
ltmetadatagt ltAnnotationgt lts_offsetgt 0
lt/s_offsetgt lte_offsetgt 12 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson12345lt/ins
tgt lt/Annotationgt ltAnnotationgt lts_offsetgt
61 lt/s_offsetgt lte_offsetgt 72 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson1267lt/inst
gt lt/Annotationgt lt/metadatagt
Score 66
102
Eg. 3 SWAN a Semantic Web Annotator

Collaboration between DERI/NUIG, OntoText and
USFD, hosted at DERI
GATE KIM SECO
Custom indexing of news or other web fractions
Quantitative media reporting
Annotated web workbench service
Custom knowledge services
Demo and poster at ESWS

103
SWAN Logical Architecture
Web
IE (64 bit)
Annotation(Oracle)
UI Users
Web UI,Web services
Knowledgebase (Sesame)
Service Users
104
Cluster Controller
105
Semantic Reference Disambiguation

Possible approaches
Vector-space models compare context similarity
runs over a corpus
SemTag
Baggas cross-document coreference work
Communities of practise approach from KM
Identity criteria from the ontology based on
properties, e.g., date_of_birth, name

106
Why disambiguation is hard not all knowledge
is explicit in text

Paris fashion week underway as cancellations
continue
By Jo Johnson and Holly Finn - Oct 07 2001
184817 (FT)
Even as Paris fashion week opened at the
weekend, the cancellations and reschedulings were
still trickling in over the fax machines Loewe,
the leather specialists owned by LVMH empire, is
not showing, Cerruti, the Italian tailor,is
downscaling to private viewings, Helmut Lang,
master of the sharp suit, is cancelling his
catwalk.
The Oscar de la Renta show, for example, which
had been planned for September 11th in New York,
and which might easily enough have moved over to
Paris instead, is not on the schedule. When the
Dominican Republic-born designer consulted
America Vogue's influential editor, Anna Wintour,
she reportedly told him it would be unpatriotic
to decamp.

107
Structure of the Tutorial

Motivation, background
Information Extraction - definition
Evaluation corpora metrics
IE approaches some examples
Rule-based approaches
Learning-based approaches
Semantic Tagging
Using traditional IE
Ontology-based IE
Platforms for large-scale processing
Language Generation
Slides http//gate.ac.uk/sale/talks/esws2004-tut
orial.ppt

108
Natural Language Generation

NLG is
subfield of AI and CL that is concerned with the
construction of computer systems that can produce
understandable texts in English or other human
languages from some underlying linguistic
representation of information ReiterDale97
NLG techniques are applied also for producing
speech, e.g., in speech dialogue systems

109

Natural Language Generation

Ontology/KB/Database
Lexicons Grammars
Text
110
Requirements Analysis

Create a corpus of target texts and (if possible)
their input representations
Analyse the information content
Unchanging texts thank you, hello, etc.
Directly available data timetable of buses
Computable data number of buses
Unavailable data not in the systems KB/DB

111
NLG Tasks

Content determination
Discourse planning
Sentence aggregation
Lexicalisation
Referring expression generation
Linguistic realisation

112
Content determination

What information to include in the text
filtering and summarising input data into a
formal knowledge representation
Application dependent
Example
project AKT
start_date October-2000
end_date October-2006
participants A,E,OU,So,Sh

113
Discourse Planning

Determine ordering and structure over the
knowledge to be generated
Theories of discourse how texts are structured
Influences text readability
Result tree structure imposing ordering over the
predicates and possibly providing discourse
relations

114
Example
SEQUENCE
LIST

ELABORATION
ELABORATION
projectAKT duration 6 yrs
project AKT participantShef
univ Shef Web-page URL

project AKT participantOU
115
Planning-Based Approaches

Use AI-style planners (e.g., Moore Paris 93
Discourse relations (e.g., ELABORATION) are
encoded as planning operators
Preconditions specify when the relation can apply
Planning starts from a top-level goal, e.g.,
define-project(X)
Computationally expensive and require a lot of
knowledge problem for real-world systems

116
Schema-Based Approaches

Capture typical text structuring patterns in
templates (derived from corpus), e.g., McKeown
85
Typically implemented as RTN
Variety comes from different available knowledge
for each entity
Reusable ones available Exemplars
Example
Describe-Project-Schema -gt Sequence(duration,
ProjParticipants-Schema)

117
Sentence Aggregation

Determine which predicates should be grouped
together in sentences
Less understood process
Default each predicate can be expressed as a
sentence, so optional step
SPOT trainable planner
Example
AKT is a 6-year project with 5 participants
Sheffield (URL)
OU

118
Lexicalisation

Choosing words and phrases to express the
concepts and relations in predicates
Trivial solution 1-1 mapping between
concepts/relations and lexical entries
Variation is useful to avoid repetitiveness and
also convey pragmatic distinctions (e.g.
formality)

119
Referring Expression Generation

Choose pronouns/phrases to refer to the entities
in the text
Example he vs Mr Smith vs John Smith, the
president of XXX Corp.
Depends on what is previously said
He is only appropriate if the person is already
introduced in the text

120
Linguistic Realisation

Use grammar to generate text which is
grammatical, i.e., syntactically and
morphologically correct
Domain-independent
Reusable components are available e.g.,
RealPro, FUF/SURGE
Example
Morphology participant -gt participants
Syntactic agreement AKT starts on

121
Example a GATE-based generator

Input
The MIAKT ontology
The RDF file for the given case
The MIAKT lexicon
Output
GATE document with the generated text

122
Lexicalising Concepts and Instances
123
Example RDF Input

ltrdfDescription rdfabout'c\breast_cancer_ontol
ogy.daml01401_patient'gt
ltrdftype rdfresource'c\breast_cancer_ontology
.damlPatient'/gt
ltNS2has_agegt68lt/NS2has_agegt
ltNS2involved_in_ta rdfresource'c\breast_cance
r_ontology.damlta-soton-1069861276136'/gt
lt/rdfDescriptiongt
ltrdfDescription rdfabout'c\breast_cancer_ontol
ogy.daml01401_mammography'gt
ltrdftype rdfresource'c\breast_cancer_ontology
.damlMammography'/gt
ltNS2carried_out_on rdfresource'c\breast_cance
r_ontology.daml01401_patient'/gt
ltNS2has_dategt22 9 1995lt/NS2has_dategt
ltNS2produce_result rdfresource'c\breast_cance
r_ontology.damlimage_01401_right_cc'/gt
lt/rdfDescriptiongt
ltrdfDescription rdfabout'c\breast_cancer_ontol
ogy.damlimage_01401_right_cc'gt
ltNS2image_filegtcancer/case0140/C_0140_1.RIGHT_CC
.LJPEGlt/NS2image_filegt
ltrdftype rdfresource'c\breast_cancer_ontology
.damlRight_CC_Image'/gt
ltNS2has_lateral rdfresource'c\breast_cancer_o
ntology.damllateral_right'/gt
ltNS2view_of_image rdfresource'c\breast_cancer
_ontology.damlcraniocaudal_view'/gt
ltNS2contains_entity rdfresource'c\breast_canc
er_ontology.daml01401_right_cc_abnor_1'/gt
lt/rdfDescriptiongt
ltrdfDescription rdfabout'c\breast_cancer_ontol
ogy.daml01401_right_cc_abnor_1'gt

124
CASE0140.RDF

The 68 years old patient is involved in a
triple assessment procedure. The triple
assessment procedure contains a mammography exam.
The mammography exam is carried out on the
patient on 22 9 1995. The mammography exam
produced a right CC image. The right CC image
contains an abnormality and it has a right
lateral side and a craniocaudal view. The
abnormality has a mass, a microlobulated margin ,
a round shape, and a probably malignant
assessment.

125
Further Reading on IE for SemWeb

Requirements for Information Extraction for
Knowledge Management. http//nlp.shef.ac.uk/dot.ko
m/publications.html
Information Extraction as a Semantic Web
Technology Requirements and Promises. Adaptive
Text Extraction and Mining workshop, 2003.
A. Kiryakov, B. Popov, et al. Semantic
Annotation, Indexing, and Retrieval. 2nd
International Semantic Web Conference (ISWC2003),
http//www.ontotext.com/publications/index.htmlKi
ryakovEtAl2003
S. Handschuh, S. Staab, R. Volz
http//www.aifb.uni-karlsruhe.de/WBS/sha/papers/p2
73_handschuh.pdf. On Deep Annotation. WWW03.
S. Dill, N. Eiron, et al http//www.tomkinshome.c
om/papers/2Web/semtag.pdf . SemTag and Seeker
Bootstrapping the semantic web via automated
semantic annotation. WWW03.
E. Motta, M. Vargas-Vera, et al MnM Ontology
Driven Semi-Automatic and Automatic Support for
Semantic Markup. Knowledge Engineering and
Knowledge Management (Ontologies and the Semantic
Web), (EKAW02), http//www.aktors.org/publications
/selected-papers/06.pdf
K. Bontcheva, A. Kiryakov, H. Cunningham, B.
Popov. M. Dimitrov. Semantic Web Enabled, Open
Source Language Technology. Language Technology
and the Semantic Web, Workshop on NLP and XML
(NLPXML-2003). http//www.gate.ac.uk/sale/eacl03-s
emweb/bontcheva-etal-final.pdf
Handschuh, Staab, Ciravegna. S-CREAM -
Semi-automatic CREAtion of Metadata (2002)
http//citeseer.nj.nec.com/529793.html

126
Further Reading on traditional IE

Day et al97 D. Day, J. Aberdeen, L. Hirschman,
R. Kozierok, P. Robinson, and M. Vilain.
Mixed-Initiative Development of Language
Processing Systems. In Proceedings of the Fifth
Conference on Applied Natural Language Processing
(ANLP97). 1997.
Ciravegna02 F. Ciravegna, A. Dingli, D.
Petrelli, Y. Wilks User-System Cooperation in
Document Annotation based on Information
Extraction. Knowledge Engineering and Knowledge
Management (Ontologies and the Semantic Web),
(EKAW02), 2002.
N. Kushmerick, B. Thomas. Adaptive information
extraction Core technologies for information
agents (2002). http//citeseer.nj.nec.com/kushmeri
ck02adaptive.html
H. Cunningham, D. Maynard, K. Bontcheva, V.
Tablan. GATE A Framework and Graphical
Development Environment for Robust NLP Tools and
Applications. 40th Anniversary Meeting of the
Association for Computational Linguistics
(ACL'02). 2002.
D.Maynard, K. Bontcheva and H. Cunningham.
Towards a semantic extraction of named entities.
Recent Advances in Natural Language Processing,
Bulgaria, 2003.
Califf and Mooney Relational Learning of Pattern
Matching Rules for Information Extraction
http//citeseer.nj.nec.com/6804.html
Borthwick. A. A Maximum Entropy Approach to Named
Entity Recognition.PhD Dissertation. 1999
Bikel D., Schwarta R., Weischedel. R. An
algorithm that learns whats in a name. Machine
Learning 34, pp.211-231, 1999
Riloff, E. (1996) "Automatically Generating
Extraction Patterns from Untagged Text"
Proceedings of the Thirteenth National Conference
on Artificial Intelligence (AAAI-96) , 1996, pp.
1044-1049. http//www.cs.utah.edu/7Eriloff/psfile
s/aaai96.pdf
Daelemans W. and Hoste V. Evaluation of Machine
Learning Methods for Natural Language Processing
Tasks. In LREC 2002 Third International
Conference on Language Resources and Evaluation,
pages 755760

127
Further Reading on traditional IE

Black W.J., Rinaldi F., Mowatt D. Facile
Description of the NE System Used For MUC-7.
Proceedings of 7th Message Understanding
Conference, Fairfax, VA, 19 April - 1 May, 1998.
Collins M., Singer Y. Unsupervised models for
named entity classificationIn Proceedings of the
Joint SIGDAT Conference on Empirical Methods in
Natural Language Processing and Very Large
Corpora, 1999
Collins M. Ranking Algorithms for Named-Entity
Extraction Boosting and the Voted Perceptron.
Proceedings of the 40th Annual Meeting of the
ACL, Philadelphia, pp. 489-496, July 2002 Gotoh
Y., Renals S. Information extraction from
broadcast news, Philosophical Transactions of the
Royal Society of London, series A Mathematical,
Physical and Engineering Sciences, 2000.
Grishman R. The NYU System for MUC-6 or Where's
the Syntax? Proceedings of the MUC-6 workshop,
Washington. November 1995.
Krupka G. R., Hausman K. IsoQuest Inc.
Description of the NetOwlTM Extractor System as
Used for MUC-7. Proceedings of 7th Message
Understanding Conference, Fairfax, VA, 19 April -
1 May, 1998.
McDonald D. Internal and External Evidence in the
Identification and Semantic Categorization of
Proper Names. In B.Boguraev and J. Pustejovsky
editors Corpus Processing for Lexical
Acquisition. Pages21-39. MIT Press. Cambridge,
MA. 1996
Mikheev A., Grover C. and Moens M. Description of
the LTG System Used for MUC-7. Proceedings of 7th
Message Understanding Conference, Fairfax, VA, 19
April - 1 May, 1998
Miller S., Crystal M., et al. BBN Description of
the SIFT System as Used for MUC-7. Proceedings of
7th Message Understanding Conference, Fairfax,
VA, 19 April - 1 May, 1998

128
Further Reading on multilingual IE

Palmer D., Day D.S. A Statistical Profile of the
Named Entity Task. Proceedings of the Fifth
Conference on Applied Natural Language
Processing, Washington, D.C., March 31- April 3,
1997.
Sekine S., Grishman R. and Shinou H. A decision
tree method for finding and classifying names in
Japanese texts. Proceedings of the Sixth Workshop
on Very Large Corpora, Montreal, Canada, 1998
Sun J., Gao J.F., Zhang L., Zhou M., Huang C.N.
Chinese Named Entity Identification Using
Class-based Language Model. In proceeding of the
19th International Conference on Computational
Linguistics (COLING2002), pp.967-973, 2002.
Takeuchi K., Collier N. Use of Support Vector
Machines in Extended Named Entity Recognition.
The 6th Conference on Natural Language Learning.
2002
D.Maynard, K. Bontcheva and H. Cunningham.
Towards a semantic extraction of named entities.
Recent Advances in Natural Language Processing,
Bulgaria, 2003.
M. M. Wood and S. J. Lydon and V. Tablan and D.
Maynard and H. Cunningham. Using parallel texts
to improve recall in IE. Recent Advances in
Natural Language Processing, Bulgaria, 2003.
D.Maynard, V. Tablan and H. Cunningham. NE
recognition without training data on a language
you don't speak. ACL Workshop on Multilingual and
Mixed-language Named Entity Recognition
Combining Statistical and Symbolic Models,
Sapporo, Japan, 2003.

129
Further Reading on multilingual IE

H. Saggion, H. Cunningham, K. Bontcheva, D.
Maynard, O. Hamza, Y. Wilks. Multimedia Indexing
through Multisource and Multilingual Information
Extraction the MUMIS project. Data and Knowledge
Engineering, 2003.
D. Manov and A. Kiryakov and B. Popov and K.
Bontcheva and D. Maynard, H. Cunningham.
Experiments with geographic knowledge for
information extraction. Workshop on Analysis of
Geographic References, HLT/NAACL'03, Canada,
2003.
H. Cunningham, D. Maynard, K. Bontcheva, V.
Tablan. GATE A Framework and Graphical
Development Environment for Robust NLP Tools and
Applications. Proceedings of the 40th Anniversary
Meeting of the Association for Computational
Linguistics (ACL'02). Philadelphia, July 2002.
H. Cunningham. GATE, a General Architecture for
Text Engineering. Computers and the Humanities,
volume 36, pp. 223-254, 2002.
D. Maynard, H. Cunningham, K. Bontcheva, M.
Dimitrov. Adapting A Robust Multi-Genre NE System
for Automatic Content Extraction. Proc. of the
10th International Conference on Artificial
Intelligence Methodology, Systems, Applications
(AIMSA 2002), 2002.
K. Pastra, D. Maynard, H. Cunningham, O. Hamza,
Y. Wilks. How feasible is the reuse of grammars
for Named Entity Recognition? Language Resources
and Evaluation Conference (LREC'2002), 2002.

130
THANK YOU!(for not snoring) The slides
http//gate.ac.uk/sale/talks/esws2004-tutorial.pp
t This work has been supported by AKT
(http//aktors.org/) and SEKT (http//sekt.semant
icweb.org/)

Write a Comment

User Comments (0)

About PowerShow.com

Human Language Technology for the Semantic Web - PowerPoint PPT Presentation

Human Language Technology for the Semantic Web

... 'DATE' nine day /TIMEX shuttle flight was to be the 12th ... annotators are cheap (but you get what you pay for!) A) NE Baseline: list lookup approach ... – PowerPoint PPT presentation