Title: Human Language Technology for the Semantic Web
1Human Language Technology for the Semantic Web
http//gate.ac.uk/ http//nlp.shef.ac.uk
/ Hamish Cunningham Kalina Bontcheva Diana
Maynard Valentin Tablan ESWS, Crete, May
2004 This work has been supported by AKT
(http//aktors.org/) and SEKT (http//sekt.semant
icweb.org/)
2Are you wasting your time?
3Structure of the Tutorial
- Motivation, background
- Information Extraction - definition
- Evaluation corpora metrics
- IE approaches some examples
- Rule-based approaches
- Learning-based approaches
- Semantic Tagging
- Using traditional IE
- Ontology-based IE
- Platforms for large-scale processing
- Language Generation
- Slides http//gate.ac.uk/sale/talks/esws2004-tut
orial.ppt
4The Knowledge Economy and Human Language
- Gartner, December 2002
- taxonomic and hierarchical knowledge mapping and
indexing will be prevalent in almost all
information-rich applications - through 2012 more than 95 of human-to-computer
information input will involve textual language - A contradiction formal knowledge in
semantics-based systems vs. ambiguous informal
natural language - The challenge to reconcile these two opposing
tendencies
5HLT Knowledge Closing the Language Loop
KEY MNLG Multilingual Natural Language
GenerationOBIE Ontology-Based Information
ExtractionAIE Adaptive Mixed-Initiative
IECLIE Controlled Language IE
(M)NLG
Semantic Web Semantic GridSemantic Web
Services
Formal Knowledge(ontologies andinstance bases)
HumanLanguage
OBIE
(A)IE
ControlledLanguage
CLIE
6Background and Examples (1)
- Like other areas of computer science, HLT has
typical data structures and infrastructure
requirements - Annotation associating arbitrary data with areas
of text or speech - Defacto standard Stand-off Markup (e.g.
TEI/XCES, NITE, ATLAS, GATE) - Other issues visualisation and editing
persistence and search metrics component model
baseline NLP tools ... - To cut a long story short HLT has a lot of T
underneath it which comes in many shapes and sizes
7Background and Examples (2)
- Infrastructure (many) examples in this
tutorial - GATE, a General Architecture for Text
Engineering architecture, framework IDE - Why?
- I happen to know a little about it ?
- Free software, relatively comprehensive, widely
used, has extensive Semantic Web support - It means we can ignore the infrastructural issues
- Not a claim that it is the best or only in all
cases!
8Structure of the Tutorial
- Motivation, background
- Information Extraction - definition
- Evaluation corpora metrics
- IE approaches some examples
- Rule-based approaches
- Learning-based approaches
- Semantic Tagging
- Using traditional IE
- Ontology-based IE
- Platforms for large-scale processing
- Language Generation
- Slides http//gate.ac.uk/sale/talks/esws2004-tut
orial.ppt
9Information Extraction (1)
- Information Extraction (IE) pulls facts and
structured information from the content of large
text collections. - Contrast IE and Information Retrieval
- NLP history from NLU to IE (if you cant score,
why not move the goalposts?)
10Information Extraction (2)
- When you can measure what you are speaking about,
and express it in numbers, you know something
about it but when you cannot measure it, when
you cannot express it in numbers, your knowledge
is of a meager and unsatisfactory kind it may be
the beginning of knowledge, but you have scarcely
in your thoughts advanced to the stage of
science. (Kelvin) - Not everything that counts can be counted, and
not everything that can be counted counts.
(Einstein) - IE progress driven by quantitative measures
- MUC Message Understanding Conferences
- ACE Automatic Content Extraction
11MUC-7 tasks
- Held in 1997, around 15 participants inc. 2 UK.
Broke IE down into component tasks - NE Named Entity recognition and typing
- CO co-reference resolution
- TE Template Elements
- TR Template Relations
- ST Scenario Templates
12An Example
- The shiny red rocket was fired on Tuesday. It is
the brainchild of Dr. Big Head. Dr. Head is a
staff scientist at We Build Rockets Inc.
- NE "rocket", "Tuesday", "Dr. Head, "We Build
Rockets"
- CO"it" rocket "Dr. Head" "Dr. Big Head"
- TE the rocket is "shiny red" and Head's
"brainchild".
- TR Dr. Head works for We Build Rockets Inc.
- ST rocket launch event with various participants
13Performance levels
- Vary according to text type, domain, scenario,
language - NE up to 97 (tested in English, Spanish,
Japanese, Chinese, etc. etc.) - CO 60-70 resolution
- TE 80
- TR 75-80
- ST 60 (but human level may be only 80)
14What are Named Entities?
- NE involves identification of proper names in
texts, and classification into a set of
predefined categories of interest - Person names
- Organizations (companies, government
organisations, committees, etc) - Locations (cities, countries, rivers, etc)
- Date and time expressions
15What are Named Entities (2)
- Other common types measures (percent, money,
weight etc), email addresses, Web addresses,
street addresses, etc. - Some domain-specific entities names of drugs,
medical conditions, names of ships, bibliographic
references etc. - MUC-7 entity definition guidelines Chinchor97
- http//www.itl.nist.gov/iaui/894.02/related_projec
ts/muc/proceedings/ne_task.html
16What are NOT NEs (MUC-7)
- Artefacts Wall Street Journal
- Common nouns, referring to named entities the
company, the committee - Names of groups of people and things named after
people the Tories, the Nobel prize - Adjectives derived from names Bulgarian,
Chinese - Numbers which are not times, dates, percentages,
and money amounts
17Basic Problems in NE
- Variation of NEs e.g. John Smith, Mr Smith,
John. - Ambiguity of NE types John Smith (company vs.
person) - May (person vs. month)
- Washington (person vs. location)
- 1945 (date vs. time)
- Ambiguity with common words, e.g. "may"
18More complex problems in NE
- Issues of style, structure, domain, genre etc.
- Punctuation, spelling, spacing, formatting, ...
all have an impact - Dept. of Computing and Maths
- Manchester Metropolitan University
- Manchester
- United Kingdom
- Tell me more about Leonardo
- Da Vinci
19Structure of the Tutorial
- Motivation, background
- Information Extraction - definition
- Evaluation corpora metrics
- IE approaches some examples
- Rule-based approaches
- Learning-based approaches
- Semantic Tagging
- Using traditional IE
- Ontology-based IE
- Platforms for large-scale processing
- Language Generation
- Slides http//gate.ac.uk/sale/talks/esws2004-tut
orial.ppt
20Corpora and System Development
- Gold standard data created by manual annotation
- Corpora are divided typically into a training and
testing portion - Rules and/or learning algorithms are developed or
trained on the training part - Tuned on the testing portion in order to optimise
- Rule priorities, rules effectiveness, etc.
- Parameters of the learning algorithm and the
features used (typical routine 10-fold cross
validation) - Evaluation set the best system configuration is
run on this data and the system performance is
obtained - No further tuning once evaluation set is used!
21Some NE Annotated Corpora
- MUC-6 and MUC-7 corpora - English
- CONLL shared task corpora http//cnts.uia.ac.be/co
nll2003/ner/ - NEs in English and
Germanhttp//cnts.uia.ac.be/conll2002/ner/ -
NEs in Spanish and Dutch - TIDES surprise language exercise (NEs in Cebuano
and Hindi) - ACE English - http//www.ldc.upenn.edu/Projects/
ACE/
22The MUC-7 corpus
- 100 documents in SGML
- News domain
- Named Entities
- 1880 Organizations (46)
- 1324 Locations (32)
- 887 Persons (22)
- Inter-annotator agreement very high (97)
- http//www.itl.nist.gov/iaui/894.02/related_projec
ts/muc/proceedings/muc_7_proceedings/marsh_slides.
pdf
23The MUC-7 Corpus (2)
- ltENAMEX TYPE"LOCATION"gtCAPE CANAVERALlt/ENAMEXgt,
ltENAMEX TYPE"LOCATION"gtFla.lt/ENAMEXgt MD
Working in chilly temperatures ltTIMEX
TYPE"DATE"gtWednesdaylt/TIMEXgt ltTIMEX
TYPE"TIME"gtnightlt/TIMEXgt, ltENAMEX
TYPE"ORGANIZATION"gtNASAlt/ENAMEXgt ground crews
readied the space shuttle Endeavour for launch on
a Japanese satellite retrieval mission. - ltpgt
- Endeavour, with an international crew of six, was
set to blast off from the ltENAMEX
TYPE"ORGANIZATIONLOCATION"gtKennedy Space
Centerlt/ENAMEXgt on ltTIMEX TYPE"DATE"gtThursdaylt/TI
MEXgt at ltTIMEX TYPE"TIME"gt418 a.m. ESTlt/TIMEXgt,
the start of a 49-minute launching period. The
ltTIMEX TYPE"DATE"gtnine daylt/TIMEXgt shuttle
flight was to be the 12th launched in darkness.
24ACE Towards Semantic Tagging of Entities
- MUC NE tags segments of text whenever that text
represents the name of an entity - In ACE (Automated Content Extraction), these
names are viewed as mentions of the underlying
entities. The main task is to detect (or infer)
the mentions in the text of the entities
themselves - Rolls together the NE and CO tasks
- Domain- and genre-independent approaches
- ACE corpus contains newswire, broadcast news (ASR
output and cleaned), and newspaper reports (OCR
output and cleaned)
25ACE Entities
- Dealing with
- Proper names e.g., England, Mr. Smith, IBM
- Pronouns e.g., he, she, it
- Nominal mentions the company, the spokesman
- Identify which mentions in the text refer to
which entities, e.g., - Tony Blair, Mr. Blair, he, the prime minister, he
- Gordon Brown, he, Mr. Brown, the chancellor
26ACE Example
- ltentity ID"ft-airlines-27-jul-2001-2"
- GENERIC"FALSE"
- entity_type "ORGANIZATION"gt
- ltentity_mention ID"M003"
- TYPE "NAME"
- string "National Air
Traffic Services"gt - lt/entity_mentiongt
- ltentity_mention ID"M004"
- TYPE "NAME"
- string "NATS"gt
- lt/entity_mentiongt
- ltentity_mention ID"M005"
- TYPE "PRO"
- string "its"gt
- lt/entity_mentiongt
- ltentity_mention ID"M006"
- TYPE "NAME"
- string "Nats"gt
- lt/entity_mentiongt
27Annotation Tools (1) GATE
28Annotation Tools (2) Alembic
29Performance Evaluation
- Evaluation metric mathematically defines how to
measure the systems performance against
human-annotated gold standard - Scoring program implements the metric and
provides performance measures - For each document and over the entire corpus
- For each type of NE
30Evaluation Metrics
- Most common are Precision and Recall
- Precision correct answers/answers produced
- Recall correct answers/total possible correct
answers - Trade-off between precision and recall
- F-Measure (ß2 1)PR / ß2R P van Rijsbergen
75 - ß reflects the weighting between precision and
recall, typically ß1 - Some tasks sometimes use other metrics, e.g.
- false positives (not sensitive to doc richness)
- cost-based (good for application-specific
adjustment)
31The Evaluation Metric (2)
- We may also want to take account of partially
correct answers - Precision Correct ½ Partially correct
- Correct Incorrect Partial
- Recall Correct ½ Partially correctCorrect
Missing Partial - Why NE boundaries are often misplaced, sosome
partially correct results
32The GATE Evaluation Tool
33Corpus-level Regression Testing
- Need to track systems performance over time
- When a change is made we want to know
implications over whole corpus - Why because an improvement in one case can lead
to problems in others - GATE offers automated tool to help with the NE
development task over time
34Regression Testing (2)
At corpus level GATEs corpus benchmark tool
tracking systems performance over time
35SW IE Evaluation tasks
- Detection of entities and events, given a target
ontology of the domain. - Disambiguation of the entities and events from
the documents with respect to instances in the
given ontology. For example, measuring whether
the IE correctly disambiguated Cambridge in the
text to the correct instance Cambridge, UK vs
Cambridge, MA. - Decision when a new instance needs to be added to
the ontology, because the text contains a new
instance, that does not already exist in the
ontology.
36ChallengeEvaluating Richer NE Tagging
- Need for new metrics when evaluating
hierarchy/ontology-based NE tagging - Need to take into account distance in the
hierarchy - Tagging a company as a charity is less wrong than
tagging it as a person
37Structure of the Tutorial
- Motivation, background
- Information Extraction - definition
- Evaluation corpora metrics
- IE approaches some examples
- Rule-based approaches
- Learning-based approaches
- Semantic Tagging
- Using traditional IE
- Ontology-based IE
- Platforms for large-scale processing
- Language Generation
- Slides http//gate.ac.uk/sale/talks/esws2004-tut
orial.ppt
38Two kinds of IE approaches
- Knowledge Engineering
- rule based
- developed by experienced language engineers
- make use of human intuition
- requires only small amount of training data
- development could be very time consuming
- some changes may be hard to accommodate
- Learning Systems
- use statistics or other machine learning
- developers do not need LE expertise
- requires large amounts of annotated training data
- some changes may require re-annotation of the
entire training corpus - annotators are cheap (but you get what you pay
for!)
39A) NE Baseline list lookup approach
- System that recognises only entities stored in
its lists (gazetteers). - Advantages - Simple, fast, language independent,
easy to retarget (just create lists) - Disadvantages impossible to enumerate all
names, collection and maintenance of lists,
cannot deal with name variants, cannot resolve
ambiguity
40B) Shallow parsing approach using internal
structure
- Internal evidence names often have internal
structure. These components can be either stored
or guessed, e.g. location - Cap. Word City, Forest, Center, River
- e.g. Sherwood Forest
- Cap. Word Street, Boulevard, Avenue, Crescent,
Road - e.g. Portobello Street
41Problems ...
- Ambiguously capitalised words (first word in
sentence)All American Bank vs. All State
Police - Semantic ambiguity "John F. Kennedy" airport
(location) "Philip Morris" organisation - Structural ambiguity Cable and Wireless vs.
Microsoft and DellCenter for Computational
Linguistics vs. message from City Hospital for
John Smith
42C) Shallow parsing with context
- Use of context-based patterns is helpful in
ambiguous cases - "David Walton" and "Goldman Sachs" are
indistinguishable - But with the phrase "David Walton of Goldman
Sachs" and the Person entity "David Walton"
recognised, we can use the pattern "Person of
Organization" to identify "Goldman Sachs
correctly.
43Examples of context patterns
- PERSON earns MONEY
- PERSON joined ORGANIZATION
- PERSON left ORGANIZATION
- PERSON joined ORGANIZATION as JOBTITLE
- ORGANIZATION's JOBTITLE PERSON
- ORGANIZATION JOBTITLE PERSON
- the ORGANIZATION JOBTITLE
- part of the ORGANIZATION
- ORGANIZATION headquarters in LOCATION
- price of ORGANIZATION
- sale of ORGANIZATION
- investors in ORGANIZATION
- ORGANIZATION is worth MONEY
- JOBTITLE PERSON
- PERSON, JOBTITLE
44Example Rule-based System - ANNIE
- ANNIE A Nearly-New IE system
- A version distributed as part of GATE
- GATE automatically deals with document formats,
saving of results, evaluation, and visualisation
of results for debugging - GATE has a finite-state pattern-action rule
language, used by ANNIE - A reusable and easily extendable set of components
45NE Components
46Gazetteer lists for rule-based NE
- Needed to store the indicator strings for the
internal structure and context rules - Internal location indicators e.g., river,
mountain, forest for natural locations street,
road, crescent, place, square, for address
locations - Internal organisation indicators e.g., company
designators GmbH, Ltd, Inc, - Produces Lookup results of the given kind
47The Named Entity Transducers
- Phases run sequentially and constitute a cascade
of FSTs over the pre-processing results - Hand-coded rules applied to annotations to
identify NEs - Annotations from format analysis, tokeniser,
sentence splitter, POS tagger, and gazetteer
modules - Use contextual information
- Finds person names, locations, organisations,
dates, addresses.
48- NE Rule in JAPE
- JAPE a Java Annotation Patterns Engine
- Light, robust regular-expression-based
processing - Cascaded finite state transduction
- Low-overhead development of new components
- Simplifies multi-phase regex processing
- Rule Company1
- Priority 25
- (
- ( Token.orthography upperInitial )
//from tokeniser - Lookup.kind companyDesignator //from
gazetteer lists - )match
- --gt
- match.NamedEntity
- kindcompany, ruleCompany1
49Named Entities in GATE
50Using co-reference to classify ambiguous NEs
- Orthographic co-reference module matches proper
names in a document - Improves NE results by assigning entity type to
previously unclassified names, based on
relations with classified NEs - May not reclassify already classified entities
- Classification of unknown entities very useful
for surnames which match a full name, or
abbreviations, e.g. Bonfield will match Sir
Peter Bonfield International Business
Machines Ltd. will match IBM
51Named Entity Coreference
52Structure of the Tutorial
- Motivation, background
- Information Extraction - definition
- Evaluation corpora metrics
- IE approaches some examples
- Rule-based approaches
- Learning-based approaches
- Semantic Tagging
- Using traditional IE
- Ontology-based IE
- Platforms for large-scale processing
- Language Generation
- Slides http//gate.ac.uk/sale/talks/esws2004-tut
orial.ppt
53Machine Learning Approaches
- Approaches
- Train ML models on manually annotated text
- Mixed initiative learning
- Used for producing training data
- Used for producing working systems
- ML Methods
- Symbolic learning rules/decision trees induction
- Statistical models HMMs, Bayesian methods,
Maximum Entropy
54ML Terminology
- Instances (tokens, entities)
- Occurrences of a phenomenon
- Attributes (features)
- Characteristics of the instances
- Classes
- Sets of similar instances
55Methodology
- The task can be broken into several subtasks
(that can use different methods) - Boundary detection
- Entity classification into NE types
- Different models for different entity types
- Several models can be used in competition.
- Some algorithms perform better on little data
while others are better when more training is
available
56Methodology (2)
- Boundaries (and entity types) notations
- S(-XXX), E(-XXX)
- ltS-ORG/gtU.N.ltE-ORG/gt official ltS-PER/gtEkeusltE-PER/
gt heads for - ltS-LOC/gtBaghdadltE-LOC/gt.
- IOB notation (Inside, Outside, Beginning_of)
- U.N. I-ORG
- official O
- Ekeus I-PER
- heads O
- for O
- Baghdad I-LOC
- . O
- Translations between the two conventions are
- straight-forward
57Features
- Linguistic features
- POS
- Morphology
- Syntax
- Lexicon data
- Semantic features
- Ontological class
- ETC
- Document structure
- Original markup
- Paragraph/sentence structure
- Surface features
- Token length
- Capitalisation
- Token type (word, punctuation, symbol)
- Feature selection the most difficult part
- Some automatic scoring methods can be used
58Mixed Initiative Learning
- Human computer interaction
- Speeds up the creation of training data
- Can be used for corpus/system creation
- Example implementations
- Alembic Day et al97 and later
- Amilcare Ciravegna03
59Mixed Initiative Learning (2)
User annotates
System learns
Pgtt1
Pgtt2
60Example 1 Alembic, Day et al 1997
- Mixed initiative approach implemented in Alembic
Workbench - Bootstrapping procedure use already tagged data
to pre-annotate new documents - Transforms the process from tagging to review
- Finally, the trained system can be used on its
own
61Mixed-Initiative text annotation
User can also edit the induced rulesand
writenew ones Brill-style learning
generate-and-test
62Considerations
- Too high recall and the human might become
over-reliant on the system annotations - Too high precision might have similar effect
- Theory-creep the choices of the human
annotator are increasingly influenced by the
machines and might deviate from the task
definition measure inter-annotator agreement
63Example 2 Amilcare Melita
- Amilcare rule-learning algorithm
- Tagging rules learn to insert tags in the text,
given training examples - Correction rules learn to move already inserted
tags to their correct place in the text - Learns begin and end tags independently
- Melita support adaptive IE
- Applied in SemWeb context (see below)
- Being extended as part of the EU-funded DOT.KOM
project towards KM andSemWeb applications
64Comparison of Alembic Melita
- The life cycle of user tagging, the machine
learning, then making suggestions which user
corrects, is very similar in Melita and Alembic - Alembic is more oriented towards NLP developers,
while Melita more towards end-users - Melita considers timeliness and intrusiveness as
criteria, while Alembic does not (possibly due to
performance bottleneck from old hardware) - Both acknowledge but do not address problems with
over-reliance on machine annotations - From ML perspective the two are very similar
rule-learning
65Eg. 3 GATE Machine Learning support
- Uses classification.
- Attr1, Attr2, Attr3, Attrn ? Class
- Classifies annotations.
- (Documents can be classified as well using a
1-to1 relation with annotations.) - Annotations of a particular type are selected as
instances. - Attributes refer to features of the instance
annotations or their context. - Generic implementation for attribute collection
can be linked to any ML engine. - ML engines currently integrated WEKA and
Ontotexts HMM.
66Implementation
- Machine Learning PR in GATE.
- Has two functioning modes
- training
- application
- Uses an XML file for configuration
- lt?xml version"1.0" encoding"windows-1252"?gt
- ltML-CONFIGgt
- ltDATASETgt lt/DATASETgt
- ltENGINEgtlt/ENGINEgt
- ltML-CONFIGgt
67Attributes Collection
Instances type Token
68Dataflow
GATE ML Library
NLP Pipeline Tokeniser Gazetteer POS
Tagger Lexicon Lookup Semantic Tagger etc
Annotated documents
Plain text documents
Feature Collection
Results Converter
Engine Interface
Machine Learning Engine
69Structure of the Tutorial
- Motivation, background
- Information Extraction - definition
- Evaluation corpora metrics
- IE approaches some examples
- Rule-based approaches
- Learning-based approaches
- Semantic Tagging
- Using traditional IE
- Ontology-based IE
- Platforms for large-scale processing
- Language Generation
- Slides http//gate.ac.uk/sale/talks/esws2004-tut
orial.ppt
70Towards Semantic Tagging of Entities
- The MUC NE task tags selected segments of text
whenever that text represents the name of an
entity. - Semantic tagging - view as mentions of the
underlying instances from the ontology - Identify which mentions in the text refer to
which instances in the ontology, e.g., - Tony Blair, Mr. Blair, he, the prime minister, he
- Gordon Brown, he, Mr. Brown, the chancellor
71Tasks
- Identify entity mentions in the text
- Reference disambiguation
- Add new instances if needed
- Disambiguate wrt instances in the ontology
- Identify instances of attributes and relations
- take into account what are allowed given the
ontology, using domainrange as constraints
72Example
XYZ was established on 03 November 1978 in
London. It opened a plant in Bulgaria in
Ontology KB
Location
Company
HQ
partOf
City
Country
type
type
HQ
type
type
establOn
partOf
03/11/1978
73Classes, instances metadata
Gordon Brown met George Bush during his two day
visit.
ltmetadatagt ltDOC-IDgthttp// 1.htmllt/DOC-IDgt
ltAnnotationgt lts_offsetgt 0 lt/s_offsetgt
lte_offsetgt 12 lt/e_offsetgt ltstringgtGordon
Brownlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson12345lt/instgt lt/Annotationgt
ltAnnotationgt lts_offsetgt 18 lt/s_offsetgt
lte_offsetgt 32 lt/e_offsetgt ltstringgtGeorge
Bushlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson67890lt/instgt lt/Annotationgt lt/metad
atagt
Classesinstances after
Classesinstances before
Bush
74Classes, instances metadata (2)
Gordon Brown met Tony Blair to discuss the
university tuition fees.
ltmetadatagt ltDOC-IDgthttp// 2.htmllt/DOC-IDgt
ltAnnotationgt lts_offsetgt 0 lt/s_offsetgt
lte_offsetgt 12 lt/e_offsetgt ltstringgtGordon
Brownlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson12345lt/instgt lt/Annotationgt
ltAnnotationgt lts_offsetgt 18 lt/s_offsetgt
lte_offsetgt 30 lt/e_offsetgt ltstringgtTony
Blairlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson26389lt/instgt lt/Annotationgt lt/metad
atagt
Classesinstances after
Classesinstances before
G. Brown
G. Bush
75Why not put metadata in ontologies?
- Can be encoded in RDF/OWL, etc. but does it need
to be put as instances in the ontology? - Typically we do not need to reason with it
- Reasoning happens in the ontology when the new
instances of classes and properties are added,
but the metadata statements are different from
them, they only refer to them - A lot more metadata than instances
- Millions of metadata statements, thousands of
instances, hundreds of concepts - Different access required
- By offset (give me all metadata of the first
paragraph) - Efficient metadata-wide statistics based on
strings not an operation that people would do
on other concepts - Mixing with keyword-based search using IR-style
indexing
76Metadata Creation with IE
- Semantic tagging creates metadata
- Stand-off or part of document
- Semi-automatic
- One view (given by the user, one ontology)
- More reliable
- Automatic metadata creation
- Many views change with ontology, re-train IE
engine for each ontology - Always up to date, if ontology changes
- Less reliable
77Problems with traditional IE for metadata
creation
- S-CREAM Semi-automatic CREAtion of Metadata
Handschuh et al02 - Semantic tags from IE need to be mapped to
instances of concepts, attributes or relations - Most ML-based IE systems do not deal well with
relations, mainly entities - Amilcare does not handle anaphora resolution,
GATE has such component but not used here - Implemented a discourse model with logical rules
- LASIE used discourse model with domainontology
problem is robustness and domain portability
78Example
Handschuh et al02 S-CREAM, EKAW02
79S-CREAM Discourse Rules
- Rules to attach instances only when the ontology
allows that (e.g., prices) - Attach tag values to the nearest preceding
compatible entity (e.g., prices and rooms) - Create a complex object between two concept
instances if they are adjacent (e.g., rate
number followed by currency) - Experienced users can write new rules
80Challenges for IE for SemWeb
- Portability different and changing ontologies
- Different text types structured, free, etc.
- Utilise ontology information where available
- Train from small amount of annotated text
- Output results wrt the given ontology
- bridge the gap demonstrated in S-CREAM
- Learn/Model at the right level
- ontologies are hierarchical and data will get
sparser the lower we go
DOT.KOM http//nlp.shef.ac.uk/dot.kom/
81Structure of the Tutorial
- Motivation, background
- Information Extraction - definition
- Evaluation corpora metrics
- IE approaches some examples
- Rule-based approaches
- Learning-based approaches
- Semantic Tagging
- Using traditional IE
- Ontology-based IE (OBIE)
- Platforms for large-scale processing
- Language Generation
- Slides http//gate.ac.uk/sale/talks/esws2004-tut
orial.ppt
82Eg. 1 GATE metadata extraction
- Combines learning and rule-based methods
- Allows combination of IE and IR
- Enables use of large-scale linguistic resources
for IE, such as WordNet - Supports ontologies as part of IE applications -
Ontology-Based IE (OBIE)
83Ontology Management in GATE
84Information Retrieval Currently based on the
Lucene IR engine useful for combining semantic
and keyword-based search
85WordNet support
86Populating Ontologies with IE
87Example 2 OBIE in h-TechSight
- hTechSight project using Ontology-Based IE for
semantic tagging of job adverts, news and reports
in chemical engineering domain - Aim is to track technological change over time
through terminological analysis - Fundamental to the application is a
domain-specific ontology - Terminological gazetteer lists are linked to
classes in the ontology - Rules classify the mentions in the text wrt the
domain ontology - Annotations output into a database or as an
ontology
88(No Transcript)
89(No Transcript)
90Exported Database
91Structure of the Tutorial
- Motivation, background
- Information Extraction - definition
- Evaluation corpora metrics
- IE approaches some examples
- Rule-based approaches
- Learning-based approaches
- Semantic Tagging
- Using traditional IE
- Ontology-based IE
- Platforms for large-scale processing
- Language Generation
- Slides http//gate.ac.uk/sale/talks/esws2004-tut
orial.ppt
92Platforms for Large-Scale Metadata Creation
- Allow use of corpus-wide statistics to improve
metadata quality, e.g., disambiguation - Automated alias discovery
- Generate SemWeb output (RDF, OWL)
- Stand-off storage and indexing of metadata
- Use large instance bases to disambiguate to
- Ontology servers for reasoning and access
- Architecture elements
- Crawler, onto storage, doc indexing, query,
annotators - Apps sem browsers, authoring tools, etc.
93Example 1 SemTag
- Lookup of all instances from the ontology (TAP)
65K instances - Disambiguate the occurrences as
- One of those in the taxonomy
- Not present in the taxonomy
- Not very high ambiguity of instances with the
same label in TAP concentrate on the second
problem - Use bag-of-words approach for disambiguation
- 3 people evaluated 200 labels in context agreed
on only 68.5 - metonymy - Placing labels in the taxonomy is hard
Dill et al, SemTag and Seeker. WWW03
94Seeker
- High-performance distributed infrastructure
- 128 dual-processor machines with separate ½
terabyte of storage - Each node runs approx. 200 documents per sec.
- Service-oriented architecture Vinci (SOAP)
Dill et al, SemTag and Seeker. WWW03
95Example 2 OBIE in KIM
- The ontology (KIMO) and 86K/200K instances KB
- High ambiguity of instances with the same label
need for disambiguation step - Lookup phase marks mentions from the ontology
- Combined with rule-based IE system to recognise
new instances of concepts and relations - Special KB enrichment stage where some of these
new instances are added to the KB - Disambiguation uses an Entity Ranking algorithm,
i.e., priority ordering of entities with the same
label based on corpus statistics (e.g., Paris)
Popov et al. KIM. ISWC03
96OBIE in KIM (2)
Popov et al. KIM. ISWC03
97Comparison between SemTag KIM
- SemTag only aims for accuracy (precision) of
classification of the annotated entities - KIM also aims for coverage (recall) whether all
possible mentions of entities were found - Trade-off sometimes finding some is enough
- SemTag does not attempt to discover and expand
the KB with new instances (e.g., new company)
the reason why KIM uses IE, not simple KB lookup - i.e. OBIE is often needed for ontology
population, not just metadata creation
98Two Annotation Scenarios (1)
- Getting the instances and the relations between
them is enough, maybe not all mentions in the
text are covered, but compensated by giving
access to this info from the annotated text
99Example
Gordon Brown met president Bush during his two
day visit. Afterwards George Bush said
The system
Bush
Score 100
100Two Annotation Scenarios (2)
- Exhaustive annotation is required, so all
occurrences of all instances and relations are
needed - Allows sentence and paragraph-level exploration,
rather than document-level as in the previous
scenario - Harder to achieve
- Distinction between these scenarios needs to be
made in the metadata annotation tools/KM tools
using IE
101Example
Gordon Brown met president Bush during his two
day visit. Afterwards George Bush said
ltmetadatagt ltAnnotationgt lts_offsetgt 0
lt/s_offsetgt lte_offsetgt 12 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson12345lt/ins
tgt lt/Annotationgt ltAnnotationgt lts_offsetgt
18 lt/s_offsetgt lte_offsetgt 32 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson1267lt/inst
gt lt/Annotationgt ltAnnotationgt lts_offsetgt
61 lt/s_offsetgt lte_offsetgt 72 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson1267lt/inst
gt lt/Annotationgt lt/metadatagt
ltmetadatagt ltAnnotationgt lts_offsetgt 0
lt/s_offsetgt lte_offsetgt 12 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson12345lt/ins
tgt lt/Annotationgt ltAnnotationgt lts_offsetgt
61 lt/s_offsetgt lte_offsetgt 72 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson1267lt/inst
gt lt/Annotationgt lt/metadatagt
Score 66
102Eg. 3 SWAN a Semantic Web Annotator
- Collaboration between DERI/NUIG, OntoText and
USFD, hosted at DERI - GATE KIM SECO
- Custom indexing of news or other web fractions
- Quantitative media reporting
- Annotated web workbench service
- Custom knowledge services
- Demo and poster at ESWS
103SWAN Logical Architecture
Web
IE (64 bit)
Annotation(Oracle)
UI Users
Web UI,Web services
Knowledgebase (Sesame)
Service Users
104Cluster Controller
105Semantic Reference Disambiguation
- Possible approaches
- Vector-space models compare context similarity
runs over a corpus - SemTag
- Baggas cross-document coreference work
- Communities of practise approach from KM
- Identity criteria from the ontology based on
properties, e.g., date_of_birth, name
106Why disambiguation is hard not all knowledge
is explicit in text
- Paris fashion week underway as cancellations
continue - By Jo Johnson and Holly Finn - Oct 07 2001
184817 (FT) - Even as Paris fashion week opened at the
weekend, the cancellations and reschedulings were
still trickling in over the fax machines Loewe,
the leather specialists owned by LVMH empire, is
not showing, Cerruti, the Italian tailor,is
downscaling to private viewings, Helmut Lang,
master of the sharp suit, is cancelling his
catwalk. - The Oscar de la Renta show, for example, which
had been planned for September 11th in New York,
and which might easily enough have moved over to
Paris instead, is not on the schedule. When the
Dominican Republic-born designer consulted
America Vogue's influential editor, Anna Wintour,
she reportedly told him it would be unpatriotic
to decamp.
107Structure of the Tutorial
- Motivation, background
- Information Extraction - definition
- Evaluation corpora metrics
- IE approaches some examples
- Rule-based approaches
- Learning-based approaches
- Semantic Tagging
- Using traditional IE
- Ontology-based IE
- Platforms for large-scale processing
- Language Generation
- Slides http//gate.ac.uk/sale/talks/esws2004-tut
orial.ppt
108Natural Language Generation
- NLG is
- subfield of AI and CL that is concerned with the
construction of computer systems that can produce
understandable texts in English or other human
languages from some underlying linguistic
representation of information ReiterDale97 - NLG techniques are applied also for producing
speech, e.g., in speech dialogue systems
109- Natural Language Generation
Ontology/KB/Database
Lexicons Grammars
Text
110Requirements Analysis
- Create a corpus of target texts and (if possible)
their input representations - Analyse the information content
- Unchanging texts thank you, hello, etc.
- Directly available data timetable of buses
- Computable data number of buses
- Unavailable data not in the systems KB/DB
111NLG Tasks
- Content determination
- Discourse planning
- Sentence aggregation
- Lexicalisation
- Referring expression generation
- Linguistic realisation
112Content determination
- What information to include in the text
filtering and summarising input data into a
formal knowledge representation - Application dependent
- Example
- project AKT
- start_date October-2000
- end_date October-2006
- participants A,E,OU,So,Sh
113Discourse Planning
- Determine ordering and structure over the
knowledge to be generated - Theories of discourse how texts are structured
- Influences text readability
- Result tree structure imposing ordering over the
predicates and possibly providing discourse
relations
114Example
SEQUENCE
LIST
ELABORATION
ELABORATION
projectAKT duration 6 yrs
project AKT participantShef
univ Shef Web-page URL
project AKT participantOU
115Planning-Based Approaches
- Use AI-style planners (e.g., Moore Paris 93
- Discourse relations (e.g., ELABORATION) are
encoded as planning operators - Preconditions specify when the relation can apply
- Planning starts from a top-level goal, e.g.,
define-project(X) - Computationally expensive and require a lot of
knowledge problem for real-world systems
116Schema-Based Approaches
- Capture typical text structuring patterns in
templates (derived from corpus), e.g., McKeown
85 - Typically implemented as RTN
- Variety comes from different available knowledge
for each entity - Reusable ones available Exemplars
- Example
- Describe-Project-Schema -gt Sequence(duration,
ProjParticipants-Schema)
117Sentence Aggregation
- Determine which predicates should be grouped
together in sentences - Less understood process
- Default each predicate can be expressed as a
sentence, so optional step - SPOT trainable planner
- Example
- AKT is a 6-year project with 5 participants
- Sheffield (URL)
- OU
118Lexicalisation
- Choosing words and phrases to express the
concepts and relations in predicates - Trivial solution 1-1 mapping between
concepts/relations and lexical entries - Variation is useful to avoid repetitiveness and
also convey pragmatic distinctions (e.g.
formality)
119Referring Expression Generation
- Choose pronouns/phrases to refer to the entities
in the text - Example he vs Mr Smith vs John Smith, the
president of XXX Corp. - Depends on what is previously said
- He is only appropriate if the person is already
introduced in the text
120Linguistic Realisation
- Use grammar to generate text which is
grammatical, i.e., syntactically and
morphologically correct - Domain-independent
- Reusable components are available e.g.,
RealPro, FUF/SURGE - Example
- Morphology participant -gt participants
- Syntactic agreement AKT starts on
121Example a GATE-based generator
- Input
- The MIAKT ontology
- The RDF file for the given case
- The MIAKT lexicon
- Output
- GATE document with the generated text
122Lexicalising Concepts and Instances
123Example RDF Input
- ltrdfDescription rdfabout'c\breast_cancer_ontol
ogy.daml01401_patient'gt - ltrdftype rdfresource'c\breast_cancer_ontology
.damlPatient'/gt - ltNS2has_agegt68lt/NS2has_agegt
- ltNS2involved_in_ta rdfresource'c\breast_cance
r_ontology.damlta-soton-1069861276136'/gt - lt/rdfDescriptiongt
- ltrdfDescription rdfabout'c\breast_cancer_ontol
ogy.daml01401_mammography'gt - ltrdftype rdfresource'c\breast_cancer_ontology
.damlMammography'/gt - ltNS2carried_out_on rdfresource'c\breast_cance
r_ontology.daml01401_patient'/gt - ltNS2has_dategt22 9 1995lt/NS2has_dategt
- ltNS2produce_result rdfresource'c\breast_cance
r_ontology.damlimage_01401_right_cc'/gt - lt/rdfDescriptiongt
- ltrdfDescription rdfabout'c\breast_cancer_ontol
ogy.damlimage_01401_right_cc'gt - ltNS2image_filegtcancer/case0140/C_0140_1.RIGHT_CC
.LJPEGlt/NS2image_filegt - ltrdftype rdfresource'c\breast_cancer_ontology
.damlRight_CC_Image'/gt - ltNS2has_lateral rdfresource'c\breast_cancer_o
ntology.damllateral_right'/gt - ltNS2view_of_image rdfresource'c\breast_cancer
_ontology.damlcraniocaudal_view'/gt - ltNS2contains_entity rdfresource'c\breast_canc
er_ontology.daml01401_right_cc_abnor_1'/gt - lt/rdfDescriptiongt
- ltrdfDescription rdfabout'c\breast_cancer_ontol
ogy.daml01401_right_cc_abnor_1'gt
124CASE0140.RDF
- The 68 years old patient is involved in a
triple assessment procedure. The triple
assessment procedure contains a mammography exam.
The mammography exam is carried out on the
patient on 22 9 1995. The mammography exam
produced a right CC image. The right CC image
contains an abnormality and it has a right
lateral side and a craniocaudal view. The
abnormality has a mass, a microlobulated margin ,
a round shape, and a probably malignant
assessment.
125Further Reading on IE for SemWeb
- Requirements for Information Extraction for
Knowledge Management. http//nlp.shef.ac.uk/dot.ko
m/publications.html - Information Extraction as a Semantic Web
Technology Requirements and Promises. Adaptive
Text Extraction and Mining workshop, 2003. - A. Kiryakov, B. Popov, et al. Semantic
Annotation, Indexing, and Retrieval. 2nd
International Semantic Web Conference (ISWC2003),
http//www.ontotext.com/publications/index.htmlKi
ryakovEtAl2003 - S. Handschuh, S. Staab, R. Volz
http//www.aifb.uni-karlsruhe.de/WBS/sha/papers/p2
73_handschuh.pdf. On Deep Annotation. WWW03. - S. Dill, N. Eiron, et al http//www.tomkinshome.c
om/papers/2Web/semtag.pdf . SemTag and Seeker
Bootstrapping the semantic web via automated
semantic annotation. WWW03. - E. Motta, M. Vargas-Vera, et al MnM Ontology
Driven Semi-Automatic and Automatic Support for
Semantic Markup. Knowledge Engineering and
Knowledge Management (Ontologies and the Semantic
Web), (EKAW02), http//www.aktors.org/publications
/selected-papers/06.pdf - K. Bontcheva, A. Kiryakov, H. Cunningham, B.
Popov. M. Dimitrov. Semantic Web Enabled, Open
Source Language Technology. Language Technology
and the Semantic Web, Workshop on NLP and XML
(NLPXML-2003). http//www.gate.ac.uk/sale/eacl03-s
emweb/bontcheva-etal-final.pdf - Handschuh, Staab, Ciravegna. S-CREAM -
Semi-automatic CREAtion of Metadata (2002)
http//citeseer.nj.nec.com/529793.html
126Further Reading on traditional IE
- Day et al97 D. Day, J. Aberdeen, L. Hirschman,
R. Kozierok, P. Robinson, and M. Vilain.
Mixed-Initiative Development of Language
Processing Systems. In Proceedings of the Fifth
Conference on Applied Natural Language Processing
(ANLP97). 1997. - Ciravegna02 F. Ciravegna, A. Dingli, D.
Petrelli, Y. Wilks User-System Cooperation in
Document Annotation based on Information
Extraction. Knowledge Engineering and Knowledge
Management (Ontologies and the Semantic Web),
(EKAW02), 2002. - N. Kushmerick, B. Thomas. Adaptive information
extraction Core technologies for information
agents (2002). http//citeseer.nj.nec.com/kushmeri
ck02adaptive.html - H. Cunningham, D. Maynard, K. Bontcheva, V.
Tablan. GATE A Framework and Graphical
Development Environment for Robust NLP Tools and
Applications. 40th Anniversary Meeting of the
Association for Computational Linguistics
(ACL'02). 2002. - D.Maynard, K. Bontcheva and H. Cunningham.
Towards a semantic extraction of named entities.
Recent Advances in Natural Language Processing,
Bulgaria, 2003. - Califf and Mooney Relational Learning of Pattern
Matching Rules for Information Extraction
http//citeseer.nj.nec.com/6804.html - Borthwick. A. A Maximum Entropy Approach to Named
Entity Recognition.PhD Dissertation. 1999 - Bikel D., Schwarta R., Weischedel. R. An
algorithm that learns whats in a name. Machine
Learning 34, pp.211-231, 1999 - Riloff, E. (1996) "Automatically Generating
Extraction Patterns from Untagged Text"
Proceedings of the Thirteenth National Conference
on Artificial Intelligence (AAAI-96) , 1996, pp.
1044-1049. http//www.cs.utah.edu/7Eriloff/psfile
s/aaai96.pdf - Daelemans W. and Hoste V. Evaluation of Machine
Learning Methods for Natural Language Processing
Tasks. In LREC 2002 Third International
Conference on Language Resources and Evaluation,
pages 755760
127Further Reading on traditional IE
- Black W.J., Rinaldi F., Mowatt D. Facile
Description of the NE System Used For MUC-7.
Proceedings of 7th Message Understanding
Conference, Fairfax, VA, 19 April - 1 May, 1998. - Collins M., Singer Y. Unsupervised models for
named entity classificationIn Proceedings of the
Joint SIGDAT Conference on Empirical Methods in
Natural Language Processing and Very Large
Corpora, 1999 - Collins M. Ranking Algorithms for Named-Entity
Extraction Boosting and the Voted Perceptron.
Proceedings of the 40th Annual Meeting of the
ACL, Philadelphia, pp. 489-496, July 2002 Gotoh
Y., Renals S. Information extraction from
broadcast news, Philosophical Transactions of the
Royal Society of London, series A Mathematical,
Physical and Engineering Sciences, 2000. - Grishman R. The NYU System for MUC-6 or Where's
the Syntax? Proceedings of the MUC-6 workshop,
Washington. November 1995. - Krupka G. R., Hausman K. IsoQuest Inc.
Description of the NetOwlTM Extractor System as
Used for MUC-7. Proceedings of 7th Message
Understanding Conference, Fairfax, VA, 19 April -
1 May, 1998. - McDonald D. Internal and External Evidence in the
Identification and Semantic Categorization of
Proper Names. In B.Boguraev and J. Pustejovsky
editors Corpus Processing for Lexical
Acquisition. Pages21-39. MIT Press. Cambridge,
MA. 1996 - Mikheev A., Grover C. and Moens M. Description of
the LTG System Used for MUC-7. Proceedings of 7th
Message Understanding Conference, Fairfax, VA, 19
April - 1 May, 1998 - Miller S., Crystal M., et al. BBN Description of
the SIFT System as Used for MUC-7. Proceedings of
7th Message Understanding Conference, Fairfax,
VA, 19 April - 1 May, 1998
128Further Reading on multilingual IE
- Palmer D., Day D.S. A Statistical Profile of the
Named Entity Task. Proceedings of the Fifth
Conference on Applied Natural Language
Processing, Washington, D.C., March 31- April 3,
1997. - Sekine S., Grishman R. and Shinou H. A decision
tree method for finding and classifying names in
Japanese texts. Proceedings of the Sixth Workshop
on Very Large Corpora, Montreal, Canada, 1998 - Sun J., Gao J.F., Zhang L., Zhou M., Huang C.N.
Chinese Named Entity Identification Using
Class-based Language Model. In proceeding of the
19th International Conference on Computational
Linguistics (COLING2002), pp.967-973, 2002. - Takeuchi K., Collier N. Use of Support Vector
Machines in Extended Named Entity Recognition.
The 6th Conference on Natural Language Learning.
2002 - D.Maynard, K. Bontcheva and H. Cunningham.
Towards a semantic extraction of named entities.
Recent Advances in Natural Language Processing,
Bulgaria, 2003. - M. M. Wood and S. J. Lydon and V. Tablan and D.
Maynard and H. Cunningham. Using parallel texts
to improve recall in IE. Recent Advances in
Natural Language Processing, Bulgaria, 2003. - D.Maynard, V. Tablan and H. Cunningham. NE
recognition without training data on a language
you don't speak. ACL Workshop on Multilingual and
Mixed-language Named Entity Recognition
Combining Statistical and Symbolic Models,
Sapporo, Japan, 2003.
129Further Reading on multilingual IE
- H. Saggion, H. Cunningham, K. Bontcheva, D.
Maynard, O. Hamza, Y. Wilks. Multimedia Indexing
through Multisource and Multilingual Information
Extraction the MUMIS project. Data and Knowledge
Engineering, 2003. - D. Manov and A. Kiryakov and B. Popov and K.
Bontcheva and D. Maynard, H. Cunningham.
Experiments with geographic knowledge for
information extraction. Workshop on Analysis of
Geographic References, HLT/NAACL'03, Canada,
2003. - H. Cunningham, D. Maynard, K. Bontcheva, V.
Tablan. GATE A Framework and Graphical
Development Environment for Robust NLP Tools and
Applications. Proceedings of the 40th Anniversary
Meeting of the Association for Computational
Linguistics (ACL'02). Philadelphia, July 2002. - H. Cunningham. GATE, a General Architecture for
Text Engineering. Computers and the Humanities,
volume 36, pp. 223-254, 2002. - D. Maynard, H. Cunningham, K. Bontcheva, M.
Dimitrov. Adapting A Robust Multi-Genre NE System
for Automatic Content Extraction. Proc. of the
10th International Conference on Artificial
Intelligence Methodology, Systems, Applications
(AIMSA 2002), 2002. - K. Pastra, D. Maynard, H. Cunningham, O. Hamza,
Y. Wilks. How feasible is the reuse of grammars
for Named Entity Recognition? Language Resources
and Evaluation Conference (LREC'2002), 2002.
130THANK YOU!(for not snoring) The slides
http//gate.ac.uk/sale/talks/esws2004-tutorial.pp
t This work has been supported by AKT
(http//aktors.org/) and SEKT (http//sekt.semant
icweb.org/)