Human Language Technology for the Semantic Web - PowerPoint PPT Presentation

Loading...

PPT – Human Language Technology for the Semantic Web PowerPoint presentation | free to view - id: a101b-ZjJmZ



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Human Language Technology for the Semantic Web

Description:

... 'DATE' nine day /TIMEX shuttle flight was to be the 12th ... annotators are cheap (but you get what you pay for!) A) NE Baseline: list lookup approach ... – PowerPoint PPT presentation

Number of Views:253
Avg rating:3.0/5.0
Slides: 131
Provided by: ham4157
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Human Language Technology for the Semantic Web


1
Human Language Technology for the Semantic Web
http//gate.ac.uk/ http//nlp.shef.ac.uk
/ Hamish Cunningham Kalina Bontcheva Diana
Maynard Valentin Tablan ESWS, Crete, May
2004 This work has been supported by AKT
(http//aktors.org/) and SEKT (http//sekt.semant
icweb.org/)
2
Are you wasting your time?
3
Structure of the Tutorial
  • Motivation, background
  • Information Extraction - definition
  • Evaluation corpora metrics
  • IE approaches some examples
  • Rule-based approaches
  • Learning-based approaches
  • Semantic Tagging
  • Using traditional IE
  • Ontology-based IE
  • Platforms for large-scale processing
  • Language Generation
  • Slides http//gate.ac.uk/sale/talks/esws2004-tut
    orial.ppt

4
The Knowledge Economy and Human Language
  • Gartner, December 2002
  • taxonomic and hierarchical knowledge mapping and
    indexing will be prevalent in almost all
    information-rich applications
  • through 2012 more than 95 of human-to-computer
    information input will involve textual language
  • A contradiction formal knowledge in
    semantics-based systems vs. ambiguous informal
    natural language
  • The challenge to reconcile these two opposing
    tendencies

5
HLT Knowledge Closing the Language Loop
KEY MNLG Multilingual Natural Language
GenerationOBIE Ontology-Based Information
ExtractionAIE Adaptive Mixed-Initiative
IECLIE Controlled Language IE
(M)NLG
Semantic Web Semantic GridSemantic Web
Services
Formal Knowledge(ontologies andinstance bases)
HumanLanguage
OBIE
(A)IE
ControlledLanguage
CLIE
6
Background and Examples (1)
  • Like other areas of computer science, HLT has
    typical data structures and infrastructure
    requirements
  • Annotation associating arbitrary data with areas
    of text or speech
  • Defacto standard Stand-off Markup (e.g.
    TEI/XCES, NITE, ATLAS, GATE)
  • Other issues visualisation and editing
    persistence and search metrics component model
    baseline NLP tools ...
  • To cut a long story short HLT has a lot of T
    underneath it which comes in many shapes and sizes

7
Background and Examples (2)
  • Infrastructure (many) examples in this
    tutorial
  • GATE, a General Architecture for Text
    Engineering architecture, framework IDE
  • Why?
  • I happen to know a little about it ?
  • Free software, relatively comprehensive, widely
    used, has extensive Semantic Web support
  • It means we can ignore the infrastructural issues
  • Not a claim that it is the best or only in all
    cases!

8
Structure of the Tutorial
  • Motivation, background
  • Information Extraction - definition
  • Evaluation corpora metrics
  • IE approaches some examples
  • Rule-based approaches
  • Learning-based approaches
  • Semantic Tagging
  • Using traditional IE
  • Ontology-based IE
  • Platforms for large-scale processing
  • Language Generation
  • Slides http//gate.ac.uk/sale/talks/esws2004-tut
    orial.ppt

9
Information Extraction (1)
  • Information Extraction (IE) pulls facts and
    structured information from the content of large
    text collections.
  • Contrast IE and Information Retrieval
  • NLP history from NLU to IE (if you cant score,
    why not move the goalposts?)

10
Information Extraction (2)
  • When you can measure what you are speaking about,
    and express it in numbers, you know something
    about it but when you cannot measure it, when
    you cannot express it in numbers, your knowledge
    is of a meager and unsatisfactory kind it may be
    the beginning of knowledge, but you have scarcely
    in your thoughts advanced to the stage of
    science. (Kelvin)
  • Not everything that counts can be counted, and
    not everything that can be counted counts.
    (Einstein)
  • IE progress driven by quantitative measures
  • MUC Message Understanding Conferences
  • ACE Automatic Content Extraction

11
MUC-7 tasks
  • Held in 1997, around 15 participants inc. 2 UK.
    Broke IE down into component tasks
  • NE Named Entity recognition and typing
  • CO co-reference resolution
  • TE Template Elements
  • TR Template Relations
  • ST Scenario Templates

12
An Example
  • The shiny red rocket was fired on Tuesday. It is
    the brainchild of Dr. Big Head. Dr. Head is a
    staff scientist at We Build Rockets Inc.
  • NE "rocket", "Tuesday", "Dr. Head, "We Build
    Rockets"
  • CO"it" rocket "Dr. Head" "Dr. Big Head"
  • TE the rocket is "shiny red" and Head's
    "brainchild".
  • TR Dr. Head works for We Build Rockets Inc.
  • ST rocket launch event with various participants

13
Performance levels
  • Vary according to text type, domain, scenario,
    language
  • NE up to 97 (tested in English, Spanish,
    Japanese, Chinese, etc. etc.)
  • CO 60-70 resolution
  • TE 80
  • TR 75-80
  • ST 60 (but human level may be only 80)

14
What are Named Entities?
  • NE involves identification of proper names in
    texts, and classification into a set of
    predefined categories of interest
  • Person names
  • Organizations (companies, government
    organisations, committees, etc)
  • Locations (cities, countries, rivers, etc)
  • Date and time expressions

15
What are Named Entities (2)
  • Other common types measures (percent, money,
    weight etc), email addresses, Web addresses,
    street addresses, etc.
  • Some domain-specific entities names of drugs,
    medical conditions, names of ships, bibliographic
    references etc.
  • MUC-7 entity definition guidelines Chinchor97
  • http//www.itl.nist.gov/iaui/894.02/related_projec
    ts/muc/proceedings/ne_task.html

16
What are NOT NEs (MUC-7)
  • Artefacts Wall Street Journal
  • Common nouns, referring to named entities the
    company, the committee
  • Names of groups of people and things named after
    people the Tories, the Nobel prize
  • Adjectives derived from names Bulgarian,
    Chinese
  • Numbers which are not times, dates, percentages,
    and money amounts

17
Basic Problems in NE
  • Variation of NEs e.g. John Smith, Mr Smith,
    John.
  • Ambiguity of NE types John Smith (company vs.
    person)
  • May (person vs. month)
  • Washington (person vs. location)
  • 1945 (date vs. time)
  • Ambiguity with common words, e.g. "may"

18
More complex problems in NE
  • Issues of style, structure, domain, genre etc.
  • Punctuation, spelling, spacing, formatting, ...
    all have an impact
  • Dept. of Computing and Maths
  • Manchester Metropolitan University
  • Manchester
  • United Kingdom
  • Tell me more about Leonardo
  • Da Vinci

19
Structure of the Tutorial
  • Motivation, background
  • Information Extraction - definition
  • Evaluation corpora metrics
  • IE approaches some examples
  • Rule-based approaches
  • Learning-based approaches
  • Semantic Tagging
  • Using traditional IE
  • Ontology-based IE
  • Platforms for large-scale processing
  • Language Generation
  • Slides http//gate.ac.uk/sale/talks/esws2004-tut
    orial.ppt

20
Corpora and System Development
  • Gold standard data created by manual annotation
  • Corpora are divided typically into a training and
    testing portion
  • Rules and/or learning algorithms are developed or
    trained on the training part
  • Tuned on the testing portion in order to optimise
  • Rule priorities, rules effectiveness, etc.
  • Parameters of the learning algorithm and the
    features used (typical routine 10-fold cross
    validation)
  • Evaluation set the best system configuration is
    run on this data and the system performance is
    obtained
  • No further tuning once evaluation set is used!

21
Some NE Annotated Corpora
  • MUC-6 and MUC-7 corpora - English
  • CONLL shared task corpora http//cnts.uia.ac.be/co
    nll2003/ner/ - NEs in English and
    Germanhttp//cnts.uia.ac.be/conll2002/ner/ -
    NEs in Spanish and Dutch
  • TIDES surprise language exercise (NEs in Cebuano
    and Hindi)
  • ACE English - http//www.ldc.upenn.edu/Projects/
    ACE/

22
The MUC-7 corpus
  • 100 documents in SGML
  • News domain
  • Named Entities
  • 1880 Organizations (46)
  • 1324 Locations (32)
  • 887 Persons (22)
  • Inter-annotator agreement very high (97)
  • http//www.itl.nist.gov/iaui/894.02/related_projec
    ts/muc/proceedings/muc_7_proceedings/marsh_slides.
    pdf

23
The MUC-7 Corpus (2)
  • ltENAMEX TYPE"LOCATION"gtCAPE CANAVERALlt/ENAMEXgt,
    ltENAMEX TYPE"LOCATION"gtFla.lt/ENAMEXgt MD
    Working in chilly temperatures ltTIMEX
    TYPE"DATE"gtWednesdaylt/TIMEXgt ltTIMEX
    TYPE"TIME"gtnightlt/TIMEXgt, ltENAMEX
    TYPE"ORGANIZATION"gtNASAlt/ENAMEXgt ground crews
    readied the space shuttle Endeavour for launch on
    a Japanese satellite retrieval mission.
  • ltpgt
  • Endeavour, with an international crew of six, was
    set to blast off from the ltENAMEX
    TYPE"ORGANIZATIONLOCATION"gtKennedy Space
    Centerlt/ENAMEXgt on ltTIMEX TYPE"DATE"gtThursdaylt/TI
    MEXgt at ltTIMEX TYPE"TIME"gt418 a.m. ESTlt/TIMEXgt,
    the start of a 49-minute launching period. The
    ltTIMEX TYPE"DATE"gtnine daylt/TIMEXgt shuttle
    flight was to be the 12th launched in darkness.

24
ACE Towards Semantic Tagging of Entities
  • MUC NE tags segments of text whenever that text
    represents the name of an entity
  • In ACE (Automated Content Extraction), these
    names are viewed as mentions of the underlying
    entities. The main task is to detect (or infer)
    the mentions in the text of the entities
    themselves
  • Rolls together the NE and CO tasks
  • Domain- and genre-independent approaches
  • ACE corpus contains newswire, broadcast news (ASR
    output and cleaned), and newspaper reports (OCR
    output and cleaned)

25
ACE Entities
  • Dealing with
  • Proper names e.g., England, Mr. Smith, IBM
  • Pronouns e.g., he, she, it
  • Nominal mentions the company, the spokesman
  • Identify which mentions in the text refer to
    which entities, e.g.,
  • Tony Blair, Mr. Blair, he, the prime minister, he
  • Gordon Brown, he, Mr. Brown, the chancellor

26
ACE Example
  • ltentity ID"ft-airlines-27-jul-2001-2"
  • GENERIC"FALSE"
  • entity_type "ORGANIZATION"gt
  • ltentity_mention ID"M003"
  • TYPE "NAME"
  • string "National Air
    Traffic Services"gt
  • lt/entity_mentiongt
  • ltentity_mention ID"M004"
  • TYPE "NAME"
  • string "NATS"gt
  • lt/entity_mentiongt
  • ltentity_mention ID"M005"
  • TYPE "PRO"
  • string "its"gt
  • lt/entity_mentiongt
  • ltentity_mention ID"M006"
  • TYPE "NAME"
  • string "Nats"gt
  • lt/entity_mentiongt

27
Annotation Tools (1) GATE
28
Annotation Tools (2) Alembic
29
Performance Evaluation
  • Evaluation metric mathematically defines how to
    measure the systems performance against
    human-annotated gold standard
  • Scoring program implements the metric and
    provides performance measures
  • For each document and over the entire corpus
  • For each type of NE

30
Evaluation Metrics
  • Most common are Precision and Recall
  • Precision correct answers/answers produced
  • Recall correct answers/total possible correct
    answers
  • Trade-off between precision and recall
  • F-Measure (ß2 1)PR / ß2R P van Rijsbergen
    75
  • ß reflects the weighting between precision and
    recall, typically ß1
  • Some tasks sometimes use other metrics, e.g.
  • false positives (not sensitive to doc richness)
  • cost-based (good for application-specific
    adjustment)

31
The Evaluation Metric (2)
  • We may also want to take account of partially
    correct answers
  • Precision Correct ½ Partially correct
  • Correct Incorrect Partial
  • Recall Correct ½ Partially correctCorrect
    Missing Partial
  • Why NE boundaries are often misplaced, sosome
    partially correct results

32
The GATE Evaluation Tool
33
Corpus-level Regression Testing
  • Need to track systems performance over time
  • When a change is made we want to know
    implications over whole corpus
  • Why because an improvement in one case can lead
    to problems in others
  • GATE offers automated tool to help with the NE
    development task over time

34
Regression Testing (2)
At corpus level GATEs corpus benchmark tool
tracking systems performance over time
35
SW IE Evaluation tasks
  • Detection of entities and events, given a target
    ontology of the domain.
  • Disambiguation of the entities and events from
    the documents with respect to instances in the
    given ontology. For example, measuring whether
    the IE correctly disambiguated Cambridge in the
    text to the correct instance Cambridge, UK vs
    Cambridge, MA.
  • Decision when a new instance needs to be added to
    the ontology, because the text contains a new
    instance, that does not already exist in the
    ontology.

36
ChallengeEvaluating Richer NE Tagging
  • Need for new metrics when evaluating
    hierarchy/ontology-based NE tagging
  • Need to take into account distance in the
    hierarchy
  • Tagging a company as a charity is less wrong than
    tagging it as a person

37
Structure of the Tutorial
  • Motivation, background
  • Information Extraction - definition
  • Evaluation corpora metrics
  • IE approaches some examples
  • Rule-based approaches
  • Learning-based approaches
  • Semantic Tagging
  • Using traditional IE
  • Ontology-based IE
  • Platforms for large-scale processing
  • Language Generation
  • Slides http//gate.ac.uk/sale/talks/esws2004-tut
    orial.ppt

38
Two kinds of IE approaches
  • Knowledge Engineering
  • rule based
  • developed by experienced language engineers
  • make use of human intuition
  • requires only small amount of training data
  • development could be very time consuming
  • some changes may be hard to accommodate
  • Learning Systems
  • use statistics or other machine learning
  • developers do not need LE expertise
  • requires large amounts of annotated training data
  • some changes may require re-annotation of the
    entire training corpus
  • annotators are cheap (but you get what you pay
    for!)

39
A) NE Baseline list lookup approach
  • System that recognises only entities stored in
    its lists (gazetteers).
  • Advantages - Simple, fast, language independent,
    easy to retarget (just create lists)
  • Disadvantages impossible to enumerate all
    names, collection and maintenance of lists,
    cannot deal with name variants, cannot resolve
    ambiguity

40
B) Shallow parsing approach using internal
structure
  • Internal evidence names often have internal
    structure. These components can be either stored
    or guessed, e.g. location
  • Cap. Word City, Forest, Center, River
  • e.g. Sherwood Forest
  • Cap. Word Street, Boulevard, Avenue, Crescent,
    Road
  • e.g. Portobello Street

41
Problems ...
  • Ambiguously capitalised words (first word in
    sentence)All American Bank vs. All State
    Police
  • Semantic ambiguity "John F. Kennedy" airport
    (location) "Philip Morris" organisation
  • Structural ambiguity Cable and Wireless vs.
    Microsoft and DellCenter for Computational
    Linguistics vs. message from City Hospital for
    John Smith

42
C) Shallow parsing with context
  • Use of context-based patterns is helpful in
    ambiguous cases
  • "David Walton" and "Goldman Sachs" are
    indistinguishable
  • But with the phrase "David Walton of Goldman
    Sachs" and the Person entity "David Walton"
    recognised, we can use the pattern "Person of
    Organization" to identify "Goldman Sachs
    correctly.

43
Examples of context patterns
  • PERSON earns MONEY
  • PERSON joined ORGANIZATION
  • PERSON left ORGANIZATION
  • PERSON joined ORGANIZATION as JOBTITLE
  • ORGANIZATION's JOBTITLE PERSON
  • ORGANIZATION JOBTITLE PERSON
  • the ORGANIZATION JOBTITLE
  • part of the ORGANIZATION
  • ORGANIZATION headquarters in LOCATION
  • price of ORGANIZATION
  • sale of ORGANIZATION
  • investors in ORGANIZATION
  • ORGANIZATION is worth MONEY
  • JOBTITLE PERSON
  • PERSON, JOBTITLE

44
Example Rule-based System - ANNIE
  • ANNIE A Nearly-New IE system
  • A version distributed as part of GATE
  • GATE automatically deals with document formats,
    saving of results, evaluation, and visualisation
    of results for debugging
  • GATE has a finite-state pattern-action rule
    language, used by ANNIE
  • A reusable and easily extendable set of components

45
NE Components
46
Gazetteer lists for rule-based NE
  • Needed to store the indicator strings for the
    internal structure and context rules
  • Internal location indicators e.g., river,
    mountain, forest for natural locations street,
    road, crescent, place, square, for address
    locations
  • Internal organisation indicators e.g., company
    designators GmbH, Ltd, Inc,
  • Produces Lookup results of the given kind

47
The Named Entity Transducers
  • Phases run sequentially and constitute a cascade
    of FSTs over the pre-processing results
  • Hand-coded rules applied to annotations to
    identify NEs
  • Annotations from format analysis, tokeniser,
    sentence splitter, POS tagger, and gazetteer
    modules
  • Use contextual information
  • Finds person names, locations, organisations,
    dates, addresses.

48
  •  NE Rule in JAPE
  • JAPE a Java Annotation Patterns Engine
  • Light, robust regular-expression-based
    processing
  • Cascaded finite state transduction
  • Low-overhead development of new components
  • Simplifies multi-phase regex processing
  • Rule Company1
  • Priority 25
  • (
  • ( Token.orthography upperInitial )
    //from tokeniser
  • Lookup.kind companyDesignator //from
    gazetteer lists
  • )match
  • --gt
  • match.NamedEntity
  • kindcompany, ruleCompany1

49
Named Entities in GATE
50
Using co-reference to classify ambiguous NEs
  • Orthographic co-reference module matches proper
    names in a document
  • Improves NE results by assigning entity type to
    previously unclassified names, based on
    relations with classified NEs
  • May not reclassify already classified entities
  • Classification of unknown entities very useful
    for surnames which match a full name, or
    abbreviations, e.g. Bonfield will match Sir
    Peter Bonfield International Business
    Machines Ltd. will match IBM

51
Named Entity Coreference
52
Structure of the Tutorial
  • Motivation, background
  • Information Extraction - definition
  • Evaluation corpora metrics
  • IE approaches some examples
  • Rule-based approaches
  • Learning-based approaches
  • Semantic Tagging
  • Using traditional IE
  • Ontology-based IE
  • Platforms for large-scale processing
  • Language Generation
  • Slides http//gate.ac.uk/sale/talks/esws2004-tut
    orial.ppt

53
Machine Learning Approaches
  • Approaches
  • Train ML models on manually annotated text
  • Mixed initiative learning
  • Used for producing training data
  • Used for producing working systems
  • ML Methods
  • Symbolic learning rules/decision trees induction
  • Statistical models HMMs, Bayesian methods,
    Maximum Entropy

54
ML Terminology
  • Instances (tokens, entities)
  • Occurrences of a phenomenon
  • Attributes (features)
  • Characteristics of the instances
  • Classes
  • Sets of similar instances

55
Methodology
  • The task can be broken into several subtasks
    (that can use different methods)
  • Boundary detection
  • Entity classification into NE types
  • Different models for different entity types
  • Several models can be used in competition.
  • Some algorithms perform better on little data
    while others are better when more training is
    available

56
Methodology (2)
  • Boundaries (and entity types) notations
  • S(-XXX), E(-XXX)
  • ltS-ORG/gtU.N.ltE-ORG/gt official ltS-PER/gtEkeusltE-PER/
    gt heads for
  • ltS-LOC/gtBaghdadltE-LOC/gt.
  • IOB notation (Inside, Outside, Beginning_of)
  • U.N. I-ORG
  • official O
  • Ekeus I-PER
  • heads O
  • for O
  • Baghdad I-LOC
  • . O
  • Translations between the two conventions are
  • straight-forward

57
Features
  • Linguistic features
  • POS
  • Morphology
  • Syntax
  • Lexicon data
  • Semantic features
  • Ontological class
  • ETC
  • Document structure
  • Original markup
  • Paragraph/sentence structure
  • Surface features
  • Token length
  • Capitalisation
  • Token type (word, punctuation, symbol)
  • Feature selection the most difficult part
  • Some automatic scoring methods can be used

58
Mixed Initiative Learning
  • Human computer interaction
  • Speeds up the creation of training data
  • Can be used for corpus/system creation
  • Example implementations
  • Alembic Day et al97 and later
  • Amilcare Ciravegna03

59
Mixed Initiative Learning (2)
User annotates
System learns
Pgtt1
Pgtt2
60
Example 1 Alembic, Day et al 1997
  • Mixed initiative approach implemented in Alembic
    Workbench
  • Bootstrapping procedure use already tagged data
    to pre-annotate new documents
  • Transforms the process from tagging to review
  • Finally, the trained system can be used on its
    own

61
Mixed-Initiative text annotation
User can also edit the induced rulesand
writenew ones Brill-style learning
generate-and-test
62
Considerations
  • Too high recall and the human might become
    over-reliant on the system annotations
  • Too high precision might have similar effect
  • Theory-creep the choices of the human
    annotator are increasingly influenced by the
    machines and might deviate from the task
    definition measure inter-annotator agreement

63
Example 2 Amilcare Melita
  • Amilcare rule-learning algorithm
  • Tagging rules learn to insert tags in the text,
    given training examples
  • Correction rules learn to move already inserted
    tags to their correct place in the text
  • Learns begin and end tags independently
  • Melita support adaptive IE
  • Applied in SemWeb context (see below)
  • Being extended as part of the EU-funded DOT.KOM
    project towards KM andSemWeb applications

64
Comparison of Alembic Melita
  • The life cycle of user tagging, the machine
    learning, then making suggestions which user
    corrects, is very similar in Melita and Alembic
  • Alembic is more oriented towards NLP developers,
    while Melita more towards end-users
  • Melita considers timeliness and intrusiveness as
    criteria, while Alembic does not (possibly due to
    performance bottleneck from old hardware)
  • Both acknowledge but do not address problems with
    over-reliance on machine annotations
  • From ML perspective the two are very similar
    rule-learning

65
Eg. 3 GATE Machine Learning support
  • Uses classification.
  • Attr1, Attr2, Attr3, Attrn ? Class
  • Classifies annotations.
  • (Documents can be classified as well using a
    1-to1 relation with annotations.)
  • Annotations of a particular type are selected as
    instances.
  • Attributes refer to features of the instance
    annotations or their context.
  • Generic implementation for attribute collection
    can be linked to any ML engine.
  • ML engines currently integrated WEKA and
    Ontotexts HMM.

66
Implementation
  • Machine Learning PR in GATE.
  • Has two functioning modes
  • training
  • application
  • Uses an XML file for configuration
  • lt?xml version"1.0" encoding"windows-1252"?gt
  • ltML-CONFIGgt
  • ltDATASETgt lt/DATASETgt
  • ltENGINEgtlt/ENGINEgt
  • ltML-CONFIGgt

67
Attributes Collection
Instances type Token
68
Dataflow
GATE ML Library
NLP Pipeline Tokeniser Gazetteer POS
Tagger Lexicon Lookup Semantic Tagger etc
Annotated documents
Plain text documents
Feature Collection
Results Converter
Engine Interface
Machine Learning Engine
69
Structure of the Tutorial
  • Motivation, background
  • Information Extraction - definition
  • Evaluation corpora metrics
  • IE approaches some examples
  • Rule-based approaches
  • Learning-based approaches
  • Semantic Tagging
  • Using traditional IE
  • Ontology-based IE
  • Platforms for large-scale processing
  • Language Generation
  • Slides http//gate.ac.uk/sale/talks/esws2004-tut
    orial.ppt

70
Towards Semantic Tagging of Entities
  • The MUC NE task tags selected segments of text
    whenever that text represents the name of an
    entity.
  • Semantic tagging - view as mentions of the
    underlying instances from the ontology
  • Identify which mentions in the text refer to
    which instances in the ontology, e.g.,
  • Tony Blair, Mr. Blair, he, the prime minister, he
  • Gordon Brown, he, Mr. Brown, the chancellor

71
Tasks
  • Identify entity mentions in the text
  • Reference disambiguation
  • Add new instances if needed
  • Disambiguate wrt instances in the ontology
  • Identify instances of attributes and relations
  • take into account what are allowed given the
    ontology, using domainrange as constraints

72
Example
XYZ was established on 03 November 1978 in
London. It opened a plant in Bulgaria in
Ontology KB
Location
Company
HQ
partOf
City
Country
type
type
HQ
type
type
establOn
partOf
03/11/1978
73
Classes, instances metadata
Gordon Brown met George Bush during his two day
visit.
ltmetadatagt ltDOC-IDgthttp// 1.htmllt/DOC-IDgt
ltAnnotationgt lts_offsetgt 0 lt/s_offsetgt
lte_offsetgt 12 lt/e_offsetgt ltstringgtGordon
Brownlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson12345lt/instgt lt/Annotationgt
ltAnnotationgt lts_offsetgt 18 lt/s_offsetgt
lte_offsetgt 32 lt/e_offsetgt ltstringgtGeorge
Bushlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson67890lt/instgt lt/Annotationgt lt/metad
atagt
Classesinstances after
Classesinstances before
Bush
74
Classes, instances metadata (2)
Gordon Brown met Tony Blair to discuss the
university tuition fees.
ltmetadatagt ltDOC-IDgthttp// 2.htmllt/DOC-IDgt
ltAnnotationgt lts_offsetgt 0 lt/s_offsetgt
lte_offsetgt 12 lt/e_offsetgt ltstringgtGordon
Brownlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson12345lt/instgt lt/Annotationgt
ltAnnotationgt lts_offsetgt 18 lt/s_offsetgt
lte_offsetgt 30 lt/e_offsetgt ltstringgtTony
Blairlt/stringgt ltclassgtPersonlt/classgt
ltinstgtPerson26389lt/instgt lt/Annotationgt lt/metad
atagt
Classesinstances after
Classesinstances before
G. Brown
G. Bush
75
Why not put metadata in ontologies?
  • Can be encoded in RDF/OWL, etc. but does it need
    to be put as instances in the ontology?
  • Typically we do not need to reason with it
  • Reasoning happens in the ontology when the new
    instances of classes and properties are added,
    but the metadata statements are different from
    them, they only refer to them
  • A lot more metadata than instances
  • Millions of metadata statements, thousands of
    instances, hundreds of concepts
  • Different access required
  • By offset (give me all metadata of the first
    paragraph)
  • Efficient metadata-wide statistics based on
    strings not an operation that people would do
    on other concepts
  • Mixing with keyword-based search using IR-style
    indexing

76
Metadata Creation with IE
  • Semantic tagging creates metadata
  • Stand-off or part of document
  • Semi-automatic
  • One view (given by the user, one ontology)
  • More reliable
  • Automatic metadata creation
  • Many views change with ontology, re-train IE
    engine for each ontology
  • Always up to date, if ontology changes
  • Less reliable

77
Problems with traditional IE for metadata
creation
  • S-CREAM Semi-automatic CREAtion of Metadata
    Handschuh et al02
  • Semantic tags from IE need to be mapped to
    instances of concepts, attributes or relations
  • Most ML-based IE systems do not deal well with
    relations, mainly entities
  • Amilcare does not handle anaphora resolution,
    GATE has such component but not used here
  • Implemented a discourse model with logical rules
  • LASIE used discourse model with domainontology
    problem is robustness and domain portability

78
Example
Handschuh et al02 S-CREAM, EKAW02
79
S-CREAM Discourse Rules
  • Rules to attach instances only when the ontology
    allows that (e.g., prices)
  • Attach tag values to the nearest preceding
    compatible entity (e.g., prices and rooms)
  • Create a complex object between two concept
    instances if they are adjacent (e.g., rate
    number followed by currency)
  • Experienced users can write new rules

80
Challenges for IE for SemWeb
  • Portability different and changing ontologies
  • Different text types structured, free, etc.
  • Utilise ontology information where available
  • Train from small amount of annotated text
  • Output results wrt the given ontology
  • bridge the gap demonstrated in S-CREAM
  • Learn/Model at the right level
  • ontologies are hierarchical and data will get
    sparser the lower we go

DOT.KOM http//nlp.shef.ac.uk/dot.kom/
81
Structure of the Tutorial
  • Motivation, background
  • Information Extraction - definition
  • Evaluation corpora metrics
  • IE approaches some examples
  • Rule-based approaches
  • Learning-based approaches
  • Semantic Tagging
  • Using traditional IE
  • Ontology-based IE (OBIE)
  • Platforms for large-scale processing
  • Language Generation
  • Slides http//gate.ac.uk/sale/talks/esws2004-tut
    orial.ppt

82
Eg. 1 GATE metadata extraction
  • Combines learning and rule-based methods
  • Allows combination of IE and IR
  • Enables use of large-scale linguistic resources
    for IE, such as WordNet
  • Supports ontologies as part of IE applications -
    Ontology-Based IE (OBIE)

83
Ontology Management in GATE
84
Information Retrieval Currently based on the
Lucene IR engine useful for combining semantic
and keyword-based search
85
WordNet support
86
Populating Ontologies with IE
87
Example 2 OBIE in h-TechSight
  • hTechSight project using Ontology-Based IE for
    semantic tagging of job adverts, news and reports
    in chemical engineering domain
  • Aim is to track technological change over time
    through terminological analysis
  • Fundamental to the application is a
    domain-specific ontology
  • Terminological gazetteer lists are linked to
    classes in the ontology
  • Rules classify the mentions in the text wrt the
    domain ontology
  • Annotations output into a database or as an
    ontology

88
(No Transcript)
89
(No Transcript)
90
Exported Database
91
Structure of the Tutorial
  • Motivation, background
  • Information Extraction - definition
  • Evaluation corpora metrics
  • IE approaches some examples
  • Rule-based approaches
  • Learning-based approaches
  • Semantic Tagging
  • Using traditional IE
  • Ontology-based IE
  • Platforms for large-scale processing
  • Language Generation
  • Slides http//gate.ac.uk/sale/talks/esws2004-tut
    orial.ppt

92
Platforms for Large-Scale Metadata Creation
  • Allow use of corpus-wide statistics to improve
    metadata quality, e.g., disambiguation
  • Automated alias discovery
  • Generate SemWeb output (RDF, OWL)
  • Stand-off storage and indexing of metadata
  • Use large instance bases to disambiguate to
  • Ontology servers for reasoning and access
  • Architecture elements
  • Crawler, onto storage, doc indexing, query,
    annotators
  • Apps sem browsers, authoring tools, etc.

93
Example 1 SemTag
  • Lookup of all instances from the ontology (TAP)
    65K instances
  • Disambiguate the occurrences as
  • One of those in the taxonomy
  • Not present in the taxonomy
  • Not very high ambiguity of instances with the
    same label in TAP concentrate on the second
    problem
  • Use bag-of-words approach for disambiguation
  • 3 people evaluated 200 labels in context agreed
    on only 68.5 - metonymy
  • Placing labels in the taxonomy is hard

Dill et al, SemTag and Seeker. WWW03
94
Seeker
  • High-performance distributed infrastructure
  • 128 dual-processor machines with separate ½
    terabyte of storage
  • Each node runs approx. 200 documents per sec.
  • Service-oriented architecture Vinci (SOAP)

Dill et al, SemTag and Seeker. WWW03
95
Example 2 OBIE in KIM
  • The ontology (KIMO) and 86K/200K instances KB
  • High ambiguity of instances with the same label
    need for disambiguation step
  • Lookup phase marks mentions from the ontology
  • Combined with rule-based IE system to recognise
    new instances of concepts and relations
  • Special KB enrichment stage where some of these
    new instances are added to the KB
  • Disambiguation uses an Entity Ranking algorithm,
    i.e., priority ordering of entities with the same
    label based on corpus statistics (e.g., Paris)

Popov et al. KIM. ISWC03
96
OBIE in KIM (2)
Popov et al. KIM. ISWC03
97
Comparison between SemTag KIM
  • SemTag only aims for accuracy (precision) of
    classification of the annotated entities
  • KIM also aims for coverage (recall) whether all
    possible mentions of entities were found
  • Trade-off sometimes finding some is enough
  • SemTag does not attempt to discover and expand
    the KB with new instances (e.g., new company)
    the reason why KIM uses IE, not simple KB lookup
  • i.e. OBIE is often needed for ontology
    population, not just metadata creation

98
Two Annotation Scenarios (1)
  • Getting the instances and the relations between
    them is enough, maybe not all mentions in the
    text are covered, but compensated by giving
    access to this info from the annotated text

99
Example
Gordon Brown met president Bush during his two
day visit. Afterwards George Bush said
The system
Bush
Score 100
100
Two Annotation Scenarios (2)
  • Exhaustive annotation is required, so all
    occurrences of all instances and relations are
    needed
  • Allows sentence and paragraph-level exploration,
    rather than document-level as in the previous
    scenario
  • Harder to achieve
  • Distinction between these scenarios needs to be
    made in the metadata annotation tools/KM tools
    using IE

101
Example
Gordon Brown met president Bush during his two
day visit. Afterwards George Bush said
ltmetadatagt ltAnnotationgt lts_offsetgt 0
lt/s_offsetgt lte_offsetgt 12 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson12345lt/ins
tgt lt/Annotationgt ltAnnotationgt lts_offsetgt
18 lt/s_offsetgt lte_offsetgt 32 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson1267lt/inst
gt lt/Annotationgt ltAnnotationgt lts_offsetgt
61 lt/s_offsetgt lte_offsetgt 72 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson1267lt/inst
gt lt/Annotationgt lt/metadatagt
ltmetadatagt ltAnnotationgt lts_offsetgt 0
lt/s_offsetgt lte_offsetgt 12 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson12345lt/ins
tgt lt/Annotationgt ltAnnotationgt lts_offsetgt
61 lt/s_offsetgt lte_offsetgt 72 lt/e_offsetgt
ltclassgtPersonlt/classgt ltinstgtPerson1267lt/inst
gt lt/Annotationgt lt/metadatagt
Score 66
102
Eg. 3 SWAN a Semantic Web Annotator
  • Collaboration between DERI/NUIG, OntoText and
    USFD, hosted at DERI
  • GATE KIM SECO
  • Custom indexing of news or other web fractions
  • Quantitative media reporting
  • Annotated web workbench service
  • Custom knowledge services
  • Demo and poster at ESWS

103
SWAN Logical Architecture
Web
IE (64 bit)
Annotation(Oracle)
UI Users
Web UI,Web services
Knowledgebase (Sesame)
Service Users
104
Cluster Controller
105
Semantic Reference Disambiguation
  • Possible approaches
  • Vector-space models compare context similarity
    runs over a corpus
  • SemTag
  • Baggas cross-document coreference work
  • Communities of practise approach from KM
  • Identity criteria from the ontology based on
    properties, e.g., date_of_birth, name

106
Why disambiguation is hard not all knowledge
is explicit in text
  • Paris fashion week underway as cancellations
    continue
  • By Jo Johnson and Holly Finn  - Oct 07 2001
    184817 (FT)
  • Even as Paris fashion week opened at the
    weekend, the cancellations and reschedulings were
    still trickling in over the fax machines Loewe,
    the leather specialists owned by LVMH empire, is
    not showing, Cerruti, the Italian tailor,is
    downscaling to private viewings, Helmut Lang,
    master of the sharp suit, is cancelling his
    catwalk.
  • The Oscar de la Renta show, for example, which
    had been planned for September 11th in New York,
    and which might easily enough have moved over to
    Paris instead, is not on the schedule. When the
    Dominican Republic-born designer consulted
    America Vogue's influential editor, Anna Wintour,
    she reportedly told him it would be unpatriotic
    to decamp.

107
Structure of the Tutorial
  • Motivation, background
  • Information Extraction - definition
  • Evaluation corpora metrics
  • IE approaches some examples
  • Rule-based approaches
  • Learning-based approaches
  • Semantic Tagging
  • Using traditional IE
  • Ontology-based IE
  • Platforms for large-scale processing
  • Language Generation
  • Slides http//gate.ac.uk/sale/talks/esws2004-tut
    orial.ppt

108
Natural Language Generation
  • NLG is
  • subfield of AI and CL that is concerned with the
    construction of computer systems that can produce
    understandable texts in English or other human
    languages from some underlying linguistic
    representation of information ReiterDale97
  • NLG techniques are applied also for producing
    speech, e.g., in speech dialogue systems

109
  • Natural Language Generation

Ontology/KB/Database
Lexicons Grammars
Text
110
Requirements Analysis
  • Create a corpus of target texts and (if possible)
    their input representations
  • Analyse the information content
  • Unchanging texts thank you, hello, etc.
  • Directly available data timetable of buses
  • Computable data number of buses
  • Unavailable data not in the systems KB/DB

111
NLG Tasks
  • Content determination
  • Discourse planning
  • Sentence aggregation
  • Lexicalisation
  • Referring expression generation
  • Linguistic realisation

112
Content determination
  • What information to include in the text
    filtering and summarising input data into a
    formal knowledge representation
  • Application dependent
  • Example
  • project AKT
  • start_date October-2000
  • end_date October-2006
  • participants A,E,OU,So,Sh

113
Discourse Planning
  • Determine ordering and structure over the
    knowledge to be generated
  • Theories of discourse how texts are structured
  • Influences text readability
  • Result tree structure imposing ordering over the
    predicates and possibly providing discourse
    relations

114
Example
SEQUENCE
LIST

ELABORATION
ELABORATION
projectAKT duration 6 yrs
project AKT participantShef
univ Shef Web-page URL

project AKT participantOU
115
Planning-Based Approaches
  • Use AI-style planners (e.g., Moore Paris 93
  • Discourse relations (e.g., ELABORATION) are
    encoded as planning operators
  • Preconditions specify when the relation can apply
  • Planning starts from a top-level goal, e.g.,
    define-project(X)
  • Computationally expensive and require a lot of
    knowledge problem for real-world systems

116
Schema-Based Approaches
  • Capture typical text structuring patterns in
    templates (derived from corpus), e.g., McKeown
    85
  • Typically implemented as RTN
  • Variety comes from different available knowledge
    for each entity
  • Reusable ones available Exemplars
  • Example
  • Describe-Project-Schema -gt Sequence(duration,
    ProjParticipants-Schema)

117
Sentence Aggregation
  • Determine which predicates should be grouped
    together in sentences
  • Less understood process
  • Default each predicate can be expressed as a
    sentence, so optional step
  • SPOT trainable planner
  • Example
  • AKT is a 6-year project with 5 participants
  • Sheffield (URL)
  • OU

118
Lexicalisation
  • Choosing words and phrases to express the
    concepts and relations in predicates
  • Trivial solution 1-1 mapping between
    concepts/relations and lexical entries
  • Variation is useful to avoid repetitiveness and
    also convey pragmatic distinctions (e.g.
    formality)

119
Referring Expression Generation
  • Choose pronouns/phrases to refer to the entities
    in the text
  • Example he vs Mr Smith vs John Smith, the
    president of XXX Corp.
  • Depends on what is previously said
  • He is only appropriate if the person is already
    introduced in the text

120
Linguistic Realisation
  • Use grammar to generate text which is
    grammatical, i.e., syntactically and
    morphologically correct
  • Domain-independent
  • Reusable components are available e.g.,
    RealPro, FUF/SURGE
  • Example
  • Morphology participant -gt participants
  • Syntactic agreement AKT starts on

121
Example a GATE-based generator
  • Input
  • The MIAKT ontology
  • The RDF file for the given case
  • The MIAKT lexicon
  • Output
  • GATE document with the generated text

122
Lexicalising Concepts and Instances
123
Example RDF Input
  • ltrdfDescription rdfabout'c\breast_cancer_ontol
    ogy.daml01401_patient'gt
  • ltrdftype rdfresource'c\breast_cancer_ontology
    .damlPatient'/gt
  • ltNS2has_agegt68lt/NS2has_agegt
  • ltNS2involved_in_ta rdfresource'c\breast_cance
    r_ontology.damlta-soton-1069861276136'/gt
  • lt/rdfDescriptiongt
  • ltrdfDescription rdfabout'c\breast_cancer_ontol
    ogy.daml01401_mammography'gt
  • ltrdftype rdfresource'c\breast_cancer_ontology
    .damlMammography'/gt
  • ltNS2carried_out_on rdfresource'c\breast_cance
    r_ontology.daml01401_patient'/gt
  • ltNS2has_dategt22 9 1995lt/NS2has_dategt
  • ltNS2produce_result rdfresource'c\breast_cance
    r_ontology.damlimage_01401_right_cc'/gt
  • lt/rdfDescriptiongt
  • ltrdfDescription rdfabout'c\breast_cancer_ontol
    ogy.damlimage_01401_right_cc'gt
  • ltNS2image_filegtcancer/case0140/C_0140_1.RIGHT_CC
    .LJPEGlt/NS2image_filegt
  • ltrdftype rdfresource'c\breast_cancer_ontology
    .damlRight_CC_Image'/gt
  • ltNS2has_lateral rdfresource'c\breast_cancer_o
    ntology.damllateral_right'/gt
  • ltNS2view_of_image rdfresource'c\breast_cancer
    _ontology.damlcraniocaudal_view'/gt
  • ltNS2contains_entity rdfresource'c\breast_canc
    er_ontology.daml01401_right_cc_abnor_1'/gt
  • lt/rdfDescriptiongt
  • ltrdfDescription rdfabout'c\breast_cancer_ontol
    ogy.daml01401_right_cc_abnor_1'gt

124
CASE0140.RDF
  • The 68 years old patient is involved in a
    triple assessment procedure. The triple
    assessment procedure contains a mammography exam.
    The mammography exam is carried out on the
    patient on 22 9 1995. The mammography exam
    produced a right CC image. The right CC image
    contains an abnormality and it has a right
    lateral side and a craniocaudal view. The
    abnormality has a mass, a microlobulated margin ,
    a round shape, and a probably malignant
    assessment.

125
Further Reading on IE for SemWeb
  • Requirements for Information Extraction for
    Knowledge Management. http//nlp.shef.ac.uk/dot.ko
    m/publications.html
  • Information Extraction as a Semantic Web
    Technology Requirements and Promises. Adaptive
    Text Extraction and Mining workshop, 2003.
  • A. Kiryakov, B. Popov, et al. Semantic
    Annotation, Indexing, and Retrieval. 2nd
    International Semantic Web Conference (ISWC2003),
    http//www.ontotext.com/publications/index.htmlKi
    ryakovEtAl2003
  • S. Handschuh, S. Staab, R. Volz
    http//www.aifb.uni-karlsruhe.de/WBS/sha/papers/p2
    73_handschuh.pdf. On Deep Annotation. WWW03.
  • S. Dill, N. Eiron, et al http//www.tomkinshome.c
    om/papers/2Web/semtag.pdf . SemTag and Seeker
    Bootstrapping the semantic web via automated
    semantic annotation. WWW03.
  • E. Motta, M. Vargas-Vera, et al MnM Ontology
    Driven Semi-Automatic and Automatic Support for
    Semantic Markup. Knowledge Engineering and
    Knowledge Management (Ontologies and the Semantic
    Web), (EKAW02), http//www.aktors.org/publications
    /selected-papers/06.pdf
  • K. Bontcheva, A. Kiryakov, H. Cunningham, B.
    Popov. M. Dimitrov. Semantic Web Enabled, Open
    Source Language Technology. Language Technology
    and the Semantic Web, Workshop on NLP and XML
    (NLPXML-2003). http//www.gate.ac.uk/sale/eacl03-s
    emweb/bontcheva-etal-final.pdf
  • Handschuh, Staab, Ciravegna. S-CREAM -
    Semi-automatic CREAtion of Metadata (2002)
    http//citeseer.nj.nec.com/529793.html

126
Further Reading on traditional IE
  • Day et al97 D. Day, J. Aberdeen, L. Hirschman,
    R. Kozierok, P. Robinson, and M. Vilain.
    Mixed-Initiative Development of Language
    Processing Systems. In Proceedings of the Fifth
    Conference on Applied Natural Language Processing
    (ANLP97). 1997.
  • Ciravegna02 F. Ciravegna, A. Dingli, D.
    Petrelli, Y. Wilks User-System Cooperation in
    Document Annotation based on Information
    Extraction. Knowledge Engineering and Knowledge
    Management (Ontologies and the Semantic Web),
    (EKAW02), 2002.
  • N. Kushmerick, B. Thomas. Adaptive information
    extraction Core technologies for information
    agents (2002). http//citeseer.nj.nec.com/kushmeri
    ck02adaptive.html
  • H. Cunningham, D. Maynard, K. Bontcheva, V.
    Tablan. GATE A Framework and Graphical
    Development Environment for Robust NLP Tools and
    Applications. 40th Anniversary Meeting of the
    Association for Computational Linguistics
    (ACL'02). 2002.
  • D.Maynard, K. Bontcheva and H. Cunningham.
    Towards a semantic extraction of named entities.
    Recent Advances in Natural Language Processing,
    Bulgaria, 2003.
  • Califf and Mooney Relational Learning of Pattern
    Matching Rules for Information Extraction
    http//citeseer.nj.nec.com/6804.html
  • Borthwick. A. A Maximum Entropy Approach to Named
    Entity Recognition.PhD Dissertation. 1999
  • Bikel D., Schwarta R., Weischedel. R. An
    algorithm that learns whats in a name. Machine
    Learning 34, pp.211-231, 1999
  • Riloff, E. (1996) "Automatically Generating
    Extraction Patterns from Untagged Text"
    Proceedings of the Thirteenth National Conference
    on Artificial Intelligence (AAAI-96) , 1996, pp.
    1044-1049. http//www.cs.utah.edu/7Eriloff/psfile
    s/aaai96.pdf
  • Daelemans W. and Hoste V. Evaluation of Machine
    Learning Methods for Natural Language Processing
    Tasks. In LREC 2002 Third International
    Conference on Language Resources and Evaluation,
    pages 755760

127
Further Reading on traditional IE
  • Black W.J., Rinaldi F., Mowatt D. Facile
    Description of the NE System Used For MUC-7.
    Proceedings of 7th Message Understanding
    Conference, Fairfax, VA, 19 April - 1 May, 1998.
  • Collins M., Singer Y. Unsupervised models for
    named entity classificationIn Proceedings of the
    Joint SIGDAT Conference on Empirical Methods in
    Natural Language Processing and Very Large
    Corpora, 1999
  • Collins M. Ranking Algorithms for Named-Entity
    Extraction Boosting and the Voted Perceptron.
    Proceedings of the 40th Annual Meeting of the
    ACL, Philadelphia, pp. 489-496, July 2002 Gotoh
    Y., Renals S. Information extraction from
    broadcast news, Philosophical Transactions of the
    Royal Society of London, series A Mathematical,
    Physical and Engineering Sciences, 2000.
  • Grishman R. The NYU System for MUC-6 or Where's
    the Syntax? Proceedings of the MUC-6 workshop,
    Washington. November 1995.
  • Krupka G. R., Hausman K. IsoQuest Inc.
    Description of the NetOwlTM Extractor System as
    Used for MUC-7. Proceedings of 7th Message
    Understanding Conference, Fairfax, VA, 19 April -
    1 May, 1998.
  • McDonald D. Internal and External Evidence in the
    Identification and Semantic Categorization of
    Proper Names. In B.Boguraev and J. Pustejovsky
    editors Corpus Processing for Lexical
    Acquisition. Pages21-39. MIT Press. Cambridge,
    MA. 1996
  • Mikheev A., Grover C. and Moens M. Description of
    the LTG System Used for MUC-7. Proceedings of 7th
    Message Understanding Conference, Fairfax, VA, 19
    April - 1 May, 1998
  • Miller S., Crystal M., et al. BBN Description of
    the SIFT System as Used for MUC-7. Proceedings of
    7th Message Understanding Conference, Fairfax,
    VA, 19 April - 1 May, 1998

128
Further Reading on multilingual IE
  • Palmer D., Day D.S. A Statistical Profile of the
    Named Entity Task. Proceedings of the Fifth
    Conference on Applied Natural Language
    Processing, Washington, D.C., March 31- April 3,
    1997.
  • Sekine S., Grishman R. and Shinou H. A decision
    tree method for finding and classifying names in
    Japanese texts. Proceedings of the Sixth Workshop
    on Very Large Corpora, Montreal, Canada, 1998
  • Sun J., Gao J.F., Zhang L., Zhou M., Huang C.N.
    Chinese Named Entity Identification Using
    Class-based Language Model. In proceeding of the
    19th International Conference on Computational
    Linguistics (COLING2002), pp.967-973, 2002.
  • Takeuchi K., Collier N. Use of Support Vector
    Machines in Extended Named Entity Recognition.
    The 6th Conference on Natural Language Learning.
    2002
  • D.Maynard, K. Bontcheva and H. Cunningham.
    Towards a semantic extraction of named entities.
    Recent Advances in Natural Language Processing,
    Bulgaria, 2003.
  • M. M. Wood and S. J. Lydon and V. Tablan and D.
    Maynard and H. Cunningham. Using parallel texts
    to improve recall in IE. Recent Advances in
    Natural Language Processing, Bulgaria, 2003.
  • D.Maynard, V. Tablan and H. Cunningham. NE
    recognition without training data on a language
    you don't speak. ACL Workshop on Multilingual and
    Mixed-language Named Entity Recognition
    Combining Statistical and Symbolic Models,
    Sapporo, Japan, 2003.

129
Further Reading on multilingual IE
  • H. Saggion, H. Cunningham, K. Bontcheva, D.
    Maynard, O. Hamza, Y. Wilks. Multimedia Indexing
    through Multisource and Multilingual Information
    Extraction the MUMIS project. Data and Knowledge
    Engineering, 2003.
  • D. Manov and A. Kiryakov and B. Popov and K.
    Bontcheva and D. Maynard, H. Cunningham.
    Experiments with geographic knowledge for
    information extraction. Workshop on Analysis of
    Geographic References, HLT/NAACL'03, Canada,
    2003.
  • H. Cunningham, D. Maynard, K. Bontcheva, V.
    Tablan. GATE A Framework and Graphical
    Development Environment for Robust NLP Tools and
    Applications. Proceedings of the 40th Anniversary
    Meeting of the Association for Computational
    Linguistics (ACL'02). Philadelphia, July 2002.
  • H. Cunningham. GATE, a General Architecture for
    Text Engineering. Computers and the Humanities,
    volume 36, pp. 223-254, 2002.
  • D. Maynard, H. Cunningham, K. Bontcheva, M.
    Dimitrov. Adapting A Robust Multi-Genre NE System
    for Automatic Content Extraction. Proc. of the
    10th International Conference on Artificial
    Intelligence Methodology, Systems, Applications
    (AIMSA 2002), 2002.
  • K. Pastra, D. Maynard, H. Cunningham, O. Hamza,
    Y. Wilks. How feasible is the reuse of grammars
    for Named Entity Recognition? Language Resources
    and Evaluation Conference (LREC'2002), 2002.

130
THANK YOU!(for not snoring) The slides
http//gate.ac.uk/sale/talks/esws2004-tutorial.pp
t This work has been supported by AKT
(http//aktors.org/) and SEKT (http//sekt.semant
icweb.org/)
About PowerShow.com