Using WordNet Predicates for Multilingual Named Entity Recognition - PowerPoint PPT Presentation

About This Presentation

Title:

Using WordNet Predicates for Multilingual Named Entity Recognition

Description:

Using WordNet Predicates for Multilingual Named Entity Recognition ... movement-verb-p (locomote#1) ORGANIZATION: org-name-p (organization#1) ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 30

Provided by: matte160

Category:

more less

Transcript and Presenter's Notes

Title: Using WordNet Predicates for Multilingual Named Entity Recognition

1
Using WordNet Predicates for Multilingual Named
Entity Recognition

Matteo Negri and Bernardo Magnini
ITC-irst
Centro per la Ricerca Scientifica e Tecnologica,
Trento - Italy
negri,magnini_at_itc.it
GWC04 - Brno (Czech Republic), January 23 2004

2
Outline

Named Entity Recognition (NER)
Rule-based approach using WordNet information
WordNet Predicates (language independent)
Internal evidence Word_Instances
External evidence Word_Classes
System architecture
Experiments and results on English and Italian
Future work

3
Named Entity Recognition (NER)

Given a written text, identify and categorize
Entity names (e.g. persons, organizations,
location names)
Temporal expressions (e.g. dates and time)
Numerical expressions (e.g. monetary values and
percentages)
NER is crucial for Information Extraction,
Question Answering and Information Retrieval
Up to 10 of a newswire text may consist of
proper names , dates, times, etc.

4
NER for Question Answering
Q1848 What was the name of the plane that
dropped the Atomic Bomb on
Hiroshima? PERSON DATE LOCATION OTHER Tibbets
piloted the Boeing B-29 Superfortress Enola
Gay, which dropped the atomic bomb on Hiroshima
on Aug. 6, 1945, causing an estimated 66,000 to
240,000 deaths. He named the plane after his
mother, Enola Gay Tibbets.
5
Named Entity Hierarchy
PERSON
NAMEX
ORGANIZATION
LOCATION
DATE
TIMEX
TIME
ENTITY
DURATION
MONEY
CARDINAL
MEASURE
PERCENT
OTHER
6
Motivations

Experiment how far can we go with NER using
WordNet as the main source of semantic knowledge
for one language
Isolate language-independent relevant knowledge
for the NER task
Experiment a multilingual approach taking
advantage of aligned wordnets (e.g.
English/Italian)

7
Knowledge-Based NER

Combination of a wide range of knowledge sources
lexical, syntactic, and semantic features of the
input text
world knowledge (e.g. gazetteers)
discourse level information (e.g. co-reference
resolution)

8
Rule-Based approach

1 2 3 4
Rome is the capital of Italy

PATTERN t1 t2 t3 t4
t1 t2 t3 t4 pos NP ort Cap lemma be pos DT sense (location-p t4 English)
OUTPUT ltLOCATIONgt t1 lt\LOCATIONgt
ltLOCATIONgt Rome lt\LOCATIONgt is the capital of
Italy
9
WordNet Predicates (1)(WN-preds)

WN-preds are defined over a set of WordNet
synsets which express a certain concept

Location1
Solid_ground1
location-p
Mandate2
Geological_formation1
person-p
Road1
Body_of_water1
measure-p

10
WordNet Predicates (2)

Input
A word w and a language L
Output
A boolean value (TRUE or FALSE)
TRUE if there exist at least one sense of w which
is subsumed by at least one of the synsets
defining the predicate

location-p ltlakegt,ltEnglishgt
TRUE
because there exists a sense of lake(lake1)
which is subsumed by one of the synset that
define the predicate (i.e. body_of_water1)
11
WordNet Predicates (3)

WN-preds have been created for the following NE
categories
PERSON person-name-p (person1,
spiritual-being1)
person-class-p (person1, spiritual-being1)
first-name-p (person1, spiritual-being1) pe
rson-product-p (artifact1)
LOCATION location-name-p (location1, road1,
mandate1, body_of_water1, solid_ground1,
geological_formation1)
location-class-p (location1, road1,
mandate1, body_of_water1, solid_ground1,
geological_formation1)
movement-verb-p (locomote1)
ORGANIZATION org-name-p (organization1)
org-class-p (organization1)
org-representative-p (trainer1, top_dog,
spokesperson1)
MEASURE measure-unit-p (measure1,
number-p (digit1, large_integer1,
common_fraction1)
MONEY money-p (monetary_unit1, coin1)
DATE date-p (time_period1)

12
WordNet Predicates (4)

The definition of a wordnet-predicate is
language-independent.
In case of aligned wordnet w-preds can be easily
parametrized with respect to a certain language
without changing the predicate definition
E.g. (Location-p lake English)
(Location-p lago Italian)

13
Knowledge-Based NER

Two kinds of information are usually
distinguished in Named Entity Recognition(McDonald
, 1996)
Internal Evidences provided by the candidate
string itself (e.g. Rome)
Drawbacks
Dimension of reliable gazetteers
Maintenance (gazetteers are never exhaustive)
Overlap among the lists (Washington person or
location?)
Limited availability for languages other than
English
External Evidence provided by the context into
which the string appears (e.g. capital)

14
Mining Evidence from WordNet

Both IE and EE can be mined from WordNet
Low coverage of Internal evidences (e.g. person
names)
High coverage of trigger words
Approach distinguishing between Word_Instances
(e.g. Nile1) and Word_Classes (e.g.
river1)
Problem in WordNet such a distinction is not
explicit!

15
Word Classes and Word Instances I

...
person
...
intellectual
Italian
...
scientist
...
...
physicist
...
astronomer
Galileo_Galilei
Kepler
In WordNet, the hyponyms of the synset person1
are a mixture of concepts (e.g. astronomer,
physicist, etc.) and individuals (e.g. Galileo
Galilei, Kepler, etc.)
16
Word Classes and Word Instances (1)
...
...
person
...
intellectual
IE (Word_Instances)
Italian
...
scientist
...
...
physicist
EE (Word_Classes)
...
astronomer
Galileo_Galilei
Kepler
- NOTE in WordNet, the hyponyms of the synset
person1 are a mixture of concepts (e.g.
astronomer, physicist, etc.) and individuals
(e.g. Galileo Galilei, Kepler, etc.)
17
Word Classes and Word Instances (2)

Semi-automatic procedure to distinguish
Word_Instances and Word_Classes in WordNet
3 steps
1) collect all the hyponyms of several high-level
synsets (e.g. person1, social_group1,
location1, measure1, etc.)
2) separate capitalized words from lower case
words
capitalized words Word_Instances
lower case words
Word_Classes
3) manual filter is necessary
Italian is not an Instance!

18
Distribution of Word Classes and Word Instances
in MultiWordNet
ENG Classes ENG Instances ITA Classes ITA Instances
PERSON 6775 1202 5982 348
LOCATION 1591 2173 979 950
ORGANIZ. 1405 498 890 297
TOTAL 9771 3873 7851 1595
19
System Architecture (NERD)

Preprocessing
tokenization
POS tagging
multiwords recognition
Basic rules application
?400 language-specific basic rules, both for
English and Italian, are applied to find and tag
all the possible NEs present in the input text
Composition rules application
higher level language-independent rules for
handling ambiguities between possible multiple
tags and for co-reference resolution

20
Basic Rules I

English basic rule for capturing IE
Example Galileo invented the telescope

PATTERN t1
t1 sense (person-name-p t1 English)
OUTPUT ltPERSONgtt1lt\PERSONgt

NOTE the WN-pred person-name-p is satisfied by
any of the 1202 English Instances of the category
PERSON

21
Basic Rules II

Italian basic rule for capturing IE
Example il telescopio fu inventato da Galileo

PATTERN t1
t1 sense (person-name-p t1 Italian)
OUTPUT ltPERSONgtt1lt\PERSONgt

NOTE here, the WN-pred person-name-p is
satisfied by any of the 1550 Instances (1202 for
English 348 for Italian) of the category PERSON

22
Basic Rules III

Basic rule for capturing EE (via trigger words)
Example Roma è la capitale italiana

PATTERN t1 t2 t3 t4
t1 t2 t3 t4 pos NP ort Cap lemma essere pos DT sense (location-p t4 Italian)
OUTPUT ltLOCATIONgtt1lt\LOCATIONgt

NOTE the WN-pred location-p is satisfied by any
of the 979 Italian Classes of the category
LOCATION

23
Basic Rules IV

Basic rule for capturing EE (via sentence
structure)
Example Bowman, who was appointed by Reagan

PATTERN t1 t2 t3
t1 t2 t3 pos NP ort Cap lemma , lemma who
OUTPUT ltPERSONgtt1lt\PERSONgt

NOTE External Evidence can be captured from the
context also in absence of particular word senses

24
Composition Rules

Input tagged text with all the possible Named
Entities
Out a tagged text, where
overlaps and inclusions between tags are removed
co-references are resolved

25
Composition Rules II

Composition rule for handling tag inclusions
Example ... 200 miles from New York...

B CARDINAL
A MEASURE
PATTERN NE1 NE2
NE1 NE2 start n end m TAG A start o (noltm)end p (oltp m) TAG B ? A
OUTPUT NE1 start n end m TAG A
26
Composition Rules III

Composition rule for co-reference resolution
Example with Judge Pasco Bowman. Bowman was
...

?
?
PATTERN NE1 NE NE2
NE1 NE2 entity ? TAG A entity ? substring of ? TAG NAMEX
OUTPUT NE2 entity ? TAG A
27
Experiment

DARPA/NIST HUB4 competition test corpora and
scoring software
Categories PERSON,LOCATION,ORGANIZATION
Reference tagged corpora
English 365 Kb of newswire texts
Italian 77 Kb of transcripts from two Italian
broadcast news shows (7000 words, 322 NEs)
F-measure, Precision and Recall computed
comparing reference corpora with automatically
tagged ones
type, content, and extension of each NE are
considered

28
Results
Recall Recall Precision Precision F-Measure F-Measure
ITA ENG ITA ENG ITA ENG
PERSON 91.48 87.29 85.08 88.38 88.16 87.83
LOCATION 97.27 92.16 80.45 81.17 88.07 86.32
ORGANIZATION 83.88 82.71 72.70 83.02 77.89 82.87
All categories 91.32 87.28 74.75 82.99 82.21 84.12
29
Conclusion and Future Work

We presented a NE recognition system based on
information represented in Wordnet
Language independent predicates for NE have been
defined
Results on two languages show that the approach
performs as state of art rule based systems
The system has been successfully integrated in a
QA system
Future work
move to WN 2.0
integrate gazetteers
use Sumo concepts