Title: Using WordNet Predicates for Multilingual Named Entity Recognition
1Using WordNet Predicates for Multilingual Named
Entity Recognition
- Matteo Negri and Bernardo Magnini
- ITC-irst
- Centro per la Ricerca Scientifica e Tecnologica,
Trento - Italy - negri,magnini_at_itc.it
- GWC04 - Brno (Czech Republic), January 23 2004
2Outline
- Named Entity Recognition (NER)
- Rule-based approach using WordNet information
- WordNet Predicates (language independent)
- Internal evidence Word_Instances
- External evidence Word_Classes
- System architecture
- Experiments and results on English and Italian
- Future work
3Named Entity Recognition (NER)
- Given a written text, identify and categorize
- Entity names (e.g. persons, organizations,
location names) - Temporal expressions (e.g. dates and time)
- Numerical expressions (e.g. monetary values and
percentages) - NER is crucial for Information Extraction,
Question Answering and Information Retrieval - Up to 10 of a newswire text may consist of
proper names , dates, times, etc.
4NER for Question Answering
Q1848 What was the name of the plane that
dropped the Atomic Bomb on
Hiroshima? PERSON DATE LOCATION OTHER Tibbets
piloted the Boeing B-29 Superfortress Enola
Gay, which dropped the atomic bomb on Hiroshima
on Aug. 6, 1945, causing an estimated 66,000 to
240,000 deaths. He named the plane after his
mother, Enola Gay Tibbets.
5Named Entity Hierarchy
PERSON
NAMEX
ORGANIZATION
LOCATION
DATE
TIMEX
TIME
ENTITY
DURATION
MONEY
CARDINAL
MEASURE
PERCENT
OTHER
6Motivations
- Experiment how far can we go with NER using
WordNet as the main source of semantic knowledge
for one language - Isolate language-independent relevant knowledge
for the NER task - Experiment a multilingual approach taking
advantage of aligned wordnets (e.g.
English/Italian)
7Knowledge-Based NER
- Combination of a wide range of knowledge sources
- lexical, syntactic, and semantic features of the
input text - world knowledge (e.g. gazetteers)
- discourse level information (e.g. co-reference
resolution)
8Rule-Based approach
- 1 2 3 4
- Rome is the capital of Italy
PATTERN t1 t2 t3 t4
t1 t2 t3 t4 pos NP ort Cap lemma be pos DT sense (location-p t4 English)
OUTPUT ltLOCATIONgt t1 lt\LOCATIONgt
ltLOCATIONgt Rome lt\LOCATIONgt is the capital of
Italy
9WordNet Predicates (1)(WN-preds)
- WN-preds are defined over a set of WordNet
synsets which express a certain concept
Location1
Solid_ground1
location-p
Mandate2
Geological_formation1
person-p
Road1
Body_of_water1
measure-p
10WordNet Predicates (2)
- Input
- A word w and a language L
- Output
- A boolean value (TRUE or FALSE)
- TRUE if there exist at least one sense of w which
is subsumed by at least one of the synsets
defining the predicate
location-p ltlakegt,ltEnglishgt
TRUE
because there exists a sense of lake(lake1)
which is subsumed by one of the synset that
define the predicate (i.e. body_of_water1)
11WordNet Predicates (3)
- WN-preds have been created for the following NE
categories - PERSON person-name-p (person1,
spiritual-being1) - person-class-p (person1, spiritual-being1)
- first-name-p (person1, spiritual-being1) pe
rson-product-p (artifact1) - LOCATION location-name-p (location1, road1,
mandate1, body_of_water1, solid_ground1,
geological_formation1) - location-class-p (location1, road1,
mandate1, body_of_water1, solid_ground1,
geological_formation1) - movement-verb-p (locomote1)
- ORGANIZATION org-name-p (organization1)
- org-class-p (organization1)
- org-representative-p (trainer1, top_dog,
spokesperson1) - MEASURE measure-unit-p (measure1,
- number-p (digit1, large_integer1,
common_fraction1) - MONEY money-p (monetary_unit1, coin1)
- DATE date-p (time_period1)
12WordNet Predicates (4)
- The definition of a wordnet-predicate is
language-independent. - In case of aligned wordnet w-preds can be easily
parametrized with respect to a certain language
without changing the predicate definition - E.g. (Location-p lake English)
- (Location-p lago Italian)
13Knowledge-Based NER
- Two kinds of information are usually
distinguished in Named Entity Recognition(McDonald
, 1996) - Internal Evidences provided by the candidate
string itself (e.g. Rome) - Drawbacks
- Dimension of reliable gazetteers
- Maintenance (gazetteers are never exhaustive)
- Overlap among the lists (Washington person or
location?) - Limited availability for languages other than
English - External Evidence provided by the context into
which the string appears (e.g. capital)
14Mining Evidence from WordNet
- Both IE and EE can be mined from WordNet
- Low coverage of Internal evidences (e.g. person
names) - High coverage of trigger words
- Approach distinguishing between Word_Instances
(e.g. Nile1) and Word_Classes (e.g.
river1) - Problem in WordNet such a distinction is not
explicit!
15Word Classes and Word Instances I
...
person
...
intellectual
Italian
...
scientist
...
...
physicist
...
astronomer
Galileo_Galilei
Kepler
In WordNet, the hyponyms of the synset person1
are a mixture of concepts (e.g. astronomer,
physicist, etc.) and individuals (e.g. Galileo
Galilei, Kepler, etc.)
16Word Classes and Word Instances (1)
...
...
person
...
intellectual
IE (Word_Instances)
Italian
...
scientist
...
...
physicist
EE (Word_Classes)
...
astronomer
Galileo_Galilei
Kepler
- NOTE in WordNet, the hyponyms of the synset
person1 are a mixture of concepts (e.g.
astronomer, physicist, etc.) and individuals
(e.g. Galileo Galilei, Kepler, etc.)
17Word Classes and Word Instances (2)
- Semi-automatic procedure to distinguish
Word_Instances and Word_Classes in WordNet - 3 steps
- 1) collect all the hyponyms of several high-level
synsets (e.g. person1, social_group1,
location1, measure1, etc.) - 2) separate capitalized words from lower case
words - capitalized words Word_Instances
- lower case words
Word_Classes - 3) manual filter is necessary
- Italian is not an Instance!
18Distribution of Word Classes and Word Instances
in MultiWordNet
ENG Classes ENG Instances ITA Classes ITA Instances
PERSON 6775 1202 5982 348
LOCATION 1591 2173 979 950
ORGANIZ. 1405 498 890 297
TOTAL 9771 3873 7851 1595
19System Architecture (NERD)
- Preprocessing
- tokenization
- POS tagging
- multiwords recognition
- Basic rules application
- ?400 language-specific basic rules, both for
English and Italian, are applied to find and tag
all the possible NEs present in the input text - Composition rules application
- higher level language-independent rules for
handling ambiguities between possible multiple
tags and for co-reference resolution
20Basic Rules I
- English basic rule for capturing IE
- Example Galileo invented the telescope
PATTERN t1
t1 sense (person-name-p t1 English)
OUTPUT ltPERSONgtt1lt\PERSONgt
- NOTE the WN-pred person-name-p is satisfied by
any of the 1202 English Instances of the category
PERSON
21Basic Rules II
- Italian basic rule for capturing IE
- Example il telescopio fu inventato da Galileo
PATTERN t1
t1 sense (person-name-p t1 Italian)
OUTPUT ltPERSONgtt1lt\PERSONgt
- NOTE here, the WN-pred person-name-p is
satisfied by any of the 1550 Instances (1202 for
English 348 for Italian) of the category PERSON
22Basic Rules III
- Basic rule for capturing EE (via trigger words)
- Example Roma è la capitale italiana
PATTERN t1 t2 t3 t4
t1 t2 t3 t4 pos NP ort Cap lemma essere pos DT sense (location-p t4 Italian)
OUTPUT ltLOCATIONgtt1lt\LOCATIONgt
- NOTE the WN-pred location-p is satisfied by any
of the 979 Italian Classes of the category
LOCATION
23Basic Rules IV
- Basic rule for capturing EE (via sentence
structure) - Example Bowman, who was appointed by Reagan
PATTERN t1 t2 t3
t1 t2 t3 pos NP ort Cap lemma , lemma who
OUTPUT ltPERSONgtt1lt\PERSONgt
- NOTE External Evidence can be captured from the
context also in absence of particular word senses
24Composition Rules
- Input tagged text with all the possible Named
Entities - Out a tagged text, where
- overlaps and inclusions between tags are removed
- co-references are resolved
25Composition Rules II
- Composition rule for handling tag inclusions
- Example ... 200 miles from New York...
B CARDINAL
A MEASURE
PATTERN NE1 NE2
NE1 NE2 start n end m TAG A start o (noltm)end p (oltp m) TAG B ? A
OUTPUT NE1 start n end m TAG A
26Composition Rules III
- Composition rule for co-reference resolution
- Example with Judge Pasco Bowman. Bowman was
...
?
?
PATTERN NE1 NE NE2
NE1 NE2 entity ? TAG A entity ? substring of ? TAG NAMEX
OUTPUT NE2 entity ? TAG A
27Experiment
- DARPA/NIST HUB4 competition test corpora and
scoring software - Categories PERSON,LOCATION,ORGANIZATION
- Reference tagged corpora
- English 365 Kb of newswire texts
- Italian 77 Kb of transcripts from two Italian
broadcast news shows (7000 words, 322 NEs) - F-measure, Precision and Recall computed
comparing reference corpora with automatically
tagged ones - type, content, and extension of each NE are
considered
28Results
Recall Recall Precision Precision F-Measure F-Measure
ITA ENG ITA ENG ITA ENG
PERSON 91.48 87.29 85.08 88.38 88.16 87.83
LOCATION 97.27 92.16 80.45 81.17 88.07 86.32
ORGANIZATION 83.88 82.71 72.70 83.02 77.89 82.87
All categories 91.32 87.28 74.75 82.99 82.21 84.12
29Conclusion and Future Work
- We presented a NE recognition system based on
information represented in Wordnet - Language independent predicates for NE have been
defined - Results on two languages show that the approach
performs as state of art rule based systems - The system has been successfully integrated in a
QA system - Future work
- move to WN 2.0
- integrate gazetteers
- use Sumo concepts