Using WordNet Predicates for Multilingual Named Entity Recognition - PowerPoint PPT Presentation

About This Presentation
Title:

Using WordNet Predicates for Multilingual Named Entity Recognition

Description:

Using WordNet Predicates for Multilingual Named Entity Recognition ... movement-verb-p (locomote#1) ORGANIZATION: org-name-p (organization#1) ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 30
Provided by: matte160
Category:

less

Transcript and Presenter's Notes

Title: Using WordNet Predicates for Multilingual Named Entity Recognition


1
Using WordNet Predicates for Multilingual Named
Entity Recognition
  • Matteo Negri and Bernardo Magnini
  • ITC-irst
  • Centro per la Ricerca Scientifica e Tecnologica,
    Trento - Italy
  • negri,magnini_at_itc.it
  • GWC04 - Brno (Czech Republic), January 23 2004

2
Outline
  • Named Entity Recognition (NER)
  • Rule-based approach using WordNet information
  • WordNet Predicates (language independent)
  • Internal evidence Word_Instances
  • External evidence Word_Classes
  • System architecture
  • Experiments and results on English and Italian
  • Future work

3
Named Entity Recognition (NER)
  • Given a written text, identify and categorize
  • Entity names (e.g. persons, organizations,
    location names)
  • Temporal expressions (e.g. dates and time)
  • Numerical expressions (e.g. monetary values and
    percentages)
  • NER is crucial for Information Extraction,
    Question Answering and Information Retrieval
  • Up to 10 of a newswire text may consist of
    proper names , dates, times, etc.

4
NER for Question Answering
Q1848 What was the name of the plane that
dropped the Atomic Bomb on
Hiroshima? PERSON DATE LOCATION OTHER Tibbets
piloted the Boeing B-29 Superfortress Enola
Gay, which dropped the atomic bomb on Hiroshima
on Aug. 6, 1945, causing an estimated 66,000 to
240,000 deaths. He named the plane after his
mother, Enola Gay Tibbets.
5
Named Entity Hierarchy
PERSON
NAMEX
ORGANIZATION
LOCATION
DATE
TIMEX
TIME
ENTITY
DURATION
MONEY
CARDINAL
MEASURE
PERCENT
OTHER
6
Motivations
  1. Experiment how far can we go with NER using
    WordNet as the main source of semantic knowledge
    for one language
  2. Isolate language-independent relevant knowledge
    for the NER task
  3. Experiment a multilingual approach taking
    advantage of aligned wordnets (e.g.
    English/Italian)

7
Knowledge-Based NER
  • Combination of a wide range of knowledge sources
  • lexical, syntactic, and semantic features of the
    input text
  • world knowledge (e.g. gazetteers)
  • discourse level information (e.g. co-reference
    resolution)

8
Rule-Based approach
  • 1 2 3 4
  • Rome is the capital of Italy

PATTERN t1 t2 t3 t4
t1 t2 t3 t4 pos NP ort Cap lemma be pos DT sense (location-p t4 English)
OUTPUT ltLOCATIONgt t1 lt\LOCATIONgt
ltLOCATIONgt Rome lt\LOCATIONgt is the capital of
Italy
9
WordNet Predicates (1)(WN-preds)
  • WN-preds are defined over a set of WordNet
    synsets which express a certain concept

Location1
Solid_ground1
location-p
Mandate2
Geological_formation1
person-p
Road1
Body_of_water1
measure-p

10
WordNet Predicates (2)
  • Input
  • A word w and a language L
  • Output
  • A boolean value (TRUE or FALSE)
  • TRUE if there exist at least one sense of w which
    is subsumed by at least one of the synsets
    defining the predicate

location-p ltlakegt,ltEnglishgt
TRUE
because there exists a sense of lake(lake1)
which is subsumed by one of the synset that
define the predicate (i.e. body_of_water1)
11
WordNet Predicates (3)
  • WN-preds have been created for the following NE
    categories
  • PERSON person-name-p (person1,
    spiritual-being1)
  • person-class-p (person1, spiritual-being1)
  • first-name-p (person1, spiritual-being1) pe
    rson-product-p (artifact1)
  • LOCATION location-name-p (location1, road1,
    mandate1, body_of_water1, solid_ground1,
    geological_formation1)
  • location-class-p (location1, road1,
    mandate1, body_of_water1, solid_ground1,
    geological_formation1)
  • movement-verb-p (locomote1)
  • ORGANIZATION org-name-p (organization1)
  • org-class-p (organization1)
  • org-representative-p (trainer1, top_dog,
    spokesperson1)
  • MEASURE measure-unit-p (measure1,
  • number-p (digit1, large_integer1,
    common_fraction1)
  • MONEY money-p (monetary_unit1, coin1)
  • DATE date-p (time_period1)

12
WordNet Predicates (4)
  • The definition of a wordnet-predicate is
    language-independent.
  • In case of aligned wordnet w-preds can be easily
    parametrized with respect to a certain language
    without changing the predicate definition
  • E.g. (Location-p lake English)
  • (Location-p lago Italian)

13
Knowledge-Based NER
  • Two kinds of information are usually
    distinguished in Named Entity Recognition(McDonald
    , 1996)
  • Internal Evidences provided by the candidate
    string itself (e.g. Rome)
  • Drawbacks
  • Dimension of reliable gazetteers
  • Maintenance (gazetteers are never exhaustive)
  • Overlap among the lists (Washington person or
    location?)
  • Limited availability for languages other than
    English
  • External Evidence provided by the context into
    which the string appears (e.g. capital)

14
Mining Evidence from WordNet
  • Both IE and EE can be mined from WordNet
  • Low coverage of Internal evidences (e.g. person
    names)
  • High coverage of trigger words
  • Approach distinguishing between Word_Instances
    (e.g. Nile1) and Word_Classes (e.g.
    river1)
  • Problem in WordNet such a distinction is not
    explicit!

15
Word Classes and Word Instances I

...
person
...
intellectual
Italian
...
scientist
...
...
physicist
...
astronomer
Galileo_Galilei
Kepler
In WordNet, the hyponyms of the synset person1
are a mixture of concepts (e.g. astronomer,
physicist, etc.) and individuals (e.g. Galileo
Galilei, Kepler, etc.)
16
Word Classes and Word Instances (1)
...
...
person
...
intellectual
IE (Word_Instances)
Italian
...
scientist
...
...
physicist
EE (Word_Classes)
...
astronomer
Galileo_Galilei
Kepler
- NOTE in WordNet, the hyponyms of the synset
person1 are a mixture of concepts (e.g.
astronomer, physicist, etc.) and individuals
(e.g. Galileo Galilei, Kepler, etc.)
17
Word Classes and Word Instances (2)
  • Semi-automatic procedure to distinguish
    Word_Instances and Word_Classes in WordNet
  • 3 steps
  • 1) collect all the hyponyms of several high-level
    synsets (e.g. person1, social_group1,
    location1, measure1, etc.)
  • 2) separate capitalized words from lower case
    words
  • capitalized words Word_Instances
  • lower case words
    Word_Classes
  • 3) manual filter is necessary
  • Italian is not an Instance!

18
Distribution of Word Classes and Word Instances
in MultiWordNet
ENG Classes ENG Instances ITA Classes ITA Instances
PERSON 6775 1202 5982 348
LOCATION 1591 2173 979 950
ORGANIZ. 1405 498 890 297
TOTAL 9771 3873 7851 1595
19
System Architecture (NERD)
  • Preprocessing
  • tokenization
  • POS tagging
  • multiwords recognition
  • Basic rules application
  • ?400 language-specific basic rules, both for
    English and Italian, are applied to find and tag
    all the possible NEs present in the input text
  • Composition rules application
  • higher level language-independent rules for
    handling ambiguities between possible multiple
    tags and for co-reference resolution

20
Basic Rules I
  • English basic rule for capturing IE
  • Example Galileo invented the telescope

PATTERN t1
t1 sense (person-name-p t1 English)
OUTPUT ltPERSONgtt1lt\PERSONgt
  • NOTE the WN-pred person-name-p is satisfied by
    any of the 1202 English Instances of the category
    PERSON

21
Basic Rules II
  • Italian basic rule for capturing IE
  • Example il telescopio fu inventato da Galileo

PATTERN t1
t1 sense (person-name-p t1 Italian)
OUTPUT ltPERSONgtt1lt\PERSONgt
  • NOTE here, the WN-pred person-name-p is
    satisfied by any of the 1550 Instances (1202 for
    English 348 for Italian) of the category PERSON

22
Basic Rules III
  • Basic rule for capturing EE (via trigger words)
  • Example Roma è la capitale italiana

PATTERN t1 t2 t3 t4
t1 t2 t3 t4 pos NP ort Cap lemma essere pos DT sense (location-p t4 Italian)
OUTPUT ltLOCATIONgtt1lt\LOCATIONgt
  • NOTE the WN-pred location-p is satisfied by any
    of the 979 Italian Classes of the category
    LOCATION

23
Basic Rules IV
  • Basic rule for capturing EE (via sentence
    structure)
  • Example Bowman, who was appointed by Reagan

PATTERN t1 t2 t3
t1 t2 t3 pos NP ort Cap lemma , lemma who
OUTPUT ltPERSONgtt1lt\PERSONgt
  • NOTE External Evidence can be captured from the
    context also in absence of particular word senses

24
Composition Rules
  • Input tagged text with all the possible Named
    Entities
  • Out a tagged text, where
  • overlaps and inclusions between tags are removed
  • co-references are resolved

25
Composition Rules II
  • Composition rule for handling tag inclusions
  • Example ... 200 miles from New York...

B CARDINAL
A MEASURE
PATTERN NE1 NE2
NE1 NE2 start n end m TAG A start o (noltm)end p (oltp m) TAG B ? A
OUTPUT NE1 start n end m TAG A
26
Composition Rules III
  • Composition rule for co-reference resolution
  • Example with Judge Pasco Bowman. Bowman was
    ...

?
?
PATTERN NE1 NE NE2
NE1 NE2 entity ? TAG A entity ? substring of ? TAG NAMEX
OUTPUT NE2 entity ? TAG A
27
Experiment
  • DARPA/NIST HUB4 competition test corpora and
    scoring software
  • Categories PERSON,LOCATION,ORGANIZATION
  • Reference tagged corpora
  • English 365 Kb of newswire texts
  • Italian 77 Kb of transcripts from two Italian
    broadcast news shows (7000 words, 322 NEs)
  • F-measure, Precision and Recall computed
    comparing reference corpora with automatically
    tagged ones
  • type, content, and extension of each NE are
    considered

28
Results
Recall Recall Precision Precision F-Measure F-Measure
ITA ENG ITA ENG ITA ENG
PERSON 91.48 87.29 85.08 88.38 88.16 87.83
LOCATION 97.27 92.16 80.45 81.17 88.07 86.32
ORGANIZATION 83.88 82.71 72.70 83.02 77.89 82.87
All categories 91.32 87.28 74.75 82.99 82.21 84.12
29
Conclusion and Future Work
  • We presented a NE recognition system based on
    information represented in Wordnet
  • Language independent predicates for NE have been
    defined
  • Results on two languages show that the approach
    performs as state of art rule based systems
  • The system has been successfully integrated in a
    QA system
  • Future work
  • move to WN 2.0
  • integrate gazetteers
  • use Sumo concepts
Write a Comment
User Comments (0)
About PowerShow.com