Title: InfoXtract Location Normalization: A Hybrid Approach to Geographic References in Information Extract
1InfoXtract Location Normalization A Hybrid
Approach to Geographic References in Information
Extraction May 31 2003 Edmonton, Alberta
- NAACL-HLT Workshop on the Analysis of Geographic
References - Huifeng Li, Rohini K. Srihari, Cheng Niu, and Wei
Li - Cymfony Inc.
2Contents
- Overview of Information Extraction System
InfoXtract - Introduction of Location Normalization (LocNZ)
- Task of LocNZ
- Problems and Proposed Method
- Algorithm for LocNZ
- Experimental Evaluation
- Future Work
3Overview of InfoXtract
- InfoXtract produces the following information
objects from a text - Named Entities (NEs) - Bill Gates,
chairman of Microsoft . - Correlated Entities (CEs) - Bill Gates,
chairman of Microsoft... - Subject-Verb-Object (SVO) triples - Both
syntactic semantic forms of the
structures - Entity Profiles - Profiles for
entity types like peopleorganizations - General Events (GEs) - Domain-independent
events - Event
- Argument structures centering around verb with
the associated information - who did what to whom when (or how often) and
where - Predefined Events (PEs) - Domain-specific
events - System component integrated NLP and machine
learning into IE - POS tagging
- Shallow and deep parsing
- Named Entity tagging
- Combining supervised unsupervised machine
learning techniques - Concept-based analysis
- Word sense disambiguation
4InfoXtract Architecture
Document Processor
Linguistic Processor(s)
HTTP
Tokenizer
POST
Process
Zoned Text
Web
Manager
Document
Server
Source
Document
Lexicon Lookup
Tokenlist
POS Tagging
NE
XML
Tokenlist
Output
Named Entity
Formatted
Manager
Detection
Extracted
Document
CE
Time
HTTP
Normalization
response
Location
SVO
Document
Normalization
Shallow
Error log
CO
Parsing
Knowledge Resources
Deep
Parsing
Lexicon
Resources
Relationship
Profile
Detection
Grammars
Alias/Coreference
Linking
CGE
Language models
Pragmatic
Filtering
Profile/Event
Legend
Merge
Grammar Module
PE
Legend
Profile/Event
HTTP
Procedure or
Linking
Statistical Model
CORBA
Hybrid
Module
5Introduction of Location Normalization
- Task of location normalization (LocNZ)
- Identify the correct sense of ambiguous location
named entity - (1) Decide if a location name is a city, a
province or a country - Support NE Tagger to decide sub-tag
- New York (NeLoc) gtNew York (NeLoc, NeCty)
- (2) Decide which city, state or country do a
city, island or state belongs to - 18 states have city of Boston
- Boston gt Alabama, Arkansas, Massachusetts,
Missouri, - Result of LocNZ can be used to
- (1) Support event extraction, merging and event
visualization - Indicate where the event occurred
- (2) Support profile generation
- Provide location information of a person or an
organization - (3) Support question answering
- Provide location area for document
categorization
6Event and Profile Generation
Event Template Argument structures centering
around verb with the associated information
Profile Template presenting the subject's most
noteworthy characteristics and achievements
ltPersonProfile 001gt Name Julian
Werner Hill Position Research chemist
Age 91 Birth-place
ltLocationProfile100gt Affiliation Du Pont Co.
Education MIT ltLocationProfile 100gt
Name St. Louis State Missouri
Country United States of America Zipcode
63101 Lattitude 90.191313 Longitude
38.634616 Related_profiles ltPersonProfile 001gt
Input Alvin Karloff was replaced by John Doe as
CEO of ABC at New York last month.
ltGeneral Event id200gt key verb replace who
John Doe whom-what Alvin Karloff
complement CEO of ABC when last month
Where ltLocationProfile101gt
7Event Visualization
Result of LocNZ Indicates the place of an event
occurred
Event type ltDie Event 200gt Who ltJulian
Werver Hill PersonProfile 001gt When
1996-01-07 Where ltLocationProfile103gt Precedin
g_event lthospitalize Event 260gt Subsequent_event
ltbury Event 250gt
Predicate Die Who Julian Werner
Hill When 1996-01-07 Where
ltLocationProfile 103gt
Event Visualization
8Problems in Location Normalization
- Difference between LocNZ and general WSD
- Selection restriction is not sufficient
- WSD verb sense tagging relies mainly on
co-occurrence constraints of semantic
structures,Verb-Subject and Verb-Object in
particular - LocNZ depends primarily on the co-occurrence of
related location entities in the same discourse
(text) - Less clues in a text than verb and noun sense
disambiguation - located in can indicate San Francisco is a
location only - Example) The Golden Gate Bridge is located in San
Francisco - Lack of sources for default senses of location
names - Tipster Gazetteer provides only small part of
default senses - Little previous research on solving LocNZ
9Major Types of Ambiguities
- City versus country and state name ambiguity
- Canada (CITY) Kansas (PROVINCE 1) United States
(COUNTRY) - Canada (CITY) Kentucky (PROVINCE 1) United States
(COUNTRY) - Canada (COUNTRY)
- New York state versus New York city
- Same city name among different provinces
ambiguity - - 33 Washington entries in the Gazetteer
- Washington (CITY) Arkansas (PROVINCE 1) United
States (COUNTRY) - Washington (CITY) California (PROVINCE 1) United
States (COUNTRY) - Washington (CITY) Connecticut (PROVINCE 1) United
States (COUNTRY) - Washington (CITY) District of Columbia (PROVINCE
1) United States (COUNTRY) - Washington (CITY) Georgia (PROVINCE 1) United
States (COUNTRY) - Washington (CITY) Illinois (PROVINCE 1) United
States (COUNTRY) - Washington (CITY) Indiana (PROVINCE 1) United
States (COUNTRY) - Washington (CITY) Iowa (PROVINCE 1) United States
(COUNTRY) - Washington (CITY) Kansas (PROVINCE 1) United
States (COUNTRY) - Washington (CITY) Kentucky (PROVINCE 1) United
States (COUNTRY) -
10Example of Text with Location Names CNN news
http//www.cnn.com/2003/WEATHER/02/19/winter.storm
.delays.ap/index.html
- A traveler gets the bad news as he looks at the
departures list that shows all canceled flights
at the Philadelphia International Airport. - MIAMI (AP) -- Travelers heading to and from the
Northeast faced continued uncertainty Tuesday,
even as airports in the mid-Atlantic region began
slowly digging themselves out from one of the
worst winter storms on record. -
- No flights left Florida for Baltimore-Washington
International Airport until Tuesday afternoon.
That airport was one of the hardest-hit by the
storm, with a snowfall total of 28 inches. - Rosanna Blum, 38, of Hunt Valley, Maryland, had a
confirmed seat on a Miami to Baltimore flight
Tuesday afternoon, but still wasn't optimistic
that she'd actually have the chance to use it. -
- Theresa York, from Maryland, works the phones at
Miami Airport as she tries to find a flight back
home. -
- "It's surreal," said Dawn Shuford, 35, as she
reclined against her suitcase in a darkened
hallway at BWI. She'd been trying since Sunday
morning to get home to Seattle. - The Washington area's two other airports, Reagan
National and Dulles, also had limited service. - Marty Legrow, from Connecticut, rests on her
suitcase at Ronald Reagan National Airport in
Washington. - Philadelphia International Airport resumed
operations Tuesday but still expected to cancel
about one-third of its flights. Flights slowly
resumed at New York's LaGuardia, Kennedy and
Newark airports, and Boston's Logan, where more
than 2 feet of snow fell, had one runway open. -
- Margie D'Onofrio, 48, of King Of Prussia,
Pennsylvania, and a travel companion left the
Bahamas on Sunday, hoping to fly back to
Philadelphia. They made it to Miami, and
D'Onofrio said she did not expect to be home
anytime Tuesday. -
- Passengers camped out overnight at many airports.
Many fliers called ahead Tuesday and weren't
clogging airports unnecessarily, Orlando
International Airport spokeswoman Carolyn Fennell
said.
11Our Previous Method Li et al. 2002
- (1) Lexical grammar processing with local context
- Identify City or State
- City of Buffalo New York State
- Disambiguate meaning of a word
- e.g. Williamsville, New York, USA
- e.g. Brussels, Belgium
- Propagate the analysis result within a text where
it appears - One sense per discourse (Gale, Yarowsky et al,
1992) - (2) Construct graph and calculate maximum weight
spanning tree considering global information with
Kruskal Algorithm - Node Location name senses
- Edge Similarity weight between two location name
senses - Calculate similarities between locations in the
graph referring to predefined similarity table - Choose maximum weight spanning tree that reflects
most probable location senses in the document - (3) Default sense application
- If similarity value is lower than a threshold,
apply default senses
12Problems of Previous Method
- For MST calculation, sort all the weighted edges
- In case there are many locations, and each
location has over 20 senses, the number of edges
will increase a lot, and edges sorting will take
much time, and value weighting is not distinctive
enough - Solution Adopted Prims Algorithm for MST
combined with heuristics - If a location has sense of country, then select
that sense as the default sense of that location
(heuristics1) - If a location has province or capital senses,
then select that sense as default sense after
local context application (heuristics2) - The number of location mentions and the distance
between them are taken into account - Previous method could not reflect these factor
- Assign weight to the sense nodes in constructed
graph - Choose the node with maximum weight
13Weight Calculation
Table 1 Impact weight of Sense2 on Sense1
14Weight Assigned to Sense Nodes
Canada Kansas, Kentucky, Country
Vancouver British Columbia Washington
port in USA Port in Canada
Charlottetown Prov in USA, New York City,
Toronto (Ontorio, New South Wales, Illinois,
New York Prov in USA, New York City,
Quebec (city in Quebec, Quebec Prov, Connecticut,
Prince Edward Island Island in Canada, Island in
South Africa, Province in Canada
15Modified Algorithm
- Look up the location gazetteer to associate
candidate senses for each location NE - If a location has sense of country, then select
that sense as the default sense of that location
(heuristics) - Call the pattern matching sub-module for local
patterns like Williamsville, New York, USA - Apply the one sense per discourse principle for
each disambiguated location name to propagate the
selected sense to its other mentions within a
document - Apply default sense heuristics for a location
with province or capital senses - Call Prims algorithm in the discourse sub-module
to resolve the remaining ambiguities - If the difference between the sense with the
maximum weight and the sense with next largest
weight is equal to or lower than a threshold,
choose the default sense of that name from
lexicon. Otherwise, choose the sense with the
maximum weight as output.
16Experimental Evaluation
17Discussion
- Note Column 59 used heuristics of default
senses - Local patterns (Col-4) alone contribute 12 to
the overall performance - Proper use of defaults senses and the
heuristics(Col-5) can achieve close to 90 - Prims algorithm (Col-7) is clearly better than
the previous method using Kruskals algorithm
(Col-6), with 13 - But both methods cannot outperform default senses
- When using all three types of evidence, the new
hybrid method performance of 96 shown in Col-9
18Future Work
- Extend the scope of location normalization
- Extend processing scope
- Physical structure
- famous building, bridge, airport, lake, street
name, - Extend gazetteer
- Introduce more context information for
disambiguation - Upgrade default meaning assignment