InfoXtract Location Normalization: A Hybrid Approach to Geographic References in Information Extract - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

InfoXtract Location Normalization: A Hybrid Approach to Geographic References in Information Extract

Description:

... looks at the departures list that shows all canceled flights at the Philadelphia ... and Boston's Logan, where more than 2 feet of snow fell, had one runway open. ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 19
Provided by: wei26
Category:

less

Transcript and Presenter's Notes

Title: InfoXtract Location Normalization: A Hybrid Approach to Geographic References in Information Extract


1
InfoXtract Location Normalization A Hybrid
Approach to Geographic References in Information
Extraction May 31 2003 Edmonton, Alberta
  • NAACL-HLT Workshop on the Analysis of Geographic
    References
  • Huifeng Li, Rohini K. Srihari, Cheng Niu, and Wei
    Li
  • Cymfony Inc.

2
Contents
  • Overview of Information Extraction System
    InfoXtract
  • Introduction of Location Normalization (LocNZ)
  • Task of LocNZ
  • Problems and Proposed Method
  • Algorithm for LocNZ
  • Experimental Evaluation
  • Future Work

3
Overview of InfoXtract
  • InfoXtract produces the following information
    objects from a text
  • Named Entities (NEs) - Bill Gates,
    chairman of Microsoft .
  • Correlated Entities (CEs) - Bill Gates,
    chairman of Microsoft...
  • Subject-Verb-Object (SVO) triples - Both
    syntactic semantic forms of the
    structures
  • Entity Profiles - Profiles for
    entity types like peopleorganizations
  • General Events (GEs) - Domain-independent
    events
  • Event
  • Argument structures centering around verb with
    the associated information
  • who did what to whom when (or how often) and
    where
  • Predefined Events (PEs) - Domain-specific
    events
  • System component integrated NLP and machine
    learning into IE
  • POS tagging
  • Shallow and deep parsing
  • Named Entity tagging
  • Combining supervised unsupervised machine
    learning techniques
  • Concept-based analysis
  • Word sense disambiguation

4
InfoXtract Architecture
Document Processor
Linguistic Processor(s)
HTTP
Tokenizer
POST
Process
Zoned Text
Web
Manager
Document
Server
Source
Document
Lexicon Lookup
Tokenlist
POS Tagging
NE
XML
Tokenlist
Output
Named Entity
Formatted
Manager
Detection
Extracted
Document
CE
Time
HTTP
Normalization
response
Location
SVO
Document
Normalization

Shallow
Error log
CO
Parsing
Knowledge Resources
Deep
Parsing
Lexicon
Resources
Relationship
Profile
Detection
Grammars
Alias/Coreference
Linking
CGE
Language models
Pragmatic
Filtering
Profile/Event
Legend
Merge
Grammar Module
PE
Legend
Profile/Event
HTTP
Procedure or
Linking
Statistical Model
CORBA
Hybrid
Module
5
Introduction of Location Normalization
  • Task of location normalization (LocNZ)
  • Identify the correct sense of ambiguous location
    named entity
  • (1) Decide if a location name is a city, a
    province or a country
  • Support NE Tagger to decide sub-tag
  • New York (NeLoc) gtNew York (NeLoc, NeCty)
  • (2) Decide which city, state or country do a
    city, island or state belongs to
  • 18 states have city of Boston
  • Boston gt Alabama, Arkansas, Massachusetts,
    Missouri,
  • Result of LocNZ can be used to
  • (1) Support event extraction, merging and event
    visualization
  • Indicate where the event occurred
  • (2) Support profile generation
  • Provide location information of a person or an
    organization
  • (3) Support question answering
  • Provide location area for document
    categorization

6
Event and Profile Generation
Event Template Argument structures centering
around verb with the associated information
Profile Template presenting the subject's most
noteworthy characteristics and achievements
ltPersonProfile 001gt Name Julian
Werner Hill Position Research chemist
Age 91 Birth-place
ltLocationProfile100gt Affiliation Du Pont Co.
Education MIT ltLocationProfile 100gt
Name St. Louis State Missouri
Country United States of America Zipcode
63101 Lattitude 90.191313 Longitude
38.634616 Related_profiles ltPersonProfile 001gt
Input Alvin Karloff was replaced by John Doe as
CEO of ABC at New York last month.
ltGeneral Event id200gt key verb replace who
John Doe whom-what Alvin Karloff
complement CEO of ABC when last month
Where ltLocationProfile101gt
7
Event Visualization
Result of LocNZ Indicates the place of an event
occurred

Event type ltDie Event 200gt Who ltJulian
Werver Hill PersonProfile 001gt When
1996-01-07 Where ltLocationProfile103gt Precedin
g_event lthospitalize Event 260gt Subsequent_event
ltbury Event 250gt
Predicate Die Who Julian Werner
Hill When 1996-01-07 Where
ltLocationProfile 103gt
Event Visualization




8
Problems in Location Normalization
  • Difference between LocNZ and general WSD
  • Selection restriction is not sufficient
  • WSD verb sense tagging relies mainly on
    co-occurrence constraints of semantic
    structures,Verb-Subject and Verb-Object in
    particular
  • LocNZ depends primarily on the co-occurrence of
    related location entities in the same discourse
    (text)
  • Less clues in a text than verb and noun sense
    disambiguation
  • located in can indicate San Francisco is a
    location only
  • Example) The Golden Gate Bridge is located in San
    Francisco
  • Lack of sources for default senses of location
    names
  • Tipster Gazetteer provides only small part of
    default senses
  • Little previous research on solving LocNZ

9
Major Types of Ambiguities
  • City versus country and state name ambiguity
  • Canada (CITY) Kansas (PROVINCE 1) United States
    (COUNTRY)
  • Canada (CITY) Kentucky (PROVINCE 1) United States
    (COUNTRY)
  • Canada (COUNTRY)
  • New York state versus New York city
  • Same city name among different provinces
    ambiguity
  • - 33 Washington entries in the Gazetteer
  • Washington (CITY) Arkansas (PROVINCE 1) United
    States (COUNTRY)
  • Washington (CITY) California (PROVINCE 1) United
    States (COUNTRY)
  • Washington (CITY) Connecticut (PROVINCE 1) United
    States (COUNTRY)
  • Washington (CITY) District of Columbia (PROVINCE
    1) United States (COUNTRY)
  • Washington (CITY) Georgia (PROVINCE 1) United
    States (COUNTRY)
  • Washington (CITY) Illinois (PROVINCE 1) United
    States (COUNTRY)
  • Washington (CITY) Indiana (PROVINCE 1) United
    States (COUNTRY)
  • Washington (CITY) Iowa (PROVINCE 1) United States
    (COUNTRY)
  • Washington (CITY) Kansas (PROVINCE 1) United
    States (COUNTRY)
  • Washington (CITY) Kentucky (PROVINCE 1) United
    States (COUNTRY)

10
Example of Text with Location Names CNN news
http//www.cnn.com/2003/WEATHER/02/19/winter.storm
.delays.ap/index.html
  • A traveler gets the bad news as he looks at the
    departures list that shows all canceled flights
    at the Philadelphia International Airport.
  • MIAMI (AP) -- Travelers heading to and from the
    Northeast faced continued uncertainty Tuesday,
    even as airports in the mid-Atlantic region began
    slowly digging themselves out from one of the
    worst winter storms on record.
  • No flights left Florida for Baltimore-Washington
    International Airport until Tuesday afternoon.
    That airport was one of the hardest-hit by the
    storm, with a snowfall total of 28 inches.
  • Rosanna Blum, 38, of Hunt Valley, Maryland, had a
    confirmed seat on a Miami to Baltimore flight
    Tuesday afternoon, but still wasn't optimistic
    that she'd actually have the chance to use it.
  • Theresa York, from Maryland, works the phones at
    Miami Airport as she tries to find a flight back
    home.
  • "It's surreal," said Dawn Shuford, 35, as she
    reclined against her suitcase in a darkened
    hallway at BWI. She'd been trying since Sunday
    morning to get home to Seattle.
  • The Washington area's two other airports, Reagan
    National and Dulles, also had limited service.
  • Marty Legrow, from Connecticut, rests on her
    suitcase at Ronald Reagan National Airport in
    Washington.
  • Philadelphia International Airport resumed
    operations Tuesday but still expected to cancel
    about one-third of its flights. Flights slowly
    resumed at New York's LaGuardia, Kennedy and
    Newark airports, and Boston's Logan, where more
    than 2 feet of snow fell, had one runway open.
  • Margie D'Onofrio, 48, of King Of Prussia,
    Pennsylvania, and a travel companion left the
    Bahamas on Sunday, hoping to fly back to
    Philadelphia. They made it to Miami, and
    D'Onofrio said she did not expect to be home
    anytime Tuesday.
  • Passengers camped out overnight at many airports.
    Many fliers called ahead Tuesday and weren't
    clogging airports unnecessarily, Orlando
    International Airport spokeswoman Carolyn Fennell
    said.

11
Our Previous Method Li et al. 2002
  • (1) Lexical grammar processing with local context
  • Identify City or State
  • City of Buffalo New York State
  • Disambiguate meaning of a word
  • e.g. Williamsville, New York, USA
  • e.g. Brussels, Belgium
  • Propagate the analysis result within a text where
    it appears
  • One sense per discourse (Gale, Yarowsky et al,
    1992)
  • (2) Construct graph and calculate maximum weight
    spanning tree considering global information with
    Kruskal Algorithm
  • Node Location name senses
  • Edge Similarity weight between two location name
    senses
  • Calculate similarities between locations in the
    graph referring to predefined similarity table
  • Choose maximum weight spanning tree that reflects
    most probable location senses in the document
  • (3) Default sense application
  • If similarity value is lower than a threshold,
    apply default senses

12
Problems of Previous Method
  • For MST calculation, sort all the weighted edges
  • In case there are many locations, and each
    location has over 20 senses, the number of edges
    will increase a lot, and edges sorting will take
    much time, and value weighting is not distinctive
    enough
  • Solution Adopted Prims Algorithm for MST
    combined with heuristics
  • If a location has sense of country, then select
    that sense as the default sense of that location
    (heuristics1)
  • If a location has province or capital senses,
    then select that sense as default sense after
    local context application (heuristics2)
  • The number of location mentions and the distance
    between them are taken into account
  • Previous method could not reflect these factor
  • Assign weight to the sense nodes in constructed
    graph
  • Choose the node with maximum weight

13
Weight Calculation
Table 1 Impact weight of Sense2 on Sense1
14
Weight Assigned to Sense Nodes
Canada Kansas, Kentucky, Country
Vancouver British Columbia Washington
port in USA Port in Canada
Charlottetown Prov in USA, New York City,
Toronto (Ontorio, New South Wales, Illinois,
New York Prov in USA, New York City,
Quebec (city in Quebec, Quebec Prov, Connecticut,

Prince Edward Island Island in Canada, Island in
South Africa, Province in Canada
15
Modified Algorithm
  • Look up the location gazetteer to associate
    candidate senses for each location NE
  • If a location has sense of country, then select
    that sense as the default sense of that location
    (heuristics)
  • Call the pattern matching sub-module for local
    patterns like Williamsville, New York, USA
  • Apply the one sense per discourse principle for
    each disambiguated location name to propagate the
    selected sense to its other mentions within a
    document
  • Apply default sense heuristics for a location
    with province or capital senses
  • Call Prims algorithm in the discourse sub-module
    to resolve the remaining ambiguities
  • If the difference between the sense with the
    maximum weight and the sense with next largest
    weight is equal to or lower than a threshold,
    choose the default sense of that name from
    lexicon. Otherwise, choose the sense with the
    maximum weight as output.

16
Experimental Evaluation
17
Discussion
  • Note Column 59 used heuristics of default
    senses
  • Local patterns (Col-4) alone contribute 12 to
    the overall performance
  • Proper use of defaults senses and the
    heuristics(Col-5) can achieve close to 90
  • Prims algorithm (Col-7) is clearly better than
    the previous method using Kruskals algorithm
    (Col-6), with 13
  • But both methods cannot outperform default senses
  • When using all three types of evidence, the new
    hybrid method performance of 96 shown in Col-9

18
Future Work
  • Extend the scope of location normalization
  • Extend processing scope
  • Physical structure
  • famous building, bridge, airport, lake, street
    name,
  • Extend gazetteer
  • Introduce more context information for
    disambiguation
  • Upgrade default meaning assignment
Write a Comment
User Comments (0)
About PowerShow.com