The Role of Toponym Disambiguation in Information Retrieval and Question Answering

About This Presentation

Title:

The Role of Toponym Disambiguation in Information Retrieval and Question Answering

Description:

The Role of Toponym Disambiguation in Information Retrieval and. Question Answering ... Davide Buscaldi, Paolo Rosso and Emilio Sanchis, A WordNet-Based Indexing ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 19

Provided by: upvAca

Category:

more less

Transcript and Presenter's Notes

Title: The Role of Toponym Disambiguation in Information Retrieval and Question Answering

1
The Role of Toponym Disambiguation in Information
Retrieval andQuestion Answering

A Ph.D. Thesis Project presented by
Davide Buscaldi
to Dpto. De Sistemas Informáticos y Computación
Supervisor
Paolo Rosso

2
Plan of the Talk

Introduction
Geographical Information Needs
Geographical ambiguity and related issues
Current solutions
Thesis Objectives
Methodology
Schedule

3
Geographical Information Needs

Users are interested to geographical or local
information
Find an activity in a map, e.g. Hotel in
Tarragona
Localize a place, e.g. Where is the Leaning
Tower?
Find a route / path, e.g. From Stazione C.le to
Malpensa airport
Calculate a distance, e.g. How far is Valencia
from Madrid?
Computers can be used to address these
information needs
Problems borrowed from computational geometry
Convex hull Given a set of points, find the
smallest convex polyhedron/polygon containing all
the points.
Find the intersections between a given set of
line segments.
Shortest path Connect two points in a Euclidean
space (with polyhedral obstacles) by a shortest
path
Point location Given a partitioning of the space
into cells, produce a data structure that
efficiently tells in which cell a query point is
located.
However, a precise representation (map) is needed

4
Geographical Information in Text

Geographical information in the web is spread
over textual, un-structured documents
Examples
Arizona Sen. John McCain won Tuesday in Florida
The Lonja de la Seda in Valencia was founded in
1469 as a market for oil
Sanderson and Kohler, 2004 showed that about
20 of searches include toponyms or other
geographical constraints
Need to adapt Information Retrieval techniques to
geographically-constrained queries (from IR to
GIR)
Grounding (Correctly identifying a location and
assigning it coordinates)
Indexing and ranking phases have to take into
account the geographical features
GeoCLEF, annual GIR workshops

5
GIR Issues

Implicit Geographical Information
Most geographical information is kept hidden
A federal judge in Detroit struck down the
National Security Agency's domestic surveillance
program yesterday, calling it unconstitutional
and an illegal abuse of presidential power.

The information about the geographical entities
that contain Detroit is contained in the text,
but it is not mentioned in an explicit way
6
(More) GIR Issues

Synonymy
Different names for the same place
St. Petersburg, San Petersburgo,
?????-?????????, Leningrad
Ambiguity of toponyms
Class ambiguity (geo vs. non-geo)
Example Muro (Mallorca) vs. muro (wall)
Geographical ambiguity
Example Cambridge (U.S.) vs. Cambridge (U.K.)
Solutions
External knowledge sources, such as ontologies,
gazzetteers, encyclopaedias

7
Ambiguity

The richer the resource used, the higher the
ambiguity
Example
Picture above locations referenced in Wikipedia
Below locations referenced in the GNS gazetteer
Ambiguity is (probably) the major issue
Directly related to coverage
67.8 of toponyms in news are ambigue Garbin and
Mani, 2005

8
Current Solutions

A variety of methods have been proposed until now
for TD
Decision trees Ollischlaeger and Hauptmann,
1999
Centroid distance (map-based) Smith and Crane,
2001
Supervised learning Rauch et al., 2003
Wikipedia-based method Overell et al., 2006
WordNet-based method Buscaldi and Rosso, 2008
Results vary in the range 75 -- 96, but on
(very) different test sets
There are open questions that have to be answered

9
Thesis ObjectivesQuestions to be Answered

Previous methods do not have been tested on the
same corpus
Which is the best approach?
Gonzalo et al., 1998 stated that WSD improves
significatively IR results only when accuracy gt
70
How does TD affect GIR?
Is the same accuracy level needed?
Do improvements in IR reflect in QA ?

10
More (Technical) Questions

Toponym indexing
Toponym representation by name, ID or
coordinates?
E.g. London,U.K., 00435713, or 5130'28?N
0007'41?W ?
Include containing entities ?
E.g. London -gt England -gt U.K. -gt Europe -gt
Northern Hemisphere
Which resource ?
How to deal with Fuzzy areas ?
E.g. Northern Europe
Weighting scheme
It is better to include geography during search
or filter results at the end ?

11
(A Few) Answers

Some questions already answered
Containing entities may help with some queries
Buscaldi et al., 2007
Filtering seem to perform better than dedicated
weighting schemes Cardoso and Silva, 2007

12
Methodology Toponym Disambiguation

We adapted a method based on Conceptual Density
computed over hierarchies of hypernyms to
hierarchies of holonyms
Given an ambiguous place name, different
subhierarchies are obtained from the reference
ontology (WordNet)

Place names in context are added to the
hierarchies

Finally, the sense related to the most dense
subhierarchy is selected

Results 94.2 Precision, 78.9 Recall (GeoSemCor)

13
MethodologyEvaluation of TD

TD methods will be compared on the same resource
Initially GeoSemCor, a subset of 1200 toponyms
from the SemCor corpus Buscaldi and Rosso, 2008
We will have to introduce a mapping between
WordNet synsets and coordinates in order to
evaluate map-based TD
J.Leidners Reuters geo-tagged corpus Leidner,
2006 when available
Same accuracy measures used in WSD
Precision (correctly disambiguated/attempted)
Recall (corr. disambiguated / total)

14
MethodologyGeoWorSE GIR Search Engine

This is the system we used at GeoCLEF 2007
Tailored to address the problem of implicit
geographical information
Core system Lucene 2.1
Three indices text, geo and WN
Indexing process
Document is stored in the text index using the
standard analyzer of Lucene
LingPipe identifies the NEs in the document
If NE is a toponym, it is added to the geo index
If it is in WordNet, it is added together with
its synonyms and holonyms in the WN index
Search process
Standard mode as Lucene
Geo mode toponyms are searched in the geo index,
content words in the text
GeoWN mode toponyms are searched in the geo and
WN indices

15
MethodologyTD and GIR

Evaluation of impact of TD on GIR
Integration of TD in our existing GIR engine
Evaluation with GeoCLEF test sets (and direct
participation in 2008)
(Possibly) evaluation over Geo-Reuters
Measures
MAP (Mean Average Precision), Recall ( relevant
retrieved / total of relevants)
Experiments with different term representations,
geographical resources, and weighting schemes
Evaluation of impact of TD on QA
Include TD in our QA system
Evaluate results with past CLEF results and
participation to CLEF 2008

16
Tasks
17
Schedule
18
Bibliography

Buscaldi and Rosso, 2008 Davide Buscaldi and
Paolo Rosso, A conceptual density-based approach
for the disambiguation of toponyms, in
International Journal of Geographical Information
Systems, (to appear in) 2008.
Buscaldi et al., 2007 Davide Buscaldi, Paolo
Rosso and Emilio Sanchis, A WordNet-Based
Indexing Technique for Geographical Information
Retrieval, in Evaluation of Multilingual and
Multi-modal Information Retrieval, pages 954-957,
Springer, ISBN 978-3-540-74998-1, 2007.
Cardoso and Silva, 2007 Nuno Cardoso and Mário
J. Silva, Query expansion through geographical
feature types, in GIR '07 Proceedings of the
4th ACM workshop on Geographical information
retrieval, pages 55-60, ACM, Lisbon, Portugal,
2007
Garbin and Mani, 2005 Eric Garbin and Inderjeet
Mani, Disambiguating toponyms in news, in
conference on Human Language Technology and
Empirical Methods in Natural Language Processing
(HLT05), pages 363-370, Association for
Computational Linguistics, Vancouver, British
Columbia, Canada, 2005.
Gonzalo et al., 1998 Julio Gonzalo, Felisa
Verdejo, Irin Chugur and José Cigarrán, Indexing
with WordNet Synsets can improve Text Retrieval,
in COLING/ACL '98 workshop on the Usage of
WordNet for NLP, pages 38-44, 1998.
Leidner, 2006 Jochen L. Leidner, An evaluation
dataset for the toponym resolution task, in
Computers, Environment and Urban Systems, volume
30, number 4, pages 400-417, 2006.
Ollischlaeger and Hauptmann, 1999 A. M.
Olligschlaeger and A. G. Hauptmann, Multimodal
Information Systems and GIS The Informedia
Digital Video Library, in 1999 ESRI User
Conference, 1999.
Overell et al., 2006 Simon Overell, Joao
Magalhaes and Stefan Rüger, Place disambiguation
with co-occurrence models, in GeoCLEF 2006
Workshop, 2006.
Rauch et al., 2003 E. Rauch, M. Bukatin and K.
Baker, A confidence-based framework for
disambiguating geographic terms, in HLT-NAACL
2003 Workshop on Analysis of Geographic
References, pages 50-54, 2003.
Sanderson and Kohler, 2004 Mark Sanderson and
Janet Kohler, Analyzing Geographic Queries, in
proceedings of Workshop on Geographic Information
Retrieval (GIR04), 2004.
Smith and Crane, 2001 David A. Smith and
Gregory Crane, Disambiguating Geographic Names in
a Historical Digital Library, in Research and
Advanced Technology for Digital Libraries, pages
127-137, Springer, 2001.