Title: The Role of Toponym Disambiguation in Information Retrieval and Question Answering
1The Role of Toponym Disambiguation in Information
Retrieval andQuestion Answering
- A Ph.D. Thesis Project presented by
- Davide Buscaldi
- to Dpto. De Sistemas Informáticos y Computación
- Supervisor
- Paolo Rosso
2Plan of the Talk
- Introduction
- Geographical Information Needs
- Geographical ambiguity and related issues
- Current solutions
- Thesis Objectives
- Methodology
- Schedule
3Geographical Information Needs
- Users are interested to geographical or local
information - Find an activity in a map, e.g. Hotel in
Tarragona - Localize a place, e.g. Where is the Leaning
Tower? - Find a route / path, e.g. From Stazione C.le to
Malpensa airport - Calculate a distance, e.g. How far is Valencia
from Madrid? - Computers can be used to address these
information needs - Problems borrowed from computational geometry
- Convex hull Given a set of points, find the
smallest convex polyhedron/polygon containing all
the points. - Find the intersections between a given set of
line segments. - Shortest path Connect two points in a Euclidean
space (with polyhedral obstacles) by a shortest
path - Point location Given a partitioning of the space
into cells, produce a data structure that
efficiently tells in which cell a query point is
located. - However, a precise representation (map) is needed
4Geographical Information in Text
- Geographical information in the web is spread
over textual, un-structured documents - Examples
- Arizona Sen. John McCain won Tuesday in Florida
- The Lonja de la Seda in Valencia was founded in
1469 as a market for oil - Sanderson and Kohler, 2004 showed that about
20 of searches include toponyms or other
geographical constraints - Need to adapt Information Retrieval techniques to
geographically-constrained queries (from IR to
GIR) - Grounding (Correctly identifying a location and
assigning it coordinates) - Indexing and ranking phases have to take into
account the geographical features - GeoCLEF, annual GIR workshops
5GIR Issues
- Implicit Geographical Information
- Most geographical information is kept hidden
- A federal judge in Detroit struck down the
National Security Agency's domestic surveillance
program yesterday, calling it unconstitutional
and an illegal abuse of presidential power.
The information about the geographical entities
that contain Detroit is contained in the text,
but it is not mentioned in an explicit way
6(More) GIR Issues
- Synonymy
- Different names for the same place
- St. Petersburg, San Petersburgo,
?????-?????????, Leningrad - Ambiguity of toponyms
- Class ambiguity (geo vs. non-geo)
- Example Muro (Mallorca) vs. muro (wall)
- Geographical ambiguity
- Example Cambridge (U.S.) vs. Cambridge (U.K.)
- Solutions
- External knowledge sources, such as ontologies,
gazzetteers, encyclopaedias
7Ambiguity
- The richer the resource used, the higher the
ambiguity - Example
- Picture above locations referenced in Wikipedia
- Below locations referenced in the GNS gazetteer
- Ambiguity is (probably) the major issue
- Directly related to coverage
- 67.8 of toponyms in news are ambigue Garbin and
Mani, 2005
8Current Solutions
- A variety of methods have been proposed until now
for TD - Decision trees Ollischlaeger and Hauptmann,
1999 - Centroid distance (map-based) Smith and Crane,
2001 - Supervised learning Rauch et al., 2003
- Wikipedia-based method Overell et al., 2006
- WordNet-based method Buscaldi and Rosso, 2008
- Results vary in the range 75 -- 96, but on
(very) different test sets - There are open questions that have to be answered
9Thesis ObjectivesQuestions to be Answered
- Previous methods do not have been tested on the
same corpus - Which is the best approach?
- Gonzalo et al., 1998 stated that WSD improves
significatively IR results only when accuracy gt
70 - How does TD affect GIR?
- Is the same accuracy level needed?
- Do improvements in IR reflect in QA ?
10More (Technical) Questions
- Toponym indexing
- Toponym representation by name, ID or
coordinates? - E.g. London,U.K., 00435713, or 5130'28?N
0007'41?W ? - Include containing entities ?
- E.g. London -gt England -gt U.K. -gt Europe -gt
Northern Hemisphere - Which resource ?
- How to deal with Fuzzy areas ?
- E.g. Northern Europe
- Weighting scheme
- It is better to include geography during search
or filter results at the end ?
11(A Few) Answers
- Some questions already answered
- Containing entities may help with some queries
Buscaldi et al., 2007 - Filtering seem to perform better than dedicated
weighting schemes Cardoso and Silva, 2007
12Methodology Toponym Disambiguation
- We adapted a method based on Conceptual Density
computed over hierarchies of hypernyms to
hierarchies of holonyms - Given an ambiguous place name, different
subhierarchies are obtained from the reference
ontology (WordNet)
- Place names in context are added to the
hierarchies
- Finally, the sense related to the most dense
subhierarchy is selected
- Results 94.2 Precision, 78.9 Recall (GeoSemCor)
13MethodologyEvaluation of TD
- TD methods will be compared on the same resource
- Initially GeoSemCor, a subset of 1200 toponyms
from the SemCor corpus Buscaldi and Rosso, 2008 - We will have to introduce a mapping between
WordNet synsets and coordinates in order to
evaluate map-based TD - J.Leidners Reuters geo-tagged corpus Leidner,
2006 when available - Same accuracy measures used in WSD
- Precision (correctly disambiguated/attempted)
- Recall (corr. disambiguated / total)
14MethodologyGeoWorSE GIR Search Engine
- This is the system we used at GeoCLEF 2007
- Tailored to address the problem of implicit
geographical information - Core system Lucene 2.1
- Three indices text, geo and WN
- Indexing process
- Document is stored in the text index using the
standard analyzer of Lucene - LingPipe identifies the NEs in the document
- If NE is a toponym, it is added to the geo index
- If it is in WordNet, it is added together with
its synonyms and holonyms in the WN index - Search process
- Standard mode as Lucene
- Geo mode toponyms are searched in the geo index,
content words in the text - GeoWN mode toponyms are searched in the geo and
WN indices
15MethodologyTD and GIR
- Evaluation of impact of TD on GIR
- Integration of TD in our existing GIR engine
- Evaluation with GeoCLEF test sets (and direct
participation in 2008) - (Possibly) evaluation over Geo-Reuters
- Measures
- MAP (Mean Average Precision), Recall ( relevant
retrieved / total of relevants) - Experiments with different term representations,
geographical resources, and weighting schemes - Evaluation of impact of TD on QA
- Include TD in our QA system
- Evaluate results with past CLEF results and
participation to CLEF 2008
16Tasks
17Schedule
18Bibliography
- Buscaldi and Rosso, 2008 Davide Buscaldi and
Paolo Rosso, A conceptual density-based approach
for the disambiguation of toponyms, in
International Journal of Geographical Information
Systems, (to appear in) 2008. - Buscaldi et al., 2007 Davide Buscaldi, Paolo
Rosso and Emilio Sanchis, A WordNet-Based
Indexing Technique for Geographical Information
Retrieval, in Evaluation of Multilingual and
Multi-modal Information Retrieval, pages 954-957,
Springer, ISBN 978-3-540-74998-1, 2007. - Cardoso and Silva, 2007 Nuno Cardoso and Mário
J. Silva, Query expansion through geographical
feature types, in GIR '07 Proceedings of the
4th ACM workshop on Geographical information
retrieval, pages 55-60, ACM, Lisbon, Portugal,
2007 - Garbin and Mani, 2005 Eric Garbin and Inderjeet
Mani, Disambiguating toponyms in news, in
conference on Human Language Technology and
Empirical Methods in Natural Language Processing
(HLT05), pages 363-370, Association for
Computational Linguistics, Vancouver, British
Columbia, Canada, 2005. - Gonzalo et al., 1998 Julio Gonzalo, Felisa
Verdejo, Irin Chugur and José Cigarrán, Indexing
with WordNet Synsets can improve Text Retrieval,
in COLING/ACL '98 workshop on the Usage of
WordNet for NLP, pages 38-44, 1998. - Leidner, 2006 Jochen L. Leidner, An evaluation
dataset for the toponym resolution task, in
Computers, Environment and Urban Systems, volume
30, number 4, pages 400-417, 2006. - Ollischlaeger and Hauptmann, 1999 A. M.
Olligschlaeger and A. G. Hauptmann, Multimodal
Information Systems and GIS The Informedia
Digital Video Library, in 1999 ESRI User
Conference, 1999. - Overell et al., 2006 Simon Overell, Joao
Magalhaes and Stefan Rüger, Place disambiguation
with co-occurrence models, in GeoCLEF 2006
Workshop, 2006. - Rauch et al., 2003 E. Rauch, M. Bukatin and K.
Baker, A confidence-based framework for
disambiguating geographic terms, in HLT-NAACL
2003 Workshop on Analysis of Geographic
References, pages 50-54, 2003. - Sanderson and Kohler, 2004 Mark Sanderson and
Janet Kohler, Analyzing Geographic Queries, in
proceedings of Workshop on Geographic Information
Retrieval (GIR04), 2004. - Smith and Crane, 2001 David A. Smith and
Gregory Crane, Disambiguating Geographic Names in
a Historical Digital Library, in Research and
Advanced Technology for Digital Libraries, pages
127-137, Springer, 2001.