The Role of Toponym Disambiguation in Information Retrieval and Question Answering - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

The Role of Toponym Disambiguation in Information Retrieval and Question Answering

Description:

The Role of Toponym Disambiguation in Information Retrieval and. Question Answering ... Davide Buscaldi, Paolo Rosso and Emilio Sanchis, A WordNet-Based Indexing ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 19
Provided by: upvAca
Category:

less

Transcript and Presenter's Notes

Title: The Role of Toponym Disambiguation in Information Retrieval and Question Answering


1
The Role of Toponym Disambiguation in Information
Retrieval andQuestion Answering
  • A Ph.D. Thesis Project presented by
  • Davide Buscaldi
  • to Dpto. De Sistemas Informáticos y Computación
  • Supervisor
  • Paolo Rosso

2
Plan of the Talk
  • Introduction
  • Geographical Information Needs
  • Geographical ambiguity and related issues
  • Current solutions
  • Thesis Objectives
  • Methodology
  • Schedule

3
Geographical Information Needs
  • Users are interested to geographical or local
    information
  • Find an activity in a map, e.g. Hotel in
    Tarragona
  • Localize a place, e.g. Where is the Leaning
    Tower?
  • Find a route / path, e.g. From Stazione C.le to
    Malpensa airport
  • Calculate a distance, e.g. How far is Valencia
    from Madrid?
  • Computers can be used to address these
    information needs
  • Problems borrowed from computational geometry
  • Convex hull Given a set of points, find the
    smallest convex polyhedron/polygon containing all
    the points.
  • Find the intersections between a given set of
    line segments.
  • Shortest path Connect two points in a Euclidean
    space (with polyhedral obstacles) by a shortest
    path
  • Point location Given a partitioning of the space
    into cells, produce a data structure that
    efficiently tells in which cell a query point is
    located.
  • However, a precise representation (map) is needed

4
Geographical Information in Text
  • Geographical information in the web is spread
    over textual, un-structured documents
  • Examples
  • Arizona Sen. John McCain won Tuesday in Florida
  • The Lonja de la Seda in Valencia was founded in
    1469 as a market for oil
  • Sanderson and Kohler, 2004 showed that about
    20 of searches include toponyms or other
    geographical constraints
  • Need to adapt Information Retrieval techniques to
    geographically-constrained queries (from IR to
    GIR)
  • Grounding (Correctly identifying a location and
    assigning it coordinates)
  • Indexing and ranking phases have to take into
    account the geographical features
  • GeoCLEF, annual GIR workshops

5
GIR Issues
  • Implicit Geographical Information
  • Most geographical information is kept hidden
  • A federal judge in Detroit struck down the
    National Security Agency's domestic surveillance
    program yesterday, calling it unconstitutional
    and an illegal abuse of presidential power.

The information about the geographical entities
that contain Detroit is contained in the text,
but it is not mentioned in an explicit way
6
(More) GIR Issues
  • Synonymy
  • Different names for the same place
  • St. Petersburg, San Petersburgo,
    ?????-?????????, Leningrad
  • Ambiguity of toponyms
  • Class ambiguity (geo vs. non-geo)
  • Example Muro (Mallorca) vs. muro (wall)
  • Geographical ambiguity
  • Example Cambridge (U.S.) vs. Cambridge (U.K.)
  • Solutions
  • External knowledge sources, such as ontologies,
    gazzetteers, encyclopaedias

7
Ambiguity
  • The richer the resource used, the higher the
    ambiguity
  • Example
  • Picture above locations referenced in Wikipedia
  • Below locations referenced in the GNS gazetteer
  • Ambiguity is (probably) the major issue
  • Directly related to coverage
  • 67.8 of toponyms in news are ambigue Garbin and
    Mani, 2005

8
Current Solutions
  • A variety of methods have been proposed until now
    for TD
  • Decision trees Ollischlaeger and Hauptmann,
    1999
  • Centroid distance (map-based) Smith and Crane,
    2001
  • Supervised learning Rauch et al., 2003
  • Wikipedia-based method Overell et al., 2006
  • WordNet-based method Buscaldi and Rosso, 2008
  • Results vary in the range 75 -- 96, but on
    (very) different test sets
  • There are open questions that have to be answered

9
Thesis ObjectivesQuestions to be Answered
  • Previous methods do not have been tested on the
    same corpus
  • Which is the best approach?
  • Gonzalo et al., 1998 stated that WSD improves
    significatively IR results only when accuracy gt
    70
  • How does TD affect GIR?
  • Is the same accuracy level needed?
  • Do improvements in IR reflect in QA ?

10
More (Technical) Questions
  • Toponym indexing
  • Toponym representation by name, ID or
    coordinates?
  • E.g. London,U.K., 00435713, or 5130'28?N
    0007'41?W ?
  • Include containing entities ?
  • E.g. London -gt England -gt U.K. -gt Europe -gt
    Northern Hemisphere
  • Which resource ?
  • How to deal with Fuzzy areas ?
  • E.g. Northern Europe
  • Weighting scheme
  • It is better to include geography during search
    or filter results at the end ?

11
(A Few) Answers
  • Some questions already answered
  • Containing entities may help with some queries
    Buscaldi et al., 2007
  • Filtering seem to perform better than dedicated
    weighting schemes Cardoso and Silva, 2007

12
Methodology Toponym Disambiguation
  • We adapted a method based on Conceptual Density
    computed over hierarchies of hypernyms to
    hierarchies of holonyms
  • Given an ambiguous place name, different
    subhierarchies are obtained from the reference
    ontology (WordNet)
  • Place names in context are added to the
    hierarchies
  • Finally, the sense related to the most dense
    subhierarchy is selected
  • Results 94.2 Precision, 78.9 Recall (GeoSemCor)

13
MethodologyEvaluation of TD
  • TD methods will be compared on the same resource
  • Initially GeoSemCor, a subset of 1200 toponyms
    from the SemCor corpus Buscaldi and Rosso, 2008
  • We will have to introduce a mapping between
    WordNet synsets and coordinates in order to
    evaluate map-based TD
  • J.Leidners Reuters geo-tagged corpus Leidner,
    2006 when available
  • Same accuracy measures used in WSD
  • Precision (correctly disambiguated/attempted)
  • Recall (corr. disambiguated / total)

14
MethodologyGeoWorSE GIR Search Engine
  • This is the system we used at GeoCLEF 2007
  • Tailored to address the problem of implicit
    geographical information
  • Core system Lucene 2.1
  • Three indices text, geo and WN
  • Indexing process
  • Document is stored in the text index using the
    standard analyzer of Lucene
  • LingPipe identifies the NEs in the document
  • If NE is a toponym, it is added to the geo index
  • If it is in WordNet, it is added together with
    its synonyms and holonyms in the WN index
  • Search process
  • Standard mode as Lucene
  • Geo mode toponyms are searched in the geo index,
    content words in the text
  • GeoWN mode toponyms are searched in the geo and
    WN indices

15
MethodologyTD and GIR
  • Evaluation of impact of TD on GIR
  • Integration of TD in our existing GIR engine
  • Evaluation with GeoCLEF test sets (and direct
    participation in 2008)
  • (Possibly) evaluation over Geo-Reuters
  • Measures
  • MAP (Mean Average Precision), Recall ( relevant
    retrieved / total of relevants)
  • Experiments with different term representations,
    geographical resources, and weighting schemes
  • Evaluation of impact of TD on QA
  • Include TD in our QA system
  • Evaluate results with past CLEF results and
    participation to CLEF 2008

16
Tasks
17
Schedule
18
Bibliography
  • Buscaldi and Rosso, 2008 Davide Buscaldi and
    Paolo Rosso, A conceptual density-based approach
    for the disambiguation of toponyms, in
    International Journal of Geographical Information
    Systems, (to appear in) 2008.
  • Buscaldi et al., 2007 Davide Buscaldi, Paolo
    Rosso and Emilio Sanchis, A WordNet-Based
    Indexing Technique for Geographical Information
    Retrieval, in Evaluation of Multilingual and
    Multi-modal Information Retrieval, pages 954-957,
    Springer, ISBN 978-3-540-74998-1, 2007.
  • Cardoso and Silva, 2007 Nuno Cardoso and Mário
    J. Silva, Query expansion through geographical
    feature types, in GIR '07 Proceedings of the
    4th ACM workshop on Geographical information
    retrieval, pages 55-60, ACM, Lisbon, Portugal,
    2007
  • Garbin and Mani, 2005 Eric Garbin and Inderjeet
    Mani, Disambiguating toponyms in news, in
    conference on Human Language Technology and
    Empirical Methods in Natural Language Processing
    (HLT05), pages 363-370, Association for
    Computational Linguistics, Vancouver, British
    Columbia, Canada, 2005.
  • Gonzalo et al., 1998 Julio Gonzalo, Felisa
    Verdejo, Irin Chugur and José Cigarrán, Indexing
    with WordNet Synsets can improve Text Retrieval,
    in COLING/ACL '98 workshop on the Usage of
    WordNet for NLP, pages 38-44, 1998.
  • Leidner, 2006 Jochen L. Leidner, An evaluation
    dataset for the toponym resolution task, in
    Computers, Environment and Urban Systems, volume
    30, number 4, pages 400-417, 2006.
  • Ollischlaeger and Hauptmann, 1999 A. M.
    Olligschlaeger and A. G. Hauptmann, Multimodal
    Information Systems and GIS The Informedia
    Digital Video Library, in 1999 ESRI User
    Conference, 1999.
  • Overell et al., 2006 Simon Overell, Joao
    Magalhaes and Stefan Rüger, Place disambiguation
    with co-occurrence models, in GeoCLEF 2006
    Workshop, 2006.
  • Rauch et al., 2003 E. Rauch, M. Bukatin and K.
    Baker, A confidence-based framework for
    disambiguating geographic terms, in HLT-NAACL
    2003 Workshop on Analysis of Geographic
    References, pages 50-54, 2003.
  • Sanderson and Kohler, 2004 Mark Sanderson and
    Janet Kohler, Analyzing Geographic Queries, in
    proceedings of Workshop on Geographic Information
    Retrieval (GIR04), 2004.
  • Smith and Crane, 2001 David A. Smith and
    Gregory Crane, Disambiguating Geographic Names in
    a Historical Digital Library, in Research and
    Advanced Technology for Digital Libraries, pages
    127-137, Springer, 2001.
Write a Comment
User Comments (0)
About PowerShow.com