Title: Using TREC for crosscomparison between classic IR and ontologybased search models at a Web scale Mir
1Using TREC for cross-comparison between classic
IR and ontology-based search models at a Web
scaleMiriam Fernández1, Vanessa López2, Marta
Sabou2, Victoria Uren2, David Vallet1, Enrico
Motta2, Pablo Castells1 Semantic Search
2009 Workshop (SemSearch 2009)18th International
World Wide Web Conference (WWW 2009)21st April
2009, Madrid
2Table of contents
- Motivation
- Part I. The proposal a novel evaluation
benchmark - Reusing the TREC Web track document collection
- Introducing the semantic layer
- Reusing ontologies from the Web
- Populating ontologies from Wikipedia
- Annotating documents
- Part II. Analyzing the evaluation benchmark
- Using the benchmark to compare an ontology-based
search approach against traditional IR baselines - Experimental conditions
- Results
- Applications of the evaluation benchmark
- Conclusions
3Motivation (I)
- Problem How can semantic search systems be
evaluated and compared with standard IR systems
to study whether and how semantic search engines
offer competitive advantages? - Traditional IR evaluation
- Evaluation methodologies generally based on the
Cranfield paradigm (Cleverdon, 1967) - Documents, queries and judgments
- Well-known retrieval performance metrics
- Precision, Recall, P_at_10, Average Precision (AP),
Mean Average Precision (MAP) - Wide initiatives like TREC to create and use
standard evaluation collections, methodologies
and metrics - The evaluation methods are systematic, easily
reproducible, and scalable
4Motivation (II)
- Ontology-based search evaluation
- Ontology-based search approaches
- Introduction of a new semantic search space
(ontologies and KBs) - Change in the IR vision (input, output, scope)
- The evaluation methods rely on user-centered
studies, and therefore they tend to be high-cost,
non-scalable and difficult to reproduce - There is still a long way to define standard
evaluation benchmarks for assessing the quality
of ontology-based search approaches - Goal develop a new reusable evaluation benchmark
for cross-comparison between classic IR and
ontology-based models on a significant scale
5Part I. The proposal
- Motivation
- Part I. The proposal a novel evaluation
benchmark - Reusing the TREC Web track document collection
- Introducing the semantic layer
- Reusing ontologies from the Web
- Populating ontologies from Wikipedia
- Annotating documents
- Part II. Analyzing the evaluation benchmark
- Using the benchmark to compare an ontology-based
search approach against traditional IR baselines - Experimental conditions
- Results
- Part III. Applications of the evaluation
benchmark - Conclusions
6The evaluation benchmark (I)
- A benchmark collection for cross-comparison
between classic IR and ontology-based search
models at a large scale should comprise five main
components - a set of documents,
- a set of topics or queries,
- a set of relevance judgments (or lists of
relevant documents for each topic), - a set of semantic resources, ontologies and KBs,
which provide the need semantic information for
ontology-based approaches. - a set of annotations that associate the semantic
resources with the document collection (not
needed for all ontology-based search approaches)
7The evaluation benchmark (II)
- Start from a well-known standard IR evaluation
benchmark - Reuse of the TREC Web track collection used in
the TREC 9 and TREC 2001 editions of the TREC
conference - Document collection WT10g (Bailey, Craswell,
Hawking, 2003). About 10GB in size.1.69 million
Web pages - The TREC topics and judgments for this text
collection are provided with the TREC 9 and TREC
2001 datasets
8The evaluation benchmark (III)
- Construct the semantic search space
- In order to fulfill Web-like conditions, all the
semantic search information should be available
online - The selected semantic information should cover,
or partially cover, the domains involved in the
TREC query set - The selected semantic resources should be
completed with a larger set of random ontologies
and KBs to approximate a fair scenario - If the semantic information available online has
to be extended in order to cover the TREC
queries, this must be done with information
sources which are completely independent from the
document collection, and available online
9The evaluation benchmark (IV)
- Document collection
- TREC WT10G
- Queries and judgments
- TREC 9 and TREC 2001 test corpora
- 100 queries with their corresponding judgments
- 20 queries selected and adapted to be used by a
NLP QA query processing module - Ontologies
- 40 public ontologies covering a subset of the
TREC domains and queries (370 files comprising
400MB of RDF, OWL and DAML) - 100 additional repositories (2GB of RDF and OWL)
- Knowledge Bases
- Some of the 40 selected ontologies have been
semi-automatically populated from Wikipedia - Annotations
- 1.2 108 non-embedded annotations generated and
stored in a MySQL database
10The evaluation benchmark (V)
- Selecting TREC queries
- Queries have to be formulated in a way suitable
for ontology-based search systems (informational
queries) - E.g., queries such as discuss the financial
aspects of retirement planning (topic 514) are
not selected - Ontologies must be available for the domain of
the query - We selected 20 queries
- Adapting TREC queries
11The evaluation benchmark (VI)
- Populating ontologies from Wikipedia
- The semantic resources available online are still
scarce and incomplete (Sabou, Gracia, Angeletou,
D'Anquin, Motta, 2007) - Generation of a simple semi-automatic
ontology-population mechanism - Populates ontology classes with new individuals
- Extracts ontology relations for a specific
ontology individual - Uses Wikipedia lists and tables to extract this
information
12The evaluation benchmark (VII)
- Annotating documents with ontology entities
- Identify ontology entities (classes, properties,
instances or literals) within the documents to
generate new annotations - Do not populate ontologies, but identify already
available semantic knowledge within the documents - Support annotation in open domain environments
(any document can be associated or linked to any
ontology without any predefined restriction). - This brings scalability limitations. To solve
them we propose to - Generate of ontology indices
- Generate of document indices
- Construct an annotation database which
- stores non-embedded annotations
13The evaluation benchmark (VIII)
Annotation based on contextual semantic
information
- Ambiguities exploit ontologies as background
knowledge (increasing precision but reducing the
number of annotations)
14Part II. Analyzing the evaluation benchmark
- Motivation
- Part I. The proposal a novel evaluation
benchmark - Reusing the TREC Web track document collection
- Introducing the semantic layer
- Reusing ontologies from the Web
- Populating ontologies from Wikipedia
- Annotating documents
- Part II. Analyzing the evaluation benchmark
- Using the benchmark to compare an ontology-based
search approach against traditional IR baselines - Experimental conditions
- Results
- Applications of the evaluation benchmark
- Conclusions
15Experimental conditions
- Keyword-based search (Lucene)
- Best TREC automatic search
- Best TREC manual search
- Semantic-based search (Fernandez, et al., 2008)
16Results (I)
MAP mean average precision
P_at_10 precision at 10
- Figures in bold correspond to best result for
each topic, excluding the best TREC manual
approach (because of the way it constructs the
query)
17Results (II)
- By P_at_10, the semantic retrieval outperforms the
other two approaches - It provides maximal quality for 55 of the
queries and it is only outperformed by both
Lucene and TREC in one query (511) - Semantic retrieval provides better results than
Lucene for 60 of the queries and equal for
another 20 - Compared to the best TREC automatic engine, our
approach improves 65 of the queries and produces
comparable results in 5 - By MAP, there is no clear winner
- The average performance of TREC automatic is
greater than semantic retrieval. - Semantic retrieval outperforms TREC automatic in
50 of the queries and Lucene in 75 - Bias in the MAP measure
- More than half of the documents retrieved by the
semantic retrieval approach have not been rated
in the TREC judgments - The annotation technique used for the semantic
retrieval approach is very conservative (missing
potential correct annotations)
18Results (III)
- For some queries for which the keyword search
(Lucene) approach finds no relevant documents,
the semantic search does - queries 457 (Chevrolet trucks), 523 (facts about
the five main clouds) and 524 (how to erase
scar?) - In the queries in which the semantic retrieval
did not outperform the keyword baseline, the
semantic information obtained by the query
processing module was scarce - Still, overall, the keyword baseline only rarely
provides significantly better results than
semantic search - TREC Web search evaluation topics are conceived
for keyword-based search engines - With complex structured queries (involving
relationships), the performance of semantic
retrieval would improve significantly compared to
the keyword-based - The full capabilities of the semantic retrieval
model for formal semantic queries were not
exploited in this set of experiments
19Results (IV)
- Studying the impact of retrieved non-evaluated
documents - 66 of the results returned by semantic retrieval
were not judged - P_at_10 not affected. Results in the first positions
have a higher probability of being evaluated - MAP evaluating the impact
- Informal evaluation of the first 10 unevaluated
results returned for every query - 89 of these results occur in the first 100
positions for their respective query - A significant portion, 31.5, of the documents we
judged turned out to be relevant - Even though this can not be generalized to all
the unevaluated results returned by the semantic
retrieval approach (the probability of being
relevant drops around the first 100 results and
then varies very little) we believe that the lack
of evaluations for all the results returned by
the semantic retrieval impairs its MAP value
20Applications of the Benchmark
- Goal How this benchmark can be applied to
evaluate other ontology-based search approaches?
21Conclusions (I)
- In the semantic search community, there is the
need of having standard evaluation benchmarks to
evaluate and compare ontology-based approaches
against each other, and against traditional IR
models - In this work, we have addressed two issues
- Construction of a potentially widely applicable
ontology-based evaluation benchmark from
traditional IR datasets, such as the TREC Web
track reference collection - Use the benchmark to evaluate a specific
ontology-based search approach (Fernandez, et
al., 2008) against different traditional IR
models at a large scale
22Conclusions (II)
- Potential limitations of the above benchmark are
- The need of ontology-based search systems to
participate in the pooling methodology to obtain
a better set of document judgments - The use of queries with a low level of
expressivity in terms of relations, more oriented
to traditional IR models - The scarceness of the publicly available semantic
information to cover the meanings involved in the
document search space - A common understanding of ontology-based search
in terms of inputs, outputs and scope should be
reached before achieving a real standardization
in the evaluation of ontology-based search models
23Thank you! ? http//nets.ii.uam.es/miriam/thesis
.pdf (chapter 6)http//nets.ii.uam.es/publicatio
ns/icsc08.pdf