Redeeming Relevance for Subject Search in Citation Indexes - PowerPoint PPT Presentation

1 / 37

About This Presentation

Title:

Redeeming Relevance for Subject Search in Citation Indexes

Description:

We compared Rosetta to a traditional content-based retrieval system ... collection in both Rosetta and the TFIDF/Cosine system. Rosetta indexed documents based ... – PowerPoint PPT presentation

Number of Views:19

Avg rating:3.0/5.0

Slides: 38

Provided by: ecdl2

Learn more at: http://www.ecdl2003.org

Category:

more less

Transcript and Presenter's Notes

Title: Redeeming Relevance for Subject Search in Citation Indexes

1
Redeeming Relevance for Subject Search in
Citation Indexes

Shannon Bradshaw
The University of Iowa
shannon-bradshaw_at_uiowa.edu

2
Citation Indexes

Valuable tools for research
Examples SCI, CiteSeer, arXiv, CiteBase
Permit traversal of citation networks
Identify significant contributions
Subject search is often the entry point

3
Subject search

Query similarity
Citation frequency

4
Citation frequency

PageRank
Example 2 papers
similar in terms of relevance
published at roughly the same time
Paper A cited only by its author
Paper B cited 10 times by other authors
Paper B likely to have greater priority for
reading

5
Problem

Boolean retrieval metrics
Many top documents are not relevant
Effective for Web-searches
Any one of several popular pages will do
Not so for users of citation indexes

6
Reference Directed Indexing (RDI)

Objective To combine strong measures of both
relevance and significance in a single metric
Intuition The opinions of authors who cite a
document effectively distinguish both what a
document is about and how important a
contribution it makes
Similar to the use of anchor text to index Web
documents

7
Example

Paper by Ron Azuma and Gary Bishop
On tracking the heads of users in augmented
reality systems
Head tracking is necessary in order to generate
the correct perspective view

8
A single reference to Azuma
9
Summarizes Azuma paper as

A six degrees of freedom tracking system
With additional details
Improves dynamic registration
Optical beacon ceiling tracker
Linear accelerometers
Rate gyroscopes

10
Leveraging multiple citations

For any document cited more than once
We can compare the words of all authors
Terms used by many referrers make good index
terms for a document

11
Repeated use of tracking and augmented reality
12
A voting technique

RDI treats each citing document as a voter
The presence of a query term in referential text
is a vote of yes
The absence of that term, a no
The documents with the most votes for the query
terms rank highest

13
Related Work

McBryan World Wide Web Worm
Brin Page Google
Chakrabarti et. al - CLEVER
Mendelzon et. al - TOPIC
Bharat et. al Hilltop
Craswell et. al Effective Site Finding

14
Contributions

Application to scientific literature
Anchor text for unrestricted subject search
Anchor text for combining measures of relevance
and significance

15
Rosetta

Experimental system in which we implemented RDI
Term weighting metric
Ranking metric

16
(No Transcript)
17
Experiments

10,000 research papers
Gathered from CiteSeer
Each document cited at least once
Evaluated
Retrieval precision
Impact of search results

18
Comparison system

We compared Rosetta to a traditional
content-based retrieval system
Comparison system uses TFIDF for term weighting
And the Cosine ranking metric

19
Indexing

Indexed collection in both Rosetta and the
TFIDF/Cosine system
Rosetta indexed documents based on references to
them
The TFIDF/Cosine system indexed documents based
on words used within them
Required that each document was cited at least
once to ensure that both systems indexed the same
set of documents

20
As referential text, Rosetta used CiteSeers
contexts of citation
21
As referential text, Rosetta used CiteSeers
contexts of citation
22
Queries

32 queries in our test set
Queries were key terms extracted from Keywords
sections of documents
Queries extracted from sample of 24 documents
Document from which key term was extracted
established the topic of interest

23
Queries
24
Relevance assessments

The topic of interest for a query was the idea
identified by the corresponding key term
Relevant documents directly addressed this same
topic
Example
Query force feedback
Relevant Work on providing a sense of touch in
VR applications or other computer simulations

25
Retrieval interface

Meta-interface
Queried both systems
Used top 10 search results from each system
Integrated all 20 search results
Presented them in random order
No way to determine the source of a retrieved
document

26
Experimental summary

32 queries drawn from document key terms
Document identified the topic of interest
Relevant documents addressed the same topic
Used a meta-search interface
Evaluated top 10 from both systems
Origin of search results hidden

27
Precision at top 10

On average RDI provided a 16.6 improvement over
TFIDF/Cosine
1 or 2 more relevant documents in the top 10
Result is significant
t-test of the mean paired difference
Test statistic 3.227
Significant at a confidence level of 99.5

28
Precision at top 10 (contd)
29
Many retrieval errors avoided

Example software architecture diagrams
Most papers about software architecture
frequently use the term diagrams
Few are about tools for diagramming
TFIDF/Cosine system -- 0/10 relevant
Rosetta -- 4/10 relevant (3 in top 5)
Rosetta made the correct distinction more often

30
Rosetta Shortcomings

Retrieval metric sorts search results by number
of query terms matched
Some authors reuse portions of text in which
other documents are cited

31
Impact of search results

A look at the number of citations to documents
retrieved for each query
Compared RDI to a baseline provided by the
TFIDF/Cosine system
TFIDF/Cosine includes no measure of impact
Seeking only a measure of the relative impact of
documents retrieved by RDI on a given topic

32
Experiment

For each query
Calculated the average citations/year for each
document
Average publication year for Rosetta 1994
TFIDF/Cosine 1995
Found the median number of citations/year for
each set of search results
Found the difference between the median for
Rosetta and the median for TFIDF/Cosine

33
Difference in impact