Redeeming Relevance for Subject Search in Citation Indexes - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Redeeming Relevance for Subject Search in Citation Indexes

Description:

We compared Rosetta to a traditional content-based retrieval system ... collection in both Rosetta and the TFIDF/Cosine system. Rosetta indexed documents based ... – PowerPoint PPT presentation

Number of Views:19
Avg rating:3.0/5.0
Slides: 38
Provided by: ecdl2
Learn more at: http://www.ecdl2003.org
Category:

less

Transcript and Presenter's Notes

Title: Redeeming Relevance for Subject Search in Citation Indexes


1
Redeeming Relevance for Subject Search in
Citation Indexes
  • Shannon Bradshaw
  • The University of Iowa
  • shannon-bradshaw_at_uiowa.edu

2
Citation Indexes
  • Valuable tools for research
  • Examples SCI, CiteSeer, arXiv, CiteBase
  • Permit traversal of citation networks
  • Identify significant contributions
  • Subject search is often the entry point

3
Subject search
  • Query similarity
  • Citation frequency

4
Citation frequency
  • PageRank
  • Example 2 papers
  • similar in terms of relevance
  • published at roughly the same time
  • Paper A cited only by its author
  • Paper B cited 10 times by other authors
  • Paper B likely to have greater priority for
    reading

5
Problem
  • Boolean retrieval metrics
  • Many top documents are not relevant
  • Effective for Web-searches
  • Any one of several popular pages will do
  • Not so for users of citation indexes

6
Reference Directed Indexing (RDI)
  • Objective To combine strong measures of both
    relevance and significance in a single metric
  • Intuition The opinions of authors who cite a
    document effectively distinguish both what a
    document is about and how important a
    contribution it makes
  • Similar to the use of anchor text to index Web
    documents

7
Example
  • Paper by Ron Azuma and Gary Bishop
  • On tracking the heads of users in augmented
    reality systems
  • Head tracking is necessary in order to generate
    the correct perspective view

8
A single reference to Azuma
9
Summarizes Azuma paper as
  • A six degrees of freedom tracking system
  • With additional details
  • Improves dynamic registration
  • Optical beacon ceiling tracker
  • Linear accelerometers
  • Rate gyroscopes

10
Leveraging multiple citations
  • For any document cited more than once
  • We can compare the words of all authors
  • Terms used by many referrers make good index
    terms for a document

11
Repeated use of tracking and augmented reality
12
A voting technique
  • RDI treats each citing document as a voter
  • The presence of a query term in referential text
    is a vote of yes
  • The absence of that term, a no
  • The documents with the most votes for the query
    terms rank highest

13
Related Work
  • McBryan World Wide Web Worm
  • Brin Page Google
  • Chakrabarti et. al - CLEVER
  • Mendelzon et. al - TOPIC
  • Bharat et. al Hilltop
  • Craswell et. al Effective Site Finding

14
Contributions
  • Application to scientific literature
  • Anchor text for unrestricted subject search
  • Anchor text for combining measures of relevance
    and significance

15
Rosetta
  • Experimental system in which we implemented RDI
  • Term weighting metric
  • Ranking metric

16
(No Transcript)
17
Experiments
  • 10,000 research papers
  • Gathered from CiteSeer
  • Each document cited at least once
  • Evaluated
  • Retrieval precision
  • Impact of search results

18
Comparison system
  • We compared Rosetta to a traditional
    content-based retrieval system
  • Comparison system uses TFIDF for term weighting
  • And the Cosine ranking metric

19
Indexing
  • Indexed collection in both Rosetta and the
    TFIDF/Cosine system
  • Rosetta indexed documents based on references to
    them
  • The TFIDF/Cosine system indexed documents based
    on words used within them
  • Required that each document was cited at least
    once to ensure that both systems indexed the same
    set of documents

20
As referential text, Rosetta used CiteSeers
contexts of citation
21
As referential text, Rosetta used CiteSeers
contexts of citation
22
Queries
  • 32 queries in our test set
  • Queries were key terms extracted from Keywords
    sections of documents
  • Queries extracted from sample of 24 documents
  • Document from which key term was extracted
    established the topic of interest

23
Queries
24
Relevance assessments
  • The topic of interest for a query was the idea
    identified by the corresponding key term
  • Relevant documents directly addressed this same
    topic
  • Example
  • Query force feedback
  • Relevant Work on providing a sense of touch in
    VR applications or other computer simulations

25
Retrieval interface
  • Meta-interface
  • Queried both systems
  • Used top 10 search results from each system
  • Integrated all 20 search results
  • Presented them in random order
  • No way to determine the source of a retrieved
    document

26
Experimental summary
  • 32 queries drawn from document key terms
  • Document identified the topic of interest
  • Relevant documents addressed the same topic
  • Used a meta-search interface
  • Evaluated top 10 from both systems
  • Origin of search results hidden

27
Precision at top 10
  • On average RDI provided a 16.6 improvement over
    TFIDF/Cosine
  • 1 or 2 more relevant documents in the top 10
  • Result is significant
  • t-test of the mean paired difference
  • Test statistic 3.227
  • Significant at a confidence level of 99.5

28
Precision at top 10 (contd)
29
Many retrieval errors avoided
  • Example software architecture diagrams
  • Most papers about software architecture
    frequently use the term diagrams
  • Few are about tools for diagramming
  • TFIDF/Cosine system -- 0/10 relevant
  • Rosetta -- 4/10 relevant (3 in top 5)
  • Rosetta made the correct distinction more often

30
Rosetta Shortcomings
  • Retrieval metric sorts search results by number
    of query terms matched
  • Some authors reuse portions of text in which
    other documents are cited

31
Impact of search results
  • A look at the number of citations to documents
    retrieved for each query
  • Compared RDI to a baseline provided by the
    TFIDF/Cosine system
  • TFIDF/Cosine includes no measure of impact
  • Seeking only a measure of the relative impact of
    documents retrieved by RDI on a given topic

32
Experiment
  • For each query
  • Calculated the average citations/year for each
    document
  • Average publication year for Rosetta 1994
  • TFIDF/Cosine 1995
  • Found the median number of citations/year for
    each set of search results
  • Found the difference between the median for
    Rosetta and the median for TFIDF/Cosine

33
Difference in impact
  • On average the median citations/year
  • 8.9 for Rosetta
  • 1.5 for the baseline

34
Difference in impact (contd)
35
Summary of Experiments
  • Small study results are tentative
  • Surpassed retrieval precision of a widely used
    relevance-based approach
  • Consistently retrieved documents that have had a
    significant impact

36
Future Work
  • Retrieval metric that eliminates Boolean
    component
  • Large scale implementation with CiteSeer data
  • Studies with more sophisticated relevance-based
    retrieval systems
  • Comparison with popularity-based retrieval
    techniques

37
Contact
  • Shannon Bradshaw
  • The University of Iowa
  • shannon-bradshaw_at_uiowa.edu
  • www.biz.uiowa.edu/sbradshaw
Write a Comment
User Comments (0)
About PowerShow.com