Dynamic Index Pruning for Effective Caching - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

Dynamic Index Pruning for Effective Caching

Description:

3. Cost aware cache eviction policies which take into consideration the cost of ... Current cache eviction polices make use of recency and access frequency of items ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 2
Provided by: yohanne4
Category:

less

Transcript and Presenter's Notes

Title: Dynamic Index Pruning for Effective Caching


1
Dynamic Index Pruning for Effective Caching
Yohannes Tsegay Andrew Turpin Justin
Zobel School of Computer Science IT RMIT
University, Melbourne, Australia
Introduction
Gradual stop
Query t1, t2, t3
  • Information Retrieval system make use of an
    inverted index to allow efficient query
    resolution.
  • An inverted index is built of two main
    components, a vocabulary of all the unique terms
    in the collection, and for each unique term, an
    inverted list of documents in which that term
    occurs.
  • The size of inverted lists grows proportionally
    to the size of the collection.
  • On average, larger collections require more
    computation time and resources to evaluate a
    query.
  • A significant amount of time spent evaluating
    queries using inverted list is performed fetching
    lists from disk.
  • To reduce the amount of time spent accessing
    disk, search engines use inverted list caching.
  • Caching large inverted lists can be
    counterproductive, as other cache items will be
    evicted to make room for the large lists,
    resulting in a large number of cache misses.
  • Process blocks in decreasing order of impact
    weight.
  • Add new documents to A, if its impact weight is
    min_score_doc(A).
  • To ensure documents in A are in the correct
    rank, continue processing more impact blocks. Do
    not add, only update accumulators already in A.

5
4
2
1
t1
4
3
1
t2
t3
5
2
1
min_score_doc(A)
Accumulators (A)
  • Stop processing further blocks if, after
    processing a large number of documents, the order
    of documents in A does not change. In this case A
    is said to have become stable.

Impacts
Query evaluator
Query evaluator
Each block contains equi-impact documents
Inverted list cache
dog
7
5
4
2
On disk index
On disk index
Query evaluation without caching
Query evaluation with caching
Documents are stored in blocks. Blocks are sorted
in decreasing order of impact
Contributions
Test data
  • 1. Scheme to cache dynamically pruned inverted
    lists, instead of full inverted lists.
  • This has the advantage of not only increasing the
    number of items maintained in cache, it also
    reduces the number of accumulators used to
    process a query.
  • 2. Gradual stop, a method for dynamic pruning
    inverted lists based on impacts.
  • 3. Cost aware cache eviction policies which take
    into consideration the cost of reading an
    inverted lists into cache.
  • We used two large TREC collections, WT100g (100
    GB) and GOV2 (425 GB).
  • Each collection has a corresponding query log, 2
    million queries from MSN search engine and 2
    million queries from Excite query log.
  • The last one million queries from each log were
    run against their corresponding collection, using
    the Gradual Stop. The inverted lists used to
    process those queries were cached. Term hits and
    byte hits were counted.

Results
1. How did the proposed pruning method perform?
2. How did caching pruned lists perform?
Cost aware cache eviction policies
  • Current cache eviction polices make use of
    recency and access frequency of items in cache.
  • Two items of the same recency/access frequency
    are about to be evicted. One costs twice as much
    to read from disk as the other, which should we
    keep to reduce disk access?
  • The cost of re-reading an inverted list should
    be considered before evicting it from cache.
  • Cost per byte
  • Greedy Dual Size with Frequency

Term hit rate percentage of inverted lists found
in cache to the total number of lists processed
for the GOV2-MSN data set
di disk access cost si size of inverted
list ti age of inverted list in cache fi
number of times list was accessed while in
cache L cost of the last item evicted,
initially 0
Acknowledgments Thanks to Microsoft for
providing the MSN query log. This research was
supported by the Australian Research Council.
CIKM, November 2007
Write a Comment
User Comments (0)
About PowerShow.com