Incremental Updates of Inverted Indices - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Incremental Updates of Inverted Indices

Description:

Traditionally, an inverted index must be rebuilt entirely whenever it is updated. ... Exploring variants on the B-Tree, like dancing trees, or tuning the design for ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 13
Provided by: irI9
Category:

less

Transcript and Presenter's Notes

Title: Incremental Updates of Inverted Indices


1
Incremental Updates of Inverted Indices
  • Jefferson Heard
  • Wai Gen Yee, advisor
  • Illinois Institute of Technology
  • 2004

2
Inverted Indices
  • Traditionally, an inverted index must be rebuilt
    entirely whenever it is updated.
  • This can be very expensive with current realistic
    collections ranging up into the terabytes.
  • It would be much more convenient to be able to
    dynamically add and delete documents from an index

3
Convenience Detailed
  • Incremental updates open up new possibilities
  • Embedded retrieval systems with evolving
    repositories
  • Peer to peer systems blended with traditional IR
    for greater search efficiency and effectiveness
  • Also, they allow for tradeoffs in efficiency
    elsewhere in the indexing process
  • More complex parsing, analysis, clustering can be
    done if an index isn't static.

4
Early Work
  • Some early work was done in 1990 by Cutting and
    Pedersen, involving storing indices as B-Trees.
  • An entire index was stored as one large B-Tree,
    with each posting list stored sequentially in the
    tree.
  • The index this created was very large and
    unwieldy, and is no longer practical for today's
    applications.

5
Posting Lists Revisited
  • A posting list is typically stored as a series of
    document-ids and term-frequencies. On disk, it
    is best if these are compressed.
  • In most dynamic systems, documents may not only
    be added, they may change or be deleted.
  • Very long lists can be very expensive to update
    an update causes a sequential decompression up to
    the point of insertion and a recompression/copy
    of the data after the list.
  • Dozens, hundreds, or potentially thousands of
    posting lists are updated just to add a single
    typical document.

6
The (Possible) Solution
  • Store each posting list as a BTree
  • Seek to the update occurs in log time.
  • Each update only affects the page it is made in,
    rather than the whole list.
  • Using a BTree ensures that a posting list can be
    retrieved for IR purposes as a sequential
    operation, without consulting the tree portion.

7
Posting List Structure
Doc ID
Doc ID offset from root
1024 2048
0 400 800
0 400 800
(0,3), ...
(0, 7), ...
(0,8), ...
(0,4), ...
(0,1), ...
(0,10), ...
In reality, each DocID/TF pair is offset from the
last DocID/TF and Elias Encoded for space
efficiency.
Doc ID offset, TF Pairs
8
Advantages Avenues of Research
  • Updates
  • Updates can be prioritized and journaled, so they
    cause minimal downtime.
  • Topdoc as a retrieval speed enhancement
  • Can guarantee by sorting blocks on disk by
    highest TFIDF that at most N blocks must be
    retrieved to ensure N-Topdoc
  • Splitting
  • Indices stored this way should be splittable,
    allowing them to grow across nodes in a
    distributed envrionment.

9
Disadvantages Avenues of Research
  • Space concerns
  • BTrees waste up to 33 of their space. Is this
    acceptable or can it be alleviated?
  • How much space overhead is it to store the tree
    portion of the lists?
  • How should cache be used, since there are as many
    trees as there are terms?
  • Speed concerns
  • How much does the wasted space affect retrieval
    time for whole posting lists?
  • How much real time does it take to update a
    document, considering that each document has
    unique terms in log proportion to its length?

10
Alleviating Some of the Problems
  • Compression
  • Storing offsets from the parent keys in child
    nodes, reduces the space at each level of the
    tree.
  • Compressing inside each block increases the
    amount we can pack into the block.
  • If a collection becomes static, sorting the
    blocks by fill amount and then RLEing blockwise
    may compress even better. How much better, on
    average?

11
Alleviating Some of the Problems
  • Journalling/Caching
  • By retrieving blocks from disk and updating them
    in memory, then commiting upon a full cache or
    during inactivity, we can assure faster updates.
  • Better B-Trees
  • Exploring variants on the B-Tree, like dancing
    trees, or tuning the design for the problem
    better may yield better compression/update times.

12
References
  • Optimizations for Dynamic Inverted Index
    Maintenance. Cutting, Doug and Pederson, Jan,
    1990
  • The Ubiquitous B-Tree. Douglas Comer, Journal of
    the ACM, 1979.
Write a Comment
User Comments (0)
About PowerShow.com