Incremental Updates of Inverted Indices - PowerPoint PPT Presentation

1 / 12

About This Presentation

Title:

Incremental Updates of Inverted Indices

Description:

Traditionally, an inverted index must be rebuilt entirely whenever it is updated. ... Exploring variants on the B-Tree, like dancing trees, or tuning the design for ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 13

Provided by: irI9

Category:

more less

Transcript and Presenter's Notes

Title: Incremental Updates of Inverted Indices

1
Incremental Updates of Inverted Indices

Jefferson Heard
Wai Gen Yee, advisor
Illinois Institute of Technology
2004

2
Inverted Indices

Traditionally, an inverted index must be rebuilt
entirely whenever it is updated.
This can be very expensive with current realistic
collections ranging up into the terabytes.
It would be much more convenient to be able to
dynamically add and delete documents from an index

3
Convenience Detailed

Incremental updates open up new possibilities
Embedded retrieval systems with evolving
repositories
Peer to peer systems blended with traditional IR
for greater search efficiency and effectiveness
Also, they allow for tradeoffs in efficiency
elsewhere in the indexing process
More complex parsing, analysis, clustering can be
done if an index isn't static.

4
Early Work

Some early work was done in 1990 by Cutting and
Pedersen, involving storing indices as B-Trees.
An entire index was stored as one large B-Tree,
with each posting list stored sequentially in the
tree.
The index this created was very large and
unwieldy, and is no longer practical for today's
applications.

5
Posting Lists Revisited

A posting list is typically stored as a series of
document-ids and term-frequencies. On disk, it
is best if these are compressed.
In most dynamic systems, documents may not only
be added, they may change or be deleted.
Very long lists can be very expensive to update
an update causes a sequential decompression up to
the point of insertion and a recompression/copy
of the data after the list.
Dozens, hundreds, or potentially thousands of
posting lists are updated just to add a single
typical document.

6
The (Possible) Solution

Store each posting list as a BTree
Seek to the update occurs in log time.
Each update only affects the page it is made in,
rather than the whole list.
Using a BTree ensures that a posting list can be
retrieved for IR purposes as a sequential
operation, without consulting the tree portion.

7
Posting List Structure
Doc ID
Doc ID offset from root
1024 2048
0 400 800
0 400 800
(0,3), ...
(0, 7), ...
(0,8), ...
(0,4), ...
(0,1), ...
(0,10), ...
In reality, each DocID/TF pair is offset from the
last DocID/TF and Elias Encoded for space
efficiency.
Doc ID offset, TF Pairs
8
Advantages Avenues of Research

Updates
Updates can be prioritized and journaled, so they
cause minimal downtime.
Topdoc as a retrieval speed enhancement
Can guarantee by sorting blocks on disk by
highest TFIDF that at most N blocks must be
retrieved to ensure N-Topdoc
Splitting
Indices stored this way should be splittable,
allowing them to grow across nodes in a
distributed envrionment.

9
Disadvantages Avenues of Research

Space concerns
BTrees waste up to 33 of their space. Is this
acceptable or can it be alleviated?
How much space overhead is it to store the tree
portion of the lists?
How should cache be used, since there are as many
trees as there are terms?
Speed concerns
How much does the wasted space affect retrieval
time for whole posting lists?
How much real time does it take to update a
document, considering that each document has
unique terms in log proportion to its length?

10
Alleviating Some of the Problems

Compression
Storing offsets from the parent keys in child
nodes, reduces the space at each level of the
tree.
Compressing inside each block increases the
amount we can pack into the block.
If a collection becomes static, sorting the
blocks by fill amount and then RLEing blockwise
may compress even better. How much better, on
average?

11
Alleviating Some of the Problems

Journalling/Caching
By retrieving blocks from disk and updating them
in memory, then commiting upon a full cache or
during inactivity, we can assure faster updates.
Better B-Trees
Exploring variants on the B-Tree, like dancing
trees, or tuning the design for the problem
better may yield better compression/update times.

12
References

Optimizations for Dynamic Inverted Index
Maintenance. Cutting, Doug and Pederson, Jan,
1990
The Ubiquitous B-Tree. Douglas Comer, Journal of
the ACM, 1979.

Write a Comment

User Comments (0)