Ktreeforest: Efficient Indexes for Boolean Queries

About This Presentation

Title:

Description:

Number of Views:33

Avg rating:3.0/5.0

Slides: 12

Provided by: scie231

Learn more at: https://www2.cs.uh.edu

Category:

more less

Transcript and Presenter's Notes

Title: Ktreeforest: Efficient Indexes for Boolean Queries

1
K-tree/forest Efficient Indexes for Boolean
Queries

2
Boolean queries

3
Existing solutions Query Bob and Alice

Inverted file
Retrieve inverted list (on disk) for Bob
Retrieve inverted list for Alice
Merge the lists to compute intersection, or
For And only retrieve the shorter list and
scan the docs (disk I/Os saved? at expense of
CPU time)
Google times for query Bob 0.11s, Alice
0.1s, Bob and Alice 0.2s

4
Existing solutionsQuery Bob and Alice

Build Secondary index on inverted lists
Retrieve secondary index on Bobs list from disk
(assuming secondary index on Bobs list is
smaller)
Search for Alice in secondary index
Retrieve documents

5
K-tree (Leaves point to lists on disk)
Alice
0
1
Bob
Bob
0
0
1
1
6
Experiments

7
Results for single word query

Method I/Os
BST-based inverted file 31.26
K-tree (parallel) 25.36
K-tree (sequential) 37.05
K-tree (sequential with no fragmentation) 31.26
Note index in memory, inverted lists on disk for
all methods. Results are averages for all
possible queries of type listed before.

8
Results for 2-words and query

Method I/Os
BST-based inverted file (merge) 62.52
BST-based inverted file (scan)
10.13
K-tree (parallel) 00.57
K-tree (sequential) 00.77
K-tree(sequential with no fragmentation) 00.61
Note index in memory, inverted lists on disk
for all methods. Results are averages for all
possible queries of type listed before.

9
K-forest

Tradeoff size of K-forest vs. post-processing
In general choose size of subset, s, by C(K,s)2s
lt avail. Memory. K can be reduced by standard
techniques and by considering frequency.

Index on sub- sets of size 3
K-trees for 3 keywords
10
K-tree highlights

Advantages
And/But queries no post processing
Or queries require some K-tree traversal
Easy to implement
Easy to parallelize, especially for shorter
and/and-not queries and all or queries
Disadvantage
Size 2K for K keywords but this is
overkill since user queries are typically short
(over 90 of queries contain at most 5 keywords).
Very rare to have queries with 10 or more
keywords.

11
Conclusions and Future Work

We have presented efficient structures
(K-tree/forest) for boolean queries
One direction is to do more experiments using for
example TREC collections
Another direction is to study how document
characteristics can help in choosing the right
set of keywords to include in these structures

Write a Comment

User Comments (0)