Ktreeforest: Efficient Indexes for Boolean Queries - PowerPoint PPT Presentation

About This Presentation
Title:

Ktreeforest: Efficient Indexes for Boolean Queries

Description:

BST-based inverted file using merge or scan technique. K-tree. Queries of ... BST-based inverted file 31.26. K-tree (parallel) 25.36. K-tree (sequential) 37.05 ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 12
Provided by: scie231
Learn more at: https://www2.cs.uh.edu
Category:

less

Transcript and Presenter's Notes

Title: Ktreeforest: Efficient Indexes for Boolean Queries


1
K-tree/forest Efficient Indexes for Boolean
Queries
  • Rakesh M. Verma and Sanjiv Behl
  • University of Houston
  • www.cs.uh.edu/rmverma

2
Boolean queries
  • Alice and Bob -- Retrieve documents containing
    Bob and Alice
  • Alice or Bob -- Retrieve documents containing
    either Bob or Alice or both
  • Alice and not Bob,

3
Existing solutions Query Bob and Alice
  • Inverted file
  • Retrieve inverted list (on disk) for Bob
  • Retrieve inverted list for Alice
  • Merge the lists to compute intersection, or
  • For And only retrieve the shorter list and
    scan the docs (disk I/Os saved? at expense of
    CPU time)
  • Google times for query Bob 0.11s, Alice
    0.1s, Bob and Alice 0.2s

4
Existing solutionsQuery Bob and Alice
  • Build Secondary index on inverted lists
  • Retrieve secondary index on Bobs list from disk
    (assuming secondary index on Bobs list is
    smaller)
  • Search for Alice in secondary index
  • Retrieve documents

5
K-tree (Leaves point to lists on disk)
Alice
0
1
Bob
Bob
0
0
1
1
6
Experiments
  • Data
  • 1 million word documents divided into pages of
    100 words each
  • Pages indexed by keywords contained
  • Methods
  • BST-based inverted file using merge or scan
    technique
  • K-tree
  • Queries of type
  • Single keyword
  • Two keywords and/and-not

7
Results for single word query
  • Method I/Os
  • BST-based inverted file 31.26
  • K-tree (parallel) 25.36
  • K-tree (sequential) 37.05
  • K-tree (sequential with no fragmentation) 31.26
  • Note index in memory, inverted lists on disk for
    all methods. Results are averages for all
    possible queries of type listed before.

8
Results for 2-words and query
  • Method I/Os
  • BST-based inverted file (merge) 62.52
  • BST-based inverted file (scan)
    10.13
  • K-tree (parallel) 00.57
  • K-tree (sequential) 00.77
  • K-tree(sequential with no fragmentation) 00.61
  • Note index in memory, inverted lists on disk
    for all methods. Results are averages for all
    possible queries of type listed before.

9
K-forest
  • Tradeoff size of K-forest vs. post-processing
  • In general choose size of subset, s, by C(K,s)2s
    lt avail. Memory. K can be reduced by standard
    techniques and by considering frequency.

Index on sub- sets of size 3
K-trees for 3 keywords
10
K-tree highlights
  • Advantages
  • And/But queries no post processing
  • Or queries require some K-tree traversal
  • Easy to implement
  • Easy to parallelize, especially for shorter
    and/and-not queries and all or queries
  • Disadvantage
  • Size 2K for K keywords but this is
    overkill since user queries are typically short
    (over 90 of queries contain at most 5 keywords).
    Very rare to have queries with 10 or more
    keywords.

11
Conclusions and Future Work
  • We have presented efficient structures
    (K-tree/forest) for boolean queries
  • One direction is to do more experiments using for
    example TREC collections
  • Another direction is to study how document
    characteristics can help in choosing the right
    set of keywords to include in these structures
Write a Comment
User Comments (0)
About PowerShow.com