Title: KLEE: A Framework for Distributed Top-k Query Algorithms
1KLEE A Framework for Distributed Top-k Query
Algorithms
- Sebastian Michel
- Peter Triantafillou
- Gerhard Weikum
- VLDB 2005
- Presented by
- Amrita Tamrakar
2Overview
- Problem Statement
- KLEE
- The Histogram Bloom Structure
- Candidate Filtering
- Conclusion
3Problem Statement Query with t terms with index
lists spread across m peers P1 ... Pm
Each peer Pj stores one inverted index over a
term t
The top-k result sorted list (docID,TotalScore)
where TotalScore for docId monotonic
aggregation of scores of this document in all m
index lists.
4Problem Definition
P0 is the peer where query is initiated
P1
P0
P2
P3
- Problem to be considered
- network consumption
- per peer load
- latency (query response time)
- processing
5Naïve Solution
- All m peers to send the complete index lists to
Pinit and then execute a centralized TA style
method - Execute TA at Pinit and access the remote index
lists one entry at a time. (more message rounds
needed!)
6KLEE
- Different philosophy approximate answers!
- Efficiency
- Reduces (docId, score)-pair transfers
- no random accesses at each peer
- Two pillars
- The HistogramBlooms structure
- The Candidate List Filter structure
7KLEE Steps
- Exploration Step get a better approximation of
min-k score threshold (topKScore) - Optimization Step decide 3 or 4 steps ?
- Candidate Filtering a docID is a good candidate
if high-scored in many peers. - Candidate Retrieval get all good docID
candidates.
8Histogram Bloom Structure
Each peer pre-computes for each index list an
equi-width histogram - Bloom filter for each
cell - average score per cell - upper/lower
score
9Bloom Filter
- A space efficient probabilistic data structure
that is used to test whether an element is a
member of a set - vector V of m bits initially all set to 0
- K hash functions with range from 1m
- insert n docs by hashing the ids and settings the
corresponding bits - Trade off accuracy vs. efficiency
A Bloom Filter with 4 hash functions. a ? A
Given a query b, we will check bits at positions
h1(b), h2(b), ..., hk(b). If any of them is 0
then b is not in the set A
10Exploration Step
Coordinator
Peer P0
Cohort
Cohort
Peer Pj
Peer Pi
score
score
...
...
Index List
Index List
11Exploration Step
- To Calculate topKScore
- Pinit has to
- Find the missing score
- Find the missing document
- if they are not present in the index list of some
peers Pj. - Uses the bloom filter of that peer to find out
where the document may lie in the histogram cell
and get the average of that cell as the score of
that document. - Replace all missing scores, Pinit computes the
top-k list and identifies the score of the kth
document in the list as topKScore
12Candidate List Filter Matrix
- Goal filter out unpromising candidate documents
in step 2 - estimate the max number of docs that are above
the mink / m threshold (Maximum_size_candidate_
list)
number of documents
score
- Send this number and the threshold to the peers
13Candidate List Filter Matrix
Each peer returns a Bloom Filter that contains
all docs above the topKScore / m threshold
1
010101001011110101001001010101001
For m peers
CLF
010010011001011111001001010111110
.. ..m
101010101010100110010010011110000
Redefined CLF
14KLEE Candidate Filter
Coordinator Peer P0
current top-k
candidate set
min-k / m
Cohort Peer Pi
Cohort Peer Pj
010010000100010001
100010100000010001
top k
0000100000100000001
0000100000100000001
score
...
Index List
15Coordinator Peer P0
current top-k
candidate set
Cohort Peer Pi
Cohort Peer Pj
010010000100010001
100010100000010001
top k
0000100000100000001
0000100000100000001
score
...
Index List
16Conclusion
- KLEE approximate top-k algorithms for wide-area
networks - significant performance benefits can be enjoyed,
at only small penalties in result quality - flexible framework for top-k algorithms, allowing
for trading-off - efficiency versus result quality and
- bandwidth savings versus the number of
communication phases. - various fine-tuning parameters