KLEE: A Framework for Distributed Top-k Query Algorithms presentation

About This Presentation

Transcript and Presenter's Notes

Title: KLEE: A Framework for Distributed Top-k Query Algorithms

1
KLEE A Framework for Distributed Top-k Query
Algorithms

Sebastian Michel
Peter Triantafillou
Gerhard Weikum
VLDB 2005
Presented by
Amrita Tamrakar

2
Overview

Problem Statement
KLEE
The Histogram Bloom Structure
Candidate Filtering
Conclusion

3
Problem Statement Query with t terms with index
lists spread across m peers P1 ... Pm
Each peer Pj stores one inverted index over a
term t
The top-k result sorted list (docID,TotalScore)
where TotalScore for docId monotonic
aggregation of scores of this document in all m
index lists.
4
Problem Definition
P0 is the peer where query is initiated
P1
P0
P2
P3

Problem to be considered
network consumption
per peer load
latency (query response time)
processing

5
Naïve Solution

All m peers to send the complete index lists to
Pinit and then execute a centralized TA style
method
Execute TA at Pinit and access the remote index
lists one entry at a time. (more message rounds
needed!)

6
KLEE

Different philosophy approximate answers!
Efficiency
Reduces (docId, score)-pair transfers
no random accesses at each peer
Two pillars
The HistogramBlooms structure
The Candidate List Filter structure

7
KLEE Steps

Exploration Step get a better approximation of
min-k score threshold (topKScore)
Optimization Step decide 3 or 4 steps ?
Candidate Filtering a docID is a good candidate
if high-scored in many peers.
Candidate Retrieval get all good docID
candidates.

8
Histogram Bloom Structure
Each peer pre-computes for each index list an
equi-width histogram - Bloom filter for each
cell - average score per cell - upper/lower
score
9
Bloom Filter

A space efficient probabilistic data structure
that is used to test whether an element is a
member of a set
vector V of m bits initially all set to 0
K hash functions with range from 1m
insert n docs by hashing the ids and settings the
corresponding bits
Trade off accuracy vs. efficiency

A Bloom Filter with 4 hash functions. a ? A
Given a query b, we will check bits at positions
h1(b), h2(b), ..., hk(b). If any of them is 0
then b is not in the set A
10
Exploration Step
Coordinator
Peer P0
Cohort
Cohort
Peer Pj
Peer Pi
score
score
...
...
Index List
Index List
11
Exploration Step

To Calculate topKScore
Pinit has to
Find the missing score
Find the missing document
if they are not present in the index list of some
peers Pj.
Uses the bloom filter of that peer to find out
where the document may lie in the histogram cell
and get the average of that cell as the score of
that document.
Replace all missing scores, Pinit computes the
top-k list and identifies the score of the kth
document in the list as topKScore

12
Candidate List Filter Matrix

Goal filter out unpromising candidate documents
in step 2
estimate the max number of docs that are above
the mink / m threshold (Maximum_size_candidate_
list)

number of documents
score

Send this number and the threshold to the peers

13
Candidate List Filter Matrix
Each peer returns a Bloom Filter that contains
all docs above the topKScore / m threshold
1
010101001011110101001001010101001
For m peers
CLF
010010011001011111001001010111110
.. ..m
101010101010100110010010011110000
Redefined CLF
14
KLEE Candidate Filter
Coordinator Peer P0
current top-k
candidate set
min-k / m
Cohort Peer Pi
Cohort Peer Pj
010010000100010001
100010100000010001
top k
0000100000100000001
0000100000100000001
score
...
Index List
15
Coordinator Peer P0
current top-k
candidate set
Cohort Peer Pi
Cohort Peer Pj
010010000100010001
100010100000010001
top k
0000100000100000001
0000100000100000001
score
...
Index List
16
Conclusion

KLEE approximate top-k algorithms for wide-area
networks
significant performance benefits can be enjoyed,
at only small penalties in result quality
flexible framework for top-k algorithms, allowing
for trading-off
efficiency versus result quality and
bandwidth savings versus the number of
communication phases.
various fine-tuning parameters

Write a Comment

User Comments (0)

About PowerShow.com

KLEE: A Framework for Distributed Top-k Query Algorithms PowerPoint PPT Presentation