KLEE: A Framework for Distributed Top-k Query Algorithms PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: KLEE: A Framework for Distributed Top-k Query Algorithms


1
KLEE A Framework for Distributed Top-k Query
Algorithms
  • Sebastian Michel
  • Peter Triantafillou
  • Gerhard Weikum
  • VLDB 2005
  • Presented by
  • Amrita Tamrakar

2
Overview
  • Problem Statement
  • KLEE
  • The Histogram Bloom Structure
  • Candidate Filtering
  • Conclusion

3
Problem Statement Query with t terms with index
lists spread across m peers P1 ... Pm
Each peer Pj stores one inverted index over a
term t
The top-k result sorted list (docID,TotalScore)
where TotalScore for docId monotonic
aggregation of scores of this document in all m
index lists.
4
Problem Definition
P0 is the peer where query is initiated
P1
P0
P2
P3
  • Problem to be considered
  • network consumption
  • per peer load
  • latency (query response time)
  • processing

5
Naïve Solution
  • All m peers to send the complete index lists to
    Pinit and then execute a centralized TA style
    method
  • Execute TA at Pinit and access the remote index
    lists one entry at a time. (more message rounds
    needed!)

6
KLEE
  • Different philosophy approximate answers!
  • Efficiency
  • Reduces (docId, score)-pair transfers
  • no random accesses at each peer
  • Two pillars
  • The HistogramBlooms structure
  • The Candidate List Filter structure

7
KLEE Steps
  • Exploration Step get a better approximation of
    min-k score threshold (topKScore)
  • Optimization Step decide 3 or 4 steps ?
  • Candidate Filtering a docID is a good candidate
    if high-scored in many peers.
  • Candidate Retrieval get all good docID
    candidates.

8
Histogram Bloom Structure
Each peer pre-computes for each index list an
equi-width histogram - Bloom filter for each
cell - average score per cell - upper/lower
score
9
Bloom Filter
  1. A space efficient probabilistic data structure
    that is used to test whether an element is a
    member of a set
  2. vector V of m bits initially all set to 0
  3. K hash functions with range from 1m
  4. insert n docs by hashing the ids and settings the
    corresponding bits
  5. Trade off accuracy vs. efficiency

A Bloom Filter with 4 hash functions. a ? A
Given a query b, we will check bits at positions
h1(b), h2(b), ..., hk(b). If any of them is 0
then b is not in the set A
10
Exploration Step
Coordinator
Peer P0
Cohort
Cohort
Peer Pj
Peer Pi
score
score
...
...
Index List
Index List
11
Exploration Step
  • To Calculate topKScore
  • Pinit has to
  • Find the missing score
  • Find the missing document
  • if they are not present in the index list of some
    peers Pj.
  • Uses the bloom filter of that peer to find out
    where the document may lie in the histogram cell
    and get the average of that cell as the score of
    that document.
  • Replace all missing scores, Pinit computes the
    top-k list and identifies the score of the kth
    document in the list as topKScore

12
Candidate List Filter Matrix
  • Goal filter out unpromising candidate documents
    in step 2
  • estimate the max number of docs that are above
    the mink / m threshold (Maximum_size_candidate_
    list)

number of documents
score
  • Send this number and the threshold to the peers

13
Candidate List Filter Matrix
Each peer returns a Bloom Filter that contains
all docs above the topKScore / m threshold
1
010101001011110101001001010101001
For m peers
CLF
010010011001011111001001010111110
.. ..m
101010101010100110010010011110000
Redefined CLF
14
KLEE Candidate Filter
Coordinator Peer P0
current top-k
candidate set
min-k / m
Cohort Peer Pi
Cohort Peer Pj
010010000100010001
100010100000010001
top k
0000100000100000001
0000100000100000001
score
...
Index List
15
Coordinator Peer P0
current top-k
candidate set
Cohort Peer Pi
Cohort Peer Pj
010010000100010001
100010100000010001
top k
0000100000100000001
0000100000100000001
score
...
Index List
16
Conclusion
  • KLEE approximate top-k algorithms for wide-area
    networks
  • significant performance benefits can be enjoyed,
    at only small penalties in result quality
  • flexible framework for top-k algorithms, allowing
    for trading-off
  • efficiency versus result quality and
  • bandwidth savings versus the number of
    communication phases.
  • various fine-tuning parameters
Write a Comment
User Comments (0)
About PowerShow.com