Optimizing Data Popularity Conscious Bloom Filters - PowerPoint PPT Presentation

About This Presentation
Title:

Optimizing Data Popularity Conscious Bloom Filters

Description:

Optimizing Data Popularity Conscious Bloom Filters. Ming Zhong Pin Lu Kai Shen Joel Seiferas ... Bloom filters: ... Data popularity conscious Bloom filters: ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 14
Provided by: kais4
Category:

less

Transcript and Presenter's Notes

Title: Optimizing Data Popularity Conscious Bloom Filters


1
Optimizing Data Popularity Conscious Bloom Filters
  • Ming Zhong Pin Lu Kai Shen Joel
    Seiferas
  • University of Rochester

2
Problem Overview
  • Bloom filters
  • compact set representation in which each object
    is hashed into several bits in the filter
  • allows possible false positives in membership
    queries
  • useful in distributed applications communicating
    sets.
  • Highly skewed data popularity distributions.
  • Data popularity conscious Bloom filters
  • use a large number of hashes for likely false
    positive candidates popular objects in queries
    unpopular objects in sets.
  • Goal customize the hash number for each object
    to minimize the false positive prob.

3
Object Popularity Stability
  • Stable object popularity is important for
    learning the object popularity and for low
    adjustment overhead.
  • Illustration of stability across month-long trace
    segments

4
Problem Formulation and Result
  • Problem formulation
  • in a universe of N objects, an n-object set is
    represented by an m-bit filter
  • object is membership pop. is pi, non-member
    query pop. is qi
  • find object hash numbers k1, k2, , kN to
    minimize the false positive probability ?1iN
    qi pow(B,ki)
  • B is the probability for an arbitrary filter bit
    to be 1, therefore ?1iN pi ki K ln(1-B) /
    (n ln(1-1/m)).
  • Result (assume kis are unrestricted real
    numbers)
  • Lagrangian function ?1iN qi pow(B,ki) ?
    (?1iN pi ki K)
  • optimization is reached when the functions
    partial derivatives on kis and ? are all zero
  • we find ki C log1/B(qi/pi), C is a
    constant
  • also B 0.5.

5
Ranged Integer Problem
  • Practical constraint
  • object is hash number ki must be a positive
    integer, and often upper-bounded by kmax.
  • Rounding real-number solutions to integers
  • may increase the false positive rate
  • no understanding on how much the increase may be.
  • Overview of our approach
  • introduce an importance score for each object
    (intuitively more important objects desire more
    hashes)
  • the importance ranking helps produce fast
    approximation solutions.

6
Object Importance Score
  • Intuition
  • revisit the optimal real-number solution ki C
    log2(qi/pi)
  • Hint qi/pi provides a ranking on object hash
    numbers in a good solution.
  • Results
  • for the ranged real-number problem, an optimal
    solution k1, k2, , kN must follow the importance
    ranking
  • k1, k2, ,kN is a 2-approximation solution
    to the ranged integer problem it also follows
    the importance ranking.

7
Polynomial-Time 2-Approximation
  • Our result indicates that at least one solution
    that follows the importance score ranking is
    provably 2-approximation.
  • ? If we enumerate all importance-ranked
    solutions, the best is a 2-approximation.
  • O(Nkmax) time 2-approximation
  • no more than (N1)kmax-1 importance-ranked
    solutions in total
  • it takes O(N) to check constraint and calculate
    the false positive rate for each solution.
  • Practically expensive
  • N can be huge
  • the constant kmax may not be very small (e.g.,
    20).

8
Faster Solutions
  • (2e)-approximation
  • the problem of identifying the best
    importance-ranked solution can be transformed
    into a knapsack problem
  • dynamic programming produces (2e)-approximation
    solution in O(N2/e) time.
  • Coarse-grained optimization
  • partition large number of objects into a small
    number of groups (objects in each group have
    similar importance scores)
  • optimize at the group granularity (then assign
    equal hash number to objects within one group) ?
    much smaller N.

9
Evaluation on Synthetic Data
  • Non-member query pop. qi follows Zipf-like
    distribution.
  • Membership pop. pi follows a uniform
    distribution.
  • Our integer approximation solution significantly
    outperforms the real-rounding solution,
    particularly at high popularity skewness.

10
Trace-driven Evaluation on Distributed Caching
  • Distributed caches exchange their content (set of
    cached web objects) to cooperate.
  • Evaluation driven by web access traces from
    IRCache.net.

11
Trace-driven Evaluation on Distributed Keyword
Searching
  • Distributed search engines pass keyword indexes
    to support distributed joins. False positives
    resolved by additional comm.
  • Evaluation driven by web page listing at dmoz.com
    and keyword query traces at Ask.com.

12
Related Work
  • Compressed Bloom filters Mitzenmacher 2002.
  • Bloom filters with additional functionalities
  • deletion Fan et al. 2000
  • frequency queries Cohen and Matias 2003
  • associating objects with values Chazelle et al.
    2004.
  • Alternative data structure Pagh et al. 2005.
  • Weighted Bloom filters Bruck et al. 2006
  • optimal real-number solution with integer
    rounding
  • analytically, the rounding-induced error increase
    is unbounded
  • practically, the error increase can be
    substantial.

13
Conclusions
  • Popularity conscious Bloom filters
  • motivated by skewed, stable data popularity
    distributions
  • customize each objects hash number according to
    its popularity in sets and queries.
  • Unrestricted real-number problem
  • optimal solution when object hash number is
    linear to log(query-pop/set-pop).
  • Ranged integer problem
  • query-pop/set-pop serves as an object importance
    indicator
  • O(Nkmax) time 2-approximation
  • O(N2/e) time (2e)-approximation.
  • Quantitative evaluations driven by real
    distributed application traces.
Write a Comment
User Comments (0)
About PowerShow.com