Bloom Filters - PowerPoint PPT Presentation

About This Presentation
Title:

Bloom Filters

Description:

Allow false positive errors, as they only cost us an extra data access. ... Let p=e-kn/m, probability of a false positive is: ... False Misses ... – PowerPoint PPT presentation

Number of Views:30
Avg rating:3.0/5.0
Slides: 17
Provided by: TS189
Category:

less

Transcript and Presenter's Notes

Title: Bloom Filters


1
Bloom Filters
  • Lookup questions Does item x exist in a set or
    multiset?
  • Data set may be very big or expensive to access.
    Filter lookup questions with negative results
    before accessing data.
  • Allow false positive errors, as they only cost us
    an extra data access.
  • Dont allow false negative errors, because they
    result in wrong answers.

2
Bloom Filter B70
  • Encoding an attribute a?U
  • Maintain a Bit Vector V of size m
  • Use k hash functions (h1..hk) , hi U?1..m
  • Encoding For item x, turn on bits
    Vh1(x)..Vhk(x).
  • Lookup Check bits Vh1(i)..Vhk(i) . If all
    equal 1, return Probably Yes. Else Definitely
    No.

3
Bloom Filter
x
V0
Vm-1
h1(x)
h2(x)
hk(x)
h3(x)
4
Bloom Errors
a
b
c
d
V0
Vm-1
h1(x)
h2(x)
hk(x)
h3(x)
x didnt appear, yet its bits are already set
5
Error Estimation
  • Assumption Hash functions are perfectly random
  • Probability of a bit being 0 after hashing all
    elements
  • Let pe-kn/m, probability of a false positive is
  • Assuming we are given m and n, the optimal k is

6
Bloom Filter Tradeoffs
  • Three factors m,k and n.
  • Normally, n and m are given, and we select k.
  • Small k
  • Less computations.
  • Actual number of bits accessed (nk) is smaller,
    so the chance of a step over is smaller too.
  • However, less bits need to be stepped over to
    generate an error.
  • For big k, the exact opposite holds.
  • Not surprisingly, when k is optimal, the hit
    ratio (ratio of bits flipped in the array) is
    exactly 0.5

7
Summary Cache FCAB00
  • Proxy servers maintain local cache to minimize
    expensive internet requests.
  • Proxy must maintain an efficient lookup method
    into the cache.
  • The lookup structure must be stored in DRAM for
    performance.
  • Structure must be compact, as DRAM is expensive
    and is used for Hot Items storage and more.
  • Pages are usually replaced in the cache using an
    LRU algorithm.

8
ICP Request Handling
Client
Internet
9
Internet Cache Protocol (ICP)
  • Allows for scaling-out when using proxies.
  • Protocol that supports discovery and retrieval of
    documents from neighboring caches.
  • Establish an hierarchy of proxy caches
  • If page not found in local proxy cache, it
    searches for the page in neighboring proxies.
  • If page not found anywhere, fetch it from the
    internet.

10
ICP Request Handling
Client
Internet
11
Summary Cache
  • Each proxy maintains a Bloom Filter representing
    its local cache.
  • Also, it holds Bloom Filters representing caches
    of other proxies.
  • Updates to Bloom Filters are exchanged
    periodically or after a certain percentage of the
    documents in the cache was replaced.
  • ICP request is sent only to proxy who supposedly
    holds the requested document.

12
ICP With Summary Cache
Client
Internet
13
Summary Cache Bloom Filters
  • To support deletions and updates, the proxy
    maintains the Bloom Filter and also an array of
    counters C, initially set to 0.
  • The Bloom Filter is filled with the contents of
    the cache.
  • Each bit in the BF is allowed 4 bits for its
    counter.
  • On insert of item i, all Chj(i) are increased
    (to a maximum of 15).
  • On deletion of item i, counters are decreased.
  • When Ci increases from 0 to 1, Vi is turned
    on.
  • When Ci decreases from 1 to 0, Vi is turned
    off.

14
Summary Cache Bloom Filters
  • Hashing scheme
  • Generate 128 bits using MD5 on the URL.
  • Divide to segments of M bits (usually 32)
  • Calculate modulus of segments by m, providing
    128/M hash values (4, for 32 bit segments)
  • If 128 bits are not enough, calculate MD5 of URL
    concatenated with itself.
  • Bloom Filter Exchange
  • Header contains MD5 properties, size of array.
  • If refresh rate is high, send only deltas.
  • Bit counts are internal and not exchanged.
  • Otherwise, send entire Bloom Filter.

15
Summary Cache - Errors
  • False Misses
  • Document requested is cached at some remote
    proxy, but summary does not reflect that fact.
  • Hit ratio is reduce, a redundant internet access
    is performed.
  • False Hits
  • Document is not at a remote proxy, but summary
    suggests that it is.
  • An Inter-Proxy query message is wasted.
  • Remote Stale Hits
  • Document is cached at a remote proxy, but is
    stale.
  • Occurs in both ICP and Summary Cache.
  • Might not be a totally wasted effort, as delta
    compression can be used.

16
Implementation - Squid
  • Squid A publicly available web proxy cache
    software.
  • http//www.squid-cache.org
  • Summary Cache is implemented in Squid v1.1.14
  • A variation called cache digest is implemented in
    Squid 1.2b20
Write a Comment
User Comments (0)
About PowerShow.com