Title: The Bloom Paradox
1The Bloom Paradox
Ori Rottenstreich Joint work with Yossi Kanizo
and Isaac Keslassy Technion, Israel
2Problem Definition
x
y
user
- Requirement A data structure in user with fast
answer to - Solutions
- O(n) Searching in a list
- O(log(n)) Searching in a sorted list
- O(1) But with false positives / negatives
x
M central memory with all elements
y
cost 10
cost 1
S local cache
cost 10
v
u
z
y
x
z
x
y
user
2
3Two Possible Errors
- False Positive but the data
structure answers - Results in a redundant access to the local cache.
- Additional cost of 1.
- False Negative but the data structure
answers - Results in an expensive access to the central
memory instead of the local cache. - Additional cost of 10-19.
y
x
4Bloom Filters (Bloom, 1970)
- Initialization Array of zero bits.
- Insertion Each of the elements is hashed
times, the corresponding bits are set. - Query Hashing the element, checking that all
bits are set. - False positive rate (probability) of
- No false negatives
0
0
0
0
0
0
0
0
0
0
0
0
y
x
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
z
x
w
5Bloom Filters are Widely Used
- Cache/Memory Framework
- Packet Classification
- Intrusion Detection
- Routing
- Accounting
- Beyond networking Spell Checking, DNA
Classification - Can be found in
- Google's web browser Chrome
- Google's database system BigTable
- Facebook's distributed storage system Cassandra
- Mellanox's IB Switch System
6Outline
- Introduction to Bloom Filters
- The Bloom Paradox
-
- The Variable-Increment Counting Bloom Filter
7The Bloom Paradox
Sometimes, it is better to disregard the Bloom
filter results, and in fact not to even query it,
thus making the Bloom filter useless.
8Example
Bloom filter
- Parameters
- Extreme case without locality All elements with
equal probability of - belonging to the cache.
- Toy example
9The Bloom Paradox
- Parameters
- Let be the set of elements that the Bloom
filter indicates are in - In particular, no false negatives ?
- Intuition
-
Bloom filter
B Bloom filter
user
M central memory with all elements
cost 10
cost 1
S local cache
cost 10
v
u
z
y
x
z
x
. .
10The Bloom Paradox
- Parameters
- Let be the set of elements that the Bloom
filter indicates are in - In particular, no false negatives ?
- Surprise
-
B Bloom filter
M central memory with all elements
cost 10
cost 1
S local cache
cost 10
v
u
z
y
x
z
x
. .
11The Bloom Paradox
- Parameters
- Let be the set of elements that the Bloom
filter indicates are in - In particular, no false negatives ?
- Surprise
- The Bloom filter indicates the membership of
- elements. Only
of them are indeed in .
B Bloom filter
. .
12The Bloom Paradox
- When the Bloom filter states that ,
it is wrong with probability - Average cost if we listen to the Bloom filter
- Average cost if we dont
- The Bloom filter is useless!
?
Dont listen to the Bloom filter
13Outline
- Introduction to Bloom Filters
- The Bloom Paradox
-
- The Variable-Increment Counting Bloom Filter
14Counting Bloom Filters (CBFs)
- Bloom filters do not support deletions of
elements. Simply resetting bits might cause false
negatives. - The solution Counting Bloom filters - Storing
array of counters instead of bits. - Insertion Incrementing counters by one.
- Deletion Decrementing counters by one.
- Query Checking that counters are positive.
- The same false positive probability.
- Require too much memory, e.g. 57 bits per element
for .
y
x
1
1
1
1
1
1
0
0
0
0
0
0
1
0
1
0
0
0
y
x
1
1
1
1
1
1
0
1
0
2
0
0
1
0
1
0
0
1
15Intuition for Variable Increments
- Upon query, we should consider the exact values
of the counters and not just their positiveness -
- Can we design a deterministic scheme that
exploits the exact values of the counters? - Idea Use variable increments to encode the
element identity
0
3
8
1
0
5
2
0
1
0
1
2
z
y
14
16Architecture
- Each hash entry contains a pair of counters
- , fixed increments ? number of elements in
entry (as in CBF) - , variable increments ? weighted sum of
elements - weights from a pre-determined set
- We use two sets of hash functions
- The first set uses
hash functions with range - , i.e. it points to the set of
entries. - The second set uses
hash functions with range , i.e.
it points to the set
. -
2
7
8
9
4
5
6
1
3
5
3
3
4
2
3
0
3
c1
2
34
9
6
26
26
17
21
0
25
c2
15
17Insertion
- Insertion
- At each entry , the two counters are
updated as follows. -
- from the
set - Example 1
-
2
7
8
9
4
5
6
1
3
5
3
3
4
2
3
0
3
c1
2
3 4
0 1
3 4
4 5
34
9
13
26
17
17
21
0
25
c2
25 29
30 43
30 34
0 8
4
8
4
13
x
z
16
18Query
- Query ( with
) -
-
- We ask whether
- 17 can be a sum of 2 elements from the set
including 4 - 30 can be a sum of 3 elements from the set
including 8 - No
- How should we pick the set of variable
increments? -
-
y
8?
4?
y?
17
19Bh Sequences
- Definition 1
- Let be a
sequence of positive integers. - Then, is a sequence iff all the sums
- with are
distinct. - Example 2
-
- All the sums of elements of are
distinct - Therefore, is a sequence.
- sequences are widely used in
error-correcting codes. -
-
20The Bh-CBF Scheme Query
- Example 3
is a sequence - Since , then the Bh-CBF can
determine that -
-
4?
19
21The Bh-CBF Scheme Operations
The Bh-CBF Scheme Query
- Here, and then necessarily
- Since , the Bh-CBF can
determine that -
-
4?
8?
4?
y?
19
22The Bh-CBF Scheme Operations
The Bh-CBF Scheme Query
- Since , the
Bh-CBF cannot exclude that -
4?
13?
4?
8?
4?
z?
y?
19
23Experimental Results
- Internet trace (equinix-chicago) with real hash
functions. - For the Bh-CBF,
(with ). -
-
-
20
24Experimental Results
- Internet trace (equinix-chicago) with real hash
functions. - For the Bh-CBF,
(with ). - For the VI-CBF,
and . .
-
-
20
25Concluding Remarks
- The Bloom Paradox
- Discovery of the Bloom paradox
- Importance of the a priori membership probability
- The Variable-Increment Counting Bloom Filter
- Can extend many variants of the counting Bloom
filter - First time sequences are presented in
networking applications
21
26Thank You