The Bloomier Filter - PowerPoint PPT Presentation

About This Presentation
Title:

The Bloomier Filter

Description:

The Problem Bloom Filters. A large set of data D, with a small subset S. We want to query whether an ... G is a lossless expander with constant probability. ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 58
Provided by: Z74
Category:

less

Transcript and Presenter's Notes

Title: The Bloomier Filter


1
The Bloomier Filter
  • Bernard Chazzelle Princeton Un., NEC Lab.
  • Joe Killian NEC Lab.
  • Ronitt Rubinfeld NEC Lab.
  • Ayellet Tal Technion, Princeton Un.

Presented by Lilach Bien
2
Overview
  • The Problem
  • Definitions
  • The algorithm
  • Analysis
  • Lower Bounds
  • Deterministic algorithm
  • Mutable version of the problem

3
The Problem
  • Bloom Bloomier Filters

4
The Problem Bloom Filters
  • A large set of data D, with a small subset S
  • We want to query whether an item d belongs to S
  • No false negative rate (if d belongs to S well
    recognize it)
  • A small positive rate (we may say d belongs to
    S, although it doesnt)
  • Allowing a small positive rate enables to build
    a compact data structure

5
The Problem Bloomier Filters
  • Bloom Filters membership queries on a small
    subset of D.
  • Bloomier Filters computing arbitrary functions
    defined only in a small subset of D.
  • The function will be computed correctly for all
    members of S (no false negative)
  • For items not in S, we almost always return a
    special value ?.
  • Allow dynamic updates to the function, if S
    doesnt change.

6
Example
  • D1,100 S1,3 R1,2
  • f(1)1 f(2)1 f(3)2
  • 1 2 87 55 40
  • 1 1 ? ? 1
  • f(2)2
  • 66 2 3
  • ? 2 2

7
Bloomier Filters - Uses
  • Building a meta database for a union of
    databases.
  • Keeps track of which database contains
    information about each entry.
  • Maintaining directories if the data or code is
    maintained in multiple locations.

8
Definitions
9
Formal Definitions
  • f is a function from D0,,N-1
  • The range is R?,1,,2r-1
  • S t1,tn is a subset of D of size n.
  • f(ti)vi vi?R
  • f(x) ? for x outside of S
  • f can be specified by the assignment
  • A(t1,v1),,(tn,vn)

10
Formal Definitions (Cont.)
  • Bloomier filters allow to query f at any point
    of S always correctly
  • For a random x?D\S the query return f(x) ? with
    probability 1-?
  • The input to the algorithm is A and ?

11
Supported Operations
  • CREATE (A)
  • Given an assignment A(t1,v1),(tn,vn), we
    initialize the data structure Tables.
  • SET_VALUE(t,v,Tables)
  • For t?D and v ?R we associate the value v with
    the domain element t in Tables.
  • It is required that t belongs to S.

12
Supported Operations (Cont.)
  • LOOKUP(t, Tables)
  • For t?S we return the last value v associated
    with t.
  • For all but a fraction ? of D\S we return ?.
  • For the remaining elements of D\S we return an
    arbitrary element of R.

13
The Idea
  • We encode the values in R as elements of the
    additive group X0,1q
  • Addition in Q is bitwise XOR
  • Any x?R is transformed to Q by its q-bit binary
    expansion ENCODE(x)
  • For y ?Q we define DECODE(y) as
  • The corresponding number in R, if yltR
  • ? otherwise

14
The Idea (Cont.)
  • Well save the function values for elements of S
    in a table.
  • Well use a hash function to compute a
    random q-bit masking value M for every
    x in D.
  • To lookup the value of x, well access a set of
    places in the table and calculate a q-bit number
    a.
  • Well return M XOR a.

15
The Idea (Cont.)
  • If t is in S well build the table so
  • a XOR M f(t).
  • Otherwise, since M is random, well get a random
    q-bit number y.
  • Proof For the ith bit of y
  • Suppose ai0 (without loss of generality)
  • We get

16
The Idea (Cont.)
  • Since y is random, for big enough q, DECODE(y)
    will return ? with high probability
  • If we save in the table elements of R (y is an
    element of R) DECODE(y) will not return ? with
    probability R/2q
  • We can do better.

17
Using 2 Tables
  • We have a table of size m, and a hash function
    HASH D?1,..,mk
  • if HASH(t) (h1,..,hk) we say that h1,,hk is
    the neighborhood of t, N(t)
  • For large enough m and k, we can choose for each
    t?S an element ?(t) from HASH(t) such that
  • For each t?S, t?t, it holds that ?(t) ? ?(t)
  • If ?(t) hi we use ?(t) to denote i.

18
Using 2 Tables (Cont.)
  • Well use 2 tables
  • The first table will store values in ?,1,,k
    encoded as values in Q.
  • It will return ?(t) for t in S, and return ? for
    most of the other items.
  • The second table will store values in R.
  • For each t in S the value f(t) in will be in
    place ?(t) .

19
Using 2 Tables (Cont.)
  • If x is in D/S then with probability k/2q the
    first table will not return ?.
  • With probability k/2q we will access the second
    table and return garbage.
  • Now we can also change function values if we
    want.
  • We use the first table to check which place in
    the second table stores the value we want to
    change.
  • We change the value in the second table.

20
The Algorithm
21
The First Table
  • Reminder
  • We want to use the table to compute a value a for
    each item t in D.
  • For items in S, a XOR M will give us the encoded
    ?(t) .
  • When we access the first table with an element t
    we know N(t)h1,,hk and M.
  • Well compute
  • We want to set the values in the indices of N(t)
    so
  • a XOR M will give us the encoded ?(t).

22
Order Respecting Matching
  • Let S be a set with neighborhood N(t) defined
    for each t?S.
  • Let ? be a complete ordering on the elements of
    S.
  • A matching ? respects (S, ?,N) if
  • For all t ?S, ?(t) ? N(t)
  • If tigt ?tj then ?(ti)?N(ti)

23
Order Respecting Matching (Cont.)
  • If for N defined by HASH a matching ? respects
    (S, ?,N) it has all the properties we wanted
  • For all t ?S, ?(t) ? N(t)
  • For all t,t ?S, ?(t) ? ?(t)
  • We may build the first table incrementally so
    that for
  • a XOR M will give us the encoded ?(t).

24
Building The First Table
  • Input
  • Order ?
  • Neighborhood N(t) defined by HASH
  • Order respecting matching ?
  • For t ?1,, ?n we set Table?(t) so that
  • encodes ?(t).
  • Since ? is order respecting we cant affect any
    value already set for tlt ? t.

25
Finding A Good Ordering And Matching
  • We get S and HASH, and compute ? and ? so ? is
    order respecting.
  • A location h?1,,m is a singleton for S if
    h?N(t) for exactly one t?S.
  • TWEAK(t,S,HASH) is the smallest value j such
    that hj is a singleton for S, where
    N(t)(h1,,hk)
  • TWEAK(t,S,HASH)? if no such j exists.
  • If TWEAK(t,S,HASH) is defined we may set
  • ?(t) TWEAK(t,S,HASH). This is an easy match.

26
Finding A Good Ordering And Matching (Cont.)
  • If t is an easy match it doesnt collide with
    the neighborhood of any t?S.
  • E the subset of S with easy matches.
  • HS/E.
  • We recursively find (?,?) for H.
  • We extend (?,?) to (?,?)
  • We first put the ordered elements of H, and then
    the elements of E.
  • ? is the union of matchings for H and E.

27
FIND_MATCH
  • FIND_MATCH (HASH, S)m, k Find (?, ? ) for S,
    HASH
  • 1. E ? ??
  • For ti ? S
  • If TWEAK (ti, S,HASH ) is defined
  • ?i TWEAK (ti, S, HASH )
  • E E ti
  • If E ? Return (failure)
  • 2. H S \ E
  • Recursively compute (?', ?') FIND_MATCH (HASH
    ,H)m ,k.
  • If FIND_MATCH (HASH ,H)m,kfailure Return
    (failure)
  • 3. ? ?'
  • For ti ? E
  • Add ti to the end of ? (ie, make ti be the
    largest element in ? thus far)
  • Return (? ??1,,?n)
  • (where ?i is determined for ti ? E, in Step 1,
    and for ti ? H (via ?') in Step 2.)

28
CREATE
  • CREATE (A (t1, v1) , (tn, vn))m, k, q
    (create a mutable table)
  • 1. Uniformly choose hash D ? 1,,mk ? 0,
    1q
  • S t1,, tn
  • Create Table1 to be an array of m elements of
    0, 1q
  • Create Table2 to be an array of m elements of R.
  • (the initial values for both tables are
    arbitrary)
  • Put (HASH , m, k, q) into the "header" of Table1
  • (we assume that these values may be recovered
    from Table1)
  • 2. (?, ?) FIND_MATCH (hash , S)m, k
  • If FIND_MATCH (hash , S)m, k failure Goto
    Step 1
  • 3. For t ? 1, , ? n
  • v A(t) (ie, the value assigned by A to t)
  • (h1,,hk,M) HASH (t)
  • L ? (t) l ? (t) (ie, L hl)
  • Table1 L ENCODE (l) ? M ?
  • Table2 L v
  • 4. Return (Table (Table1,Table2))

29
LOOKUP SET_VALUE
  • LOOKUP (t, Table (Table1,Table2))
  • 1. Get (HASH, m, k, q) from Table1
  • (h1,, hk, M) HASH (t)
  • l DECODE (M ? )
  • 2. If l is defined
  • L hl
  • Return (Table2L)
  • Else Return (?)
  • SET_VALUE (t, v, Table (Table1,Table2))
  • 1. Get (HASH, m, k, q) from Table1
  • (h1,, hk, M) HASH (t)
  • l DECODE (M ? )
  • 2. If l is defined
  • L hl
  • Table2L v
  • Return (success)
  • Else Return (failure)

30
Analysis
31
Analyzing FIND_MATCH
  • We show that FIND_MATCH succeeds with constant
    probability for every S.
  • Well define a bi-partite graph G
  • On the left side there are n vertices LL1,,Ln
    corresponding to S.
  • On the right side there are m vertices
    RR1,,Rm corresponding to 1,,m
  • There is an edge between Li and Rj if for ti?S if
    there is l such that jhl.

32
The Singleton Property
  • We say that G has the singleton property if for
    all nonempty A?L there exists a vertex Ri?R such
    that Ri is adjacent to exactly one vertex in A.
  • If G has the singleton property FIND_MATCH will
    never get stuck (there will always be easy
    matches).
  • N(v) the set of neighbors of v?L.
  • N(A) the set of neighbors of the elements in A.

33
Lossless Expansion Property
  • We say that G has the lossless expansion
    property if for all nonempty A?L, N(A)gtkA/2
  • If G has the lossless expansion property it has
    the singleton property
  • Assume to contrary that there is an A such that
    each node in N(A) has at least 2 neighbors.
  • The sub-graph for A has at least 2N(A) edges.
  • Since N(A)gtkA/2, the sub-graph has more than
    kA edges a contradiction.

34
Lossless Expansion Property (Cont.)
  • For a random graph G with
  • Fixed k, kgt2
  • mckn for a fixed c
  • G is a lossless expander with constant
    probability.
  • FIND_MATCH will succede with constant
    probability.

35
Data Structure Complexity
  • The error probability is k/2q
  • We have to set
  • SpaceO(n(rlog1/e)) bits
  • Lookup Time O(1)
  • Update Time O(1)

36
Data Structure Complexity (Cont.)
  • FIND_MATCH well use the graph again.
  • We may show that with high probability for all
    non-empty A?L, N(A)gtcA for some constant
    cgtk/2.
  • For a set A ?L well assume there are a items in
    N(A) with one neighbor and cA-a items with
    more than one neighbor.
  • The sub-graph for A has at least
  • a2(cA-a)2cA-a edges.
  • On the other hand it has at most kA edges.

37
Data Structure Complexity (Cont.)
  • Each item in A has at most k neighbors.
  • The number of items in A that has neighbors that
    belong only to them is at least
  • a/k ? (2c-k)A/k (2c/k-1)ApA
  • These items are easy matches.
  • The run-time of FIND_MATCH is, if there is such c
    is
  • O(n)O((1-p)n)O((1-p)2n)O(n)
  • That is also the expected run-time of CREATE

38
Lower Bounds
39
Deterministic Algorithm
  • If R1,2,?, S splits into subsets A and B that
    map to 1 and 2, resp.
  • Even in that case deterministic Bloomier
    filtering requires O(n log log N) bits of
    storage.
  • Define G - a graph where each node is a vector
    in -1,0,1N with exactly n coordinates equal to
    1, and n others equal to -1.
  • The 1s represent A and the -1s represent B.
  • Two nodes v and v are adjacent if the set A of v
    intersects the set B of v
  • (if v(x1,,xN) and v(y1,yN) they are
    adjacent if there is i such that xiyi-1)

40
Deterministic Algorithm (Cont.)
  • Since the memory is the only source of
    information about A and B no 2 adjacent node
    should correspond to the same memory
    configuration.
  • The memory size m is at least log?(G) (?(G) is
    the minimum number of colors required to color
    G).
  • Well show that ?(G) is between O(2n log N) and
    O(n2n log N).

41
Lower Bound On ?(G)
  • For every color c required to color G we have a
    vector zc in -1,1N.
  • For a node v(x1,,xN) we allow xi to be 1 (or
    -1) only if zi is 1 (or -1).
  • A set of binary vectors in length l is (k,l)
    universal if for every choice of k coordinate
    positions we get all the possible 2k patterns.
  • Well show that zc is (N,n) universal if we turn
    the minus ones to zeroes.

42
Lower Bound On ?(G) (Cont.)
  • Let i1,..,in be n coordinate positions.
  • For each w in -1,1N we have a node v whose
    i1,..,in coordinates match w.
  • If v is colored in color c then the i1,..,in
    coordinates of zc match w.
  • Therefore, for each choice of n-coordinate
    positions we get all the possible patterns.
  • There size of an (N,n) universal set is O(2n log
    N) so this is a lower bound on ?(G) .

43
Upper Bound On ?(G)
  • There exists an (N,2n) universal set of vectors
    of size O(n2n log N).
  • Well turn all the zeroes to minus ones.
  • Well use that group as zc.
  • Because the set zc is universal we may select
    for each node is a vector zc that matches the 1s
    and -1s of the node.
  • c will be the color of the node.

44
Mutable Filtering
  • If
  • and the number m of storage bits satisfies
  • for some large enough constant c, the Bloomier
    Filtering cannot support dynamic updates on S of
    size 2n.
  • The proof is for the R1,2,?, S splits into
    subsets of size n A and B that map to 1 and 2,
    resp.
  • We assume the algorithm is randomized.

45
Mutable Filtering (Cont.)
  • Let ? be a sequence of random choices made by
    the algorithm, when the input to the algorithm
    was A and B.
  • We assume B was a specific set Borg and change
    A.
  • For each possible A we have a corresponding
    memory configuration.
  • In other words for each memory configuration
    we have a family of possibilities to A that led
    to this configuration.
  • Let F be the largest family.

46
Mutable Filtering (Cont.)
  • Now we change B For each possible Bnew we get
    to a different memory configuration.
  • For each configuration 1?i?2m there is a family
    of options to Bnew that leads to it. We mark it
    by Gi.

I
II
47
Mutable Filtering (Cont.)
  • Given a memory configuration C in II, For any
    path that leads to it
  • B can be the Bnew on the path.
  • For each item in such a set we must answer in
    B.
  • A can be the set on the path before configuration
    in I.
  • For each item in such a set that couldnt be
    changed to Bnew on the path we must answer in
    A.
  • Suppose in I we were in the configuration F
    leads to, and then we randomly chose Bnew.
  • i(Bnew) denotes j such that Bnew?G j
  • In II we have to
  • Answer in A for each item of a set in F that
    couldnt be changed to Bnew
  • Answer in B for each item of a set in G i(Bnew)

48
The Proof
  • is the subset of F whose sets intersect
    Bnew.
  • We show that with high probability (over the
    selection of Bnew) the sets
  • are intersecting.
  • There is an item for which the algorithm must
    answer both in A and in B.
  • There is a set Bnew that causes the algorithm to
    make errors.

49
Lk And Its Size
  • Lk is the set of items that belong to at list k
    sets in F.
  • Well look at subsets of
  • that belong to Lk and show they intersect.
  • We first bound the size of Lk.
  • Fk is the sub-family of F that contains only
    subsets of Lk.

50
Lk And Its Size (Cont.)

51
And Its Size
  • It is a subset of both and Lk.
  • The algorithm should answer in A for each item
    of
  • Well show that with probability ?1/2
    cannot be very small.
  • The expected number of sets in F a random item
    of D intersects is

52
And Its Size (Cont.)
  • If an item in Lk does not appear in it
    intersects only sets in
  • Such an item appears in at least k sets.
  • According to Markov bound

53
And Its Size
  • Mi is a subset of Lk and
  • The algorithm should answer in B for each item
    of
  • According to Chernoff Bound

54
And Its Size (Cont.)
  • This probability is the number of Bnews that
    hold
  • in all Gis, divided by the number of Bnews
    that hold

55
And Its Size (Cont.)

56
The Error
  • With probability at least ½-o(1)
  • and
  • that is, the 2 sets intersect
  • The algorithm must answer in A for each item of
  • and in B for each item of
  • There is a set Bnew for which algorithm will make
    an error

57
Questions?
Write a Comment
User Comments (0)
About PowerShow.com