Bloom Based Filters for Hierarchical Data - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Bloom Based Filters for Hierarchical Data

Description:

A peer-to-peer system where each node stores a set of XML documents ... The Index Fabric [Cooper & Shadmon, RightOrder Inc 2001] ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 39
Provided by: ioanni8
Category:

less

Transcript and Presenter's Notes

Title: Bloom Based Filters for Hierarchical Data


1
Bloom Based Filters for Hierarchical Data
  • Georgia Koloniari and Evaggelia Pitoura
    University of Ioannina, Greece

2
Outline
  • Motivation
  • Problem Description
  • Related Work
  • Our approach Multi-Level Bloom Filters
  • Performance Evaluation
  • Hierarchical Distribution of Filters
  • Experimental Results
  • Conclusions
  • Future Work

3
Motivation
  • Evolution of peer-to-peer systems as an effective
    way of sharing data
  • Wide use of XML for data representation and
    exchange in the Internet
  • Service Descriptions in XML-based languages
  • Growing interest in content-based routing of data
  • Challenge How to efficiently discover the
    appropriate data based on their content?

4
The Problem
  • A peer-to-peer system where each node stores a
    set of XML documents
  • A query issued at a node may need results from
    multiple nodes in the system
  • Use data summaries at each node to assist query
    routing

B
SumB
A
C
SumC
5
Summaries Requirements
  • Scalability summaries should be able to scale to
    a large number of users and shared documents.
  • Distribution should be distributed across the
    nodes of the peer-to-peer system without
    requiring any central point of control.
  • Dynamic should support updates, since in a
    peer-to-peer system, users join and leave the
    system at will.

6
Related Work
  • XML Indices
  • The Index Fabric Cooper Shadmon, RightOrder
    Inc 2001
  • XSKETCH Synopsis Polyzotis Garofalakis, VLDB
    2002
  • APEX Chung, Min Chim, ACM SIGMOD 2002
  • Path Tree Aboulnaga, Alameldeen Naughton, VLDB
    2001
  • Signature-based Indices Park Kim, DASFAA 2001
  • Routing in P2P
  • Secure Service Discovery Hodes et al, Mobicom
    99
  • Routing indices Crespo Garcia-Molina, ICDCS
    2002

7
Data Model
ltxmlgt
ltdevicegt
ltprintergt
ltcolorgtlt/colorgt
ltpostscriptgtlt/postscriptgt
lt/printergt
ltcameragt
ltdigitalgtlt/digitalgt
lt/cameragt
lt/devicegt
lt/xmlgt
8
Querying
  • XML-based data or service descriptions
  • Find the documents that satisfy a given query
  • Queries that exploit content and structure of the
    data
  • Membership Queries Is element X in set Y?
  • Path Queries consisting of regular path
    expressions, i.e. device//camera

9
Bloom Filters
  • Compact data structures for a probabilistic
    representation of a set
  • Appropriate to answer membership queries

10
Bloom Filters (contd)
Query for b check the bits at positions H1(b),
H2(b), ..., H4(b).
11
Bloom Filters (contd)
  • Appearance of false positives.
  • False positive the probabilty that the filter
    recognizes an elemnt as belonging to the set
    although it does not.
  • P (1 - e-kn/m)k
  • Ease of updates with the use of an array of
    counters
  • Unable to represent relationships between
    elements

12
Our approach
  • Bloom filters suitable for distributed
    environments
  • Main drawback Unable to represent hierarchies
  • Extend to multi-level Bloom Filters in order to
    support path queries
  • Two approaches
  • Breadth Bloom Filters
  • Depth Bloom Filters

13
Breadth Bloom Filters
  • One Bloom Filter BBFi for each level of the tree
    i
  • In each filter BBFi we insert the elements of all
    the nodes of level i.
  • An additional BBF0 with all the elements to
    improve performance
  • Different sizes of the filter for each filter
  • Look-up
  • check BBF0 for all elements of the path
  • check each element ai of the path to the
    corresponding level

14
Breadth Bloom Filters
BBF0
(device?printer?camera? color?postscript?digital)
BBF1
device
BBF2
printer ? camera
BBF3
(color?postscript?digital)
Queries device/printer/color
/printer/postscript
15
Depth Bloom Filters
  • One Bloom Filter DBFi for each path of the tree
    with length i, i.e. each path with i1 nodes
  • In each DBFi we insert all paths of the tree
    with length i.
  • Look-up for path of length p
  • Check all elements of the query in DBF
  • Check for every sub-path of length 2 to p
  • For split the path at the positition of and
    check each sub-path seperately

16
Depth Bloom Filters
(device?printer?camera? color?postscript?digital)

(device/printer?device/camera? camera/digital?prin
ter/color? printer/postscript)
(device/camera/digital ?device/printer/color ?devi
ce/printer/postscript)
Queries /device/printer/color
/device//postscript
17
Experimental Evaluation
  • 200 XML documents produced by the Niagara
    Generator (www.cs.wisc.edu/niagara)
  • 4 hash functions using the MD5 message digest
    algorithm (RFC1321)
  • Size of the filter 78000 bits, about 2 of the
    size of the documents
  • Levels of the documents 4
  • Elements per document 50
  • No repetition between element names
  • Length of queries 3 (e.g. /device/camera/digital)
  • 90 of the elements forming the queries were
    contained in the documents
  • Metric Percentage of false positives

18
Influence of filter size
19
Influence of the number of elements per document
20
Influence of the levels of the document
21
Influence of the length of the queries
22
Varying the query workload
Workload type /printer/digital
23
Summary of Results
  • Multi-level Bloom filters outperform Simple Bloom
    filters in evaluating path queries.
  • For 2 of the total size of the data, multi-level
    Bloom filters evaluate path queries for a false
    positives ratio below 3, while Simple Blooms
    fail to recognize the correct paths, no matter
    how much the filter size increases.
  • Breadth Blooms work better than Depth Blooms.
  • Depth Blooms require more space but are suitable
    for handling queries for which Breadth Blooms
    present a high ratio of false positives (exp. 5)

24
Distribution
  • Each node stores
  • local summary
  • merged summary of neighbours
  • merged summary constructed by applying the
    bit-wise OR per level
  • Nodes organized according to topological
    proximity
  • Two organizations of nodes
  • hierarchical
  • horizons

25
Distribution Hierarchical Organization
Node C Local filter Merged filter E? F ? G ?
H Root filters A, B, D
26
Bloom Filter Similarity
  • Nodes organized according to Bloom Filter
    Similarity
  • Measure similarity measure based on the
    Manhattan distance metric.
  • Let two filters B and C of size m
  • d(B, C) B1 C1 B2 C2 Bm
    Cm.
  • similarity(B, C) m d(B, C).

27
Bloom Filter Similarity (contd)
B
1
0
0
1
1
0
0
1
C
0
1
1
0
1
0
0
1
similarity(B, C) 8 - (1 0 0 1 0 1 0
1) 4
For multi-level Bloom filters similarity is
defined as the sum of each pair of corresponding
levels
28
Content-Based Organization
  • When a node joins the system
  • it broadcasts its local summary and attaches to
    the most similar node available

29
Performance in Distributed Setting
  • Hierarchical organization of nodes
  • Metric Number of hops
  • Parameters
  • Variable number of nodes
  • Number of hierarchies 5
  • Maximum out-degree 5
  • Every 10 of all docs 70 similar
  • Length of queries 2
  • 10 of the documents have results
  • 70 of the documents contain the elements of the
    path query
  • One document per node

30
Finding the first result with respect to the nodes
31
Finding all the results with respect to the nodes
32
Finding the first result with varying number of
results
33
Finding the first result with respect to the nodes
34
Finding all the results with respect to the nodes
35
Summary of Results
  • The content-based organization is much more
    efficient in finding all the results for a query,
    than the proximity organization.
  • They both perform similarly in discovering the
    first result.
  • The content-based organization outperforms the
    proximity one when the nodes that satisfy a given
    query are limited.
  • Both Simple and multi-level Blooms can be
    efficiently used as distributed filters.
  • For path queries, multi-level Blooms outperform
    Simple ones.

36
Conclusions
  • We introduced two novel data structures Breadth
    and Depth Bloom Filters that exploit both the
    content and structure of the XML documents given
    a small space overhead.
  • The new data structures outperform simple Bloom
    Filters with respect to false positives when
    addresing regular path expression queries
  • Distributed in large-scale systems to support
    efficient service discovery
  • Extended the use of Bloom filters to organize the
    nodes according to their content.

37
Future Work
  • Explore different policies for the filters
    distribution.
  • Explore different types of data summaries (e.g.
    Signatures)
  • Extend the data model to XML graphs and
    incorporate values into the indexes

38
Thank you
Write a Comment
User Comments (0)
About PowerShow.com