Model-Based Semantic Compression for Network-Data Tables - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Model-Based Semantic Compression for Network-Data Tables

Description:

X1 X2 X3 X4 [e1,e2,e3,e4] Semantic-compression. Plan. SPARTAN's DependencyFinder ... X1 X2 X3 X4. P. M. RowAggregator. Experimental Results: Summary ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 20
Provided by: csr92
Category:

less

Transcript and Presenter's Notes

Title: Model-Based Semantic Compression for Network-Data Tables


1
Model-Based Semantic Compression for Network-Data
Tables
  • Shivnath Babu

Stanford University
with Minos Garofalakis, Rajeev Rastogi, Avi
Silberschatz
Bell Laboratories
NRDM, Santa Barbara, CA, May 25, 2001
2
Introduction
  • Networks create massive, fast-growing
    relational-data tables
  • Switch/router-level network performance data
  • SNMP and RMON data
  • Packet and flow traces (Sprint IP backbone -- 600
    gigabytes/day)
  • Call Detail Records (ATT -- 300 million
    records/day)
  • Web-server logs (Akamai -- 10-100 billion
    log-lines/day)
  • The data is important for running big enterprises
    effectively
  • Application and user profiling
  • Capacity planning and provisioning, determining
    pricing plans
  • The data needs to be stored, analyzed, and
    (often) shipped across sites

3
Compressing Massive Tables
  • Example table network flow measurements
    (simplified)
  • Good compression is essential
  • Optimizes storage, I/O, network bandwidth over
    the lifetime of the data
  • Can afford intelligent compression

4
Compressing Massive Tables A New Direction in
Data Compression
  • Several generic compression techniques and
    tools (e.g., Huffman,
  • Lempel-Ziv, Gzip)
  • Syntactic operate at byte-level, view table as
    a large byte-string
  • Lossless do not support lossless and lossy
    compression
  • Semantic compression
  • Exploiting data characteristics and dependencies
    improves compression ratio significantly
  • Capturing aggregate data characteristics ties in
    with enterprise data monitoring and analysis
  • Benefits of lossy compression schemes
  • Enables trading precision for performance
    (compression time and storage)
  • Tradeoff can be adjusted by user (flexible)

5
SPARTAN A Model-Based Semantic Compressor
  • New compression paradigm Model-Based Semantic
    Compression (MBSC)
  • Extract data mining models from table
  • Derive compression plan using the extracted
    models
  • Use models to represent data succinctly
  • Use models to drive other model building
  • Compress different data partitions using
    different models
  • Lossless and lossy compression (within
    user-specified error bounds)
  • SPARTAN system implements a specific
    instantiation of MBSC
  • Key idea Classification and Regression Trees
    (CaRTs) can capture cross-column dependencies and
    eliminate entire data columns

6
SPARTAN Semantic Compression with Classification
and Regression Trees (CaRTs)
error0 errorlt3
Protocol
http
http
http
http
http
ftp
ftp
ftp
Duration
12
16
15
19
26
27
32
18
Bytes
20K
24K
20K
40K
58K
100K
300K
80K
Packets
3
5
8
11
18
24
35
15
A compact CaRT can eliminate an entire column by
prediction
Outlier Packets11, Duration 19
7
SPARTAN Compression Problem Formulation
  • Given Data table over set of attributes X and
    per-attribute error tolerances
  • Find Set of attributes P to be predicted using
    CaRTs such that
  • Overall storage cost (CaRTs outliers
    materialized columns) is minimized
  • Each attribute in P is predicted within its
    specified tolerance
  • A predicted attribute should not be used to
    predict another attribute -- otherwise errors
    will compound
  • Non-trivial problem
  • Space of possible CaRT predictors is
    exponential in number of attributes

8
Two Phase Compression
  • Planning Phase -- Come up with a compression
    plan
  • Compression Phase -- Scan the data and compress
    it using the plan

9
SPARTAN Architecture Planning Phase
10
SPARTANs DependencyFinder
  • Goal Identify strong dependencies among
    attributes to prune the
  • (huge) search space of possible CaRT models
  • Input Random sample of input table T
  • Output A Bayesian Network (BN) over Ts
    attributes
  • Structure of BN Neighbors are the strongly
    related attributes

11
SPARTAN Architecture Planning Phase
12
SPARTANs CaRTSelector
  • Heart of SPARTANs semantic-compression engine
  • Output Subset of attributes P to be
    predicted (within tolerance)
  • and corresponding CaRTs
  • Uses Bayesian Network constructed by
    DependencyFinder
  • Hard optimization problem strict generalization
    of Weighted Maximum Independent Set (WMIS)
    (NP-hard)
  • Two solutions
  • Greedy heuristic
  • New heuristic based on WMIS approximation
    algorithms

13
Maximum Independent Set (MIS) CaRTSelector
  • Exploits mapping of WMIS to CaRTSelector problem
  • Hill-climbing search that proceeds in iterations
  • Start with set of predicted attributes (P) empty
    all attributes materialized (M)
  • Each iteration improves earlier solution by
    moving a selected subset of nodes from M to P
  • Map to a WMIS instance and use solution
  • Weight of a node (attribute)
    materializationCost predictionCost
  • Stop when no improvement is possible
  • Number of CaRTs built (n attributes)
  • Greedy CaRTSelector O(n)
  • MIS CaRTSelector O(n2) in the worst
    case, O(n logn) on average

14
SPARTAN Architecture Planning Phase
15
Experimental Results Summary
  • SPARTAN system has been tested over several real
    data sets
  • Full details are in S. Babu, M. Garofalakis, R.
    Rastogi. SPARTAN A Model-Based Semantic
    Compression System for Massive Data Tables.
    SIGMOD 2001
  • Better compression ratios compared to Gzip and
    Fascicles
  • factors up to 3 (for 5-10 error tolerances for
    numeric attributes)
  • 20-30 on average for 1 error for numeric
    attributes
  • Small sample sizes are effective for model-based
    compression
  • 50KB is often sufficient

16
Conclusions
  • MBSC A novel approach to massive-table
    compression
  • SPARTAN a specific instantiation of MBSC
  • Uses CaRTs to eliminate significant fractions of
    columns by prediction
  • Uses a Bayesian Network to identify predictive
    correlations and drive the selection of CaRTs
  • CaRT-selection problem is NP-hard
  • Two heuristic-search-based algorithms for
    CaRT-selection
  • Experimental evidence for effectiveness of
    SPARTANs model-based approach

17
Future Direction in MBSC Compressing Continuous
Data Streams
  • Networks generate continuous streams of data
  • E.g., packet traces, flow traces, SNMP data
  • Applying MBSC to continuous data streams
  • Data characteristics and dependencies can vary
    over time
  • Goal compression plan should adapt to changes in
    data characteristics
  • Models must be maintained online as tuples arrive
    in the stream
  • Study data mining models with respect to online
    maintanence
  • Incremental
  • Data stream speeds
  • Parallelism
  • Trade precision for performance
  • Eager Vs. Lazy schemes
  • Compression plan must be maintained with respect
    to models

18
Future Direction in MBSC Distributed MBSC
  • Data collection infrastructure is often
    distributed
  • Multiple monitoring points over an ISPs network
  • Web servers are replicated for load balancing and
    reliability
  • Data must be compressed before being transferred
    to warehouses or repositories
  • MBSC can be done locally at each collection point
  • Lack of global data view might result in
    suboptimal compression plans
  • More sophisticated approaches might be beneficial
  • Distributed data mining problem
  • Opportunity cost of network bandwidth is high --
    keep communication overhead minimal

19
Future Direction in MBSC Using Extracted Models
in other Contexts
  • A crucial side-effect of MBSC -- capturing data
    characteristics helps enterprise data monitoring
    and analysis
  • Interaction models (e.g., Bayesian Network)
    enable event-correlation and root-cause analysis
    for network management
  • Anomaly detection -- intrusions, (distributed)
    denial-of-service attacks
Write a Comment
User Comments (0)
About PowerShow.com