Model-Based Semantic Compression for Network-Data Tables - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

Model-Based Semantic Compression for Network-Data Tables

Description:

X1 X2 X3 X4 [e1,e2,e3,e4] Semantic-compression. Plan. SPARTAN's DependencyFinder ... X1 X2 X3 X4. P. M. RowAggregator. Experimental Results: Summary ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 20

Provided by: csr92

Category:

more less

Transcript and Presenter's Notes

Title: Model-Based Semantic Compression for Network-Data Tables

1
Model-Based Semantic Compression for Network-Data
Tables

Shivnath Babu

Stanford University
with Minos Garofalakis, Rajeev Rastogi, Avi
Silberschatz
Bell Laboratories
NRDM, Santa Barbara, CA, May 25, 2001
2
Introduction

Networks create massive, fast-growing
relational-data tables
Switch/router-level network performance data
SNMP and RMON data
Packet and flow traces (Sprint IP backbone -- 600
gigabytes/day)
Call Detail Records (ATT -- 300 million
records/day)
Web-server logs (Akamai -- 10-100 billion
log-lines/day)

The data is important for running big enterprises
effectively
Application and user profiling
Capacity planning and provisioning, determining
pricing plans
The data needs to be stored, analyzed, and
(often) shipped across sites

3
Compressing Massive Tables

Example table network flow measurements
(simplified)

Good compression is essential
Optimizes storage, I/O, network bandwidth over
the lifetime of the data
Can afford intelligent compression

4
Compressing Massive Tables A New Direction in
Data Compression

Several generic compression techniques and
tools (e.g., Huffman,
Lempel-Ziv, Gzip)
Syntactic operate at byte-level, view table as
a large byte-string
Lossless do not support lossless and lossy
compression

Semantic compression
Exploiting data characteristics and dependencies
improves compression ratio significantly
Capturing aggregate data characteristics ties in
with enterprise data monitoring and analysis

Benefits of lossy compression schemes
Enables trading precision for performance
(compression time and storage)
Tradeoff can be adjusted by user (flexible)

5
SPARTAN A Model-Based Semantic Compressor

New compression paradigm Model-Based Semantic
Compression (MBSC)
Extract data mining models from table
Derive compression plan using the extracted
models
Use models to represent data succinctly
Use models to drive other model building
Compress different data partitions using
different models

Lossless and lossy compression (within
user-specified error bounds)

SPARTAN system implements a specific
instantiation of MBSC
Key idea Classification and Regression Trees
(CaRTs) can capture cross-column dependencies and
eliminate entire data columns

6
SPARTAN Semantic Compression with Classification
and Regression Trees (CaRTs)
error0 errorlt3
Protocol
http
http
http
http
http
ftp
ftp
ftp
Duration
12
16
15
19
26
27
32
18
Bytes
20K
24K
20K
40K
58K
100K
300K
80K
Packets
3
5
8
11
18
24
35
15
A compact CaRT can eliminate an entire column by
prediction
Outlier Packets11, Duration 19
7
SPARTAN Compression Problem Formulation

Given Data table over set of attributes X and
per-attribute error tolerances
Find Set of attributes P to be predicted using
CaRTs such that
Overall storage cost (CaRTs outliers
materialized columns) is minimized
Each attribute in P is predicted within its
specified tolerance
A predicted attribute should not be used to
predict another attribute -- otherwise errors
will compound

Non-trivial problem
Space of possible CaRT predictors is
exponential in number of attributes

8
Two Phase Compression

Planning Phase -- Come up with a compression
plan
Compression Phase -- Scan the data and compress
it using the plan

9
SPARTAN Architecture Planning Phase
10
SPARTANs DependencyFinder

Goal Identify strong dependencies among
attributes to prune the
(huge) search space of possible CaRT models

Input Random sample of input table T
Output A Bayesian Network (BN) over Ts
attributes

Structure of BN Neighbors are the strongly
related attributes

11
SPARTAN Architecture Planning Phase
12
SPARTANs CaRTSelector

Heart of SPARTANs semantic-compression engine
Output Subset of attributes P to be
predicted (within tolerance)
and corresponding CaRTs
Uses Bayesian Network constructed by
DependencyFinder

Hard optimization problem strict generalization
of Weighted Maximum Independent Set (WMIS)
(NP-hard)
Two solutions
Greedy heuristic
New heuristic based on WMIS approximation
algorithms

13
Maximum Independent Set (MIS) CaRTSelector

Exploits mapping of WMIS to CaRTSelector problem
Hill-climbing search that proceeds in iterations
Start with set of predicted attributes (P) empty
all attributes materialized (M)
Each iteration improves earlier solution by
moving a selected subset of nodes from M to P
Map to a WMIS instance and use solution
Weight of a node (attribute)
materializationCost predictionCost
Stop when no improvement is possible
Number of CaRTs built (n attributes)
Greedy CaRTSelector O(n)
MIS CaRTSelector O(n2) in the worst
case, O(n logn) on average

14
SPARTAN Architecture Planning Phase
15
Experimental Results Summary

SPARTAN system has been tested over several real
data sets
Full details are in S. Babu, M. Garofalakis, R.
Rastogi. SPARTAN A Model-Based Semantic
Compression System for Massive Data Tables.
SIGMOD 2001
Better compression ratios compared to Gzip and
Fascicles
factors up to 3 (for 5-10 error tolerances for
numeric attributes)
20-30 on average for 1 error for numeric
attributes
Small sample sizes are effective for model-based
compression
50KB is often sufficient

16
Conclusions

MBSC A novel approach to massive-table
compression
SPARTAN a specific instantiation of MBSC
Uses CaRTs to eliminate significant fractions of
columns by prediction
Uses a Bayesian Network to identify predictive
correlations and drive the selection of CaRTs
CaRT-selection problem is NP-hard
Two heuristic-search-based algorithms for
CaRT-selection
Experimental evidence for effectiveness of
SPARTANs model-based approach

17
Future Direction in MBSC Compressing Continuous
Data Streams

Networks generate continuous streams of data
E.g., packet traces, flow traces, SNMP data
Applying MBSC to continuous data streams
Data characteristics and dependencies can vary
over time
Goal compression plan should adapt to changes in
data characteristics
Models must be maintained online as tuples arrive
in the stream
Study data mining models with respect to online
maintanence
Incremental
Data stream speeds
Parallelism
Trade precision for performance
Eager Vs. Lazy schemes
Compression plan must be maintained with respect
to models

18
Future Direction in MBSC Distributed MBSC

Data collection infrastructure is often
distributed
Multiple monitoring points over an ISPs network
Web servers are replicated for load balancing and
reliability
Data must be compressed before being transferred
to warehouses or repositories
MBSC can be done locally at each collection point
Lack of global data view might result in
suboptimal compression plans
More sophisticated approaches might be beneficial
Distributed data mining problem
Opportunity cost of network bandwidth is high --
keep communication overhead minimal

19
Future Direction in MBSC Using Extracted Models
in other Contexts

A crucial side-effect of MBSC -- capturing data
characteristics helps enterprise data monitoring
and analysis
Interaction models (e.g., Bayesian Network)
enable event-correlation and root-cause analysis
for network management
Anomaly detection -- intrusions, (distributed)
denial-of-service attacks

Write a Comment

User Comments (0)