Accelerating Data Management Operations using New Hardware Paradigms presentation

About This Presentation

Transcript and Presenter's Notes

Title: Accelerating Data Management Operations using New Hardware Paradigms

1
Accelerating Data Management Operations using New
Hardware Paradigms
Major Area Examination

Sudipto Das sudipto_at_cs

Committee Prof. Divyakant Agrawal (co-chair)
Prof. Amr El Abbadi (co-chair) Prof. Timothy
Sherwood
2
Data Streams Model

What are Data Streams and how are they processed?

Conventional Database
Data Stream Model
Courtesy Ahmed Metwally

Data is viewed as a passing stream (possibly
infinite)
Only a single pass through the data tuples
Answer to queries are computed as the stream is
viewed

3
Applications of Data Streams

Monitoring network web traffic
Internet Advertising
Stock market analysis
Detecting DoS and DDoS attacks malicious
activities on the network
Distributed monitoring for load balancing

4
Challenges Involved

The volume of stream over the entire lifetime is
huge (possibly infinite)
Space complexity should be sub-linear to stream
size
Stream Summaries introduce approximation
Goal Reduce error as well as reduce the space
requirements
Queries require timely answers
Most queries are online and continuous in nature
Response time must be small
Goal Processing cost must be small
Answer queries with high accuracy using minimal
space and small response time

5
Why new hardware paradigms?

Ever increasing data rates call for faster
processing
Speed of processing cores bounded by physical
barriers
The shift is towards multi-core architectures
Ram06 One core to multiple cores Intel
Digital Revolution
64 128 cores projected in near future Held06
Intel Tera-scale Computing
Cisco recently announced 40 core QuantunFlowTM
Network Processor
Paradigm shift in algorithmic models to exploit
these architectures
Only concurrent programs can effectively exploit
the potential of the multi-core architectures
Specialized hardware like TCAM or GPU can be
utilized to improve efficiency of certain
operations

6
(No Transcript)
7
Presentation Outline

Introduction Motivation
Common Queries on Data Streams
New Hardware Paradigms
Work in Progress
Conclusion Discussion

8
Frequent Elements

Find elements with frequency above a certain
threshold (also known as support)
Two classes of algorithms
Sketch Based Techniques Char02
Use set of hash functions to keep a sketch of
frequent elements
Expensive in terms of computation, error bounds
not stringent
Counter Based Techniques Datar02, Dem02, Man02,
Karp03, Met05, Pani07
Monitor only a subset of the elements seen
Maintain counters for these elements
Heuristics to limit the space to
or
Handling deletions is not trivial Corm03
Space is
time for reporting

9
Top K

Return the top k elements based on some scoring
function Met05, Das07
Top-k in the sliding window
Top-k in a window of data items
Window as number of items or based on time
Distributed Top-k Monitoring Bab03
Answer queries when monitors are distributed
Goal is to minimize communication and support
continuous queries
Use filters and constraints to limit communication

10
(No Transcript)
11
Presentation Outline

Introduction Motivation
Common Queries on Data Streams
New Hardware Paradigms
Work in Progress
Conclusion Discussion

12
New Hardware Paradigms

Network Processing Units (NPU)
Ternary Content Addressable Memory (TCAM)
Chip Multiprocessors (CMP)
Graphics Processing Units (GPU)
Cell Broadband Engine

13
Network Processing Unit (NPU)

Provides extensive parallelism supporting up to
10 GB/s line rates
Examples Intel IXP Family, AMD NPs, Cisco
QuantumFlow
IXP 2855 NPU provides 16 Micro Engines each
operating up to 1.5 GHz
Each ME has 8 Hardware thread contexts
MEs designed for simple Data Plane operations and
have a simple instruction set
XScale core supports a much diverse instruction
set and is used as a Control Plane Processor
Built-in hashing and cryptography unit
Gold05 TLP of NPU used for accelerating
Database operations such as sequential scan and
Hash based join

14
(No Transcript)
15
Ternary Content Addressable Memory

Provide constant time lookups (Searches based on
Content)
Ternary capability provides dont care bits
that allow range matches
IDT 75K62134 Chip has 256K - 36 bit words
Programmable word size
From 36 bits up to 576 bits
Supports pipelining of requests
128 request contexts supported
1oo million searches per second

16
Applications of TCAM

TCAM Conscious Heavy Distinct Hitters Ban07b
TCAM as hardware implementation of Hashtable
Adapted the 1-level filtering Ven05 into the
TCAM setting
Acceleration due to low cost of look-ups
Experiments on NLANR repository data confirms the
intuition
Accelerating Database operations Ban05, Ban06
Suggest a CAM-Cache architecture to be used to
interface a TCAM with the CPU
Modify the Nested Loop Join to exploit O(1)
lookups
Efficient sorting range intersection Pani03
TCAM Conscious Frequent Elements Ban07a

17
Chip Multi-Processors (CMP)

Architecture characterized by multiple cores on a
single die and shared cache between cores
Two broad categories
Lean Camp
Simple design of cores
Rely on Thread Level Parallelism
Sun ULTRASPARC T1 T2, Compaq Piranha
Fat Camp
Target maximum single thread performance
Larger cores compared to LC
Intel Core 2 Quad, IBM Power6

18
Chip Multi-Processors (CMP)

Adaptive Aggregation on CMPs Cie07a
Use the shared cache in a CMP to provide a
Hash-based aggregation algorithm
Local hash table approach (best performance),
Shared hash table (plagued by synchronization
issues)
Hybrid approach with adaptive harness (best of
both worlds)
Har07 provided good experimental observations
for a Database Server on a CMP
Under saturated workloads, lean camp hides memory
latencies better
L2 hit latency is a bottleneck for chips with
larger cache
Parallel Buffers on CMP Cie07b

19
(No Transcript)
20
Presentation Outline

Introduction Motivation
Common Queries on Data Streams
New Hardware Paradigms
Work in Progress
Conclusion Discussion

21
Integrated Frequent Elements/Top-K

Idea developed in Met05 called Space Saving
Authors develop a new counter based technique
Count occurrences of elements and keep them
sorted
Number of counters is dependent on the precision
sought by the user
Occurrence of a monitored element results in
incrementing the counter
New element results in replacing minimum
Intuition is that high frequency elements will
never be replaced

22
Space-Saving By ExampleMet05, Met06
Courtesy Ahmed Metwally

Elements should be sorted to identify min in
constant time

23
Stream Summary Data StructureDem02, Met05,
Met06
f1 lt f2 lt f3
f2
f3
f1
If f2 gt f1 1, then a new bucket is added
between f1 f2
Element e appears in the stream
Assuming f2 f1 1
24
TCAM Adaptation of Space SavingBan07a

Need to maintain counters per element
Stream elements looked up efficiently using TCAM
Need to keep track of minimum element
Element frequencies also stored in TCAM
Minimum frequency can be looked up efficiently
Fast and Efficient
But elements are not sorted
Cannot answer top-k queries
Adaptation for continuous queries not easy

Figure taken from Ban07a SS Space Saving
Met05 LC Lossy Counting Mank02
25
TCAM Adapted Stream Summary
SRAM
TCAM

For each stream element, we have to look it up
Use the TCAM to store the elements to look them
up in O(1) time

26
Using the Parallelism of NPU

NPU has 16 Micro Engines (MEs)
Previous solution uses only a single ME
Challenges in using multiple MEs
Shared resources (TCAM) need synchronization
Synchronization leads to performance degradation
To avoid conflict, split the TCAM
Merge the counters to produce the global result
Splitting increases space overhead

27
Experiments

Experiments are performed on the Intel NPU
platform IXDP 2801
Development using Teja NP Application Development
Environment and Intel Exchange Architecture
Involves programming in two semi-low level
languages Teja C TM and Micro C TM
Synthetic Data Zipfian Distribution with varying
Zipfian factors

28
Experimental Platform

IXDP 2801 Development Platform from Intel with
Integrated IDT TCAM chip

29
Some Preliminary Results
30
Some Preliminary Results

Analysis of performance
Pointer Manipulation Overhead We have to live
with it
Words in TCAM not aligned along boundaries
Experiments revealed this is not the case
Deletions are not O(1) per stream element
This is indeed the culprit
Present TCAM word width supports only singly
linked lists
Adding Support to handle Doubly Linked Lists

31
Experiments for Parallelism
32
Naïve Synchronization
Naïve Synchronization Ruins Parallelism
33
Open Questions

What would be an efficient synchronization
scheme?
Can we do away with synchronization?
For a shared structure without synchronization,
we have Lost Updates
Can we provide a bound for error introduced?
Challenges with merge
How do we merge unsorted lists to generate a
sorted list?
How often do we merge?
Do we loose information during merge?

34
(No Transcript)
35
Presentation Outline

Introduction Motivation
Common Queries on Data Streams
New Hardware Paradigms
Work in Progress
Conclusion Discussion

36
Conclusion

Data Streams form an important class of
applications with specific needs
Increased data rates and on-line answering
constraints necessitate acceleration
New hardware paradigms (like TCAMs, multi-core
processors) can be exploited to accelerate these
operations
Recent Trends in Multi-Core Chip Design also
advocate development of algorithms to exploit
these new architectures
NPU (parallelism bundled with TCAMs) provides a
good framework

37
Discussions

Increasing popularity of TCAMs might lead it to
be considered as a commodity chip (just like
GPUs)
Multi-core architectures (16 to 64 Cores) bring
forward new frontiers to explore the parallelism
Adapt additional stream operators to best exploit
these advanced features
Vision Design a Data Management System
leveraging modern hardware paradigms to
efficiently and quickly answer a diverse set of
queries

38
Acknowledgements

My advisors and my committee members
Computer Science Dept at UCSB
Colleagues at DSL and at UCSB

39
References (I)

Ban07a Bandi et. al., Fast Data Stream
Algorithms using Associative Memory, SIGMOD 2007
Cei07a Cieslewicz et. al., Adaptive Aggregation
on Chip Multiprocessor, VLDB 2007
Ged07 Gedik et. al., Executing Stream Joins on
the Cell Processor, VLDB 2007
Met05 Metwally et. al., Efficient Computation
of Frequent and Top-k Elements in Data Streams,
ICDT 2005
Ban05 Bandi et. al., Hardware Acceleration of
Database Operations Using Content-Addressable
Memories, DaMoN 2005
Gold05 Gold et. al., Accelerating Database
Operators Using a Network Processor, DaMoN 2005
Datar02 Datar et. al., Maintaining Stream
Statistics over Sliding Windows, SODA 2002
Das07 Das et. al., Ad-hoc Top-k Query Answering
for Data Streams, VLDB 2007
Venk05 Venkataraman et. al., New Streaming
Algorithms for Fast Detection of Superspreaders,
NDSS 2005
Shri04 Shrivastava et. al., Medians and Beyond
New Aggregation Techniques for Sensor Networks,
Sensys 2004
Corm03 Cormode et. al., Whats Hot and Whats
Not Tracking Most Frequent Items Dynamically,
PODS 2003

40
References (II)

Pani07 Panigrahy et. al., Finding Frequent
Elements in Non-Bursty Streams, ESA 2007
Mot03 Motwani et. al., Query Processing,
Approximation, and Resource Management in a Data
Stream Management System, CIDR 2003
Corm04 Cormode et. al. Diamond in the Rough
Finding Hierarchical Heavy Hitters in
Multi-Dimensional Streams, SIGMOD 2004
Corm08 Cormode et. al., Finding Hierarchical
Heavy Hitters in Streaming Data, ACM TKDD 2008
Li07 Hong Li et. al., Stochastic Simulation of
Biochemical Systems on the Graphics Processing
Unit, Bioinformatics Journal, 2007
Dem02 Demaine et. al., Frequency Estimation of
Internet Packet Streams with Limited Space, ESA
2002
Cie07b Cieslewicz et. al., Parallel Buffers for
Chip Multiprocessors, DaMoN 2007
Man02 Manku et. al., Approximate Frequency
Counts over Data Streams, VLDB 2002
Ban07b Bandi et. al., Fast Algorithms for Heavy
Distinct Hitters using Associative Memories,
ICDCS 2007
Har07 Hardavellas et. al., Database Servers on
Chip Multiprocessors Limitations and
Opportunities, CIDR 2007
Corm06 Cormode et. al., Space and Time
Efficient Deterministic Algorithms for Biased
Quantiles over Data Streams, PODS 2006

41
References (III)

Col07 Colohan et. al., CMP Support for Large
and Dependent Speculative Threads, IEEE Parallel
Distributed Systems, 2007.
Yu04 Yu et. al., Efficient Multi-Match Packet
Classification with TCAM, High Perf.
Interconnects 2004.
Held06 Held et. al., From a Few Cores to Many
A Tera-scale Computing Research Overview, Intel
White Paper, 2006.
Ram06 Ramanathan, Intel Multicore Processors
Leading the Next Digital Revolution, Intel White
Paper, 2006.
Ban04 Bandi et. al., Hardware Acceleration in
Commercial Databases A Case Study of Spatial
Operations, VLDB 2004.
Karp03 Karp et. al., A Simple Algorithm for
Finding Frequent Elements in Stream and Bags,
TODS 2003.
Char02 Charikar et. al., Finding Frequent Items
in Data Streams, ICALP 2002.
Gilb02 Gilbert et. al., Fast, Small-Space
Algorithms for Approximate Histogram Maintenance,
STOC 02.
Fang98 Fang et. al., Computing Iceberg Queries
Efficiently, VLDB 98.
Ross07 Ross, Efficient Hash probes on Modern
Processors, ICDE 2007.
Gre01 Greenwald et. al., Space-Efficient Online
Computation of Quantile Summaries, SIGMOD 2001.
Ban06 Bandi et. al., Fast Computation of
Database Operations Using CAM, DEXA 2006.

42
References (IV)

Gov05 Govindaraju et. al, Fast and Approximate
Stream Mining of Quantiles and Frequencies using
Graphics Processors, SIGMOD 2005
He08 He et. al, Relational Joins on Graphics
Processors, To Appear, SIGMOD 2008
Met06 Metwally et. al., An Integrated Efficient
Solution for Computing Frequent and Top-k
Elements in Data Streams, ACM TODS, Sept 2006.
Corm06b Cormode et. al., What's Different
Distributed, Continuous Monitoring of
Duplicate-Resilient Aggregates on Data Streams,
ICDE 2006
Bab03 Babcock et. al., Distributed Top-k
Monitoring, SIGMOD 2003
Shar02 Sharma et. al., Sorting and Searching
using TCAMs, IEEE HotI 2002
Pani03 Panigrahy et. al., Sorting and Searching
using TCAMs, IEEE Micro 2003
Akh07 Akhbarizadeh et. al., A TCAM Based
Parallel Architecture for High Speed Packet
Forwarding, IEEE Trans on Computers, 2007
Est06 Estan et. al., Bitmap Algorithms for
Counting Active Flows on High Speed Links, IEEE
Trans. On Networking, Oct 2006.
Kum07 Kumar et. al., On Finding Frequent
Elements in a Data Stream, Approx and Random 2007
Mou06 Mouratidis et. al., Continuous Monitoring
of Top-k Queries over Sliding Windows, SIGMOD 2006

43
ThanQ
Questions
44
Back up Slides
45
TCAM Adapted Stream Summary Key Observations

Elements are sorted by the frequency
No overhead for minimum maintenance
Frequency counting within some error bound
Can answer both Frequent Elements and Top-k
queries
Can support continuous queries
Presence of TCAM accelerates look-up
Sorting needs pointer manipulations which adds
overhead

46
Hot Items

Hot Items refer to ones that appeared in
significant fraction of the stream
We are looking at frequencies above N/(k1),
where N is length of stream seen, k is a
parameter
Frequency Counting algorithms such as Dem02,
Man02 can be used to answer these queries
But they cannot handle deletions in the stream
Corm03 provides algorithm that efficiently
handles deletion
Input space divided into subsets
Transactions results incrementing decrementing
appropriate subsets
Tests to see if a set has a hot item
Test results combined to report the hot item set
Space is
time for reporting

47
Quantiles

Very important for summarization of streams
Can provide a wide range of statistics about the
stream
Median
Percentiles
Distribution of the elements in the stream
Deterministic Algorithm with space bound of
suggested in Gre01
Shri04 suggested an algorithm with space
complexity of where U is the
alphabet
Corm06 suggested an algorithm for answering
biased Quantile queries using space

48
Heavy Distinct Hitters

Naïve approach consumes a lot of space
Bitmap based techniques Est06
Flow IDs hashed into bitmap, flow count is a
count of hits
To reduce space, sampling is used.
Multi-resolution sampling reduce dependence on
stream length.
Simple, fast, but prone to error
Hash Based Technique Ven05 Space
Use of different levels of filtering on the
stream
Multi-level filtering complex, but space
efficient
Streams are sampled to find hosts with multiple
connections
Probabilistic bounds on accuracy based on
sampling rates
Requires tuning of a huge number of parameters

49
Graphics Processing Units (GPU)

Characterized by high parallelism and high memory
bandwidth
200X more processing than Intel Core 2 Duo E6700
2.6 GHz Gov05, Li07
Memory Bandwidth 100GB/s (10-15 GB/s for Intel
CPUs)
Applications
Stream mining for Quantiles and Frequency
estimation Gov05
Develop a GPU aware sorting algorithm which forms
the core computing the summaries
Use Lossy Counting Mank02 for Frequency
estimation and Gre01 for Quantiles
Ban04 used it for accelerating spatial database
operations

50
Cell Broadband Engine

Intended for Game Consoles and Multimedia rich
consumer devices, product of STI
Typically contains 1 Power Processing Element
(PPE) and 8 Synergistic Processing Element (SPE)
PPE
Dual threaded 64-bit RISC processor, runs system
software
SPE
128 bit RISC processor specialized for data-rich,
compute intensive SIMD applications
Provides Instruction Level Parallelism in form of
dual pipeline
Band Joins on stream windows parallelized on Cell
Processors Ged07

51
TCAM Architecture
Courtesy Banit Agrawal
52
Space Saving

An element ei, with frequency fi gt min must exist
in the Stream-Summary
Assuming no specific distribution, or
user-supplied support, to find all frequent
elements with error , the Space-Saving
algorithm uses a number of counters bounded by
Any element ei, with frequency is
guaranteed to be in the Stream-Summary
Zipfian Data
Zipfian data with parameter , has frequency
distribution as where

53
Comparison of Frequency Counting Techniques

Sticky Sampling Man02
Lossy Counting Man02
GroupTest Corm03
CountSketch Char02
Misra-Gries
Frequent Karp03
SpaceSaving Met05
Pani07

54
Parallel Computing

Amdahls Law
Multiple-core Computing Multiple execution units
Symmetric Multiprocessing
Memory architecture where two or more identical
processors are connected to single shared memory
In multi-core architectures, SMP refers to the
shared cache
Advantage of CMP over SMP is the presence of
on-chip cache coherence hardware

55
References (V)

Her04 Hershberger et. al., Adaptive Spatial
Partitioning for Multidimensional Streams, ISAAC
2004
Her05 Hershberger et. al., Space Complexity of
Hierarchical Heavy Hitters in Multi-Dimensional
Data Streams, PODS 2005
Pani02 Panigrahy et. al., Reducing TCAM Power
Consumption and Increasing Throughput, IEEE HotI
2002
Lak05 Lakshminarayanan et. al., Algorithms for
Advanced Packet Classification with Ternary CAMs,
SIGCOMM 2005

Write a Comment

User Comments (0)

About PowerShow.com

Accelerating Data Management Operations using New Hardware Paradigms PowerPoint PPT Presentation