Title: High Performance Embedded Computing with Massively Parallel Processors
1High Performance Embedded Computing with
Massively Parallel Processors
- Yangdong Steve Deng ???
- dengyd_at_tsinghua.edu.cn
- Tsinghua University
2Outline
- Motivation and background
- Morphing GPU into a network processor
- High performance radar DSP processor
- Conclusion
3High Performance Embedded Computing
- Future IT infrastructure demands even higher
computing power - Core Internet router throughput up to 90Tbps
- 4G wireless base station 1Gbit/s data rate per
customer and up to 200 subscribers in service
area - CMU driverless car 270GFLOPs (Giga FLoating
point Operations Per second)
4Fast Increasing IC Costs
- Fabrication Cost
- Moores Second Law The cost of doubling circuit
density increases in line with Moore's First Law. - Design Cost
- Now 20-50M per product
- Will reach 75-120M at
- 32nm node
The 4-year development of Cell processor by Sony,
IBM, and Toshiba costs over 400M.
5Implications of the Prohibitive Cost
- ASICs would be unaffordable for many
applications! - Scott MacGregor, CEO of Broadcom
- Broadcom is not intending a move to 45nm in the
next year or so as it will be too expensive. - David Turek, VP of IBM
- IBM will be pulling out of Cell
- development, with PowerXCell
- 8i to be the companys last
- entrance in the technology.
6Multicore Machines Are Really Powerful!
Manufacturer Processor Type Model Model Number Cores GFLOPs FP64 GFLOPs FP32
AMD GPGPU FireStream 9270 160/800 240 1200
AMD GPU Radeon HD 5870 320/1600 544 2720
AMD GPU Radeon HD 5970 640/3200 928 4640
AMD CPU Magny-Cours 12 362.11 362.11
Fujitsu CPU SPARC64 VII 4 128 128
Intel CPU Core 2 Extreme QX9775 4 51.2 51.2
nVidia GPU Fermi 480 512 780 1560
nVidia GPGPU Tesla C1060 240 77.76 933.12
nVidia GPGPU Tesla C2050 448 515.2 1288
Tilera CPU TilePro 64 166 166
GPU Graphics Processing Unit GPGPU General
Purpose GPU
Tilera Tile Gx100 CPU
AMD 12-Core CPU
NVidia Fermi GPU
7Implications
- An increasing number of applications would be
implemented with multi-core devices - Huawei multi-core base stations
- Intel cluster based Internet routers
- IBM signal processing and radar applications on
Cell processor -
- Also meets the strong demands for customizability
and extendibility
8Outline
- Motivation and background
- Morphing GPU into a network processor
- High performance radar DSP processor
- Conclusion
9Software Routing with GPU
- Background and motivation
- GPU based routing processing
- Routing table lookup
- Packet classification
- Deep packet inspection
- GPU microarchitecture enhancement
- CPU and GPU integration
- QoS-aware scheduling
10Ever-Increasing Internet Traffic
11Fast Changing Network Protocols/Services
- New services are rapidly appearing
- Data-center, Ethernet forwarding, virtual LAN,
- Personal customization is often essential for QoS
- However, todays Internet heavily depend on 2
protocols - Ethernet and IPv4, with both developed in 1970s!
12Internet Router
13Internet Router
- Backbone network device
- Packet forwarding and path finding
- Connect multiple subnets
- Key requirements
- High throughput 40G-90Tbps
- High flexibility
Router
Packets
Packets
14Current Router Solutions
- Hardware routers
- Fast
- Long design time
- Expensive
- And hard to maintain
- Network processor based router
- Network processor data parallel packet processor
- No good programming models
- Software routers
- Extremely flexible
- Low cost
- But slow
15Outline
- Background and motivation
- GPU based routing processing
- Routing table lookup
- Packet classification
- Deep packet inspection
- GPU microarchitecture enhancement
- CPU and GPU integration
- QoS-aware scheduling
16Critical Path of Routing Processing
17GPU Based Software Router
- Data level parallelism packet level parallelism
18Routing Table Lookup
- Routing table contains network topology
information - Find the output port according to destination IP
address - Potentially large routing table (1M entries)
- Can be updated dynamically
An exemplar routing table
Destination Address Prefix Next-Hop Output Port
24.30.32/20 192.41.177.148 2
24.30.32.160/28 192.41.177.3 6
208.12.32/20 192.41.177.196 1
208.12.32.111/32 192.41.177.195 5
19Routing Table Lookup
- Longest prefix match
- Memory bound
- Usually based on a trie data structure
- Trie a prefix tree with strings as keys
- A nodes position directly reflects its key
- Pointer operations
- Widely divergent branches!
Destination Address Prefix Next-Hop Output Port
24.30.32/20 192.41.177.148 2
24.30.32.160/28 192.41.177.3 6
208.12.32/20 192.41.177.196 1
208.12.32.111/32 192.41.177.195 5
20GPU Based Routing Table Lookup
- Organize the search trie into an array
- Pointer converted to offset with regard to array
head - 6X speedup even with frequent routing table
updates
21Packet Classification
- Match header fields with predefined rules
- Size of rule-sets can be huge (i.e., over 5000
rules)
Rule Example
Priority Treat packets destined to 166.111.66.70 - 166.111.66.77 as highest priority
Packet filtering Deny all traffic from ISP3 destined to 166.111.66.77
Traffic rate limit Ensure ISP2 does not inject more than 10Mbps email traffic on interface 2
Accounting billing Treat video traffic to 166.111.X.X as highest priority and perform accounting
22Packet Classification
- Hardware solution
- Usually with Ternary CAM
- (TCAM)
- Expensive and power hungry
- Software solutions
- Linear search
- Hash based
- Tuple space search
- Convert the rules into a set of exact match
23GPU Based Packet Classification
- A linear search approach
- Scale to rule sets with 20,000 rules
- Meta-programming
- Compile rules into CUDA code with PyCUDA
Treat packets destined to 166.111.66.70 -
166.111.66.77 as highest priority
if (DA gt 166.111.66.70) (DA lt
166.111.66.77) priority 0
24GPU Based Packet Classification
25Deep Packet Inspection (DPI)
- Core component for network intrusion detection
- Against viruses, spam, software vulnerabilities,
Example rule alert tcp EXTERNAL_NET 27374 -gt
HOME_NET any (msg"BACKDOOR subseven 22" flags
A content "0d0a5b52504c5d3030320d0a"
Snort
Sniffing
Packet Decoder
Preprocessor (Plug-ins)
Packet stream
Data Flow
Detection Engine (Plug-ins)
Fixed String Matching
Regular Expression Matching
Output Stage (Plug-ins)
Alerts/Logs
26GPU Based Deep Packet Inspection (DPI)
- Fixed string match
- Each rule is just a string that is disallowed
- Bloom-filter based search
- One warp for a packet and one thread for a string
- Throughput 19.2Gbps (30X speed-up over SNORT)
Initial Bloom Filter
After pre-processing rules
Checking packet content
Bloom Vector
27GPU Based Deep Packet Inspection (DPI)
- Regular expression matching
- Each rule is a regular expression
- e.g., ab e, a, b, bb, bbb, ...
- Aho-Corasick Algorithm
- Converts patterns into a finite state machine
- Matching is done by state traversal
- Memory bound
- Virtually no computation
- Compress the state table
- Merging dont-cared entries
- Throughput 9.3Gbps
- 15X speed-up over SNORT
Example Phe, she, his, hers
28Outline
- Background and motivation
- GPU based routing processing
- Routing table lookup
- Packet classification
- Deep packet inspection
- GPU microarchitecture enhancement
- CPU and GPU integration
- QoS-aware scheduling
29Limitation of GPU-Based Packet Processing
Packet queue
30Microarchitectural Enhancements
- CPU-GPU integration with a shared memory
- Maintain current CUDA interface
- Implemented on GPGPU-Sim
GPU
CPU/GPU Shared Memory
A. Bakhoda, et al., Analyzing CUDA Workloads
Using a Detailed GPU Simulator, ISPASS, 2009.
31Microarchitectural Enhancements
- Uniformly one thread for one packet
- No thread block necessary
- Directly schedule and issue warps
- GPU fetches packet IDs from task queue when
- Either a sufficient number of packets are already
collected - Or a given interval passes after last fetch
32Results Throughput
33Results Packet Latency
34Outline
- Motivation and background
- Morphing GPU into a network processor
- High performance radar DSP processor
- Conclusion
35High Performance Radar DSP Processor
- Motivation
- Feasibility of GPU for DSP processing
- Designing a massively parallel DSP processor
36Research Objectives
- High performance DSP processor
- For high-performance applications
- Radar, sonar, cellular baseband,
- Performance requirements
- Throughput 800GFLOPs
- Power Efficiency 100GFLOPS/W
- Memory bandwidth 400Gbit/s
- Scale to multi-chip solutions
37Current DSP Platforms
Processor Frequency cores Throughput Memory Bandwidth Power Power Efficiency (GFLOPS/W)
TI TMS320C6472-700 500MHz 6 33.6GMac/s NA 3.8W 17.7
FreeScale MSC8156 1GHz 6 48GMac/s 1GB/s 10W 9.6
ADI TigerSHARC ADSP-TS201S 600MHz 1 4.8GMac/s 38.4GB/s (on-chip) 2.18W 4.4
PicoChip PC205 260MHz 1GPP 248DSPs 31GMac/s NA lt5W 12.4
Intel Core i7 980XE 3.3GHz 6 107. 5GFLOPS 31.8GB/s 130W 0.8
Tilera Tile64 866MHz 64 CPUs 221GFLOPS 6.25GB/s 22W 10.0
NVidia Fermi GPU 1GHz 512 scalar cores 1536GFLOPS 230GB/s 200W 7.7
GDDR5 Peak Bandwidth 28.2GB/s
38High Performance Radar DSP Processor
- Motivation
- Feasibility of GPU for DSP processing
- Designing a massively parallel DSP processor
39HPEC Challenge - Radar Benchmarks
Benchmark Description
TDFIR Time-domain finite impulse response filtering
FDFIR Frequency-domain finite impulse response filtering
CT Corner turn or matrix transpose to place radar data into a contiguous row for efficient FFT
QR QR factorization prevalent in target recognition algorithms
SVD Singular value decomposition produces a basis for the matrix as well as the rank for reducing interference
CFAR Constant false-alarm rate detection find target in an environment with varying background noise
GA Graph optimization via genetic algorithm removing uncorrelated data relations
PM Pattern Matching identify stored tracks that match a target
DB Database operations to store and query target tracks
40GPU Implementation
Benchmark Description
TDFIR Loops of multiplication and accumulation (MAC)
FDFIR FFT followed by MAC loops
CT GPU based matrix transpose, extremely efficient
QR Pipeline of CPU GPU, Fast Givens algorithm
SVD Based on QR factorization and fast matrix multiplication
CFAR Accumulation of neighboring vector elements
GA Parallel random number generator and inter-thread communication
PM Vector level parallelism
DB Binary tree operation, hard for GPU implementation
41Performance Results
Kernels Data Set CPU Throughput (GFLOPS) GPU Throughput (GFLOPS) Speedup
TDFIR Set 1 Set 2 3.382 3.326 97.506 23.130 28.8 6.9
FDFIR Set 1 Set 2 0.541 0.542 61.681 11.955 114.1 22.1
CT Set 1 Set 2 1.194 0.501 17.177 35.545 14.3 70.9
PM Set 1 Set 2 0.871 0.281 7.761 21.241 8.9 75.6
CFAR Set 1 Set 2 Set 3 Set 4 1.154 1.314 1.313 1.261 2.234 17.319 13.962 8.301 1.9 13.1 10.6 6.6
GA Set 1 Set 2 Set 3 Set 4 0.562 0.683 0.441 0.373 1.177 8.571 0.589 2.249 2.1 12.5 1.4 6.0
QR Set 1 Set 2 Set 3 1.704 0.901 0.904 54.309 5.679 6.686 31.8 6.3 7.4
SVD Set 1 Set 2 0.747 0.791 4.175 2.684 5.6 3.4
DB Set 1 Set 2 112.3 5.794 126.8 8.459 1.13 1.46
The throughputs of CT and DB are measured in
Mbytes/s and Transactions/s, respectively.
42Performance Comparison
- GPU NVIDIA Fermi, CPU Intel Core 2 Duo
(3.33GHz), DSP AD TigherSharc 101
43Instruction Profiling
44Thread Profiling
- Warp occupancy number of active threads in an
issued warp - 32 threads per warp
45Off-Chip Memory Profiling
- DRAM efficiency the percentage of time spent on
sending data across the pins of DRAM over the
whole time of memory service.
46Limitation
- GPU suffers from a low power-efficiency (MFLOPS/W)
47High Performance Radar DSP Processor
- Motivation
- Feasibility of GPU for DSP processing
- Designing a massively parallel DSP processor
48Key Idea - Hardware Architecture
- Borrow the GPU microarchitecture
- Using a DSP core as the basic execution unit
- Multiprocessors organized in programmable
pipelines - Neighboring multiprocessors can be merged as
wider datapaths
49Key Idea Parallel Code Generation
- Meta-programming based parallel code generation
- Foundation technologies
- GPU meta-programming frameworks
- Copperhead (UC Berkeley) and PyCUDA (NY
University) - DSP code generation framework
- Spiral (Carnegie Mellon University)
50Key Idea Internal Representation as KPN
- Kahn Process Network (KPN)
- A generic model for concurrent computation
- Solid theoretic foundation
- Process algebra
51Scheduling and Optimization on KPN
- Automatic task and thread scheduling and mapping
- Extract data parallelism through process
splitting - Latency and throughput aware scheduling
- Performance estimation based on analytical models
52Key Idea - Low Power Techniques
- GPU-like processors are power hungry!
- Potential low power techniques
- Aggressive memory coalescing
- Enable task-pipeline to avoid synchronization via
global memory - Operation chaining to avoid extra memory accesses
- ???
53Outline
- Motivation and background
- Morphing GPU into a network processor
- High performance radar DSP processor
- Conclusion
54Conclusion
- A new market of high performance embedded
computing is emerging - Multi-core engines would be the work-horses
- Need both HW and SW research
- Case study 1 GPU based Internet routing
- Case study 2 Massively parallel DSP processor
- Significant performance improvements
- More works ahead
- Low power, scheduling, parallel programming
model, legacy code,