High Performance Embedded Computing with Massively Parallel Processors - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

High Performance Embedded Computing with Massively Parallel Processors

Description:

High Performance Embedded Computing with Massively Parallel Processors Yangdong Steve Deng dengyd_at_tsinghua.edu.cn Tsinghua University The routing table ... – PowerPoint PPT presentation

Number of Views:241
Avg rating:3.0/5.0
Slides: 55
Provided by: techIt168
Category:

less

Transcript and Presenter's Notes

Title: High Performance Embedded Computing with Massively Parallel Processors


1
High Performance Embedded Computing with
Massively Parallel Processors
  • Yangdong Steve Deng ???
  • dengyd_at_tsinghua.edu.cn
  • Tsinghua University

2
Outline
  • Motivation and background
  • Morphing GPU into a network processor
  • High performance radar DSP processor
  • Conclusion

3
High Performance Embedded Computing
  • Future IT infrastructure demands even higher
    computing power
  • Core Internet router throughput up to 90Tbps
  • 4G wireless base station 1Gbit/s data rate per
    customer and up to 200 subscribers in service
    area
  • CMU driverless car 270GFLOPs (Giga FLoating
    point Operations Per second)

4
Fast Increasing IC Costs
  • Fabrication Cost
  • Moores Second Law The cost of doubling circuit
    density increases in line with Moore's First Law.
  • Design Cost
  • Now 20-50M per product
  • Will reach 75-120M at
  • 32nm node

The 4-year development of Cell processor by Sony,
IBM, and Toshiba costs over 400M.
5
Implications of the Prohibitive Cost
  • ASICs would be unaffordable for many
    applications!
  • Scott MacGregor, CEO of Broadcom
  • Broadcom is not intending a move to 45nm in the
    next year or so as it will be too expensive.
  • David Turek, VP of IBM
  • IBM will be pulling out of Cell
  • development, with PowerXCell
  • 8i to be the companys last
  • entrance in the technology.

6
Multicore Machines Are Really Powerful!
Manufacturer Processor Type Model Model Number Cores GFLOPs FP64 GFLOPs FP32
AMD GPGPU FireStream 9270 160/800 240 1200
AMD GPU Radeon HD 5870 320/1600 544 2720
AMD GPU Radeon HD 5970 640/3200 928 4640
AMD CPU Magny-Cours 12 362.11 362.11
Fujitsu CPU SPARC64 VII 4 128 128
Intel CPU Core 2 Extreme QX9775 4 51.2 51.2
nVidia GPU Fermi 480 512 780 1560
nVidia GPGPU Tesla C1060 240 77.76 933.12
nVidia GPGPU Tesla C2050 448 515.2 1288
Tilera CPU TilePro 64 166 166
GPU Graphics Processing Unit GPGPU General
Purpose GPU
Tilera Tile Gx100 CPU
AMD 12-Core CPU
NVidia Fermi GPU
7
Implications
  • An increasing number of applications would be
    implemented with multi-core devices
  • Huawei multi-core base stations
  • Intel cluster based Internet routers
  • IBM signal processing and radar applications on
    Cell processor
  • Also meets the strong demands for customizability
    and extendibility

8
Outline
  • Motivation and background
  • Morphing GPU into a network processor
  • High performance radar DSP processor
  • Conclusion

9
Software Routing with GPU
  • Background and motivation
  • GPU based routing processing
  • Routing table lookup
  • Packet classification
  • Deep packet inspection
  • GPU microarchitecture enhancement
  • CPU and GPU integration
  • QoS-aware scheduling

10
Ever-Increasing Internet Traffic
11
Fast Changing Network Protocols/Services
  • New services are rapidly appearing
  • Data-center, Ethernet forwarding, virtual LAN,
  • Personal customization is often essential for QoS
  • However, todays Internet heavily depend on 2
    protocols
  • Ethernet and IPv4, with both developed in 1970s!

12
Internet Router
13
Internet Router
  • Backbone network device
  • Packet forwarding and path finding
  • Connect multiple subnets
  • Key requirements
  • High throughput 40G-90Tbps
  • High flexibility

Router
Packets
Packets
14
Current Router Solutions
  • Hardware routers
  • Fast
  • Long design time
  • Expensive
  • And hard to maintain
  • Network processor based router
  • Network processor data parallel packet processor
  • No good programming models
  • Software routers
  • Extremely flexible
  • Low cost
  • But slow

15
Outline
  • Background and motivation
  • GPU based routing processing
  • Routing table lookup
  • Packet classification
  • Deep packet inspection
  • GPU microarchitecture enhancement
  • CPU and GPU integration
  • QoS-aware scheduling

16
Critical Path of Routing Processing
17
GPU Based Software Router
  • Data level parallelism packet level parallelism

18
Routing Table Lookup
  • Routing table contains network topology
    information
  • Find the output port according to destination IP
    address
  • Potentially large routing table (1M entries)
  • Can be updated dynamically

An exemplar routing table
Destination Address Prefix Next-Hop Output Port
24.30.32/20 192.41.177.148 2
24.30.32.160/28 192.41.177.3 6
208.12.32/20 192.41.177.196 1
208.12.32.111/32 192.41.177.195 5
19
Routing Table Lookup
  • Longest prefix match
  • Memory bound
  • Usually based on a trie data structure
  • Trie a prefix tree with strings as keys
  • A nodes position directly reflects its key
  • Pointer operations
  • Widely divergent branches!

Destination Address Prefix Next-Hop Output Port
24.30.32/20 192.41.177.148 2
24.30.32.160/28 192.41.177.3 6
208.12.32/20 192.41.177.196 1
208.12.32.111/32 192.41.177.195 5
20
GPU Based Routing Table Lookup
  • Organize the search trie into an array
  • Pointer converted to offset with regard to array
    head
  • 6X speedup even with frequent routing table
    updates

21
Packet Classification
  • Match header fields with predefined rules
  • Size of rule-sets can be huge (i.e., over 5000
    rules)

Rule Example
Priority Treat packets destined to 166.111.66.70 - 166.111.66.77 as highest priority
Packet filtering Deny all traffic from ISP3 destined to 166.111.66.77
Traffic rate limit Ensure ISP2 does not inject more than 10Mbps email traffic on interface 2
Accounting billing Treat video traffic to 166.111.X.X as highest priority and perform accounting
22
Packet Classification
  • Hardware solution
  • Usually with Ternary CAM
  • (TCAM)
  • Expensive and power hungry
  • Software solutions
  • Linear search
  • Hash based
  • Tuple space search
  • Convert the rules into a set of exact match

23
GPU Based Packet Classification
  • A linear search approach
  • Scale to rule sets with 20,000 rules
  • Meta-programming
  • Compile rules into CUDA code with PyCUDA

Treat packets destined to 166.111.66.70 -
166.111.66.77 as highest priority
if (DA gt 166.111.66.70) (DA lt
166.111.66.77) priority 0
24
GPU Based Packet Classification
  • 60X speedup

25
Deep Packet Inspection (DPI)
  • Core component for network intrusion detection
  • Against viruses, spam, software vulnerabilities,

Example rule alert tcp EXTERNAL_NET 27374 -gt
HOME_NET any (msg"BACKDOOR subseven 22" flags
A content "0d0a5b52504c5d3030320d0a"
Snort
Sniffing
Packet Decoder
Preprocessor (Plug-ins)
Packet stream
Data Flow
Detection Engine (Plug-ins)
Fixed String Matching
Regular Expression Matching
Output Stage (Plug-ins)
Alerts/Logs
26
GPU Based Deep Packet Inspection (DPI)
  • Fixed string match
  • Each rule is just a string that is disallowed
  • Bloom-filter based search
  • One warp for a packet and one thread for a string
  • Throughput 19.2Gbps (30X speed-up over SNORT)

Initial Bloom Filter
After pre-processing rules
Checking packet content
Bloom Vector
27
GPU Based Deep Packet Inspection (DPI)
  • Regular expression matching
  • Each rule is a regular expression
  • e.g., ab e, a, b, bb, bbb, ...
  • Aho-Corasick Algorithm
  • Converts patterns into a finite state machine
  • Matching is done by state traversal
  • Memory bound
  • Virtually no computation
  • Compress the state table
  • Merging dont-cared entries
  • Throughput 9.3Gbps
  • 15X speed-up over SNORT

Example Phe, she, his, hers
28
Outline
  • Background and motivation
  • GPU based routing processing
  • Routing table lookup
  • Packet classification
  • Deep packet inspection
  • GPU microarchitecture enhancement
  • CPU and GPU integration
  • QoS-aware scheduling

29
Limitation of GPU-Based Packet Processing
Packet queue
30
Microarchitectural Enhancements
  • CPU-GPU integration with a shared memory
  • Maintain current CUDA interface
  • Implemented on GPGPU-Sim

GPU
CPU/GPU Shared Memory
A. Bakhoda, et al., Analyzing CUDA Workloads
Using a Detailed GPU Simulator, ISPASS, 2009.
31
Microarchitectural Enhancements
  • Uniformly one thread for one packet
  • No thread block necessary
  • Directly schedule and issue warps
  • GPU fetches packet IDs from task queue when
  • Either a sufficient number of packets are already
    collected
  • Or a given interval passes after last fetch

32
Results Throughput
33
Results Packet Latency
34
Outline
  • Motivation and background
  • Morphing GPU into a network processor
  • High performance radar DSP processor
  • Conclusion

35
High Performance Radar DSP Processor
  • Motivation
  • Feasibility of GPU for DSP processing
  • Designing a massively parallel DSP processor

36
Research Objectives
  • High performance DSP processor
  • For high-performance applications
  • Radar, sonar, cellular baseband,
  • Performance requirements
  • Throughput 800GFLOPs
  • Power Efficiency 100GFLOPS/W
  • Memory bandwidth 400Gbit/s
  • Scale to multi-chip solutions

37
Current DSP Platforms
Processor Frequency cores Throughput Memory Bandwidth Power Power Efficiency (GFLOPS/W)
TI TMS320C6472-700 500MHz 6 33.6GMac/s NA 3.8W 17.7
FreeScale MSC8156 1GHz 6 48GMac/s 1GB/s 10W 9.6
ADI TigerSHARC ADSP-TS201S 600MHz 1 4.8GMac/s 38.4GB/s (on-chip) 2.18W 4.4
PicoChip PC205 260MHz 1GPP 248DSPs 31GMac/s NA lt5W 12.4
Intel Core i7 980XE 3.3GHz 6 107. 5GFLOPS 31.8GB/s 130W 0.8
Tilera Tile64 866MHz 64 CPUs 221GFLOPS 6.25GB/s 22W 10.0
NVidia Fermi GPU 1GHz 512 scalar cores 1536GFLOPS 230GB/s 200W 7.7
GDDR5 Peak Bandwidth 28.2GB/s
38
High Performance Radar DSP Processor
  • Motivation
  • Feasibility of GPU for DSP processing
  • Designing a massively parallel DSP processor

39
HPEC Challenge - Radar Benchmarks
Benchmark Description
TDFIR Time-domain finite impulse response filtering
FDFIR Frequency-domain finite impulse response filtering
CT Corner turn or matrix transpose to place radar data into a contiguous row for efficient FFT
QR QR factorization prevalent in target recognition algorithms
SVD Singular value decomposition produces a basis for the matrix as well as the rank for reducing interference
CFAR Constant false-alarm rate detection find target in an environment with varying background noise
GA Graph optimization via genetic algorithm removing uncorrelated data relations
PM Pattern Matching identify stored tracks that match a target
DB Database operations to store and query target tracks
40
GPU Implementation
Benchmark Description
TDFIR Loops of multiplication and accumulation (MAC)
FDFIR FFT followed by MAC loops
CT GPU based matrix transpose, extremely efficient
QR Pipeline of CPU GPU, Fast Givens algorithm
SVD Based on QR factorization and fast matrix multiplication
CFAR Accumulation of neighboring vector elements
GA Parallel random number generator and inter-thread communication
PM Vector level parallelism
DB Binary tree operation, hard for GPU implementation
41
Performance Results
Kernels Data Set CPU Throughput (GFLOPS) GPU Throughput (GFLOPS) Speedup
TDFIR Set 1 Set 2 3.382 3.326 97.506 23.130 28.8 6.9
FDFIR Set 1 Set 2 0.541 0.542 61.681 11.955 114.1 22.1
CT Set 1 Set 2 1.194 0.501 17.177 35.545 14.3 70.9
PM Set 1 Set 2 0.871 0.281 7.761 21.241 8.9 75.6
CFAR Set 1 Set 2 Set 3 Set 4 1.154 1.314 1.313 1.261 2.234 17.319 13.962 8.301 1.9 13.1 10.6 6.6
GA Set 1 Set 2 Set 3 Set 4 0.562 0.683 0.441 0.373 1.177 8.571 0.589 2.249 2.1 12.5 1.4 6.0
QR Set 1 Set 2 Set 3 1.704 0.901 0.904 54.309 5.679 6.686 31.8 6.3 7.4
SVD Set 1 Set 2 0.747 0.791 4.175 2.684 5.6 3.4
DB Set 1 Set 2 112.3 5.794 126.8 8.459 1.13 1.46
The throughputs of CT and DB are measured in
Mbytes/s and Transactions/s, respectively.
42
Performance Comparison
  • GPU NVIDIA Fermi, CPU Intel Core 2 Duo
    (3.33GHz), DSP AD TigherSharc 101

43
Instruction Profiling
44
Thread Profiling
  • Warp occupancy number of active threads in an
    issued warp
  • 32 threads per warp

45
Off-Chip Memory Profiling
  • DRAM efficiency the percentage of time spent on
    sending data across the pins of DRAM over the
    whole time of memory service.

46
Limitation
  • GPU suffers from a low power-efficiency (MFLOPS/W)

47
High Performance Radar DSP Processor
  • Motivation
  • Feasibility of GPU for DSP processing
  • Designing a massively parallel DSP processor

48
Key Idea - Hardware Architecture
  • Borrow the GPU microarchitecture
  • Using a DSP core as the basic execution unit
  • Multiprocessors organized in programmable
    pipelines
  • Neighboring multiprocessors can be merged as
    wider datapaths

49
Key Idea Parallel Code Generation
  • Meta-programming based parallel code generation
  • Foundation technologies
  • GPU meta-programming frameworks
  • Copperhead (UC Berkeley) and PyCUDA (NY
    University)
  • DSP code generation framework
  • Spiral (Carnegie Mellon University)

50
Key Idea Internal Representation as KPN
  • Kahn Process Network (KPN)
  • A generic model for concurrent computation
  • Solid theoretic foundation
  • Process algebra

51
Scheduling and Optimization on KPN
  • Automatic task and thread scheduling and mapping
  • Extract data parallelism through process
    splitting
  • Latency and throughput aware scheduling
  • Performance estimation based on analytical models

52
Key Idea - Low Power Techniques
  • GPU-like processors are power hungry!
  • Potential low power techniques
  • Aggressive memory coalescing
  • Enable task-pipeline to avoid synchronization via
    global memory
  • Operation chaining to avoid extra memory accesses
  • ???

53
Outline
  • Motivation and background
  • Morphing GPU into a network processor
  • High performance radar DSP processor
  • Conclusion

54
Conclusion
  • A new market of high performance embedded
    computing is emerging
  • Multi-core engines would be the work-horses
  • Need both HW and SW research
  • Case study 1 GPU based Internet routing
  • Case study 2 Massively parallel DSP processor
  • Significant performance improvements
  • More works ahead
  • Low power, scheduling, parallel programming
    model, legacy code,
Write a Comment
User Comments (0)
About PowerShow.com