Title: Benchmark Results for UltraHigh Performance Scalable Processing Architecture for Embedded Signal and
1Benchmark Results for Ultra-High Performance
Scalable Processing Architecture for Embedded
Signal and Image Processing Applications
Authors
- Stewart Reddaway (sfr_at_wscapeinc.com)
- Nigel Bond (nigel.bond_at_wscapeinc.com)
- Rick Pancoast (rick.pancoast_at_lmco.com)
Justin Kidman (justin.kidman_at_wscapeinc.com) Peter
Rogina (peter.rogina_at_wscapeinc.com)
2Scalable Processing Platform (SPP)Applications
- Image Processing
- Signal Processing
- Compression/De-compression
- Encryption/De-cryption
- Network Processing
- Search Engine
- Certain Supercomputing Applications
Wide Ranging Applicability to DoE/DoD/DARPA and
Commercial Embedded HPC Processing
Requirements
3Scalable Processing Platform
50 GFLOPS 64-bit FP SIMD Processor Chip
300 GFLOPS 64-bit FP 6U VME Expansion board
4CamArray Multi-Sensor Platform
Initial 30 camera system will process and store
over 1 Billion pixels per second continuously
6 Cameras Dual CameraLink (x3)
42 Cameras per Cardcage
(x 7)
6U Host Motherboard
With SAN Storage
SAN
530 Camera CamArray with sync
Universal Sync
Horita BSG-50 Black Burst Generator
Horita TG-50 SMPTE Generator/Inserter
42 Cameras max per Cardcage (21 Duprees x 2
cameras per)
SMPTE Level Shifter / Digital Distribution Amp (1
x 21 Distribution Amp)
8m CameraLink cables (2 cameras per Dupree board)
1U SAN 15 minutes uncompressed
1.25 hrs. _at_ 5x compression etc
Fiber channel from host cards to SAN
6HPEC 2006Hardware Benchmark Performance Results
DRAM to DRAM demonstration using realistic Pulse
Compression data on actual CSX600 hardware Note
Streaming implementation on Dupree in 4Q06
7 1024-point complex Pulse Compression
- Per chip measurements
- 96 PEs enable 12 sets in parallel
- 8 PEs per PC
- 9.39 usec DRAM to DRAM
- 9.87 usec on-chip (ie without I/O from/to DRAM)
- 11.55 GFLOPS with I/O
- 11.02 GFLOPS without I/O
Overlapping of I/O and compute results in 95 of
cycles being used for computation
8Pulse Compression Demonstration
Hardware Performance
per chip
9FFT/PC speedups
- Code speedups totaling 11 are possible now
- Workarounds due to problems with existing
microcode 5 - Stripping out unnecessary measurement and
interface code 2 - Use of new microcode with 0 degree twiddle 4
- New microcoded instructions will enable more
efficient butterfly pipelining (estimated 1.5x
1.8x overall improvement) - A family of begin, middle and end butterflies
nearly doubles speed - similar technique will speedup a sequence of
complex multiplies - More special case butterflies can also be
microcoded
10JPEG2000 Performance Estimates
For 30x compression of 1K x 1K, 8-bpp, RGB image
(per chip)
- Optimization of compute I/O overlap can boost
performance to 48 fps (shown in brackets) - Grayscale images are 3x faster, supporting up to
125 (146) fps - Bayer coded color images are 2.5x faster
- 10/12-bit pixels can be handled proportionally
slower
Analysis suggests route to substantially higher
performance
11Summary
- SPP offers greatly enhanced performance AND
performance/watt - Scaled SIMD architecture very well suited for
many embedded applications - Existing and emerging libraries will ease
integration and product insertion - WorldScape partners uniquely qualified to
deliver embedded application products - Ongoing hardware and library development will
foster accelerated ability to achieve TRL-6
status in FY07 - First customer shipments of SPP Dupree card -
1Q07 - Beta relationships being established now