Benchmark Results for UltraHigh Performance Scalable Processing Architecture for Embedded Signal and

About This Presentation

Title:

Benchmark Results for UltraHigh Performance Scalable Processing Architecture for Embedded Signal and

Description:

Benchmark Results for. Ultra-High Performance Scalable Processing Architecture for ... SMPTE Generator/Inserter. SMPTE Level Shifter / Digital Distribution Amp ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 12

Provided by: peter468

Category:

more less

Transcript and Presenter's Notes

Title: Benchmark Results for UltraHigh Performance Scalable Processing Architecture for Embedded Signal and

1
Benchmark Results for Ultra-High Performance
Scalable Processing Architecture for Embedded
Signal and Image Processing Applications
Authors

Stewart Reddaway (sfr_at_wscapeinc.com)
Nigel Bond (nigel.bond_at_wscapeinc.com)
Rick Pancoast (rick.pancoast_at_lmco.com)

Justin Kidman (justin.kidman_at_wscapeinc.com) Peter
Rogina (peter.rogina_at_wscapeinc.com)
2
Scalable Processing Platform (SPP)Applications

Image Processing
Signal Processing
Compression/De-compression
Encryption/De-cryption
Network Processing
Search Engine
Certain Supercomputing Applications

Wide Ranging Applicability to DoE/DoD/DARPA and
Commercial Embedded HPC Processing
Requirements
3
Scalable Processing Platform
50 GFLOPS 64-bit FP SIMD Processor Chip
300 GFLOPS 64-bit FP 6U VME Expansion board
4
CamArray Multi-Sensor Platform
Initial 30 camera system will process and store
over 1 Billion pixels per second continuously
6 Cameras Dual CameraLink (x3)
42 Cameras per Cardcage

(x 7)
6U Host Motherboard
With SAN Storage
SAN
5
30 Camera CamArray with sync
Universal Sync
Horita BSG-50 Black Burst Generator
Horita TG-50 SMPTE Generator/Inserter
42 Cameras max per Cardcage (21 Duprees x 2
cameras per)
SMPTE Level Shifter / Digital Distribution Amp (1
x 21 Distribution Amp)
8m CameraLink cables (2 cameras per Dupree board)
1U SAN 15 minutes uncompressed
1.25 hrs. _at_ 5x compression etc
Fiber channel from host cards to SAN
6
HPEC 2006Hardware Benchmark Performance Results
DRAM to DRAM demonstration using realistic Pulse
Compression data on actual CSX600 hardware Note
Streaming implementation on Dupree in 4Q06
7
1024-point complex Pulse Compression

Per chip measurements
96 PEs enable 12 sets in parallel
8 PEs per PC
9.39 usec DRAM to DRAM
9.87 usec on-chip (ie without I/O from/to DRAM)
11.55 GFLOPS with I/O
11.02 GFLOPS without I/O

Overlapping of I/O and compute results in 95 of
cycles being used for computation
8
Pulse Compression Demonstration
Hardware Performance
per chip
9
FFT/PC speedups

Code speedups totaling 11 are possible now
Workarounds due to problems with existing
microcode 5
Stripping out unnecessary measurement and
interface code 2
Use of new microcode with 0 degree twiddle 4
New microcoded instructions will enable more
efficient butterfly pipelining (estimated 1.5x
1.8x overall improvement)
A family of begin, middle and end butterflies
nearly doubles speed
similar technique will speedup a sequence of
complex multiplies
More special case butterflies can also be
microcoded

10
JPEG2000 Performance Estimates
For 30x compression of 1K x 1K, 8-bpp, RGB image
(per chip)

Optimization of compute I/O overlap can boost
performance to 48 fps (shown in brackets)
Grayscale images are 3x faster, supporting up to
125 (146) fps
Bayer coded color images are 2.5x faster
10/12-bit pixels can be handled proportionally
slower

Analysis suggests route to substantially higher
performance
11
Summary

SPP offers greatly enhanced performance AND
performance/watt
Scaled SIMD architecture very well suited for
many embedded applications
Existing and emerging libraries will ease
integration and product insertion
WorldScape partners uniquely qualified to
deliver embedded application products
Ongoing hardware and library development will
foster accelerated ability to achieve TRL-6
status in FY07
First customer shipments of SPP Dupree card -
1Q07
Beta relationships being established now