Title: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers
1Implementation of Image Processing Kernels on SRC
and SGI Reconfigurable Computers
- Esam El-Araby1, Mohamed Taher1, Tarek
El-Ghazawi1, and Kris Gaj21The George
Washington University,2George Mason
Universityesam, mtaher, tarek_at_gwu.edu,
kgaj_at_gmu.edu
2Introduction
- What are Reconfigurable Computers (RCs)?
- RCs are computing systems based on the close
system-level integration of one or more
general-purpose processors and one or more Field
Programmable Gate Array (FPGA) chips
- Benefits of RCs
- A trade-off between traditional hardware and
software - Hardware-like performance with software-like
flexibility - Hardware can be modified on-the-fly
- The programming model is aimed at shielding
programmers from the details of the hardware
description - Orders of magnitude performance improvements over
traditional systems
3Introduction (cntd)
- Status of RCs
- An important research subject due to the recent
fast growth of FPGAs technology - Evolved from
- glue logic between components to
- Accelerator boards to
- Stand-alone general-purpose RCs to
- Parallel reconfigurable computers
- However, there exist multiple challenges that
must be resolved
4Challenges
- Performance
- I/O Bandwidth
- Significant Configuration Latency
- Some systems spend 25 to 98 of their execution
time performing reconfiguration - Need for Efficient OS and Run-Time
Reconfiguration Management - Reconfiguration methods in current systems are
not fully dynamic - Ease of Use
- Compilers/Languages
- HDLs (VHDL and Verilog) are hard to use by
application scientists - HLLs and simple interfaces
- Debuggers
5SRC Architecture(Hi-BarTM Based Systems)
- Hi-Bar sustains 1.4 GB/s per port with 180 ns
latency per tier - Up to 256 input and 256 output ports with two
tiers of switch - Common Memory (CM) has controller with DMA
capability - Controller can perform other functions such as
scatter/gather - Up to 8 GB DDR SDRAM supported per CM node
Source SRC
6SRC Programming Environment
7SRC Application Simulation Process
Macro Verilog Code
Compiler Front-end
HLL source
Macro Definition
Verilog Generator
CFG?DFG
Synthesis
Verilog
Place and Route
DFG Behavioral Simulation
User Chip Level Simulation
logic.bin
Macro Emulation C Code (Info File)
Macro Verilog Code
Optional
Source SRC
8Steps to Final Logic
- DFG Simulation
- Verifies memory allocation
- Verifies data movement
- Uses real run time environment
- Emulates the CM OBM relationships
- Simulates User Logic
HLL Source
DFG Simulation
Source SRC
9Steps to Final Logic
- DFG Simulation
- User Logic Simulation
- Test user developed macros
- The application becomes the test bench
- Push the generated logic one step closer to the
actual hardware implementation - Requires logic designer mentality for debugging
- Gives full visibility into the logic
HLL Source
DFG Simulation
UL Simulation
Source SRC
10Steps to Final Logic
- DFG Emulation
- User Logic Simulation
- MAP Hardware Execution
- Full execution using ComList and User Logic on
MAP
HLL Source
DFG Simulation
UL Simulation
MAP Execution
Source SRC
11SGI Systems(System Architecture)
R
- NUMAlink system interconnect
- General-purpose compute nodes
- Peer-attached general purpose I/O
- Integrated graphics/visualization
- Reconfigurable Application Specific Computing
C
IO
V
RASC
12RASC Architecture
13RASC Architecture (cntd)
14Design Flow (HDLs)
Design iterations
Design Verification
Design Entry (Verilog, VHDL)
.v, .vhd
.v, .vhd
Behavioral Simulation (VCS, Modelsim)
IA-32 Linux Machine
Design Synthesis (Synplify Pro, Amplify)
.v, .vhd
.edf
Metadata Processing (Python)
Design Implementation (ISE)
.ncd, .pcf
Static Timing Analysis (ISE Timing Analyzer)
.cfg
.bin
15Design Flow (HLLs)
HLL Design Entry (Handel-C, Impulse C, Mitrion C,
Viva)
Design Verification
RTL Generation and Integration with Core Services
.v, .vhd
Behavioral Simulation (VCS, Modelsim)
.v, .vhd
.v, .vhd
IA-32 Linux Machine
Design Synthesis (Synplify Pro, Amplify)
Metadata Processing (Python)
.edf
Static Timing Analysis (ISE Timing Analyzer)
.ncd, .pcf
Design Implementation (ISE)
.cfg
.bin
16Application Programming Interface
- Rasclib
- Resource allocation in conjunction with the RASC
Device Manager - Data movement to/from the COP via DMA engines
- Algorithm control (start, stop, single step,
stepN) - Automatic scaling across multiple devices
- Interfaces necessary for debugging
17Abstraction Layer Algorithm API
- The Abstraction Layers algorithm API mirrors the
COP API with a few additions that enable wide
scaling,
18RASC Debugging
- Based on Open Source Gnu Debugger (GDB)
- Uses extensions to current command set
- Can debug host application and FPGA
- Provides notification when FPGA starts or stops
- Supplies information on FPGA characteristics
- Can single-step or run N steps of the
algorithm - Dumps data regarding the set of registers that
are visible when the FPGA is active
19Applications of DWT
- Pattern recognition
- Feature extraction
- Metallurgy characterization of rough surfaces
- Trend detection
- Finance exploring variation of stock prices
- Perfect reconstruction
- Communications wireless channel signals
- De-noising noisy data
- FBI fingerprint compression
- Detecting self-similarity in a time series
- Video compression JPEG 2000
- Hyperspectral Dimension Reduction
- Image Registration
20Multi-Resolution DWT Decomposition (Mallat
Algorithm)
- The input image is first convolved along the rows
by the two filters L and H and decimated along
the columns by two resulting in two
"column-decimated" images L and H - Each of the two images, L and H, is then
convolved along the columns by the two filters L
and H and decimated along the rows by two - This decomposition results into four images, LL,
LH, HL and HH - The LL image is taken as the new input to perform
the next level of decomposition
21DWT Implementation (Top-Level)
22FIR Module(Transposed Form)
23DWT End-to-End Throughput (SRC-6 SGI-RASC vs.
P4)
Filter Type SRC-6 (MB/sec) SGI RASC (MB/sec) P4 (MB/sec)
Daub1(Haar) 199 130 12
Daub2 199 130 9.98
Daub3 199 130 8.73
Image Size 512 X 512 pixels
24Conclusions
- DWT is implemented on both SRC-6 and SGI-RASC
systems - Similarities and differences are analyzed with
regard to - System hardware architecture
- Ease of programming
- Programming model
- Development time
- Hardware/software libraries
- Performance
- The speed-up vs. microprocessor is reported
- Primary bottlenecks limiting the performance of
both systems are recognized - The capability to share and port applications
between the SRC and SGI systems is explored