Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers

Description:

Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby1, Mohamed Taher1, Tarek El-Ghazawi1, and Kris Gaj2 – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 25

Provided by: klabsOrgm5

Learn more at: http://www.klabs.org

Category:

more less

Transcript and Presenter's Notes

Title: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers

1
Implementation of Image Processing Kernels on SRC
and SGI Reconfigurable Computers

Esam El-Araby1, Mohamed Taher1, Tarek
El-Ghazawi1, and Kris Gaj21The George
Washington University,2George Mason
Universityesam, mtaher, tarek_at_gwu.edu,
kgaj_at_gmu.edu

2
Introduction

What are Reconfigurable Computers (RCs)?
RCs are computing systems based on the close
system-level integration of one or more
general-purpose processors and one or more Field
Programmable Gate Array (FPGA) chips

Benefits of RCs
A trade-off between traditional hardware and
software
Hardware-like performance with software-like
flexibility
Hardware can be modified on-the-fly
The programming model is aimed at shielding
programmers from the details of the hardware
description
Orders of magnitude performance improvements over
traditional systems

3
Introduction (cntd)

Status of RCs
An important research subject due to the recent
fast growth of FPGAs technology
Evolved from
glue logic between components to
Accelerator boards to
Stand-alone general-purpose RCs to
Parallel reconfigurable computers
However, there exist multiple challenges that
must be resolved

4
Challenges

Performance
I/O Bandwidth
Significant Configuration Latency
Some systems spend 25 to 98 of their execution
time performing reconfiguration
Need for Efficient OS and Run-Time
Reconfiguration Management
Reconfiguration methods in current systems are
not fully dynamic
Ease of Use
Compilers/Languages
HDLs (VHDL and Verilog) are hard to use by
application scientists
HLLs and simple interfaces
Debuggers

5
SRC Architecture(Hi-BarTM Based Systems)

Hi-Bar sustains 1.4 GB/s per port with 180 ns
latency per tier
Up to 256 input and 256 output ports with two
tiers of switch
Common Memory (CM) has controller with DMA
capability
Controller can perform other functions such as
scatter/gather
Up to 8 GB DDR SDRAM supported per CM node

Source SRC
6
SRC Programming Environment
7
SRC Application Simulation Process
Macro Verilog Code
Compiler Front-end
HLL source
Macro Definition
Verilog Generator
CFG?DFG
Synthesis
Verilog
Place and Route
DFG Behavioral Simulation
User Chip Level Simulation
logic.bin
Macro Emulation C Code (Info File)
Macro Verilog Code
Optional
Source SRC
8
Steps to Final Logic

DFG Simulation
Verifies memory allocation
Verifies data movement
Uses real run time environment
Emulates the CM OBM relationships
Simulates User Logic

HLL Source
DFG Simulation
Source SRC
9
Steps to Final Logic

DFG Simulation
User Logic Simulation
Test user developed macros
The application becomes the test bench
Push the generated logic one step closer to the
actual hardware implementation
Requires logic designer mentality for debugging
Gives full visibility into the logic

HLL Source
DFG Simulation
UL Simulation
Source SRC
10
Steps to Final Logic

DFG Emulation
User Logic Simulation
MAP Hardware Execution
Full execution using ComList and User Logic on
MAP

HLL Source
DFG Simulation
UL Simulation
MAP Execution
Source SRC
11
SGI Systems(System Architecture)
R

NUMAlink system interconnect
General-purpose compute nodes
Peer-attached general purpose I/O
Integrated graphics/visualization
Reconfigurable Application Specific Computing

C
IO
V
RASC
12
RASC Architecture
13
RASC Architecture (cntd)
14
Design Flow (HDLs)
Design iterations
Design Verification
Design Entry (Verilog, VHDL)
.v, .vhd
.v, .vhd
Behavioral Simulation (VCS, Modelsim)
IA-32 Linux Machine
Design Synthesis (Synplify Pro, Amplify)
.v, .vhd
.edf
Metadata Processing (Python)
Design Implementation (ISE)
.ncd, .pcf
Static Timing Analysis (ISE Timing Analyzer)
.cfg
.bin
15
Design Flow (HLLs)
HLL Design Entry (Handel-C, Impulse C, Mitrion C,
Viva)
Design Verification
RTL Generation and Integration with Core Services
.v, .vhd
Behavioral Simulation (VCS, Modelsim)
.v, .vhd
.v, .vhd
IA-32 Linux Machine
Design Synthesis (Synplify Pro, Amplify)
Metadata Processing (Python)
.edf
Static Timing Analysis (ISE Timing Analyzer)
.ncd, .pcf
Design Implementation (ISE)
.cfg
.bin
16
Application Programming Interface

Rasclib
Resource allocation in conjunction with the RASC
Device Manager
Data movement to/from the COP via DMA engines
Algorithm control (start, stop, single step,
stepN)
Automatic scaling across multiple devices
Interfaces necessary for debugging

17
Abstraction Layer Algorithm API

The Abstraction Layers algorithm API mirrors the
COP API with a few additions that enable wide
scaling,

and deep scaling.

18
RASC Debugging

Based on Open Source Gnu Debugger (GDB)
Uses extensions to current command set
Can debug host application and FPGA
Provides notification when FPGA starts or stops
Supplies information on FPGA characteristics
Can single-step or run N steps of the
algorithm
Dumps data regarding the set of registers that
are visible when the FPGA is active

19
Applications of DWT

Pattern recognition
Feature extraction
Metallurgy characterization of rough surfaces
Trend detection
Finance exploring variation of stock prices
Perfect reconstruction
Communications wireless channel signals
De-noising noisy data
FBI fingerprint compression
Detecting self-similarity in a time series
Video compression JPEG 2000
Hyperspectral Dimension Reduction
Image Registration

20
Multi-Resolution DWT Decomposition (Mallat
Algorithm)

The input image is first convolved along the rows
by the two filters L and H and decimated along
the columns by two resulting in two
"column-decimated" images L and H
Each of the two images, L and H, is then
convolved along the columns by the two filters L
and H and decimated along the rows by two
This decomposition results into four images, LL,
LH, HL and HH
The LL image is taken as the new input to perform
the next level of decomposition

21
DWT Implementation (Top-Level)
22
FIR Module(Transposed Form)
23
DWT End-to-End Throughput (SRC-6 SGI-RASC vs.
P4)
Filter Type SRC-6 (MB/sec) SGI RASC (MB/sec) P4 (MB/sec)
Daub1(Haar) 199 130 12
Daub2 199 130 9.98
Daub3 199 130 8.73
Image Size 512 X 512 pixels
24
Conclusions

DWT is implemented on both SRC-6 and SGI-RASC
systems
Similarities and differences are analyzed with
regard to
System hardware architecture
Ease of programming
Programming model
Development time
Hardware/software libraries
Performance
The speed-up vs. microprocessor is reported
Primary bottlenecks limiting the performance of
both systems are recognized
The capability to share and port applications
between the SRC and SGI systems is explored