Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers

Description:

Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers Esam El-Araby1, Mohamed Taher1, Tarek El-Ghazawi1, and Kris Gaj2 – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 25
Provided by: klabsOrgm5
Learn more at: http://www.klabs.org
Category:

less

Transcript and Presenter's Notes

Title: Implementation of Image Processing Kernels on SRC and SGI Reconfigurable Computers


1
Implementation of Image Processing Kernels on SRC
and SGI Reconfigurable Computers
  • Esam El-Araby1, Mohamed Taher1, Tarek
    El-Ghazawi1, and Kris Gaj21The George
    Washington University,2George Mason
    Universityesam, mtaher, tarek_at_gwu.edu,
    kgaj_at_gmu.edu

2
Introduction
  • What are Reconfigurable Computers (RCs)?
  • RCs are computing systems based on the close
    system-level integration of one or more
    general-purpose processors and one or more Field
    Programmable Gate Array (FPGA) chips
  • Benefits of RCs
  • A trade-off between traditional hardware and
    software
  • Hardware-like performance with software-like
    flexibility
  • Hardware can be modified on-the-fly
  • The programming model is aimed at shielding
    programmers from the details of the hardware
    description
  • Orders of magnitude performance improvements over
    traditional systems

3
Introduction (cntd)
  • Status of RCs
  • An important research subject due to the recent
    fast growth of FPGAs technology
  • Evolved from
  • glue logic between components to
  • Accelerator boards to
  • Stand-alone general-purpose RCs to
  • Parallel reconfigurable computers
  • However, there exist multiple challenges that
    must be resolved

4
Challenges
  • Performance
  • I/O Bandwidth
  • Significant Configuration Latency
  • Some systems spend 25 to 98 of their execution
    time performing reconfiguration
  • Need for Efficient OS and Run-Time
    Reconfiguration Management
  • Reconfiguration methods in current systems are
    not fully dynamic
  • Ease of Use
  • Compilers/Languages
  • HDLs (VHDL and Verilog) are hard to use by
    application scientists
  • HLLs and simple interfaces
  • Debuggers

5
SRC Architecture(Hi-BarTM Based Systems)
  • Hi-Bar sustains 1.4 GB/s per port with 180 ns
    latency per tier
  • Up to 256 input and 256 output ports with two
    tiers of switch
  • Common Memory (CM) has controller with DMA
    capability
  • Controller can perform other functions such as
    scatter/gather
  • Up to 8 GB DDR SDRAM supported per CM node

Source SRC
6
SRC Programming Environment
7
SRC Application Simulation Process
Macro Verilog Code
Compiler Front-end
HLL source
Macro Definition
Verilog Generator
CFG?DFG
Synthesis
Verilog
Place and Route
DFG Behavioral Simulation
User Chip Level Simulation
logic.bin
Macro Emulation C Code (Info File)
Macro Verilog Code
Optional
Source SRC
8
Steps to Final Logic
  • DFG Simulation
  • Verifies memory allocation
  • Verifies data movement
  • Uses real run time environment
  • Emulates the CM OBM relationships
  • Simulates User Logic

HLL Source
DFG Simulation
Source SRC
9
Steps to Final Logic
  • DFG Simulation
  • User Logic Simulation
  • Test user developed macros
  • The application becomes the test bench
  • Push the generated logic one step closer to the
    actual hardware implementation
  • Requires logic designer mentality for debugging
  • Gives full visibility into the logic

HLL Source
DFG Simulation
UL Simulation
Source SRC
10
Steps to Final Logic
  • DFG Emulation
  • User Logic Simulation
  • MAP Hardware Execution
  • Full execution using ComList and User Logic on
    MAP

HLL Source
DFG Simulation
UL Simulation
MAP Execution
Source SRC
11
SGI Systems(System Architecture)
R
  • NUMAlink system interconnect
  • General-purpose compute nodes
  • Peer-attached general purpose I/O
  • Integrated graphics/visualization
  • Reconfigurable Application Specific Computing

C
IO
V
RASC
12
RASC Architecture
13
RASC Architecture (cntd)
14
Design Flow (HDLs)
Design iterations
Design Verification
Design Entry (Verilog, VHDL)
.v, .vhd
.v, .vhd
Behavioral Simulation (VCS, Modelsim)
IA-32 Linux Machine
Design Synthesis (Synplify Pro, Amplify)
.v, .vhd
.edf
Metadata Processing (Python)
Design Implementation (ISE)
.ncd, .pcf
Static Timing Analysis (ISE Timing Analyzer)
.cfg
.bin
15
Design Flow (HLLs)
HLL Design Entry (Handel-C, Impulse C, Mitrion C,
Viva)
Design Verification
RTL Generation and Integration with Core Services
.v, .vhd
Behavioral Simulation (VCS, Modelsim)
.v, .vhd
.v, .vhd
IA-32 Linux Machine
Design Synthesis (Synplify Pro, Amplify)
Metadata Processing (Python)
.edf
Static Timing Analysis (ISE Timing Analyzer)
.ncd, .pcf
Design Implementation (ISE)
.cfg
.bin
16
Application Programming Interface
  • Rasclib
  • Resource allocation in conjunction with the RASC
    Device Manager
  • Data movement to/from the COP via DMA engines
  • Algorithm control (start, stop, single step,
    stepN)
  • Automatic scaling across multiple devices
  • Interfaces necessary for debugging

17
Abstraction Layer Algorithm API
  • The Abstraction Layers algorithm API mirrors the
    COP API with a few additions that enable wide
    scaling,
  • and deep scaling.

18
RASC Debugging
  • Based on Open Source Gnu Debugger (GDB)
  • Uses extensions to current command set
  • Can debug host application and FPGA
  • Provides notification when FPGA starts or stops
  • Supplies information on FPGA characteristics
  • Can single-step or run N steps of the
    algorithm
  • Dumps data regarding the set of registers that
    are visible when the FPGA is active

19
Applications of DWT
  • Pattern recognition
  • Feature extraction
  • Metallurgy characterization of rough surfaces
  • Trend detection
  • Finance exploring variation of stock prices
  • Perfect reconstruction
  • Communications wireless channel signals
  • De-noising noisy data
  • FBI fingerprint compression
  • Detecting self-similarity in a time series
  • Video compression JPEG 2000
  • Hyperspectral Dimension Reduction
  • Image Registration

20
Multi-Resolution DWT Decomposition (Mallat
Algorithm)
  • The input image is first convolved along the rows
    by the two filters L and H and decimated along
    the columns by two resulting in two
    "column-decimated" images L and H
  • Each of the two images, L and H, is then
    convolved along the columns by the two filters L
    and H and decimated along the rows by two
  • This decomposition results into four images, LL,
    LH, HL and HH
  • The LL image is taken as the new input to perform
    the next level of decomposition

21
DWT Implementation (Top-Level)
22
FIR Module(Transposed Form)
23
DWT End-to-End Throughput (SRC-6 SGI-RASC vs.
P4)
Filter Type SRC-6 (MB/sec) SGI RASC (MB/sec) P4 (MB/sec)
Daub1(Haar) 199 130 12
Daub2 199 130 9.98
Daub3 199 130 8.73
Image Size 512 X 512 pixels
24
Conclusions
  • DWT is implemented on both SRC-6 and SGI-RASC
    systems
  • Similarities and differences are analyzed with
    regard to
  • System hardware architecture
  • Ease of programming
  • Programming model
  • Development time
  • Hardware/software libraries
  • Performance
  • The speed-up vs. microprocessor is reported
  • Primary bottlenecks limiting the performance of
    both systems are recognized
  • The capability to share and port applications
    between the SRC and SGI systems is explored
Write a Comment
User Comments (0)
About PowerShow.com