Survey of C-based Application Mapping Tools for Reconfigurable Computing

About This Presentation

Title:

Survey of C-based Application Mapping Tools for Reconfigurable Computing

Description:

Survey of C-based Application Mapping Tools for Reconfigurable Computing Brian Holland, Mauricio Vacas, Vikas Aggarwal, Ryan DeVille, Ian Troxel, and Alan D. George – PowerPoint PPT presentation

Number of Views:177

Avg rating:3.0/5.0

Slides: 22

Provided by: klabsOrgma

Learn more at: http://klabs.org

Category:

more less

Transcript and Presenter's Notes

Title: Survey of C-based Application Mapping Tools for Reconfigurable Computing

1
Survey ofC-based Application Mapping Toolsfor
Reconfigurable Computing

Brian Holland, Mauricio Vacas, Vikas Aggarwal,
Ryan DeVille, Ian Troxel, and Alan D. George
High-performance Computing and Simulation (HCS)
Research Lab
Department of Electrical and Computer Engineering
University of Florida

2
Outline

Introduction
General Survey
Ten C-based Application Mappers
Benchmarking Results
Finite-Impulse Response (FIR)
N-Queens
Radix Sort
Lessons Learned
Conclusions
Acknowledgements
References

3
Motivation for Application Mappers

Motivation for Application Mappers
HDL programming has shortcomings
Limited applicability to application developers
More involved development process (vs. software)
Requires training beyond application level
Instead, can we find and exploit an environment
that allows a measure of hardware control along
with increased productivity?
Can we bring RC performance benefits to
application developers?
Would this be practical/possible in traditional
HDL?
HDL is well below the level of traditional
application programming
Consequently, we need to move to a higher level
of abstraction

4
Introduction

Selecting a Higher Level of Abstraction
CAD tools Visual appealing, but tedious for
large projects
New language Optimal, but requires complete
retraining
Traditional or Object-Oriented languages Which?
How?
Ideally, use pure ANSI-C, The Universal
Language
Requires no additional knowledge or special
training
Port existing C programs into hardware
implementations (HDL)
Translation can be handled by a hardware compiler
Programmer concentrates on algorithmic
functionality

5
Commonalities

General characteristics of C-based application
mappers
Companies create proprietary ANSI C-based
language
Languages do not have all ANSI C features
Extra pragmas are included for corresponding
compilers
Additional libraries of functions/macros for
further extensions
Must adhere to specific programming style for
maximum optimization
Emphasis on both hardware generation and I/O
interfaces

6
Spectrum of C-based Application Mappers
7
CarteSRC Computers 1
Catapult CMentor Graphics 2-3

C/Fortran FPGA environment
Direct mapping of C/Fortran code to configuration
level
Software emulation and simulation of compiled
code for debugging
Capable of multiprocessor and multi-FPGA
computational definitions
Allows explicit data flow control within memory
hierarchy
Targets SRCs MAP processor
Produces Unified Executables for HW or SW
processor execution
Runtime libraries handle required interfacing and
management

Algorithmic synthesis tool for RTL generation
RTL from pure untimed C
No extensions, pragmas, etc.
Compiler uses wrappers around algorithmic code
External manages I/O interface
Internal constrains synthesis to optimize for
chosen interface
Explicit architectural constraints and
optimization
Output RTL netlists in VHDL, Verilog, and SystemC

8
DIME-CNallatech 4
Handel CCeloxica 5

FPGA prototyping tool
Designs are not cycle-accurate
Allows application synthesis for a higher clock
speed
Compilation/Optimization
Pipeline/parallelize where possible
Included IEEE-754 FP cores
Dedicated (integer) multipliers
Currently in beta, expected release 4Q05
Output synthesizable VHDL and DIMEtalk components

Environment for cycle-accurate application
development
All operations occur in one deterministic clock
cycle
Makes it cycle-accurate, but clock freq reduced
to slowest operation
Decisions/Loops are penalty-free but can
significantly impact timing
Language has pragmas for explicitly defined
parallelism
Compiler can analyze, optimize, and rewrite code
Output VHDL/Verilog, SystemC, or targeted EDIFs

9
Impulse CImpulse Accelerated Technologies 6
Mitrion CMitrion 7

Language/compiler for modeling sequential apps.
Processes - independent, potentially concurrent,
computing blocks
Streams communicate and synchronize processes
Uses Streams-C methodology
However, focuses on compatibility with C
development environments
Compilation
Each process implemented as separate state
machine
Output Generic or FPGA-specific VHDL

Softcore processor tactic
Processor creates abstraction layer between C
code and FPGA
Compilation
C code is mapped to a generic API of possible
functions
Processor instantiated on FPGA, tailored to
specific application
Custom instruction bit-widths, specific cache and
buffer sizes
Currently in beta, expected release 4Q05
Output a VHDL IP core for target architectures

10
Napa C National Semiconductor 8
SA-CColorado State University 9-12

Language/compiler for RISC/FPGA hybrid processor
Capitalize on single-cycle interconnect instead
of I/O bus
Datapath Synthesis Technique
Hand-optimized pre-placed, pre-routed module
generators
Compiler generates hardware pipelines from C
loops
Targets NS NAPA1000 hybrid processor
Fixed-Instruction Processor (FIP), Adaptive Logic
Processor (ALP)
ALP also compiles to RTL VHDL, structural VHDL,
structural Verilog

High-level, expression-oriented,
machine-independent, single-assignment language
Designed to implicitly express data-parallel
operations
Image and signal processing
Compiler (UC-Irvine, UC-Riverside, Colorado State
Univ.)
Loop optimizations
Structural transforms
Execution block placement
Target Platforms
UC Irvine Morphosys Annapolis WildForce,
StarFire, WildFire

11
Streams CLos Alamos National Laboratory 12-14
SystemCOpen SystemC Initiative (OSCI) 15-16

Stream-oriented sequential process modeling
Essentially, data elements moving through
discrete functional blocks
Compiler
Generates multi-threaded processor executables
and multiple FPGA bitstreams
Allows parallel C program translation into a
parallel arch.
Includes functional-level simulation environment
Output synthesizable RTL

Open-source extension of C for HW/SW modeling
Core language, modules ports for defining
structure, and interfaces channels
Supports functional modeling
Hierarchical decomposition of a system into
modules
Structural connectivity between modules using
ports/exports
Scheduling and synchronization of concurrent
processes using events
Event-driven simulator
Events are basic dynamic/static process
synchronization objects

12
About the Benchmarks

Three classic algorithms used for benchmarking
Finite-Impulse Response (FIR)
Simple 51-tap FIR filter for standard DSP
applications
Compare compiler solutions and analyze their
usage metrics
N-Queens
Classic embarrassingly parallel HPC backtracking
search problem
Showcases the potential of optimized
implementations
Radix Sort
Sorts using binary bins, minimizing resources
Illustrates resource metrics in RAM-intensive
applications
Implementation Details
DIME-C, Handel C, Impulse C, VHDL, and ANSI-C
(for baseline timing)
Experiments performed on Nallatech BenNUEY-PCI
card with VirtexII-6000 FPGA
Resource utilization based on post
place-and-route data
Runtime represents communication time (setup and
verification I/O is negated)
Handel C and Impulse C require VHDL wrappers
which can increase resource usage

13
Finite-Impulse Response

FIR filter containing 51 taps, each 16-bits wide
(based on algorithms in 4,6)
Various application-mapper languages do not have
a consistent I/O interface
Could not create a consistent streaming channel
with requisite blocking in every tool
Instead, FIR algorithm operates on values stored
in a block RAM
Obtains speedup through parallel multiplication,
efficient memory accesses
The 51 coefficients and variables are stored in
local variables
Additional performance boosts are possible in
multi-channel DSP processing

14
N-Queens

Represents a purely computational algorithm
virtually no communication overhead
Algorithm contains several parallelizable code
segments, exploitable for speedup
Implementations are based upon same baseline C
code
Every available technique and compiler
optimization is employed to boost performance
Notes
Handel C N-Queens is a benchmark from our
MAPLD04 paper with additional refinements
VHDL N-Queens is culmination of a semester-long
endeavor into algorithms parallelism
DIME-C and Impulse C N-Queens are results of
experimentation with beta compilers

15
Radix Sort

Sorts values one bit at a time (saving
significant resources vs. sorting on digit at a
time)
Represents a worst-case legacy algorithm,
containing no functional-level parallelism
Every element in every iteration depends on every
previous element in every iteration
Ideal for software processor with fast cache,
challenging in FPGA hardware
Speedup comes through efficient RAM usage and
compiler optimizations/pipelining
Reduce quantity and addressing complexity of RAM
accesses whenever possible
Metrics are based on sorting 600 32-bit integers
contained within a block RAM

16
Some Optimization Techniques

Keep expensive computational operations to a
minimum
Multiplication, division, modulo, greater/less
than, and floating point are slow
Minimize reliance on arrays
Watch for combinable statements

Exploit functional level parallelism
Reduce bit-widths to minimal size

17
Case Study Dot Product
Green Computation Blue Communication Orange -
Pragmas
DIME-C void Kernel(int a50, int b50, int
answer) int i, temp 0
for(i0ilt50i) temp ai
bi answer temp void
dot_product(int a150, int b150, int a250,
int b250, int answer) int answer1,
answer2 pragma genusc instance Kernel1
Kernel(a1,b1,answer1) pragma genusc
instance Kernel2 Kernel(a2,b2,answer2)
answer answer1 answer2
IMPULSE C void Kerne11(co_stream a1, co_stream
b1, co_stream z1) int a50, b50,
answer0 co_stream_open(a1,O_RDONLY,INT_TYPE(
32)) /etc/ for(i0ilt50i)
co_stream_read(a1, ai, sizeof(int32))
co_stream_read(b1, bi, sizeof(int32))
for(i0ilt50i) pragma CO
UNROLL answer ai bi
co_stream_write(z1, answer, sizeof(int32))
co_stream_close(a1) /etc/ void
Kernel2(co_stream a2, co_stream b2, co_stream
z2) / SAME AS IN Kernel1 / void
dot_product(co_stream z1, co_stream z2, co_stream
ans) int i, answer1, answer2, answer
co_stream_open(z1,O_RDONLY,INT_TYPE(32))
/etc/ co_stream_read(z1, answer1,
INT_TYPE(32)) co_stream_read(z2, answer2,
INT_TYPE(32)) answer answer1 answer2
co_stream_write(ans, answer, INT_TYPE(32))
co_stream_close(z1) /etc/
HANDEL C int 32 Kernel1(int 32 a50, int 32
b50) static int 32 i, tempi, answer
par(i0ilt50i) tempi ai
bi for(i0ilt50i)
answer tempi return
answer int 32 Kernel2(int 32 a50, int 32
b50) / SAME AS IN Kernel1 / void
main() //dot_product int 32 a150 int 32
b150 int 32 a250 int 32 b250 int
32 temp1, temp2 int 32 answer interface
bus_out() OutputResult(answer) par
ans1 Kernel1(int 32 a150,int 32 b50)
ans2 Kernel2(int 32 a250,int 32
b50) answer ans1 ans2
Not all implementations are perfectly optimized.
Your mileage will vary.
18
Lessons Learned

Tools are not near point of automatic translation
Programs still require some tweaking for hardware
compilation 17
Optimized Software C ? Optimized Hardware C
However, generating VHDL is significantly easier
Learning basics of a C-based mapper is
straightforward
At least two major challenges remain
Input/output interfaces become a limiting factor
Moving generic VHDL to unsupported platforms
requires VHDL knowledge
However, once a generic I/O wrapper is generated,
it should be reusable
True hardware debugging remains a challenge
Another level of abstraction means another layer
for mistranslation
With no knowledge of internal VHDL signals,
tracing becomes difficult

19
Conclusions

Advantages of C-based application mappers
Far broader audience of potential RC users with
high-level languages
Required HDL knowledge is significantly reduced
or eliminated
Time to preliminary results is much less than
manual HDL
Software-to-hardware porting is considerably
easier
Visualization of C hardware is far easier for
scientific community
Disadvantages
Mapper instructions are many times more powerful
than CPU instructions, but FPGA clocks are many
times slower
Mappers can parallelize and pipeline C code,
however they generally cannot automatically
instantiate multiple functional units
Optimized C-mapper code is obtained through
manual parallelization of existing code using
techniques pertinent to algorithms structure
Reduced development time can come at cost of
performance

20
Acknowledgements

We thank the following vendors for application
mapping tools, information, and technical
support
Celoxica (Handel C)
Impulse Accelerated Technologies (Impulse C)
Nallatech (DIME-C)
Mitrion (Mitrion C)
We thank the following vendors for providing
tools and/or hardware that made this study
possible
Aldec (Active-HDL Riviera EDA tools)
Intel (Xeon servers)
Nallatech (FUSE DIMEtalk tools, RC boards)
Xilinx (ISE, RC boards, FPGAs)

21
References

1 http//www.srccomp.com
2 http//www.mentor.com/products/c-based_design/
catapult_c_synthesis/.
3 K. Morris, Catapult C Mentor Announces
Architectural Synthesis, fpgajournal.com, June
1, 2004.
4 Nallatech, Inc., DIME-C User Guide,
Reference Manual, United Kingdom, 2005.
5 Celoxica, Ltd. Using Handel-C with DK,
Training Manual, United Kingdom, 2005.
6 D. Pellerin and S. Thibault, Practical FPGA
Programming in C, Pearson Education, Inc., Upper
Saddle River, NJ, 2005.
7 Mitrionics AB, Inc, The Mitrion Processor,
Product Overview, Sweden, 2005.
8 M. Gokhale, J. Stone and E. Gomersall,
Co-Synthesis to a Hybrid RISC/FPGA
Architecture, Journal of VLSI Signal Processing
Systems, 24, pp. 165-180, 2000.
9 J. Hammes and W. Böhm, The SA-C Language,
Reference Manual, Colorado State University,
2001.
10 J. Hammes, M. Chawathe and W. Böhm, The
SA-C Compiler, Reference Manual, Colorado State
University, 2001.
11 Colorado State Univ. Cameron Poster for ACS
PI Meeting, Arlington, VA, March 7, 2002.
12 I. Troxel, CARMA An Infrastructure for
Reconfigurable High-Performance Computing, Ph.D.
Prospectus, University of Florida, pp. 30-32,
2005.
13 R. Goering, Open-source C compiler targets
FPGAs, Embedded.com, October 18, 2002.
14 J. Frigo, M. Gokhale and D. Lavenier,
Evaluation of Streams-C C-to-FPGA Compiler An
Applications Perspective, Proc. ACM/SIGDA
International Symposium on Field-Programmable
Gate Arrays (FPGA), Monterey, CA, February 11-13,
2001.
15 http//www.systemc.org.
16 OSCI, SystemC 2.0.1 Language Reference
Manual, Reference Manual, San Jose, CA, 2003.
17 D. A. Buell, S. Akella, J. P. Davis, G.
Quan, and D. Caliga, "The DARPA boolean equation
benchmark on a reconfigurable computer," Proc.
Military Applications of Programmable Logic
Devices (MAPLD),Washington, DC, September 8-10,
2004.
18 V. Aggarwal, I. Troxel, and A George,
Design and Analysis of Parallel N-Queens on
Reconfigurable Hardware with Handel-C and MPI
Proc. MAPLD, Washington, DC, September 8-10,
2004.
19 J. Jussel, The future of programmable SoC
design is C-based, Proc. Engineering of
Reconfigurable Systems and Algorithms (ERSA), Las
Vegas, NV, June 27-30, 2005.

Write a Comment

User Comments (0)

About PowerShow.com

Survey of C-based Application Mapping Tools for Reconfigurable Computing - PowerPoint PPT Presentation

Survey of C-based Application Mapping Tools for Reconfigurable Computing

Survey of C-based Application Mapping Tools for Reconfigurable Computing Brian Holland, Mauricio Vacas, Vikas Aggarwal, Ryan DeVille, Ian Troxel, and Alan D. George – PowerPoint PPT presentation