Title: Survey of C-based Application Mapping Tools for Reconfigurable Computing
1Survey ofC-based Application Mapping Toolsfor
Reconfigurable Computing
- Brian Holland, Mauricio Vacas, Vikas Aggarwal,
- Ryan DeVille, Ian Troxel, and Alan D. George
- High-performance Computing and Simulation (HCS)
Research Lab - Department of Electrical and Computer Engineering
- University of Florida
2Outline
- Introduction
- General Survey
- Ten C-based Application Mappers
- Benchmarking Results
- Finite-Impulse Response (FIR)
- N-Queens
- Radix Sort
- Lessons Learned
- Conclusions
- Acknowledgements
- References
3Motivation for Application Mappers
- Motivation for Application Mappers
- HDL programming has shortcomings
- Limited applicability to application developers
- More involved development process (vs. software)
- Requires training beyond application level
- Instead, can we find and exploit an environment
that allows a measure of hardware control along
with increased productivity? - Can we bring RC performance benefits to
application developers? - Would this be practical/possible in traditional
HDL? - HDL is well below the level of traditional
application programming - Consequently, we need to move to a higher level
of abstraction
4Introduction
- Selecting a Higher Level of Abstraction
- CAD tools Visual appealing, but tedious for
large projects - New language Optimal, but requires complete
retraining - Traditional or Object-Oriented languages Which?
How? - Ideally, use pure ANSI-C, The Universal
Language - Requires no additional knowledge or special
training - Port existing C programs into hardware
implementations (HDL) - Translation can be handled by a hardware compiler
- Programmer concentrates on algorithmic
functionality
5Commonalities
- General characteristics of C-based application
mappers - Companies create proprietary ANSI C-based
language - Languages do not have all ANSI C features
- Extra pragmas are included for corresponding
compilers - Additional libraries of functions/macros for
further extensions - Must adhere to specific programming style for
maximum optimization - Emphasis on both hardware generation and I/O
interfaces
6Spectrum of C-based Application Mappers
7CarteSRC Computers 1
Catapult CMentor Graphics 2-3
- C/Fortran FPGA environment
- Direct mapping of C/Fortran code to configuration
level - Software emulation and simulation of compiled
code for debugging - Capable of multiprocessor and multi-FPGA
computational definitions - Allows explicit data flow control within memory
hierarchy - Targets SRCs MAP processor
- Produces Unified Executables for HW or SW
processor execution - Runtime libraries handle required interfacing and
management
- Algorithmic synthesis tool for RTL generation
- RTL from pure untimed C
- No extensions, pragmas, etc.
- Compiler uses wrappers around algorithmic code
- External manages I/O interface
- Internal constrains synthesis to optimize for
chosen interface - Explicit architectural constraints and
optimization - Output RTL netlists in VHDL, Verilog, and SystemC
8DIME-CNallatech 4
Handel CCeloxica 5
- FPGA prototyping tool
- Designs are not cycle-accurate
- Allows application synthesis for a higher clock
speed - Compilation/Optimization
- Pipeline/parallelize where possible
- Included IEEE-754 FP cores
- Dedicated (integer) multipliers
- Currently in beta, expected release 4Q05
- Output synthesizable VHDL and DIMEtalk components
- Environment for cycle-accurate application
development - All operations occur in one deterministic clock
cycle - Makes it cycle-accurate, but clock freq reduced
to slowest operation - Decisions/Loops are penalty-free but can
significantly impact timing - Language has pragmas for explicitly defined
parallelism - Compiler can analyze, optimize, and rewrite code
- Output VHDL/Verilog, SystemC, or targeted EDIFs
9Impulse CImpulse Accelerated Technologies 6
Mitrion CMitrion 7
- Language/compiler for modeling sequential apps.
- Processes - independent, potentially concurrent,
computing blocks - Streams communicate and synchronize processes
- Uses Streams-C methodology
- However, focuses on compatibility with C
development environments - Compilation
- Each process implemented as separate state
machine - Output Generic or FPGA-specific VHDL
- Softcore processor tactic
- Processor creates abstraction layer between C
code and FPGA - Compilation
- C code is mapped to a generic API of possible
functions - Processor instantiated on FPGA, tailored to
specific application - Custom instruction bit-widths, specific cache and
buffer sizes - Currently in beta, expected release 4Q05
- Output a VHDL IP core for target architectures
10Napa C National Semiconductor 8
SA-CColorado State University 9-12
- Language/compiler for RISC/FPGA hybrid processor
- Capitalize on single-cycle interconnect instead
of I/O bus - Datapath Synthesis Technique
- Hand-optimized pre-placed, pre-routed module
generators - Compiler generates hardware pipelines from C
loops - Targets NS NAPA1000 hybrid processor
- Fixed-Instruction Processor (FIP), Adaptive Logic
Processor (ALP) - ALP also compiles to RTL VHDL, structural VHDL,
structural Verilog
- High-level, expression-oriented,
machine-independent, single-assignment language - Designed to implicitly express data-parallel
operations - Image and signal processing
- Compiler (UC-Irvine, UC-Riverside, Colorado State
Univ.) - Loop optimizations
- Structural transforms
- Execution block placement
- Target Platforms
- UC Irvine Morphosys Annapolis WildForce,
StarFire, WildFire
11Streams CLos Alamos National Laboratory 12-14
SystemCOpen SystemC Initiative (OSCI) 15-16
- Stream-oriented sequential process modeling
- Essentially, data elements moving through
discrete functional blocks - Compiler
- Generates multi-threaded processor executables
and multiple FPGA bitstreams - Allows parallel C program translation into a
parallel arch. - Includes functional-level simulation environment
- Output synthesizable RTL
- Open-source extension of C for HW/SW modeling
- Core language, modules ports for defining
structure, and interfaces channels - Supports functional modeling
- Hierarchical decomposition of a system into
modules - Structural connectivity between modules using
ports/exports - Scheduling and synchronization of concurrent
processes using events - Event-driven simulator
- Events are basic dynamic/static process
synchronization objects
12About the Benchmarks
- Three classic algorithms used for benchmarking
- Finite-Impulse Response (FIR)
- Simple 51-tap FIR filter for standard DSP
applications - Compare compiler solutions and analyze their
usage metrics - N-Queens
- Classic embarrassingly parallel HPC backtracking
search problem - Showcases the potential of optimized
implementations - Radix Sort
- Sorts using binary bins, minimizing resources
- Illustrates resource metrics in RAM-intensive
applications - Implementation Details
- DIME-C, Handel C, Impulse C, VHDL, and ANSI-C
(for baseline timing) - Experiments performed on Nallatech BenNUEY-PCI
card with VirtexII-6000 FPGA - Resource utilization based on post
place-and-route data - Runtime represents communication time (setup and
verification I/O is negated) - Handel C and Impulse C require VHDL wrappers
which can increase resource usage
13Finite-Impulse Response
- FIR filter containing 51 taps, each 16-bits wide
(based on algorithms in 4,6) - Various application-mapper languages do not have
a consistent I/O interface - Could not create a consistent streaming channel
with requisite blocking in every tool - Instead, FIR algorithm operates on values stored
in a block RAM - Obtains speedup through parallel multiplication,
efficient memory accesses - The 51 coefficients and variables are stored in
local variables - Additional performance boosts are possible in
multi-channel DSP processing
14N-Queens
- Represents a purely computational algorithm
virtually no communication overhead - Algorithm contains several parallelizable code
segments, exploitable for speedup - Implementations are based upon same baseline C
code - Every available technique and compiler
optimization is employed to boost performance - Notes
- Handel C N-Queens is a benchmark from our
MAPLD04 paper with additional refinements - VHDL N-Queens is culmination of a semester-long
endeavor into algorithms parallelism - DIME-C and Impulse C N-Queens are results of
experimentation with beta compilers
15Radix Sort
- Sorts values one bit at a time (saving
significant resources vs. sorting on digit at a
time) - Represents a worst-case legacy algorithm,
containing no functional-level parallelism - Every element in every iteration depends on every
previous element in every iteration - Ideal for software processor with fast cache,
challenging in FPGA hardware - Speedup comes through efficient RAM usage and
compiler optimizations/pipelining - Reduce quantity and addressing complexity of RAM
accesses whenever possible - Metrics are based on sorting 600 32-bit integers
contained within a block RAM
16Some Optimization Techniques
- Keep expensive computational operations to a
minimum - Multiplication, division, modulo, greater/less
than, and floating point are slow - Minimize reliance on arrays
- Watch for combinable statements
- Exploit functional level parallelism
- Reduce bit-widths to minimal size
17Case Study Dot Product
Green Computation Blue Communication Orange -
Pragmas
DIME-C void Kernel(int a50, int b50, int
answer) int i, temp 0
for(i0ilt50i) temp ai
bi answer temp void
dot_product(int a150, int b150, int a250,
int b250, int answer) int answer1,
answer2 pragma genusc instance Kernel1
Kernel(a1,b1,answer1) pragma genusc
instance Kernel2 Kernel(a2,b2,answer2)
answer answer1 answer2
IMPULSE C void Kerne11(co_stream a1, co_stream
b1, co_stream z1) int a50, b50,
answer0 co_stream_open(a1,O_RDONLY,INT_TYPE(
32)) /etc/ for(i0ilt50i)
co_stream_read(a1, ai, sizeof(int32))
co_stream_read(b1, bi, sizeof(int32))
for(i0ilt50i) pragma CO
UNROLL answer ai bi
co_stream_write(z1, answer, sizeof(int32))
co_stream_close(a1) /etc/ void
Kernel2(co_stream a2, co_stream b2, co_stream
z2) / SAME AS IN Kernel1 / void
dot_product(co_stream z1, co_stream z2, co_stream
ans) int i, answer1, answer2, answer
co_stream_open(z1,O_RDONLY,INT_TYPE(32))
/etc/ co_stream_read(z1, answer1,
INT_TYPE(32)) co_stream_read(z2, answer2,
INT_TYPE(32)) answer answer1 answer2
co_stream_write(ans, answer, INT_TYPE(32))
co_stream_close(z1) /etc/
HANDEL C int 32 Kernel1(int 32 a50, int 32
b50) static int 32 i, tempi, answer
par(i0ilt50i) tempi ai
bi for(i0ilt50i)
answer tempi return
answer int 32 Kernel2(int 32 a50, int 32
b50) / SAME AS IN Kernel1 / void
main() //dot_product int 32 a150 int 32
b150 int 32 a250 int 32 b250 int
32 temp1, temp2 int 32 answer interface
bus_out() OutputResult(answer) par
ans1 Kernel1(int 32 a150,int 32 b50)
ans2 Kernel2(int 32 a250,int 32
b50) answer ans1 ans2
Not all implementations are perfectly optimized.
Your mileage will vary.
18Lessons Learned
- Tools are not near point of automatic translation
- Programs still require some tweaking for hardware
compilation 17 - Optimized Software C ? Optimized Hardware C
- However, generating VHDL is significantly easier
- Learning basics of a C-based mapper is
straightforward - At least two major challenges remain
- Input/output interfaces become a limiting factor
- Moving generic VHDL to unsupported platforms
requires VHDL knowledge - However, once a generic I/O wrapper is generated,
it should be reusable - True hardware debugging remains a challenge
- Another level of abstraction means another layer
for mistranslation - With no knowledge of internal VHDL signals,
tracing becomes difficult
19Conclusions
- Advantages of C-based application mappers
- Far broader audience of potential RC users with
high-level languages - Required HDL knowledge is significantly reduced
or eliminated - Time to preliminary results is much less than
manual HDL - Software-to-hardware porting is considerably
easier - Visualization of C hardware is far easier for
scientific community - Disadvantages
- Mapper instructions are many times more powerful
than CPU instructions, but FPGA clocks are many
times slower - Mappers can parallelize and pipeline C code,
however they generally cannot automatically
instantiate multiple functional units - Optimized C-mapper code is obtained through
manual parallelization of existing code using
techniques pertinent to algorithms structure - Reduced development time can come at cost of
performance
20Acknowledgements
- We thank the following vendors for application
mapping tools, information, and technical
support - Celoxica (Handel C)
- Impulse Accelerated Technologies (Impulse C)
- Nallatech (DIME-C)
- Mitrion (Mitrion C)
- We thank the following vendors for providing
tools and/or hardware that made this study
possible - Aldec (Active-HDL Riviera EDA tools)
- Intel (Xeon servers)
- Nallatech (FUSE DIMEtalk tools, RC boards)
- Xilinx (ISE, RC boards, FPGAs)
21References
- 1 http//www.srccomp.com
- 2 http//www.mentor.com/products/c-based_design/
catapult_c_synthesis/. - 3 K. Morris, Catapult C Mentor Announces
Architectural Synthesis, fpgajournal.com, June
1, 2004. - 4 Nallatech, Inc., DIME-C User Guide,
Reference Manual, United Kingdom, 2005. - 5 Celoxica, Ltd. Using Handel-C with DK,
Training Manual, United Kingdom, 2005. - 6 D. Pellerin and S. Thibault, Practical FPGA
Programming in C, Pearson Education, Inc., Upper
Saddle River, NJ, 2005. - 7 Mitrionics AB, Inc, The Mitrion Processor,
Product Overview, Sweden, 2005. - 8 M. Gokhale, J. Stone and E. Gomersall,
Co-Synthesis to a Hybrid RISC/FPGA
Architecture, Journal of VLSI Signal Processing
Systems, 24, pp. 165-180, 2000. - 9 J. Hammes and W. Böhm, The SA-C Language,
Reference Manual, Colorado State University,
2001. - 10 J. Hammes, M. Chawathe and W. Böhm, The
SA-C Compiler, Reference Manual, Colorado State
University, 2001. - 11 Colorado State Univ. Cameron Poster for ACS
PI Meeting, Arlington, VA, March 7, 2002. - 12 I. Troxel, CARMA An Infrastructure for
Reconfigurable High-Performance Computing, Ph.D.
Prospectus, University of Florida, pp. 30-32,
2005. - 13 R. Goering, Open-source C compiler targets
FPGAs, Embedded.com, October 18, 2002. - 14 J. Frigo, M. Gokhale and D. Lavenier,
Evaluation of Streams-C C-to-FPGA Compiler An
Applications Perspective, Proc. ACM/SIGDA
International Symposium on Field-Programmable
Gate Arrays (FPGA), Monterey, CA, February 11-13,
2001. - 15 http//www.systemc.org.
- 16 OSCI, SystemC 2.0.1 Language Reference
Manual, Reference Manual, San Jose, CA, 2003. - 17 D. A. Buell, S. Akella, J. P. Davis, G.
Quan, and D. Caliga, "The DARPA boolean equation
benchmark on a reconfigurable computer," Proc.
Military Applications of Programmable Logic
Devices (MAPLD),Washington, DC, September 8-10,
2004. - 18 V. Aggarwal, I. Troxel, and A George,
Design and Analysis of Parallel N-Queens on
Reconfigurable Hardware with Handel-C and MPI
Proc. MAPLD, Washington, DC, September 8-10,
2004. - 19 J. Jussel, The future of programmable SoC
design is C-based, Proc. Engineering of
Reconfigurable Systems and Algorithms (ERSA), Las
Vegas, NV, June 27-30, 2005.