Survey of C-based Application Mapping Tools for Reconfigurable Computing - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Survey of C-based Application Mapping Tools for Reconfigurable Computing

Description:

Survey of C-based Application Mapping Tools for Reconfigurable Computing Brian Holland, Mauricio Vacas, Vikas Aggarwal, Ryan DeVille, Ian Troxel, and Alan D. George – PowerPoint PPT presentation

Number of Views:177
Avg rating:3.0/5.0
Slides: 22
Provided by: klabsOrgma
Learn more at: http://klabs.org
Category:

less

Transcript and Presenter's Notes

Title: Survey of C-based Application Mapping Tools for Reconfigurable Computing


1
Survey ofC-based Application Mapping Toolsfor
Reconfigurable Computing
  • Brian Holland, Mauricio Vacas, Vikas Aggarwal,
  • Ryan DeVille, Ian Troxel, and Alan D. George
  • High-performance Computing and Simulation (HCS)
    Research Lab
  • Department of Electrical and Computer Engineering
  • University of Florida

2
Outline
  • Introduction
  • General Survey
  • Ten C-based Application Mappers
  • Benchmarking Results
  • Finite-Impulse Response (FIR)
  • N-Queens
  • Radix Sort
  • Lessons Learned
  • Conclusions
  • Acknowledgements
  • References

3
Motivation for Application Mappers
  • Motivation for Application Mappers
  • HDL programming has shortcomings
  • Limited applicability to application developers
  • More involved development process (vs. software)
  • Requires training beyond application level
  • Instead, can we find and exploit an environment
    that allows a measure of hardware control along
    with increased productivity?
  • Can we bring RC performance benefits to
    application developers?
  • Would this be practical/possible in traditional
    HDL?
  • HDL is well below the level of traditional
    application programming
  • Consequently, we need to move to a higher level
    of abstraction

4
Introduction
  • Selecting a Higher Level of Abstraction
  • CAD tools Visual appealing, but tedious for
    large projects
  • New language Optimal, but requires complete
    retraining
  • Traditional or Object-Oriented languages Which?
    How?
  • Ideally, use pure ANSI-C, The Universal
    Language
  • Requires no additional knowledge or special
    training
  • Port existing C programs into hardware
    implementations (HDL)
  • Translation can be handled by a hardware compiler
  • Programmer concentrates on algorithmic
    functionality

5
Commonalities
  • General characteristics of C-based application
    mappers
  • Companies create proprietary ANSI C-based
    language
  • Languages do not have all ANSI C features
  • Extra pragmas are included for corresponding
    compilers
  • Additional libraries of functions/macros for
    further extensions
  • Must adhere to specific programming style for
    maximum optimization
  • Emphasis on both hardware generation and I/O
    interfaces

6
Spectrum of C-based Application Mappers
7
CarteSRC Computers 1
Catapult CMentor Graphics 2-3
  • C/Fortran FPGA environment
  • Direct mapping of C/Fortran code to configuration
    level
  • Software emulation and simulation of compiled
    code for debugging
  • Capable of multiprocessor and multi-FPGA
    computational definitions
  • Allows explicit data flow control within memory
    hierarchy
  • Targets SRCs MAP processor
  • Produces Unified Executables for HW or SW
    processor execution
  • Runtime libraries handle required interfacing and
    management
  • Algorithmic synthesis tool for RTL generation
  • RTL from pure untimed C
  • No extensions, pragmas, etc.
  • Compiler uses wrappers around algorithmic code
  • External manages I/O interface
  • Internal constrains synthesis to optimize for
    chosen interface
  • Explicit architectural constraints and
    optimization
  • Output RTL netlists in VHDL, Verilog, and SystemC

8
DIME-CNallatech 4
Handel CCeloxica 5
  • FPGA prototyping tool
  • Designs are not cycle-accurate
  • Allows application synthesis for a higher clock
    speed
  • Compilation/Optimization
  • Pipeline/parallelize where possible
  • Included IEEE-754 FP cores
  • Dedicated (integer) multipliers
  • Currently in beta, expected release 4Q05
  • Output synthesizable VHDL and DIMEtalk components
  • Environment for cycle-accurate application
    development
  • All operations occur in one deterministic clock
    cycle
  • Makes it cycle-accurate, but clock freq reduced
    to slowest operation
  • Decisions/Loops are penalty-free but can
    significantly impact timing
  • Language has pragmas for explicitly defined
    parallelism
  • Compiler can analyze, optimize, and rewrite code
  • Output VHDL/Verilog, SystemC, or targeted EDIFs

9
Impulse CImpulse Accelerated Technologies 6
Mitrion CMitrion 7
  • Language/compiler for modeling sequential apps.
  • Processes - independent, potentially concurrent,
    computing blocks
  • Streams communicate and synchronize processes
  • Uses Streams-C methodology
  • However, focuses on compatibility with C
    development environments
  • Compilation
  • Each process implemented as separate state
    machine
  • Output Generic or FPGA-specific VHDL
  • Softcore processor tactic
  • Processor creates abstraction layer between C
    code and FPGA
  • Compilation
  • C code is mapped to a generic API of possible
    functions
  • Processor instantiated on FPGA, tailored to
    specific application
  • Custom instruction bit-widths, specific cache and
    buffer sizes
  • Currently in beta, expected release 4Q05
  • Output a VHDL IP core for target architectures

10
Napa C National Semiconductor 8
SA-CColorado State University 9-12
  • Language/compiler for RISC/FPGA hybrid processor
  • Capitalize on single-cycle interconnect instead
    of I/O bus
  • Datapath Synthesis Technique
  • Hand-optimized pre-placed, pre-routed module
    generators
  • Compiler generates hardware pipelines from C
    loops
  • Targets NS NAPA1000 hybrid processor
  • Fixed-Instruction Processor (FIP), Adaptive Logic
    Processor (ALP)
  • ALP also compiles to RTL VHDL, structural VHDL,
    structural Verilog
  • High-level, expression-oriented,
    machine-independent, single-assignment language
  • Designed to implicitly express data-parallel
    operations
  • Image and signal processing
  • Compiler (UC-Irvine, UC-Riverside, Colorado State
    Univ.)
  • Loop optimizations
  • Structural transforms
  • Execution block placement
  • Target Platforms
  • UC Irvine Morphosys Annapolis WildForce,
    StarFire, WildFire

11
Streams CLos Alamos National Laboratory 12-14
SystemCOpen SystemC Initiative (OSCI) 15-16
  • Stream-oriented sequential process modeling
  • Essentially, data elements moving through
    discrete functional blocks
  • Compiler
  • Generates multi-threaded processor executables
    and multiple FPGA bitstreams
  • Allows parallel C program translation into a
    parallel arch.
  • Includes functional-level simulation environment
  • Output synthesizable RTL
  • Open-source extension of C for HW/SW modeling
  • Core language, modules ports for defining
    structure, and interfaces channels
  • Supports functional modeling
  • Hierarchical decomposition of a system into
    modules
  • Structural connectivity between modules using
    ports/exports
  • Scheduling and synchronization of concurrent
    processes using events
  • Event-driven simulator
  • Events are basic dynamic/static process
    synchronization objects

12
About the Benchmarks
  • Three classic algorithms used for benchmarking
  • Finite-Impulse Response (FIR)
  • Simple 51-tap FIR filter for standard DSP
    applications
  • Compare compiler solutions and analyze their
    usage metrics
  • N-Queens
  • Classic embarrassingly parallel HPC backtracking
    search problem
  • Showcases the potential of optimized
    implementations
  • Radix Sort
  • Sorts using binary bins, minimizing resources
  • Illustrates resource metrics in RAM-intensive
    applications
  • Implementation Details
  • DIME-C, Handel C, Impulse C, VHDL, and ANSI-C
    (for baseline timing)
  • Experiments performed on Nallatech BenNUEY-PCI
    card with VirtexII-6000 FPGA
  • Resource utilization based on post
    place-and-route data
  • Runtime represents communication time (setup and
    verification I/O is negated)
  • Handel C and Impulse C require VHDL wrappers
    which can increase resource usage

13
Finite-Impulse Response
  • FIR filter containing 51 taps, each 16-bits wide
    (based on algorithms in 4,6)
  • Various application-mapper languages do not have
    a consistent I/O interface
  • Could not create a consistent streaming channel
    with requisite blocking in every tool
  • Instead, FIR algorithm operates on values stored
    in a block RAM
  • Obtains speedup through parallel multiplication,
    efficient memory accesses
  • The 51 coefficients and variables are stored in
    local variables
  • Additional performance boosts are possible in
    multi-channel DSP processing

14
N-Queens
  • Represents a purely computational algorithm
    virtually no communication overhead
  • Algorithm contains several parallelizable code
    segments, exploitable for speedup
  • Implementations are based upon same baseline C
    code
  • Every available technique and compiler
    optimization is employed to boost performance
  • Notes
  • Handel C N-Queens is a benchmark from our
    MAPLD04 paper with additional refinements
  • VHDL N-Queens is culmination of a semester-long
    endeavor into algorithms parallelism
  • DIME-C and Impulse C N-Queens are results of
    experimentation with beta compilers

15
Radix Sort
  • Sorts values one bit at a time (saving
    significant resources vs. sorting on digit at a
    time)
  • Represents a worst-case legacy algorithm,
    containing no functional-level parallelism
  • Every element in every iteration depends on every
    previous element in every iteration
  • Ideal for software processor with fast cache,
    challenging in FPGA hardware
  • Speedup comes through efficient RAM usage and
    compiler optimizations/pipelining
  • Reduce quantity and addressing complexity of RAM
    accesses whenever possible
  • Metrics are based on sorting 600 32-bit integers
    contained within a block RAM

16
Some Optimization Techniques
  • Keep expensive computational operations to a
    minimum
  • Multiplication, division, modulo, greater/less
    than, and floating point are slow
  • Minimize reliance on arrays
  • Watch for combinable statements
  • Exploit functional level parallelism
  • Reduce bit-widths to minimal size

17
Case Study Dot Product
Green Computation Blue Communication Orange -
Pragmas
DIME-C void Kernel(int a50, int b50, int
answer) int i, temp 0
for(i0ilt50i) temp ai
bi answer temp void
dot_product(int a150, int b150, int a250,
int b250, int answer) int answer1,
answer2 pragma genusc instance Kernel1
Kernel(a1,b1,answer1) pragma genusc
instance Kernel2 Kernel(a2,b2,answer2)
answer answer1 answer2
IMPULSE C void Kerne11(co_stream a1, co_stream
b1, co_stream z1) int a50, b50,
answer0 co_stream_open(a1,O_RDONLY,INT_TYPE(
32)) /etc/ for(i0ilt50i)
co_stream_read(a1, ai, sizeof(int32))
co_stream_read(b1, bi, sizeof(int32))
for(i0ilt50i) pragma CO
UNROLL answer ai bi
co_stream_write(z1, answer, sizeof(int32))
co_stream_close(a1) /etc/ void
Kernel2(co_stream a2, co_stream b2, co_stream
z2) / SAME AS IN Kernel1 / void
dot_product(co_stream z1, co_stream z2, co_stream
ans) int i, answer1, answer2, answer
co_stream_open(z1,O_RDONLY,INT_TYPE(32))
/etc/ co_stream_read(z1, answer1,
INT_TYPE(32)) co_stream_read(z2, answer2,
INT_TYPE(32)) answer answer1 answer2
co_stream_write(ans, answer, INT_TYPE(32))
co_stream_close(z1) /etc/
HANDEL C int 32 Kernel1(int 32 a50, int 32
b50) static int 32 i, tempi, answer
par(i0ilt50i) tempi ai
bi for(i0ilt50i)
answer tempi return
answer int 32 Kernel2(int 32 a50, int 32
b50) / SAME AS IN Kernel1 / void
main() //dot_product int 32 a150 int 32
b150 int 32 a250 int 32 b250 int
32 temp1, temp2 int 32 answer interface
bus_out() OutputResult(answer) par
ans1 Kernel1(int 32 a150,int 32 b50)
ans2 Kernel2(int 32 a250,int 32
b50) answer ans1 ans2
Not all implementations are perfectly optimized.
Your mileage will vary.
18
Lessons Learned
  • Tools are not near point of automatic translation
  • Programs still require some tweaking for hardware
    compilation 17
  • Optimized Software C ? Optimized Hardware C
  • However, generating VHDL is significantly easier
  • Learning basics of a C-based mapper is
    straightforward
  • At least two major challenges remain
  • Input/output interfaces become a limiting factor
  • Moving generic VHDL to unsupported platforms
    requires VHDL knowledge
  • However, once a generic I/O wrapper is generated,
    it should be reusable
  • True hardware debugging remains a challenge
  • Another level of abstraction means another layer
    for mistranslation
  • With no knowledge of internal VHDL signals,
    tracing becomes difficult

19
Conclusions
  • Advantages of C-based application mappers
  • Far broader audience of potential RC users with
    high-level languages
  • Required HDL knowledge is significantly reduced
    or eliminated
  • Time to preliminary results is much less than
    manual HDL
  • Software-to-hardware porting is considerably
    easier
  • Visualization of C hardware is far easier for
    scientific community
  • Disadvantages
  • Mapper instructions are many times more powerful
    than CPU instructions, but FPGA clocks are many
    times slower
  • Mappers can parallelize and pipeline C code,
    however they generally cannot automatically
    instantiate multiple functional units
  • Optimized C-mapper code is obtained through
    manual parallelization of existing code using
    techniques pertinent to algorithms structure
  • Reduced development time can come at cost of
    performance

20
Acknowledgements
  • We thank the following vendors for application
    mapping tools, information, and technical
    support
  • Celoxica (Handel C)
  • Impulse Accelerated Technologies (Impulse C)
  • Nallatech (DIME-C)
  • Mitrion (Mitrion C)
  • We thank the following vendors for providing
    tools and/or hardware that made this study
    possible
  • Aldec (Active-HDL Riviera EDA tools)
  • Intel (Xeon servers)
  • Nallatech (FUSE DIMEtalk tools, RC boards)
  • Xilinx (ISE, RC boards, FPGAs)

21
References
  • 1 http//www.srccomp.com
  • 2 http//www.mentor.com/products/c-based_design/
    catapult_c_synthesis/.
  • 3 K. Morris, Catapult C Mentor Announces
    Architectural Synthesis, fpgajournal.com, June
    1, 2004.
  • 4 Nallatech, Inc., DIME-C User Guide,
    Reference Manual, United Kingdom, 2005.
  • 5 Celoxica, Ltd. Using Handel-C with DK,
    Training Manual, United Kingdom, 2005.
  • 6 D. Pellerin and S. Thibault, Practical FPGA
    Programming in C, Pearson Education, Inc., Upper
    Saddle River, NJ, 2005.
  • 7 Mitrionics AB, Inc, The Mitrion Processor,
    Product Overview, Sweden, 2005.
  • 8 M. Gokhale, J. Stone and E. Gomersall,
    Co-Synthesis to a Hybrid RISC/FPGA
    Architecture, Journal of VLSI Signal Processing
    Systems, 24, pp. 165-180, 2000.
  • 9 J. Hammes and W. Böhm, The SA-C Language,
    Reference Manual, Colorado State University,
    2001.
  • 10 J. Hammes, M. Chawathe and W. Böhm, The
    SA-C Compiler, Reference Manual, Colorado State
    University, 2001.
  • 11 Colorado State Univ. Cameron Poster for ACS
    PI Meeting, Arlington, VA, March 7, 2002.
  • 12 I. Troxel, CARMA An Infrastructure for
    Reconfigurable High-Performance Computing, Ph.D.
    Prospectus, University of Florida, pp. 30-32,
    2005.
  • 13 R. Goering, Open-source C compiler targets
    FPGAs, Embedded.com, October 18, 2002.
  • 14 J. Frigo, M. Gokhale and D. Lavenier,
    Evaluation of Streams-C C-to-FPGA Compiler An
    Applications Perspective, Proc. ACM/SIGDA
    International Symposium on Field-Programmable
    Gate Arrays (FPGA), Monterey, CA, February 11-13,
    2001.
  • 15 http//www.systemc.org.
  • 16 OSCI, SystemC 2.0.1 Language Reference
    Manual, Reference Manual, San Jose, CA, 2003.
  • 17 D. A. Buell, S. Akella, J. P. Davis, G.
    Quan, and D. Caliga, "The DARPA boolean equation
    benchmark on a reconfigurable computer," Proc.
    Military Applications of Programmable Logic
    Devices (MAPLD),Washington, DC, September 8-10,
    2004.
  • 18 V. Aggarwal, I. Troxel, and A George,
    Design and Analysis of Parallel N-Queens on
    Reconfigurable Hardware with Handel-C and MPI
    Proc. MAPLD, Washington, DC, September 8-10,
    2004.
  • 19 J. Jussel, The future of programmable SoC
    design is C-based, Proc. Engineering of
    Reconfigurable Systems and Algorithms (ERSA), Las
    Vegas, NV, June 27-30, 2005.
Write a Comment
User Comments (0)
About PowerShow.com