Implementing Algorithms in FPGA-Based Reconfigurable Computers Using C-Based Synthesis - PowerPoint PPT Presentation

Loading...

PPT – Implementing Algorithms in FPGA-Based Reconfigurable Computers Using C-Based Synthesis PowerPoint presentation | free to download - id: 3b493c-ODg0N



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Implementing Algorithms in FPGA-Based Reconfigurable Computers Using C-Based Synthesis

Description:

Implementing Algorithms in FPGA-Based Reconfigurable Computers Using C-Based Synthesis Doug Johnson, Technical Marketing Manager NCSA/OSC Reconfigurable Systems ... – PowerPoint PPT presentation

Number of Views:583
Avg rating:3.0/5.0
Slides: 39
Provided by: ncsaIllin
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Implementing Algorithms in FPGA-Based Reconfigurable Computers Using C-Based Synthesis


1
Implementing Algorithms in FPGA-Based
Reconfigurable Computers Using C-Based Synthesis
  • Doug Johnson, Technical Marketing Manager
  • NCSA/OSC Reconfigurable Systems Summer Institute
  • Urbana, Illinois, July 11-13 2005

2
Celoxica
  • UK-Based System design company
  • Provider of design tools, IP services for
    Digital Imaging Signal Processing
  • Image Processing
  • Video Processing
  • Sonar/ Radar signal processing
  • Biometrics
  • Massively parallel data mining and matching
  • Complete solutions for Electronic Level System
    (ESL) Design
  • System/ algorithm acceleration
  • Co-design partitioning
  • Co-simulation co-verification (C/ C/ SystemC/
    Handel-C/ Matlab/ VHDL/ Verilog)
  • Hardware compilation C synthesis to
    reconfigurable architectures
  • Consulting and professional services
  • Systems analysis and design strategy
  • System implementation capability

3
Presentation Objectives
  • Prerequisites
  • Motivations for using FPGAs in RC and HPC
  • HPC and RC FPGA systems hardware and
    infrastructure
  • Objectives
  • HPC algorithms and Considerations for
    Reconfigurable Computing (RC)
  • Share a perspective on the State-of-the-Art for
    C-based HW design
  • Describe the C to FPGA Flow
  • Illustrate with code examples
  • Look forward to some critical debate

4
Agenda
  • Reconfigurable Computing
  • Considerations, core algorithm relationships,
    commercial applications
  • C-based design
  • The solution space (its place in EDA)
  • Nature of C for HW design
  • The Design Flow
  • Summary
  • JPEG2000 Design Example

5
Agenda
  • Reconfigurable Computing (RC)
  • Considerations, core algorithm relationships,
    commercial applications
  • C-based design
  • The solution space (its place in EDA)
  • Nature of C for HW design
  • The Design Flow
  • Summary

RC Using FPGAs for (algorithmic)
computation 1. Embedded Well established body
of knowledge/experience 2. Enterprise Some 3.
HPC Starting Out
6
Reconfigurable Computing
Commercial C-to-FPGA tools
Closely Coupled Systems Partitioning Frameworks
FPGAs
First RC Successes
1980
1990
2000
20X0?
  • Promised Opportunities
  • Algorithm Acceleration
  • Exploit parallelism to increase performance with
    custom HW implementation
  • Algorithm Offload
  • Free CPU resource by offloading bottleneck
    processes
  • BIG Challenges
  • Development complexity
  • Design framework and methods, deployment and
    integration/middleware
  • Coupling to coprocessor/data bandwidth
  • Price/Performance/Power!
  • Choosing the right applications!

7
FPGA Computing and Methodology
  • High Performance Embedded and Reconfigurable
    Computing
  • Why FPGA Computing?
  • Moores Law showing signs of strain
  • Ability to parallelize in HW
  • Price/GOPS coming down rapidly
  • Hard IP blocks excellent density
  • Example Floating Point Performance
  • Maximum for Virtex-4 50 GFLOPS (Courtesy of
    Dave Bennett, Xilinx Labs)
  • Maximum for Virtex-2 17.5 GFLOPS
  • Can fit 10s of FPUs on 2 Xilinx Virtex-4s
    (Courtesy of Justin Tripp, LANL)
  • Use of hard macros for functions is mandatory
    (example DSP48 on Virtex-4)
  • C-based design for FPGAs
  • Several offerings on commercial marketplace or in
    research
  • Commercial Celoxica, Mentor Graphics, Impulse
    Technologies, Mitrion
  • Research Sandia, UC Riverside, LANL
  • RTL/HDL is the most widely used way to get to
    FPGAs but is not usable by SW engineers

8
Conventional Wisdom for RC
  • 1. Small data objects
  • Data transfer overhead to coprocessor, High
    operation to byte ratio
  • 2. Modest arithmetic
  • Difficult to design and implement complex
    algorithms in HW
  • Integer/fixed precision calculations
  • Floating point too resource expensive
  • 3. Data-parallelism
  • Parallelism essential - FPGA clocks order of
    magnitude slower than CPUs
  • Fine grain - wide data widths
  • Medium grain - operation/function routine
  • Course grain - multiple instantiations of
    application processes
  • 4. Pipeline-ability
  • Streaming Applications most successful
  • 5. Simple Control
  • Difficult to design complex scheduling schemes in
    Parallel HW

Essential
Fewer Issues with Latency in HPC
9
Further Considerations
  • 6. Exploiting Soft programmable HW
  • Configurable Applications
  • Schedule and load HW content prior to HW
    execution
  • Reconfigurable Applications
  • Dynamically change HW content during HW execution

10
Commercial RC Applications
using C-based design
  • Well established in embedded systems
  • Digital Video Technology and Image Processing
  • PROCESSING AT THE SENSOR versus local and/or
    remote processing
  • 3D LCD display development and test
  • Real-time verification of HDTV image processing
    algorithms
  • Robust image matching - product tracking and
    production line control
  • Digital Signal Processing
  • Engine control unit for 3-phase motors
  • Radar and sonar beamforming and spatial filtering
  • Computer aided tomography security system
  • Communications and Networking
  • Internet reconfigurable multimedia terminal, MP3,
    VoIP etc.
  • Ground traffic simulation testbed for broadband
    satellite network communications
  • Satellite based Internet data tracking system
  • Rapid Systems Prototyping

11
Commercial RC Applications
using C-based design
  • Enterprise Computing
  • Content processing solutions
  • XML parsing, virus checking
  • Packet/Pattern Matching/Filtering
  • Compression/decompression
  • Security/Encryption DES/3-DES, SHA, MD5,
    AES/Rijndael
  • High Performance Computing
  • Image processing
  • CT scan analysis, 3D modeling, Ray Tracing
  • Finite element analysis and simulation
  • Custom Vector Engines
  • Genome calculations
  • Seismic data processing

12
Core Algorithm Relationships in HPC
Rational Drug Design
Nanotechnology
Tomographic Reconstruction
Phylogenetic Trees
Biomolecular Dynamics
Neural Networks
Crystallography
Fracture Mechanics
MRI Imaging
Molecular Modeling
Reservoir Modelling
Diffraction Inversion Problems
Biosphere/Geosphere
Distribution Networks
Chemical Dynamics
Flow in Porous Media
Electrical Grids
Atomic Scattering
Pipeline Flows
Data Assimilation
Signal Processing
Condensed Matter Electronic Structure
Plasma Processing
Chemical Reactors
Electronic Structure
Cloud Physics
Boilers
Combustion
Actinide Chemistry
Radiation
CVD
Quantum Chemistry
Reaction-Diffusion
Fourier Methods
Graph Theoretic
Chemical Reactors
Cosmology
n-body
Transport
Astrophysics
Multiphase Flow
Manufacturing Systems
CFD
Basic Algorithms Numerical Methods
Weather and Climate
Discrete Events
PDE
Air Traffic Control
Structural Mechanics
Military Logistics
Seismic Processing
Population Genetics
Monte Carlo
ODE
Multibody Dynamics
Geophysical Fluids
VLSI Design
Transportation Systems
Aerodynamics
Economics
Raster Graphics
Fields
Orbital Mechanics
Nuclear Structure
Ecosystems
QCD
Pattern Matching
Symbolic Processing
Neutron Transport
Economics Models
Genome Processing
Virtual Reality
Astrophysics
Cryptography
Electromagnetics
Computer Vision
Virtual Prototypes
Intelligent Search
Multimedia Collaboration Tools
Databases
Magnet Design
Computational Steering
Computer Algebra
Scientific Visualization
Data Mining
Automated Deduction
Number Theory
Intelligent Agents
CAD
Source Rick Stevens - ANL
13
Core Algorithm Relationships in HPC
Rational Drug Design
Nanotechnology
Tomographic Reconstruction
Phylogenetic Trees
Biomolecular Dynamics
Neural Networks
Crystallography
Fracture Mechanics
MRI Imaging
Molecular Modeling
Reservoir Modelling
Diffraction Inversion Problems
Biosphere/Geosphere
Distribution Networks
Chemical Dynamics
Flow in Porous Media
Electrical Grids
Atomic Scattering
Pipeline Flows
Data Assimilation
Signal Processing
Condensed Matter Electronic Structure
Plasma Processing
Chemical Reactors
Electronic Structure
Cloud Physics
Boilers
Combustion
Actinide Chemistry
Radiation
CVD
Quantum Chemistry
Reaction-Diffusion
Fourier Methods
Graph Theoretic
Chemical Reactors
Cosmology
n-body
Transport
Astrophysics
Multiphase Flow
Manufacturing Systems
CFD
Basic Algorithms Numerical Methods
Weather and Climate
Discrete Events
PDE
Air Traffic Control
Structural Mechanics
Military Logistics
Seismic Processing
Population Genetics
Monte Carlo
ODE
Multibody Dynamics
Geophysical Fluids
VLSI Design
Transportation Systems
Aerodynamics
Economics
Raster Graphics
Fields
Orbital Mechanics
Nuclear Structure
Ecosystems
QCD
Pattern Matching
Symbolic Processing
Neutron Transport
Economics Models
Genome Processing
Virtual Reality
Astrophysics
Cryptography
Electromagnetics
Computer Vision
Virtual Prototypes
Intelligent Search
Multimedia Collaboration Tools
How do we map out the right Apps?
Databases
Magnet Design
Computational Steering
Computer Algebra
Scientific Visualization
Data Mining
Automated Deduction
Number Theory
CAD
Intelligent Agents
Source Rick Stevens - ANL
14
Exploiting FPGA in HPC
  • Hardware
  • Enterprise Quality co-processor system products
    (Cray XD1, SGI RASC)
  • Robust PCI/PCIx/VME-based FPGA card solutions for
    development
  • A software design methodology is essential
  • SW dominated application sector
  • Target developers have a SW background
  • Register Transfer Level (RTL), Hardware
    Description Languages (HDL) are foreign
  • Complete designs can be specified in a C
    environment
  • Porting to HW implementations simplified
  • Platform abstractions through APIs and Libraries
  • Simplified Specification, Development, Deployment

How do we select and benchmark?
15
Agenda
  • Reconfigurable Computing
  • Considerations, core algorithm relationships,
    commercial applications
  • C-based design
  • The solution space (its place in EDA Electronic
    Design Automation)
  • Nature of C for HW design
  • The Design Flow
  • Summary
  • JPEG2000 Design Example

16
Embedded Hardware (HW) Design
Specification
Function
Algorithm Design
Block Design
Fixed Point extraction
DSP IP
Architecture
Architecture Exploration
Implementation IP Models
TLM Frameworks
Design Analysis
Fast Mixed Simulation
Custom Processors
HW Accelerated Simulation
Interface Synthesis
HLL Synthesis
Implementation
Reconfigurable Prototypes
Implementation IP
Emulation Platforms
RTL Verification
RTL
Physical Design
17
C to FPGA Accelerated System
Function Architecture
Algorithm Design
System Model
Partitioning
Architecture Exploration
APIs/Libraries
Mixed Simulation
Design Analysis
Optimization
C-Based Synthesis
Implementation
EDIF
RTL
OBJ
PR
Synthesis
FPGA
Processor
18
Challenges for C-based synthesis
  • Concurrency (Parallelism)
  • Compiler-determined (behavioral synthesis)
  • Explicit
  • Timing
  • Constraints
  • Explicit
  • Rules-based
  • Data Types
  • Annotations, additional or C
  • Communication
  • Additional or C-like

19
Two Approaches to C-based Design
20
Agenda
  • Reconfigurable Computing
  • Considerations, core algorithm relationships,
    commercial applications
  • C-based design
  • The solution space (its place in EDA)
  • Nature of C for HW design
  • The Design Flow
  • Summary
  • JPEG2000 Design Example

21
System Design Refinement
Function
par processA() processB()
processC() processD()
  • System Function
  • Course grain parallelism

C/C
AL
CP
Handel-C
  • Parallel algorithm design
  • Fine-grain parallism
  • Bit/cycle true processes
  • Algorithm Testbench

void processD() unsigned 9 a,b,c par a1
b2 c3
CA
Handel-C
Architecture
A
B
void main() interface port_in interface
port_out
  • Add interfaces
  • Signal/cycle accurate test

CA
Handel-C
C
EDIF/RTL
22
Systems Integration
Implementation
A
  • Complete system design
  • Interface to pins
  • Multi-Clock domain
  • IP Integration

B
EDIF (Electronic Design Interface Format)
C
RTL from HDL IP
CLK
B
A
RST
Data
set clock external CLK set reset external
RST interface Data() void main() par
processA() processB()
processC() processD()
C
interface processB()
interface processD()
EDIF/RTL
23
Parallel Debug in C environment
Algorithm Design
24
Resource Usage/Speed Estimations
Architecture Exploration
25
FPGA Support
Technology mapping Optimizations
26
Handel-C Template Multiplier
set clock external "clk" void
main() while(1) par
process() void process() unsigned W
A, B, C while(1) par
Multiply(A, B, C)
27
Agenda
  • Reconfigurable Computing
  • Considerations, core algorithm relationships,
    commercial applications
  • C-based design
  • The solution space (its place in EDA)
  • Nature of C for HW design
  • The Design Flow
  • Summary
  • JPEG2000 Design Example

28
Summary
  • Commercial C-based design is a reality
  • For the HPC and RC communities it offers
  • Fastest route to accelerating SW designs in FPGA
  • Lower barrier to adoption than RTL technologies
  • Greater customization and productivity than block
    based approaches
  • Complete integration with RTL/block based
    approaches for Power users
  • Deterministic and quality results
  • State of the art tools used by embedded systems
    designers
  • RC platforms for rapid prototyping
  • Simple migration, development to deployment with
    full library support

29
Design Example
  • JPEG2000 Image Compression Algorithm

30
Example Design
JPEG 2000 Compressor
  • Five Steps to HW Platform
  • 1. Specification Model
  • Algorithm Profiling
  • 2. Functional System Model
  • System Estimations
  • 3. Architecture and Communication Model
  • Optimization
  • 4. Implementation Model
  • Direct Synthesis C to EDIF
  • 5. HW Platform
  • Board level integration

Original Image
Pre processing
RGB to YUV conversion
DWT
Quantization
Rate Control
Tier-1 Encoder
Coded Image
Tier-2 Encoder
31
1. Specification Model
Function Architecture
22 .c and .h files 1468 lines of code
C/C
AL
Original Image
Pre processing
RGB to YUV conversion
Algorithm Profiling - Memory - Processing Time -
Data Flow
DWT
Quantization
Rate Control
Tier-1 Encoder
Coded Image
Tier-2 Encoder
DWT/Tier1 are the compute intensive blocks
32
2. Functional System Model
Function Architecture
C/C
AL
Original Image
System Model
Pre processing
Partitioning
RGB to YUV conversion
DWT
quantization
Rate Control
/ C / void sw_block()
/Handel-C/ extern C sw_block() void
main(void) while(1) par sw_block() hw_bl
ock() void hw_block()
Tier-1 Encoder
Coded Image
Tier-2 Encoder
Cycles/speed/area
33
3. Architecture and Communication Model
Function Architecture
C/C
AL
Original Image
Pre processing
RGB to YUV conversion
DWT
quantization
Rate Control
Tier-1 Encoder
DsmPortH2S
Coded Image
Tier-2 Encoder

Dataflow/Cycles/speed/area
34
4. Implementation Model
A
B
void main() interface port_in interface
port_out
C
EDIF
Device Family
Implementation
EDIF
RTL
35
Estimations from Synthesis
  • DWT 6 VII1000

36
5. Hardware Platform
From PR Report for VII1000-4
A
B
uP
HW
uP
  • DWT
  • Slices 758
  • Device utilization 7
  • Speed (MHz) 151
  • Lines of code 395

HW
C
uP
HW
uP
HW
RAM
RAM
Board Level Integration Specific I/O
Implementations Pin Location constraints
Implementation Model Estimations
  • DWT 6

Implementation
EDIF
  • Microblaze Xilinx FPGA
  • Nios Altera FPGA
  • Xilinx V2Pro
  • Toshiba MeP FPGA
  • PowerPC PLB FPGA
  • PC FPGA PCI Card
  • etc

PR
FPGA
37
JPEG2000 DWT Implementation
  • Example taken from a Xilinx Design Challenge
  • Comparison made with HDL approach
  • See Article in Xcell Volume 46 http//www.xilinx.c
    om/publications/xcellonline/xcell_46/xc_celoxica46
    .htm
  • Observations
  • Comparable
  • Using C faster
  • Using C quicker
  • Expert vs Novice

HDL 800 7 128 435 20 6 hours
  • C-Based Design 1st pass
  • Slices 646
  • Device utilization 6
  • Speed (MHz) 110
  • Lines of code 386
  • Design time (days) 6
  • Simulation time 5 mins
  • 2nd pass
  • 546
  • 5
  • 130
  • 386
  • 7 (61)
  • 5 mins
  • Final
  • 758
  • 7
  • 151
  • 395
  • 7 (61)
  • 20 mins

Doesnt include partitioning spec. development
Lena used as testbench throughout, input bit
width12, max 1K image width
38
JPEG2000 MQ coder Implementation
  • Celoxica 1st Pass
  • Slices 1.347
  • Device utilization 12
  • Speed (MHz) 89.5
  • Lines of code 310
  • Design time (days) 10
  • Simulation time for Lena jpeg 5 mins
  • Observations
  • HDL Smaller
  • HC Faster
  • HC Quicker
  • Expert vs Novice
  • Celoxica Final
  • 1,999
  • 18
  • 115.5
  • 330
  • 12 (102)
  • 5 mins
  • HDL
  • 620
  • 6
  • 76
  • 800
  • 30
  • Hours

Doesnt include partitioning spec. development
  • Common language base eased porting to hardware of
    the MQ coder source DSM allowed partition, co
    verification data to be moved between hardware
    software
  • Optimizations included adding parallelism,
    replacing for() loops with while() loops,
    simplifying loop control.
  • Design developed in a unified design environment
About PowerShow.com