Reconfigurable Computing RC, Parallel Computer Architecture, HighLevel Programming Paradigms - PowerPoint PPT Presentation


PPT – Reconfigurable Computing RC, Parallel Computer Architecture, HighLevel Programming Paradigms PowerPoint presentation | free to view - id: 6b9cf-ZDc1Z


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Reconfigurable Computing RC, Parallel Computer Architecture, HighLevel Programming Paradigms


Reconfigurable Computing RC, Parallel Computer Architecture, HighLevel Programming Paradigms – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 49
Provided by: iantr6


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Reconfigurable Computing RC, Parallel Computer Architecture, HighLevel Programming Paradigms

Reconfigurable Computing (RC), Parallel Computer
Architecture, High-Level Programming Paradigms
  • Brian Holland
  • EEL 6763 Guest Lecture
  • April 4, 2007

The 1000km View
  • Introduction
  • Reconfigurable Computing Overview
  • Enabling Technology
  • Overview
  • FPGA structure (Xilinx Virtex-II Pro)
  • Xilinx technology progression
  • RC Design Space
  • Custom RC Systems
  • COTS RC Systems
  • Embedded, clusters and large-scale
  • Parallelization Considerations
  • Brians RC Research Emphasis
  • RC Amenability and Application Design Migration
  • High-Level Programming Paradigms for FPGAs
  • Conclusions and Future Work

Courtesy of Ian Troxel
Blurring the software/hardware demarcation
  • An otherwise fixed GPP augmented with
    computation-specific substructures
  • -G. Estrin, Stanford (1960)
  • Technology Overview
  • Goal ASIC speed with GPP flexibility
  • Bit-level manipulation with less overhead
  • Large bit size parallel computation
  • Custom designs developed since 1992
  • Embedded systems and recent trend toward COTS
  • Many industry, government, academic projects

c/o Hauck (U. Wash.)
Application-Specific Integrated
Circuit General Purpose Processor
Courtesy of Ian Troxel
RC Visibility
  • RC Conferences and workshops
  • Universities
  • Florida, GWU, MIT, Berkeley, UIUC, UC-Davis, BYU,
    U. Washington, South Carolina, Tennessee, George
    Mason, Washington Univ. _at_ St. Louis, Imperial
    College, VT, etc.
  • Government
  • LANL, ORNL, DoD (many), NASA, etc.
  • Industry
  • Xilinx, Cray, Honeywell, Boeing, Lockheed Martin,

References available upon request
Courtesy of Ian Troxel
RC University of Florida
  • HCS Lab leads first NSF Center in UF/ECE history
  • CHREC (research commenced in spring semester
  • Center for High-Performance Reconfigurable
  • Pronounced shreck
  • Industry/government/university research
    consortium alliance
  • Under auspices of I/UCRC Program at NSF
  • Industry/University Cooperative Research Center
  • Broad spectrum of CHREC membership anticipated
  • Leading aerospace industry (e.g. Honeywell,
  • Leading supercomputing RC industry (e.g. Cray,
  • Leading government agencies (e.g. NASA, NSA,
  • Leading national laboratories (e.g. ORNL)

Courtesy of Ian Troxel
Objectives for CHREC
  • Establish first multidisciplinary NSF research
    center in reconfigurable high-performance
  • Basis for long-term partnership and collaboration
    amongst industry, academe, and government a
    research consortium
  • RC from supercomputing to high-performance
    embedded systems
  • Directly support research needs of our Center
  • Highly cost-effective manner with pooled,
    leveraged resources and maximized synergy
  • Enhance educational experience for a diverse set
    of high-quality graduate and undergraduate
  • Ideal recruits after graduation for our Center
  • Advance knowledge and technologies in this field
  • Commercial relevance ensured with rapid
    technology transfer

Enabling Technology
  • Hardware advances provide more transistors than
    traditional processors can effectively use
    (beside more cache)
  • Very-Large Scale Integration (submicron)
  • Decreasing production cost (per chip)
  • Increasing system speeds
  • Field-Programmable Gate Array (FPGA)
  • Embedded SRAM configuration
  • Multi-FPGA systems
  • Multi-context FPGAs
  • Dynamically programmable
  • Partial reconfiguration
  • RISC / FPGA hybrid

FPGAs offer capacity advantages compared to CPLDs
due to the scalable design of internal resources.
The limit to this scalability is a topic up for
Chameleon Systems
Courtesy of Ian Troxel
Enabling Technology
PowerPC 405

Dedicated multipliers and memory
  • Digital Clock Management (DCM) provides
  • 16 independent clock domains
  • Clock divide, multiply, phase shift
  • Enhanced Phase Locked Loops (PLLs)

Routing Resources (90)
More detail follows
Courtesy of Ian Troxel
Enabling Technology
  • FPGA Internal Structure (early Xilinx Virtex

LUT / RAM / ROM / Shift Register
Combinational Logic Block (CLB)
Input Signals
Output Signals
D Flip-flop
Carry Logic
Half of Slice (Logic Cell)
Note Chip manufacturers typically use slice as
the common unit but this can be misleading
One Slice
Courtesy of Ian Troxel
Enabling Technology
RocketIO X Receiver
  • RocketIO X transceivers
  • Physical Media Attachment (PMA)
  • Serializer/deserializer (SERDES)
  • TX and RX buffers
  • Clock generator / recovery circuitry
  • Physical Coding Sublayer (PCS)
  • 8B/10B encode/decode
  • 64B/66B encode/decode/scrambler/descrambler
  • Elastic buffer supporting channel bonding and
    clock correction
  • Supported Transceiver Interfaces (up to 10Gbps
    per pin)
  • 1x Infiniband -- 2.5Gbps
  • SONET OC-48
  • SONET OC-192
  • PCI Express
  • 10Gbps XAUI
  • 10Gbps Fibre Channel
  • 10Gbps Ethernet
  • Any custom design

RocketIO X Transmitter
Courtesy of Ian Troxel
Enabling Technology
XtremeDSP slices include an 18x18 2s compliment,
signed multiplier, and a 48-bit accumulator
Courtesy of Ian Troxel
RC Design Space
Its all about the application
Courtesy of Ian Troxel
Custom RC Systems
  • A Long history of custom designs (numerous)
  • Teramac and HARP Oxford 1994
  • PowerPC chip connected via bus to an FPGA for
    simple computation assist
  • TIGER UC Berkeley 1997 (others from the BRASS
  • MIPS II core augmented by an array of FPGAs
  • Custom VLSI design
  • CORDS Princeton 1998
  • Theoretical DRFPGA model
  • PipeRench Carnegie Mellon U 1998
  • Pipeline reconfigurable processor with dynamic
  • MorphoSys UC Irvine and Fed. U of Rio de
    Janeiro 2000
  • RISC core with an 8x8 FPGA array
  • Many more examples
  • See Hartensteins A Decade of RC A Visionary

Most focus on COTS-based technology in order to
reduce problem complexity much more difficult
when everything is new
Courtesy of Ian Troxel
COTS Embedded Systems
Dependable Multiprocessor (DM)
  • Supercomputing in space through FPGA acceleration
  • Speedups of 10? to 100?
  • Initial focus on 2DFFT and LU decomposition
  • Improving experience for hardware-accelerated
    application development
  • Removing wasted development efforts on
    proprietary compile-time hardware/software
    interfaces and run-time environments
  • Exposing hardware application developers to
    common FPGA resources
  • Providing earth and space scientists a
    transparent i/f to DMs FPGA resources

Courtesy of Ian Troxel
COTS RC Clusters
RC board(s)
  • Concept gaining visibility
  • Virginia Tech. Tower Of Power
  • AFRL-IFTC 48-node RC cluster
  • RC cluster in HCS lab _at_ UF (aka Delta)
  • I/O speeds a potential limitation (typically PCI)
  • Especially cost-effective for applications with
    high computation to communication ratios

PCI Bridge
Memory Hierarchy
Courtesy of Ian Troxel
COTS RC Systems
PCI Boards and Software
Alpha-Data (HW) Headquartered in Edinburgh, UK
and founded in 1993. Several products (e.g.
ADM-XRC, ADM-XPL) with simple API access.
Celoxica (HW/SW) Headquartered in Oxford, UK and
founded in 1996. A few boards (e.g. RC2000,
RC200) plus a few PDKs featuring Handel-C.
Nallatech (HW/SW) UK company with US Headquarters
in MD, founded in 1993. Several boards (e.g.
DIME, DIME-II) with API access plus FUSE
Annapolis Micro Systems (HW/SW) Headquartered in
Annapolis, MD and founded in 1982 with RC
beginning in 1994. Several boards (e.g. WILDSTAR,
FIREBIRD) plus the CoreFireTM graphical
application mapper.
Many others Xess, StarBridge Systems, Tarari,
CoreTech, Avnet, Catalina Research, Cesys,
Dalanco Spry, CHIPit Power Edition, Orange Tree
Technologies, Traquair
  • PCI boards suffer large delay due to poor
    peripheral bus performance and scalability (but
    newer variants improving)
  • The Pilchard board with DIM interface a notable

Courtesy of Ian Troxel
Large-scale RC Systems
SRC Headquartered in Colorado Springs, CO and
founded in 1996 by Seymour Cray. They offer a
wide range of systems and are fully focused on RC.
SGI Headquartered in Mountain View, CA. Breaking
into the RC market by building off of their
extensive supercomputing background.
Cray Headquartered in Seattle, Washington. Formed
from the 2000 merger of Tera Computer Company and
Cray Research. Traditionally thought of as the
supercomputer company.
XDI Headquartered in Schaumburg, IL. Incorporated
in 2003. Created an RC system for direct
attachment of Altera Stratix II FPGA to host
processor via Opteron Socket
These systems have the best raw
performance, which may well justify their extra
Courtesy of Ian Troxel
Large-scale RC Systems
  • Carte provides
  • C, Fortran parallel coding
  • Mapping for DLD and/or DEL
  • Manage IPC automatically
  • A level of debug

Courtesy of Ian Troxel
Large-scale RC Systems
A 16x16 crossbar to connect 256 nodes at 1400MB/s
for a bisection BW of 22,400 MB/s
Cluster solution available as well
8GB with 64-bit addressing at 1400MB/s
Connects through the DIM slots to provide a
sustained BW of 1400MB/s (4x PCI-X133)
Courtesy of Ian Troxel
Large-scale RC Systems
  • FGPA products in development for the SGI Altix
    3000 family
  • First generation of the Athena board released
  • Up to 256 Itanium2 with 64 bit Linux DSM
  • SGI NUMAlink GSM interconnect fabric (up to 256

Message Passing/Commodity Bus
Distributed Shared Memory
Courtesy of Ian Troxel
Large-scale RC Systems
Real-time OS, distributed software and an
independent supervisory network monitor, control,
and manage the system
12 64-bit AMD Opteron 200 series processors in
six 2-way SMPs
Six Xilinx Virtex-II Pros per shelf attach to the
RapidArray fabric
12 custom comm. procs provide a 1Tb/s
nonblocking switching fabric per shelf to
deliver 8GB/s BW between SMPs
12 shelves can combine to provide 144 processors
and 96 FPGAs
Courtesy of Ian Troxel
Large-scale RC Systems
Algorithm Parallelization Considerations
  • Examine algorithm and identify options
  • Subdivide algorithm into atomic components
  • Decompose control and data flow
  • Examine performance on software system to
    identify processor-intensive components
  • Identify fine-grained operations (e.g. bit
    manipulation) and portions able to be deeply
    pipelined for RC
  • Define I/O requirements for each system component
    and their interactions
  • Search for pre-built RC cores to speed development

Courtesy of Ian Troxel
Algorithm Parallelization Considerations
  • Understand system component advantages
  • Traditional microprocessor
  • Control-bound algorithms (i.e. complex state
  • Random and/or sustained memory access (i.e. rich
  • Relatively infinite algorithm instruction count
    (analogous to FPGA area)
  • Rich tool support for development and debug
  • RC components
  • Dataflow and streaming algorithms (i.e. deep,
    custom pipelines)
  • Bit-level manipulations, especially non-standard
  • High degree of true hardware parallelism offered
  • Hybrid approaches can offer best or worst of both

Use the right tool for the job!
Courtesy of Ian Troxel
Algorithm Parallelization Considerations
Embedded Hybrid
  • Board-level RC system parallelization
  • Be sure to consider…
  • Buffer requirements, I/O bandwidth and latency,
    achievable parallel efficiency, component
    response time, etc.

Library-Based Coprocessor
Parallel Dual Processor
Parallel Multiprocessor
Stream Coprocessor
Pre/Post Processor
Streaming / Pipelined
Courtesy of Ian Troxel
Algorithm Parallelization Considerations
Cluster-level Tradeoffs
  • Intra-Node
  • Number and mix of boards
  • Network interface
  • Inter-Node
  • Interconnect
  • Arbitration and control
  • Discovery and identification
  • Intra-Cluster
  • Distributed job control
  • Configuration management
  • Resource monitoring
  • Parallel execution support
  • IPC
  • Inter-Cluster
  • User authentication
  • Safety (FPGA power issue)

X-node Cluster
Ethernet? SAN?
SAN? Backplane? Pin-to-pin?
Network Attached?
Y-node Cluster
Courtesy of Ian Troxel
Algorithm Parallelization Considerations
Grid-level Tradeoffs
X-node Cluster
  • Inter-Cluster
  • Resource advertisement
  • Domain sharing (gateways)
  • Job priority definition
  • User authentication
  • Safety (FPGA power issue)
  • Grid
  • Distributed resource monitoring
  • Distributed execution control
  • Data staging
  • Interconnect (latency tolerance)
  • Arbitration and control
  • Discovery and identification
  • Security and authentication

Long-haul Backbone
Y-node Cluster
New direction for RC research
Courtesy of Ian Troxel
The Research of Brian
RC-Amenability Test (RAT)
  • Should this application be done in an FPGA?
  • Cannot rely on rules of thumb such as O(n2)
    comp. complexity
  • Needs to instead be based on two things
  • Ratio of communication time to computation time
  • (cannot be distinguished simply by looking at
    algorithm complexity)
  • Ratio of software execution time (tsoft) to RC
    latency (tRC)
  • Comm./Comp. ratio tells efficiency of RC device,
    soft/hard ratio tells speedup
  • RC Latency (tRC) as mentioned above means
  • Processing time of specific core design (tcomp),
  • Time it takes to send data and receive results
    (tread twrite)
  • The two metrics above should be easily
    approximated, even before beginning
    implementation (assume to be true for now, will
    discuss later)
  • Reason for using this rule of thumb, is based on
    double-buffered performance, which is either
    bound by (tcomp) OR (treadtwrite)

RC-Amenability Methodology
  • So, the following metrics are defined
  • tcomp time for FPGA to process X bytes,
    assuming all data in FPGA
  • tcomm tread twrite
  • tRC (tcomp tcomm) ? Ntcomp Ntcomm
  • tsoft pure-software execution time
  • lets more carefully define in FPGA
  • By in FPGA above, I mean the storage location
    of the FPGA, so either external memory if any,
    otherwise internal memory
  • If external, there is one more layer of buffering
    between FPGA memory and internal processing cores
    (that communication/buffering is considered part
    of tcomp… see below)
  • Data movement between FPGAs local (external or
    otherwise) RAM and processing core should be
    considered part of tcomp
  • Only data movement between actual host processor
    memory and FPGA board should be considered for
    tcomm in my proposed expression
  • Provides more apples-to-apples comparison, since
    software processor must also move data between
    local external memory and internal functional
  • Even O(n) algorithm may be able to achieve 100
    processor utilization when working out of local
    external memory (depends on local memory
    throughput, not PCI… will depend on specific RC
    card /vendor wrapper)

depends on application… discussed later
RC-Amenability Methodology
  • Lets look at a couple of pictures to clear this
    up, illustrating each case (FPGA has external
    memory, FPGA has no external memory)
  • Red line indicates communication associated with
    tcomm (tread or twrite)
  • Blue line indicates communication included in
  • Important assumptions, before we continue
  • Must transfer X bytes to FPGA before beginning RC
    computation (can still be true for streaming
    applications, by the way)
  • Once all data is in FPGA memory, processor core
    can achieve 100 utilization

How To Calculate Metrics tRC
  • This metric is fairly difficult to formally
    define, but is of course critical to get right
    for accurate predictions
  • I propose to assume a double-buffered
    communication pattern, however some cases there
    may only be a single call to FPGA… I know its not
    best solution, maybe just use an example?
  • For example at bottom of slide, assume repeated
    calls to same function, back-to-back
  • Even if in the software there is only one
    function call, depending on memory capacity of
    target FPGA co-processor, that one call may need
    to be broken up into many small calls to FPGA…
    the above assumption is thus valid for more than
    just cases of software-mandated multi-calls
  • Double buffering is usually possible, and should
    be assumed necessary for realistic
    implementations (maybe this isnt true, but I
    think it is… does everyone agree? If you
    disagree, what is a counter-example?)
  • For cases where double buffering simply isnt
    possible, all hope is not lost!
  • As a minimum, for a single, non-double buffered
    call, tRC would be defined as twrite tcomp
  • Perhaps alternate expressions can be
    fairly-easily derived if necessary by visualizing
    entire course of processing as a known sequence
    of read, write, and computation events
  • Use example below (double-buffered, repeated
    back-to-back calls) as guide to how to derive
    analytical expression based on tcomm and tcomp
    from such a sequence of events

C-based Application Mappers
  • High-level languages accelerate efficiency of
    hardware design
  • Software languages are more intuitive, deployed
    than HDLs
  • Significantly more legacy code and programmers
    exist for HLLs
  • Main challenge remains effectively porting to
    particular platform
  • Moving to a supported platform is relatively
  • DIME-C inherently targets DIMEtalk and Nallatech
  • SRC machines are built around Carte environment
  • New versions of Impulse-C can directly target
    Cray XD1
  • However, unsupported platforms require manual
  • Handel-C, Impulse-C require additional user code
    to target other platforms
  • Inner constructs (e.g. Streams) do not port
    well to outside world
  • Like user FPGA code, HLLs may need to target
    standard interface
  • Hardware abstraction for application mappers is
    future goal of our USURP work

Courtesy of Ian Troxel
Accelerating Applications Using C-to-FPGA
Techniques CoDeveloper and Impulse C
David Pellerin, CTO Impulse Accelerated
Courtesy of Impulse Accelerated Technologies
What is Impulse C?
  • ANSI C for FPGA programming
  • A library of functions compatible with standard C
  • Functions for application partitioning
  • Functions and types for process communication
  • A software-to-hardware compiler
  • Optimizes C code for parallelism
  • Generates HDL, ready for FPGA synthesis
  • Also generates hardware/software interfaces
  • Purpose
  • Describe hardware accelerators using standard C
  • Move compute-intensive functions to FPGAs

Courtesy of Impulse Accelerated Technologies
C-to-FPGA Programming Goals
  • Support FPGA-based computing platforms
  • Allow true software programming of FPGAs, from C
  • Bring FPGAs within reach of software programmers
  • Allow hardware designers a faster path to
  • Maintain compatibility with existing tool flows
  • Use standard C development tools for design and
  • Use with existing FPGA synthesis tools and design

Courtesy of Impulse Accelerated Technologies
Its All About Parallelism
  • Parallelism at the system level
  • Multiple parallel processes
  • System-level pipelining and/or co-processing as
  • Hardware accelerators combined with embedded
  • Parallelism at the C statement level
  • Loop unrolling and pipelining
  • Instruction scheduling

FPGA bus
H/W accelerator
Courtesy of Impulse Accelerated Technologies
Impulse C Programming Model
  • Communicating Processes
  • Buffered communication channels to implement data
  • Supports dataflow and message-based
  • Supports parallelism at the application level and
    at the level of individual processes

Courtesy of Impulse Accelerated Technologies
Example Simple Filter
  • Data passed into filter via data stream
  • Could also use shared memory
  • Written using untimed, hardware-independent C
  • Using coding styles familiar to C programmers
  • Software test bench written in C to test
  • In software simulation
  • In actual hardware

Courtesy of Impulse Accelerated Technologies
Simple Filter Process
Input data
Output data
  • Use C to describe the behavior of the filter
  • Read data from an input stream
  • Store samples as needed
  • Perform some computations
  • Write new data to an output stream

Courtesy of Impulse Accelerated Technologies
Impulse C Streaming Process
void img_proc(co_stream pixels_in, co_stream
pixels_out) int nPixel . . . do
co_stream_open(pixels_in, O_RDONLY,
INT_TYPE(32)) co_stream_open(pixels_out,
O_WRONLY, INT_TYPE(32)) while (
co_stream_read(pixels_in, nPixel, sizeof(int))
0 ) . . . // Do
some kind of filtering operation here…
. . . co_stream_write(pixels_out,
nPixel, sizeof(int))
IF_SIM(break) // Terminate here if desktop
simulation while(1) // Run forever if
hardware implementation
Courtesy of Impulse Accelerated Technologies
Impulse C shared memory process
void img_proc(co_signal start, co_memory
datamem, co_signal done) double
status int32 offset 0 . . . do
co_signal_wait(start, (int32)status)
co_memory_readblock(datamem, offset, A,
ARRAYSIZE sizeof(double)) . . .
// Do some kind of computation here, perhaps
calculating A into B . . .
co_memory_writeblock(datamem, offset, B,
ARRAYSIZE sizeof(double))
co_signal_post(done, 0) while(1)
Courtesy of Impulse Accelerated Technologies
Parallel Programming Model
  • Communicating Process Programming Model
  • Buffered communication channels (FIFOs) to
    implement streams
  • Supports dataflow, message-based and memory
  • Supports parallelism at the application level and
    at the level of individual processes

Courtesy of Impulse Accelerated Technologies
An Impulse C Process
Multiple methods of process-to-process communicati
ons are supported
Shared memory block reads/writes
Stream inputs
Stream outputs
Signal inputs
Signal outputs
Register inputs
Register outputs
App Monitor outputs
Processes are independently synchronized
Courtesy of Impulse Accelerated Technologies
Using Multiple Processes
Test image
Filtered image
Test producer
Image filter
Test consumer
Testing can be performed in desktop simulation
using Visual Studio or some other C environment.
Courtesy of Impulse Accelerated Technologies
  • Reconfigurable Computing, a growing field
  • Technology advancements
  • Design space growth
  • Numerous boards and developers
  • Embedded, clusters, and large-scale systems
  • Good visibility
  • Parallel RC design a tricky business
  • Examine algorithm and identify options
  • Design with a system-level approach in mind
  • Much scholarly research under development
  • RC Amenability Test
  • Initial version of RAT has been submitted for
  • However, still room for significant improvement
    and expansion
  • HLL Tools
  • Tools have matured greatly since initial research
    in 2005
  • But can they contribute to nontrivial scientific

Courtesy of Ian Troxel
The Future of RC?
  • The end is nigh!
  • Lack of widespread expertise
  • Too much legacy code to port
  • RC devices and tools in their infancy
  • Fractured market with expensive platforms
  • My traditional processor works for me now
  • The future is bright!
  • Economy of scale likely to come in time
  • Embedded market driving innovation
  • Large-scale HPC systems coming online
  • Hybrid approaches show merit
  • Programming models solidifying

Courtesy of Ian Troxel
There is hope that we will bridge the gap!
Thank you for listening and thank you to the HCS
lab members whose work was featured in this
Courtesy of Ian Troxel