Reconfigurable Computing RC, Parallel Computer Architecture, HighLevel Programming Paradigms

About This Presentation

Title:

Reconfigurable Computing RC, Parallel Computer Architecture, HighLevel Programming Paradigms

Description:

Reconfigurable Computing RC, Parallel Computer Architecture, HighLevel Programming Paradigms – PowerPoint PPT presentation

Number of Views:102

Avg rating:3.0/5.0

Slides: 49

Provided by: iantr6

Category:

more less

Transcript and Presenter's Notes

Title: Reconfigurable Computing RC, Parallel Computer Architecture, HighLevel Programming Paradigms

1
Reconfigurable Computing (RC), Parallel Computer
Architecture,High-Level Programming Paradigms

Brian Holland
EEL 6763 Guest Lecture
April 4, 2007

2
The 1000km View

Introduction
Reconfigurable Computing Overview
Enabling Technology
Overview
FPGA structure (Xilinx Virtex-II Pro)
Xilinx technology progression
RC Design Space
Custom RC Systems
COTS RC Systems
Embedded, clusters and large-scale
Parallelization Considerations
Brians RC Research Emphasis
RC Amenability and Application Design Migration
High-Level Programming Paradigms for FPGAs
Conclusions and Future Work

Courtesy of Ian Troxel
3
Introduction
Blurring the software/hardware demarcation

An otherwise fixed GPP augmented with
computation-specific substructures
-G. Estrin, Stanford (1960)
Technology Overview
Goal ASIC speed with GPP flexibility
Bit-level manipulation with less overhead
Large bit size parallel computation
Custom designs developed since 1992
Embedded systems and recent trend toward COTS
clusters
Many industry, government, academic projects

c/o Hauck (U. Wash.)
Application-Specific Integrated
Circuit General Purpose Processor
Courtesy of Ian Troxel
4
RC Visibility

RC Conferences and workshops
FCCM, FPGA, RAW, ERSA, MAPLD, HPEC,
HPCA, VLSI, FPL, FPT, ISVLSI, CHES, CRYPTO, etc.
Universities
Florida, GWU, MIT, Berkeley, UIUC, UC-Davis, BYU,
U. Washington, South Carolina, Tennessee, George
Mason, Washington Univ. _at_ St. Louis, Imperial
College, VT, etc.
Government
LANL, ORNL, DoD (many), NASA, etc.
Industry
Xilinx, Cray, Honeywell, Boeing, Lockheed Martin,
etc.

References available upon request
Courtesy of Ian Troxel
5
RC University of Florida

HCS Lab leads first NSF Center in UF/ECE history
CHREC (research commenced in spring semester
2007)
Center for High-Performance Reconfigurable
Computing
Pronounced shreck
Industry/government/university research
consortium alliance
Under auspices of I/UCRC Program at NSF
Industry/University Cooperative Research Center
Broad spectrum of CHREC membership anticipated
Leading aerospace industry (e.g. Honeywell,
Boeing)
Leading supercomputing RC industry (e.g. Cray,
Nallatech)
Leading government agencies (e.g. NASA, NSA,
AFRL)
Leading national laboratories (e.g. ORNL)

Courtesy of Ian Troxel
6
Objectives for CHREC

Establish first multidisciplinary NSF research
center in reconfigurable high-performance
computing
Basis for long-term partnership and collaboration
amongst industry, academe, and government a
research consortium
RC from supercomputing to high-performance
embedded systems
Directly support research needs of our Center
members
Highly cost-effective manner with pooled,
leveraged resources and maximized synergy
Enhance educational experience for a diverse set
of high-quality graduate and undergraduate
students
Ideal recruits after graduation for our Center
members
Advance knowledge and technologies in this field
Commercial relevance ensured with rapid
technology transfer

7
Enabling Technology

Hardware advances provide more transistors than
traditional processors can effectively use
(beside more cache)
Very-Large Scale Integration (submicron)
Decreasing production cost (per chip)
Increasing system speeds
Field-Programmable Gate Array (FPGA)
Embedded SRAM configuration
Multi-FPGA systems
Multi-context FPGAs
Dynamically programmable
Partial reconfiguration
RISC / FPGA hybrid

Xilinx
FPGAs offer capacity advantages compared to CPLDs
due to the scalable design of internal resources.
The limit to this scalability is a topic up for
debate.
Chameleon Systems
Courtesy of Ian Troxel
8
Enabling Technology
PowerPC 405

Dedicated multipliers and memory

Digital Clock Management (DCM) provides
16 independent clock domains
Clock divide, multiply, phase shift
Enhanced Phase Locked Loops (PLLs)

Routing Resources (90)
More detail follows
Courtesy of Ian Troxel
9
Enabling Technology

FPGA Internal Structure (early Xilinx Virtex
series)

LUT / RAM / ROM / Shift Register
Combinational Logic Block (CLB)
Input Signals
Output Signals
D Flip-flop
Carry Logic
Half of Slice (Logic Cell)
Note Chip manufacturers typically use slice as
the common unit but this can be misleading
One Slice
Courtesy of Ian Troxel
10
Enabling Technology
RocketIO X Receiver

RocketIO X transceivers
Physical Media Attachment (PMA)
Serializer/deserializer (SERDES)
TX and RX buffers
Clock generator / recovery circuitry
Physical Coding Sublayer (PCS)
8B/10B encode/decode
64B/66B encode/decode/scrambler/descrambler
Elastic buffer supporting channel bonding and
clock correction
Supported Transceiver Interfaces (up to 10Gbps
per pin)
1x Infiniband -- 2.5Gbps
SONET OC-48
SONET OC-192
PCI Express
10Gbps XAUI
10Gbps Fibre Channel
10Gbps Ethernet
Any custom design

RocketIO X Transmitter
Courtesy of Ian Troxel
11
Enabling Technology
XtremeDSP slices include an 18x18 2s compliment,
signed multiplier, and a 48-bit accumulator
Courtesy of Ian Troxel
12
RC Design Space
Its all about the application
Courtesy of Ian Troxel
13
Custom RC Systems

A Long history of custom designs (numerous)
Teramac and HARP Oxford 1994
PowerPC chip connected via bus to an FPGA for
simple computation assist
TIGER UC Berkeley 1997 (others from the BRASS
project)
MIPS II core augmented by an array of FPGAs
Custom VLSI design
CORDS Princeton 1998
Theoretical DRFPGA model
PipeRench Carnegie Mellon U 1998
Pipeline reconfigurable processor with dynamic
scheduling
MorphoSys UC Irvine and Fed. U of Rio de
Janeiro 2000
RISC core with an 8x8 FPGA array
Many more examples
See Hartensteins A Decade of RC A Visionary
Retrospective

Most focus on COTS-based technology in order to
reduce problem complexity much more difficult
when everything is new
Courtesy of Ian Troxel
14
COTS Embedded Systems
Dependable Multiprocessor (DM)

Supercomputing in space through FPGA acceleration
Speedups of 10? to 100?
Initial focus on 2DFFT and LU decomposition
kernels
Improving experience for hardware-accelerated
application development
Removing wasted development efforts on
proprietary compile-time hardware/software
interfaces and run-time environments
Exposing hardware application developers to
common FPGA resources
Providing earth and space scientists a
transparent i/f to DMs FPGA resources

Courtesy of Ian Troxel
15
COTS RC Clusters
RC board(s)
CPU(s)

Concept gaining visibility
Virginia Tech. Tower Of Power
AFRL-IFTC 48-node RC cluster
RC cluster in HCS lab _at_ UF (aka Delta)
I/O speeds a potential limitation (typically PCI)
Especially cost-effective for applications with
high computation to communication ratios

PCI Bridge
NIC
Memory Hierarchy
FPGA
Courtesy of Ian Troxel
16
COTS RC Systems
PCI Boards and Software
Alpha-Data (HW) Headquartered in Edinburgh, UK
and founded in 1993. Several products (e.g.
ADM-XRC, ADM-XPL) with simple API access.
Celoxica (HW/SW) Headquartered in Oxford, UK and
founded in 1996. A few boards (e.g. RC2000,
RC200) plus a few PDKs featuring Handel-C.
Nallatech (HW/SW) UK company with US Headquarters
in MD, founded in 1993. Several boards (e.g.
DIME, DIME-II) with API access plus FUSE
middleware.
Annapolis Micro Systems (HW/SW) Headquartered in
Annapolis, MD and founded in 1982 with RC
beginning in 1994. Several boards (e.g. WILDSTAR,
FIREBIRD) plus the CoreFireTM graphical
application mapper.
Many others Xess, StarBridge Systems, Tarari,
CoreTech, Avnet, Catalina Research, Cesys,
Dalanco Spry, CHIPit Power Edition, Orange Tree
Technologies, Traquair

PCI boards suffer large delay due to poor
peripheral bus performance and scalability (but
newer variants improving)
The Pilchard board with DIM interface a notable
exception

Courtesy of Ian Troxel
17
Large-scale RC Systems
SRC Headquartered in Colorado Springs, CO and
founded in 1996 by Seymour Cray. They offer a
wide range of systems and are fully focused on RC.
SGI Headquartered in Mountain View, CA. Breaking
into the RC market by building off of their
extensive supercomputing background.
Cray Headquartered in Seattle, Washington. Formed
from the 2000 merger of Tera Computer Company and
Cray Research. Traditionally thought of as the
supercomputer company.
XDI Headquartered in Schaumburg, IL. Incorporated
in 2003. Created an RC system for direct
attachment of Altera Stratix II FPGA to host
processor via Opteron Socket
These systems have the best raw
performance, which may well justify their extra
expense
Courtesy of Ian Troxel
18
Large-scale RC Systems
SRC

Carte provides
C, Fortran parallel coding
Mapping for DLD and/or DEL
Manage IPC automatically
A level of debug

Courtesy of Ian Troxel
19
Large-scale RC Systems
SRC
A 16x16 crossbar to connect 256 nodes at 1400MB/s
for a bisection BW of 22,400 MB/s
Cluster solution available as well
8GB with 64-bit addressing at 1400MB/s
Connects through the DIM slots to provide a
sustained BW of 1400MB/s (4x PCI-X133)
Courtesy of Ian Troxel
20
Large-scale RC Systems
SGI

FGPA products in development for the SGI Altix
3000 family
First generation of the Athena board released
Up to 256 Itanium2 with 64 bit Linux DSM
SGI NUMAlink GSM interconnect fabric (up to 256
devices)

Message Passing/Commodity Bus
Distributed Shared Memory
Courtesy of Ian Troxel
21
Large-scale RC Systems
Cray
Real-time OS, distributed software and an
independent supervisory network monitor, control,
and manage the system
12 64-bit AMD Opteron 200 series processors in
six 2-way SMPs
Six Xilinx Virtex-II Pros per shelf attach to the
RapidArray fabric
12 custom comm. procs provide a 1Tb/s
nonblocking switching fabric per shelf to
deliver 8GB/s BW between SMPs
12 shelves can combine to provide 144 processors
and 96 FPGAs
Courtesy of Ian Troxel
22
Large-scale RC Systems
XDI
23
Algorithm Parallelization Considerations

Examine algorithm and identify options
Subdivide algorithm into atomic components
Decompose control and data flow
Examine performance on software system to
identify processor-intensive components
Identify fine-grained operations (e.g. bit
manipulation) and portions able to be deeply
pipelined for RC
Define I/O requirements for each system component
and their interactions
Search for pre-built RC cores to speed development

Courtesy of Ian Troxel
24
Algorithm Parallelization Considerations

Understand system component advantages
Traditional microprocessor
Control-bound algorithms (i.e. complex state
machines)
Random and/or sustained memory access (i.e. rich
hierarchy)
Relatively infinite algorithm instruction count
(analogous to FPGA area)
Rich tool support for development and debug
RC components
Dataflow and streaming algorithms (i.e. deep,
custom pipelines)
Bit-level manipulations, especially non-standard
widths
High degree of true hardware parallelism offered
Hybrid approaches can offer best or worst of both
worlds

Use the right tool for the job!
Courtesy of Ian Troxel
25
Algorithm Parallelization Considerations
Embedded Hybrid

Board-level RC system parallelization
Be sure to consider
Buffer requirements, I/O bandwidth and latency,
achievable parallel efficiency, component
response time, etc.

Library-Based Coprocessor
Parallel Dual Processor
uP
FPGA
uP
FPGA
Parallel Multiprocessor
Stream Coprocessor
Pre/Post Processor
uP
uP
FPGA
FPGA
Streaming / Pipelined
Courtesy of Ian Troxel
26
Algorithm Parallelization Considerations
Cluster-level Tradeoffs

Intra-Node
Number and mix of boards
Network interface
Inter-Node
Interconnect
Arbitration and control
Discovery and identification
Intra-Cluster
Distributed job control
Configuration management
Resource monitoring
Parallel execution support
IPC
Inter-Cluster
User authentication
Safety (FPGA power issue)

X-node Cluster
Ethernet? SAN?
SAN? Backplane? Pin-to-pin?
Network Attached?
Y-node Cluster
Courtesy of Ian Troxel
27
Algorithm Parallelization Considerations
Grid-level Tradeoffs
X-node Cluster

Inter-Cluster
Resource advertisement
Domain sharing (gateways)
Job priority definition
User authentication
Safety (FPGA power issue)
Grid
Distributed resource monitoring
Distributed execution control
Data staging
Interconnect (latency tolerance)
Arbitration and control
Discovery and identification
Security and authentication

Long-haul Backbone
Y-node Cluster
New direction for RC research
Courtesy of Ian Troxel
28
The Research of Brian
29
RC-Amenability Test (RAT)

Should this application be done in an FPGA?
Cannot rely on rules of thumb such as O(n2)
comp. complexity
Needs to instead be based on two things
Ratio of communication time to computation time
(cannot be distinguished simply by looking at
algorithm complexity)
Ratio of software execution time (tsoft) to RC
latency (tRC)
Comm./Comp. ratio tells efficiency of RC device,
soft/hard ratio tells speedup
RC Latency (tRC) as mentioned above means
either
Processing time of specific core design (tcomp),
OR
Time it takes to send data and receive results
(tread twrite)
The two metrics above should be easily
approximated, even before beginning
implementation (assume to be true for now, will
discuss later)
Reason for using this rule of thumb, is based on
double-buffered performance, which is either
bound by (tcomp) OR (treadtwrite)

30
RC-Amenability Methodology

So, the following metrics are defined
tcomp time for FPGA to process X bytes,
assuming all data in FPGA
tcomm tread twrite
tRC (tcomp tcomm) ? Ntcomp Ntcomm
MANY OTHERS POSSIBLE
tsoft pure-software execution time
lets more carefully define in FPGA
By in FPGA above, I mean the storage location
of the FPGA, so either external memory if any,
otherwise internal memory
If external, there is one more layer of buffering
between FPGA memory and internal processing cores
(that communication/buffering is considered part
of tcomp see below)
Data movement between FPGAs local (external or
otherwise) RAM and processing core should be
considered part of tcomp
Only data movement between actual host processor
memory and FPGA board should be considered for
tcomm in my proposed expression
Provides more apples-to-apples comparison, since
software processor must also move data between
local external memory and internal functional
units
Even O(n) algorithm may be able to achieve 100
processor utilization when working out of local
external memory (depends on local memory
throughput, not PCI will depend on specific RC
card /vendor wrapper)

depends on application discussed later
31
RC-Amenability Methodology

Lets look at a couple of pictures to clear this
up, illustrating each case (FPGA has external
memory, FPGA has no external memory)

Red line indicates communication associated with
tcomm (tread or twrite)
Blue line indicates communication included in
tcomp
Important assumptions, before we continue
Must transfer X bytes to FPGA before beginning RC
computation (can still be true for streaming
applications, by the way)
Once all data is in FPGA memory, processor core
can achieve 100 utilization

32
How To Calculate Metrics tRC

This metric is fairly difficult to formally
define, but is of course critical to get right
for accurate predictions
I propose to assume a double-buffered
communication pattern, however some cases there
may only be a single call to FPGA I know its not
best solution, maybe just use an example?
For example at bottom of slide, assume repeated
calls to same function, back-to-back
Even if in the software there is only one
function call, depending on memory capacity of
target FPGA co-processor, that one call may need
to be broken up into many small calls to FPGA
the above assumption is thus valid for more than
just cases of software-mandated multi-calls
Double buffering is usually possible, and should
be assumed necessary for realistic
implementations (maybe this isnt true, but I
think it is does everyone agree? If you
disagree, what is a counter-example?)
For cases where double buffering simply isnt
possible, all hope is not lost!
As a minimum, for a single, non-double buffered
call, tRC would be defined as twrite tcomp
tread
Perhaps alternate expressions can be
fairly-easily derived if necessary by visualizing
entire course of processing as a known sequence
of read, write, and computation events
Use example below (double-buffered, repeated
back-to-back calls) as guide to how to derive
analytical expression based on tcomm and tcomp
from such a sequence of events

33
C-based Application Mappers

High-level languages accelerate efficiency of
hardware design
Software languages are more intuitive, deployed
than HDLs
Significantly more legacy code and programmers
exist for HLLs
Main challenge remains effectively porting to
particular platform
Moving to a supported platform is relatively
straightforward
DIME-C inherently targets DIMEtalk and Nallatech
hardware
SRC machines are built around Carte environment
New versions of Impulse-C can directly target
Cray XD1
However, unsupported platforms require manual
effort
Handel-C, Impulse-C require additional user code
to target other platforms
Inner constructs (e.g. Streams) do not port
well to outside world
Like user FPGA code, HLLs may need to target
standard interface
Hardware abstraction for application mappers is
future goal of our USURP work

MAPLD'05
Courtesy of Ian Troxel
34
Accelerating ApplicationsUsing C-to-FPGA
TechniquesCoDeveloper and Impulse C
David Pellerin, CTO Impulse Accelerated
Technologies
Courtesy of Impulse Accelerated Technologies
35
What is Impulse C?

ANSI C for FPGA programming
A library of functions compatible with standard C
Functions for application partitioning
Functions and types for process communication
A software-to-hardware compiler
Optimizes C code for parallelism
Generates HDL, ready for FPGA synthesis
Also generates hardware/software interfaces
Purpose
Describe hardware accelerators using standard C
Move compute-intensive functions to FPGAs

Courtesy of Impulse Accelerated Technologies
36
C-to-FPGA Programming Goals

Support FPGA-based computing platforms
Allow true software programming of FPGAs, from C
language
Bring FPGAs within reach of software programmers
Allow hardware designers a faster path to
prototypes
Maintain compatibility with existing tool flows
Use standard C development tools for design and
debugging
Use with existing FPGA synthesis tools and design
flows

Courtesy of Impulse Accelerated Technologies
37
Its All About Parallelism

Parallelism at the system level
Multiple parallel processes
System-level pipelining and/orco-processing as
appropriate
Hardware accelerators combinedwith embedded
software
Parallelism at the C statement level
Loop unrolling and pipelining
Instruction scheduling

Processor
FPGA bus
MEMORY
H/W accelerator
PERIPHERALS
FPGA
Courtesy of Impulse Accelerated Technologies
38
Impulse C Programming Model

Communicating Processes
Buffered communication channels to implement data
streams
Supports dataflow and message-based
communications
Supports parallelism at the application level and
at the level of individual processes

Courtesy of Impulse Accelerated Technologies
39
Example Simple Filter

Data passed into filter via data stream
Could also use shared memory
Written using untimed, hardware-independent C
code
Using coding styles familiar to C programmers
Software test bench written in C to test
functionality
In software simulation
In actual hardware

Courtesy of Impulse Accelerated Technologies
40
Simple Filter Process
Inputdata
Outputdata
Filter

Use C to describe the behavior of the filter
Read data from an input stream
Store samples as needed
Perform some computations
Write new data to an output stream

Courtesy of Impulse Accelerated Technologies
41
Impulse C Streaming Process
void img_proc(co_stream pixels_in, co_stream
pixels_out) int nPixel . . . do
co_stream_open(pixels_in, O_RDONLY,
INT_TYPE(32)) co_stream_open(pixels_out,
O_WRONLY, INT_TYPE(32)) while (
co_stream_read(pixels_in, nPixel, sizeof(int))
0 ) . . . // Do
some kind of filtering operation here
. . . co_stream_write(pixels_out,
nPixel, sizeof(int))
co_stream_close(pixels_in)
co_stream_close(pixels_out)
IF_SIM(break) // Terminate here if desktop
simulation while(1) // Run forever if
hardware implementation
Courtesy of Impulse Accelerated Technologies
42
Impulse C shared memory process
void img_proc(co_signal start, co_memory
datamem, co_signal done) double
AARRAYSIZE double BARRAYSIZE int32
status int32 offset 0 . . . do
co_signal_wait(start, (int32)status)
co_memory_readblock(datamem, offset, A,
ARRAYSIZE sizeof(double)) . . .
// Do some kind of computation here, perhaps
calculating A into B . . .
co_memory_writeblock(datamem, offset, B,
ARRAYSIZE sizeof(double))
co_signal_post(done, 0) while(1)
Courtesy of Impulse Accelerated Technologies
43
Parallel Programming Model

Communicating Process Programming Model
Buffered communication channels (FIFOs) to
implement streams
Supports dataflow, message-based and memory
communications
Supports parallelism at the application level and
at the level of individual processes

Courtesy of Impulse Accelerated Technologies
44
An Impulse C Process
Multiple methods ofprocess-to-processcommunicati
onsare supported
Shared memory block reads/writes
Stream inputs
Stream outputs
Signal inputs
Signal outputs
Register inputs
Register outputs
App Monitor outputs
Processes are independently synchronized
Courtesy of Impulse Accelerated Technologies
45
Using Multiple Processes
Testimage
Filtered image
Testproducer
Imagefilter
Test consumer
Testing can be performed in desktopsimulation
using Visual Studioor some other C environment.
Courtesy of Impulse Accelerated Technologies
46
Conclusions

Reconfigurable Computing, a growing field
Technology advancements
Design space growth
Numerous boards and developers
Embedded, clusters, and large-scale systems
Good visibility
Parallel RC design a tricky business
Examine algorithm and identify options
Design with a system-level approach in mind
Much scholarly research under development
RC Amenability Test
Initial version of RAT has been submitted for
publication
However, still room for significant improvement
and expansion
HLL Tools
Tools have matured greatly since initial research
in 2005
But can they contribute to nontrivial scientific
applications?

Courtesy of Ian Troxel
47
The Future of RC?

The end is nigh!
Lack of widespread expertise
Too much legacy code to port
RC devices and tools in their infancy
Fractured market with expensive platforms
My traditional processor works for me now

The future is bright!
Economy of scale likely to come in time
Embedded market driving innovation
Large-scale HPC systems coming online
Hybrid approaches show merit
Programming models solidifying

Courtesy of Ian Troxel
48
There is hope that we will bridge the gap!
Thank you for listening and thank you to the HCS
lab members whose work was featured in this
presentation
Courtesy of Ian Troxel

Write a Comment

User Comments (0)