Reconfigurable Computing RC, Parallel Computer Architecture, HighLevel Programming Paradigms - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Reconfigurable Computing RC, Parallel Computer Architecture, HighLevel Programming Paradigms

Description:

Reconfigurable Computing RC, Parallel Computer Architecture, HighLevel Programming Paradigms – PowerPoint PPT presentation

Number of Views:102
Avg rating:3.0/5.0
Slides: 49
Provided by: iantr6
Category:

less

Transcript and Presenter's Notes

Title: Reconfigurable Computing RC, Parallel Computer Architecture, HighLevel Programming Paradigms


1
Reconfigurable Computing (RC), Parallel Computer
Architecture,High-Level Programming Paradigms
  • Brian Holland
  • EEL 6763 Guest Lecture
  • April 4, 2007

2
The 1000km View
  • Introduction
  • Reconfigurable Computing Overview
  • Enabling Technology
  • Overview
  • FPGA structure (Xilinx Virtex-II Pro)
  • Xilinx technology progression
  • RC Design Space
  • Custom RC Systems
  • COTS RC Systems
  • Embedded, clusters and large-scale
  • Parallelization Considerations
  • Brians RC Research Emphasis
  • RC Amenability and Application Design Migration
  • High-Level Programming Paradigms for FPGAs
  • Conclusions and Future Work

Courtesy of Ian Troxel
3
Introduction
Blurring the software/hardware demarcation
  • An otherwise fixed GPP augmented with
    computation-specific substructures
  • -G. Estrin, Stanford (1960)
  • Technology Overview
  • Goal ASIC speed with GPP flexibility
  • Bit-level manipulation with less overhead
  • Large bit size parallel computation
  • Custom designs developed since 1992
  • Embedded systems and recent trend toward COTS
    clusters
  • Many industry, government, academic projects

c/o Hauck (U. Wash.)
Application-Specific Integrated
Circuit General Purpose Processor
Courtesy of Ian Troxel
4
RC Visibility
  • RC Conferences and workshops
  • FCCM, FPGA, RAW, ERSA, MAPLD, HPEC,
    HPCA, VLSI, FPL, FPT, ISVLSI, CHES, CRYPTO, etc.
  • Universities
  • Florida, GWU, MIT, Berkeley, UIUC, UC-Davis, BYU,
    U. Washington, South Carolina, Tennessee, George
    Mason, Washington Univ. _at_ St. Louis, Imperial
    College, VT, etc.
  • Government
  • LANL, ORNL, DoD (many), NASA, etc.
  • Industry
  • Xilinx, Cray, Honeywell, Boeing, Lockheed Martin,
    etc.

References available upon request
Courtesy of Ian Troxel
5
RC University of Florida
  • HCS Lab leads first NSF Center in UF/ECE history
  • CHREC (research commenced in spring semester
    2007)
  • Center for High-Performance Reconfigurable
    Computing
  • Pronounced shreck
  • Industry/government/university research
    consortium alliance
  • Under auspices of I/UCRC Program at NSF
  • Industry/University Cooperative Research Center
  • Broad spectrum of CHREC membership anticipated
  • Leading aerospace industry (e.g. Honeywell,
    Boeing)
  • Leading supercomputing RC industry (e.g. Cray,
    Nallatech)
  • Leading government agencies (e.g. NASA, NSA,
    AFRL)
  • Leading national laboratories (e.g. ORNL)

Courtesy of Ian Troxel
6
Objectives for CHREC
  • Establish first multidisciplinary NSF research
    center in reconfigurable high-performance
    computing
  • Basis for long-term partnership and collaboration
    amongst industry, academe, and government a
    research consortium
  • RC from supercomputing to high-performance
    embedded systems
  • Directly support research needs of our Center
    members
  • Highly cost-effective manner with pooled,
    leveraged resources and maximized synergy
  • Enhance educational experience for a diverse set
    of high-quality graduate and undergraduate
    students
  • Ideal recruits after graduation for our Center
    members
  • Advance knowledge and technologies in this field
  • Commercial relevance ensured with rapid
    technology transfer

7
Enabling Technology
  • Hardware advances provide more transistors than
    traditional processors can effectively use
    (beside more cache)
  • Very-Large Scale Integration (submicron)
  • Decreasing production cost (per chip)
  • Increasing system speeds
  • Field-Programmable Gate Array (FPGA)
  • Embedded SRAM configuration
  • Multi-FPGA systems
  • Multi-context FPGAs
  • Dynamically programmable
  • Partial reconfiguration
  • RISC / FPGA hybrid

Xilinx
FPGAs offer capacity advantages compared to CPLDs
due to the scalable design of internal resources.
The limit to this scalability is a topic up for
debate.
Chameleon Systems
Courtesy of Ian Troxel
8
Enabling Technology
PowerPC 405


Dedicated multipliers and memory
  • Digital Clock Management (DCM) provides
  • 16 independent clock domains
  • Clock divide, multiply, phase shift
  • Enhanced Phase Locked Loops (PLLs)

Routing Resources (90)
More detail follows
Courtesy of Ian Troxel
9
Enabling Technology
  • FPGA Internal Structure (early Xilinx Virtex
    series)

LUT / RAM / ROM / Shift Register
Combinational Logic Block (CLB)
Input Signals
Output Signals
D Flip-flop
Carry Logic
Half of Slice (Logic Cell)
Note Chip manufacturers typically use slice as
the common unit but this can be misleading
One Slice
Courtesy of Ian Troxel
10
Enabling Technology
RocketIO X Receiver
  • RocketIO X transceivers
  • Physical Media Attachment (PMA)
  • Serializer/deserializer (SERDES)
  • TX and RX buffers
  • Clock generator / recovery circuitry
  • Physical Coding Sublayer (PCS)
  • 8B/10B encode/decode
  • 64B/66B encode/decode/scrambler/descrambler
  • Elastic buffer supporting channel bonding and
    clock correction
  • Supported Transceiver Interfaces (up to 10Gbps
    per pin)
  • 1x Infiniband -- 2.5Gbps
  • SONET OC-48
  • SONET OC-192
  • PCI Express
  • 10Gbps XAUI
  • 10Gbps Fibre Channel
  • 10Gbps Ethernet
  • Any custom design

RocketIO X Transmitter
Courtesy of Ian Troxel
11
Enabling Technology
XtremeDSP slices include an 18x18 2s compliment,
signed multiplier, and a 48-bit accumulator
Courtesy of Ian Troxel
12
RC Design Space
Its all about the application
Courtesy of Ian Troxel
13
Custom RC Systems
  • A Long history of custom designs (numerous)
  • Teramac and HARP Oxford 1994
  • PowerPC chip connected via bus to an FPGA for
    simple computation assist
  • TIGER UC Berkeley 1997 (others from the BRASS
    project)
  • MIPS II core augmented by an array of FPGAs
  • Custom VLSI design
  • CORDS Princeton 1998
  • Theoretical DRFPGA model
  • PipeRench Carnegie Mellon U 1998
  • Pipeline reconfigurable processor with dynamic
    scheduling
  • MorphoSys UC Irvine and Fed. U of Rio de
    Janeiro 2000
  • RISC core with an 8x8 FPGA array
  • Many more examples
  • See Hartensteins A Decade of RC A Visionary
    Retrospective

Most focus on COTS-based technology in order to
reduce problem complexity much more difficult
when everything is new
Courtesy of Ian Troxel
14
COTS Embedded Systems
Dependable Multiprocessor (DM)
  • Supercomputing in space through FPGA acceleration
  • Speedups of 10? to 100?
  • Initial focus on 2DFFT and LU decomposition
    kernels
  • Improving experience for hardware-accelerated
    application development
  • Removing wasted development efforts on
    proprietary compile-time hardware/software
    interfaces and run-time environments
  • Exposing hardware application developers to
    common FPGA resources
  • Providing earth and space scientists a
    transparent i/f to DMs FPGA resources

Courtesy of Ian Troxel
15
COTS RC Clusters
RC board(s)
CPU(s)
  • Concept gaining visibility
  • Virginia Tech. Tower Of Power
  • AFRL-IFTC 48-node RC cluster
  • RC cluster in HCS lab _at_ UF (aka Delta)
  • I/O speeds a potential limitation (typically PCI)
  • Especially cost-effective for applications with
    high computation to communication ratios

PCI Bridge
NIC
Memory Hierarchy
FPGA
Courtesy of Ian Troxel
16
COTS RC Systems
PCI Boards and Software
Alpha-Data (HW) Headquartered in Edinburgh, UK
and founded in 1993. Several products (e.g.
ADM-XRC, ADM-XPL) with simple API access.
Celoxica (HW/SW) Headquartered in Oxford, UK and
founded in 1996. A few boards (e.g. RC2000,
RC200) plus a few PDKs featuring Handel-C.
Nallatech (HW/SW) UK company with US Headquarters
in MD, founded in 1993. Several boards (e.g.
DIME, DIME-II) with API access plus FUSE
middleware.
Annapolis Micro Systems (HW/SW) Headquartered in
Annapolis, MD and founded in 1982 with RC
beginning in 1994. Several boards (e.g. WILDSTAR,
FIREBIRD) plus the CoreFireTM graphical
application mapper.
Many others Xess, StarBridge Systems, Tarari,
CoreTech, Avnet, Catalina Research, Cesys,
Dalanco Spry, CHIPit Power Edition, Orange Tree
Technologies, Traquair
  • PCI boards suffer large delay due to poor
    peripheral bus performance and scalability (but
    newer variants improving)
  • The Pilchard board with DIM interface a notable
    exception

Courtesy of Ian Troxel
17
Large-scale RC Systems
SRC Headquartered in Colorado Springs, CO and
founded in 1996 by Seymour Cray. They offer a
wide range of systems and are fully focused on RC.
SGI Headquartered in Mountain View, CA. Breaking
into the RC market by building off of their
extensive supercomputing background.
Cray Headquartered in Seattle, Washington. Formed
from the 2000 merger of Tera Computer Company and
Cray Research. Traditionally thought of as the
supercomputer company.
XDI Headquartered in Schaumburg, IL. Incorporated
in 2003. Created an RC system for direct
attachment of Altera Stratix II FPGA to host
processor via Opteron Socket
These systems have the best raw
performance, which may well justify their extra
expense
Courtesy of Ian Troxel
18
Large-scale RC Systems
SRC
  • Carte provides
  • C, Fortran parallel coding
  • Mapping for DLD and/or DEL
  • Manage IPC automatically
  • A level of debug

Courtesy of Ian Troxel
19
Large-scale RC Systems
SRC
A 16x16 crossbar to connect 256 nodes at 1400MB/s
for a bisection BW of 22,400 MB/s
Cluster solution available as well
8GB with 64-bit addressing at 1400MB/s
Connects through the DIM slots to provide a
sustained BW of 1400MB/s (4x PCI-X133)
Courtesy of Ian Troxel
20
Large-scale RC Systems
SGI
  • FGPA products in development for the SGI Altix
    3000 family
  • First generation of the Athena board released
  • Up to 256 Itanium2 with 64 bit Linux DSM
  • SGI NUMAlink GSM interconnect fabric (up to 256
    devices)

Message Passing/Commodity Bus
Distributed Shared Memory
Courtesy of Ian Troxel
21
Large-scale RC Systems
Cray
Real-time OS, distributed software and an
independent supervisory network monitor, control,
and manage the system
12 64-bit AMD Opteron 200 series processors in
six 2-way SMPs
Six Xilinx Virtex-II Pros per shelf attach to the
RapidArray fabric
12 custom comm. procs provide a 1Tb/s
nonblocking switching fabric per shelf to
deliver 8GB/s BW between SMPs
12 shelves can combine to provide 144 processors
and 96 FPGAs
Courtesy of Ian Troxel
22
Large-scale RC Systems
XDI
23
Algorithm Parallelization Considerations
  • Examine algorithm and identify options
  • Subdivide algorithm into atomic components
  • Decompose control and data flow
  • Examine performance on software system to
    identify processor-intensive components
  • Identify fine-grained operations (e.g. bit
    manipulation) and portions able to be deeply
    pipelined for RC
  • Define I/O requirements for each system component
    and their interactions
  • Search for pre-built RC cores to speed development

Courtesy of Ian Troxel
24
Algorithm Parallelization Considerations
  • Understand system component advantages
  • Traditional microprocessor
  • Control-bound algorithms (i.e. complex state
    machines)
  • Random and/or sustained memory access (i.e. rich
    hierarchy)
  • Relatively infinite algorithm instruction count
    (analogous to FPGA area)
  • Rich tool support for development and debug
  • RC components
  • Dataflow and streaming algorithms (i.e. deep,
    custom pipelines)
  • Bit-level manipulations, especially non-standard
    widths
  • High degree of true hardware parallelism offered
  • Hybrid approaches can offer best or worst of both
    worlds

Use the right tool for the job!
Courtesy of Ian Troxel
25
Algorithm Parallelization Considerations
Embedded Hybrid
  • Board-level RC system parallelization
  • Be sure to consider
  • Buffer requirements, I/O bandwidth and latency,
    achievable parallel efficiency, component
    response time, etc.

Library-Based Coprocessor
Parallel Dual Processor
uP
FPGA
uP
FPGA
Parallel Multiprocessor
Stream Coprocessor
Pre/Post Processor
uP
uP
FPGA
FPGA
Streaming / Pipelined
Courtesy of Ian Troxel
26
Algorithm Parallelization Considerations
Cluster-level Tradeoffs
  • Intra-Node
  • Number and mix of boards
  • Network interface
  • Inter-Node
  • Interconnect
  • Arbitration and control
  • Discovery and identification
  • Intra-Cluster
  • Distributed job control
  • Configuration management
  • Resource monitoring
  • Parallel execution support
  • IPC
  • Inter-Cluster
  • User authentication
  • Safety (FPGA power issue)

X-node Cluster
Ethernet? SAN?
SAN? Backplane? Pin-to-pin?
Network Attached?
Y-node Cluster
Courtesy of Ian Troxel
27
Algorithm Parallelization Considerations
Grid-level Tradeoffs
X-node Cluster
  • Inter-Cluster
  • Resource advertisement
  • Domain sharing (gateways)
  • Job priority definition
  • User authentication
  • Safety (FPGA power issue)
  • Grid
  • Distributed resource monitoring
  • Distributed execution control
  • Data staging
  • Interconnect (latency tolerance)
  • Arbitration and control
  • Discovery and identification
  • Security and authentication

Long-haul Backbone
Y-node Cluster
New direction for RC research
Courtesy of Ian Troxel
28
The Research of Brian
29
RC-Amenability Test (RAT)
  • Should this application be done in an FPGA?
  • Cannot rely on rules of thumb such as O(n2)
    comp. complexity
  • Needs to instead be based on two things
  • Ratio of communication time to computation time
  • (cannot be distinguished simply by looking at
    algorithm complexity)
  • Ratio of software execution time (tsoft) to RC
    latency (tRC)
  • Comm./Comp. ratio tells efficiency of RC device,
    soft/hard ratio tells speedup
  • RC Latency (tRC) as mentioned above means
    either
  • Processing time of specific core design (tcomp),
    OR
  • Time it takes to send data and receive results
    (tread twrite)
  • The two metrics above should be easily
    approximated, even before beginning
    implementation (assume to be true for now, will
    discuss later)
  • Reason for using this rule of thumb, is based on
    double-buffered performance, which is either
    bound by (tcomp) OR (treadtwrite)

30
RC-Amenability Methodology
  • So, the following metrics are defined
  • tcomp time for FPGA to process X bytes,
    assuming all data in FPGA
  • tcomm tread twrite
  • tRC (tcomp tcomm) ? Ntcomp Ntcomm
    MANY OTHERS POSSIBLE
  • tsoft pure-software execution time
  • lets more carefully define in FPGA
  • By in FPGA above, I mean the storage location
    of the FPGA, so either external memory if any,
    otherwise internal memory
  • If external, there is one more layer of buffering
    between FPGA memory and internal processing cores
    (that communication/buffering is considered part
    of tcomp see below)
  • Data movement between FPGAs local (external or
    otherwise) RAM and processing core should be
    considered part of tcomp
  • Only data movement between actual host processor
    memory and FPGA board should be considered for
    tcomm in my proposed expression
  • Provides more apples-to-apples comparison, since
    software processor must also move data between
    local external memory and internal functional
    units
  • Even O(n) algorithm may be able to achieve 100
    processor utilization when working out of local
    external memory (depends on local memory
    throughput, not PCI will depend on specific RC
    card /vendor wrapper)

depends on application discussed later
31
RC-Amenability Methodology
  • Lets look at a couple of pictures to clear this
    up, illustrating each case (FPGA has external
    memory, FPGA has no external memory)
  • Red line indicates communication associated with
    tcomm (tread or twrite)
  • Blue line indicates communication included in
    tcomp
  • Important assumptions, before we continue
  • Must transfer X bytes to FPGA before beginning RC
    computation (can still be true for streaming
    applications, by the way)
  • Once all data is in FPGA memory, processor core
    can achieve 100 utilization

32
How To Calculate Metrics tRC
  • This metric is fairly difficult to formally
    define, but is of course critical to get right
    for accurate predictions
  • I propose to assume a double-buffered
    communication pattern, however some cases there
    may only be a single call to FPGA I know its not
    best solution, maybe just use an example?
  • For example at bottom of slide, assume repeated
    calls to same function, back-to-back
  • Even if in the software there is only one
    function call, depending on memory capacity of
    target FPGA co-processor, that one call may need
    to be broken up into many small calls to FPGA
    the above assumption is thus valid for more than
    just cases of software-mandated multi-calls
  • Double buffering is usually possible, and should
    be assumed necessary for realistic
    implementations (maybe this isnt true, but I
    think it is does everyone agree? If you
    disagree, what is a counter-example?)
  • For cases where double buffering simply isnt
    possible, all hope is not lost!
  • As a minimum, for a single, non-double buffered
    call, tRC would be defined as twrite tcomp
    tread
  • Perhaps alternate expressions can be
    fairly-easily derived if necessary by visualizing
    entire course of processing as a known sequence
    of read, write, and computation events
  • Use example below (double-buffered, repeated
    back-to-back calls) as guide to how to derive
    analytical expression based on tcomm and tcomp
    from such a sequence of events



33
C-based Application Mappers
  • High-level languages accelerate efficiency of
    hardware design
  • Software languages are more intuitive, deployed
    than HDLs
  • Significantly more legacy code and programmers
    exist for HLLs
  • Main challenge remains effectively porting to
    particular platform
  • Moving to a supported platform is relatively
    straightforward
  • DIME-C inherently targets DIMEtalk and Nallatech
    hardware
  • SRC machines are built around Carte environment
  • New versions of Impulse-C can directly target
    Cray XD1
  • However, unsupported platforms require manual
    effort
  • Handel-C, Impulse-C require additional user code
    to target other platforms
  • Inner constructs (e.g. Streams) do not port
    well to outside world
  • Like user FPGA code, HLLs may need to target
    standard interface
  • Hardware abstraction for application mappers is
    future goal of our USURP work

MAPLD'05
Courtesy of Ian Troxel
34
Accelerating ApplicationsUsing C-to-FPGA
TechniquesCoDeveloper and Impulse C
David Pellerin, CTO Impulse Accelerated
Technologies
Courtesy of Impulse Accelerated Technologies
35
What is Impulse C?
  • ANSI C for FPGA programming
  • A library of functions compatible with standard C
  • Functions for application partitioning
  • Functions and types for process communication
  • A software-to-hardware compiler
  • Optimizes C code for parallelism
  • Generates HDL, ready for FPGA synthesis
  • Also generates hardware/software interfaces
  • Purpose
  • Describe hardware accelerators using standard C
  • Move compute-intensive functions to FPGAs

Courtesy of Impulse Accelerated Technologies
36
C-to-FPGA Programming Goals
  • Support FPGA-based computing platforms
  • Allow true software programming of FPGAs, from C
    language
  • Bring FPGAs within reach of software programmers
  • Allow hardware designers a faster path to
    prototypes
  • Maintain compatibility with existing tool flows
  • Use standard C development tools for design and
    debugging
  • Use with existing FPGA synthesis tools and design
    flows

Courtesy of Impulse Accelerated Technologies
37
Its All About Parallelism
  • Parallelism at the system level
  • Multiple parallel processes
  • System-level pipelining and/orco-processing as
    appropriate
  • Hardware accelerators combinedwith embedded
    software
  • Parallelism at the C statement level
  • Loop unrolling and pipelining
  • Instruction scheduling


Processor
FPGA bus
MEMORY
H/W accelerator
PERIPHERALS
FPGA
Courtesy of Impulse Accelerated Technologies
38
Impulse C Programming Model
  • Communicating Processes
  • Buffered communication channels to implement data
    streams
  • Supports dataflow and message-based
    communications
  • Supports parallelism at the application level and
    at the level of individual processes

Courtesy of Impulse Accelerated Technologies
39
Example Simple Filter
  • Data passed into filter via data stream
  • Could also use shared memory
  • Written using untimed, hardware-independent C
    code
  • Using coding styles familiar to C programmers
  • Software test bench written in C to test
    functionality
  • In software simulation
  • In actual hardware

Courtesy of Impulse Accelerated Technologies
40
Simple Filter Process
Inputdata
Outputdata
Filter
  • Use C to describe the behavior of the filter
  • Read data from an input stream
  • Store samples as needed
  • Perform some computations
  • Write new data to an output stream

Courtesy of Impulse Accelerated Technologies
41
Impulse C Streaming Process
void img_proc(co_stream pixels_in, co_stream
pixels_out) int nPixel . . . do
co_stream_open(pixels_in, O_RDONLY,
INT_TYPE(32)) co_stream_open(pixels_out,
O_WRONLY, INT_TYPE(32)) while (
co_stream_read(pixels_in, nPixel, sizeof(int))
0 ) . . . // Do
some kind of filtering operation here
. . . co_stream_write(pixels_out,
nPixel, sizeof(int))
co_stream_close(pixels_in)
co_stream_close(pixels_out)
IF_SIM(break) // Terminate here if desktop
simulation while(1) // Run forever if
hardware implementation
Courtesy of Impulse Accelerated Technologies
42
Impulse C shared memory process
void img_proc(co_signal start, co_memory
datamem, co_signal done) double
AARRAYSIZE double BARRAYSIZE int32
status int32 offset 0 . . . do
co_signal_wait(start, (int32)status)
co_memory_readblock(datamem, offset, A,
ARRAYSIZE sizeof(double)) . . .
// Do some kind of computation here, perhaps
calculating A into B . . .
co_memory_writeblock(datamem, offset, B,
ARRAYSIZE sizeof(double))
co_signal_post(done, 0) while(1)
Courtesy of Impulse Accelerated Technologies
43
Parallel Programming Model
  • Communicating Process Programming Model
  • Buffered communication channels (FIFOs) to
    implement streams
  • Supports dataflow, message-based and memory
    communications
  • Supports parallelism at the application level and
    at the level of individual processes

Courtesy of Impulse Accelerated Technologies
44
An Impulse C Process
Multiple methods ofprocess-to-processcommunicati
onsare supported
Shared memory block reads/writes
Stream inputs
Stream outputs
Signal inputs
Signal outputs
Register inputs
Register outputs
App Monitor outputs
Processes are independently synchronized
Courtesy of Impulse Accelerated Technologies
45
Using Multiple Processes
Testimage
Filtered image
Testproducer
Imagefilter
Test consumer
Testing can be performed in desktopsimulation
using Visual Studioor some other C environment.
Courtesy of Impulse Accelerated Technologies
46
Conclusions
  • Reconfigurable Computing, a growing field
  • Technology advancements
  • Design space growth
  • Numerous boards and developers
  • Embedded, clusters, and large-scale systems
  • Good visibility
  • Parallel RC design a tricky business
  • Examine algorithm and identify options
  • Design with a system-level approach in mind
  • Much scholarly research under development
  • RC Amenability Test
  • Initial version of RAT has been submitted for
    publication
  • However, still room for significant improvement
    and expansion
  • HLL Tools
  • Tools have matured greatly since initial research
    in 2005
  • But can they contribute to nontrivial scientific
    applications?

Courtesy of Ian Troxel
47
The Future of RC?
  • The end is nigh!
  • Lack of widespread expertise
  • Too much legacy code to port
  • RC devices and tools in their infancy
  • Fractured market with expensive platforms
  • My traditional processor works for me now
  • The future is bright!
  • Economy of scale likely to come in time
  • Embedded market driving innovation
  • Large-scale HPC systems coming online
  • Hybrid approaches show merit
  • Programming models solidifying

Courtesy of Ian Troxel
48
There is hope that we will bridge the gap!
Thank you for listening and thank you to the HCS
lab members whose work was featured in this
presentation
Courtesy of Ian Troxel
Write a Comment
User Comments (0)
About PowerShow.com