An Introduction to High Performance Reconfigurable Computing - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

An Introduction to High Performance Reconfigurable Computing

Description:

Typically programming has been done in VHDL/Verilog ... Not yet mature enough: MAPC is better than Verilog, but not close enough to C. ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 49

Provided by: PeterM133

Category:

more less

Transcript and Presenter's Notes

Title: An Introduction to High Performance Reconfigurable Computing

1
An Introduction toHigh PerformanceReconfigurable
Computing
Grid Computing Workshop Department of Physics,
University of Cape Town 13 September 2006

Peter McMahonpeter_at_dotnet.za.net
Department of Electrical Engineering
University of Cape Town

Disclaimer References are missing. Only results
presented as mine are so.
2
Agenda

High performance computing and the motivation for
alternative architectures
Speeding up computing with Field Programmable
Gate Arrays
The current state of reconfigurable computing
Recent performance results
The future of reconfigurable computing

3
Motivation for alternative architectures (1)

Present systems use conventional architectures
CPU clock speeds have become a significant
barrier no longer doubling every 18 months
Power consumption has become a major issue. Many
sizeable centres consume gt5MW, with next
generation centres planning for 30 MW.
Koeberg produces 1800 MW!

4
(No Transcript)
5
Motivation for alternative architectures (2)

Present solution from vendors seems to be to tack
together more processors (multi-core and bigger
clusters)
More cores and/or chips leads to greater power
consumption and cooling issues
Even if CPUs could be made to run faster, they
would then run hotter
Maybe its time to look at other ways of doing
high performance computing?

6
What are we trying to do?

Were not looking at personal or embedded
computing some of the same issues, but not our
fight!
Increase the performance capability of high
performance computing systems, scaling to
petaflops and beyond
Majority of scientific codes running on HPC
centres are floating-point intensive
Hence specifically what we want to do is increase
the performance of floating-point intensive
software (roughly, FLOPS)

7
What options are there? (1)

Multi-core and/or multi-CPU systems
Parallelism via VLIW
Sea of ALUs/Processors IBM Cell Processor
High performance floating-point coprocessor
(ClearSpeed)
Reconfigurable computing

8
What options are there? (2)

Sony/Toshiba/IBM Cell Processor
Delivers 30x performance of single PPC for some
applications 100x in exceptional cases.

9
What options are there? (3)

ClearSpeed floating-point accelerator
0.17 GFLOPS/Watt in an HP cluster
Up from 0.07 GFLOPS/Watt without ClearSpeed
Performance increase of 2.7x by using two
ClearSpeed accelerators per server

10
Introduction to FPGAs (1)

Field Programmable Gate Arrays
Back to basics all programs are essentially a
series of logic operations on bits
The key idea is that FPGAs are custom-designed
like ICs (ASICs), but are also software-reprogramm
able

11
Introduction to FPGAs (2)

You can in some sense think of an FPGA as a grid
of wires connecting together logic gates. The
joints between the wires are defined when you
configure the device.
These wires have fuses between them and the
fuses can be blown or connected in software.
At least, that was the original idea
(Programmable Array Logic) now they are far
more sophisticated.

12
(No Transcript)
13
Introduction to FPGAs (3)

Instead of just AND/OR gates, FPGAs now use
lookup table and flip-flop blocks, and include
onboard memory (block RAM), hardware integer
multipliers, fast I/O interconnects etc.

14
What is a reconfigurable computer?

Idea whereby hardware can modify itself to suit
executing program
Reconfigurable computing is sometimes used to
refer to FPGAs alone.
We use the term to refer to hybrid computers that
include both conventional microprocessors and
FPGA reconfigurable logic.

15
A generic reconfigurable computer architecture
16
Performance advantages of reconfigurable computers

Simple idea use CPU when it is faster, and FPGA
when it is faster
FPGAs have been used to do application-type
computation before, but
Typically programming has been done in
VHDL/Verilog
All-or-nothing whole machine built out of custom
hardware

17
Performance in the real world

The first commercial reconfigurable computers
have yielded promising results.

3
18
Performance in the real world

Why does the speedup vary with input size?

4
19
Performance in the real world

FPGAs faster for certain applications
FPGAs can execute in parallel
Programs which do not depend on previously
calculated values can be executed in single clock
cycle
Programs where number of iterations are not known
a-priori generally perform better on general
purpose computers. Hardware to implement such
routines require complex control structures

20
Programming models

FPGA-based general-purpose computing devices have
been possible for many years, but have not taken
off. Why?
A simple matter of programming.
10 years ago the problem was programming. It is
still programming.

21
Programming models

VHDL is a nice abstraction, but it is still
design.
Software developers cannot be expected to operate
at the digital logic level.
There are conservatively 10x more software
developers than there are digital electronics
designers.
Hand-coding may yield 2x better performance than
an automated tool, but productivity must be
factored in 100 apps running 50x faster is
better than 2 apps running 100x faster.

22
Programming models

Ultimate objective create a programming model
that abstracts the hardware from the programmer,
including the decision of whether to run code on
the microprocessor or reconfigurable logic
components
Intermediate objective create a programming
model that abstracts the complexities of FPGA
design from the programmer, and allows the
programmer to develop applications in a
high-level language

23
FPGA programming models

Lots of wire
Vast quantities of Mountain Dew
Healthy disregard for personal hygiene

Very little wire
Vast quantities of Mountain Dew
Healthy disregard for personal hygiene

24
Programming models

In the simplest case, the CPU may be coded with a
high-level language (e.g. C), and the FPGA with a
HDL (e.g. VHDL or Verilog).
This is not ideal the programmer shouldnt have
to worry about logic design.
Solution use a special C-gtHDL compiler for FPGA
routines, allowing the programmer to write the
entire program in a high-level language

25
Commercial implementations
x
VHDL
Verilog
Low Efficiency High
x
x
x
x
Easy Ease of Use
Difficult
26
SRC programming

SRC MapStation. Two languages C and MAP C for
MAP component (FPGA)

27
SRC MapStation

Xeon processors, common memory, and MAP FPGA
boards

28
SRC programming
include ltlibmap.hgt void sub_routine(int,
int) void main() int A
(int)malloc(10sizeof(int)) int B
(int)malloc(10sizeof(int)) // Put data to
process into A map_allocate(1)
sub_routine(A, B) map_free(1) // Do
something with data in B free(A) free(B)
include ltlibmap.hgt void sub_routine(int A, int
B) OBM_BANK_A(AL, int, 10) OBM_BANK_B(BL,
int, 10) DMA_CPU(CM2OBM, AL, , A, 1,
10sizeof(int), ) wait_DMA(0) // Do some
processing with AL DMA_CPU(OBM2CM, BL, , B, 1,
10sizeof(int), ) wait_DMA(0)
29
Programming models

SRCs model hides the logic hardware design
element from the programmer, but s/he still has
to be familiar with the reconfigurable computers
architecture.
Estimating what subroutines will be best to run
on the FPGA is not necessarily trivial need to
perform code profiling.
Short bursts of FPGA use are counter-productive
call overhead.
Not easy to switch between microprocessor and
FPGA target code.

30
Performance Results

SRC has reported 250x speedups for some signal
processing applications
Non-floating point applications seem to be
getting speedups of 20x
Floating-point performance is currently fairly
poor not familiar with any floating-point code
that has speedup of gt10x

31
What is limiting performance?

10x is nice, but why should we care?
This is possibly just the tip of the iceberg.
Current clock rates are 100MHz, so theres a
possibility of scaling it up
Most importantly, FPGAs dont yet ship with
floating-point multipliers, and have only limited
integer multipliers.
Slice count limits parallelism (max 10 parallel
engines) theres a possibility of scaling this
up too

32
Where does performance come from? (1)

Intensity! (100MHz vs 4GHz)
CPUs can do 0.5 integer operations per clock
cycle.
This is the best case, when there are no caching
issues, the pipeline is working perfectly etc.
The worst case is much worse!
FPGA only implements the functionality that is
needed, resulting in much less complicated logic
(no need for Control Unit, ALU, etc.)
Pipelined code can reduce to just a few clock
cycles per loop iteration!

33
Where does performance come from? (2)

Lower latency
Ties into intensity (since lower latency
increases intensity)
Can have 1 cycle access to static variables
Can have 5 cycle access to BRAM
Spatial parallelism on the chip
We can make one small pipeline for doing some
computation, then replicate it
Since it is small, we can do much better than
processor core or cluster parallelism!

34
Parallel random number generation

To do Monte Carlo, we need a decent random number
generator.
To do Monte Carlo in parallel, we need to
generation uncorrelated random number streams in
parallel.
SRC do not provide an RNG library.
Implement parallel LCG from SPRNG
Xi (a Xi-1 b) mod M
Parameterize b (by stream)

35
Monte Carlo (1)

Estimate pi CPU vs MAP

36
Monte Carlo (2)

Asian option pricing
Uses parallel random number generator
Significantly more calculation required than for
pi estimator
Floating-point code
1x performance
Could only fit 2 parallel computations on FPGA!

37
Conways Game of Life (1)

Cellular automata with simple rules

38
Conways Game of Life (2)

128x128 cell grid 100,000 iterations
CPU 2 mins 39 sec
MAP 58 sec
Speedup 2.7x
Used 4 parallel engines, consuming 30 of
slices.
Compiler issues (e.g. limit on lexical writes)
made it more trouble than it was worth to extend
to more engines, and I can see how it scales
already.

39
Lattice Gas (1)

Cellular automata algorithm, but more complicated
than Conways Game of Life.

40
Lattice Gas (2)

On a large lattice, can get reasonable results

41
Lattice Gas (3)

480x480 point lattice 10,000 iterations
CPU 1 min 39 sec
MAP 1 min 6 sec
Speedup 1.5x
Used 4 parallel engines, consuming 50 slices
and 60 BRAM.

42
Edge Detection (1)

Find the edges in an image

43
Edge Detection (2)

The Sobel edge detection algorithm just
involves 2D convolution which is actually
implemented in a very similar way to CA
But CA uses 1,000s of iterations, and ED uses
only one so I/O time dominates.

44
Edge Detection (3)

700x2100 image
CPU 0.45 sec
MAP 0.95 sec
Slowdown 2.1x
Measuring just compute time we see a 20x
speedup.
Used 3 parallel engines, consuming 23 slices.

45
Molecular model docking

Automatically dock molecular models into electron
density maps.
Spent a couple of weeks converting existing code
to a form suitable for MAP (i.e. C to C, static
arrays etc.)
No performance results yet.

46
Reconfigurable Computing at UCT

Some experience with FPGAs in Electrical
Engineering used for specific applications.
Plan to start a reconfigurable computing lab at
the end of this year.
Will focus on applications for KAT. Main two are
RFI excision
Pulsar searching
May look at other scientific computing software
applications as well

47
The future hardware

At the very least, reconfigurable computers will
likely remain the best choice for
non-floating-point code implementing simple
algorithms and big loops.
Hardware is still expensive, but any application
with 50x speedup or better starts to have a very
compelling price/performance case.
High clock speeds more hardware multipliers
floating-point hardware more slices.
At least there is room to grow!

48
The future software

Not yet mature enough MAPC is better than
Verilog, but not close enough to C.
Library support is lacking.
Adoption will be limited until programming
becomes as easy as it is for clusters.
Currently getting performance gains involves a
lot of thinking about the hardware abstracting
this without ruining performance is going to be
difficult.
But in the last 5 years, big gains have been
made hope for the future.