An Introduction to High Performance Reconfigurable Computing - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

An Introduction to High Performance Reconfigurable Computing

Description:

Typically programming has been done in VHDL/Verilog ... Not yet mature enough: MAPC is better than Verilog, but not close enough to C. ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 49
Provided by: PeterM133
Category:

less

Transcript and Presenter's Notes

Title: An Introduction to High Performance Reconfigurable Computing


1
An Introduction toHigh PerformanceReconfigurable
Computing
Grid Computing Workshop Department of Physics,
University of Cape Town 13 September 2006
  • Peter McMahonpeter_at_dotnet.za.net
  • Department of Electrical Engineering
  • University of Cape Town

Disclaimer References are missing. Only results
presented as mine are so.
2
Agenda
  • High performance computing and the motivation for
    alternative architectures
  • Speeding up computing with Field Programmable
    Gate Arrays
  • The current state of reconfigurable computing
  • Recent performance results
  • The future of reconfigurable computing

3
Motivation for alternative architectures (1)
  • Present systems use conventional architectures
  • CPU clock speeds have become a significant
    barrier no longer doubling every 18 months
  • Power consumption has become a major issue. Many
    sizeable centres consume gt5MW, with next
    generation centres planning for 30 MW.
  • Koeberg produces 1800 MW!

4
(No Transcript)
5
Motivation for alternative architectures (2)
  • Present solution from vendors seems to be to tack
    together more processors (multi-core and bigger
    clusters)
  • More cores and/or chips leads to greater power
    consumption and cooling issues
  • Even if CPUs could be made to run faster, they
    would then run hotter
  • Maybe its time to look at other ways of doing
    high performance computing?

6
What are we trying to do?
  • Were not looking at personal or embedded
    computing some of the same issues, but not our
    fight!
  • Increase the performance capability of high
    performance computing systems, scaling to
    petaflops and beyond
  • Majority of scientific codes running on HPC
    centres are floating-point intensive
  • Hence specifically what we want to do is increase
    the performance of floating-point intensive
    software (roughly, FLOPS)

7
What options are there? (1)
  • Multi-core and/or multi-CPU systems
  • Parallelism via VLIW
  • Sea of ALUs/Processors IBM Cell Processor
  • High performance floating-point coprocessor
    (ClearSpeed)
  • Reconfigurable computing

8
What options are there? (2)
  • Sony/Toshiba/IBM Cell Processor
  • Delivers 30x performance of single PPC for some
    applications 100x in exceptional cases.

9
What options are there? (3)
  • ClearSpeed floating-point accelerator
  • 0.17 GFLOPS/Watt in an HP cluster
  • Up from 0.07 GFLOPS/Watt without ClearSpeed
  • Performance increase of 2.7x by using two
    ClearSpeed accelerators per server

10
Introduction to FPGAs (1)
  • Field Programmable Gate Arrays
  • Back to basics all programs are essentially a
    series of logic operations on bits
  • The key idea is that FPGAs are custom-designed
    like ICs (ASICs), but are also software-reprogramm
    able

11
Introduction to FPGAs (2)
  • You can in some sense think of an FPGA as a grid
    of wires connecting together logic gates. The
    joints between the wires are defined when you
    configure the device.
  • These wires have fuses between them and the
    fuses can be blown or connected in software.
  • At least, that was the original idea
    (Programmable Array Logic) now they are far
    more sophisticated.

12
(No Transcript)
13
Introduction to FPGAs (3)
  • Instead of just AND/OR gates, FPGAs now use
    lookup table and flip-flop blocks, and include
    onboard memory (block RAM), hardware integer
    multipliers, fast I/O interconnects etc.

14
What is a reconfigurable computer?
  • Idea whereby hardware can modify itself to suit
    executing program
  • Reconfigurable computing is sometimes used to
    refer to FPGAs alone.
  • We use the term to refer to hybrid computers that
    include both conventional microprocessors and
    FPGA reconfigurable logic.

15
A generic reconfigurable computer architecture
16
Performance advantages of reconfigurable computers
  • Simple idea use CPU when it is faster, and FPGA
    when it is faster
  • FPGAs have been used to do application-type
    computation before, but
  • Typically programming has been done in
    VHDL/Verilog
  • All-or-nothing whole machine built out of custom
    hardware

17
Performance in the real world
  • The first commercial reconfigurable computers
    have yielded promising results.

3
18
Performance in the real world
  • Why does the speedup vary with input size?

4
19
Performance in the real world
  • FPGAs faster for certain applications
  • FPGAs can execute in parallel
  • Programs which do not depend on previously
    calculated values can be executed in single clock
    cycle
  • Programs where number of iterations are not known
    a-priori generally perform better on general
    purpose computers. Hardware to implement such
    routines require complex control structures

20
Programming models
  • FPGA-based general-purpose computing devices have
    been possible for many years, but have not taken
    off. Why?
  • A simple matter of programming.
  • 10 years ago the problem was programming. It is
    still programming.

21
Programming models
  • VHDL is a nice abstraction, but it is still
    design.
  • Software developers cannot be expected to operate
    at the digital logic level.
  • There are conservatively 10x more software
    developers than there are digital electronics
    designers.
  • Hand-coding may yield 2x better performance than
    an automated tool, but productivity must be
    factored in 100 apps running 50x faster is
    better than 2 apps running 100x faster.

22
Programming models
  • Ultimate objective create a programming model
    that abstracts the hardware from the programmer,
    including the decision of whether to run code on
    the microprocessor or reconfigurable logic
    components
  • Intermediate objective create a programming
    model that abstracts the complexities of FPGA
    design from the programmer, and allows the
    programmer to develop applications in a
    high-level language

23
FPGA programming models
  • Lots of wire
  • Vast quantities of Mountain Dew
  • Healthy disregard for personal hygiene
  • Very little wire
  • Vast quantities of Mountain Dew
  • Healthy disregard for personal hygiene

24
Programming models
  • In the simplest case, the CPU may be coded with a
    high-level language (e.g. C), and the FPGA with a
    HDL (e.g. VHDL or Verilog).
  • This is not ideal the programmer shouldnt have
    to worry about logic design.
  • Solution use a special C-gtHDL compiler for FPGA
    routines, allowing the programmer to write the
    entire program in a high-level language

25
Commercial implementations
x
VHDL
Verilog
Low Efficiency High
x
x
x
x
Easy Ease of Use
Difficult
26
SRC programming
  • SRC MapStation. Two languages C and MAP C for
    MAP component (FPGA)

27
SRC MapStation
  • Xeon processors, common memory, and MAP FPGA
    boards

28
SRC programming
include ltlibmap.hgt void sub_routine(int,
int) void main() int A
(int)malloc(10sizeof(int)) int B
(int)malloc(10sizeof(int)) // Put data to
process into A map_allocate(1)
sub_routine(A, B) map_free(1) // Do
something with data in B free(A) free(B)
include ltlibmap.hgt void sub_routine(int A, int
B) OBM_BANK_A(AL, int, 10) OBM_BANK_B(BL,
int, 10) DMA_CPU(CM2OBM, AL, , A, 1,
10sizeof(int), ) wait_DMA(0) // Do some
processing with AL DMA_CPU(OBM2CM, BL, , B, 1,
10sizeof(int), ) wait_DMA(0)
29
Programming models
  • SRCs model hides the logic hardware design
    element from the programmer, but s/he still has
    to be familiar with the reconfigurable computers
    architecture.
  • Estimating what subroutines will be best to run
    on the FPGA is not necessarily trivial need to
    perform code profiling.
  • Short bursts of FPGA use are counter-productive
    call overhead.
  • Not easy to switch between microprocessor and
    FPGA target code.

30
Performance Results
  • SRC has reported 250x speedups for some signal
    processing applications
  • Non-floating point applications seem to be
    getting speedups of 20x
  • Floating-point performance is currently fairly
    poor not familiar with any floating-point code
    that has speedup of gt10x

31
What is limiting performance?
  • 10x is nice, but why should we care?
  • This is possibly just the tip of the iceberg.
  • Current clock rates are 100MHz, so theres a
    possibility of scaling it up
  • Most importantly, FPGAs dont yet ship with
    floating-point multipliers, and have only limited
    integer multipliers.
  • Slice count limits parallelism (max 10 parallel
    engines) theres a possibility of scaling this
    up too

32
Where does performance come from? (1)
  • Intensity! (100MHz vs 4GHz)
  • CPUs can do 0.5 integer operations per clock
    cycle.
  • This is the best case, when there are no caching
    issues, the pipeline is working perfectly etc.
    The worst case is much worse!
  • FPGA only implements the functionality that is
    needed, resulting in much less complicated logic
    (no need for Control Unit, ALU, etc.)
  • Pipelined code can reduce to just a few clock
    cycles per loop iteration!

33
Where does performance come from? (2)
  • Lower latency
  • Ties into intensity (since lower latency
    increases intensity)
  • Can have 1 cycle access to static variables
  • Can have 5 cycle access to BRAM
  • Spatial parallelism on the chip
  • We can make one small pipeline for doing some
    computation, then replicate it
  • Since it is small, we can do much better than
    processor core or cluster parallelism!

34
Parallel random number generation
  • To do Monte Carlo, we need a decent random number
    generator.
  • To do Monte Carlo in parallel, we need to
    generation uncorrelated random number streams in
    parallel.
  • SRC do not provide an RNG library.
  • Implement parallel LCG from SPRNG
  • Xi (a Xi-1 b) mod M
  • Parameterize b (by stream)

35
Monte Carlo (1)
  • Estimate pi CPU vs MAP

36
Monte Carlo (2)
  • Asian option pricing
  • Uses parallel random number generator
  • Significantly more calculation required than for
    pi estimator
  • Floating-point code
  • 1x performance
  • Could only fit 2 parallel computations on FPGA!

37
Conways Game of Life (1)
  • Cellular automata with simple rules

38
Conways Game of Life (2)
  • 128x128 cell grid 100,000 iterations
  • CPU 2 mins 39 sec
  • MAP 58 sec
  • Speedup 2.7x
  • Used 4 parallel engines, consuming 30 of
    slices.
  • Compiler issues (e.g. limit on lexical writes)
    made it more trouble than it was worth to extend
    to more engines, and I can see how it scales
    already.

39
Lattice Gas (1)
  • Cellular automata algorithm, but more complicated
    than Conways Game of Life.

40
Lattice Gas (2)
  • On a large lattice, can get reasonable results

41
Lattice Gas (3)
  • 480x480 point lattice 10,000 iterations
  • CPU 1 min 39 sec
  • MAP 1 min 6 sec
  • Speedup 1.5x
  • Used 4 parallel engines, consuming 50 slices
    and 60 BRAM.

42
Edge Detection (1)
  • Find the edges in an image

43
Edge Detection (2)
  • The Sobel edge detection algorithm just
    involves 2D convolution which is actually
    implemented in a very similar way to CA
  • But CA uses 1,000s of iterations, and ED uses
    only one so I/O time dominates.

44
Edge Detection (3)
  • 700x2100 image
  • CPU 0.45 sec
  • MAP 0.95 sec
  • Slowdown 2.1x
  • Measuring just compute time we see a 20x
    speedup.
  • Used 3 parallel engines, consuming 23 slices.

45
Molecular model docking
  • Automatically dock molecular models into electron
    density maps.
  • Spent a couple of weeks converting existing code
    to a form suitable for MAP (i.e. C to C, static
    arrays etc.)
  • No performance results yet.

46
Reconfigurable Computing at UCT
  • Some experience with FPGAs in Electrical
    Engineering used for specific applications.
  • Plan to start a reconfigurable computing lab at
    the end of this year.
  • Will focus on applications for KAT. Main two are
  • RFI excision
  • Pulsar searching
  • May look at other scientific computing software
    applications as well

47
The future hardware
  • At the very least, reconfigurable computers will
    likely remain the best choice for
    non-floating-point code implementing simple
    algorithms and big loops.
  • Hardware is still expensive, but any application
    with 50x speedup or better starts to have a very
    compelling price/performance case.
  • High clock speeds more hardware multipliers
    floating-point hardware more slices.
  • At least there is room to grow!

48
The future software
  • Not yet mature enough MAPC is better than
    Verilog, but not close enough to C.
  • Library support is lacking.
  • Adoption will be limited until programming
    becomes as easy as it is for clusters.
  • Currently getting performance gains involves a
    lot of thinking about the hardware abstracting
    this without ruining performance is going to be
    difficult.
  • But in the last 5 years, big gains have been
    made hope for the future.
Write a Comment
User Comments (0)
About PowerShow.com