Application-Specific Memory Interleaving Enables High Performance in FPGA-based Grid Computations - PowerPoint PPT Presentation

1 / 1

About This Presentation

Title:

Application-Specific Memory Interleaving Enables High Performance in FPGA-based Grid Computations

Description:

Application-Specific Memory Interleaving Enables High Performance in FPGA-based Grid Computations FPGAs: Technological opportunity Traditional memory interleaving for ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 2

Provided by: buEducaad8

Category:

more less

Transcript and Presenter's Notes

Title: Application-Specific Memory Interleaving Enables High Performance in FPGA-based Grid Computations

1
Application-Specific Memory Interleaving Enables
High Performance in FPGA-based Grid Computations
FPGAs Technological opportunity Traditional
memory interleaving for broad parallelism in
use since 1960s Generic Designed to avoid
application specifics Fixed bus All
applications use same memory interface Expensive
High hardware costs, accessible only for major
processor designs FPGAs for memory
interleaving ideal technological
match Customizable Can adapt to arbitrary
application characteristics Not just
permitted, customization is inherent and
compulsory Configurable Unique interleaving
structure for each application Multiple
different structures for different parts of one
application Free (almost) 10s to 100s of
independently addressable RAM busses On-chip
bus widths 100s to 1000s of bits Cheap, fast
logic for address generation de-interleaving
networks FPGA-based computation is an emerging
field Does not have softwares huge base of
widely applicable techniques Needs to develop a
cookbook of reusable computation structures
Grid computation candidate for acceleration
Many applications in molecular dynamics,
physics, molecule docking, Perlin noise,
image processing Computation characteristics
being addressed ? Cluster of grid points
needed at each step ? Grid cells accessed
in irregular order Invalidates
typical schemes for reusing data ? Working
set fits into FPGAs on-chip RAM Implementing
for FPGA computation Allows reconsideration of
the whole algorithm Optimal FPGA algorithms
are commonly very different from sequential
implementations Developer has access to
algorithms logical indexing scheme Extra
design information in 2,3,... dimensional
indexing, before flattening into RAM addresses
FPGAs support massive, fine-grained parallelism
in computation pipeline Often throttled by
serial access to RAM operands Goal Fetch enough
operands to fill the width of the computation
array
Bilinear interpolation for computing off-grid
points
Implementation technique
2. Round up to power of 2 bounding box RAM banks
indexed by X, Y mod 4
4. De-interleaving Map RAM banks to outputs
3b
3d
3a
3c
2b
2d
1
2c
2a
1b
1d
1c
1a
0
1. Define applications access cluster Convert
to rectangular array
1
0
3. Address generation Map access cluster to
grid Handle wraparound X, Y / 4 vs X, Y / 4
1
The general case, not just limited to 1D or 2D
arrays
MSBs
1?
X
Address generation
1?
Y
3a
3b
3c
1
2a
2b
2c
RAM array
2
3
0
1
1b
0
De-interleave
LSBs
C
D
A
B
1
0
Variations and extensions Dimensions 1, 2, 3,
Adapts easily to dimensionality. Adapts easily
to cluster size shape. Can use
non-power-of-2 memory arrays LSBs become X mod
NX efficient implementations for modest
NX MSBs become X div NX efficient
implementations using block multipliers Allows
wide range of design tradeoffs Logic
multipliers vs. RAMs Latency vs. hardware Take
advantage of dual-ported RAMs, when
available Allocate less hardware to small
grids Optimize de-interleaving
multiplexers 4x4x4 RAM array requires 641
output multiplexers Implement efficiently as
three layers of 41 multiplexers Write port
design choices Can use dual-ported RAM for
non-interfering, concurrent read write Write
single words or clusters need not be same shape
as read cluster
Automation Java program initial version
available See http//www.bu.edu/caadlab/publicati
ons Source code and documentation Sample
input for hex grid example above nameHexGrid
Name of VHDL component axishoriz axisvert S
ymbol names for axis indices databits16 Wid
th of individual word outputB1,0,1 outputA2,1,0
Define the access cluster outputB2,1,1 outputC
2,1,2 outputA3,2,0 outputB3,2,1 outputC3,2,2
testSize150,75 Grid size for test bench
VHDL output HexGrid.vhdl Synthesizable
entity definition HexGrid_def.vhdl Declaration
package HexGrid_test_driver.vhdl Test bench -
confirms implementation
Y01
X01
Z01
Output
Tom VanCourt Martin Herbordt tvancour,
herbordt _at_ bu.edu

Barnes, George H., Richard M. Brown, Maso Kato,
David J. Kuck, Daniel L. Slotnick, and Richard A.
Stokes. The Illiac IV Computer. IEEE Transactions
on Computers 17(8), August 1968

Böhm, A.P.W., B. Draper, W. Najjar, J. Hammes, R.
Rinker, M. Chawathe, and C. Ross. One-step
Compilation of Image Processing Applications to
FPGAs. Proc FCCM. 2001