Reconfigurable computing a new supercomputing paradigm - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

Reconfigurable computing a new supercomputing paradigm

Description:

Comparing a dual core Opteron to FPGA on fp performance: ... Wrapper makes core look like a function call in C code. W. Najjar. TU Delft. 15 ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 61
Provided by: walidn
Category:

less

Transcript and Presenter's Notes

Title: Reconfigurable computing a new supercomputing paradigm


1
Reconfigurable computing - a new supercomputing
paradigm
  • Walid Najjar
  • Computer Science Engineering
  • University of California Riverside

2
ROCCC
  • Riverside Optimizing Compiler for Configurable
    Computing
  • Code acceleration
  • By mapping of circuits to FPGA
  • Achieve same speed as hand-written VHDL codes
  • Improved productivity
  • Allows design and algorithm space exploration
  • Keeps the user fully in control
  • We automate only what is very well understood

3
FPGA A New HPC Platform?
David Strensky, FPGAs Floating-Point Performance
-- a pencil and paper evaluation, in HPCwire.com
  • Comparing a dual core Opteron to FPGA on fp
    performance
  • Opteron 2.5 GHz, 1 add and 1 mult per cycle. 2.5
    x 2 x 2 10 Gflops
  • FPGAs Xilinx V4 and V5 with DSP cores
  • Balanced allocation of dp fp adders, multipliers
    and registers
  • Use both DSP and logic for multipliers,run at
    lower speed
  • Logic for I/O interfaces

4
Balanced Designs
  • Same number of mults as adds (matrix
    multiplication).
  • Double precision
  • Higher percentage of peak on FPGA (streaming)
  • 1/3 of the power!

5
Challenges
  • FPGA is an amorphous mass of logic
  • Languages reflect the von Neumann execution model

6
ROCCC Overview
Procedure, loop and array optimizations
Instruction scheduling Pipelining and
storage optimizations
C/C
High level transformations
Low level transformations
Code generation
Hi-CIRRF
Lo-CIRRF
Java
SystemC
CIRRF Compiler Intermediate Representation for
Reconfigurable Fabrics
  • Limitations on the code
  • No recursion
  • No pointers

7
Focus
  • Extensive compile time optimizations
  • Maximize parallelism, speed and throughput
  • Minimize area and memory accesses
  • Optimizations
  • Loop level fine grained parallelism
  • Storage level compiler configured storage for
    data reuse
  • Circuit level expression simplification,
    pipelining

8
Execution Model
  • A simplified model
  • Decoupled memory access from datapath
  • Parallel loop iterations
  • Pipelined datapath

9
So far, working compiler with
  • Extensive compiler optimizations and
    transformations
  • Analysis and hardware support for data reuse
  • Efficient code generation and pipelining
  • Import of existing IP cores
  • Support for dynamic partial reconfiguration

10
So far, working compiler with
  • Extensive compiler optimizations and
    transformations
  • Analysis and hardware support for data reuse
  • Efficient code generation and pipelining
  • Import of existing IP cores
  • Support for dynamic partial reconfiguration

Loop, array procedure transformations. Maximize
clock speed parallelism, within
resources. Under user control.
11
High Level Transformations
12
So far, working compiler with
  • Extensive compiler optimizations and
    transformations
  • Analysis and hardware support for data reuse
  • Efficient code generation and pipelining
  • Import of existing IP cores
  • Support for dynamic partial reconfiguration

Smart buffer technique reduces off chip memory
accesses by gt 98
13
So far, working compiler with
  • Extensive compiler optimizations and
    transformations
  • Analysis and hardware support for data reuse
  • Efficient code generation and pipelining
  • Import of existing IP cores
  • Support for dynamic partial reconfiguration

Clock speed comparable to hand written HDL codes
14
So far, working compiler with
  • Extensive compiler optimizations and
    transformations
  • Analysis and hardware support for data reuse
  • Efficient code generation and pipelining
  • Import of existing IP cores
  • Support for dynamic partial reconfiguration

Huge wealth of existing IP cores. Wrapper makes
core look like a function call in C code.
15
So far, working compiler with
  • Extensive compiler optimizations and
    transformations
  • Analysis and hardware support for data reuse
  • Efficient code generation and pipelining
  • Import of existing IP cores
  • Support for dynamic partial reconfiguration

DPR allows reconfiguration of a subset of the
FPGA, dynamically, under software
control. Reduces configuration overhead.
16
Simple example
  • 5-tap FIR Bi 3Ai 5Ai1 7Ai2
    9Ai3 11Ai4
  • define N 516
  • void begin_hw()
  • void end_hw()
  • int main()
  • int i
  • const int T5 3,5,7,9,11
  • int AN, BN
  • begin_hw()
  • L1 for (i0 ilt(N-5) ii1)
  • Bi T0Ai T1Ai1 T2Ai2
    T3Ai3 T4Ai4
  • end_hw()

17
Lo-CIRRF Viewer
Example 3-tap FIR unrolled once (two concurrent
iterations)
Indices of A
coefficients
int main() int i int A32 int B32
for (i0 ilt28 ii1) Bi 3Ai
5Ai1 7Ai2
18
RC Platform Models
1
2
3
Fast Network
CPU
FPGA
Memory
FPGA
Memory
CPU
19
Platforms for RC
  • SGI Altix 4700
  • Shared memory machine, fast interconnect
  • 12.8 GB/sec
  • Itanium 2, 1.6 GHz
  • RASC RC100 Blade 2 Virtex 4 LX200
  • Xtremedata XD1000
  • Altera Stratix II drop-in for AMD Opteron
  • Integrated interface to Hypertransport
  • 16 bits _at_ 800 M transfers/sec
  • Memory interface
  • 128 bits DDR-333up to 4 x 4 GB ECC
  • Flash memory
  • For FPGA configuration or data

20
SGI RASC RC100 Blade
SRAM
SRAM
SRAM
SSP
NL4
V4LX200
TIO
SRAM
PCI
SRAM
Selmap
NL4
Loader
SRAM
Selmap
SRAM
NL4
SSP
TIO
V4LX200
SSAM
SRAM
SRAM
21
RC 100 Blade
22
Xtremedata XD1000
23
XD 1000
24
XD 1000 (drop-in)
25
Examples
  • Molecular dynamics
  • Computes the forces exerted by atoms on atoms in
    a molecule and its environment
  • Time step 1 femto second
  • Bioinformatics
  • Exact string edit distance computation
  • Using Smith-Waterman, a dynamic programming
  • Similar dynamic time warping, motif discovery

26
Molecular Dynamics
  • Objective
  • Determine the shape of a molecule by computing
    the forces exerted on each atom by all other
    atoms, in the molecule and its environment.
  • N-body problem.
  • Forces
  • Electrostatic (Coulomb)
  • Van der Waal
  • Importance
  • Computationally intensive
  • months and years of compute time for small
    problems
  • Impact move bio-chemistry to digital simulation
  • Ultimate goal protein folding

27
Algorithm
For every atom I in system for each other atom
J in system compute the forces exerted by
atom J on atom I sum all the forces compute
its next position Repeat until stable
  • Of course, not al forces have meaningful values
    (1/d2)
  • More complex calculations on the boundaries
  • One loop body with 60 variants!

28
Nanoscale Molecular Dynamics
  • NAMD
  • MD code designed for high-performance simulation
    of large biomolecular systems
  • Double precision floating-point
  • Critical loop
  • Computes the forces 82 of execution time
  • 60 variants to compute boundary conditions
  • Forces computed in X, Y Z dimensions
  • 52 FP operations per loop body

29
Characteristics of NAMD
  • Required bytes
  • per iteration
  • Sp. 48 bytes
  • Dp. 96 bytes
  • RASC 6.4 GB/s

30
NAMD Results
  • Itanium
  • Ideal one full EPIC instruction/cycle
  • Measured actual execution time
  • FPGA
  • Enough bandwidth for single precision
  • Double precision two cycles for data for each
    iteration

31
Smith Waterman Algorithm
  • Dynamic programming string matching algorithm
    used widely in genetics related research.
  • Computes a matching score of two input strings S
    and T using a 2D matrix.
  • Computation of each cell depends on the computed
    values of three neighboring cells north, west
    and northwest.

32
Smith Waterman Algorithm
33
Smith Waterman Algorithm
34
Smith Waterman Algorithm
35
Smith Waterman Algorithm
36
Smith Waterman Algorithm
37
Smith Waterman Algorithm
38
Smith Waterman Algorithm
39
Smith Waterman Algorithm
40
Smith Waterman Algorithm
41
Smith Waterman Algorithm
42
Smith Waterman Algorithm
43
Smith Waterman Algorithm
44
Smith-Waterman Code
  • Dynamic Programming
  • Used in protein modeling, bio-informatics, data
    mining
  • A wave-front algorithm with two input strings
  • Ai,j F(Ai,j-1, Ai-1, j-1, Ai-1, j)
  • F CostMatrix(Ai,0,A0,j)
  • Our Approach
  • Chunk the input strings in fixed sizes k
  • Build a k x k template hardware by compiling two
    nested loops (k each) and fully unrolling both.
  • Host strip mines the two outer loops over this
    template.

45
S-W View
46
After (many) Transformations
  • Transformations
  • Loop unrolling
  • Scalar replacement
  • Feedback store elimination
  • gt 70 passes

47
Systolic execution
48
SW Performance
49
SW Potential on the RASC
  • 100 MHz clock
  • 3 cores of 2K cells each, 72 of FPGA area
  • 3 x 2K x 100 MHz 600 Gcups
  • Speedup over Itanium 7140

50
Productivity Speedup
A ratio of 1,000 Productivity speedup
10x to 100x
51
Impact?
Performance time
Programmability time
Programmability Performance 2
Commoditization of HPC will take science and
engineering to a new revolution
52
Conclusion
  • FPGAs a viable platform for supercomputing
  • Including single and double precision fp
  • Main challenge is their programmability
  • ROCCC shown as bridging the gap between
  • HLL program representation, and
  • Circuit instantiation
  • A new paradigm deskside supercomputing

53
  • Thank you
  • www.cs.ucr.edu/roccc

54
Intrusion Detection Bloom filter
  • Bloom filter
  • Is a data structure used to test set membership
    of an element
  • has an array of N elements all of which are set
    to 0 initially
  • members of the set are inserted in the filter
    using multiple hash functions, each returns a
    unique value in the range of 0 to N-1.

55
Search operation in a bloom filter
  • During a search operation, multiple hash
    functions are applied to an incoming value.
  • If all the locations returned by the hash
    function contain 1, then the element belongs to
    the set with a probability P.
  • Probability of a false positive
  • K number of hash functions
  • m number of bits in the Bloom filter array
  • n number of elements inserted into the Bloom
    filter

56
Bloom filter for virus detection
  • Signature Processing Engine (SPE) contains the
    generated bloom filter code
  • Bloom filter output contains false positives.
    Hence a RAM is used for absolute string
    comparison and to eliminate false positives.

Legend SPE Signature Processing Engine
FPE False Positive Eliminator
57
Virus signatures
  • the virus rules in the bleeding snort database.
  • Each rule consists of a rule header and an
    option.
  • Header contains information to be used in packet
    classification.
  • Rule option contains the signatures to be used in
    intrusion detection.
  • Most of the signatures in bleeding-snort database
    were under 32 bytes.

58
Bloom Filter C Code
  • for(i0ilt248i)
  • for(j0jlt7j)
  • value input_streamij
  • temp value 0x1
  • for(k0 klt7 k)
  • result_location1 result_location1
    (hash_function1k temp)
  • result_location2 result_location2
    (hash_function1k temp)
  • result_location3 result_location3
    (hash_function1k temp)
  • result_location4 result_location4
    (hash_function1k temp)
  • value value gtgt 1
  • found bit_arrayresult_location1
    bit_arrayresult_location2 bit_arrayresult_loc
    ation3 bit_arrayresult_location4

Compile time constant, folded
In data-path Table lookup
59
Datapath Analysis
  • Compiler exploits ILP by grouping instructions
    into different execution levels.
  • Each level corresponds to a loop iteration and
    the instructions are executed simultaneously.
  • ROCC automatically places latches for pipelining

Each latched level corresponds to one pipeline
stage and has a delay of one cycle. In the
3-stage pipeline each box of XOR corresponds to
one byte of input being XORed with a hashing
function
60
Throughput evaluation
  • Clock frequency of the synthesized circuit is
    73MHz.
  • The BRAM on our target FPGA can process 32 bytes
    per cycle.
  • Throughput bits per cycle clock frequency
  • 328 73 100,000 bits/sec
  • 18.6 Gbps
Write a Comment
User Comments (0)
About PowerShow.com