Title: Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers?
1Super-Sized Multiplies How Do FPGAs Fare in
Extended Digit Multipliers?
- Stephen Craven
- Cameron Patterson
- Peter Athanas
- Configurable Computing Lab
- Virginia Tech
2Outline
- Background
- Large Integer Multiplication
- GIMPS
- Algorithm Comparison
- Floating-point FFT
- All-integer FFT
- Fast Galois Transform
- Accelerator Design
- System Design
- Operation
- Performance
- Improvements Future Work
3Large Integer Multiplication
- Complexity
- Grade School O(N2)
- Fourier Transform O(N log N)
- Efficient FFT-Based Multiplication
- Divide integers into sequences of smaller digits.
- 867530924601 ? 86, 75, 30, 92, 46, 01
- Convolution of two sequences equivalent to
multiplication. - Element-wise multiplication in frequency domain ?
time domain convolution.
4GIMPS
- Why multiply big numbers?
- Great Internet Mersenne Prime Search (GIMPS)
- Primality testing algorithm for Mersenne numbers
(2q 1) requires squaring of multi-million digit
numbers. - Mersenne primes are largest primes known used
in cryptography. - Large integer convolution
- Performance comparison of Pentiums and FPGAs in
traditional floating-point domains. - Lucas-Lehmer Primality Test
- Mq 2q 1 v 4
- for i 1q-2,
- v v2 2 (mod Mq)
- if v 0, Mq is prime
- else, Mq is composite
5Discrete Weighted Transform
- Discrete Weighted Transform (DWT)
- Variable base each sequence digit can contain
differing numbers of bits. - Creates power-of-two sequence needed by FFT.
- Eliminates need to zero pad to convert cyclic,
FFT-based convolution into acyclic convolution
needed for squaring. - Steps
- Number to be multiplied divided into
variable-length digits. - Sequence multiplied by a weight sequence.
- FFT performed on new, power-of-two length
weighted sequence. - Example for Mq 237 1 with FFT length of 4
- Bits / digit 10, 9, 9, 9
- To square 78,314,567,209 (mod Mq), our sequence
would be 553, 93, 381, 291 - 553 93 210 381 219 291 228
78,314,567,209 - Multiply sequence by weights then FFT.
6Objective
- Compare performance of Pentium processors to
FPGAs. - GIMPS chosen because highly optimized code
exists. - GIMPS utilizes fast floating-point performance of
Pentiums. - Xilinx Virtex-II Pro 100 (2VP100) chosen as
target device. - Largest available 2VP device.
- Contains 444, 17x17 unsigned multipliers
- 888kB of embedded Block RAM
- Target 12 million digit numbers.
- Reward for first prime above 10 million.
7Floating-point FFT
- GIMPS implementation uses floating-point
requires round off error checks. - Using near double-precision floating-point
(51-bit mantissa) - 49 real multipliers can be placed on 2VP100
- 12 complex multipliers
- 12 million digit number -gt 2 million point FFT
- 44 million complex multiplies -gt 3.7 million
cycles
8All-integer FFT
- Perform FFT modulo special prime.
- Prime must have nice roots of one two.
- Reductions modulo prime should be simple.
- Primes of the form 2k 2m 1 meet requirements.
Prime Multipliers FFT Length Iteration time
247-2241 49 4M 1.9M cycles
264-2321 26 2M 1.7M cycles
273-2371 17 2M 2.6M cycles
2113-2571 9 1M 2.3M cycles
9Fast Galois Transform
- All-integer transform using complex numbers
modulo a Mersenne Prime a bi (mod Mp) - Real input sequence folded into complex input
with half the length. - Modular reductions via Mersenne primes are simple
addition.
Prime Multipliers FFT Length Iteration Time
261 - 1 6 (complex) 1M 3.5M cycles
289 - 1 3 (complex) 512K 3.3M cycles
10Algorithm Selection
- Considered algorithms
- Floating-point FFT 3.7M cycles / iteration
- All-integer FFT 1.7M cycles / iteration
- Galois Transform 3.3M cycles / iteration
- Winograd Transform no acceptable run lengths
- Chinese Remainder Theorem added complexity
11FFT Design
- Multipliers and adder generated by CoreGen.
- 10 cycle butterfly latency.
12Complete Design
- 8-point FFTs lower cache throughput.
- Multiple caches allow for overlapping computation
with memory reads and writes.
13Performance Estimates
- XC2VP100-6ff1696
- ISE version 6.2i
- Iteration time
- 34 milliseconds
- FFT Engine frequency
- 80 MHz
- 2VP 100 utilization
- 70 slices Not Implemented
- 24 BRAMs
- 86 multipliers
Iteration Stage Time (us)
Weighted sequence creation 250
Forward FFT 11,500
DFT coefficient squaring 250
Inverse FFT 11,500
Weight removal 250
Carry releasing 5,000
Mersenne mod reduction 5,000
14Performance Comparison
- Pentium 4 Performance
- Non-SIMD (64-bit multiplies)
- 6.4 GFLOPs
- All-Integer transform leverages FPGA strengths
- 1.9 billion integer multiplies /sec
- Transform performance exceeds P4.
- FPGA vs. Pentium 4
- 34 ms vs. 60 ms gt 1.76x speed-up!
- 10,000 vs. 500 gt 20x more costly.?
- 600 sq mm vs. 146 sq mm gt 4.1x more die area.
- ? FPGAs would likely be less costly if volume
equaled the P4. - The P4 area estimate does not include the area
required by all of the support chips. - 2VP100 die area extrapolated from 2VP20 data
supplied by Semiconductor Insights
(www.semiconductor.com).
15Improvements Future Work
- Pentium assemble code highly-optimized while HW
accelerator is a first draft. - Algorithm exploration
- Nussbaumers method using 17-bit primes
- Utilize nice form of prime to implement
shift-only multiply for first two FFT stages. - Cluster Implementation
- Configurable Computing Lab constructing a 16-node
2VP cluster with gigabit transceivers as
interconnect. - Alternative reduced-multiplier butterfly
structures - Floorplanning
16Conclusions
- All-integer FFTs attractive for hardware
implementations of filters / convolutions. - GIMPS accelerator designed
- Operates at 80 MHz
- 176 faster than 3.2 GHz Pentium 4
- Cost of accelerator outweighs benefit in this
application.