Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? - PowerPoint PPT Presentation

1 / 16

About This Presentation

Title:

Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers?

Description:

Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? Stephen Craven Cameron Patterson Peter Athanas Configurable Computing Lab – PowerPoint PPT presentation

Number of Views:105

Avg rating:3.0/5.0

Slides: 17

Provided by: scr68

Learn more at: http://klabs.org

Category:

more less

Transcript and Presenter's Notes

Title: Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers?

1
Super-Sized Multiplies How Do FPGAs Fare in
Extended Digit Multipliers?

Stephen Craven
Cameron Patterson
Peter Athanas
Configurable Computing Lab
Virginia Tech

2
Outline

Background
Large Integer Multiplication
GIMPS
Algorithm Comparison
Floating-point FFT
All-integer FFT
Fast Galois Transform
Accelerator Design
System Design
Operation
Performance
Improvements Future Work

3
Large Integer Multiplication

Complexity
Grade School O(N2)
Fourier Transform O(N log N)
Efficient FFT-Based Multiplication
Divide integers into sequences of smaller digits.
867530924601 ? 86, 75, 30, 92, 46, 01
Convolution of two sequences equivalent to
multiplication.
Element-wise multiplication in frequency domain ?
time domain convolution.

4
GIMPS

Why multiply big numbers?
Great Internet Mersenne Prime Search (GIMPS)
Primality testing algorithm for Mersenne numbers
(2q 1) requires squaring of multi-million digit
numbers.
Mersenne primes are largest primes known used
in cryptography.
Large integer convolution
Performance comparison of Pentiums and FPGAs in
traditional floating-point domains.
Lucas-Lehmer Primality Test
Mq 2q 1 v 4
for i 1q-2,
v v2 2 (mod Mq)
if v 0, Mq is prime
else, Mq is composite

5
Discrete Weighted Transform

Discrete Weighted Transform (DWT)
Variable base each sequence digit can contain
differing numbers of bits.
Creates power-of-two sequence needed by FFT.
Eliminates need to zero pad to convert cyclic,
FFT-based convolution into acyclic convolution
needed for squaring.
Steps
Number to be multiplied divided into
variable-length digits.
Sequence multiplied by a weight sequence.
FFT performed on new, power-of-two length
weighted sequence.
Example for Mq 237 1 with FFT length of 4
Bits / digit 10, 9, 9, 9
To square 78,314,567,209 (mod Mq), our sequence
would be 553, 93, 381, 291
553 93 210 381 219 291 228
78,314,567,209
Multiply sequence by weights then FFT.

6
Objective

Compare performance of Pentium processors to
FPGAs.
GIMPS chosen because highly optimized code
exists.
GIMPS utilizes fast floating-point performance of
Pentiums.
Xilinx Virtex-II Pro 100 (2VP100) chosen as
target device.
Largest available 2VP device.
Contains 444, 17x17 unsigned multipliers
888kB of embedded Block RAM
Target 12 million digit numbers.
Reward for first prime above 10 million.

7
Floating-point FFT

GIMPS implementation uses floating-point
requires round off error checks.
Using near double-precision floating-point
(51-bit mantissa)
49 real multipliers can be placed on 2VP100
12 complex multipliers
12 million digit number -gt 2 million point FFT
44 million complex multiplies -gt 3.7 million
cycles

8
All-integer FFT

Perform FFT modulo special prime.
Prime must have nice roots of one two.
Reductions modulo prime should be simple.
Primes of the form 2k 2m 1 meet requirements.

Prime Multipliers FFT Length Iteration time
247-2241 49 4M 1.9M cycles
264-2321 26 2M 1.7M cycles
273-2371 17 2M 2.6M cycles
2113-2571 9 1M 2.3M cycles
9
Fast Galois Transform

All-integer transform using complex numbers
modulo a Mersenne Prime a bi (mod Mp)
Real input sequence folded into complex input
with half the length.
Modular reductions via Mersenne primes are simple
addition.

Prime Multipliers FFT Length Iteration Time
261 - 1 6 (complex) 1M 3.5M cycles
289 - 1 3 (complex) 512K 3.3M cycles
10
Algorithm Selection

Considered algorithms
Floating-point FFT 3.7M cycles / iteration
All-integer FFT 1.7M cycles / iteration
Galois Transform 3.3M cycles / iteration
Winograd Transform no acceptable run lengths
Chinese Remainder Theorem added complexity

11
FFT Design

Multipliers and adder generated by CoreGen.
10 cycle butterfly latency.

12
Complete Design

8-point FFTs lower cache throughput.
Multiple caches allow for overlapping computation
with memory reads and writes.

13
Performance Estimates

XC2VP100-6ff1696
ISE version 6.2i
Iteration time
34 milliseconds
FFT Engine frequency
80 MHz
2VP 100 utilization
70 slices Not Implemented
24 BRAMs
86 multipliers

Iteration Stage Time (us)
Weighted sequence creation 250
Forward FFT 11,500
DFT coefficient squaring 250
Inverse FFT 11,500
Weight removal 250
Carry releasing 5,000
Mersenne mod reduction 5,000
14
Performance Comparison

Pentium 4 Performance
Non-SIMD (64-bit multiplies)
6.4 GFLOPs
All-Integer transform leverages FPGA strengths
1.9 billion integer multiplies /sec
Transform performance exceeds P4.
FPGA vs. Pentium 4
34 ms vs. 60 ms gt 1.76x speed-up!
10,000 vs. 500 gt 20x more costly.?
600 sq mm vs. 146 sq mm gt 4.1x more die area.
? FPGAs would likely be less costly if volume
equaled the P4.
The P4 area estimate does not include the area
required by all of the support chips.
2VP100 die area extrapolated from 2VP20 data
supplied by Semiconductor Insights
(www.semiconductor.com).

15
Improvements Future Work

Pentium assemble code highly-optimized while HW
accelerator is a first draft.
Algorithm exploration
Nussbaumers method using 17-bit primes
Utilize nice form of prime to implement
shift-only multiply for first two FFT stages.
Cluster Implementation
Configurable Computing Lab constructing a 16-node
2VP cluster with gigabit transceivers as
interconnect.
Alternative reduced-multiplier butterfly
structures
Floorplanning

16
Conclusions

All-integer FFTs attractive for hardware
implementations of filters / convolutions.
GIMPS accelerator designed
Operates at 80 MHz
176 faster than 3.2 GHz Pentium 4
Cost of accelerator outweighs benefit in this
application.

Write a Comment

User Comments (0)