Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers?

Description:

Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers? Stephen Craven Cameron Patterson Peter Athanas Configurable Computing Lab – PowerPoint PPT presentation

Number of Views:105
Avg rating:3.0/5.0
Slides: 17
Provided by: scr68
Learn more at: http://klabs.org
Category:

less

Transcript and Presenter's Notes

Title: Super-Sized Multiplies: How Do FPGAs Fare in Extended Digit Multipliers?


1
Super-Sized Multiplies How Do FPGAs Fare in
Extended Digit Multipliers?
  • Stephen Craven
  • Cameron Patterson
  • Peter Athanas
  • Configurable Computing Lab
  • Virginia Tech

2
Outline
  • Background
  • Large Integer Multiplication
  • GIMPS
  • Algorithm Comparison
  • Floating-point FFT
  • All-integer FFT
  • Fast Galois Transform
  • Accelerator Design
  • System Design
  • Operation
  • Performance
  • Improvements Future Work

3
Large Integer Multiplication
  • Complexity
  • Grade School O(N2)
  • Fourier Transform O(N log N)
  • Efficient FFT-Based Multiplication
  • Divide integers into sequences of smaller digits.
  • 867530924601 ? 86, 75, 30, 92, 46, 01
  • Convolution of two sequences equivalent to
    multiplication.
  • Element-wise multiplication in frequency domain ?
    time domain convolution.

4
GIMPS
  • Why multiply big numbers?
  • Great Internet Mersenne Prime Search (GIMPS)
  • Primality testing algorithm for Mersenne numbers
    (2q 1) requires squaring of multi-million digit
    numbers.
  • Mersenne primes are largest primes known used
    in cryptography.
  • Large integer convolution
  • Performance comparison of Pentiums and FPGAs in
    traditional floating-point domains.
  • Lucas-Lehmer Primality Test
  • Mq 2q 1 v 4
  • for i 1q-2,
  • v v2 2 (mod Mq)
  • if v 0, Mq is prime
  • else, Mq is composite

5
Discrete Weighted Transform
  • Discrete Weighted Transform (DWT)
  • Variable base each sequence digit can contain
    differing numbers of bits.
  • Creates power-of-two sequence needed by FFT.
  • Eliminates need to zero pad to convert cyclic,
    FFT-based convolution into acyclic convolution
    needed for squaring.
  • Steps
  • Number to be multiplied divided into
    variable-length digits.
  • Sequence multiplied by a weight sequence.
  • FFT performed on new, power-of-two length
    weighted sequence.
  • Example for Mq 237 1 with FFT length of 4
  • Bits / digit 10, 9, 9, 9
  • To square 78,314,567,209 (mod Mq), our sequence
    would be 553, 93, 381, 291
  • 553 93 210 381 219 291 228
    78,314,567,209
  • Multiply sequence by weights then FFT.

6
Objective
  • Compare performance of Pentium processors to
    FPGAs.
  • GIMPS chosen because highly optimized code
    exists.
  • GIMPS utilizes fast floating-point performance of
    Pentiums.
  • Xilinx Virtex-II Pro 100 (2VP100) chosen as
    target device.
  • Largest available 2VP device.
  • Contains 444, 17x17 unsigned multipliers
  • 888kB of embedded Block RAM
  • Target 12 million digit numbers.
  • Reward for first prime above 10 million.

7
Floating-point FFT
  • GIMPS implementation uses floating-point
    requires round off error checks.
  • Using near double-precision floating-point
    (51-bit mantissa)
  • 49 real multipliers can be placed on 2VP100
  • 12 complex multipliers
  • 12 million digit number -gt 2 million point FFT
  • 44 million complex multiplies -gt 3.7 million
    cycles

8
All-integer FFT
  • Perform FFT modulo special prime.
  • Prime must have nice roots of one two.
  • Reductions modulo prime should be simple.
  • Primes of the form 2k 2m 1 meet requirements.

Prime Multipliers FFT Length Iteration time
247-2241 49 4M 1.9M cycles
264-2321 26 2M 1.7M cycles
273-2371 17 2M 2.6M cycles
2113-2571 9 1M 2.3M cycles
9
Fast Galois Transform
  • All-integer transform using complex numbers
    modulo a Mersenne Prime a bi (mod Mp)
  • Real input sequence folded into complex input
    with half the length.
  • Modular reductions via Mersenne primes are simple
    addition.

Prime Multipliers FFT Length Iteration Time
261 - 1 6 (complex) 1M 3.5M cycles
289 - 1 3 (complex) 512K 3.3M cycles
10
Algorithm Selection
  • Considered algorithms
  • Floating-point FFT 3.7M cycles / iteration
  • All-integer FFT 1.7M cycles / iteration
  • Galois Transform 3.3M cycles / iteration
  • Winograd Transform no acceptable run lengths
  • Chinese Remainder Theorem added complexity

11
FFT Design
  • Multipliers and adder generated by CoreGen.
  • 10 cycle butterfly latency.

12
Complete Design
  • 8-point FFTs lower cache throughput.
  • Multiple caches allow for overlapping computation
    with memory reads and writes.

13
Performance Estimates
  • XC2VP100-6ff1696
  • ISE version 6.2i
  • Iteration time
  • 34 milliseconds
  • FFT Engine frequency
  • 80 MHz
  • 2VP 100 utilization
  • 70 slices Not Implemented
  • 24 BRAMs
  • 86 multipliers

Iteration Stage Time (us)
Weighted sequence creation 250
Forward FFT 11,500
DFT coefficient squaring 250
Inverse FFT 11,500
Weight removal 250
Carry releasing 5,000
Mersenne mod reduction 5,000
14
Performance Comparison
  • Pentium 4 Performance
  • Non-SIMD (64-bit multiplies)
  • 6.4 GFLOPs
  • All-Integer transform leverages FPGA strengths
  • 1.9 billion integer multiplies /sec
  • Transform performance exceeds P4.
  • FPGA vs. Pentium 4
  • 34 ms vs. 60 ms gt 1.76x speed-up!
  • 10,000 vs. 500 gt 20x more costly.?
  • 600 sq mm vs. 146 sq mm gt 4.1x more die area.
  • ? FPGAs would likely be less costly if volume
    equaled the P4.
  • The P4 area estimate does not include the area
    required by all of the support chips.
  • 2VP100 die area extrapolated from 2VP20 data
    supplied by Semiconductor Insights
    (www.semiconductor.com).

15
Improvements Future Work
  • Pentium assemble code highly-optimized while HW
    accelerator is a first draft.
  • Algorithm exploration
  • Nussbaumers method using 17-bit primes
  • Utilize nice form of prime to implement
    shift-only multiply for first two FFT stages.
  • Cluster Implementation
  • Configurable Computing Lab constructing a 16-node
    2VP cluster with gigabit transceivers as
    interconnect.
  • Alternative reduced-multiplier butterfly
    structures
  • Floorplanning

16
Conclusions
  • All-integer FFTs attractive for hardware
    implementations of filters / convolutions.
  • GIMPS accelerator designed
  • Operates at 80 MHz
  • 176 faster than 3.2 GHz Pentium 4
  • Cost of accelerator outweighs benefit in this
    application.
Write a Comment
User Comments (0)
About PowerShow.com