Title: Fast 16point FFT Core for VirtexII FPGA
1Fast 16-point FFT Core for Virtex-II FPGA
- Subhradyuti Sarkar
- Rakesh Kumar
- s1sarkar, rakumar_at_cs.ucsd.edu
2Organization
- Goal
- Design
- Implementation
- Results
3Goal of the Project
- Fast Fourier Transform is a very commonly used
routine in a DSP application - It is computationally intensive as well, hence a
software implementation might not meet the timing
requirement in a real-time embedded system - Our aim was to design/implement a lightweight and
fast FFT processor targeting Virtex-II FPGA
4High-level Schematic
16-point FFT processor
output
input (streaming)
reset
clock
5What is SPARK? And why didnt we use it?
- SPARK is a high-level synthesis framework that
can read a restricted C program and convert it
into a VHDL description - Since our core will accept streaming input, it
must have pipelined READ, EXECUTE and OUTPUT
stages - Nearly impossible to express streaming and
pipelining in C
6FFT 101
- This requires O(N2) computations. FFT uses
divide-and-conquer techniques and bring down the
complexity to O(N log N). - Since 16 is a power of four, we used radix-4
decimation-in-frequency algorithm by breaking the
16-point DFT formula into four smaller DFTs. The
final set of transforms look like
- Original DFT 256 multiplication, 240
additions - Radix-4 FFT 64 multiplications, 192 additions!
7Our Design version 1.0
- Pipelined READ, EXECUTE and OUTPUT stage
- Read one complex input (32-bits) in every cycle
- 16 point input is read in 16 cycles
- Output one complex result (32-bits) in every
cycle - EXECUTE stage also takes 16 cycles
- Performance
- Latency 17 cycles
- Throughput 1 transform per 16 cycles
READ
EXEC
OUTPUT
READ
OUTPUT
EXEC
8Our Design version 2.0
- Completing the execution stage in 16 cycles
resulted in large fan-outs for the internal
registers very poor timing characteristics - By trial and error, we divided the execution
stage into 4 pipelined sub-stages - That necessitated 5x internal storage, but 2560
bits of storage is no big deal for XC2V2000. - Final performance
- Latency 65 cycles
- Throughput 1 transform per 16 cycles (same as
before)
R
E1
E2
E3
E4
O
O
E4
R
E1
E2
E3
9Xilinx ISE Design Flow
VHDL
Constraints
(pin, area, timing)
Synthesize
net-list(s)
Translate
merges net-lists and constraints to a single
database
Map
maps the design to the board logic
Place/Route
floorplanned, placed and routed design
Configure
the design to downloaded to board
Verify
using Xilinx Chipscope Pro
10Characteristics of the Final Design
- Resource
- 1432 out of 21504 flip-flops (6)
- 1037 out of 21504 LUT (4)
- 65 out of 624 I/O pins (10)
- Timing
- Minimum clock period 5.899ns (Maximum frequency
169.520MHz) - Power
- 440 mW estimated by XPower
11Place-and-routed Design
Without any PIN constraint
- Input pins are constrained to Bank0
- Output pins are constrained to Bank1
12Post Place-and-route Simulation
- Input data 0,32,0,32,0,32,0,32,0,32,0,32,0,32,
0,32 - Expected output 16,0,0,0,0,0,0,0,-16,0,0,0,0,0
,0,0
13Speedup from Software Implementation
- Naïve Software Implementation
- 2.4 GHz P-4, 512 MB RAM, gcc 3.3.1
- On average every transform takes 325 ns
- FPGA Implementation
- Virtex-II XC2V2000 FF896 board with speed grade
-4 - In steady state every transform takes 5.9?16
94.4 ns - 3.44X speedup
14Configuration and Verification
- The board was connected to the LPT1 port of the
PC using Xilinx parallel cable 4 (JTAG Mode) - iMPact successfully downloaded the bitmap file
to the board - Unfortunately, Chipscope Pro could not upload
the data generated by our test- bench
15Thank You
QUESTIONS?